Immoviz.io - real estate search engine

41
Immoviz - #WeAreAnts IMMOVIZ BORDEAUX Emeline Gaulard - Du Phan 1

Transcript of Immoviz.io - real estate search engine

Page 1: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

IMMOVIZ BORDEAUX

Emeline Gaulard - Du Phan 1

Page 2: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

WHO ARE WE

EMELINE GAULARD

BACKEND DEVELOPER EPITECH ’18

DU PHAN

DATA SCIENTIST ENSC ‘17

2

Page 3: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

WE ARE

ANTS

WHERE DO WE WORK

Prototyping Data Science Internet of Things Fun

3

Page 4: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Immoviz ?4

Page 5: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 5

Page 6: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 6

Elastic Search

Python

Search

Machine Learningipython notebook

pandas/numpy

seaborn/folium

scikit-learn

hyperopt

BackendNodeJs

Express

CasperJS

PostgreSQL

Slack bot (node-slackr)

TOOLBOX

Page 7: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 7

Before

INFRASTRUCTURE

Page 8: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Now

INFRASTRUCTURE

8

Page 9: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

OUTLINE

SCRAPPERS

ELASTIC SEARCH

DUPLICATE AGGREGATION

PRICE PREDICTION

1

2

3

4

9

Page 10: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Scrappers

CasperJS

PostgreSQL

Elastic SearchReal estate sites (seloger, leboncoin, bienici, sudouest,…)

10

Page 11: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 11

Scrappers

Simple to use

Lightweight

Debugging is non-trivial

Page 12: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Indexing Mapping

Analyzing Querying

Elastic Search

12

Page 13: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 13

Analyzer example

Page 14: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 14

Query example

Page 15: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Text analysis

Price comparison

Elastic Search

Duplicate aggregation

ID comparison

Price comparison

PostgreSQL

15

Page 16: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Duplicate aggregation

ID comparison

Price comparison

PostgreSQL

16

Page 17: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Text analysis

Price comparison

Elastic Search

Duplicate aggregation

ID comparison

Price comparison

PostgreSQL

17

Page 18: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Error analysis Cross-validate for testing error

Locate sensitive zone Visualize error

MACHINE LEARNING WORKFLOW

Data Cleaning Check input format

Split data and hide holdout Drop/impute null values

Filter outlier …

Feature Engineering Extract features

Scale/normalize data Test contextual data

Data Modeling Cross-validate for model selection

Optimize hyper-parameters …

18

Page 19: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 19

If a data set has affected any step in

the learning process, its ability to assess the

outcome has been compromised.

Data snooping

Page 20: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 20

k-fold Cross Validation

Page 21: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Error analysis Cross-validate for testing error

Locate sensitive zone Visualize error

MACHINE LEARNING WORKFLOW

Data Cleaning Check input format

Drop/impute null values Filter outlier

Split data and hide holdout …

Feature Engineering Extract features

Scale/normalize data Test contextual data

Data Modeling Cross-validate for model selection

Optimize hyper-parameters …

21

Page 22: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 22

Source: Professor Yaser Abu-Mostafa, Caltech

Page 23: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts X

If you torture the data long enough, it will

confess.

Data snooping

Page 24: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 23

Some key numbers

60 000 adverts, including 20 432 selling ads

12 839 unique selling ads with 61 features

10 883 selling ads remaining with 52 features after filtering

8 months of data

Page 25: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 24

Data Cleaning & EDA

Data Modeling

20%

Error Analysis

Allocation of time

10%

20%

Feature Engineering 50%

Page 26: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Location features Contextual data (Open Moulinette) Imputing Room features Removing contextual outliers Improving ES queries

Feature engineering - what work ?

Time series features NLP on text data Dimensionality reduction Numerical values transforming/scaling

25

Page 27: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 26

Linear Model Tree-based model Average Ensemble method

Metamodel Ensemble method

Data Modeling: what algorithms to use ?

Page 28: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 27

This is how you win ML competitions: you

take other peoples’ work and ensemble

them together.”

Vitaly Kuznetsov - NIPS2014

Page 29: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts X

Meta-model ensemble method: explanation

Page 30: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 28

Kaggle Homesite winner

Source: Homesite Quote Conversion, Winners' Write-Up, 1st Place: KazAnova

Page 31: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 29

Error analysis: visualization is key

Page 32: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 30

Error analysis: visualization is key

Page 33: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Result

Linear Regression

Lasso

Random Forest

Gradient Boosting

Average Ensemble Method

Metamodel Ensemble Method

0 6,5 13 19,5 26

10-fold CV mean error (%)

31

Page 34: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts

Result

12.3% 13.1%

CV mean error Holdout mean error

32

8.8% 9.3%

CV median error Holdout median error

Page 35: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 33

Feature importance

Page 36: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 34

Page 37: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 35

How to improve the model

More data

Improve ES queries (sector, type, … )

Leverage time series data

More data

Page 38: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts X

How to improve the model

Page 39: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 36

Metrics

Recommendation System

User Experience

Speed

What’s next ?

Page 40: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 37

Conclusion

Better data beats cleverer algorithm

System monitoring is vital

There needs to be a coherent data flow between backend and ML engine

Page 41: Immoviz.io - real estate search engine

Immoviz - #WeAreAnts 38

Thank you for your attention.

Any questions ?