REAL ESTATE SEARCH ENGINE AND REFERRAL NETWORK. Built Owlio for the Entire Real Estate Industry.
Immoviz.io - real estate search engine
-
Upload
du-phan -
Category
Data & Analytics
-
view
340 -
download
0
Transcript of Immoviz.io - real estate search engine
Immoviz - #WeAreAnts
IMMOVIZ BORDEAUX
Emeline Gaulard - Du Phan 1
Immoviz - #WeAreAnts
WHO ARE WE
EMELINE GAULARD
BACKEND DEVELOPER EPITECH ’18
DU PHAN
DATA SCIENTIST ENSC ‘17
2
Immoviz - #WeAreAnts
WE ARE
ANTS
WHERE DO WE WORK
Prototyping Data Science Internet of Things Fun
3
Immoviz - #WeAreAnts
Immoviz ?4
Immoviz - #WeAreAnts 5
Immoviz - #WeAreAnts 6
Elastic Search
Python
Search
Machine Learningipython notebook
pandas/numpy
seaborn/folium
scikit-learn
hyperopt
BackendNodeJs
Express
CasperJS
PostgreSQL
Slack bot (node-slackr)
TOOLBOX
Immoviz - #WeAreAnts 7
Before
INFRASTRUCTURE
Immoviz - #WeAreAnts
Now
INFRASTRUCTURE
8
Immoviz - #WeAreAnts
OUTLINE
SCRAPPERS
ELASTIC SEARCH
DUPLICATE AGGREGATION
PRICE PREDICTION
1
2
3
4
9
Immoviz - #WeAreAnts
Scrappers
CasperJS
PostgreSQL
Elastic SearchReal estate sites (seloger, leboncoin, bienici, sudouest,…)
10
Immoviz - #WeAreAnts 11
Scrappers
Simple to use
Lightweight
Debugging is non-trivial
Immoviz - #WeAreAnts
Indexing Mapping
Analyzing Querying
Elastic Search
12
Immoviz - #WeAreAnts 13
Analyzer example
Immoviz - #WeAreAnts 14
Query example
Immoviz - #WeAreAnts
Text analysis
Price comparison
Elastic Search
Duplicate aggregation
ID comparison
Price comparison
PostgreSQL
15
Immoviz - #WeAreAnts
Duplicate aggregation
ID comparison
Price comparison
PostgreSQL
16
Immoviz - #WeAreAnts
Text analysis
Price comparison
Elastic Search
Duplicate aggregation
ID comparison
Price comparison
PostgreSQL
17
Immoviz - #WeAreAnts
Error analysis Cross-validate for testing error
Locate sensitive zone Visualize error
…
MACHINE LEARNING WORKFLOW
Data Cleaning Check input format
Split data and hide holdout Drop/impute null values
Filter outlier …
Feature Engineering Extract features
Scale/normalize data Test contextual data
…
Data Modeling Cross-validate for model selection
Optimize hyper-parameters …
18
Immoviz - #WeAreAnts 19
If a data set has affected any step in
the learning process, its ability to assess the
outcome has been compromised.
Data snooping
Immoviz - #WeAreAnts 20
k-fold Cross Validation
Immoviz - #WeAreAnts
Error analysis Cross-validate for testing error
Locate sensitive zone Visualize error
…
MACHINE LEARNING WORKFLOW
Data Cleaning Check input format
Drop/impute null values Filter outlier
Split data and hide holdout …
Feature Engineering Extract features
Scale/normalize data Test contextual data
…
Data Modeling Cross-validate for model selection
Optimize hyper-parameters …
21
Immoviz - #WeAreAnts 22
Source: Professor Yaser Abu-Mostafa, Caltech
Immoviz - #WeAreAnts X
If you torture the data long enough, it will
confess.
Data snooping
Immoviz - #WeAreAnts 23
Some key numbers
60 000 adverts, including 20 432 selling ads
12 839 unique selling ads with 61 features
10 883 selling ads remaining with 52 features after filtering
8 months of data
Immoviz - #WeAreAnts 24
Data Cleaning & EDA
Data Modeling
20%
Error Analysis
Allocation of time
10%
20%
Feature Engineering 50%
Immoviz - #WeAreAnts
Location features Contextual data (Open Moulinette) Imputing Room features Removing contextual outliers Improving ES queries
Feature engineering - what work ?
Time series features NLP on text data Dimensionality reduction Numerical values transforming/scaling
25
Immoviz - #WeAreAnts 26
Linear Model Tree-based model Average Ensemble method
Metamodel Ensemble method
Data Modeling: what algorithms to use ?
Immoviz - #WeAreAnts 27
This is how you win ML competitions: you
take other peoples’ work and ensemble
them together.”
Vitaly Kuznetsov - NIPS2014
Immoviz - #WeAreAnts X
Meta-model ensemble method: explanation
Immoviz - #WeAreAnts 28
Kaggle Homesite winner
Source: Homesite Quote Conversion, Winners' Write-Up, 1st Place: KazAnova
Immoviz - #WeAreAnts 29
Error analysis: visualization is key
Immoviz - #WeAreAnts 30
Error analysis: visualization is key
Immoviz - #WeAreAnts
Result
Linear Regression
Lasso
Random Forest
Gradient Boosting
Average Ensemble Method
Metamodel Ensemble Method
0 6,5 13 19,5 26
10-fold CV mean error (%)
31
Immoviz - #WeAreAnts
Result
12.3% 13.1%
CV mean error Holdout mean error
32
8.8% 9.3%
CV median error Holdout median error
Immoviz - #WeAreAnts 33
Feature importance
Immoviz - #WeAreAnts 34
Immoviz - #WeAreAnts 35
How to improve the model
More data
Improve ES queries (sector, type, … )
Leverage time series data
More data
Immoviz - #WeAreAnts X
How to improve the model
Immoviz - #WeAreAnts 36
Metrics
Recommendation System
User Experience
Speed
What’s next ?
Immoviz - #WeAreAnts 37
Conclusion
Better data beats cleverer algorithm
System monitoring is vital
There needs to be a coherent data flow between backend and ML engine
Immoviz - #WeAreAnts 38
Thank you for your attention.
Any questions ?