BlaBlaCar Elastic Search Feedback

Post on 12-Jul-2015

1.935 views 1 download

Tags:

Transcript of BlaBlaCar Elastic Search Feedback

1/37

ElasticSearchfeedback

2/37

Introduction

3/37

Nicolas Blanc - BlaBlArchitect

SinfomicSinfomic (1999)

@thewhitegeek

(2001)

(2005)

(2008)

(2012)

4/37

What is BlaBlaCar ?

5/37

3 000 000MEMBERSIN EUROPE

6/37

10 9 countries10 9 countries

● France● Spain● Italy● UK● Poland● Portugal● Netherlands● Belgium● Luxemburg● NEW Germany

● France● Spain● Italy● UK● Poland● Portugal● Netherlands● Belgium● Luxemburg

7/37

Growth50 millions

25 millions

January

2008January

2013

8/37

Infrastructure

2 front web servers 2 MySQL master (+4 slaves SSD) 1 private cloud

(KVM + Open vSwitch)● Redis● Memcache● RabbitMQ/workers

1 cluster ElasticSearch

9/37

Changing the Search Engine

10/37

What's existing ? Why Changing ?

MySQL Database● Relationnal DB (lots of join needed)● Plain SQL query● Home made geographical search

Recent problems● New feature, means more complex queries● Scalability : Performance depending on DB load

11/37

Initial requirements

Scalability● Trip search need to be made in less than 200ms● The system part of the solution easy to maintain● Be able to cluster it (also to not have SPOF)

Low code impact on existing application● Same features as of today (geographical search)● Minimize the developper's work ● Add one missing feature : facets

12/37

Initial Competitors

SenseiDB

13/37

Why ElasticSearch

✔ Easyest cluster possibility✔ Good performance when indexing✔ Few code to write to use it✔ Schema less✔ Based on Lucene✔ Written in Java (need to code grouping feature)

14/37

ElasticSearch has won,now migrate our search !

15/37

Changing our mindset

Object in Relationnal Database● Can be exploded on multiple tables● Lots of informations usable by JOIN

Object in Document Oriented Database● Only one big index for theses objects● All informations need to be in the object, not on multiple tables

16/37

Changing our mindset

Object in Relationnal Database● Can be exploded on multiple tables● Lots of informations usable by JOIN

Object in Document Oriented Database● Only one big index for theses objects● All informations need to be in the object, not on multiple tables

17/37

Well defining our objects

Need to know what we want to search● Searching trips (front office usage)● Searching members (backoffice usage)● Searching FAQ (front office usage)

Think of all needed field● The ones used for query● The ones used for filters● The ones used for facets

18/37

Thinking of well defining index

System point of view● Number of Nodes in the cluster● Number of Shards● Number of Replica

Application point of view● Define type and attributes for all fields (mapping)● Using parent/child or nested to improve indexing● How to push documents from DB ?

19/37

Indexing : using a river or not ?

River advantages● Plugs directly to our source backend● ElasticSearch API exists to code a new one

River problems● Not easy to add business logic on some fields● Really hard when your DB is unconventionnal● Full Reindex all the documents

20/37

Indexing : our manual way

We write an asynchronous indexer● Written in java● Have business logic when fetching from db● Fetch from multiple DB/source● Use of java ES library● Easy interface

●send {“trip”:1234567} and the server answer {“OK”}

21/37

One index sample : Trip

22/37

Well defining our object Trip

Think of all needed field● The ones used for query

● Trip date of departure,from where,to where,user id● The ones used for filters

● User ratings,price,vehicle,seats left,is user blocked(a blocked user, is a user who made some forbidden

action on the website.)● The ones used for facets

● User ratings,price,vehicle

23/37

Well defining our index Trip

Think of all system requirement● The cluster has 2 nodes

● We keep the default configuration for shards/replica

Think of object mapping● For each field :

● Define the type (string, long, geo_point, date, float, boolean)

● Define the scope (include_in_all)● Define the analyzer (for type string)

24/37

Trip Mapping

"trip": { "properties": { "is_user_blocked": { "type": "boolean", "include_in_all" : false }, "user_ratings" : { "type" : "long", "include_in_all" : false }, "from": { "type": "geo_point", "include_in_all" : false }, "price": { "include_in_all": false, "type": "float" },

"price_euro": { "type": "float", “include_in_all: false }, "seats_left": { "include_in_all": false, "type": "long" }, "seats_offered": { "include_in_all": false, "type": "long" }, "to": { "include_in_all": false, "type": "geo_point" },

"trip_date": { "format": "dateOptionalTime", "include_in_all": false, "type": "date" }, “vehicle”: { "include_in_all": false, "type": "string" }, "userid": { "include_in_all": false, "index": "not_analyzed", "type": "string" } }}

25/37

Well indexing eventsWhich modification send event change●All trips creation/deletion/modification●Member modifications (block or not)●New ratings from other members●A seat has been reserved●Member change his vehicle

Event change is a call to internal indexer●Send '{“trip”:123456}' to indexer (create/update)●Send '{“tripd”:123456}' to indexer (delete)

26/37

Sample trip index query{"query": { "filtered": { "query": { "match_all": {} }, "filter": { "and": [{ "geo_distance": { "distance": "40.14937866995km", "from": { "lat": 48.856614, "lon": 2.3522219 } } }, { "geo_distance": { "distance": "40.14937866995km", "to": { "lat": 45.764043, "lon": 4.835659 } } },

{ "range": { "price": { "from": 0, "include_lower": false } } }] } } }, "sort": [{ "trip_date": { "order": "asc" }, }], "filter": { "term": { "is_user_blocked": false } } }, "from": 0, "size": 10}

27/37

The Real WorldA trip has now more than 30 fields● (faq is around 25 fields)● (members even more...)

To build a trip document we need 3 differents SQL queries● (FAQ : 2 differents SQL queries)● (Member : 10 differents SQL queries)

A trip has only 1 shard (grouping)

28/37

And now the caveats

29/37

Preloaded Scripts

We use mvel script to improve scoring● They are not clustered● Each node need to have the scripts● Need a node restart to be added or modified

Solution : Chef (tool from Opscode) All nodes configurations are centralized into Chef repository

30/37

Grouping documents

Home made patchs to ElasticSearch(based on a Martijn Van Groningen work for lusini.de)

Soon in ElasticSearch(I hope so much)

31/37

Mapping modification

On a running index :Changing a type is not allowedChanging analyzer is not allowed

Solution : index alias1) Changing mapping → create a new index2) When new index is up to date → changing alias

32/37

IOs limits

We have only 2 nodes● Trip index is around 2GB● But only 1 shard for Trip index● Can index 100 trips / seconds on busy evening

Solution : We put Intel SSDs(waiting for distributed grouping feature)

33/37

Choosing the analyzer

Some field need to not be analyzed● If you use ISO code for country(IT, for Italy or DE for Germany are ignored in some cases)

Global analyzer has limits● Accentuation from countries like France, Germany or Spain are not always parsed correctly● One analyzer by country is difficult to implement in some cases

34/37

OK Sweet,What's next

?

35/37

Using ElasticSearch to ease log analysis

36/37

By the way…

We’re hiring !!! Dev, HTML Ninja, leader,…

Come & See me right now… or send me your friends

(And we have beer, baby foot and arcade cabinet )

37/37

Thank you !

Follow us !

@covoiturage

Apply now :

join@BlaBlaCar.com