Post on 11-May-2015
description
How Lucene Powers LinkedIn Segmentation & Targeting Platform
Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy
©2013 LinkedIn Corporation. All Rights Reserved.
About Us
*
Hien Luu Rajasekaran Rangaswamy
Agenda
§ Little bit about LinkedIn § Segmentation & Targeting Platform Overview § How Lucene powers Segmentation & Targeting
Platform § Q&A
©2013 LinkedIn Corporation. All Rights Reserved.
Our Mission Connect the world’s professionals to make them
more productive and successful.
Our Vision Create economic opportunity for every
professional in the world.
Members First!
©2013 LinkedIn Corporation. All Rights Reserved.
The world’s largest professional network Over 65% of members are now international
Company Pages
>3M
Languages
>30M
>90% Fortune 100 Companies use LinkedIn Talent Soln to hire
Professional searches in 2012
>5.7B
19
Other Company Facts
*
• Headquartered in Mountain View, Calif., with offices around the world! • LinkedIn has ~4200 full-‐Kme employees located around the world
Source : http://press.linkedin.com/about
©2013 LinkedIn Corporation. All Rights Reserved.
SegmentaKon & TargeKng
Segmentation & Targeting
Segmentation & Targeting Attribute types
Bhaskar Ghosh
Segmentation & Targeting
LinkedIn Confidential ©2013 All Rights Reserved 10
1. Create attributes
§ Name § Email § State § Occupation § Etc.
2. Attributes Added to Table
Name Email State OccupaEon …
John Smith jsmith@blah.com California Engineer
Jane Smith smithj@mail.com Nevada HR Manager
3. Create Target Segment: California, Engineer
Name Email State OccupaEon
John Smith jsmith@blah.com California Engineer
Jane Doe jdoe@email.com California Engineer
4. Export List & Send Vendor
Jane Doe jdoe@email.com California Engineer
Segmentation & Targeting
§ Business definition – Business would like to launch new campaign
often – Business would like to specify targeting criteria
using arbitrary set of attributes – Attributes need to be computed to fulfill the
targeting criteria – The attribute data resides on Hadoop or TD – Business is most comfortable with SQL-like
language
©2013 LinkedIn Corporation. All Rights Reserved.
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute Computation
Engine
Attribute Serving Engine
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute Computation
Engine
Self-service
Support various data sources
Attribute consolidation
Attribute availability
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute computation
~238M
PB
TB
TB
~440
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Attribute Serving Engine
Self-service
Attribute predicate expression
Build segments
Build lists
Segmentation & Targeting
©2013 LinkedIn Corporation. All Rights Reserved.
Serving Engine
$
count filter sum complex
expressions Σ 1234
LinkedIn Member Attribute table
~238M
~440
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Who are north American recruiters that don’t work for a competitor?
Who are the LinkedIn Talent Solution prospects in Europe?
Who are the job seekers?
LinkedIn Segmentation & Targeting Platform
©2013 LinkedIn Corporation. All Rights Reserved.
Complex tree-like attribute predicate expressions
Agenda
§ Architecture – Indexer Architecture – Serving Architecture
§ Load Balanced Model § Next Steps - Distributed Model § DocValues § Lessons Learnt § Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
©2013 LinkedIn Corporation. All Rights Reserved.
Architecture
Data
StorageLayer
AttributeCreationEngine
AttributeMaterialization
EngineAttributeComputationEngine
AttributeMetastore
AttributeIndexingAttribute
ServingEngine
AttributeServingEngine
LuceneOutputFormat RecordWriter LuceneDocumentWrapper
Document
Index
Index Merger
Web Servers
©2013 LinkedIn Corporation. All Rights Reserved.
Indexer
HDFS
shard 1
shard 2
shard n
Avro data in HDFS
mysql attribute
store
Hadoop Indexer MR
Attribute Definitions
Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper
©2013 LinkedIn Corporation. All Rights Reserved.
JSON Predicate Expression
JSON Lucene Query Parser
Inverted Index
Inverted Index
Inverted Index
Segment & List
Serving
Agenda
§ Architecture – Indexer Architecture – Serving Architecture
§ Load Balanced Model § Next Steps - Distributed Model § DocValues § Lessons Learnt § Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
©2013 LinkedIn Corporation. All Rights Reserved.
Serving – Load Balanced Model
Shard 1
Shared Drive
Shard 2 Shard n
Web Server 2 Web Server nWeb Server 1
Load Balancer
HTTP Request
©2013 LinkedIn Corporation. All Rights Reserved.
Serving – Load Balanced Model
But Wait…..
• Is load balancing alone good enough?
• What about distribution and failover?
Agenda
§ Architecture – Indexer Architecture – Serving Architecture
§ Load Balanced Model § Next Steps - Distributed Model § DocValues § Lessons Learnt § Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
©2013 LinkedIn Corporation. All Rights Reserved.
Next Steps - Distributed Model
• A generic cluster management framework
• Used to manage partitioned and replicated resources in
distributed systems
• Built on top of Zookeeper that hides the complexity of ZK
primitives
• Provides distributed features such as leader election, two-
phase commit etc. via a model of state machine
http://helix.incubator.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved.
Next Steps - Distributed Model
Shard 1
Shard 2
Web Server 2 Web Server 3Web Server 1
Load Balancer
HTTP Request
Scatter Gather
active
standby
Shard 2
Shard3
active
standby
Shard 3
Shard1
active
standby
©2013 LinkedIn Corporation. All Rights Reserved.
Next Steps - Distributed Model
Shard 1
Shard 2
Web Server 2 Web Server 3Web Server 1
Load Balancer
HTTP Request
Scatter Gather
active
standby
Shard 2
Shard3
active
active
Shard 3
Shard1
failure
failure
Agenda
§ Architecture – Indexer Architecture – Serving Architecture
§ Load Balanced Model § Next Steps - Distributed Model § DocValues § Lessons Learnt § Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
©2013 LinkedIn Corporation. All Rights Reserved.
DocValues – Use Case
• Once segments are built, users want to forecast, see a
target revenue projection for the campaigns that they want
to run.
• Campaigns can be run on various Revenue Models
• This involves adding per member Propensity Scores and
Dollar Amounts
©2013 LinkedIn Corporation. All Rights Reserved.
DocValues – Why not Stored Fields?
Why not use Stored Fields?
• Stored fields have one indirection
per document resulting in two disk
seeks per document
• Performance cost quickly adds up
when fetching millions of documents
Document ID
.fdx fetch filepointer to field data
.fdt scan by id until field is found
©2013 LinkedIn Corporation. All Rights Reserved.
DocValues – Why not Field Cache?
Why not use Field Cache?
• Is memory resident
• Works fine when there is enough memory
• But keeping millions of un-inverted values in memory is impossible
• Additional cost to parse values (from String and to String)
©2013 LinkedIn Corporation. All Rights Reserved.
DocValues
• Dense column based storage (1 Value per Document and 1 Column
per field and segment)
• Accepts primitives
• No conversion from/to String needed
• Loads 80x-100x faster than building a FieldCache
• All the work is done during Indexing
• DocValue fields can be indexed and stored too
Agenda
§ Architecture – Indexer Architecture – Serving Architecture
§ Load Balanced Model § Next Steps - Distributed Model § DocValues § Lessons Learnt § Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
©2013 LinkedIn Corporation. All Rights Reserved.
Lessons Learnt
Indexing • Reuse index writers, field and document instances • Create many partitions and Merge them in a different process • Rebuild (bootstrap) entire index if possible • Use partial updates with caution • Analyze the index Serving • Reuse a single instance of IndexSearcher • Limit usage of stored fields and term vectors • Plan for load balancing and failover • Cache term frequencies • Use different machines for Serving and indexing
Agenda
§ Architecture – Indexer Architecture – Serving Architecture
§ Load Balanced Model § Next Steps - Distributed Model § DocValues § Lessons Learnt § Why not use an existing solution?
©2013 LinkedIn Corporation. All Rights Reserved.
©2013 LinkedIn Corporation. All Rights Reserved.
Why not use an existing solution?
• Doesn’t allow dynamic schema • Difficult to bootstrap indexes built in
hadoop • Indexing elevates query latency
• Doesn’t allow dynamic schema • Difficult to bootstrap indexes built in
hadoop • Larger memory overhead • Comparatively slow
Questions? More info: data.linkedin.com
©2013 LinkedIn Corporation. All Rights Reserved.