How Lucene Powers the LinkedIn Segmentation and Targeting Platform

39
How Lucene Powers LinkedIn Segmentation & Targeting Platform Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy ©2013 LinkedIn Corporation. All Rights Reserved.

description

Presented by Hien Luu, Technical Lead, LinkedIn Rajasekaran Rangaswamy, LinkedIn For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns. Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.

Transcript of How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Page 1: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

How Lucene Powers LinkedIn Segmentation & Targeting Platform

Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy

©2013 LinkedIn Corporation. All Rights Reserved.

Page 2: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

About Us

*

Hien  Luu   Rajasekaran  Rangaswamy  

Page 3: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Little bit about LinkedIn §  Segmentation & Targeting Platform Overview §  How Lucene powers Segmentation & Targeting

Platform §  Q&A

©2013 LinkedIn Corporation. All Rights Reserved.

Page 4: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Our Mission Connect the world’s professionals to make them

more productive and successful.

Our Vision Create economic opportunity for every

professional in the world.

Members First!

Page 5: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

The world’s largest professional network Over 65% of members are now international

Company  Pages    

>3M  

Languages    

>30M  

>90%  Fortune  100  Companies    use  LinkedIn  Talent  Soln  to  hire  

Professional  searches  in  2012    

>5.7B  

19  

Page 6: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Other Company Facts

*

•  Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! •  LinkedIn  has  ~4200  full-­‐Kme  employees  located  around  the  world    

Source : http://press.linkedin.com/about

Page 7: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

SegmentaKon  &  TargeKng  

Page 8: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

Page 9: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting Attribute types

Bhaskar Ghosh

Page 10: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

LinkedIn Confidential ©2013 All Rights Reserved 10  

1. Create attributes

§  Name §  Email §  State §  Occupation §  Etc.

2. Attributes Added to Table

Name   Email   State   OccupaEon   …  

John  Smith   [email protected]   California   Engineer  

Jane  Smith   [email protected]   Nevada   HR  Manager  

3. Create Target Segment: California, Engineer

Name   Email   State   OccupaEon  

John  Smith   [email protected]   California   Engineer  

Jane  Doe   [email protected]   California   Engineer  

4. Export List & Send Vendor

Jane  Doe   [email protected]   California   Engineer  

Page 11: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

§  Business definition – Business would like to launch new campaign

often – Business would like to specify targeting criteria

using arbitrary set of attributes – Attributes need to be computed to fulfill the

targeting criteria – The attribute data resides on Hadoop or TD – Business is most comfortable with SQL-like

language

©2013 LinkedIn Corporation. All Rights Reserved.

Page 12: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute Computation

Engine

Attribute Serving Engine

Page 13: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute Computation

Engine

Self-service

Support various data sources

Attribute consolidation

Attribute availability

Page 14: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute computation

~238M

PB

TB

TB

~440

Page 15: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute Serving Engine

Self-service

Attribute predicate expression

Build segments

Build lists

Page 16: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Serving Engine

$

count filter sum complex

expressions Σ 1234

LinkedIn Member Attribute table

~238M

~440

Page 17: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

LinkedIn Segmentation & Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Who are north American recruiters that don’t work for a competitor?

Who are the LinkedIn Talent Solution prospects in Europe?

Who are the job seekers?

Page 18: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

LinkedIn Segmentation & Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Complex tree-like attribute predicate expressions

Page 19: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

Page 20: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Architecture

Data

StorageLayer

AttributeCreationEngine

AttributeMaterialization

EngineAttributeComputationEngine

AttributeMetastore

AttributeIndexingAttribute

ServingEngine

AttributeServingEngine

Page 21: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

LuceneOutputFormat RecordWriter LuceneDocumentWrapper

Document

Index

Index Merger

Web Servers

©2013 LinkedIn Corporation. All Rights Reserved.

Indexer

HDFS

shard 1

shard 2

shard n

Avro data in HDFS

mysql attribute

store

Hadoop Indexer MR

Attribute Definitions

Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper

Page 22: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

JSON Predicate Expression

JSON Lucene Query Parser

Inverted Index

Inverted Index

Inverted Index

Segment & List

Serving

Page 23: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

Page 24: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Serving – Load Balanced Model

Shard 1

Shared Drive

Shard 2 Shard n

Web Server 2 Web Server nWeb Server 1

Load Balancer

HTTP Request

Page 25: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Serving – Load Balanced Model

But Wait…..

•  Is load balancing alone good enough?

•  What about distribution and failover?

Page 26: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

Page 27: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Next Steps - Distributed Model

•  A generic cluster management framework

•  Used to manage partitioned and replicated resources in

distributed systems

•  Built on top of Zookeeper that hides the complexity of ZK

primitives

•  Provides distributed features such as leader election, two-

phase commit etc. via a model of state machine

http://helix.incubator.apache.org/

Page 28: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Next Steps - Distributed Model

Shard 1

Shard 2

Web Server 2 Web Server 3Web Server 1

Load Balancer

HTTP Request

Scatter Gather

active

standby

Shard 2

Shard3

active

standby

Shard 3

Shard1

active

standby

Page 29: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Next Steps - Distributed Model

Shard 1

Shard 2

Web Server 2 Web Server 3Web Server 1

Load Balancer

HTTP Request

Scatter Gather

active

standby

Shard 2

Shard3

active

active

Shard 3

Shard1

failure

failure

Page 30: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

Page 31: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues – Use Case

•  Once segments are built, users want to forecast, see a

target revenue projection for the campaigns that they want

to run.

•  Campaigns can be run on various Revenue Models

•  This involves adding per member Propensity Scores and

Dollar Amounts

Page 32: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues – Why not Stored Fields?

Why not use Stored Fields?

•  Stored fields have one indirection

per document resulting in two disk

seeks per document

•  Performance cost quickly adds up

when fetching millions of documents

Document ID

.fdx fetch filepointer to field data

.fdt scan by id until field is found

Page 33: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues – Why not Field Cache?

Why not use Field Cache?

•  Is memory resident

•  Works fine when there is enough memory

•  But keeping millions of un-inverted values in memory is impossible

•  Additional cost to parse values (from String and to String)

Page 34: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues

•  Dense column based storage (1 Value per Document and 1 Column

per field and segment)

•  Accepts primitives

•  No conversion from/to String needed

•  Loads 80x-100x faster than building a FieldCache

•  All the work is done during Indexing

•  DocValue fields can be indexed and stored too

Page 35: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

Page 36: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Lessons Learnt

Indexing •  Reuse index writers, field and document instances •  Create many partitions and Merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for Serving and indexing

Page 37: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

Page 38: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Why not use an existing solution?

•  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in

hadoop •  Indexing elevates query latency

•  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in

hadoop •  Larger memory overhead •  Comparatively slow

Page 39: How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Questions? More info: data.linkedin.com

©2013 LinkedIn Corporation. All Rights Reserved.