How Lucene Powers the LinkedIn Segmentation and Targeting Platform

Post on 11-May-2015

1.051 views 0 download

Tags:

description

Presented by Hien Luu, Technical Lead, LinkedIn Rajasekaran Rangaswamy, LinkedIn For internet companies, marketing campaigns play an important role in acquiring new customers, retaining and engaging existing customers, and promoting new products. The LinkedIn segmentation and targeting platform helps marketing teams to easily and quickly create member segments based on member attributes using nested predicate expressions ranging from simple to complex. Once segments are created, then those qualified members are targeted with marketing campaigns. Lucene is a key piece of technology in this platform. This session will cover how we leverage Hadoop to efficiently build Lucene indexes for a large and growing member attribute data set of 225 million members, and how Lucene is used to create segments based on complex nested predicate expressions. This presentation will also share some of the lessons we learned and challenges we encountered from using Lucene to search over large data sets.

Transcript of How Lucene Powers the LinkedIn Segmentation and Targeting Platform

How Lucene Powers LinkedIn Segmentation & Targeting Platform

Lucene/SOLR Revolution EU, November 2013 Hien Luu, Raj Rangaswamy

©2013 LinkedIn Corporation. All Rights Reserved.

About Us

*

Hien  Luu   Rajasekaran  Rangaswamy  

Agenda

§  Little bit about LinkedIn §  Segmentation & Targeting Platform Overview §  How Lucene powers Segmentation & Targeting

Platform §  Q&A

©2013 LinkedIn Corporation. All Rights Reserved.

Our Mission Connect the world’s professionals to make them

more productive and successful.

Our Vision Create economic opportunity for every

professional in the world.

Members First!

©2013 LinkedIn Corporation. All Rights Reserved.

The world’s largest professional network Over 65% of members are now international

Company  Pages    

>3M  

Languages    

>30M  

>90%  Fortune  100  Companies    use  LinkedIn  Talent  Soln  to  hire  

Professional  searches  in  2012    

>5.7B  

19  

Other Company Facts

*

•  Headquartered  in  Mountain  View,  Calif.,  with  offices  around  the  world! •  LinkedIn  has  ~4200  full-­‐Kme  employees  located  around  the  world    

Source : http://press.linkedin.com/about

©2013 LinkedIn Corporation. All Rights Reserved.

SegmentaKon  &  TargeKng  

Segmentation & Targeting

Segmentation & Targeting Attribute types

Bhaskar Ghosh

Segmentation & Targeting

LinkedIn Confidential ©2013 All Rights Reserved 10  

1. Create attributes

§  Name §  Email §  State §  Occupation §  Etc.

2. Attributes Added to Table

Name   Email   State   OccupaEon   …  

John  Smith   jsmith@blah.com   California   Engineer  

Jane  Smith   smithj@mail.com   Nevada   HR  Manager  

3. Create Target Segment: California, Engineer

Name   Email   State   OccupaEon  

John  Smith   jsmith@blah.com   California   Engineer  

Jane  Doe   jdoe@email.com   California   Engineer  

4. Export List & Send Vendor

Jane  Doe   jdoe@email.com   California   Engineer  

Segmentation & Targeting

§  Business definition – Business would like to launch new campaign

often – Business would like to specify targeting criteria

using arbitrary set of attributes – Attributes need to be computed to fulfill the

targeting criteria – The attribute data resides on Hadoop or TD – Business is most comfortable with SQL-like

language

©2013 LinkedIn Corporation. All Rights Reserved.

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute Computation

Engine

Attribute Serving Engine

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute Computation

Engine

Self-service

Support various data sources

Attribute consolidation

Attribute availability

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute computation

~238M

PB

TB

TB

~440

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Attribute Serving Engine

Self-service

Attribute predicate expression

Build segments

Build lists

Segmentation & Targeting

©2013 LinkedIn Corporation. All Rights Reserved.

Serving Engine

$

count filter sum complex

expressions Σ 1234

LinkedIn Member Attribute table

~238M

~440

LinkedIn Segmentation & Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Who are north American recruiters that don’t work for a competitor?

Who are the LinkedIn Talent Solution prospects in Europe?

Who are the job seekers?

LinkedIn Segmentation & Targeting Platform

©2013 LinkedIn Corporation. All Rights Reserved.

Complex tree-like attribute predicate expressions

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

©2013 LinkedIn Corporation. All Rights Reserved.

Architecture

Data

StorageLayer

AttributeCreationEngine

AttributeMaterialization

EngineAttributeComputationEngine

AttributeMetastore

AttributeIndexingAttribute

ServingEngine

AttributeServingEngine

LuceneOutputFormat RecordWriter LuceneDocumentWrapper

Document

Index

Index Merger

Web Servers

©2013 LinkedIn Corporation. All Rights Reserved.

Indexer

HDFS

shard 1

shard 2

shard n

Avro data in HDFS

mysql attribute

store

Hadoop Indexer MR

Attribute Definitions

Mapper K=> AvroKey<GenericRecord> V=> AvroValue<NullWritable> Reducer K=> NullWritable V=> LuceneDocumentWrapper

©2013 LinkedIn Corporation. All Rights Reserved.

JSON Predicate Expression

JSON Lucene Query Parser

Inverted Index

Inverted Index

Inverted Index

Segment & List

Serving

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

©2013 LinkedIn Corporation. All Rights Reserved.

Serving – Load Balanced Model

Shard 1

Shared Drive

Shard 2 Shard n

Web Server 2 Web Server nWeb Server 1

Load Balancer

HTTP Request

©2013 LinkedIn Corporation. All Rights Reserved.

Serving – Load Balanced Model

But Wait…..

•  Is load balancing alone good enough?

•  What about distribution and failover?

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

©2013 LinkedIn Corporation. All Rights Reserved.

Next Steps - Distributed Model

•  A generic cluster management framework

•  Used to manage partitioned and replicated resources in

distributed systems

•  Built on top of Zookeeper that hides the complexity of ZK

primitives

•  Provides distributed features such as leader election, two-

phase commit etc. via a model of state machine

http://helix.incubator.apache.org/

©2013 LinkedIn Corporation. All Rights Reserved.

Next Steps - Distributed Model

Shard 1

Shard 2

Web Server 2 Web Server 3Web Server 1

Load Balancer

HTTP Request

Scatter Gather

active

standby

Shard 2

Shard3

active

standby

Shard 3

Shard1

active

standby

©2013 LinkedIn Corporation. All Rights Reserved.

Next Steps - Distributed Model

Shard 1

Shard 2

Web Server 2 Web Server 3Web Server 1

Load Balancer

HTTP Request

Scatter Gather

active

standby

Shard 2

Shard3

active

active

Shard 3

Shard1

failure

failure

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues – Use Case

•  Once segments are built, users want to forecast, see a

target revenue projection for the campaigns that they want

to run.

•  Campaigns can be run on various Revenue Models

•  This involves adding per member Propensity Scores and

Dollar Amounts

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues – Why not Stored Fields?

Why not use Stored Fields?

•  Stored fields have one indirection

per document resulting in two disk

seeks per document

•  Performance cost quickly adds up

when fetching millions of documents

Document ID

.fdx fetch filepointer to field data

.fdt scan by id until field is found

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues – Why not Field Cache?

Why not use Field Cache?

•  Is memory resident

•  Works fine when there is enough memory

•  But keeping millions of un-inverted values in memory is impossible

•  Additional cost to parse values (from String and to String)

©2013 LinkedIn Corporation. All Rights Reserved.

DocValues

•  Dense column based storage (1 Value per Document and 1 Column

per field and segment)

•  Accepts primitives

•  No conversion from/to String needed

•  Loads 80x-100x faster than building a FieldCache

•  All the work is done during Indexing

•  DocValue fields can be indexed and stored too

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

©2013 LinkedIn Corporation. All Rights Reserved.

Lessons Learnt

Indexing •  Reuse index writers, field and document instances •  Create many partitions and Merge them in a different process •  Rebuild (bootstrap) entire index if possible •  Use partial updates with caution •  Analyze the index Serving •  Reuse a single instance of IndexSearcher •  Limit usage of stored fields and term vectors •  Plan for load balancing and failover •  Cache term frequencies •  Use different machines for Serving and indexing

Agenda

§  Architecture –  Indexer Architecture –  Serving Architecture

§  Load Balanced Model §  Next Steps - Distributed Model §  DocValues §  Lessons Learnt §  Why not use an existing solution?

©2013 LinkedIn Corporation. All Rights Reserved.

©2013 LinkedIn Corporation. All Rights Reserved.

Why not use an existing solution?

•  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in

hadoop •  Indexing elevates query latency

•  Doesn’t allow dynamic schema •  Difficult to bootstrap indexes built in

hadoop •  Larger memory overhead •  Comparatively slow

Questions? More info: data.linkedin.com

©2013 LinkedIn Corporation. All Rights Reserved.