Search Engines Performance - Explained

36
About me Over 12 years in software world Israeli Air Force Israel Discount Bank SAP Team Leader System Architect Java Eco-System Continuous Delivery Search Big Data Contact: @alonaizenberg, alonaizenberg.blogspot.com, alon.aizenberg@gmail. com

description

About meOver 12 years in software world Israeli Air Force Israel Discount Bank SAPTeam Leader System ArchitectJava Eco-System Continuous Delivery Search Big Data Contact: @alonaizenberg, alonaizenberg.blogspot.com, alon.aizenberg@gmail. comSearch Engine PerformanceexplainedApache SolrIn this talk we will take Apache Solr as an example to Search Engine, but the majority of concepts and mechanisms are true for most of the available products on the marketAgendaHistory Market at a g

Transcript of Search Engines Performance - Explained

Page 1: Search Engines Performance - Explained

About meOver 12 years in software world

Israeli Air ForceIsrael Discount BankSAP

Team LeaderSystem Architect

Java Eco-SystemContinuous DeliverySearchBig Data

Contact: @alonaizenberg, alonaizenberg.blogspot.com, [email protected]

Page 2: Search Engines Performance - Explained

Search EnginePerformance

explained

Page 3: Search Engines Performance - Explained
Page 4: Search Engines Performance - Explained

Apache Solr

In this talk we will take Apache Solr as an example to Search Engine, but the majority of concepts and mechanisms are true for most of the available products on the market

Page 5: Search Engines Performance - Explained

Agenda

History

Market at a glance

Anatomy of a typical search system

Scenarios and problems

Scaling the search scenario

● Handling large data-sets

● Handling request load

Achieving high availability

Page 6: Search Engines Performance - Explained

History

● 1994 - Lycos

● 1995 - AltaVista, Yahoo!

● 1997 - Yandex

● 1998 - Google, MSN search

● 2000 - First lucene version (marks the raise of custom search implementations)

● 2006 - ask.com, AOL search

● 2009 - Bing

Page 7: Search Engines Performance - Explained

Market at a Glance

Many open source offerings:Apache Lucine, Apache Solr (built on lucine), Nutch, Sphinx, ElasticSearch (built on lucine), Xapian, many more...

Some enterprise solutions:Google (Google Search Appliance, Google Mini)Sap (TREX, Enterprise search)IBM (OmniFind)Oracle (Oracle Secure Enterprise Search)Microsoft (FAST search server)

Almost no standards:OpenSearch, Robot Exclusion Standard

Page 8: Search Engines Performance - Explained

Anatomy

of typical search system

Page 9: Search Engines Performance - Explained

Anatomy of typical search system

Page 10: Search Engines Performance - Explained

Anatomy of typical search systemHow is data stored in the engine?

● Index file(s)● Each index is a collection of Documents (we will see later that this is not

really true)● Document is a collection of data fields● A field can be of any Data type (text, integer, boolean etc.)● An index file has internal data structure mapping from terms to Documents

(inverted index)● Very similar to Data Base table

How is information indexed? ● Indexing API allows for programs to index information in a transparent way● Remote

How is information retrieved / searched?● Rich query language (like SQL) allowing complex search queries● Remote

Page 11: Search Engines Performance - Explained

Solr index structure

Page 12: Search Engines Performance - Explained

Scenarios and their problems

Page 13: Search Engines Performance - Explained

2 main scenarios● Search scenario: search for a term

Problems:■ How to execute a search on a big data-set, fast.■ How to scale the solution to serve any given number of

concurrent requests.■ How to provide a highly available service.

● Indexing scenario: build indexes via add/delete/update document operations

Problems:■ How to index a large number of documents, fast.

We will discuss only search scenario, if we will have time, we will touch the indexing scenario too.

Page 14: Search Engines Performance - Explained

Scaling the Search Scenario

handilng large data-sets

Page 15: Search Engines Performance - Explained

Handling Large Data-Sets

● Searching in an index is a function of data size.

● To process a big data-set efficiently, we have to break down the data into smaller parts, process them concurrently, and then combine the results.

● This principal is also called Map Reduce.

● In search, we split large indexes into shards, and search each shard concurrently.

● Concurrent search request processing can happen in the same machine on multiple CPUs, or on different machines

Page 16: Search Engines Performance - Explained

Handling Large Data-Sets - Map Reduce2 steps process:

"Map" step - The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step - The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

Page 17: Search Engines Performance - Explained

Map Reduce Examples

Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.

Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair.

Generate Inverted Index: The map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.

Page 18: Search Engines Performance - Explained

Scaling Map Reduce● Scale up (vertically) approach.

● Add more CPU / Memory so more threads can run concurrently searching different parts of the index.

● Pros:○ Easy to implement on a single machine.○ Usually no performance compromises as we add more CPU/Memory

(almost linear scalability)

● Cons:○ We want to use small and cheap machines, and run big scenarios. ○ Large data sets, which cannot fit into one physical machine.

● a "Scale up"-only approach is not realistic, due to above cons.

Page 19: Search Engines Performance - Explained

Scaling Map Reduce● Scale out (horizontally) approach

● Split the data on multiple machines and run the search tasks in parallel on multiple nodes (a.k.a Sharding or Distributed search).

● Pros:○ Cheap machines.○ No limits on data sets.

● Cons:○ Complex implementation. ○ Performance penalty for large clusters (not linear scalable).

● Each request is handled by all machines / shards.

● This approach is what most of the big projects implement.

Page 20: Search Engines Performance - Explained

Handling Large Data-Sets - Distributed Sharding

Page 21: Search Engines Performance - Explained

Handling Large Data-Sets - Distributed Sharding

Page 22: Search Engines Performance - Explained

Handling Large Data-Sets - Distributed Sharding Problems

● The more data we have, the more index shards we will split our cluster into.

● Adding shards is not for free.

● Each shard brings performance penalty.

● Distributing the query to multiple nodes, is time and network bound.

● Waiting for results from all nodes. Not all nodes behave equally even when all nodes have same hardware specifications, and same amount of data indexed.

● Wasting time executing the reduce function.

● Does not scale in linear manner, more nodes = less performance gain from each new node.

Page 23: Search Engines Performance - Explained

Scaling the Search Scenario

handing request load

Page 24: Search Engines Performance - Explained

Handling Request Load● Now we know how to cope with a LOT of data, But how do we handle a

LOT of users / search requests?

● Scale this solution horizontally (scale out again), by replicating each shard.

● Each Shard exists in one Master machine and multiple Slave machines (replicas of the Master).

● The Master is responsible for running indexing requests only. It has the

most recent index instance.

● Slaves replicate the index(s) from the Master node, and serve search requests.

● Load balance the Slaves with standard hardware / software load balancing

solutions.

● The more load / users you have, the more slaves you add to handle the search requests.

Page 25: Search Engines Performance - Explained

Handling Request Load - Replication

Page 26: Search Engines Performance - Explained

Handling Request Load - replication● Indexes may be composed of multiple sub-indexes, or segments. Each

segment is a fully independent index, which could be searched separately.

● Common scenario: add a bulk of documents to an index shard on a master server.

● A new segment is created or altered (in remove document or update scenarios).

● Master takes a snapshot of the index state at a given time, marking the new/changed segments in the index.

● Slaves poll the Master, to see if any segment should be replicated.

● Segments are replicated to all slave nodes.

● A new 'view' is created for the new segment configuration.

Page 27: Search Engines Performance - Explained

Handling Request Load - replication problems

● As an index grows, it becomes more segmented.

● The search function becomes inefficient.

● Therefore, index optimization happens on all master and slave nodes, to merge compact the segments.

● Replication protocol can be selected and tuned including replication rules.

● Solr supports unix/rsync/script or Pure Java replication mechanisms.

Page 28: Search Engines Performance - Explained

Handling Request Load - Search Query

Page 29: Search Engines Performance - Explained

Handling Request Load - Search Query

● User execute a search query

● Load balancer selects a slave node on one of the shards and forwards the request to it.

● The Shard gets the request, distributes it to other index shards, and executes the processing on its own piece of index.

● When all shards finish the processing, they send the results back to the node which got the original result.

● All results are sorted, and returned to the user.

Page 30: Search Engines Performance - Explained

Handling Request Load - Search Query Problems

● The more users / requests we have, the more Slaves we can add.

● Adding slaves is not free.

● Each slave adds more network chat to the system.

● Each slave polls the Master for updates, putting load on the master and using bandwidth.

● Each slave replicates the index deltas, providing additional load on the master and network.

● The more you distribute the more performance overhead you get.

● Not linearly scalable.

Page 31: Search Engines Performance - Explained

Achieving high availability

Page 32: Search Engines Performance - Explained

Achieving high availability

● If we have many slaves serving same search function, we can continue to serve search requests even if not all the slaves in a given Shard are available.

● Solr search API is http based.● With http health checks on the http load balancer side, we can take out of

the cluster the problematic slave nodes.● The more slaves we have for each shard, the highly a system is available.● We got search high availability for free.

Page 33: Search Engines Performance - Explained

Achieving high availability

Page 34: Search Engines Performance - Explained

Summary

● To handle more data, split the system into shards.

● To handle more requests add more Slave nodes.

● We achieved highly available, and fully salable (data and load wise) search system.

Page 35: Search Engines Performance - Explained

Questions

?

Page 36: Search Engines Performance - Explained

Thank you