Wize commerce search infrastructure

22
1 Wize Commerce ® Search Infrastructure Zixin Wu, Patanachai Tangchaisin, Joe Foo, Yayati Rajpal, Jijoe Vurghese Platform and Infrastructure, Wize Commerce, San Mateo, California INTRODUCTION .......................................................................................................................... 2 BACKGROUND............................................................................................................................ 2 SEARCH SYSTEM COMPONENTS ............................................................................................ 5 DATA INGESTION (AKA INDEXING) .................................................................................................. 6 Event driven mode vs. batch mode ...................................................................................... 6 Incremental and full indexing (Kappa architecture) ............................................................... 7 Distributed messaging system .............................................................................................. 7 INDEX DEPLOYMENT .................................................................................................................... 8 Index update ......................................................................................................................... 8 Sharding ................................................................................................................................ 9 Distribution ............................................................................................................................ 9 Master-slave vs. SolrCloud deployment model .................................................................. 10 RUNTIME SEARCH ...................................................................................................................... 11 Service oriented architecture .............................................................................................. 11 Solr architecture .................................................................................................................. 12 Customized features ........................................................................................................... 13 Monitoring ........................................................................................................................... 15 FAILURE MODES ........................................................................................................................ 18 System failures .................................................................................................................... 18 Data failures ........................................................................................................................ 19 DISCUSSION ............................................................................................................................. 20 WHAT WORKS WELL ................................................................................................................... 20 FUTURE WORK .......................................................................................................................... 21 ACKNOWLEDGEMENT ............................................................................................................. 22

description

Confronted with a brick wall in terms of scalability and performance of the current search system, the platform team at Wize Commerce decided to re-architect the system for near real time data latency. Get the details of design and implementation of this initiative in our white paper, Wize Commerce Search Infrastructure

Transcript of Wize commerce search infrastructure

Page 1: Wize commerce search infrastructure

1

Wize Commerce® Search Infrastructure

Zixin Wu, Patanachai Tangchaisin, Joe Foo, Yayati Rajpal, Jijoe Vurghese Platform and Infrastructure, Wize Commerce, San Mateo, California

INTRODUCTION .......................................................................................................................... 2 BACKGROUND ............................................................................................................................ 2

SEARCH SYSTEM COMPONENTS ............................................................................................ 5 DATA INGESTION (AKA INDEXING) .................................................................................................. 6

Event driven mode vs. batch mode ...................................................................................... 6 Incremental and full indexing (Kappa architecture) ............................................................... 7 Distributed messaging system .............................................................................................. 7

INDEX DEPLOYMENT .................................................................................................................... 8 Index update ......................................................................................................................... 8 Sharding ................................................................................................................................ 9 Distribution ............................................................................................................................ 9 Master-slave vs. SolrCloud deployment model .................................................................. 10

RUNTIME SEARCH ...................................................................................................................... 11 Service oriented architecture .............................................................................................. 11 Solr architecture .................................................................................................................. 12 Customized features ........................................................................................................... 13 Monitoring ........................................................................................................................... 15

FAILURE MODES ........................................................................................................................ 18 System failures .................................................................................................................... 18 Data failures ........................................................................................................................ 19

DISCUSSION ............................................................................................................................. 20 WHAT WORKS WELL ................................................................................................................... 20 FUTURE WORK .......................................................................................................................... 21

ACKNOWLEDGEMENT ............................................................................................................. 22

Page 2: Wize commerce search infrastructure

2

Introduction For over ten years, Wize Commerce has been helping clients maximize their revenue and traffic using optimization technologies that operate at massive scale and across every channel, device, and digital ecosystem. Wize Commerce is a platform that delivers qualified traffic, increased monetization, and revenue and profit growth to their merchants, driving over $1B in annual merchant sales. We specialize in acquiring and engaging customers, helping people find what they’re looking for by leading them to robust comparison-shopping sites. The search subsystem within the Wize Commerce platform is one of the key differentiators. In addition to basic information retrieval based on search terms like “ipad” or “marine vhf radios”, this system applies sophisticated scoring and ranking algorithms that predict user behavior on our sites. Another key aspect of the search system is the freshness of data it operates on. This is even more critical for comparison shopping focused sites like pricemachine.com and nextag.com. A couple of years ago, we ran into the proverbial brick wall in terms of scalability and throughput of the current search subsystem. This led us down the path of a complete re-architecture of this system. What follows is a detailed, technical rundown of this evolution of the Wize Commerce search platform and lessons we learnt along this journey.

Background Information retrieval (IR) technology is key to Wize Commerce’s marquee offerings (nextag.com and pricemachine.com). This recognition came very early on in the company’s life. The company began using Apache Lucene 1.3 in 2004 - state of the art, at the time. Lucene is well respected in IR circles and has a very active community. While Lucene is great for general IR, we quickly realized it had to be customized to work well in the e-commerce domain. We added support for numeric types in Lucene documents, randomly accessible fields, faceting, geo-spatial search (for our Travel and Real estate comparison shopping offerings) and various scorers, tokenizers and filters.

Page 3: Wize commerce search infrastructure

3

Very soon, our search index outgrew the capacity of a single machine. We ended up retrofitting Lucene to support a fan-out mechanism (known internally as request broker) (Figure 1). The search index was broken up (sharded) into smaller chunks that could be accommodated on each search machine. Search requests were sent to the request brokering tier. This tier fanned out the search request to all shards and waited for results. After results were received from each shard, the results were “merged” (re-ranked) before the top N results were returned to requesting application.

Figure 1: Request flow in original search system

In IR terminology, indexing is the process of transforming data into a searchable form. In Lucene, this transformation results in Lucene documents held together in a Lucene search index. We started with custom Java tasks that ran every so often to pull updated content from our catalog system into the search index. The catalog system is constantly refreshed

Page 4: Wize commerce search infrastructure

4

with offer data provided by our merchants. We apply a matching processing on these offers across all merchants to create unified representations of these offers (aka normalization). This process results in creation of (normalized) products in our catalog. A Lucene document in our search index roughly corresponds to a product in the catalog system. To optimize number of remote calls (between WWW tier and search system) while painting search results pages on our sites, all data associated with a search result “row” was stored as within each Lucene document. This system served us well for about 10 years. However, as Wize Commerce’s merchant base grew, the system started to show its limitations:

1. The index generation process ran in a batch mode. This resulted in long data latency from the time merchants updated their offers to this data reflecting on Wize Commerce sites. We were clocking in at 18–24 hours latency

2. The above limitation was exacerbated by the decision to store all data related to a product into its Lucene document. Hence, the size of each document was relatively big. This made it hard to scale out along two dimensions - for the search use case (IR) vs. the data storage use case (key lookup).

3. The large size of index shards also meant it took longer to complete the search operation (~3 secs at 95th percentile).

4. Lucene is an append-only system, which means we cannot update an existing Lucene document in situ, once it has been committed into the index. To update data in a document, we’ve to create a new document with all data (including updates) and write it out to the index. Overtime, this means we end up with obsolete documents in the search index and hence, have to run a full re-build. This took 3+ days to complete.

5. An increasing number of merchants moved to near real time offer updates (pricing and availability, mainly) during peak shopping events like Black Friday, Cyber Monday, Flash sales, etc. The high data latency in our current system was now a show stopper.

6. A lot of search code was embedded directly into WWW tier code. This meant WWW tier engineers had a steep learning curve before being able to modify search code.

We looked at various incremental design modifications, but very soon, we realized nothing would give us the required quantum leap improvement. We had to go for a ground up re-architecture of the search system.

Page 5: Wize commerce search infrastructure

5

We set the following goal for our next generation search system: 1. Data latency of 10 mins 2. Search response time of 1 sec at 95th percentile 3. Adopt a service oriented architecture for search service 4. Reduce bloat of Lucene documents and separate data storage from IR concerns

Search system components There are two major components of our search system: indexing and search. Offer information from merchants is sanitized, normalized and stored in the catalog datastore by the catalog import process. At the same time, changes are emitted as events to Apache Kafka for other parts of the system to consume. These events are processed using Apache Storm and the resulting Lucene documents are pushed to Solr master servers. Search requests from clients first go through pre-processing steps such as spelling correction and then lands on Solr slave servers through a load balancer. Candidate documents are fetched from the index by matching the search keyword as well as constraints such as category and price. The matched documents go through a scorer, that computes a score for each. These documents are finally sorted by their scores and diversified (see Seller diversification section below). The top N documents to be displayed on search results page is returned to requesting application. Both Tomcat and Jetty are common web containers used for deploying Solr. Initially we deployed Solr on tomcat, as it is the most popular web container used in our web applications. However, we saw significant performance gains by switching to Jetty.

Page 6: Wize commerce search infrastructure

6

Data Ingestion (aka indexing)

Figure 2: Data ingestion pipeline

Event driven mode vs. batch mode One big gain from the system re-architecture is the significant reduction in data latency. Our original system operated in batch mode. The indexer ran every 6-12 hours to pick up changed product information, since last run, from the catalog datastore. The changes were processed through a series of steps and written into new Lucene index segments. After all products in the batch were processed, we deployed the segments to Lucene search servers. Usually a batch of products took about 6 hours to complete resulting in a data latency ranging from 12-18 hours. When we designed the new search system, initially we did prototypes in batch mode using Pig, Hive and Cascading. We tried to reduce data latency by removing unused features and running the batch process more frequently. But we only got an hour or so improvement compared to our goal of 10 mins data latency. It was clear we needed a complete rethinking of the processing pipeline. The result was reimagining the flow as a stream of change events. The advantage of an event driven model is that after an event is picked up by the processing pipeline, it can

Page 7: Wize commerce search infrastructure

7

go through the pipeline and reach the live site as fast as the pipeline can process, independent of other events. The stream processing landscape was very sparse – S4 from Yahoo! and Apache Storm stood out as the viable, open source candidates. Given the cleaner abstractions provided by the latter (Spouts, Bolts, Topologies) we settled on Apache Storm.

Incremental and full indexing (Kappa architecture) What we just described in the previous section is incremental indexing, because it handles updates triggered by our catalog import process, when offer information is changed in merchant feeds. However some fields in our Lucene documents do not come from merchant feeds. For example, we compute and store click-through-rate of a product for ranking purpose. Click-through-rate computation is done in batch mode for all products, once a day, and hence, there are no change events emitted. Therefore, in addition to incremental indexing, we need a full indexing process to update such fields. This full indexing process runs perpetually in parallel with the incremental indexing process, iterating over every product and enqueues a pseudo “change” event. The rest of the event processing pipeline is exactly the same as incremental processing. Using the same pipeline for both incremental and full indexing has the benefit of a shared single code base. This is a counter example to the Lambda architecture proposed by Nathan Marz (author of Apache Storm) – in other words, this is an example of Kappa architecture, a term coined by Jay Kreps. The full indexing process also serves as a catch-all for lost update events. We can control the rate of full indexing to keep load on Solr in check.

Distributed messaging system A common situation with event stream processing system is the need to decouple parts of the processing pipeline that differ in throughput and availability (aka an impedance mismatch). A messaging system is one way to achieve this decoupling by buffering events between the various parts of the pipeline. We realized we did not need the overhead of standards compliance (like Java Messaging Service (JMS)). The field of lightweight message queue has numerous contenders like Apache Qpid, Kestrel, Kafka, RabbitMQ, ZeroMQ, Akka, Beanstalk, Apache ActiveMQ, etc. We had to narrow the field first. A quick proof of concept with ZeroMQ convinced us that it operated at too low a level (sockets) for our indexing pipeline use case. Next up was Akka. Akka is a great framework for generic Actor

Page 8: Wize commerce search infrastructure

8

based programming. However, we would have to roll our own event processing constructs like failure handling, etc. Given the excellent documentation, its Erlang pedigree and rich community, RabbitMQ emerged as the winner. After a month or so of use, we realized we missed a very important criteria in our evaluation. We required messages to be persistent because regenerating lost messages (due to RabbitMQ outages, message errors, etc.) was expensive. We thought RabbitMQ has a persistence mode. Let’s enable that and call this done. The only problem – throughput of the message queue system dropped by about 50x L. We had to go back to the drawing board. On our second (quicker) pass over message queue contenders, Apache Kafka emerged as a clear winner with excellent throughput while persisting every message (the default mode!!). We were able to push about 400,000 messages/second on an unturned DELL R410 class machine, which is close to what Jay Kreps’ team at LinkedIn got in their tests. One crucial advantage of Apache Kafka is the ability for each consumer to replay messages independent of other consumers. This feature has saved our skin many times when we had to fix code in the Storm processing pipeline and need to rewind and replay messages over this new codebase.

Index Deployment

Index update The indexing pipeline is broken into two topologies – “Indexing” and “Push to Solr” (Figure 2). The former topology is responsible for transforming a product change event into a flattened set of data columns containing all data needed to construct a Luence document. This data is persisted into our catalog (Apache Hbase). The latter topology is responsible for picking up the set of data columns and pushing it as a Lucene document into Solr master servers. The primary motivation for these separate topologies is the speed mismatch between them (the latter is much faster) and the benefit of being able to quickly rebuild the entire Solr index (in case we need to build multiple indexes to stage multiple incompatible versions of Solr, build subset indexes, rebuild the entire Solr index for disaster recovery, etc.). To achieve the goal of 10 minutes data latency, we commit changes on Solr master every 5 minutes, or whenever there are more than a specified number of uncommitted updates.

Page 9: Wize commerce search infrastructure

9

Sharding We have more than 100 million documents in our search index, and we partition it into 64 shards, where documents are uniformly distributed by hashed value of their product IDs. This decision is based on performance tests we ran on different configurations such as 16 and 32 shards - 64 shards had the lowest response times for our QPS (request per second) rating. It is worth mentioning that we have to occasionally redo these performance tests as we make significant performance improvements or change the size of Lucene documents dramatically. To achieve high availability, we have two to four replicas of each shard in a cluster, thus each replica gets only a portion of the total traffic. If we host one shard per server, the server is underutilized. Therefore we host multiple shards on a server. This configuration also provides the flexibility to scale without re-sharding, by moving some shards to new servers if the index size grows. However, hosting multiple shards on a server is not the only way to achieve high availability and parallelism. We could instead partition the index into fewer shards, and host one shard per server (as if combining the shards on a server), and execute a search on segments in parallel. But unfortunately the Solr versions currently available are not ready for this feature (more details in LUCENE-5299) .We created different indexes for different markets, so that we can more easily leverage Solr functionality around language specific stemming, protected word lists, stopwords, etc, and also avoid IDF bias across languages. Different markets’ indexes can be hosted on the same server.

Distribution We have four data centers globally. One in west coast of the U.S., two in east coast of the U.S., and the other in Ireland (Figure 3). To achieve the goals of high availability, high scalability, low data latency, low cross-data-center bandwidth usage, and ease of maintenance, we designed our index distribution architecture as follows.

Page 10: Wize commerce search infrastructure

10

Figure 3: Deployment view

Only our "main" data center has Solr master servers. The master servers do not accept search query. They are only responsible for updating the indexes and letting slaves or relays pull from their indexes. To reduce cross-data-center bandwidth usage, we put relay servers in each data center except for the "main" one, and let slaves pull from the relays.

Master-slave vs. SolrCloud deployment model We chose master-slave structure instead of SolrCloud because:

1. We tried Solr Cloud and had a live site search failure incident due to leader election failure (SOLR-5373). In this situation, the SolrCloud stops serving requests until the cloud achieves consistency. In other words, it behaves like a CP system (CAP Theorem). We require an AP system.

2. In SolrCloud model, we need to handle failure of index updates in multiple data centers.

3. We have more direct control in master-slave structure. 4. One advantage of SolrCloud is high availability of master nodes. But our

business can tolerate some of data latency, thus it is acceptable to restore master nodes manually, if they fail.

Page 11: Wize commerce search infrastructure

11

Runtime search

Service oriented architecture While reviewing our original search system, we realized search related code is strewn throughout other systems that are not search related, such as the web UI code. This makes it very hard to coordinate changes to search code across multiple code bases. This was compounded by the fact that Java serialization was used to move search requests and responses across application tiers. In order to achieve scalability, testability, and maintainability, we decided to move our search code into a search micro service. A micro service is a logical unit which is responsible for a self-contained function, e.g. a service using Solr for recall and ranking, a service for retrieving product detail information by product ID lookup, a service for spelling suggestion etc., and then we create an abstraction on top of these for inter-service communications. As mentioned earlier, a major problem with our original system is the inability to evolve the search data model due to limitations of Java serialization. This caused backward compatibility issues and the deployment process required a lot of careful coordination to ensure sequencing of change rollout to prevent site outages. Therefore, we chose Apache Thrift for data transport as well as remote procedure calls (RPC). This helps us avoid compatibility issue when clients communicate with micro services and when services talk to each other. Moreover, it makes our service tier language independent to automatically support languages like Ruby, C# and other languages in the future due to richness of Apache Thrift’s language support.

Page 12: Wize commerce search infrastructure

12

Figure 4: Micro Services search architecture

Solr architecture Solr is a Lucene-based distributed search framework with REST APIs. A search request is accepted by a “merger”, which fans out requests to all shards, and merges and sorts responses from shards (Figure 5). Note that a Solr instance plays the role of merger and shard, interchangeably, per request. Besides distributed search, Solr also provides a framework for index update and replication, as well as features such as faceting, caching, statistics, admin, spelling correction, etc.

Page 13: Wize commerce search infrastructure

13

Figure 5: Data and request flow through search system

Customized features While Solr's architecture provides a good starting point to build our search system, the built-in features do not suffice for our business needs. Therefore, we implemented some of our business logic as Solr components and plug-ins, without changing Solr code base, to minimize the difficulty in upgrading to future versions of Solr. Of the many search features we have, we describe briefly the major ones in the following sections. Global Inverse Document Frequency (IDF) The widely used TF-IDF (term frequency–inverted document frequency) model is part of our ranking formula to measure the importance of a document to a given query. The inverse document frequency (IDF) is a measure of how much information a word provides, that is, whether the term is common or rare across all documents. In a distributed search environment, IDF of a term may differ from shard to shard, if it is computed locally on each shard. This is not an issue if the frequency of the term is high, as it is likely the same in each shard. But for "long tail" keywords, this inequality creates significant bias in the TF-IDF value between shards.

Page 14: Wize commerce search infrastructure

14

Solr doesn't address this problem as yet, though there is a placeholder in the QueryComponent. Since text match score is a non-trivial signal in our ranking formula, we implemented our own, as a Solr component. This resulted in significant improvement in click-through-rate (CTR) of our search results, and hence, revenue. Our implementation is straightforward - before sending out search requests from Merger to Shards, Merger sends out requests of the terms in the given query to shards, and each shard returns the document frequency of the requested terms (excluding documents marked as "deleted"), along with the total number of documents in the shard. Then Merger combines the responses, computes the "global IDF" of the terms, and finally puts the "global IDF" values into the search requests sent to shards. While in shards, we use these values by replacing TermQuery and PhraseQuery with our own implementation in our QParserPlugin. Since the global IDF values usually do not change much overtime, an optimization we did to reduce the overhead of this feature, is to put a local cache on each Merger with a TTL, so that we won't request the frequency of a term repeatedly in a short period of time. If all terms are found in the cache, we do not need to send out "IDF requests" to shards at all. Scoring In the domain of product shopping, purely text match score such as TF-IDF is not good enough to measure the relevance of a product to a given query. Therefore we use many more ranking signals to sort the recalled products. The formula to combine these ranking signals is generated by regression analysis based on logged data such as impressions and clicks. Some ranking signals have good predictive power, such as product specific CTR and keyword-product specific CTR. The former is query independent. We store the value in a DocValue field of Lucene documents, and read it by our scorer. For the latter, since a product may have different CTR associated with different search keywords, we cannot store this information as a field in Lucene documents. Instead, we hold this data as a map of maps in memory and look up the values in scorer by using the query's search keyword and the ID of the product being scored. Due to our complex ranking formula and the usage of the above ranking signals, we implemented our own scorer and plugged it into the Lucene query object.

Page 15: Wize commerce search infrastructure

15

The system loads the abovementioned data for ranking and model through data files. These files are generated offline at regular intervals, but transferring a file from one that machine to every Solr slaves is not a scalable solution because it takes a very long time to complete transfer and adds maintenance overhead to manage a list of all Solr slave servers. Instead, we wanted to download external data files from master Solr server to slave Solr servers, just like indexes are replicated. We implemented via a custom request handler to allow slaves to query latest files on master and compare with current files. If files on Solr master are newer, then each Solr slave server downloads the newer files. Search runtime document filtering One of our business goals is to drive high quality traffic to merchants, which means improving conversion rate on merchants' websites. One way to do this is to show products from a merchant to more or fewer visitors based on the products' categories and the merchant's conversion rate on the categories. For example, a merchant may have higher conversion rate on computer products than televisions. We achieve this by randomly removing products from some visitors' search results. Since this effect is unknown when the index is built, we cannot put this information in the documents and filter them by query constraints. Instead, before a document is handled by the scorer, it has a chance to be discarded, based on the product's merchant, category, merchant's conversion rate on the category, etc.. Seller diversification A good comparison shopping website should show products from a variety of merchants. If we sort recalled products purely on their scores (as described in the "Scoring" section), then similar products from a merchant likely have very close scores and, thus, shown consecutively on the search results page. The worst case is the entire page has results from a single merchant. To diversify products in search results, we retrieve more documents than the current page size (suppose the current page is the first page) from shards, and after merging the responses in Merger, we iterate through the merged document list and give exposure to more number of merchants.

Monitoring Since we have a large Solr cluster with more than 200 machines per data center, we need a way to monitor and notify us when things go wrong in the clusters. Previously, we used an in-house monitoring system that track by polling a statistic endpoint e.g. REST API, XML, HTML, JMX, etc. and sends out email notification to alert a responsible party when a specific threshold is reached. However, when our

Page 16: Wize commerce search infrastructure

16

system grew larger, it was not practical to use polling every system. Therefore, we incorporate a push model monitoring with this platform change. We use Statsd integrated with Graphite (Figure 6), to track all statistics we want from our system in a push model. This system relies of asynchronous UDP connections to push data from applications to the monitoring system (Figure 4) with minimal overhead on application code. The application is not impacted even if the monitoring system goes down. Additionally, we have Cubism.js (Figure 7) for aggregate visualization of cluster metrics, to observe any abnormality in near real time. Since Solr already ships with a statistics page (XML styled with XSL) and JMX, in addition to what mentioned above, we also use our in-house monitoring system to track Solr servers on many aspects from the Solr statistics page, such as memory usage, JVM, CPU utilization, request per seconds, size of indexes, and deleted document ratio (Figure 8).

Figure 6: Graphite monitoring

Page 17: Wize commerce search infrastructure

17

Figure 8: Node level metric tracking

Figure 7: Cubism.js near real time visualization

Page 18: Wize commerce search infrastructure

18

Failure modes

System failures Apache Kafka Currently, we have deployed Kafka version 0.7, which lacks support for message replication. We're using the most basic redundancy (loss of one disk) for messages via RAID10 disks on machines running Kafka software. In the event of complete machine failure, we lose unprocessed messages in queues residing on this machine. However, we can re-enqueue the changes from the source system (the catalog) by reprocessing merchant feed files. We plan to upgrade to Kafka 0.8+ very soon, to leverage message replication support, at which point, we will switch back to RAID0 or JBOD disk setup on machines (maximum throughput). Apache Storm Storm 0.9.x has redundancy for Nimbus nodes. Even without this redundancy, a downed Nimbus node only prevents submission of new topologies. It doesn’t affect topologies already running. When machines running Storm worker nodes fail, the worker processes transparently migrate to other machines in Storm cluster and pick up where they left off. Of course, this only works if there are free “slots” available across the Storm cluster. We keep about roughly 100 slots free to allow for failure of 3 nodes. Zookeeper We run a Zookeeper quorum of 5 nodes, which can tolerate failure of 2 nodes and continue functioning. Storm itself uses Zookeeper. We keep this instance separate and run it on one of the Storm nodes. We aren’t concerned with redundancy for this Storm Zookeeper because it is used only for Storm’s internal state control and recovering a downed Storm Zookeeper node is as simple as bringing the Zookeeper process back up. Solr Solr master and relay machines have no redundancy. They can quickly be rebuilt by bringing up a new machine, copying the index files over from any slave Solr machine.

Page 19: Wize commerce search infrastructure

19

The new master machine will start processing from that point forward as if nothing happened (we’ve done this many times). Solr slave machines are, of course, deployed for redundancy.

Data failures Event processing failures Errors do occur during event processing. Fundamentally, we treat such errors as transient problems. Experience tells us these errors are usually due to loss of connectivity to datastores used during event processing, which self-corrects in some time. Initially, we relied on Storm’s built in (infinite) retry mechanism, but soon realized that these retries took up all the processing bandwidth in the Storm cluster and the remaining, newer events backed up. We designed a retry mechanism to take errored events off the main queues after a certain number (3, in our case) of retries. The number of retries is stored as meta data with the event. After 3 unsuccessful retries, the event is taken off the Kafka queue and placed into an error queue and also persisted into an Hbase error table - more on the latter below. Another error handling Storm topology runs continuously picking up events from the error queue and retrying them infinitely. If successful, the event entry is removed from the error queue and the Hbase error table. How do we know we don’t have runaway errors in the system? The Hbase error table has the error count and exception stack trace per Storm topology bolt. We generate a daily email notification for each topology and bolt from this Hbase table. Over time, we have a good sense of the steady state of errors per topology and can act of deviations in errors daily. Lost events Specifically for Solr, we’ve a full indexing topology that scans all products in our catalog in perpetuity and generates events for each product. This process serves as a catch-all for lost updates, in addition to causing certain moving window type product data, every so often, even if the product itself wasn’t modified by merchants. Since the events don’t actually contain the product data itself, this data is looked up at event

Page 20: Wize commerce search infrastructure

20

processing time from the underlying Hbase datastore and hence, we avoid the problem around stale data overwriting more recent data in destination systems. We keep the full indexing topology tuned for maximum parallelism while scanning the catalog Hbase datastore. We do this by making sure to scan all available Hbase regions in parallel. Another problem is to ensure stale (deleted) products are removed from Solr in a timely manner, in case a product deletion event is lost. This is important because during search, these products would participate in searches when they should not. This is also important to ensure accuracy of the distribute global IDF functionality we implemented in Solr. We run a continuous stale products Storm topology. This topology queries Solr for documents not touched within a certain time period. These documents represent the candidate products for deletion. The product ids for these documents are enqueued into Kafka + Storm for processing by the stale products topology.

Discussion

What works well The event driven indexing pipeline is very stable, especially the back pressure mechanism when one of the systems along the path is backed up. The Zookeeper used by Storm became a bottleneck when we upgraded to Storm v0.9.x recently because this version of Storm is more chatty with Zookeeper. Our workaround was to split the Storm cluster into two, giving each new Storm cluster its own Zookeeper. The error handling mechanism has been invaluable in tracking down bad code changes, data errors, etc. and the ability to rewind Kafka event queues to reprocess previous changes (after fixing code/data) has been a life saver. Splitting the index topology into two (one for data transformation to generate the data required to populate Solr and the other to actually push this data into Solr) also helps us very quickly rebuild an entire index (just run over all data output from first topology above using second topology), generate data summaries offline (how many offers does each seller have in search index, etc.)

Page 21: Wize commerce search infrastructure

21

We gained lower search data latency and search response time compared to the original search system, by using event-driven system, upgrading from a very old Lucene version to Solr, removing unused features, and doing multiple rounds of performance tuning to figure out the bottlenecks and improve them. We also gained benefits from better system design and software engineering, such as:

● Easy to upgrade Solr since we wrote our feature as Solr components. ● Easy to add new features. Adding new components is done via configuration as

it is modular. Our custom analyzers are kept in a common library so that it can be used in different modules.

● Avoid Java jar dependency problem by componentizing the system into services. Each service is a self-contained business logic unit. A change of the service does not affect other services as long as the interface doesn’t change.

● Easy to find system slowness via Graphite. Graphite is a scalable system that can maintain numeric data over time through its whisper file system. Our system sends data such as response time, error code, etc. to graphite and this data is then exposed on UI of graphite and cubism. It becomes very easy to see which servers are slow when average response time is high, or which layer or data center is affected. Corresponding analysis can be performed for further investigation.

● Continuous index update. Thanks to the Solr index deployment and automatic reader opening mechanism, we don’t need to take search servers out-of-service during index deployment.

Future work ● Since we have a lot of customization in our code, we need to bundle our code

into Solr Web application (.WAR file). This means our Solr application needs to be shut down during a deployment. However, during a shutdown, persistent connections are lost, and this affects performance on the live site while new connections are reestablished. We are working on implementing better fault tolerance between Solr servers during code deployment.

● Parallel "branches" in indexing topologies to further reduce data latency for frequently changed data attributes like price, URL, offer name, description, etc. that don't need recomputing other Solr document fields.

● Use query independent scores on documents so that we can use EarlyTerminatingSortingCollector to skip documents for scoring, when enough number of documents have been scored. This can significantly reduce search response time.

Page 22: Wize commerce search infrastructure

22

● Reduce #shards by doing a search on segments in parallel (depending on LUCENE-5299)

● Create high priority shards for high value products. For most users, if all high priority shards respond to merger, merger can return results to client without waiting for non-high priority shards.

Acknowledgement We would like to thank the invaluable contributions to this project by Anuj Jalan, Jai Sharma, Srinivasan Ramaswamy, Rajan Gangadharan, Yingji Zhang, Jaipal Deswal, Shashilpi Krishan, Sunil Sharma et al., and the Technology Operations, Systems Administration and Networking teams.