Solr at zvents 6 years later & still going strong
-
Upload
lucenerevolution -
Category
Technology
-
view
1.256 -
download
1
description
Transcript of Solr at zvents 6 years later & still going strong
Solr @ Zvents: 6 years later
Amit Nithianandan Lead Engineer – Search and Analytics
About Me
My “Street Cred”
• Joined Zvents in Aug 2008 as member of search engineering team. – Knew nothing about Solr/Lucene (Lucene.. Isn’t that a
misspelling of the Safeway brand milk?)
• Worked on small features early on – New ranking configuration for “hot tickets” module on site.
• Worked on larger initiatives – Multiple re-writes of the federated search component – Recent upgrade to Solr 4.0
• Contribute to community – Authored a few articles/blog posts, most notable regarding
running Solr in Eclipse. – Wrote Chrome extension for easily editing long(Solr) API URLs
Overview
• About Zvents
• Why Solr?
• Search @ Zvents Details
• Federated Search discussion
• Integration with external data stores
• Development/Deployment
• Operations/Performance Details
About Zvents
• Helps people find fun things to do since 2005!
• Content sourced from a variety of places:
– Normal end users
– Internal content editors
– External content editors @ local newspapers
– Feeds
• Powers the events guide section of hundreds of local newspaper sites around the nation.
Technologies used (including but not limited to).
Why Solr?
• Flexible, Powerful, Customizable
• RESTful query API
• Scales reasonably well without hassle.
• Fast and easy to get started given the samples.
• Strong and active community
– Mailing list amazing. Conferences and meetups help too
Zvents Search at a quick glance…
• 2 Masters/10 Slaves Not sharded.
• Solr 4.x running on Jetty
• Six cores – Five host actual data, sixth used for federated search
• Federated search among eight different document types (i.e. venue, restaurants, movies…)
• Total number of documents ~5M documents
• We allow blank text (“what”) searches so people can look for stuff based on date and location
• How to surface the most relevant things to do?
11
Search Challenges
Document Design Notes
• Venues, Artists are as you would expect
• For movies, index each showtime pk = {theater_id, movie id, time} triple.
– When searching, filter by location, collapse on the movie_id sort by time asc
• For events, index each occurrence (time).
– When searching, collapse on a sequence_id sort by time asc to show the most recent upcoming event.
• Avoid showing visual “duplicates”
Request Flow
Zvents Search Service API
• Essentially Solr API with a few changes.
• ServletFilter and custom QueryComponent used to translate URL parameters to proper Solr parameter “syntax”
– E.g. latitude/longitude/radius converted to geospatial query and distance in km.
• Federated search executed using ThreadPool
– Parallel searches, results blended together.
Sample Query
http://localhost:8983/map_prod/select?qt=zvents&trim=1& start=0&start_spn=0&rows_spn=6&rows=10&zsort=0&rcity=San+Francisco&latitude=37.7752&longitude=-122.419&radius=75.0&category=event,event_spn,venue&sd=201212190000&fq:event:=has_city:true&fq:event_spn:=has_city:true&wt=ruby&q=the%20fillmore&fl=id,name,score&facet=true&indent=on
Category specific fq parameter
Lat/Long Params
Collapse results (grouping)
Sample Response (abbr) {
'organic'=>{
'response'=>{'numFound'=>379,'start'=>0,'maxScore'=>75.485054,'doc
s'=>[…]
},
'facet_counts'=>{
'facet_fields'=>{
'category'=>{
'event'=>67,
'venue'=>312}}},
'sponsored'=>{
'response'=>{'numFound'=>0,'start'=>0,'maxScore'=>0.0,'docs'=>[ …
]
},
'facet_counts'=>{
'facet_fields'=>{
'category'=>{
'event_spn'=>0}}}
}
Federated Search
Federated Search (notice movies + events mixed)
Federated Search (cont’d) • Zvents federator component executes multiple concurrent searches
and blends the results.
• Raw score meaningless across products so scores must be normalized so that across products they mean something.
• Division by max to yield 0-1 scale throws out the score distribution differences
• We chose to use the Z score (score – avg)/stddev.
• Getting stats like average and standard deviation on the results not trivial.
• Initially thought to hack the handler to put my own collector/scorer
PostFilter to the rescue!
• PostFilters allow you to (as the name suggests) execute filtering logic *after* the main query and all other filters have executed.
• Lucene filters + main query execute in parallel in a leap-frog manner. Some filters (i.e. filter by distance to user) are expensive to generate up front for all documents.
• You can create a delegate Collector to optionally call “super.collect()” if some condition is true.
• Since now I am at the lowest level of Lucene effectively (Collector/Scorer), I can store distribution information about the scores as they pass through the collector and custom scorer!
Example Result Snippet
<lst name="score_stats">
<float name="min">1.3786081E-6</float>
<float name="max">10.416486</float>
<float name="avg">1.8479956</float>
<float name="stdDev">1.544854</float>
<long name="numDocs">561</long>
<float name="sumSquaredScores">3254.7324</float>
<float name="sumScores">1036.7256</float>
</lst>
Federated Search – Victory! • Now the federator, when executing the product specific searches, can extract
this information to produce a “normal” score.
• Results from different products can be blended based on how good individual results are relative to their peers.
Ranking/Filtering using (highly) volatile data…
• Store data in field, re-index document constantly with updated field value
• Atomic updates? Solr 4.0 feature
– Claim ignorance here. Don’t know performance impacts nor usage.
• Use functions/FunctionQuery + pseudo-fields
– Instead of indexed click field, use clk() function.
• Use PostFilter to support filtering of documents based on this volatile data
Solr + External Data Store == Sweet!
Log Processing
Jetty Container
Solr Functions pull volatile data
from EhCache
Example: log(clk(EVENT,sequence_id))
Separate thread updates EhCache from
Hypertable
Filtering events based on ticket availability
Example: &fq={!ticket_filter idField=id}
Ticket availability publisher
EHCache
Publishes ticket information via AMQP
Jetty
Cache stores: {Event_id=>ticket_count}
1) Fetch ticket information.
2) Filter out document if ticket_count ==0
id 0
1
2
4
3
1245
Solr PostFilter
5678
Development and Deployment
Production Environment • Java 1.7
• Quad Core 2.8 GHz
• 10 GB RAM
– 8GB dedicated to JVM heap.
• All provisioned as VMs on VMWare ESX servers.
– Significantly simplifies cluster growth. Simply add servers and go!
• 10 Slaves, 2 Masters
– From configuration standpoint, masters == slave except masters have 4GB JVM heap instead of 8GB.
Solr Project Configuration
• Maven based – Treat Solr as dependency *not* as application.
• Other dependencies specified in POM, bundled into war during assembly phase.
• Build tarball that is pushed to Nexus – Tarball contains configuration scripts + Jetty jar
etc.
• Bundle Jetty with the app for all in one deployment.
Advantages of using Maven
• Solr version upgrades as simple as increasing dependency version in pom.xml. – Of course run tests before deploy!
• All dependencies managed by pom.xml and bundled into deployment artifact – No management of classpath via solrconfig.xml
• Take advantage of standard release management practices. Everything self contained.
Deployment via Capistrano
• Capistrano- Framework/Utility for executing commands in parallel via SSH on multiple servers (https://github.com/capistrano/capistrano)
• Capistrano-Nexus Gem- Zvents built gem to deploy a tarball hosted on a Nexus server out to staging/production.
Examples
• Staging/Development Deploy: – mvn deploy
– RELEASE=“2.10-SNAPSHOT” cap staging deploy
• Production Deploy: – mvn release:prepare
– mvn relesae:perform
– RELEASE=“2.10” cap production deploy
Monitoring- NewRelic
Monitoring- NewRelic (cont’d)
CONTACT
Amit Nithianandan
Anithian-at-gmail.com