Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
-
Upload
lucidworks -
Category
Software
-
view
1.619 -
download
0
Transcript of Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Bloomberg • Largest provider of financial news and information
• Our strength is quickly and accurately delivering data, news and analytics
• Creating high performance and accurate information retrieval systems is core to our strength
Bloomberg Search Team • Search infrastructure
• Develop and support search as a service platform • Support for other search applications within the company
• Consultancy • Provide design consultancy/support to application teams • Promote search best practices/standardization throughout the company
• Machine learning • Develop machine learning techniques to improve relevancy • Create natural language processors to answer questions
• Unified search • Create information retrieval tools to organize and connect the vast and varied
datasets provided to our clients
Our Approach • Use Search/Solr as it provides flexible search/filtering over large, fast moving,
result sets
• Initially used StatsComponent, but quickly ran into limitations
• Wanted to push the bounds of analytics capabilities in Solr/Lucene
• Needed a pluggable framework to perform complex calculations/aggregations on numerical time-series data
• DocValues provided high performance columnar access to fields in the index (without un-inversion cost)
DocValues • DocValues provide high performance
columnar access to fields in the index
• No un-inversion cost
• Increased storage footprint
• Helps achieve NRT
• Values live off-heap in memory map
Analytics Component • New component from the ground up
• Designed/Implemented by the Bloomberg Search Team over summer of 2013
• Initial implementation was built using DocValues API directly, but moved to FieldCache
• Refactored existing faceting implementation to support analytics
• Created simple prefix notation for statistical expressions
• Available as a Solr Contrib module in Solr 5.x or patches for 4.8+ on SOLR-5302
Features • Flexible/Extendable framework for adding additional statistics/faceting
• Supports Multiple Analytics Requests per query execution • Multiple statistic calculations per request • Multiple facets per request • Each request can facet statistics over different fields and ranges
Features - Faceting • Field Faceting
• Support for int, long, float, double, date, string fields • Support for multi-value fields • Support for limit, offset and mincount • Support for sorting of stats-facets by any statistic (i.e. sort by mean)
• Range faceting • Numeric types and dates • Dynamically calculate range/gap based on calculated statistics
• Support for query faceting of stats • Use calculated statistics to generate facet queries
Features – Map Operators • Basic Math
• neg(<expr>) • add(<expr>,...) • mult(<expr>,...) • div(<expr>,<expr>) • pow(<expr>,<expr>) • log(<expr>,<expr>)
• Constants • const_num(<number>) • const_date(<date>) • const_str(<string>)
• Date Math • date_math(<date expr>,<date op>,...)
• String operations • rev(<expr>) • concat(<expr>,...)
• Field • <field>
• Missing Values • miss(<expr>,<value>)
Features – Reduction Operators • Statistical
• min(<expr>) • max(<expr>) • sum(<expr>) • count(<expr>) • miss(<expr>) • unique(<expr>)
• Complex • sumofsquares(<expr>) • mean(<expr>) • stddev(<expr>) • median(<expr>) • percentile(<expr>)
Examples
• Weighted Average • Calculate weighted average of field_a with field_b as the weight
div( mean( mult(field_a, field_b) ), sum(field_b) )
• Variance • Calculate the variance of field_a
pow( stddev(field_a), const_num(2) )
Examples
• T-Score • Calculate a t-score where ## is the value and all values in your sample are stored in field_a.
div( add( const_num(##), neg( mean(field_a) ) ), div( stddev(field_a), pow( count(field_a), const_num(.5) ) ) )
• Segment, aggregate and analyze financial data quickly
• Aggregate time series data across multiple fields to render charts
• Created flexible diagnostic tools/visualizations to analyze Solr performance
How We Use It
Future Plans • Multi-shard support
• Pivot Facet Support
• Statistics on Multi-value fields • To support unique()
• Filter result set based upon calculated statistics
• Generalize facet implementation