Post on 24-May-2015
description
1 |
Search Discover Analyze
Grant Ingersoll Chief Scien:st Lucid Imagina:on
Enabling Scalable Search, Discovery and Analy6cs with Solr, Mahout and Hadoop
2 |
l ________ data growth in the next ___ days/months/years – Many es:mate 80-‐90% of data is “unstructured” (mul:-‐structured?)
l The Age of “Data Paranoia” – What if I don’t collect it all? – What if I miss something or lose something? – What if I can’t store it long enough? – How do I secure it? – Can I afford to do any of this? Can I afford not to?
– What if I can’t make sense of it?
We All Know the Pain
3 |
Big Data Premise and Promise
Premise Promise
Large Scale Data Collection/Storage ✔
Prevents Data Loss ✔
Long Term Storage ✔
Affordable ✔
New Science Delivering New Insights ?
4 |
l User Needs: – Real-‐:me, ad hoc access to content – Aggressive Priori:za:on based on Importance – Serendipity
l Batch processing isn’t enough
l Search is built for mul:-‐structured
l Deeper analysis yields: – Business insight into users – Beaer Search and Discovery for users
Why Search, Discovery and Analy;cs (SDA)?
Search
Discovery Analytics
5 |
l Fast, efficient, scalable search – Bulk and Near Real Time Indexing
l Large scale, cost effective storage
l Large scale processing power – Large scale and distributed for whole data consumption and analysis – Sampling tools – Distributed In Memory where appropriate
l NLP and machine learning tools that scale to enhance discovery and analysis
What do you need for SDA?
6 |
l Dark Data – Petabytes (and beyond) of content in storage with liale insight into what’s in it – Forensics, Intelligence Gathering, Risk analysis, etc.
l Financial – Enable total customer view to beaer understand risks and opportuni:es
l Medical – Extend research capabili:es through deeper analysis of both scien:fic data, publica:ons and field usage
l Social Media Monitoring – Understand and analyze social networks and their trends all the :me, no maaer the scale
l Commerce – Drive more sales through metric driven search and discovery without the guesswork
Example Use Cases
7 |
An applica:on development plaiorm aimed at enabling Search, Discovery and Analysis of your content and user interac:ons, no maaer the volume, variety
and velocity of that content, nor the number of users
Announcing LucidWorks Big Data Beta
8 |
Architecture
9 |
l Combines the real :me, ad hoc data accessibility of LucidWorks with compute and storage capabili:es of Hadoop
l Delivers analy:c capabili:es along with scalable machine learning algorithms for deeper insight into both content and users
l RESTful API suppor:ng JSON input/output formats for easy integra:on
l Full Stack -‐ Minimizes the impact of provisioning Hadoop, LucidWorks and other components
l Hosted in cloud and supported by Lucid Imagina:on
Key Features of Beta
10 |
APIs
l Search and Indexing – Full power of LucidWorks (Solr) – Bulk and Near Real Time Indexing – Sharded via SolrCloud
l Workflows – Predefined workflows ease
common data tasks such as bulk indexing
l Administra:on – Access to key system informa:on – User management
l Analy:cs – Common search analy:cs for
beaer understanding of relevancy based on log analysis
– Historical views
l Machine Learning – Clustering – Sta:s:cally Interes:ng Phrases – Future enhancements planned
l Proxy APIs – LucidWorks – WebHDFS
11 |
Under the Hood
l Lucene/Solr 4.0-‐dev
l Sharded with SolrCloud – 1 second (default) som commits for
NRT updates – 1 minute (default) hard commits
(no searcher reopen) – Transac:on logs for recovery – Solr takes care of leader elec:on,
etc. so no more master/worker
l See Mark Miller’s talk on SolrCloud
l RESTful services built on Restlet 2.1
l Service Discovery, load balancing, failover enabled via ZooKeeper + Neilix Curator
l Authen:ca:on and authoriza:on over SSL (op:onal)
l Proxies for LucidWorks and WebHDFS API
l Workflow engine coordinates data flow
LucidWorks 2.1 SDA Engine
12 |
Under the Hood
l Apache Hadoop – Map-‐Reduce (MR) jobs for ETL and
bulk indexing into SolrCloud sharded system
– Leverage Pig and custom MR jobs for log processing and metric calcula:on
– WebHDFS
l Apache Mahout – K-‐Means Clustering – Sta:s:cally Interes:ng Phrases – More to come
l Apache HBase – Key-‐value and :me series of all
calculated metrics
l Apache Pig – ETL – Log analysis -‐> HBase
l Apache ZooKeeper – Neilix Curator for service
discovery and higher level ZK client
l Apache Kasa – Pub-‐sub for collec:ng logs from
LucidWorks into HDFS
13 |
l Our approach is from search and discovery outwards to analy:cs – Analy:cs in beta are focused around analysis of search logs
l Analy:cs Themes – Relevance – Data quality – Discovery – Integra:on with other packages (R?)
l Machine Learning – Classifica:on – NLP
l More analy:cs on the index itself?
The Road Ahead
14 |
l hap://bit.ly/lucidworks-‐big-‐data
l hap://www.lucidimagina:on.com
l grant@lucidimagina:on.com
l @gsingers
Contacts