KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

Search Discover Analyze

Grant Ingersoll Chief Scien:st Lucid Imagina:on

Enabling Scalable Search, Discovery and Analy6cs with Solr, Mahout and Hadoop

l  ________ data growth in the next ___ days/months/years –  Many es:mate 80-‐90% of data is “unstructured” (mul:-‐structured?)

l  The Age of “Data Paranoia” –  What if I don’t collect it all? –  What if I miss something or lose something? –  What if I can’t store it long enough? –  How do I secure it? –  Can I afford to do any of this? Can I afford not to?

–  What if I can’t make sense of it?

We All Know the Pain

Big Data Premise and Promise

Premise Promise

Large Scale Data Collection/Storage ✔

Prevents Data Loss ✔

Long Term Storage ✔

Affordable ✔

New Science Delivering New Insights ?

l  User Needs: –  Real-‐:me, ad hoc access to content –  Aggressive Priori:za:on based on Importance –  Serendipity

l  Batch processing isn’t enough

l  Search is built for mul:-‐structured

l  Deeper analysis yields: –  Business insight into users –  Beaer Search and Discovery for users

Why Search, Discovery and Analy;cs (SDA)?

Search

Discovery Analytics

l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing

l  Large scale, cost effective storage

l  Large scale processing power –  Large scale and distributed for whole data consumption and analysis –  Sampling tools –  Distributed In Memory where appropriate

l  NLP and machine learning tools that scale to enhance discovery and analysis

What do you need for SDA?

l  Dark Data – Petabytes (and beyond) of content in storage with liale insight into what’s in it –  Forensics, Intelligence Gathering, Risk analysis, etc.

l  Financial – Enable total customer view to beaer understand risks and opportuni:es

l  Medical – Extend research capabili:es through deeper analysis of both scien:fic data, publica:ons and field usage

l  Social Media Monitoring – Understand and analyze social networks and their trends all the :me, no maaer the scale

l  Commerce – Drive more sales through metric driven search and discovery without the guesswork

Example Use Cases

An applica:on development plaiorm aimed at enabling Search, Discovery and Analysis of your content and user interac:ons, no maaer the volume, variety

and velocity of that content, nor the number of users

Announcing LucidWorks Big Data Beta

Architecture

l  Combines the real :me, ad hoc data accessibility of LucidWorks with compute and storage capabili:es of Hadoop

l  Delivers analy:c capabili:es along with scalable machine learning algorithms for deeper insight into both content and users

l  RESTful API suppor:ng JSON input/output formats for easy integra:on

l  Full Stack -‐ Minimizes the impact of provisioning Hadoop, LucidWorks and other components

l  Hosted in cloud and supported by Lucid Imagina:on

Key Features of Beta

l  Search and Indexing –  Full power of LucidWorks (Solr) –  Bulk and Near Real Time Indexing –  Sharded via SolrCloud

l  Workflows –  Predefined workflows ease

common data tasks such as bulk indexing

l  Administra:on –  Access to key system informa:on –  User management

l  Analy:cs –  Common search analy:cs for

beaer understanding of relevancy based on log analysis

–  Historical views

l  Machine Learning –  Clustering –  Sta:s:cally Interes:ng Phrases –  Future enhancements planned

l  Proxy APIs –  LucidWorks –  WebHDFS

Under the Hood

l  Lucene/Solr 4.0-‐dev

l  Sharded with SolrCloud –  1 second (default) som commits for

NRT updates –  1 minute (default) hard commits

(no searcher reopen) –  Transac:on logs for recovery –  Solr takes care of leader elec:on,

etc. so no more master/worker

l  See Mark Miller’s talk on SolrCloud

l  RESTful services built on Restlet 2.1

l  Service Discovery, load balancing, failover enabled via ZooKeeper + Neilix Curator

l  Authen:ca:on and authoriza:on over SSL (op:onal)

l  Proxies for LucidWorks and WebHDFS API

l  Workflow engine coordinates data flow

LucidWorks 2.1 SDA Engine

Under the Hood

l  Apache Hadoop –  Map-‐Reduce (MR) jobs for ETL and

bulk indexing into SolrCloud sharded system

–  Leverage Pig and custom MR jobs for log processing and metric calcula:on

–  WebHDFS

l  Apache Mahout –  K-‐Means Clustering –  Sta:s:cally Interes:ng Phrases –  More to come

l  Apache HBase –  Key-‐value and :me series of all

calculated metrics

l  Apache Pig –  ETL –  Log analysis -‐> HBase

l  Apache ZooKeeper –  Neilix Curator for service

discovery and higher level ZK client

l  Apache Kasa –  Pub-‐sub for collec:ng logs from

LucidWorks into HDFS

l  Our approach is from search and discovery outwards to analy:cs –  Analy:cs in beta are focused around analysis of search logs

l  Analy:cs Themes –  Relevance –  Data quality –  Discovery –  Integra:on with other packages (R?)

l  Machine Learning –  Classifica:on –  NLP

l  More analy:cs on the index itself?

The Road Ahead

l  hap://bit.ly/lucidworks-‐big-‐data

l  hap://www.lucidimagina:on.com

l  grant@lucidimagina:on.com

l  @gsingers

Contacts

KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

Technology

Transcript of KEYNOTE: Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.

Solr + Hadoop - Fouillez facilement dans votre système Big Data

Integrating Hadoop & Solr

How Hadoop Changes the Analytics Paradigm...Search Solr Model Machine Learning SAS, R, Spark, Mahout Serve NoSQL Database HBase Streaming Spark Streaming Unlimited Storage HDFS, HBase

Big Data Analysis Patterns with Hadoop, Mahout and Solr

NoSQL, Apache SOLR and Apache Hadoop

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

Hadoop based etl and solr based semantic search

TriHUG: Lucene Solr Hadoop

[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

Intro to Mahout -- DC Hadoop

Scaling Big Data with Hadoop and Solr Second Edition - Sample Chapter

Solit 2014, MapReduce и машинное обучение на hadoop и mahout, Слисенко Константин

Использование Hadoop и Mahout в машинном обучении

ユーザーズガイド - Fujitsusoftware.fujitsu.com/jp/manual/manualfiles/m150005/j2ul...・Apache Hadoop、Hadoop、HDFS、HBase、Hive、Pig、Mahout は Apache Software Foundation

Mahout @ India Hadoop Summit

Scaling search at Trovit with Solr and Hadoop - Marc Sturlese

Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue

Leveraging Solr and Mahout

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr