SolrCloud on Hadoop
-
Upload
alex-moundalexis -
Category
Technology
-
view
1.411 -
download
2
description
Transcript of SolrCloud on Hadoop
1
SolrCloud on Hadoop Cleveland Big Data and Hadoop User Group, January 2014 Alex Moundalexis [email protected] @technmsg
Disclaimer
• Technologies, not products • Cloudera builds things soHware
• most donated to Apache • some closed-‐source
• I will likely menMon “Cloudera Something” • Cloudera “products” I reference are open source
• Apache Licensed • Source code is on GitHub
• hRps://github.com/cloudera
2
What This Talk Isn’t About
• Deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• Sizing & Tuning • Depends heavily on data and workload
• Coding • Unless you count XML or CSV
• Algorithms
3
4
Quick and dirty, more Mme for use cases.
The Apache Hadoop Ecosystem
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaMons • ConfiguraMon • Workflow
5
ParMal Ecosystem
6
Hadoop
external system
RDBMS / DWH
web server
device logs
API access
log collecMon
DB table import
batch processing
machine learning
external system
API access
user
RDBMS / DWH
DB table export
BI tool + JDBC/ODBC
Search
SQL
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpMmized for large streaming access to data • Based on Google File System
• hRp://research.google.com/archive/gfs.html
7
Lots of Commodity Machines
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm • Batch oriented, not realMme • Works well with distributed compuMng • Lots of Java, but other languages supported • Based on Google’s paper
• hRp://research.google.com/archive/mapreduce.html
9
Under the Covers
You specify map() and reduce() functions. ���
���The framework does the
rest. 60
Apache HBase
• Random, realMme read/write access • Key/value columnar store • (b|tr)illions of rows/columns • Based on Google BigTable
• hRp://research.google.com/archive/bigtable.html
12
Cloudera Hue
• Hadoop User Experience • Hadoop is largely command line • Hue provides a UI for end-‐users • SDK to build your own apps on top
13
Apache Tika
• Content analysis toolkit • Simply put, a lot of parsers • Detect/extract metadata/text from documents
• HTML • XML • Office • PDF • mbox • More…
14
Apache ZooKeeper
• Distributed systems are HARD • Everyone was trying to implement the same subsystems • Bugs leads to race condiMons, other bad things
• ZK: Highly reliable distributed coordinaMon services • ConfiguraMon • Naming • SynchronizaMon • Group Services
15
Cloudera Morphlines
• In-‐memory transformaMons • Load, parse, transform, process • Records as name-‐value pairs w/ opMonal blob/pojo objects
• Java library, embedded in your codebase • Used to ETL data from Flume and MR into Solr
• Was part of CDK, now part of Kite • hRp://kitesdk.org
16
Apache Lucene
• Java-‐based index and search • ranked or sorted results • hits streamed through QP
• mem(results) < mem(collecMon)
• rich/extensible query operators • bool, phrase, range, span, spaMal
• Features • spellchecking • hit highlighMng • tokenizaMon
17
Apache Solr
• Enterprise search plaporm • Based on Apache Lucene
• Full-‐text search • FaceMng • NRT indexing • UI
18
Apache Solr – Simple Indexing via CLI
$ java -‐jar post.jar solr.xml money.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/update.. SimplePostTool: POSTing file solr.xml SimplePostTool: POSTing file money.xml SimplePostTool: COMMITting Solr index changes.. $ post.sh *.xml
19
Apache Solr – Document money.xml
<add> <doc> <field name="id">USD</field> <field name="name">One Dollar</field> <field name="manu">Bank of America</field> <field name="manu_id_s">boa</field> <field name="cat">currency</field> <field name="features">Coins and notes</field> <field name="price_c">1,USD</field> <field name="inStock">true</field> </doc> <doc> <field name="id">EUR</field> <field name="name">One Euro</field>
20
Apache Solr – More Advanced Indexing
• From DB, using Data Import Handler (DIH) • Load a CSV file • POST JSON documents • Index binary documents (uses Tika) • SolrJ for programmaMc document creaMon
21
Apache Solr – Querying
• HTTP GET • hRp://solr:8983/solr/collecMon1/select/
• Examples • ?q=Mmestamp:[* TO NOW] • ?q=-‐instock:false • ?q={!lucene q.op=AND df=text}myfield:foo +bar -‐bat
22
Apache Solr – Querying
• HTTP GET • hRp://solr:8983/solr/collecMon1/select/?q=video
• Examples • &fl=name,id (return only name and id fields) • &fl=name,id,score (return relevancy score as well) • &fl=*,score (return all fields + relevancy score) • &sort=price desc&fl=name,id,price (sort by price desc) • &wt=json (return response in JSON format)
23
What the Heck is FaceMng?
• Generate counts for properMes or categories • Links allow drill-‐down or refine search results
What?
24
Facets on Amazon.com
25
Apache Solr – Facets at Query Time
• HTTP GET • hRp://solr:8983/solr/collecMon1/select/?q=video • All docs, count by category q=*:*&facet=true&facet.field=cat
• All docs, count by category and in-‐stock status q=*:*&facet=true&facet.field=cat&facet.field=inStock
• Docs matching “ipod”, count by price (above/below $100) q=ipod&facet=true&facet.query=price:[0 TO 100]&facet.query=price:[100 TO *]
26
Apache Solr – Querying via UI
27
Apache SolrCloud
• IntegraMon of Solr + ZooKeeper • Provides for shard failover
28
Cloudera Search
• Based on Apache Solr (incl Lucene and SolrCloud) • Fault-‐tolerance: collecMons backed by HDFS or HBase • IntegraMon galore:
• HBase/Flume/MapReduce w/ Lucene • Hue w/ Solr • Avro w/ Tika • HDFS w/ Solr/Lucene • Sentry w/ Solr
29
Cloudera Search + Hue
30
Cloudera Search + Hue
31
32
Apologies, I swiped some preRy slides from markeMng…
Why Search?
Search Design Strategy
33
One pool of data
One security framework
One set of system resources
One management interface
An Integrated Part of the Hadoop System
Storage
Integra5on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
InteracMve SQL
CLOUDERA IMPALA
InteracMve Search CLOUDERA SEARCH
Machine Learning MAHOUT
Math & Sta5s5cs
SAS, R
Benefits of Search IntegraMon
34
Improved Big Data ROI § An interacMve experience without technical knowledge § Single data set for mulMple compuMng frameworks
Faster Time to Insight § Exploratory analysis, esp. unstructured data § Broad range of indexing opMons to accommodate needs
Cost Efficiency § Single scalable plaporm; no incremental investment § No need for separate systems, storage
Solid Founda5ons & Reliability § Solr in producMon environments for years § Hadoop-‐powered reliability and scalability
35
Some quick examples.
Search Use Cases
Search Use Cases
36
Offer easy access to non-‐technical resources
Explore data prior to processing and modeling
Gain immediate access and find correlaMons in mission-‐criMcal data
Powerful, proven search capabili5es that let organiza5ons:
Monsanto
37
Scalable, efficient image search for analysis and research
Track plant characterisMcs throughout their lifecycle
Before: Manual aRribute extracMon and search queries within database
Now: Parse and index images at acquisiMon and on demand, index archived images in batch
38
Cloudera: Internal Field Portal
Custom Aggregated Search
Cloudera – Internal Field Portal
• Single stop for field engineers • Mailing lists: public, private • Tickets: support, development, public ASF • Customer data: accounts, clusters, KB arMcles • Customer Clusters: configs, audits, logs, events • Books and papers • Discussion forums
• Dogfooding, yes • Makes my life easier
39
Cloudera – Internal Field Portal
40
Cloudera – Internal Field Portal
• Varied fetchers/observers for web/API content • Content is retrieved via Flume, Sqoop
• Search indexes and replicates into HBase • Each collecMon has collecMon-‐specific filters/fields • Provides Mtle, content snippet, link to original
• Morphlines extracts books and papers using Tika • Impala for analyMcs
• Future: Use MapReduce to ingest logs
41
42
ParMng thoughts… in no parMcular order.
Summary
Search Simplifies InteracMon
43
Explore
Navigate
Correlate Experts know MapReduce. Savvy people know SQL.
Everyone knows Search.
Summary
• With Hadoop, it depends. • The tools are out there. • Open source soHware, hooray!
• Many interconnected pieces • Many unexplored opportuniMes • A thriving community awaits you…
• Data can make a difference. • Search allows everyone to interact with data.
• This is a Big Deal.
44
What’s Next?
• Search examples • hRp://blog.cloudera.com/blog/category/search/
• Cloudera provides pre-‐loaded VMs • hRp://Mny.cloudera.com/quickstartvm
• Clone our repos! • hRps://github.com/cloudera
45
46
Preferably related to the talk…
QuesMons?
47
Thank You! Alex Moundalexis [email protected] @technmsg We’re hiring, kids! Well, not kids.