Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

43
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

description

You’ve got your Hadoop cluster, you’ve got your petabytes of unstructured data, you run mapreduce jobs and SQL-on-Hadoop queries. Something is still missing though. After all, we are not expected to enter SQL queries while looking for information on the web. Altavista and Google solved it for us ages ago. Why are we still requiring SQL or Java certification from our enterprise bigdata users? In this talk, we will look into how integration of SolrCloud into Apache Bigtop is now enabling building bigdata indexing solutions and ingest pipelines. We will dive into the details of integrating full-text search into the lifecycle of your bigdata management applications and exposing the power of Google-in-a-box to all enterprise users, not just a chosen few data scientists.

Transcript of Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Page 1: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Building Google-in-a-box:!using Apache SolrCloud and Bigtop to index your bigdata

Page 2: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Who’s this guy?

Page 3: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Roman Shaposhnik!@rhatr or [email protected]

•  Sr. Manager at Pivotal Inc. building a team of ASF contributors •  ASF junkie

•  VP of Apache Incubator, former VP of Apache Bigtop •  Hadoop/Sqoop/Giraph committer •  contributor across the Hadoop ecosystem)

•  Used to be root@Cloudera •  Used to be a PHB at Yahoo! •  Used to be a UNIX hacker at Sun microsystems •  First time author: “Giraph in action”

Page 4: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

What’s this all about?

Page 5: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

This is NOT this kind of talk

Page 6: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

This is this kind of a talk:

Page 7: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

What are we building?

Page 8: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

WWW analytics platform

HDFS

HBase

MapReduce

Nutch

WWW

Solr Cloud

Lily HBase Indexer

Hive

Hue DataSci

Replication Morphlines

Pig

Zookeeper

Page 9: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Google papers •  GFS (Google FS) == HDFS •  MapReduce == MapReduce •  Bigtable == HBase •  Sawzall == Pig/Hive •  F1 == HAWQ/Impala

Page 10: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Storage design requirements •  Low-level storage layer: KISS

•  commodity hardware •  massively scalable •  highly available •  minimalistic set of APIs (non-POSIX)

•  Application specific storage layer •  leverages LLSL •  Fast r/w random access (vs. immutable streaming) •  Scan operations

Page 11: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Design patterns •  HDFS is the “data lake”

•  Great simplification of storage administration (no-SANAS) •  “Stateless” distributed applications persistence layer

•  Applications are “stateless” compositions of various services •  Can be instantiated anywhere (think YARN) •  Can restart serving up the state from HDFS •  Are coordinated via Zookeeper

Page 12: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Application design: SolrCloud

HDFS

Zookeeper

Solr svc …

Solr svc I am alive

Who Am I? What do I do?

Page 13: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Application design: SolrCloud

HDFS

Zookeeper

Solr svc

Peer is dead What do I do?

Page 14: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Application design: SolrCloud

HDFS

Zookeeper

Solr svc …

Solr svc I am alive

Who Am I? What do I do?

replication kicks in

Page 15: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

How do we build something like this?

Page 16: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

The bill of materials •  HDFS •  Zookeeper •  HBase •  Nutch •  Lily HBase indexer •  SolrCloud •  Morphlines (part of Project Kite) •  Hue •  Hive/Pig/…

Page 17: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

How about? $ for comp in hadoop hbase zookeeper … ; do wget http://dist.apache.org/$comp tar xzvf $comp.tar.gz cd $comp ; mvn/ant/make install scp … ssh … done

Page 18: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

How about? $ for comp in hadoop hbase zookeeper … ; do wget http://dist.apache.org/$comp tar xzvf $comp.tar.gz cd $comp ; mvn/ant/make install scp … ssh … done

Page 19: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

We’ve seen this before!

GNU Software Linux kernel

Page 20: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Apache Bigtop!

HBase, Solr.. Hadoop

Page 21: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Lets get down to business

Page 22: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Still remember this?

HDFS

HBase

Solr Cloud

Lily HBase Indexer

Replication Morphlines

Zookeeper

Hue DataSci

Page 23: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

HBase: row-key design

com.cnn.www/a.html <html>...

content:

CNN CNN.com

anchor:a.com anchor:b.com

Page 24: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Indexing: schema design •  Bad news: no more “schema on write” •  Good news: you can change it on the fly •  Lets start with the simplest one:!!<field name=”id" type=”string" indexed="true" stored="true” required=“true”/> <field name=”text" type="text_general" indexed="true" stored="true"/> <field name=“url” type=”string" indexed="true" stored="true”/>

Page 25: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Deployment •  Single node pseudo distributed configuration •  Puppet-driven deployment

•  Bigtop comes with modules •  You provide your own cluster topology in cluster.pp

Page 26: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Deploying the ‘data lake’ •  Zookeeper

•  3-5 members of the ensemble # vi /etc/zookeeper/conf/zoo.cfg

# service zookeeper-server init # service zookeeper-server start!

•  HDFS •  tons of configurations to consider: HA, NFS, etc. •  see above, plus: /usr/lib/hadoop/libexec/init-hdfs.sh

Page 27: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

HBase asynchronous indexing •  leveraging WAL for indexing •  can achieve infinite scalability of the indexer •  doesn’t slow down HBase (unlike co-processors) •  /etc/hbase/conf/hbase-site.xml:

<property> <name>hbase.replication</name> <value>true</value> </property>

Page 28: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Different clouds

HBase Region Server

HBase Region Server

Lily Indexer Node

Lily Indexer Node

Solr Node

Solr Node

HBase “cloud” Lily Indexer “cloud” SolrCloud home of Morphline ETL

… … … replication Solr docs

Page 29: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Lily HBase indexer •  Pretends to be a region server on the receiving end •  Gets records •  Pipes them through the Morphline ETL •  Feeds the result to Solr •  All operations are managed via individual indexers

Page 30: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Creating an indexer

$ hbase-indexer add-indexer ! --name web_crawl ! --indexer-conf ./indexer.xml ! --connection-param solr.zk=localhost/solr ! --connection-param solr.collection=web_crawl ! --zookeeper localhost:2181

Page 31: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

indexer.xml <indexer table="web_crawl" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper"> <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/> <!-- <param name="morphlineId" value="morphline1"/> à </indexer>

Page 32: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Morphlines •  Part of Project Kite (look for it on GitHub) •  A very flexible ETL library (not just for HBase) •  “UNIX pipes” for bigdata •  Designed for NRT processing •  Record-oriented processing driven by HOCON definition •  Require a “pump” (most of the time) •  Have built-in syncs (e.g. loadSolr) •  Essentially a push-based data flow engine

Page 33: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Different clouds

extractHBaseCells

convertHTML

WAL entries

N records

xquery

logInfo

M records

P records

Solr docs

Page 34: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Morphline spec morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"] commands : [ { extractHBaseCells {…} } { convertHTML {charset : UTF-8} } { xquery {…} } { logInfo { format : "output record: {}", args : ["@{}"] } } ] } ]

Page 35: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

extractHBaseCells { extractHBaseCells { mappings : [ { inputColumn : "content:*" outputField : "_attachment_body" type : "byte[]" source : value } ] } }

Page 36: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

xquery { xquery { fragments : [ { fragmentPath : "/" queryString : """ <fieldsToIndex> <webpage> {for $tk in //text() return concat($tk, ' ')} </webpage> </fieldsToIndex> """ } ] } }

Page 37: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

SolrCloud •  Serves up lucene indices from HDFS •  A webapp running on bigtop-tomcat

•  gets configured via /etc/default/solr!SOLR_PORT=8983

SOLR_ADMIN_PORT=8984 SOLR_LOG=/var/log/solr SOLR_ZK_ENSEMBLE=localhost:2181/solr SOLR_HDFS_HOME=hdfs://localhost:8020/solr SOLR_HDFS_CONFIG=/etc/hadoop/conf!!

Page 38: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Collections and instancedirs •  All of these objects reside in Zookeeper

•  An unfortunate trend we already saw with Lily indexers •  Collection

•  a distributed set of lucene indices •  an object defined by Zookeeper configuration

•  Collection require (and can share) configurations in instancedir •  Bigtop-provided tool: solrcrl! $ solrctl [init|instacedir|collection|…]

Page 39: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Creating a collection # solrctl init $ solrctl instancedir --generate /tmp/web_crawl $ vim /tmp/web_crawl/conf/schema.xml $ vim /tmp/web_crawl/conf/solrconfig.xml $ solrctl instancedir --create web_crawl /tmp/web_crawl $ solrctl collection --create web_crawl -s 1

Page 40: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Hue •  Apache licensed, but not an ASF project •  A nice, flexible UI for Hadoop bigdata management platform •  Follows an extensible app model

Page 41: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Demo time!

Page 42: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Where to go from here

Page 43: Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your bigdata

Questions?