Webinar: Solr & Spark for Real Time Big Data Analytics

28

Transcript of Webinar: Solr & Spark for Real Time Big Data Analytics

Page 1: Webinar: Solr & Spark for Real Time Big Data Analytics
Page 2: Webinar: Solr & Spark for Real Time Big Data Analytics

GC Tuning

lucenerevolution.org

October 13-16 � Austin, TX

Page 3: Webinar: Solr & Spark for Real Time Big Data Analytics

Solr & Spark

•  Spark Overview

•  Resilient Distributed Dataset (RDD)

•  Solr as a SparkSQL DataSource

•  Interacting with Solr from the Spark shell

•  Indexing from Spark Streaming

•  Lucidworks Fusion and Spark

•  Other goodies (document matching, term vectors, etc.)

Page 4: Webinar: Solr & Spark for Real Time Big Data Analytics

•  Solr user since 2010, committer since April 2014, work for Lucidworks, PMC member ~ May 2015

•  Focus mainly on SolrCloud features … and bin/solr!

•  Release manager for Lucene / Solr 5.1

•  Co-author of Solr in Action

•  Several years experience working with Hadoop, Pig, Hive, ZooKeeper … working with Spark for about 1 year

•  Other contributions include Solr on YARN, Solr Scale Toolkit, and Solr/Storm integration project on github

About Me …

Page 5: Webinar: Solr & Spark for Real Time Big Data Analytics

About Solr

•  Vibrant, thriving open source community

•  Solr 5.3 just released!

ü  Authentication and authorization framework; plugins for Kerberos and Ranger

ü  ~2x indexing performance w/ replication

http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/

ü  Field cardinality estimation using HyperLogLog

ü  Rule-based replica placement strategy

•  Deploy to YARN cluster using Slider

Page 6: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark Overview •  Wealth of overview / getting started resources on the Web

Ø  Start here -> https://spark.apache.org/

Ø  Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

•  Faster, more modernized alternative to MapReduce

Ø  Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less computing power)

•  Unified platform for Big Data

Ø  Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining

•  Write code in Java, Scala, or Python … REPL interface too

•  Runs on YARN (or Mesos), plays well with HDFS

Page 7: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark: Memory-centric Distributed Computation

Spark Core

Spark SQL

Spark Streaming

MLlib (machine learning)

GraphX (BSP)

Hadoop YARN Mesos Standalone

HDFS Execution

Model The Shuffle Caching

components

engine

cluster mgmt

Tachyon

languages Scala Java Python R

shared memory

Page 8: Webinar: Solr & Spark for Real Time Big Data Analytics

Physical Architecture

Spark Master (daemon)

Spark Slave (daemon)

my-spark-job.jar (w/ shaded deps)

My Spark App SparkContext

(driver) •  Keeps track of live workers •  Web UI on port 8080 •  Task Scheduler •  Restart failed tasks

Spark Executor (JVM process)

Tasks

Executor runs in separate process than slave daemon

Spark Worker Node (1...N of these)

Each task works on some partition of a data set to apply a transformation or action

Cache

Losing a master prevents new applications from being executed

Can achieve HA using ZooKeeper

and multiple master nodes

Tasks are assigned based on data-locality

When selecting which node to execute a task on, the master takes into account data locality

•  RDD Graph •  DAG Scheduler •  Block tracker •  Shuffle tracker

Page 9: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark vs. Hadoop’s Map/Reduce

Operators File System Fault-Tolerance

Ecosystem

Hadoop Map-Reduce HDFS S3

Replicated blocks

Java API Pig

Hive

Spark Filter, map, flatMap, join, groupByKey,

reduce, sortByKey,

count, distinct, top, cogroup,

etc …

HDFS S3

Tachyon

Immutable RDD lineage

Python / Scala / Java API

SparkSQL GraphX

MLlib SparkR

Page 10: Webinar: Solr & Spark for Real Time Big Data Analytics

RDD Illustrated: Word count

map(word  =>  (word,  1))  

Map words into pairs with count of 1 (quick,1)  

(brown,1)  

(fox,1)  

(quick,1)  

(quick,1)  

val  file  =        spark.textFile("hdfs://...")  

HDFS  

file RDD from HDFS

quick  brown  fox  jumped  …  

quick  brownie  recipe  …  

quick  drying  glue  …  

…  …  

…  

file.flatMap(line  =>  line.split("  "))

Split lines into words

quick  

brown  

fox  

quick  

quick  

…  …  

reduceByKey(_  +  _)

Send all keys to same reducer and sum

(quick,1)  

(quick,1)  

(quick,1)  

(quick,3)  

Shuffle across

machine boundaries

Executors assigned based on data-locality if possible, narrow transformations occur in same executor Spark keeps track of the transformations made to generate each RDD

Partition 1 Partition 2 Partition 3

x val  file  =  spark.textFile("hdfs://...")  val  counts  =  file.flatMap(line  =>  line.split("  "))                                    .map(word  =>  (word,  1))                                    .reduceByKey(_  +  _)  counts.saveAsTextFile("hdfs://...")  

Page 11: Webinar: Solr & Spark for Real Time Big Data Analytics

Understanding Resilient Distributed Datasets (RDD) •  Read-only partitioned collection of records with fault-tolerance

•  Created from external system OR using a transformation of another RDD

•  RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)

•  If a partition is lost, RDDs can be re-computed by re-playing the transformations

•  User can choose to persist an RDD (for reusing during interactive data-mining) •  User can control partitioning scheme

Page 12: Webinar: Solr & Spark for Real Time Big Data Analytics

Solr as a Spark SQL Data Source •  DataFrame is a DSL for distributed data manipulation

•  Data source provides a DataFrame

•  Uniform way of working with data from multiple sources

•  Hive, JDBC, Solr, Cassandra, etc.

•  Seamless integration with other Spark technologies: SparkR, Python, MLlib …

Map<String,  String>  options  =  new  HashMap<String,  String>();  options.put("zkhost",  zkHost);  options.put("collection”,  "tweets");    DataFrame  df  =  sqlContext.read().format("solr").options(options).load();  count  =  df.filter(df.col("type_s").equalTo(“echo")).count();  

Page 13: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark SQL Query Solr, then expose results as a SQL table

Map<String,  String>  options  =  new  HashMap<String,  String>();  options.put("zkhost",  zkHost);  options.put("collection”,  "tweets");    DataFrame  df  =  sqlContext.read().format("solr").options(options).load();    df.registerTempTable("tweets");    sqlContext.sql("SELECT  count(*)  FROM  tweets  WHERE  type_s='echo'");  

Page 14: Webinar: Solr & Spark for Real Time Big Data Analytics

SolrRDD: Reading data from Solr into Spark

•  Can execute any query and expose as an RDD

•  SolrRDD produces JavaRDD<SolrDocument>  

•  Use deep-paging if needed (cursorMark)

•  Stream docs from Solr (vs. building lists on the server-side)

•  More parallelism using a range filter on a numeric field (_version_)

e.g. 10 shards x 10 splits per shard == 100 concurrent Spark tasks

Page 15: Webinar: Solr & Spark for Real Time Big Data Analytics

Shard 1

Shard 2

socialdata collection

Partition 1 (spark task)

solr.DefaultSource

Partition 2 (spark task)

Spark Driver

App

ZooKeeper Read collection metadata (num shards, replica URLs, etc)

Table  Scan  Query:  q=*:*&rows=1000&  

distrib=false&cursorMark=*  

Results streamed back from Solr

Para

lleliz

e qu

ery

exec

utio

n

sqlContext.load(“solr”,opts)  

DataFrame (schema + RDD<Row>)

Map(“collection”  -­‐>  “socialdata”,            “zkHost”  -­‐>  “…”)  

SolrRDD: Reading data from Solr into Spark

Page 16: Webinar: Solr & Spark for Real Time Big Data Analytics

Query Solr from the Spark Shell Interactive data mining with the full power of Solr queries

ADD_JARS=$PROJECT_HOME/target/spark-­‐solr-­‐1.0-­‐SNAPSHOT.jar  bin/spark-­‐shell    val  solrDF  =  sqlContext.load("solr",  Map(      "zkHost"  -­‐>  "localhost:9983",      "collection"  -­‐>  "gettingstarted")).filter("provider_s='twitter'")    solrDF.registerTempTable("tweets")    sqlContext.sql("SELECT  COUNT(type_s)  FROM  tweets  WHERE  type_s='echo'").show()    tweets.filter("type_s='post'").groupBy("author_s").count().    orderBy(desc("count")).limit(10).show()  

Page 17: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark Streaming: Nuts & Bolts •  Transform a stream of records into small, deterministic batches

ü  Discretized stream: sequence of RDDs

ü  Once you have an RDD, you can use all the other Spark libs (MLlib, etc)

ü  Low-latency micro batches

ü  Time to process a batch must be less than the batch interval time

•  Two types of operators:

ü  Transformations (group by, join, etc)

ü  Output (send to some external sink, e.g. Solr)

•  Impressive performance!

ü  4GB/s (40M records/s) on 100 node cluster with less than 1 second latency

ü  Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark

Page 18: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark Streaming Example: Solr as Sink

Twitter

./spark-­‐submit  -­‐-­‐master  MASTER  -­‐-­‐class  com.lucidworks.spark.SparkApp  spark-­‐solr-­‐1.0.jar  \            twitter-­‐to-­‐solr  -­‐zkHost  localhost:2181  –collection  social  

Solr

JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters);

Various transformations / enrichments on each tweet (e.g. sentiment analysis, language detection)

JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }});

map()

class TwitterToSolrStreamProcessor extends SparkApp.StreamProcessor

SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);

Slide Legend

Provided by Spark

Custom Java / Scala code

Provided by Lucidworks

Page 19: Webinar: Solr & Spark for Real Time Big Data Analytics

Spark Streaming Example: Solr as Sink //  start  receiving  a  stream  of  tweets  ...  JavaReceiverInputDStream<Status>  tweets  =      TwitterUtils.createStream(jssc,  null,  filters);    //  map  incoming  tweets  into  SolrInputDocument  objects  for  indexing  in  Solr  JavaDStream<SolrInputDocument>  docs  =  tweets.map(        new  Function<Status,SolrInputDocument>()  {          public  SolrInputDocument  call(Status  status)  {              SolrInputDocument  doc  =                  SolrSupport.autoMapToSolrInputDoc("tweet-­‐"+status.getId(),  status,  null);              doc.setField("provider_s",  "twitter");              doc.setField("author_s",  status.getUser().getScreenName());              doc.setField("type_s",  status.isRetweet()  ?  "echo"  :  "post");              return  doc;          }      }    );    //  when  ready,  send  the  docs  into  a  SolrCloud  cluster  SolrSupport.indexDStreamOfDocs(zkHost,  collection,  docs);  

Page 20: Webinar: Solr & Spark for Real Time Big Data Analytics

com.lucidworks.spark.SolrSupport  public  static  void  indexDStreamOfDocs(final  String  zkHost,  final  String  collection,  final  int  batchSize,  JavaDStream<SolrInputDocument>  docs)    {          docs.foreachRDD(              new  Function<JavaRDD<SolrInputDocument>,  Void>()  {                  public  Void  call(JavaRDD<SolrInputDocument>  solrInputDocumentJavaRDD)  throws  Exception  {                      solrInputDocumentJavaRDD.foreachPartition(                          new  VoidFunction<Iterator<SolrInputDocument>>()  {                              public  void  call(Iterator<SolrInputDocument>  solrInputDocumentIterator)  throws  Exception  {                                  final  SolrServer  solrServer  =  getSolrServer(zkHost);                                  List<SolrInputDocument>  batch  =  new  ArrayList<SolrInputDocument>();                                  while  (solrInputDocumentIterator.hasNext())  {                                      batch.add(solrInputDocumentIterator.next());                                      if  (batch.size()  >=  batchSize)                                          sendBatchToSolr(solrServer,  collection,  batch);                                  }                                  if  (!batch.isEmpty())                                      sendBatchToSolr(solrServer,  collection,  batch);                              }                          }                      );                      return  null;                  }              }          );      }  

Page 21: Webinar: Solr & Spark for Real Time Big Data Analytics

com.lucidworks.spark.ShardPartitioner •  Custom partitioning scheme for RDD using Solr’s DocRouter

•  Stream docs directly to each shard leader using metadata from ZooKeeper, document shard assignment, and ConcurrentUpdateSolrClient

final  ShardPartitioner  shardPartitioner  =  new  ShardPartitioner(zkHost,  collection);  pairs.partitionBy(shardPartitioner).foreachPartition(      new  VoidFunction<Iterator<Tuple2<String,  SolrInputDocument>>>()  {          public  void  call(Iterator<Tuple2<String,  SolrInputDocument>>  tupleIter)  throws  Exception  {              ConcurrentUpdateSolrClient  cuss  =  null;              while  (tupleIter.hasNext())  {                    //  ...  Initialize  ConcurrentUpdateSolrClient  once  per  partition                cuss.add(doc);              }        }  });  

Page 22: Webinar: Solr & Spark for Real Time Big Data Analytics

Lucidworks Fusion

Page 23: Webinar: Solr & Spark for Real Time Big Data Analytics

Fusion & Spark

•  Users leave evidence of their needs & experience as they use your app

•  Fusion closes the feedback loop to improve results based on user “signals” •  Scale out distributed aggregation jobs across Spark cluster

•  Recommendations at scale

Page 24: Webinar: Solr & Spark for Real Time Big Data Analytics

Extra goodies: Reading Term Vectors from Solr •  Pull TF/IDF (or just TF) for each term in a field for each document in query

results from Solr

•  Can be used to construct RDD<Vector> which can then be passed to MLLib:

SolrRDD  solrRDD  =  new  SolrRDD(zkHost,  collection);    JavaRDD<Vector>  vectors  =        solrRDD.queryTermVectors(jsc,  solrQuery,  field,  numFeatures);  vectors.cache();    KMeansModel  clusters  =        KMeans.train(vectors.rdd(),  numClusters,  numIterations);    //  Evaluate  clustering  by  computing  Within  Set  Sum  of  Squared  Errors  double  WSSSE  =  clusters.computeCost(vectors.rdd());    

Page 25: Webinar: Solr & Spark for Real Time Big Data Analytics

Document Matching using Stored Queries

•  For each document, determine which of a large set of stored queries matches.

•  Useful for alerts, alternative flow paths through a stream, etc

•  Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match

•  Matching framework; you have to decide where to load the stored queries from and what to do when matches are found

•  Scale it using Spark … need to scale to many queries, checkout Luwak

Page 26: Webinar: Solr & Spark for Real Time Big Data Analytics

Document Matching using Stored Queries

Stored Queries

DocFilterContext

Twitter map()

Slide Legend

Provided by Spark

Custom Java / Scala code

Provided by Lucidworks

JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters);

JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }});

JavaDStream<SolrInputDocument> enriched = SolrSupport.filterDocuments(docFilterContext, …);

Get queries

Index docs into an EmbeddedSolrServer Initialized from configs stored in ZooKeeper

ZooKeeper

Key abstraction to allow you to plug-in how to store the queries and what action to take when docs match

Page 27: Webinar: Solr & Spark for Real Time Big Data Analytics

Getting Started git clone https://github.com/LucidWorks/spark-solr/ cd spark-solr mvn clean package -DskipTests See: https://lucidworks.com/blog/solr-spark-sql-datasource/

Page 28: Webinar: Solr & Spark for Real Time Big Data Analytics

Wrap-up and Q & A

Download Fusion: http://lucidworks.com/fusion/download/

Feel free to reach out to me with questions:

[email protected] / @thelabdude