C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

67
DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen

description

The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.

Transcript of C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

Page 1: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 1

DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen

Page 2: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 2

•  Big Data Engineer at DataStax •  Co-author of ‘Programming Hive’

and ‘Introduction to Solr’ from O’Reilly

About the Presenter

Page 3: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 3

•  The company behind Cassandra •  Sells DataStax Enterprise

About DataStax

Page 4: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 4

DataStax Enterprise 3.x

Page 5: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 5

•  Single stack •  Cassandra •  Solr •  Hadoop •  Consulting •  Support

DataStax Enterprise

Page 6: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 6

•  ZDNet article: “The biggest cloud app of all: Netflix”

•  http://zd.net/ZHtrmW •  Built on Cassandra •  “According to Cockcroft, if something

goes wrong, Netflix can continue to run the entire service on two out of three zones”

Cassandra at Netflix

Page 7: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 7

•  Petabytes of growing data •  Hadoop is for batch work •  What are the solutions for realtime?

What is Big Data?

Page 8: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 8

•  Near realtime •  1000 millisecond latency

What is realtime?

Page 9: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 9

•  Cost of scaling to petabytes •  Physical limitations

Why not relational databases?

Page 10: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 10

•  Hadoop for batch •  Solr and Cassandra for realtime •  Gives most of relational capability at

1/10 the cost, scales linearly

Relational to Big Data

Page 11: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 11

•  Distributed database heavy lifting •  Simple dynamo model •  Executes replication tasks extremely

well

Why Cassandra?

Page 12: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 12

•  Cassandra is easier, code is readable •  Fewer moving parts •  Multi-datacenter replication •  Enables low level IO tuning

Cassandra vs. HBase

Page 13: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 13

•  HBase runs on HDFS •  HDFS is not designed for random

access IO •  Multiple hacks / products to perform

random access (MapR, HDFS Jiras)

Cassandra vs. HBase

Page 14: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 14

•  Cassandra is peer to peer, there is no single point of failure (SPOF)

•  The HDFS name node is a single point of failure

Cassandra vs. HBase

Page 15: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 15

•  Most of Facebook runs on MySQL •  Memcache front ends the reads

HBase at Facebook

Page 16: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 16

•  Hive with Hadoop •  A vague dialect of SQL •  Requires Java for UDFs •  Relational Joins

Batch Analytics

Page 17: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 17

•  Solr •  SQL features except relational joins

•  Use Hive for relational joins

•  CEP (Complex Event Processing) •  Storm

Realtime Analytics

Page 18: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 18

•  Storm, computes results on streaming data

CEP (Complex Event Processing)

Page 19: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 19

•  Java inverted indexing library •  Text analytics is raw computation

over linear sets of data •  High speed computation engine

Lucene

Page 20: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 20

•  Terms dictionary points to list document ids (integers)

•  Tokenizes text •  Complete variety of computation on

vectors of data

Inverted Indexes

Page 21: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 21

•  Search server built around Lucene •  Adds faceting, distributed search •  Missed the cloud environment

features of NoSQL systems for many years

Solr

Page 22: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 22

•  Solr Cloud is a Zookeeper based system

•  New and probably not production ready

•  Playing catch up

Solr Cloud

Page 23: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 23

•  High overlap with Solr •  More mature than Solr Cloud •  Less distributed features than

Cassandra

Elastic Search

Page 24: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 24

•  Columns, column families, keyspaces •  Peer to peer •  Eventual consistency •  Implements basic Google BigTable

model

Cassandra Concepts

Page 25: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 25

•  Both implement a log structured merge tree file architecture

Lucene and Cassandra

Page 26: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 26

•  Data is stored in Cassandra •  Data placement controlled by

Cassandra •  Solr is a secondary index (only)

DataStax Enterprise with Solr

Page 27: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 27

•  Separation of church and state, eg, data and index

DataStax Enterprise with Solr

Page 28: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 28

•  Indexing is a CPU intensive task •  Not IO bound because of multi-

threading •  When a thread is flushing, other

threads are indexing, CPU is saturated at all times

Indexing

Page 29: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 29

•  IO bound, index needs to fit in RAM, then CPU bound

•  Lucene enables multithreading queries

•  Solr does not multithread queries

Queries

Page 30: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 30

•  Eventual consistency, each node has it’s own Lucene index

•  Lucene segment files are not replicated (like Solr Cloud and ElasticSearch)

DataStax Enterprise with Solr

Page 31: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 31

•  Query requests are round robin’d across nodes automatically

Distributed Search Architecture

Page 32: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 32

•  3.0.1 is the current release of DataStax Enterprise

DSE 3.0.1

Page 33: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 33

•  Ease of re-indexing •  Re-index the entire cluster or per-

node •  Re-indexing occurs when the Solr

schema changes

New Features in DSE 3.0.1

Page 34: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 34

•  Solr Cloud requires re-indexing from an external data source such as a relational database

New Features in DSE 3.0.1

Page 35: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 35

•  DSE re-indexes directly from Cassandra

•  No custom code is required for re-indexing

New Features in DSE 3.0.1

Page 36: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 36

•  View the heap memory usage of the field caches

•  Perform capacity planning

New Features in DSE 3.0.1

Page 37: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 37

•  Multithreaded re-indexing and repair •  Adding a new Solr node is fast

New Features in DSE 3.0.1

Page 38: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 38

•  Kerberos and SSL security •  Security audit logging

New Features in DSE 3.0.1

Page 39: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 39

•  Near realtime: per-segment filters, facets, multivalue facets

•  Solr 4.3

DataStax Enterprise 3.1

Page 40: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 40

•  vNodes •  Composite keys

DataStax Enterprise 3.1

Page 41: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 41

•  Multi datacenter live Solr schema updates and re-indexing

•  CQL -> Solr queries, makes porting SQL applications easy for SQL developers

Future

Page 42: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 42

Demo of Wikipedia

Page 43: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 43

•  Details about every trade •  Tick data generated real time and is

quantitatively query-able •  Too big to query on in real time? Not

anymore!

Real World Example: Tick Data

Page 44: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 44

•  Computing the moving stock price average in real time

•  Comparing multiple moving averages for different stock_symbols

•  Requires statistical analysis, group by companies, and faceting features

Tick Data - Moving Average

Page 45: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 45

•  Read latest ticks for a given company •  Query ticks for companies in specific

verticals during large events such as press releases

•  Compute deviation of stock data over 5 years for groups of companies

Tick Data Analytics - Ad Hoc Searches

Page 46: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 46

Real Time Stocks Demo

Page 47: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 47

General

Page 48: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 48

•  Like an SQL CREATE TABLE statement

•  Defines field types •  Defines fields

Schema

Page 49: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 49

•  XML based configuration options for Solr

Solr Config

Page 50: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 50

•  Commits new index segment to RAM •  Avoids ‘hard’ commit fsync

Soft Commit

Page 51: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 51

Auto Soft Commit <!-- The default high-performance update handler --> !<updateHandler class="solr.DirectUpdateHandler2”>! <autoSoftCommit> !

<maxTime>1000</maxTime> <!– Near Realtime of 1 second -->! </autoSoftCommit> !</updateHandler>!

Page 52: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 52

•  Loaded for sort and facet queries •  Uses heap space

Field Cache

Page 53: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 53

•  Java based API for interacting with a Solr server

•  DSE supports SolrJ/HTTP with no changes

SolrJ / HTTP

Page 54: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 54

•  Auto data type mapping •  Copy fields •  Dynamic fields

Insert data with CQL

Page 55: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 55

•  Exists however is mainly useful for debugging

•  Limited functionality, queries a single node

CQL with Solr Query

Page 56: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 56

CQL Insert Example INSERT INTO wikipedia (key, text) !VALUES ('1', 'when in rome')!

Page 57: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 57

How to convert applications

Page 58: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 58

•  Common to convert existing SQL applications to Big Data

•  Focus on the application functionality

SQL to Solr

Page 59: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 59

•  Cassandra makes all distributed operations easy

SQL to Solr

Page 60: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 60

•  SELECT * FROM wikipedia WHERE type = ‘pdf’

•  q=type:pdf

SELECT WHERE

Page 61: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 61

•  SELECT title,text FROM wikipedia •  q=*:* •  fl=title,text

SELECT columns

Page 62: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 62

•  SELECT COUNT(*) FROM wikipedia WHERE type = ‘pdf’

•  q=type:pdf •  Get the num found

SELECT COUNT

Page 63: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 63

•  SELECT * FROM stocks ORDER BY price ASC

•  q=*:* •  sort=price asc

SELECT ORDER BY

Page 64: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 64

•  SELECT AVG(price) FROM stocks •  q=*:* •  stats=true •  stats.field=price •  The average is called ‘mean’ in the

Solr results

SELECT AVG

Page 65: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 65

•  SELECT AVG(price) FROM stocks GROUP BY symbol

•  q=*:* •  stats=true •  stats.field=price •  stats.facet=symbol

SELECT AVG GROUP BY

Page 66: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 66

•  SELECT * FROM wikipedia WHERE text LIKE ‘rom%’

•  q=text:rom*

SELECT WHERE LIKE

Page 67: C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Rutherglen

©2012 DataStax 67