No sql and sql - open analytics summit

28
NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs Allen Day MapR Technologies

Transcript of No sql and sql - open analytics summit

NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs

Allen DayMapR Technologies

Me

• Allen Day– Principal Data Scientist @ MapR– Human Genomics / Bioinformatics (PhD,

UCLA School of Medicine)

• @allenday

[email protected][email protected]

You

• I’m assuming that the typical attendee:

– is a software developer

– is interested and familiar with open source

– is familiar with Hadoop, relational DBs

– has heard of or has used some NoSQL technology

Big Data Workloads

• Offline– ETL– Model creation & clustering & indexing– Web Crawling– Batch reporting

• Online– Lightweight OLTP– Classification & anomaly detection– Stream processing– Interactive reporting

SQL

What is NoSQL? Why use it?

• Traditional storage (relational DBs) are unable to accommodate increasing # and variety of observations– Culprits: sensors, event logs, electronic payments

• Solution: stay responsive by relaxing ACID storage requirements– Denormalize (#)– Loosen schema (variety), loosen consistency

• This is the essence of NoSQL

NoSQL Impact on Business Processes

• Traditional business intelligence (BI) tech stack assumes relational DB storage– Company decisions depend on this (reports, charts)

• NoSQL collected data aren’t in relational DB– Data volume/variety is still increasing– Tech and methods are still in flux

• Decoupled data storage and decision support systems– BI can’t access freshest, largest data sets– Very high opportunity cost to business

Ideal Solution Features

• Scalable & Reliable– Distributed replicated storage– Distributed parallel processing

• BI application support– Ad-hoc, interactive queries– Real-time responsiveness

• Flexible– Handles rapid storage and schema evolution– Handles new analytics methods and functions

Hadoop FSMap/Reduce, YARN{

SQL Interface{Extensible for NoSQL,Advanced Analytics{

From Ideals to Possibilities

• Migrate NoSQL data/processing to SQL– High cost to marshal NoSQL data to SQL storage– SQL systems lack advanced analytics capabilities

• Migrate SQL data to NoSQL– Breaks compatibility for BI-dependent functions, e.g.

financial reporting– Limited support for relational operations (joins)

• high latency

– NoSQL tech is still in flux (continuity)• Other Approaches?– Yes. First let’s consider a SQL/NoSQL use case

Impala

Interactive Queries & Hadoop

low-latency

Example Problem: Marketing Campaign

• Jane is an analyst at an e-commerce company

• How does she figure out good targeting segments for the next marketing campaign?

• She has some ideas… …and lots of data

User profiles

Transaction information

Access logs

Traditional System Solution 1: RDBMS

• ETL the data from MongoDB and Hadoop into the RDBMS– MongoDB data must be

flattened, schematized, filtered and aggregated

– Hadoop data must be filtered and aggregated

• Query the data using any SQL-based tool

User profiles

Access logs

Transaction information

Traditional System Solution 2: Hadoop

• ETL the data from Oracle and MongoDB into Hadoop– MongoDB data must be

flattened and schematized

• Work with the MapReduce team to write custom code to generate the desired analyses

User profiles

Access logs

Transaction information

Traditional System Solution 3: Hive

• ETL the data from Oracle and MongoDB into Hadoop– MongoDB data must be

flattened and schematized

• But HiveQL queries are slow and BI tool support is limited– Marshaling/Coding

User profiles

Access logs

Transaction information

What Would Google Do?

Distributed File System NoSQL Interactive

analysisBatch

processing

GFS BigTable Dremel MapReduce

HDFS HBase ??? Hadoop MapReduce

Build Apache Drill to provide a true open source solution to interactive analysis of Big Data

Apache Drill Overview

• Interactive analysis of Big Data using standard SQL• Fast

– Low latency queries– Complement native interfaces and

MapReduce/Hive/Pig• Open

– Community driven open source project– Under Apache Software Foundation

• Modern– Standard ANSI SQL:2003 (select/into)– Nested data support– Schema is optional– Supports RDBMS, Hadoop and NoSQL

Interactive queriesData analystReporting100 ms-20 min

Data miningModelingLarge ETL20 min-20 hr

MapReduceHive

Pig

Apache Drill

How Does It Work?SELECT * FROM oracle.transactions, mongo.users, hdfs.eventsLIMIT 1

How Does It Work?

• Drillbits run on each node, designed to maximize data locality

• Processing is done outside MapReduce paradigm (but possibly within YARN)

• Queries can be fed to any Drillbit• Coordination, query planning, optimization,

scheduling, and execution are distributed

SELECT * FROM oracle.transactions, mongo.users, hdfs.eventsLIMIT 1

Apache Drill: Key Features

• Full ANSI SQL:2003 support– Use any SQL-based tool

• Nested data support– Flattening is error-prone and often impossible

• Schema-less data source support– Schema can change rapidly and may be record-specific

• Extensible– DSLs, UDFs– Custom operators (e.g. k-means clustering)– Well-documented data source & file format APIs

How Does Impala Fit In?Impala Strengths• Beta currently available• Easy install and setup on top of

Cloudera• Faster than Hive on some queries• SQL-like query language

Questions• Open Source ‘Lite’• Lacks RDBMS support• Lacks NoSQL support beyond

HBase• Early row materialization

increases footprint and reduces performance

• Limited file format support• Query results must fit in memory!• Rigid schema is required• No support for nested data• SQL-like (not SQL)

Many important features are “coming soon”. Architectural foundation is constrained. No community development.

Drill Status: Alpha Available July• Heavy active development by multiple organizations

– Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho

• Available– Logical plan syntax and interpreter– Reference interpreter

• In progress– SQL interpreter– Storage engine implementations for Accumulo, Cassandra, HBase and

various file formats

• Significant community momentum– Over 200 people on the Drill mailing list– Over 200 members of the Bay Area Drill User Group– Drill meetups across the US and Europe

• Beta: Q3

Why Apache Drill Will Be SuccessfulResources• Contributors have

strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho

Community• Development done in

the open• Active contributors

from multiple companies

• Rapidly growing

Architecture• Full SQL• New data support• Extensible APIs• Full Columnar

Execution• Beyond Hadoop

Bottom Line: Apache Drill enables NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs

Me

• Allen Day– Principal Data Scientist @ MapR

• @allenday

[email protected][email protected]

ADDITIONAL SLIDES

Full SQL (ANSI SQL:2003)

• Drill supports SQL (ANSI SQL:2003 standard)– Correlated subqueries, analytic functions, …– SQL-like is not enough

• Use any SQL-based tool with Apache Drill– Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, …– Standard ODBC and JDBC drivers

Nested Data• Nested data is becoming prevalent

– JSON, BSON, XML, Protocol Buffers, Avro, etc.– The data source may or may not be aware

• MongoDB supports nested data natively• A single HBase value could be a JSON document

(compound nested type)– Google Dremel’s innovation was efficient columnar

storage and querying of nested data

• Flattening nested data is error-prone and often impossible– Think about repeated and optional fields at every

level…

• Apache Drill supports nested data– Extensions to ANSI SQL:2003

enum Gender { MALE, FEMALE}

record User { string name; Gender gender; long followers;}

{ "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa”} ]}

JSON

Avro

Schema is Optional

• Many data sources do not have rigid schemas– Schemas change rapidly– Each record may have a different schema, may be sparse/wide

• Apache Drill supports querying against unknown schemas– Query any HBase, Cassandra or MongoDB table

• User can define the schema or let the system discover it automatically– System of record may already have schema information– No need to manage schema evolution

Row Key CF contents CF anchor"com.cnn.www" contents:html = "<html>…" anchor:my.look.ca = "CNN.com"

anchor:cnnsi.com = "CNN""com.foxnews.www" contents:html = "<html>…" anchor:en.wikipedia.org = "Fox News"

… … …

Flexible and Extensible Architecture

• Apache Drill is designed for extensibility• Well-documented APIs and interfaces• Data sources and file formats

– Implement a custom scanner to support a new source/format

• Query languages– SQL:2003 is the primary language– Implement a custom Parser to support a Domain Specific Language– UDFs

• Optimizers– Drill will have a cost-based optimizer– Clear surrounding APIs support easy optimizer exploration

• Operators– Custom operators can be implemented (e.g. k-Means clustering)– Operator push-down to data source (RDBMS)