Integrate Hue with your Hadoop cluster - Yahoo! Hadoop Meetup
Jethro data meetup index base sql on hadoop - oct-2014
-
Upload
eli-singer -
Category
Software
-
view
835 -
download
3
description
Transcript of Jethro data meetup index base sql on hadoop - oct-2014
![Page 1: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/1.jpg)
JethroData Indexed Based SQL-on-Hadoop - An Architectural Comparison of Tools
Simpler. Faster. Cheaper.
![Page 2: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/2.jpg)
About the presenter
• Over 25 years experience in software design & mgmt• Expertise in database architecture, information
retrieval and search technologies• Led numerous information retrieval projects for
various Israeli intelligence agencies as well as for commercial companies
• Started JethroData in 2010 with the idea of integrating database and search technologies to accelerate big data analytics
• Bachelor's degree in Computer Science and Philosophy from the Tel-Aviv University
Boaz Raufman – Co-Founder / CTO
![Page 3: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/3.jpg)
SQL-on-Hadoop
Hadoop uses the same parallel design pattern as the parallel databases from last decade
Frameworks
• MapReduce• Tez• Spark
Reborn on Hadoop
• Pivotal HAWQ
• IBM BigSQL• Teradata
Aster• Actian
New Comers
• Hive• Impala• Presto• Tajo• Drill• Spark SQL
![Page 4: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/4.jpg)
Data Node
Full-Scan Execution
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
Data Node
Data Node
Data Node
Data Node
Query Executor
Query Executor
Query Executor
Query Executor
Query Executor
Query Planner
/Mgr
Query Planner/
Mgr
Query Planner/
Mgr
Query Planner/
Mgr
Query Planner/
Mgr
Performance and resources based on the size of the dataset
![Page 5: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/5.jpg)
Shared-Nothing MPP Design Principles
Parallel Processing• Divide the work across
many nodes• Try to minimize inter-
node communication• Work should be evenly
distributed
Full Data Scanning• Full sequential scan -
massive I/O• Data locality and local
processing• Minimize amount of data
being read– Columnar data store– Partition by specific key– Block stats
Performance and resource requirement based on the dataset size
![Page 6: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/6.jpg)
MPP Complex queries processing
LocalAggregation
GlobalAggregation
(join, distinct,group by, order by,
sub-query)
Result Merge
Example:SELECTDAY,COUNT( DISTINCT ITEM)FROM T1WHEREPRODUCT=‘abc’GROUP BY DAY
![Page 7: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/7.jpg)
Index based SQL-on-Hadoop
![Page 8: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/8.jpg)
Data Node
The Index-Access Design
Data Node
Data Node
Data Node
Data Node
Jethro Query Node
Query Node
Client: SELECT day, sum(sales) FROM t1 WHERE prod=‘abc’ GROUP BY day
1. Index Access 2. Read data only for require rows
Performance and resources based on the size of the result-set
![Page 9: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/9.jpg)
Index-Based Design
• Surgical scan – minimum I/O• Performance and required resources based
on the result set size• Extremely efficient for Interactive SQL use
cases• Pay at load time
![Page 10: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/10.jpg)
Architecture – Contrarian Concepts
• Index everything– Every column is indexed
• Colum oriented– Columnar or row-groups– Append only data model
• Everything is stored in HDFS– Can also work with S3 or Posix
• Shared Everything– Separate compute and storage, each
scales-out independently– Minimize cross-node operations– Stateless Query nodes
• Parallelized multi-threaded execution– multiple parallelization dimensions:
columns, row ranges, partitions, pipelining and bucketing
JethroNode
Jethro Node
Storage Layer
Processing Layer
HDFS/Posix/S3
Client
![Page 11: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/11.jpg)
Jethro Indexes
Indexes map each column value to a list of rows
Jethro stores indexes as hierarchical compressed bitmaps Very fast query operations – AND / OR / NOT Processed the entire WHERE clause to a final list of rows Patent pending: http://
www.google.com/patents/WO2013001535A3?cl=en INSERT Performance
– Load is very fast: files are appended, no random read/write, no locks– Jethro Indexes are append-only. If needed, duplicate entries are allowed– Periodic background merge (non-blocking)– Compatible with HDFS
Value Rows
FR rows 5,9,10,11,14
IL rows 1,3,7,12,13
US rows 2,4,6,8,15
![Page 12: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/12.jpg)
Built in optimizations
• Code is written in C++• Column store and true column processing• Vectorization for expression evaluation• Multi-threaded and parallelized execution• Planner using indexes meta data - index-based
queries• Server-side cache in memory and local disk
![Page 13: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/13.jpg)
Full-Scan: Performance depends on size of datasetIndex-Access: Performance depends on size of result-set
Use-Case Analysis
![Page 14: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/14.jpg)
Comparing Recent Benchmarks – Jun 2014
Impala Parquet Vs. Hive/Tez, Presto, Shark
Source
Jethro Vs. Impala/Parquet
Source
Impala
Using the same queries in Jun-2014 Impala benchmarks, we compared Impala with Jethro (TPC-DS, SF 1,000)
![Page 15: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/15.jpg)
Benchmark – Jethro vs. Impala – Oct 2014
q19 q42 q52 q55 q63 q68 q73 q980
20
40
60
80
100
120
140
160
180
200
103
39.8 39.4 39.9
73.4
188.4
84.3 85.2
6.4 5 4.9 412.3 11.7 10.3 4.4
TPC-DS Interactive queriesOct 2014
Impala 1.4.2 Jethro 0.9
* Queries use original TPC-DS filter criteria
*
![Page 16: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/16.jpg)
DEMO
Go to Tableau demo
1. Point browser at: http://54.245.114.83/ 2. Login as try-jethro/jethro1233. Edit workbooks:
1. Jethro: Jethro sd – save2. Impala: Impala sd save
![Page 17: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/17.jpg)
Jethro Query Node
Side-By-Side Implementation
MapReduce / Impala
Jethro Indexer
DataStream
Jethro Query Nodes
▪▪
▪▪▪
▪▪ ▪ ▪▪▪
▪▪ ▪▪▪▪▪
▪
▪▪
Hive / Pig
▪▪▪▪▪
▪▪
▪▪▪
BI Tools
SQL
▪
Existing Hadoop tables are untouched
Indexes are added to select
tables. ~30% incremental
storage
![Page 18: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/18.jpg)
1. Installing JethroData
• Existing Hadoop cluster– CDH 4.x, CDH 5.x, HDP 2.x, EMR 3.x
• Designated Jethro server– Can be inside or outside the cluster– HW: CPU: 16+ cores, Mem: 64GB+, Net: 1GB / 10GB, SSD for
cache• Install Jethro – download package
– rpm install– Install HDFS client (if needed) – Create /jethro dir in HDFS
• Start Jethro– service jethro start
![Page 19: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/19.jpg)
2. Load Data into Jethro
• Run create instance script– JethroAdmin create-instance demo /Jethro/demo
• Define a new table– JethroClient demo localhost 9111
• Create table sales_demo (…);
• Run JethroLoader process– JethroLoader demo sales_demo.desc
sales_demo.csv &• Start Querying – ODBC, JDBC, JethroClient
That’s it!
![Page 20: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/20.jpg)
Road Map
• Jethro S3• Analytic functions• UDF• Cascading optimizer• Function indexes• Light weight text search• Rows group format (Parquet/ORC)• Integration with YARN for resource management• Sync with Hive Metastore/HCatalog• Nested data• Materialized views• Distributes query
![Page 21: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/21.jpg)
Functional Indexes
Solution• Function index created for commonly
used functions• Some function indexes are automatically
created for specific data types. Example:year function for timestamp
• Query optimizer will identify scenarios where functional indexes should be used
• Function index can also be user defined or created on the fly via adaptive optimization
Base Index Function YAER index
Value Rows Value Rows
02/04/2007 5,9,10,11,14 20071,2,3,4,5,6,7,
8,9,10,11,12,
13,14,15
03/05/2007 1,3,7,12,13
10/10/2007 2,4,6,8,15
01/02/2008 15,18 2008 15,16,17,18
05/03/2008 16,17
Query uses function index examples:year(c1)=2007 explicit use for year indexC1 between 2007-01-01 and 2007-12-31 implicit use for year indexC1 between 2007-01-01 and 2008-02-15 Mix: take year index for 2007 and rest from base index
ProblemHow to accelerate this query:Select count(*) from T where year(birthdate)=2007;
![Page 22: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/22.jpg)
Jethro’s Benefits
Simple to use
• Implemented side by side with existing Hadoop system
• Access via SQL or your favorite BI tool
• Integrates with Hadoop eco system
10X Faster queries
• Interactive analysis with Sub second latency
• Access to data as it arrives
• Analyze granular, raw data
50% Cheaper to operate
• Significantly less computing resources
• No dual systems, costly ETL
• Elastically scalable on commodity hardware
![Page 23: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/23.jpg)
• Point browser at: http://www.jethrodata.com/home
• Click • Register:
Try it Today
![Page 24: Jethro data meetup index base sql on hadoop - oct-2014](https://reader034.fdocuments.net/reader034/viewer/2022052601/559326d31a28abfb1f8b46d9/html5/thumbnails/24.jpg)
Jethro – Big Data Analytics. Real-Time.
Thank You!