Trivadis TechEvent 2017 Querying distributed data with SQL and Apache Drill by Jonatan Kazmierczak

29
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Querying distributed data Querying distributed data (CSV, JSON, MongoDB) (CSV, JSON, MongoDB) with SQL and Apache Drill with SQL and Apache Drill Jonatan Kazmierczak Jonatan Kazmierczak

Transcript of Trivadis TechEvent 2017 Querying distributed data with SQL and Apache Drill by Jonatan Kazmierczak

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH

Querying distributed data Querying distributed data (CSV, JSON, MongoDB) (CSV, JSON, MongoDB) with SQL and Apache Drillwith SQL and Apache Drill

Jonatan KazmierczakJonatan Kazmierczak

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Why is it important?

Apache Drill allows to query, explore, transform and expose distributed data stored in various formatsusing SQL standardwithout involving expensive and complex tools, processes and infrastructure.

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

About author

senior consultant at Trivadis creator of Class Visualizer working with Java+SQL for 20 years top rated participant in contests

in programming and data science:HackerRank, TopCoder, Code Jam

conference speaker fan of Atari XL/XE demos

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

About author – cont.www.hackerrank.com/jonatan_k

1st rank in Java 1st rank in JavaScript Top 1% in functional programming in Scala Top 1% in SQL Medalist in algorithmic contests

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Agenda

Introduction Demo: starting with Drill Technical details Demo: deep dive into Drill Summary Q & A

Introduction

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Computers – beforewww.amibay.com/showthread.php?71410-Atari-65XE-BOX-XC12-BOX-2-Quickshots

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Data – before

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Data – now

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Computers – now

32GB RAM3TB RAM

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

What is Apache Drill ?

low latency distributed schema-free SQL query engine for large-scale datasets

designed to scale to several thousands of nodes and query petabytes of data at the speeds required by BI/Analytics environments

Demo: starting with Drill

Technical details

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Basic info

Website drill.apache.org

Current version 1.11.0

Query language SQL:2003

Interfaces shell, web console, JDBC/ODBC, REST API, Java API, C++ API

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Supported data sources and formats

RDBMS FSNoSQL

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Features

Dynamic schema discovery Flexible data model In-memory data processing (whenever possible) Extensible architecture Distributed and embedded mode

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Distributed setup

Node 1

Node 2

Node 3

ZooKeeper Drillbit

Drillbit

Drillbit

Client

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Sample query

select * from dfs.demo.`countries.csv`

storage plugin

workspace

table / view / file / document

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Config – storage plugins

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Query execution

Drillbit

Client

Client

Foreman

Foreman

SQL Parser

SQL Parser

Optimizer

Optimizer

Parallelizer

Parallelizer

Executor

Executor

Storage Plugin

Storage Plugin

SQL query

parse SQL query

logical plan

optimize logical plan

physical plan

parallelize physical plan

execution tree (fragments)

execute fragments

fetch data

data

results

results

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Drill insideover 0x2000 classes

Demo: deep dive into Drill

Summary

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Advantages

Easy to start working with

Concept of SQL-on-Anything

Using standard SQL

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Disadvantages

Partially implemented or unfinished features

Lacks in documentation

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Use cases

Data exploration

Data transformation

BI / Data analytics

Questions

-- SQL and Drill -- Jonatan-Kazmierczak -- TechEvent 2017 --

Session Feedback – now

Please use the Trivadis Events mobile app to give feedback on each session

Use "My schedule" if you have registered for a session;

Otherwise use "Agenda" and the search function

If the mobile app does not work, use the web browser URL: http://trivadis.quickmobileplatform.eu/ User name: <your_loginname> (such as "svv") Password: sent by e-mail...

Thank you!Thank you!Jonatan KazmierczakJonatan Kazmierczak