Evolving from RDBMS to NoSQL + SQL

Post on 16-Apr-2017

299 views 4 download

Transcript of Evolving from RDBMS to NoSQL + SQL

1© 2016 MapR Technologies 1© 2016 MapR Technologies

Evolving from RDBMS to NoSQL + SQLJim Scott@kingmesal #strataconf

2© 2016 MapR Technologies 2

Why Does this Matter

• 90%+ of the use cases do not deal with “relational” data• RDBMS data models are more complex than a single table

– One-to-many relationships require multiple tables– Creating code to persist data takes time and QA

• Inferred (or removed) keys used without actual foreign keys– Difficult for others to understand relationships

• Transactional tables never look the same as analytics tables– OLTP -> ETL -> OLAP– This takes significant time to build

3© 2016 MapR Technologies 3

Topics

• Changing Data Models– Relations Model to JSON Model

• A New Database for JSON Data– Document Database (OJAI)

• Querying JSON Data and More– Drill

• Resources

4© 2016 MapR Technologies 4

Empowering “as it happens” businesses by speeding up the

data-to-action cycle

5© 2016 MapR Technologies 5© 2016 MapR Technologies© 2016 MapR Technologies

Changing Data Models

6© 2016 MapR Technologies 6

180 Tables NOT SHOWN!

7© 2016 MapR Technologies 7

236 tablesto describe 7 kinds of things

8© 2016 MapR Technologies 8

9© 2016 MapR Technologies 9

10© 2016 MapR Technologies 10

Searching for Elvis// Find discs where Elvis was credited > SELECT distinct album_id, name FROM

(SELECT id album_id, artist_id, name, FLATTEN(credit) FROM release) albums

join (SELECT distinct artist_id FROM

(SELECT id artist_id, FLATTEN(alias) FROM artistwhere name like 'Elvis%Presley’)

) artists USING artist_id;

11© 2016 MapR Technologies 11

Benefits• Extended relational model allows massive simplification

– On a real example, we see >20x reduction in number of tables

• Simplification drives improved introspection– This is good

• Apache Drill gives very high performance execution for extended relational problems

• You can try this out today

12© 2016 MapR Technologies 12© 2016 MapR Technologies© 2016 MapR Technologies

A New Database for JSON Data

13© 2016 MapR Technologies 13

Basics of the API• http://ojai.github.io/

• Entry point to a table - DocumentStore– insert()– insertOrReplace()– find()– delete()– replace()– update()– increment()

14© 2016 MapR Technologies 14

Working with JSON in Java• Step 1 – Create instance of JSON Serializer

Gson gson = new Gson();

• Step 2 – Serialize POJO to JSONString json = gson.toJson(myObject);

• Step 3 – Deserialize JSON into POJOMyObject myObject = gson.fromJson(json, MyObject.class);

15© 2016 MapR Technologies 15

Creating Documents in Java OJAI• Use static methods on class org.ojai.json.Json

Document doc = Json.newDocument(myObject);Document doc = Json.newDocument(jsonString);

• Alternatively– Use builders– Stream from disk– Use InputStream

16© 2016 MapR Technologies 16

Creating New Documents• DocumentStore.insert(doc)

Done!

• DocumentStore.insertOrReplace(doc)

Done!

Easy right?

17© 2016 MapR Technologies 17

Updating Existing Documents• DocumentStore.update(_id, DocumentMutation)

• Mutation methods– mutation.append(FieldPath, “user visited URL”);– mutation.set(“field.name”, “What a great example”);– mutation.increment(“field”, 1);– mutation.merge(“field”, Map<String, Object>);– mutation.setOrReplace(…);– mutation.delete(field);

Yes, these are atomic.

18© 2016 MapR Technologies 18

Deleting Documents• DocumentStore.delete(doc);

Done!

• DocumentStore.delete(_id);

Done!

This is easy too, right?

19© 2016 MapR Technologies 19

Finding Documents• DocumentStore.find(QueryCondition);

• Query condition setup:– qc.is(“field”, EQUAL, “blue”)

.and().notExists(“other.field”)

.or().like(“field”, “%purple”)

.or().matches(“another.field”, “regular expression”)

20© 2016 MapR Technologies 20© 2016 MapR Technologies© 2016 MapR Technologies

Querying JSON Data and More

21© 2016 MapR Technologies 21

How to Bring SQL to Non-Relational Data Stores?

Familiarity of SQL Agility of NoSQL

• ANSI SQL semantics

• BI (Tableau, MicroStrategy,

etc.)

• Low latency

• No schema management– HDFS (Parquet, JSON, etc.)– HBase– …

• No transformation– No silos of data

• Ease of use

22© 2016 MapR Technologies 22

Drill Supports Schema Discovery On-The-Fly

• Fixed schema• Leverage schema in centralized

repository (Hive Metastore)

• Fixed schema, evolving schema or schema-less

• Leverage schema in centralized repository or self-describing data

2Schema Discovered On-The-FlySchema Declared In Advance

SCHEMA ON WRITE

SCHEMA BEFORE READ

SCHEMA ON THE FLY

23© 2016 MapR Technologies 23

Drill’s Data Model is Flexible

JSONBSON

HBase

ParquetAvro

CSVTSV

Dynamic schemaFixed schema

Complex

Flat

Flexibility

Name Gender AgeMichael M 6Jennifer F 3

{ name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos}{ name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC}

RDBMS/SQL-on-Hadoop table

Apache Drill table

Flex

ibili

ty

24© 2016 MapR Technologies 24

Enabling “As-It-Happens” Business with Instant Analytics

Hadoop data Data modeling TransformationData

movement(optional)

Users

Hadoop data Users

Traditionalapproach

Exploratory approach

New Business questionsSource data evolution

Total time to insight: weeks to months

Total time to insight: minutes

25© 2016 MapR Technologies 25

Evolution Towards Self-Service Data Exploration

Data Modeling and Transformation

Data Visualization

IT-driven

IT-driven

IT-driven

Self-service

IT-driven

Self-service

Optional

Self-service

Traditional BIw/ RDBMS

Self-Service BIw/ RDBMS SQL-on-Hadoop

Self-Service Data Exploration

Zero-day analytics

26© 2016 MapR Technologies 26

Common Use Cases

Raw Data Exploration JSON Analytics DWH offload

Hive HBaseFiles Directories…

{JSON}, ParquetText Files …

27© 2016 MapR Technologies 27

- Sub-directory- HBase namespace- Hive database

Drill Enables ‘SQL-on-Everything’

SELECT * FROM dfs.yelp.`business.json`

Workspace- Pathnames- Hive table- HBase table

Table

- DFS (Text, Parquet, JSON)- HBase/MapR-DB- Hive Metastore/HCatalog- Easy API to go beyond Hadoop

Storage plugin instance

28© 2016 MapR Technologies 28

Reuse Existing SQL Tools and Skills

Leverage SQL-compatible tools (BI, query builders, etc.) via Drill’s standard ODBC, JDBC and ANSI SQL support

Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data

29© 2016 MapR Technologies 29© 2016 MapR Technologies© 2016 MapR Technologies

Security Controls

30© 2016 MapR Technologies 30

Access Controls that Scale

PAM Authentication + User Impersonation

Fine-grained row and column level access control with Drill Views – no centralized security repository required

Files HBase Hive

Drill View 1

Drill View 2

UUU

U

U

31© 2016 MapR Technologies 31

Granular Security via Drill Views

Name City State Credit Card #Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)OwnerAdmins

Permission Admins

Business Analyst Data Scientist

Name City State Credit Card #

Dave San Jose

CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist View (/views/maskedcards.csv)

Not a physical data copy

Name City State

Dave San Jose

CA

John Boulder CO

Business Analyst View

OwnerAdmins

Permission Business Analysts

OwnerAdmins

Permission Data

Scientists

32© 2016 MapR Technologies 32

Ownership ChainingCombine Self Service Exploration with Data Governance

Name City State Credit Card #

Dave San Jose CA 1374-7914-3865-4817

John Boulder CO 1374-9735-1794-9711

Raw File (/raw/cards.csv)

Name City State Credit Card #

Dave San Jose CA 1374-1111-1111-1111

John Boulder CO 1374-1111-1111-1111

Data Scientist (/views/V_Scientist)

Jane (Read)John (Owner)

Name City State

Dave San Jose CA

John Boulder CO

Analyst(/views/V_Analyst)

Jack (Read)Jane(Owner)

RAW

FILEV

_Scientist

V_A

nalyst

Does Jack have access to V_Analyst? ->YES

Who is the owner of V_Analyst? ->Jane

Drill accesses V_Analyst as Jane (Impersonation hop 1)

Does Jane have access to V_Scientist ? -> YES

Who is the owner of V_Scientist? ->John

Drill accesses V_Scientist as John (Impersonation hop 2)

John(Owner)

Does John have permissions on raw file? -> YES

Who is the owner of raw file? ->John

Drill accesses source file as John (no impersonation here)

Jack queries the view V_Analyst

*Ownership chain length (# hops) is configurable

Ownership chaining

Access path

33© 2016 MapR Technologies 33

Security Summary• Logical

– No physical data copies/silos

• Granular– Row level and column level security controls

• De-centralized– User impersonation respecting storage system permissions

– No separate permission repository for granular controls

– Integrated with Hadoop File System permissions and LDAP

• Self-service w/ governance– If you have access to data, you control who and how widely can access it

– Audits

34© 2016 MapR Technologies 34© 2016 MapR Technologies© 2016 MapR Technologies

Using Drill with Yelp

35© 2016 MapR Technologies 35

Business dataset {"business_id": "4bEjOyTaDG24SY5TxsaUNQ","full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109","hours": {

"Monday": {"close": "23:00", "open": "07:00"},"Tuesday": {"close": "23:00", "open": "07:00"},"Friday": {"close": "00:00", "open": "07:00"},"Wednesday": {"close": "23:00", "open": "07:00"},"Thursday": {"close": "23:00", "open": "07:00"},"Sunday": {"close": "23:00", "open": "07:00"},"Saturday": {"close": "00:00", "open": "07:00"}

},"open": true,"categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"],"city": "Las Vegas","review_count": 4084,"name": "Mon Ami Gabi","neighborhoods": ["The Strip"],"longitude": -115.172588519464,"state": "NV","stars": 4.0,

"attributes": {"Alcohol": "full_bar”,

"Noise Level": "average","Has TV": false,"Attire": "casual","Ambience": {

"romantic": true,"intimate": false,"touristy": false,"hipster": false,

"classy": true,"trendy": false,

"casual": false},"Good For": {"dessert": false, "latenight": false, "lunch": false,

"dinner": true, "breakfast": false, "brunch": false},}

}

36© 2016 MapR Technologies 36

Zero to Results in 2 minutes$ tar -xvzf apache-drill-1.9.0.tar.gz

$ bin/sqlline -u jdbc:drill:zk=local$ bin/drill-embedded> SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10;+------------+------------+-------------+| state | city | businesses |+------------+------------+-------------+| NV | Las Vegas | 12021 || AZ | Phoenix | 7499 || AZ | Scottsdale | 3605 || EDH | Edinburgh | 2804 || AZ | Mesa | 2041 || AZ | Tempe | 2025 || NV | Henderson | 1914 || AZ | Chandler | 1637 || WI | Madison | 1630 || AZ | Glendale | 1196 |+------------+------------+-------------+

Install

Query files and

directories

Results

Launch shell (embedded mode)

37© 2016 MapR Technologies 37

Directories are implicit partitions

SELECT dir0, SUM(amount)FROM salesGROUP BY dir1 IN (q1, q2)

sales├── 2014│   ├── q1│   ├── q2│   ├── q3│   └── q4└── 2015 └── q1

38© 2016 MapR Technologies 38

Intuitive SQL Access to Complex Data// It’s Friday 10pm in Vegas and looking for Hummus

> SELECT name, stars, b.hours.Friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2;

+------------+------------+------------+------------+| name | stars | friday | categories |+------------+------------+------------+------------+| Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] || Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] |+------------+------------+------------+------------+

Query data with any levels of nesting

39© 2016 MapR Technologies 39

Reviews dataset

{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}

40© 2016 MapR Technologies 40

ANSI SQL Compatibility//Get top cool rated businesses

SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC);

+------------+| name |+------------+| Earl of Sandwich || XS Nightclub || The Cosmopolitan of Las Vegas || Wicked Spoon |+------------+

Use familiar SQL functionality

(Joins, Aggregations, Sorting, Sub-queries, SQL data types)

41© 2016 MapR Technologies 41

Logical Views //Create a view combining business and reviews datasets

> CREATE OR REPLACE VIEW dfs.tmp.BusinessReviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+------------+| ok | summary |+------------+------------+| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |+------------+------------+

> SELECT COUNT(*) AS Total FROM dfs.tmp.BusinessReviews;+------------+| Total |+------------+| 1125458 |+------------+

Lightweight file system based views for granular

and de-centralized

data management

42© 2016 MapR Technologies 42

Materialized Views AKA Tables> ALTER SESSION SET `store.format` = 'parquet';

> CREATE TABLE dfs.yelp.BusinessReviewsTbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id;+------------+---------------------------+| Fragment | Number of records written |+------------+---------------------------+| 1_0 | 176448 || 1_1 | 192439 || 1_2 | 198625 || 1_3 | 200863 || 1_4 | 181420 || 1_5 | 175663 |+------------+---------------------------+

Save analysis results as tables using familiar CTAS

syntax

43© 2016 MapR Technologies 43

Repeated Values Support// Flatten repeated categories

> SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3;

+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | ["Doctors","Health & Medical"] || Pine Cone Restaurant | ["Restaurants"] || Deforest Family Restaurant | ["American (Traditional)","Restaurants"] |+------------+------------+

> SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5;+------------+------------+| name | categories |+------------+------------+| Eric Goldberg, MD | Doctors || Eric Goldberg, MD | Health & Medical || Pine Cone Restaurant | Restaurants || Deforest Family Restaurant | American (Traditional) || Deforest Family Restaurant | Restaurants |+------------+------------+

Dynamically flatten

repeated and nested data elements as part of SQL queries. No ETL necessary

44© 2016 MapR Technologies 44

Checkins dataset {    "checkin_info":{       "3-4":1,      "13-5":1,      "6-6":1,      "14-5":1,      "14-6":1,      "14-2":1,      "14-3":1,      "19-0":1,      "11-5":1,      "13-2":1,      "11-6":2,      "11-3":1,      "12-6":1,      "6-5":1,      "5-5":1,      "9-2":1,      "9-5":1,      "9-6":1,      "5-2":1,      "7-6":1,      "7-5":1,      "7-4":1,      "17-5":1,      "8-5":1,      "10-2":1,      "10-5":1,      "10-6":1   },   "type":"checkin",   "business_id":"JwUE5GmEO-sH1FuwJgKBlQ"}

45© 2016 MapR Technologies 45

Supports Dynamic / Unknown Columns> SELECT KVGEN(checkin_info) checkins FROM dfs.yelp.`checkin.json` LIMIT 1;+------------+| checkins |+------------+| [{"key":"3-4","value":1},{"key":"13-5","value":1},{"key":"6-6","value":1},{"key":"14-5","value":1},{"key":"14-6","value":1},{"key":"14-2","value":1},{"key":"14-3","value":1},{"key":"19-0","value":1},{"key":"11-5","value":1},{"key":"13-2","value":1},{"key":"11-6","value":2},{"key":"11-3","value":1},{"key":"12-6","value":1},{"key":"6-5","value":1},{"key":"5-5","value":1},{"key":"9-2","value":1},{"key":"9-5","value":1},{"key":"9-6","value":1},{"key":"5-2","value":1},{"key":"7-6","value":1},{"key":"7-5","value":1},{"key":"7-4","value":1},{"key":"17-5","value":1},{"key":"8-5","value":1},{"key":"10-2","value":1},{"key":"10-5","value":1},{"key":"10-6","value":1}] |+------------+

> SELECT FLATTEN(KVGEN(checkin_info)) checkins FROM dfs.yelp.`checkin.json` limit 6;+------------+| checkins |+------------+| {"key":"3-4","value":1} || {"key":"13-5","value":1} || {"key":"6-6","value":1} || {"key":"14-5","value":1} || {"key":"14-6","value":1} || {"key":"14-2","value":1} |+------------+

Convert Map with a wide set of dynamic columns into an array of key-value pairs

46© 2016 MapR Technologies 46© 2016 MapR Technologies© 2016 MapR Technologies

Resources

47© 2016 MapR Technologies 47

Drill is Top-Ranked SQL-on-Hadoop

Source: Gigaom Research, 2015

Key: • Number indicates companies relative strength across all vectors• Size of ball indicates company’s relative strength along individual vector

“Drill isn’t just about

SQL-on-Hadoop.

It’s about SQL-on-

pretty-much-

anything,

immediately, and

without formality.”

48© 2016 MapR Technologies 48

49© 2016 MapR Technologies 49

OJAI and MapR-DBWhere to find it…

– The source: https://github.com/ojai/ojai– The site: http://ojai.github.io/– Python bindings: https://github.com/mapr-demos/python-bindings– Javascript bindings: https://github.com/mapr-demos/js-bindings

Ready to play with your data?– Download the sandbox: http://maprdb.io– Examples:

• Java: https://github.com/mapr-demos/maprdb-ojai-101• Python: https://github.com/mapr-demos/maprdb_python_examples

50© 2016 MapR Technologies 50

Drill Walkthrough• Example queries• Conversion from relational model to flat JSON model

https://www.mapr.com/blog/drilling-healthy-choices

https://www.mapr.com/blog/evolution-database-schemas-using-sql-nosql

52© 2016 MapR Technologies 52

@kingmesal

jscott@mapr.com

Engage with us!

kingmesal