NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

42
It‘s For Data Transformation M. Sc. Johannes Schildgen 2015-07-08 [email protected] … Is Not A Query Language! on Wide-Column Stores

Transcript of NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

Page 1: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

It‘s For Data Transformation

M. Sc. Johannes Schildgen2015-07-08

[email protected]

… Is Not A Query Language!

on Wide-Column Stores

Page 2: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

2

"A DBA walks into a NoSQL bar, but turns and leaves because he couldn't find a table"

Page 3: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

3

Column Families

RowId info children

Page 4: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

4

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

Column Families

€10 €5

€0€7

Page 5: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

5

HBase APIput ‘pers‘, ‘Carl‘, ‘info:born‘, ‘1982‘

put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BSIT‘

put ‘pers‘, ‘Carl‘, ‘info:school‘, ‘BUIT‘

get ‘pers‘, ‘Carl‘

Page 6: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

6

Jaspersoft HBase QL

{ "tableName": "pers", "deserializerClass": "com.jaspersoft…DefaultDeserializer", "filter": { "SingleColumnValueFilter": { "family": „info", "qualifier": „school", "compareOp": "EQUAL", "comparator": { "SubstringComparator": { "substr":

„BSIT" } } } }}

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬

http://community.jaspersoft.com/wiki/jaspersoft-hbase-query-language

Page 7: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

7

Phoenix

SELECT * FROM pers WHERE school = ‘BSIT‘

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′𝐩𝐞𝐫𝐬

„Parent of each person?“

https://github.com/forcedotcom/phoenix

Page 8: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

8

Page 9: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

9

Input Table Output Table

Page 10: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

10

Column

RowID Value

Input Cell Output Cell

Column

RowID Value

Page 11: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

11

_c

_r _v

_c

_r _v

Input Cell Output Cell

Page 12: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

12

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.born <- IN.born;

𝝅𝒃𝒐𝒓𝒏

Page 13: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

13

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.born <- IN.born,OUT.school <- IN.school;

𝝅𝒃𝒐𝒓𝒏 , 𝒔𝒄𝒉𝒐𝒐𝒍

Page 14: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

14

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

𝐩𝐞𝐫𝐬

OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′

Page 15: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

15

_cborn

_rLisa

_v1997

_cborn

_rLisa

_v1997

Input Cell Output Cell

IN-FILTER: school=‘BSIT‘,OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

row predicate

𝐩𝐞𝐫𝐬𝛔𝐬𝐜𝐡𝐨𝐨𝐥 ¿ ′𝐁𝐒𝐈𝐓 ′

Page 16: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

16

That was:Selection and Projection

Page 17: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

17

Now:Grouping

Page 18: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

18

_ccmpny

_rPeter

_vIBM

_csalsum

_rIBM

_v645k

Input Cell Output Cell

Salary sum of each company.

OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):

Page 19: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

19

RowId info

Eve

Carl

Julia

Lisa

OUT._r <- IN.cmpny, OUT.salsum <- SUM(IN.salary):

born cmpny salary

1965 IBM 70k

born cmpny job

1966 IBM intern

born cmpny salary

1967 IBM 80k

born school salary

1997 BSIT 1k

salsum

IBM 70k

salsum

IBM 80k

salsum

IBM 150k

Page 20: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

20

Advanced Transformations:More Filters

Page 21: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

21

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

€10 €5

€0€7

Page 22: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

22

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

Page 23: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

23

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN._c) <- IN._v;

Page 24: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

24

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c) <- IN._v;

Page 25: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

25

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(@>5)) <- IN._v;

cell predicate

Page 26: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

26

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

IN-FILTER: COL_COUNT(children)>0OUT._r <- IN._r,OUT.$(IN.children._c?(!Carl)) <- IN._v;

cell predicate

Page 27: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

27

NotaQL Transformation Platform:MapReduce

Page 28: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

28

map(rowId, row)

row violates row pred.?

has more columns?

no

cell violates cell pred.?

yes

map IN.{_r,_c,_v}, fetched columns and constants to r,c and v

no

emit((r, c), v)

no

yes

Stop

yes

RowId info

Peter born cmpny salary

1965 IBM 70k

salsum

IBM 70k

((IBM, salsum), 70k)

Page 29: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

29

reduce((r,c), {v})

put(r, c, aggregateAll(v))

Stop

((IBM, salsum), {70k, 80k, 10k})

((IBM, salsum), 160k)

Page 30: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

30

Advanced Transformations:Graph Algorithms

Page 31: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

31

RowId info childrenPeter

Lisa

born cmpny salary

1965 IBM 70k

Lisa Carl Susi Toni

€5 €0 €10 €7

born school

1997 BSIT

Peter, 1965,IBM, 70k

Lisa, 1997,BSIT

€10 €5

€0€7

Page 32: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

32

_cLisa

_rPeter

_c€5

_cPeter

_rLisa

_v€5

Input Cell Output Cell

„Parent of each person?“

OUT._r <- IN.children._c, OUT.$(IN._r) <- IN._v;

Page 33: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

33

RowId info linksWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.333

crawled pr

17:36 0.333

OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;

Page 34: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

34

RowId info links incomingWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.333

crawled pr

17:36 0.333

OUT._r <- IN.links._c, OUT.incoming.$(IN._r) <- IN._v;

Twitter Wikipedia

- -

Google

-

Wikipedia

-

Reverting the graph

Page 35: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

35

RowId info links incomingWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.333

crawled pr

17:36 0.333

OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links));

Twitter Wikipedia

- -

Google

-

Wikipedia

-

Page 36: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

36

RowId info links incomingWikipedia

Twitter

Google

crawled pr

17:35 0.333

Twitter Google

- -

Google

-

Wikipedia

-

crawled pr

17:36 0.167

crawled pr

17:36 0.5

OUT._r <- IN.links._c, OUT.info.pr <- SUM(IN.pr/COL_COUNT(links));

Twitter Wikipedia

- -

Google

-

Wikipedia

-

PageRank

Page 37: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

37

Advanced Transformations:Text Processing

Page 38: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

38

RowId infoWikipedia

Twitter

crawled pr body

17:35 0.333 all information can be found here

OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘));

crawled pr body

17:36 0.167 click here for more information

Page 39: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

39

RowId infoWikipedia

Twitter

crawled pr body words

17:35 0.333 all information can be found here 6

OUT._r <- IN._r, OUT.words <- COUNT(IN.body.split(‘ ‘));

crawled pr body

17:36 0.167 click here for more information 5

Word Count

Page 40: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

40

RowId infoWikipedia

Twitter

crawled pr body words

17:35 0.333 all information can be found here 6

OUT._r <- IN.body.split(‘ ‘), OUT.$(IN._r) <- COUNT(*);

crawled pr body

17:36 0.167 click here for more information 5

RowId infohere

Wikipedia Twitter

1 1

Term Index

Page 41: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

41

Conclusion

Selection, ProjectionGrouping, AggregationSchema-FlexibleHorizontal AggregationMetadataDataGraph ProcessingText Processing

SQL

Page 42: NotaQL Is Not a Query Language! It's for Data Transformation on Wide-Column Stores

42

Thank you!