Survey of Accumulo Techniques for Indexing Data

49
SURVEY OF ACCUMULO TECHNIQUES FOR INDEXING DATA Donald Miner @donaldpminer January 18 th , 2015

Transcript of Survey of Accumulo Techniques for Indexing Data

SURVEY OF ACCUMULO

TECHNIQUES FOR INDEXING DATA

Donald Miner

@donaldpminer

January 18th, 2015

INTRODUCTION TO

ACCUMULO

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

Adelaide Bartkowski

Alyssa Files

Beatriz Palmore

Cecilia Ours

Craig Avalos

Dianna Lapointe

Erma Davis

Fermina Smead

Garrett Harsh

Gaylene Sherry

Gilberto Pardue

Hui Nodal

Janell Tomita

Jannette Betters

Jeana Delk

Madlyn Radke

Peggie Allis

Rhona Zygmont

Tran Degarmo

Wilhelmina Papp

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

Janell Tomita

Jannette Betters

Jeana Delk

Madlyn Radke

Peggie Allis

Rhona Zygmont

Tran Degarmo

Wilhelmina Papp

Adelaide Bartkowski

Alyssa Files

Beatriz Palmore

Cecilia Ours

Craig Avalos

Dianna Lapointe

Erma Davis

Fermina Smead

Garrett Harsh

Gaylene Sherry

Gilberto Pardue

Hui Nodal

-inf to D E to H J to +inf

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

Accumulo Master

TabletServer TabletServer TabletServer

ZooKeeper

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

KEY VALUE

Adelaide Bartkowski 91294124

Alyssa Files 491294

Beatriz Palmore 4124124124

Cecilia Ours 419120

Craig Avalos 940124

Dianna Lapointe 4921

Erma Davis 050194

Fermina Smead 10024599949

Garrett Harsh 140095931

Gaylene Sherry 914815

Gilberto Pardue 412414124124

Hui Nodal 962195192

Janell Tomita 12121

Jannette Betters 9192012

Jeana Delk 9120150

Madlyn Radke 4921

Peggie Allis 944944

Rhona Zygmont 123103

Tran Degarmo 9499494

Wilhelmina Papp 11221

Lookup “Garret Harsh”

FAST

Lookup “4921”

SLOW

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

The Apache Accumulo sorted, distributed key/value store is

a robust, scalable, high performance data storage and

retrieval system.

MIT Lincoln Lab study:

100 Million inserts per second using Accumulo

http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdf

http://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf

Booz Allen Hamilton study:

942 tablet servers, 7.56 trillion entries, 408TB, 26 hours

94MB/Sec, 15TB/hr, 80million inserts per second

11 tablet servers went down with no interruption

Showed linear scalability for write throughput

22,000 queries per second

HBase vs. Accumulo

• Subtle yet important differences in visibility implementation

• Coprocessors vs. Iterators

• Accumulo has faster write throughput*

• HBase’s reads are faster*

• HBase has more ecosystem integration

• Accumulo can shift around column families and locality groups

after the fact

• Accumulo has shown to work with no problems at 1,000 nodes

(BAH paper). Facebook and others run a “cell” design for

HBase. Largest clusters in the hundreds*.

* We believeDisclaimer: I am biased

Column Visibility Syntax

Label Description

A & B Both ‘A’ and ‘B’ are required

A | B Either ‘A’ or ‘B’ is required

A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required

A | (B & C) ‘A’ or ‘B’ and ‘C’ is required

(A | B) & (C & D) ?

A & (B & (C | D)) ?

Patient has schizophrenia: insurer | MD & psych

Patient has stomach ulcers: insurer | doctor

Patient has cavity: insurer | dentist

Patient has consent for general anesthesia: surgeon

More cool features

• Iterator framework: customizable server-side processing

• Constraints: user-defined Java functions that allow or

prevent new writes based on a condition

• Large rows: no limit on data stored in a row

• MapReduce InputFormats

• Thrift proxy: access Accumulo through Ruby, Python, …

• Monitor page: shows performance, status, errors, more

• Locality groups: group column families together on disk

for performance tuning (changeable later)

• On-HDFS at rest encryption (work in progress)

• Table import and export

Scalability & Performance

• Multiple HDFS volumes: Accumulo can use multiple

NameNodes to store its data

• Master stores metadata in an Accumulo table

• Native in-memory map: data is first written into a buffer

written in C++, outside of Java

• Relative encoding: consecutive keys with the same values

are flagged instead of rewritten

• Scan pipelines: stages of the read path are parallelized

into separate threads

• Caching: data recently scanned is cached

HOW IT WORKS

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

Lookup key

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

Collection of data that is kept together

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

What the data is

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

Who can see the data

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

When the data was created

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

UNIQUENESS

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

SORTED

Data Model

KEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAM

P

Some piece of information

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Text rowID = new Text(”don");

Text colFam = new Text(”info");

Text colQual = new Text(”picture");

ColumnVisibility colVis = new ColumnVisibility("public");

long timestamp = System.currentTimeMillis();

Value value = new Value(MyPictureObj.getBytes());

Mutation mutation = new Mutation(rowID);

mutation.put(colFam, colQual, colVis, timestamp, value);

BatchWriterConfig config = new BatchWriterConfig();

BatchWriter writer = conn.createBatchWriter(”usertable", config)

writer.add(mutation);

writer.close();

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Family Visibilities

don-don info public

Reading data

Range Family Visibilities

don-don info public

Reading data

Authorizations auths = new Authorizations("public”);

Scanner scan = conn.createScanner(”usertable", auths);

scan.setRange(new Range(”don",”don"));

scan.fetchFamily(”info");

for(Entry<Key,Value> entry : scan) {

String row = entry.getKey().getRow();

Value value = entry.getValue();

}

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Family Visibilities

don-don info public, user, tech

Reading data

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Visibilities

don-don public, user, tech

Reading data

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Visibilities

d-e public, user, tech

Reading data Scan

INDEXING & TABLE DESIGN

Basic Structured Data

Row IDColumn

Family

Column

Qualifier

Column

Visibility

Timestam

pValue

bob attribute height public Jun 2012 5’11”

bob attribute surname public Jul 2013 doe

bob insurance dental private Sep 2009 MetLife

jane attribute bloodType public Jul 2011 ab-

jane attribute surname public Aug 2013 doe

jane contact cellPhone public Dec 2010 (808) 345-

9876

jane insurance vision private Jan 2008 VSP

john allergy major private Feb 1988 amoxicillin

john attribute weight public Sep 2013 180

john contact homeAddr public Mar 2003 34 Baker LN

Basic Structured Data

Row IDColumn

Family

Column

Qualifier

Column

Visibility

Timestam

pValue

bob attribute height public Jun 2012 5’11”

bob attribute surname public Jul 2013 doe

bob insurance dental private Sep 2009 MetLife

jane attribute bloodType public Jul 2011 ab-

jane attribute surname public Aug 2013 doe

jane contact cellPhone public Dec 2010 (808) 345-

9876

jane insurance vision private Jan 2008 VSP

john allergy major private Feb 1988 amoxicillin

john attribute weight public Sep 2013 180

john contact homeAddr public Mar 2003 34 Baker LN

Indexing Everything

Row

ID

Column Fam Column Qual Visibility Time value

index Column Fam Column Qual:Row ID Visibility Time -

to Column Fam Column Qual:Row ID Visibility Time -

values Column Fam Column Qual:Row ID Visibility Time -

Event Table

Index Table

Index Table

Row IDColumn

Family

Column

Qualifier

Column

Visibility

Timestam

pValue

(808) 345-

9876

contact cellPhone:jane public Dec 2010 -

180 attribute weight:john public Sep 2013 -

34 Baker LN contact homeAddr:john public Mar 2003 -

5’11” attribute height:bob public Jun 2012 -

MetLife insuranc

e

dental:bob private Sep 2009 -

VSP insuranc

e

vision:jane private Jan 2008 -

ab- attribute bloodType:jane public Jul 2011 -

amoxicillin allergy major:john private Feb 1988 -

doe attribute surname:bob public Jul 2013 -

doe attribute surname:jane public Aug 2013 -

Data Lake

PATIENTS MEDICINES DOCTORS

INDEX

Data Lake

PATIENTS MEDICINES DOCTORS

INDEX

Tell me

everything

you know

of

amoxicillin

amoxicillin

Data Lake

PATIENTS DISEASES DOCTORS

INDEX

amoxicillin

bob:allergy:amoxicillin

larry:takes:amoxicillinStomach ulcer:

treatment:amoxicillin

smith:

prescribed:amoxicillinInfection:

treatment:amoxicillin

Diarrhea:

side effect:amoxicillin

Visibility labels help converge

data sources but still protect

who can see them.

Graphs

a

bc

d

e

a b c d e

a - 1

b 1 -

c - 1

d 1 1 - 1

e -

Start Nodes

End N

odes

Row ID Column Family Column Qualifier Value

a edge b 1

a edge d 1

c edge a 1

c edge d 1

d edge c 1

e edge d 1

• Random walk

• Neighborhoods

• Traversals

Each edge can have

a visibility label!

Term-Partitioned Index

Tablet Server 1

Row IDColumn

FamilyValue

baseball document docid_3

baseball document docid_2

bat document docid_2

Tablet Server 2

Row IDColumn

FamilyValue

football document docid_1

football document docid_3

glove document docid_1

Tablet Server 3

Row IDColumn

FamilyValue

nba document docid_1

shoes document docid_1

soccer document docid_3

RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3]

Tablet Server knows about

the terms “baseball”

Tablet Server knows about

the terms “football”Tablet Server knows about

the terms “soccer”

Query: “baseball” AND “football” AND “soccer”

Client

Client-side Set

Intersection

[docid_2, docid_3]

[docid_1, docid_3]

[docid_3]

Visibility labels allow protected search Iterators can maintain stats about docs

Geospacial Indexing: Grid Squares

Geospacial Indexing: Z-Order Curve

33.333W, 55.555N = 3535.353535

3535.353535 is the rowkey

Temporal Indexing

Row IDColumn

FamilyColumn Qualifier Value

Router37 2014-12 1418624102 cold

Router37 2015-01 1421633979 cold

Router37 2015-01 1421634319 hot

Router37 2015-01 1421635001 cold

Server92 2014-12 1418555102 cold

Server92 2014-12 1418556999 hot

Server92 2014-12 1418651002 cold

Server92 2014-12 1418756987 hot

Server92 2014-12 1418853304 cold

Server98 2014-12 1418555104 cold

Server98 2015-01 1421633319 cold

Note:

Dynamically

adding column

families

Resources

Apache Accumulo website

accumulo.apache.org

Accumulo Summit 2014

accumulosummit.com

slideshare.net/AccumuloSummit

Accumulo Summit 2015

End of April!

accumulosummit.com