One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck...

24
One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck Principal Engineer Dell Software Group

Transcript of One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP Randy Guck...

One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP

Randy GuckPrincipal EngineerDell Software Group

What is Doradus? Storage and query service

Leverages Cassandra NoSQL DB

Pure Java

- Stateless

- Embeddable or standalone

Open source: Apache 2.0 License

30 Doradus: The Tarantula Nebula Source: Hubble Space Telescope

Why Use Doradus? Easy to use: no client driver

Spider storage manager

- Good for unstructured data

OLAP storage manager

- Near real time data warehousing

Compared to Cassandra alone:

- Data model, searching, analytics

Compared to Hadoop:

- Fast data loads and queries

- Dense storage: less hardware

Cassandra

Data

Applications

REST API

OLAPSpide

r

Doradus

A Multi-Node Cluster

Doradus

Cassandra

Data

Node 2

Cassandra

Data

Doradus

Cassandra

Data

Node 1 Node 3

Applications

REST API

Secondary Doradusinstances are optional

Why Did We Build Doradus OLAP? Some tough customer requirements:

- Statistical queries most important

- Need to scan millions of objects/second

- User-customizable “insights” = millions of possible queries

Couldn’t use indexes, pre-computed queries, etc.

Disk physics

- ~100's of random reads/second

- ~1000's of serial reads/second

Needed a radically new approach!

Doradus OLAP Combines ideas from:

- Online Analytical Processing: data arranged in static cubes

- Columnar databases: Column-oriented storage and compression

- NoSQL databases: Sharding

Features:

- Fast loading: up to 500K objects/second/node

- Dense storage: 1 billion objects in 2 GB!

- Fast cube merging: typically seconds

- No indexes!

Example: Message Tracking Schema

Message Participant Address

PersonManager

Employees

PersonAddress

AttachmentsMessage

ParticipantsMessage

AddressParticipants

Attachment

DQL Object Queries Builds on Lucene syntax

- Full text queries

Adds link paths

- Directed graph searches

- Quantifiers and filters

- Transitive searches

Other features

- Stateless paging

- Sorting

Examples:

- LastName = Smith AND NOT (FirstName : Jo*) AND BirthDate = [1986 TO 1992]

- ALL(Participants).ANY(Address.WHERE(Email='*.gmail.com')).Person.Department : support

- Employees^(4).Office='San Jose’

DQL Aggregate Queries Metric functions

- COUNT, AVERAGE, MIN, MAX, DISTINCT, ...

Multi-level grouping

Grouping functions

- BATCH, BOTTOM, FIRST, LAST, LOWER, SETS, TERMS, TOP, TRUNCATE, UPPER, WHERE, ...

Examples:

- metric=COUNT(*), AVERAGE(Size), MIN(Participants.Address.Person.Birthdate)

- metric=DISTINCT(Attachments.Extension);groups=Tags,

Participants.Address.Person.Department;query=Attachments.Size > 100000

- metric=AVERAGE(Size);groups=TOP(10,Participants.Address.Email)

OLAP Data Loading

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Sources

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Batch 2

Batch 3

...

Sources Batches

Batch 4

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Batch 2

Batch 3

...

2014-03-01

2014-02-28

2014-02-27

Sources Batches Shards

Batch 4

Merge

OLAP Data Loading

Batch 1

EventsEventsEvents

EventsEventsPeople

EventsEventsComputers

EventsEventsDomains

Batch 2

Batch 3

...

2014-03-01

2014-02-28

2014-02-27

Sources Batches Shards OLAP Store

Batch 4

Merge

Storing Batches

Field ValuesID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...

Size 48120 5435 24220 ...

SendDate 1280246462000 1279354872112 1279357261413 ...

Priority 0 0 1 ...

Subject.txt ballades encash nautch colloquy geared

nettlier outdoors culvert hypothec winder

stolons ungot guiding rupiahs outgone

...

Subject 1 2 0 ...

...

Data is sorted by object ID and stored as columnar, compressed blobs

Key ColumnsEmail/Message/2014-03-01/{Batch GUID}/ID [compressed data]

Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]

Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]

... ......

OLAP Table

Field Value Arrays

Compressed rows

Merging Batches

Key Columns

Email/Message/2014-03-01/ID [compressed data]

Email/Message/2014/03-01/Size [compressed data]

Email/Message/2014-03-01/SendDate [compressed data]

... ...

Email/Person/2014-03-01/ID [compressed data]

Email/Person/2014-03-01/FirstName [compressed data]

Email/Person/2014-03-01/LastName [compressed data]

... ...

Email/Address/2014-03-01/ID [compressed data]

Email/Address/2014-03-01/Person [compressed data]

Email/Address/2014/-03-01/Message [compressed data]

... ...

Email/Message/2014-02-28/ID [compressed data]

Email/Message/2014-02-28/Size [compressed data]

...

Batch #1: Shard 2014-03-01Message Table

ID ...

Size ...

SendDate ...

...

Batch #2: Shard 2014-03-01Message Table

ID ...

Size ...

SendDate ...

...

...

OLAP Store

Person Table

ID ...

FirstName ...

Lastname ...

...

Address Table

ID ...

Person ...

Messages ...

...

Person Table

ID ...

FirstName ...

Lastname ...

...

Address Table

ID ...

Person ...

Messages ...

...

Message table dataShard 2014-03-01

Person table dataShard 2014-03-01

Address table dataShard 2014-03-01

Data for other shards

Does Merging Take Long?

OLAP Query Execution Example query:

- Count messages with Size between 1000-10000 andHasBeenSent=false in shards 2014-03-01 to 2014-03-31

How many rows are read?

- 2 fields x 31 shards = 62 rows

- Typically represents millions of values

Value arrays are scanned in memory

Physical rows are read on “cold” start only

- Multiple caching levels for “warm” and “hot” data

1 Billion Objects in 2GB?Example Security Event (CSV format):

Fixed Fields Variable FieldsComputer Name MAILSERVER18 1 MAILSERVER18$Log Name Security 2Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 WorkstationType Success Audit 4 (0x0,0x142999A)Source Security 5 3Category Logon/Logoff 6 KerberosEvent ID 540 7 KerberosUser Domain NT AUTHORITYUser Name SYSTEM  User SID S-1-5-18

MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,"(0x0,0x142999A)",3,Kerberos,Kerberos

Events Schema

EventsInsertio

nStrings

Fields:• ComputerName (text)• LogName (text)• Timestamp (timestamp)• Type (text)• Source (text)

Fields:• Index (integer)• Value (text)• Event (link)

Count: 115 Million Count: 880 Million

Params

Event (inverse)

• Category (text)• EventID (integer)• UserDomain (text)• UserSID (text)• Params (link)

Event Schema Load Load stats:

Total shards: 860

Total events: 114,572,247

Total ins strings: 879,529,753

Total objects: 994,102,000

Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)

Space usage::nodetool -h localhost status

Datacenter: datacenter1

=======================

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address Load Owns Host ID Token Rack

UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1

Demo 1) Count all Events in all shards

- 860 shards => 115M events

2) Find the top 5 hours-of-the-day when certain privileged events fail:

- Event IDs are any of 577, 681, 529

- Event type is ‘Failure Audit’

- Insertion string 8 is (0x0,0x3E7)

- Event occurred in first half of 2005 (181 shards)

Doradus OLAP Summary Advantages:

Simple REST API

All fields are searchable without indexes

Ad-hoc statistical searches

Support for graph-based queries

Near real time data warehousing

Dense storage = less hardware

Horizontally scalable when needed

Doradus OLAP Summary Good for applications where data:

Is continuous/streaming

Is structured to semi-structured

Can be loaded in batches

Is partitionable, especially by time

Is typically queried in a subset of shards

Emphasizes statistical queries

Thank You! Where to find Doradus

- Source: github.com/dell-oss/Doradus

- Downloads: search.maven.org

Contact me

- [email protected]

- @randyguck