Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]
-
Upload
accumulo-summit -
Category
Technology
-
view
165 -
download
2
Transcript of Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]
Mike Walch
Using Fluo to incrementally process data in Accumulo
Problem: Maintain counts of inbound links
fluo.io
github.com
apache.org
nytimes.com
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound Links
032
0
Example DataExample Graph
Solution 1 - Maintain counts using batch processing
Website
fluo.iogithub.comapache.orggithub.comnytimes.comapache.org
# Inbound
+1-1
+1-1
+1+1
Link count change log
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
+1-23
+65 +105
Last Hour Aggregates
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
531,385,1922,528,190
53,395,000
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
541,385,1692,528,255
53,395,105
Historical
Latest Counts
MapReduce
MapReduce
WebCrawler
Internet
WebCache
Solution 2 - Maintain counts using Fluo
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
531,385,1922,528,190
53,395,000
Fluo Table
+1
-1WebCrawler
Internet
WebCache
Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo
# InboundLinks
Update every hour using
MapReduce
Update in real-timeusing Fluo
Website Distribution
nytimes.com
github.com
fluo.io
Fluo 101 - Basics
- Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates
- Allows for incremental processing of data
- Based on Google’s Percolator paper
- Started as a side project by Keith Turner in 2013
- Originally called Accismus
- Tested using synthetic workloads
- Almost ready for production environments
Fluo 101 - Accumulo vs Fluo
- Fluo is a transactional API built on top of Accumulo- Fluo stores its data in Accumulo
- Fluo uses Accumulo conditional mutations for transactions
- Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp
- Each Fluo application runs its own processes- Oracle allocates timestamps for transactions
- Workers run user code (called observers) that perform transactions
Fluo 101 - Architecture
Accumulo
HDFS
Zookeeper
YARN
Client Cluster
Fluo Client for App 1
Fluo Clientfor App 1
Fluo Clientfor App 2
Fluo Application 2Fluo Application 1
Fluo Worker
Observer1 Observer2
Fluo Oracle
Fluo Worker
ObserverA
Fluo Oracle
Fluo Worker
Observer1 Observer2
Table1 Table2
Fluo 101 - Client API
Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc)
public void addDocument(FluoClient fluoClient, String docId, String content) {
TypeLayer typeLayer = new TypeLayer(new StringEncoder());
try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {
if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } }}
Fluo 101 - Observers- Developers can write observers that are triggered when a column is
modified and run by Fluo workers.
- Best practice: Do work/transactions in observers over client code
public class DocumentObserver extends TypedObserver {
@Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here }
@Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); }}
Example Fluo Application
- Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time
- Fluo client performs two actions:1. Add document to table 2. Mark document for deletion
- Which triggers two observers: - Add Observer - increase word and document counts- Delete Observer - decrease counts and clean up
Add first document to table
Fluo Table
Row
d : doc1
Column
doc
Value
my first hello world
Fluo Client
Client Cluster
AddObserver
DeleteObserver
An observer increments word counts
Fluo Table
Row
d : doc1
w : firstw : hellow : myw : world
total : docs
Column
doc
cntcntcntcnt
cnt
Value
my first hello world
1111
1Fluo Client
Client Cluster
AddObserver
DeleteObserver
A second document is added
Fluo Table
Row
d : doc1d : doc2
w : firstw : hellow : myw : secondw : world
total : doc
Column
docdoc
cntcntcntcntcnt
cnt
Value
my first hello worldsecond hello world
12112
2
Fluo Client
Client Cluster
AddObserver
DeleteObserver
First document is marked for deletion
Fluo Table
Row
d : doc1d : doc1d : doc2
w : firstw : hellow : myw : secondw : world
total : doc
Column
docdeletedoc
cntcntcntcntcnt
cnt
Value
my first hello world
second hello world
12112
2
Fluo Client
Client Cluster
AddObserver
DeleteObserver
Observer decrements counts and deletes document
Fluo Table
Row
d : doc1d : doc1d : doc2
w : firstw : hellow : myw : secondw : world
total : doc
Column
docdeletedoc
cntcntcntcntcnt
cnt
Value
my first hello world
second hello world
11111
1
Fluo Client
Client Cluster
AddObserver
DeleteObserver
Things to watch out for...
- Collisions occur when two transactions update the same data at the same time
- Only one transaction will succeed. Others need to be retried.
- Some OK but too many can slow computation
- Avoid collisions by not updating same row/column on every transaction
- Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update
- Result is different than if transactions were serialized
- Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.
How does Fluo fit in?
Higher
Large JoinThroughput
Lower
Slower Processing Latency Faster
Batch Processing
MapReduce, Spark
Incremental Processing
Fluo, Percolator
Stream Processing
Storm
Don’t use Fluo if...
1. You want to do ad-hoc analysis on your data (use batch processing instead)
2. Your incoming data is being joined with a small data set(use stream processing instead)
Use Fluo if...
1. If you want to maintain a large scale computation using a series of small transaction updates
2. Periodic batch processing jobs are taking too long to join new data with existing data
Fluo Application Lifecycle
1. Use batch processing to seed computation with historical data
2. Use Fluo to process incoming data and maintain computation in real-time
3. While processing, Fluo can be queried and notifications can be made to user
Major Progress
2010 2013 2014 2015
Google releases Percolator paper
Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus)
Fluo can process transactions
1.0.0-alpha released
Oracle and worker can be run in YARN
Changed project name to Fluo
1.0.0-beta releasing soon
Solidified Fluo Client/Observer API
Automated running Fluo cluster on Amazon EC2
Multi-application support
Improved how observer notifications are found
Created Stress Test
Fluo Stress Test- Motivation: Needed test that stresses Fluo
and is easy to verify for correctness
- The stress test computes the number of unique integers by building a bitwise trie
- New integers are added at leaf nodes
- Observers watch all nodes, create parents, and percolate total up to root node
- Test runs successfully if count at root is same a number of leaf nodes
- Multiple transactions can operate on same nodes causing collisions
1110
11xx = 3
1100
10xx = 0 01xx = 1 00xx = 1
xxxx = 5
0101 00011110
Easy to run Fluo
1. On machine with Maven+Git, clone the fluo-dev and fluo repos
2. Follow some basic configuration steps
3. Run the following commands
It’s just as easy to run a Fluo cluster on Amazon EC2
fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballsfluo-dev setup # Sets up locally Accumulo, Hadoop, etcfluo-dev deploy # Build Fluo distribution and deploy locallyfluo new myapp # Create configuration for ‘myapp’ Fluo applicationfluo init myapp # Initialize ‘myapp’ in Zookeeperfluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARNfluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
Fluo Ecosystem
fluoMain Project Repo
fluo-quickstart
Simple Fluo example
fluo-stressStresses Fluo on
cluster
fluo-io.github.io
Fluo project website
phrasecountIn-depth Fluo
example
fluo-deployRun Fluo on EC2
cluster
fluo-devHelps developers
run Fluo locally
Future Direction- Primary focus: Release production-ready 1.0 release with stable API
- Other possible work:
- Fluo-32: Real world example application
- Possibly using CommonCrawl data
- Fluo-58: Support writing observers in Python
- Fluo-290: Support running Fluo on Mesos
- Fluo-478: Automatically scale up & down Fluo workers based on workload
Get involved!
1. Experiment with Fluo- API has stabilized- Tools and development process make it easy- Not recommended for production yet (wait for 1.0)
2. Contribute to Fluo- ~85 open issues on GitHub- Review-then-commit process