S3MPER S3mper is a library that provides an additional layer of
consistency checking on top of Amazon's S3 index through use of a
consistent, secondary index. S3mper is a library that provides an
additional layer of consistency checking on top of Amazon's S3
index through use of a consistent, secondary index.
Efficient ETL with Cassandra Cassandra
Offline Analysis
Evolution Speed!
We Want to Aggregate, Index, and Query Data in Real Time
Interactive Exploration
Lets walk through some use cases
client activity event * /name = movieStarts
Pipeline Challenges App owners: send and forget Data
scientists: validation, ETL, batch processing DevOps: stream
processing, targeted search
Message Routing
We Want to Consume Data Selectively in Different Ways
Message broker High-throughput Persistent and replicated
There Is More
Intelligent Alerts
Intelligent Alerts
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Ad-hoc query with different dimensions Quick aggregations and
Top-N queries Time series with flexible filters Quick access to raw
data using boolean queries What We Need
Druid Rapid exploration of high dimensional data Fast ingestion
and querying Time series
Real-time indexing of event streams Killer feature: boolean
search Great UI: Kibana
The Old Pipeline
The New Pipeline
There Is More
Its Not All About Counters and Time Series
RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123
Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789
Service A 200 4965-4a74e 456 abc Service B 200 Status:200
Distributed Tracing
Distributed Tracing
Distributed Tracing
A System that Supports All These
A Data Pipeline To Glue Them All
Make It Simple
Message Producing Simple and Uniform API
messageBus.publish(event)
Consumption Is Simple Too consumer.observe().subscribe(new
Subscriber() { @Override public void onNext(Ackable ackable) {
process(ackable.getEntity(MyEventType.class)); ackable.ack(); } });
consumer.pause(); consumer.resume()
RxJava Functional reactive programming model Powerful streaming
API Separation of logic and threading model
Design Decisions Top Priority: app stability and throughput
Asynchronous operations Aggressive buffering Drops messages if
necessary
Anything Can Fail
Cloud Resiliency
Fault Tolerance Features Write and forward with auto-reattached
EBS (Amazons Elastic Block Storage) disk-backed queue: big-queue
Customized scaling down
Theres More to Do Contribute to @NetflixOSS Join us :-)
Summary http://netflix.github.io +
You can build your own web-scale data pipeline using open
source components