Lambda architecture: from zero to One
-
Upload
serg-masyutin -
Category
Engineering
-
view
715 -
download
4
Transcript of Lambda architecture: from zero to One
Lambda Architecture: from zero to OneSerhiy Masyutin
MeStaff Engineer @ LohikaPassionate DeveloperFatherMountain Biker
AgendaProject OverviewArchitecture EvolutionWhat is Lambda Architecture?Cluster EvolutionWhat We Achieved?
Project Overview
Project GoalsPortfolio-driven R&D projectFocus on TechnologyFocus on KnowledgeFocus on a new remote Team
, .
Service designed to offload highly concurrent scenario of live voting
Service designed to offload highly concurrent scenario of live votingUser puts a voteUser requests results on campaignManager requests reports on campaignsAdmin controls the system
Architecture GoalsSaaS SolutionHigh ThroughputScalabilityLow Latency
Software as a Service
Essential Data Modelcampaign { startDate, endDate }vote { user, campaign, timestamp }
Architecture Evolution
Votes
Start Simple
Reports
Maria DB 5.5.44 (no second level cache in our persistence, all tests started with empty DB, connection pooling)
Start Simple
Java 8Spring Boot 1.2.5MariaDB 5.5
Angularjs 1.4
Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can "just run". Spring Data's mission is to provide a familiar and consistent, Spring-based programming model for data access while still retaining the special traits of the underlying data store. MariaDB An enhanced, drop-in replacement for MySQL. MariaDB 5.5 is a stable (GA) release of MariaDB. It is MariaDB 5.3 + MySQL 5.5
Benchmark it!Simple throughout scenario: user.vote() user.request(results)Stop tests when error rate raises above 5%Benchmark tool runs locally, targeting could server
, , ,
1000 campaigns15-100k votes
GatlingAn open-source load testing framework based on Scala, Akka and NettyHigh performanceOut-of-box HTTP supportReady-to-present HTML reportsScenario recorder and developer-friendly DSL
http://gatling.io
Gatlingscenario(Throughout simulation").repeat(repeatCount) { feed(voteFeeder()) .exec(http("Vote") .post(voteLink) .headers(sentHeaders).header("Authorization", token) .body(StringBody("${vote}")) .check(status.is(200)).asJSON) .exec(http("Report") .get(reportByOptionLink+"/${votingSchemaId}") .headers(sentHeaders).header("Authorization", token) .check(status.is(200)).asJSON)}
Gatling
Benchmark!
3-4 , .
.
, . . , . , ))
,
KafkaPublisher-subscriberDistributed by designScalableFastDurable
http://kafka.apache.org
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumersMessages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
Incoming Queue
Votes
Reports
Kafka 0.8.2.1 (Scala 2.9.1, only one partition) , . )
Benchmark!
1.5 - 2.1x
, . .
RedisIn-memory data structure store (set, map, etc)Easy leader board implementationHyperLogLog is its native data structure
http://redis.io
. HyperLogLog In the Redis implementation it only uses 12kbytes per key to count with a standard error of 0.81%, and there is no limit to the number of items you can count, unless you approach 2^64 items (which seems quite unlikely).// key == set
In-memory Storage
Votes
Reports
Benchmark!
~1.6x performance boost :
A fast and general engine for large-scale data processing
http://spark.apache.org
, .Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.
Scalable Processing
Votes
Reports
.
live layer Spark Streaming, . .
,
: , .
.
TODO: BenchmarksProcessing latencyLatency vs Data Volume
TODO: Scalable Storage
Reports
Votes
HDFS.
Architecture Goals MetHigh ThroughputScalable StorageScalable ProcessingExtensible ProcessingLow Latency Reads & Updates
.
,
.
Lambda Architecture
A Single Picturehttp://lambda-architecture.net/img/la-overview_small.png
A Single PictureQUERY = f_query(batch_view, realtime_view)batch_view = f_batch(all_data)realtime_view = f_speed(new_data, realtime_view)
: == all_data == batch_view == realtime_view
Batch LayerImmutable append-only data storeBatch computations produce batch views
HadoopThe batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views.Apache Hadoop is the de facto standard batch-processing system used in most high-throughput architectures.
Serving LayerRandom reads/queries on batch viewsBatch updates from batch layerNo need for random writes
SimpleRobustPredictableEasy to configure and operateCassandra/HBase/ElaphantDB, Hive/Impala
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.Examples of technologies used in the serving layer include Druid, which provides a single cluster to handle output from both layers. Dedicated stores used in the serving layer include Apache Cassandra or Apache HBase for speed-layer output, and Elephant DB or Cloudera Impala for batch-layer output.
Batch + Serving LayerRobustness and fault toleranceScalabilityGeneralizationExtensibilityMinimal maintenanceDebuggability
Speed LayerLow latency reads and updatesIncremental computation (different from batch one)ScalabilityFault toleranceMinimal amount of stored data
Complexity isolation complexity is pushed to layer whose results are temporary .Eventual accuracy eventually all the results will be taken from serving layer.Speed layer might use approximate algorithms like HyperLogLog and BloomFilters for computations.Storm/Spark Streaming
The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available.Stream-processing technologies typically used in this layer include Apache Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL databases.
GoalsRobustness and fault toleranceScalabilityGeneralizationExtensibilityMinimal maintenanceDebuggabilityLow latency reads and updates
Lambda Architectruehttp://lambda-architecture.net/img/la-overview_small.png
High-load meets big data
Cluster Evolution
Start Simplesingle box
Optimization: Tomcat ConnectorStart with a single machineNumber of threads matter, benchmark itFine-tuning can be OS specific
Java Nio2 Connector, 500 threads
http://techblog.netflix.com/2015/07/tuning-tomcat-for-high-throughput-fail.html
Benchmark!
???
, . ) . . 12
HaproxyThe Reliable, High Performance TCP/HTTP Load BalancerA single-process program
http://haproxy.org
A cluster of 10 servers
Ubuntu 14.04 (virtual machines, 2 cores, 8 GB RAM)
Haproxy vs nginx
12 , . .
Optimization: Load Balancing
!!!
When to Stop?CPU %Memory GBhaproxy952.5tomcat3976kafka11.3redis553.5
, , , . 30 RPS .
What We Achieved?
ExperienceLambda Architecture: we have OneCluster Scaling & OptimizationExcellent team
TechnologyJava 8Spring Boot 1.2.5Spring Data 1.2.5Tomcat 8MariaDB 5.5Haproxy 1.5.14
Kafka 0.8Redis 2.8Spark 1.4HDFS 2.6Gatling 2.2Angularjs 1.4
Things That MatterSmall steps make huge differenceChoose right metricsBenchmarkOptimize!
Q/A
Thank You!