2014 09-12 lambda-architecture-at-indix
-
Upload
yu-ishikawa -
Category
Technology
-
view
767 -
download
0
description
Transcript of 2014 09-12 lambda-architecture-at-indix
Lambda Architecture at Indix(Only Part I Edition)
2014-09-12Yu Ishikawa
Abstract
• Evolution of our data platform• Introduce Lambda Architecture• Motivation for using it
• Original Source– http://engineering.indix.com/blog/lambda-architecture-at-indix/– http://engineering.indix.com/blog/lambda-architecture-at-indix-part-II/
Agenda
• Background• Data Platform v1.0• Challenges• Data Platform v2.0– Lambda Architecture– Components– Principles
• Conclusion
Background
Indix
• A product intelligence platform– the world’s largest product database and APIs to
enable brands, retailers and developers to deliver the right product to the right customer at the right place, every time
Build a catalog
• Several million products and billions of price points collected from thousands of e-commerce web sites
• collect product data as semi-structured HTML via crawling product pages from these web sites– Extract product attributes from the pages– Run through a series of machine learning algorithms to
classify and extract deeper product attributes– Uses this data to compute aggregates across multiple
dimensions and derive actionable insights• data is also indexed by our search engine
– consumed by our apps, API and mobile platforms
Data Platform 1.0
Diagram of 1.0
• looking for a datamodel that would allow us to keep a copy of millions of web pages collected by our crawler
Challenges
Faced Challenges in 1.0
• Operational Issues– Best-practice configuration to avoid issues due to compaction
• Data Corruption – Had to run a migration job to restore some of the data
• When a web page redirects to another web page, we used to copy all the data corresponding to the older web page to the new one and mark the row corresponding to the old web page for deletion
• Data Loss– Ended up storing null instead of valid JSON
• Due to a bug in the Google GSON library for converting our POJO objects to JSON
• Wrong Choice of Map Reduce Abstractions– Quite a bit of plumbing code to express complex data workflows
System 1.0 was too complex to manage, understand and extend
Needed a simpler scalable approach that would scale, is easier to reason, be tolerant to human errors and can evolve with our
product
Data Platform 2.0
Lambda Architecture
• A set of architecture principles and components– Allows both batch and real-time to work together
while building immutability, recomputation and human fault tolerance into the system
• Coined by Nathan Marz– an ex-Twitter engineer and the creator of Storm
Diagram of 2.0• Three layers - batch, serving and
speed• To get the final result, the batch and
realtime views must be queried and the results merged together.
Layers and Components• Collect data
– Akka: an actor based concurrent library in Scala• Implement their Crowler
• Batch layer:– responsible for computing arbitrary functions on the master data– Scalding: a Scala library to write MR jobs in the Scala collections API
– Spark: a framework for in-memory distributed computing
• Serving layer:– indexes and exposes precomputed views to serve ad-hoc queries with low latency– Hbase, Solr, Oogway
• Speed layer: – deals only with new data and compensates for the high latency
updates of the serving layer by creating realtime views– our real time latency requirements are in hours and not in seconds
Principles
A core set of architectural principleshelp you build simple and robust big data systems
• Immutability and Human Fault Tolerance– Recomputation: Easy to recover from mistakes
• Ex) write bad data, bugs in the computation– At any point of time, you have the previous versions of the batch and
serving layer
• Complexity Isolation– To Isolate complexity, using HBase in speed layer– HBase is high operational complexity because of random writes leading to
compactions
• Enforceable Schemas– advocates the use of enforceable schemas to ensure read time validation of
data to avoid data loss– These schemas are implemented by language-neutral serialization
frameworks like Thrift, Avro, Protobuf etc
Conclusion
Conclusion
• Lambda Architecture is technology and domain agnostic
• 2.0 is more robust and simple– has been in production for more than a year now– have matured during the last year and are stable
• However, our speed and merge layers are still work in progress
SEE ALSO
• Questioning the Lambda Architecture - O'Reilly Radar – http
://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
• Lambda Architecture | MapR – https://www.mapr.com/developercentral/lambda-
architecture• Applying the Lambda Architecture with Spark– http://spark-summit.org/wp
-content/uploads/2014/07/Lambda-Architecture-Jim-Scott..pdf