2014 09-12 lambda-architecture-at-indix

Lambda Architecture at Indix(Only Part I Edition)

2014-09-12Yu Ishikawa

Abstract

• Evolution of our data platform• Introduce Lambda Architecture• Motivation for using it

• Original Source– http://engineering.indix.com/blog/lambda-architecture-at-indix/– http://engineering.indix.com/blog/lambda-architecture-at-indix-part-II/

http://engineering.indix.com/blog/lambda-architecture-at-indix/

http://engineering.indix.com/blog/lambda-architecture-at-indix/

http://engineering.indix.com/blog/lambda-architecture-at-indix-part-II/





Agenda

• Background• Data Platform v1.0• Challenges• Data Platform v2.0– Lambda Architecture– Components– Principles

• Conclusion

Background

Indix

• A product intelligence platform– the world’s largest product database and APIs to

enable brands, retailers and developers to deliver the right product to the right customer at the right place, every time

Build a catalog

• Several million products and billions of price points collected from thousands of e-commerce web sites

• collect product data as semi-structured HTML via crawling product pages from these web sites– Extract product attributes from the pages– Run through a series of machine learning algorithms to

classify and extract deeper product attributes– Uses this data to compute aggregates across multiple

dimensions and derive actionable insights• data is also indexed by our search engine

– consumed by our apps, API and mobile platforms

Data Platform 1.0

Diagram of 1.0

• looking for a datamodel that would allow us to keep a copy of millions of web pages collected by our crawler

Challenges

Faced Challenges in 1.0

• Operational Issues– Best-practice configuration to avoid issues due to compaction

• Data Corruption – Had to run a migration job to restore some of the data

• When a web page redirects to another web page, we used to copy all the data corresponding to the older web page to the new one and mark the row corresponding to the old web page for deletion

• Data Loss– Ended up storing null instead of valid JSON

• Due to a bug in the Google GSON library for converting our POJO objects to JSON

• Wrong Choice of Map Reduce Abstractions– Quite a bit of plumbing code to express complex data workflows

System 1.0 was too complex to manage, understand and extend

Needed a simpler scalable approach that would scale, is easier to reason, be tolerant to human errors and can evolve with our

product

Data Platform 2.0

Lambda Architecture

• A set of architecture principles and components– Allows both batch and real-time to work together

while building immutability, recomputation and human fault tolerance into the system

• Coined by Nathan Marz– an ex-Twitter engineer and the creator of Storm

Diagram of 2.0• Three layers - batch, serving and

speed• To get the final result, the batch and

realtime views must be queried and the results merged together.

Layers and Components• Collect data

– Akka: an actor based concurrent library in Scala• Implement their Crowler

• Batch layer:– responsible for computing arbitrary functions on the master data– Scalding: a Scala library to write MR jobs in the Scala collections API

– Spark: a framework for in-memory distributed computing

• Serving layer:– indexes and exposes precomputed views to serve ad-hoc queries with low latency– Hbase, Solr, Oogway

• Speed layer: – deals only with new data and compensates for the high latency

updates of the serving layer by creating realtime views– our real time latency requirements are in hours and not in seconds

Principles

A core set of architectural principleshelp you build simple and robust big data systems

• Immutability and Human Fault Tolerance– Recomputation: Easy to recover from mistakes

• Ex) write bad data, bugs in the computation– At any point of time, you have the previous versions of the batch and

serving layer

• Complexity Isolation– To Isolate complexity, using HBase in speed layer– HBase is high operational complexity because of random writes leading to

compactions

• Enforceable Schemas– advocates the use of enforceable schemas to ensure read time validation of

data to avoid data loss– These schemas are implemented by language-neutral serialization

frameworks like Thrift, Avro, Protobuf etc

Conclusion

Conclusion

• Lambda Architecture is technology and domain agnostic

• 2.0 is more robust and simple– has been in production for more than a year now– have matured during the last year and are stable

• However, our speed and merge layers are still work in progress

SEE ALSO

• Questioning the Lambda Architecture - O'Reilly Radar – http

://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

• Lambda Architecture | MapR – https://www.mapr.com/developercentral/lambda-

architecture• Applying the Lambda Architecture with Spark– http://spark-summit.org/wp

-content/uploads/2014/07/Lambda-Architecture-Jim-Scott..pdf

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html



https://www.mapr.com/developercentral/lambda-architecture



http://spark-summit.org/wp-content/uploads/2014/07/Lambda-Architecture-Jim-Scott..pdf








2014 09-12 lambda-architecture-at-indix

Technology

Transcript of 2014 09-12 lambda-architecture-at-indix