2014 09-12 lambda-architecture-at-indix

21
Lambda Architecture at Indix (Only Part I Edition) 2014-09-12 Yu Ishikawa

description

Introduce the blog article which is Lambda Architecture at Indix

Transcript of 2014 09-12 lambda-architecture-at-indix

Page 1: 2014 09-12 lambda-architecture-at-indix

Lambda Architecture at Indix(Only Part I Edition)

2014-09-12Yu Ishikawa

Page 3: 2014 09-12 lambda-architecture-at-indix

Agenda

• Background• Data Platform v1.0• Challenges• Data Platform v2.0– Lambda Architecture– Components– Principles

• Conclusion

Page 4: 2014 09-12 lambda-architecture-at-indix

Background

Page 5: 2014 09-12 lambda-architecture-at-indix

Indix

• A product intelligence platform– the world’s largest product database and APIs to

enable brands, retailers and developers to deliver the right product to the right customer at the right place, every time

Page 6: 2014 09-12 lambda-architecture-at-indix

Build a catalog

• Several million products and billions of price points collected from thousands of e-commerce web sites

• collect product data as semi-structured HTML via crawling product pages from these web sites– Extract product attributes from the pages– Run through a series of machine learning algorithms to

classify and extract deeper product attributes– Uses this data to compute aggregates across multiple

dimensions and derive actionable insights• data is also indexed by our search engine

– consumed by our apps, API and mobile platforms

Page 7: 2014 09-12 lambda-architecture-at-indix

Data Platform 1.0

Page 8: 2014 09-12 lambda-architecture-at-indix

Diagram of 1.0

• looking for a datamodel that would allow us to keep a copy of millions of web pages collected by our crawler

Page 9: 2014 09-12 lambda-architecture-at-indix

Challenges

Page 10: 2014 09-12 lambda-architecture-at-indix

Faced Challenges in 1.0

• Operational Issues– Best-practice configuration to avoid issues due to compaction

• Data Corruption – Had to run a migration job to restore some of the data

• When a web page redirects to another web page, we used to copy all the data corresponding to the older web page to the new one and mark the row corresponding to the old web page for deletion

• Data Loss– Ended up storing null instead of valid JSON

• Due to a bug in the Google GSON library for converting our POJO objects to JSON

• Wrong Choice of Map Reduce Abstractions– Quite a bit of plumbing code to express complex data workflows

Page 11: 2014 09-12 lambda-architecture-at-indix

System 1.0 was too complex to manage, understand and extend

Page 12: 2014 09-12 lambda-architecture-at-indix

Needed a simpler scalable approach that would scale, is easier to reason, be tolerant to human errors and can evolve with our

product

Page 13: 2014 09-12 lambda-architecture-at-indix

Data Platform 2.0

Page 14: 2014 09-12 lambda-architecture-at-indix

Lambda Architecture

• A set of architecture principles and components– Allows both batch and real-time to work together

while building immutability, recomputation and human fault tolerance into the system

• Coined by Nathan Marz– an ex-Twitter engineer and the creator of Storm

Page 15: 2014 09-12 lambda-architecture-at-indix

Diagram of 2.0• Three layers - batch, serving and

speed• To get the final result, the batch and

realtime views must be queried and the results merged together.

Page 16: 2014 09-12 lambda-architecture-at-indix

Layers and Components• Collect data

– Akka: an actor based concurrent library in Scala• Implement their Crowler

• Batch layer:– responsible for computing arbitrary functions on the master data– Scalding: a Scala library to write MR jobs in the Scala collections API

– Spark: a framework for in-memory distributed computing

• Serving layer:– indexes and exposes precomputed views to serve ad-hoc queries with low latency– Hbase, Solr, Oogway

• Speed layer: – deals only with new data and compensates for the high latency

updates of the serving layer by creating realtime views– our real time latency requirements are in hours and not in seconds

Page 17: 2014 09-12 lambda-architecture-at-indix

Principles

Page 18: 2014 09-12 lambda-architecture-at-indix

A core set of architectural principleshelp you build simple and robust big data systems

• Immutability and Human Fault Tolerance– Recomputation: Easy to recover from mistakes

• Ex) write bad data, bugs in the computation– At any point of time, you have the previous versions of the batch and

serving layer

• Complexity Isolation– To Isolate complexity, using HBase in speed layer– HBase is high operational complexity because of random writes leading to

compactions

• Enforceable Schemas– advocates the use of enforceable schemas to ensure read time validation of

data to avoid data loss– These schemas are implemented by language-neutral serialization

frameworks like Thrift, Avro, Protobuf etc

Page 19: 2014 09-12 lambda-architecture-at-indix

Conclusion

Page 20: 2014 09-12 lambda-architecture-at-indix

Conclusion

• Lambda Architecture is technology and domain agnostic

• 2.0 is more robust and simple– has been in production for more than a year now– have matured during the last year and are stable

• However, our speed and merge layers are still work in progress