GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

25
MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence

description

My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers. Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset. http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch

Transcript of GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Page 1: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

MAKING ENTERPRISE DATA

AVAILABLE IN REAL TIME

WITH ELASTICSEARCH

Yann Cluchey

CTO @ Cogenta

CTO @ GfK Online Pricing Intelligence

Page 2: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

What is Enterprise Data?

Page 3: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

What is Enterprise Data?

Page 4: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Online Pricing Intelligence

1. Gather data from 500+ of eCommerce sites

2. Organise into high quality market view

3. Competitive intelligence tools

Page 5: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Price,

Stock,

Meta

Price,

Stock,

Meta

Price,

Stock,

Meta Price,

Stock,

Meta

HTML

Custom Crawler

Parse web content

Discover product data

Tracking 20m products

Daily+

HTML

HTML

HTML

Page 6: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Database

Processing, Storage

Enrichment

Persistent Storage

Product Catalogue

+ time series data

Processing

Page 7: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Database

Thing #1 - Detection

Identify distinct products

Automated information retrieval

Lucene + custom index builder

Continuous process

(Humans for QA)

Lucene

Index

Index Builder

GUI

Matcher

Page 8: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Thing #2 - BI Tools

Web Applications

Also based on Lucene

Batch index build process

Per-customer indexes

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 2

Customer

Index 3

Page 9: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Thing #1 - Pain

Continuously indexing

Track changes, read back out to index

Drain on performance

Latency, coping with peaks

Full rebuild for index schema change

or inconsistencies

Full rebuild doesn’t scale well…

Unnecessary work..?Lucene

Index

Index Builder

GUI

Database

Page 10: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Customer

Index 2

Thing #2 - Pain

Twice daily batch rebuild, per customer

Very slow

Moar customers?

Moar data?

Moar often?

Data set too complex,

keeps changing

Index shipping

Moar web servers?

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 3

Indexing

Database

Batch Sync

Web Server 1 Web Server 2

Page 11: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Pain Points

As data, customers scale,

processes slow down

Adapting to change

Easy to layer on,

hard to make fundamental changes

Read vs write concerns

Database Maintenance

Index

Index Builder

Database

Page 12: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Goals

Eliminate latencies

Improve scalability

Improve availability

Something achievable

Your mileage will vary

Page 13: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

elasticsearch

Open source, distributed search engine

Based on Lucene, fully featured API

Querying, filtering, aggregation

Text processing / IR

Schema-free

Yummy

(real-time, sharding, highly available)

Silver bullets not included

Page 14: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

IndexingDatabase

IndexingDatabase

Our Pipeline

Database

CrawlersCrawlers

ProcessorsProcessors

ProcessorsProcessors

ProcessorsProcessorsIndexers

IndexesIndexes

Indexes

Web ServersWeb Servers

Web Servers

Page 15: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Our New Pipeline

Database

CrawlersCrawlers

ProcessorsProcessors

ProcessorsProcessors

ProcessorsProcessorsIndexers

IndexesIndexes

IndexesWeb Servers

Web ServersWeb Servers

Page 16: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Event Hooks

Messages fired OnCreate.. and OnUpdate

Payload contains everything needed for indexing

The data

Keys (still mastered in SQL)

Versioning

Sender has all the information already

Use RabbitMQ to control event message flow

Messages are durable

Page 17: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Indexing Strategy

RESTful API (HTTP, Thrift, Memcache)

Use bulk methods

They support percolation

Rivers (pull)

RabbitMQ River

JDBC River

Mongo/Couch/etc. River

Logstash

Index QIndexer

Event Q

Page 18: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Model Your Data

What’s in your documents?

Database = Index

Table = Type ...?

Start backwards

What do your applications need?

How will they need to query the data?

Prototyping! Fail quickly!

elasticsearch supports Nested objects, parent/child docs

Page 19: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Joins

Events relate to line-items

Amazon decreased price

Pixmania is running a promotion

Need to group by Product

Use key/value store

Get full Product document

Modify it, write it back

Enqueue indexing instruction

IndexerEvent Q3 3 5

1 4

1

2

13

4Index Q

Key/value

store

5

Page 20: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Where to join?

elasticsearch

Consider performance

Depends how data is structured/indexed (e.g. parent/child)

Compression, collisions

In-memory cache (e.g. Memcache)

Persistent storage (e.g. Cassandra or Mongo)

Two awesome benefits

Quickly re-index if needed

Updates have access to the full Product data

Serialisation is costly

Page 21: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Synchronisation & Concurrency

Fault tolerance

Code to expect missing data

Out of sequence events

Concurrency Control

Apply Optimistic Concurrency Control at Mongo

Optimise for collisions

Page 22: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Synchronisation & Concurrency

Synchronisation

Out of sequence index instructions

elasticsearch external versioning

Can rebuild from scratch if need to

Consistency

Which version is right?

Dates

Revision numbers from SQL

Independent updates

Page 23: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Figures

Ingestion

20m data points/day (continuously)

~ 200GB

3K msgs/second at peak

Hardware

SQL: 2 x 12-core, 64GB, 72-spindle SAN

Indexing: 4 x 4-core, 8GB

Mongo: 1 x 4-core, 16GB, 1xSSD

Elastic: 5 x 4-core, 16GB, 1xSSD

Custom-Built

Lucene

elasticsearch

Latency 3 hours < 1 second

Bottleneck Disk (SQL) CPU

Page 24: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Managing Change

Key/value

store

Index_A

Client

IndexerEvent Q

Alias

Index_B

Index

Index_BIndex_A

Page 25: GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Thanks

@YannCluchey

Concurrency Patterns with MongoDBhttp://slidesha.re/YFOehF

Consistency without ConsensusPeter Bourgon, SoundCloudhttp://bit.ly/1DUAO1R

Eventually Consistent Data StructuresSean Cribbs, Bashohttps://vimeo.com/43903960