GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Post on 01-Jul-2015

2.098 views 1 download

description

My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers. Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset. http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch

Transcript of GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

MAKING ENTERPRISE DATA

AVAILABLE IN REAL TIME

WITH ELASTICSEARCH

Yann Cluchey

CTO @ Cogenta

CTO @ GfK Online Pricing Intelligence

What is Enterprise Data?

What is Enterprise Data?

Online Pricing Intelligence

1. Gather data from 500+ of eCommerce sites

2. Organise into high quality market view

3. Competitive intelligence tools

Price,

Stock,

Meta

Price,

Stock,

Meta

Price,

Stock,

Meta Price,

Stock,

Meta

HTML

Custom Crawler

Parse web content

Discover product data

Tracking 20m products

Daily+

HTML

HTML

HTML

Database

Processing, Storage

Enrichment

Persistent Storage

Product Catalogue

+ time series data

Processing

Database

Thing #1 - Detection

Identify distinct products

Automated information retrieval

Lucene + custom index builder

Continuous process

(Humans for QA)

Lucene

Index

Index Builder

GUI

Matcher

Thing #2 - BI Tools

Web Applications

Also based on Lucene

Batch index build process

Per-customer indexes

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 2

Customer

Index 3

Thing #1 - Pain

Continuously indexing

Track changes, read back out to index

Drain on performance

Latency, coping with peaks

Full rebuild for index schema change

or inconsistencies

Full rebuild doesn’t scale well…

Unnecessary work..?Lucene

Index

Index Builder

GUI

Database

Customer

Index 2

Thing #2 - Pain

Twice daily batch rebuild, per customer

Very slow

Moar customers?

Moar data?

Moar often?

Data set too complex,

keeps changing

Index shipping

Moar web servers?

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 3

Indexing

Database

Batch Sync

Web Server 1 Web Server 2

Pain Points

As data, customers scale,

processes slow down

Adapting to change

Easy to layer on,

hard to make fundamental changes

Read vs write concerns

Database Maintenance

Index

Index Builder

Database

Goals

Eliminate latencies

Improve scalability

Improve availability

Something achievable

Your mileage will vary

elasticsearch

Open source, distributed search engine

Based on Lucene, fully featured API

Querying, filtering, aggregation

Text processing / IR

Schema-free

Yummy

(real-time, sharding, highly available)

Silver bullets not included

IndexingDatabase

IndexingDatabase

Our Pipeline

Database

CrawlersCrawlers

ProcessorsProcessors

ProcessorsProcessors

ProcessorsProcessorsIndexers

IndexesIndexes

Indexes

Web ServersWeb Servers

Web Servers

Our New Pipeline

Database

CrawlersCrawlers

ProcessorsProcessors

ProcessorsProcessors

ProcessorsProcessorsIndexers

IndexesIndexes

IndexesWeb Servers

Web ServersWeb Servers

Event Hooks

Messages fired OnCreate.. and OnUpdate

Payload contains everything needed for indexing

The data

Keys (still mastered in SQL)

Versioning

Sender has all the information already

Use RabbitMQ to control event message flow

Messages are durable

Indexing Strategy

RESTful API (HTTP, Thrift, Memcache)

Use bulk methods

They support percolation

Rivers (pull)

RabbitMQ River

JDBC River

Mongo/Couch/etc. River

Logstash

Index QIndexer

Event Q

Model Your Data

What’s in your documents?

Database = Index

Table = Type ...?

Start backwards

What do your applications need?

How will they need to query the data?

Prototyping! Fail quickly!

elasticsearch supports Nested objects, parent/child docs

Joins

Events relate to line-items

Amazon decreased price

Pixmania is running a promotion

Need to group by Product

Use key/value store

Get full Product document

Modify it, write it back

Enqueue indexing instruction

IndexerEvent Q3 3 5

1 4

1

2

13

4Index Q

Key/value

store

5

Where to join?

elasticsearch

Consider performance

Depends how data is structured/indexed (e.g. parent/child)

Compression, collisions

In-memory cache (e.g. Memcache)

Persistent storage (e.g. Cassandra or Mongo)

Two awesome benefits

Quickly re-index if needed

Updates have access to the full Product data

Serialisation is costly

Synchronisation & Concurrency

Fault tolerance

Code to expect missing data

Out of sequence events

Concurrency Control

Apply Optimistic Concurrency Control at Mongo

Optimise for collisions

Synchronisation & Concurrency

Synchronisation

Out of sequence index instructions

elasticsearch external versioning

Can rebuild from scratch if need to

Consistency

Which version is right?

Dates

Revision numbers from SQL

Independent updates

Figures

Ingestion

20m data points/day (continuously)

~ 200GB

3K msgs/second at peak

Hardware

SQL: 2 x 12-core, 64GB, 72-spindle SAN

Indexing: 4 x 4-core, 8GB

Mongo: 1 x 4-core, 16GB, 1xSSD

Elastic: 5 x 4-core, 16GB, 1xSSD

Custom-Built

Lucene

elasticsearch

Latency 3 hours < 1 second

Bottleneck Disk (SQL) CPU

Managing Change

Key/value

store

Index_A

Client

IndexerEvent Q

Alias

Index_B

Index

Index_BIndex_A

Thanks

@YannCluchey

Concurrency Patterns with MongoDBhttp://slidesha.re/YFOehF

Consistency without ConsensusPeter Bourgon, SoundCloudhttp://bit.ly/1DUAO1R

Eventually Consistent Data StructuresSean Cribbs, Bashohttps://vimeo.com/43903960