GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

MAKING ENTERPRISE DATA

AVAILABLE IN REAL TIME

WITH ELASTICSEARCH

Yann Cluchey

CTO @ Cogenta

CTO @ GfK Online Pricing Intelligence

What is Enterprise Data?

Online Pricing Intelligence

1. Gather data from 500+ of eCommerce sites

2. Organise into high quality market view

3. Competitive intelligence tools

Price,

Stock,

Price,

Stock,

Price,

Stock,

Meta Price,

Stock,

Custom Crawler

Parse web content

Discover product data

Tracking 20m products

Daily+

Database

Processing, Storage

Enrichment

Persistent Storage

Product Catalogue

+ time series data

Processing

Database

Thing #1 - Detection

Identify distinct products

Automated information retrieval

Lucene + custom index builder

Continuous process

(Humans for QA)

Lucene

Index Builder

Matcher

Thing #2 - BI Tools

Web Applications

Also based on Lucene

Batch index build process

Per-customer indexes

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 2

Customer

Index 3

Thing #1 - Pain

Continuously indexing

Track changes, read back out to index

Drain on performance

Latency, coping with peaks

Full rebuild for index schema change

or inconsistencies

Full rebuild doesn’t scale well…

Unnecessary work..?Lucene

Index Builder

Database

Customer

Index 2

Thing #2 - Pain

Twice daily batch rebuild, per customer

Very slow

Moar customers?

Moar data?

Moar often?

Data set too complex,

keeps changing

Index shipping

Moar web servers?

Database

Customer

Index 1

Index Builder

BI Tools

Customer

Index 3

Indexing

Database

Batch Sync

Web Server 1 Web Server 2

Pain Points

As data, customers scale,

processes slow down

Adapting to change

Easy to layer on,

hard to make fundamental changes

Read vs write concerns

Database Maintenance

Index Builder

Database

Eliminate latencies

Improve scalability

Improve availability

Something achievable

Your mileage will vary

elasticsearch

Open source, distributed search engine

Based on Lucene, fully featured API

Querying, filtering, aggregation

Text processing / IR

Schema-free

(real-time, sharding, highly available)

Silver bullets not included

IndexingDatabase

Our Pipeline

Database

CrawlersCrawlers

ProcessorsProcessors

ProcessorsProcessorsIndexers

IndexesIndexes

Indexes

Web ServersWeb Servers

Web Servers

Our New Pipeline

Database

CrawlersCrawlers

ProcessorsProcessors

ProcessorsProcessorsIndexers

IndexesIndexes

IndexesWeb Servers

Web ServersWeb Servers

Event Hooks

Messages fired OnCreate.. and OnUpdate

Payload contains everything needed for indexing

The data

Keys (still mastered in SQL)

Versioning

Sender has all the information already

Use RabbitMQ to control event message flow

Messages are durable

Indexing Strategy

RESTful API (HTTP, Thrift, Memcache)

Use bulk methods

They support percolation

Rivers (pull)

RabbitMQ River

JDBC River

Mongo/Couch/etc. River

Logstash

Index QIndexer

Event Q

Model Your Data

What’s in your documents?

Database = Index

Table = Type ...?

Start backwards

What do your applications need?

How will they need to query the data?

Prototyping! Fail quickly!

elasticsearch supports Nested objects, parent/child docs

Events relate to line-items

Amazon decreased price

Pixmania is running a promotion

Need to group by Product

Use key/value store

Get full Product document

Modify it, write it back

Enqueue indexing instruction

IndexerEvent Q3 3 5

4Index Q

Key/value

Where to join?

elasticsearch

Consider performance

Depends how data is structured/indexed (e.g. parent/child)

Compression, collisions

In-memory cache (e.g. Memcache)

Persistent storage (e.g. Cassandra or Mongo)

Two awesome benefits

Quickly re-index if needed

Updates have access to the full Product data

Serialisation is costly

Synchronisation & Concurrency

Fault tolerance

Code to expect missing data

Out of sequence events

Concurrency Control

Apply Optimistic Concurrency Control at Mongo

Optimise for collisions

Synchronisation & Concurrency

Synchronisation

Out of sequence index instructions

elasticsearch external versioning

Can rebuild from scratch if need to

Consistency

Which version is right?

Revision numbers from SQL

Independent updates

Figures

Ingestion

20m data points/day (continuously)

~ 200GB

3K msgs/second at peak

Hardware

SQL: 2 x 12-core, 64GB, 72-spindle SAN

Indexing: 4 x 4-core, 8GB

Mongo: 1 x 4-core, 16GB, 1xSSD

Elastic: 5 x 4-core, 16GB, 1xSSD

Custom-Built

Lucene

elasticsearch

Latency 3 hours < 1 second

Bottleneck Disk (SQL) CPU

Managing Change

Key/value

Index_A

Client

IndexerEvent Q

Index_B

Index_BIndex_A

Thanks

@YannCluchey

Concurrency Patterns with MongoDBhttp://slidesha.re/YFOehF

Consistency without ConsensusPeter Bourgon, SoundCloudhttp://bit.ly/1DUAO1R

Eventually Consistent Data StructuresSean Cribbs, Bashohttps://vimeo.com/43903960

GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

Technology

Transcript of GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elasticsearch

C# for Java Developers - GOTO Conference › dl › jaoo-aarhus-2010 › slides › ... · C# for Java Developers Patrick Linskey plinskey@gmail.com @plinskey. Who am I? ... •Can

Goto aarhus

Innovation through Collaborationjaoo.dk/dl/goto-aarhus-2011/slides/MortenSkjoldborgMadsen_Advan… · Quality Manager Coordinate quality assurance plans, processes and resources Team

Elasticsearch - nosqlroadshow.comnosqlroadshow.com/.../elasticsearch_Alexander_Reelsen.pdf · Elasticsearch - The Company • Founded in 2012 • By the people behind the Elasticsearch

Lars Bak & Gilad Bracha - GOTO Conferencegotocon.com/dl/goto-aarhus-2011/slides/GiladBracha...WEB APPLICATION IN DART • Newsreader completely written in Dart • App code: 3210 LOC

Groovy & Grails in Depth - GOTO Conferencegotocon.com/dl/jaoo-aarhus-2009/slides/GraemeRocher_GrailsInDepth.pdf · Title: PRESENTATION TITLE UP TO A MAXIMUM OF THREE LINES FONT IS

Adapting software for the public cloud - GOTO Conferencegotocon.com/dl/jaoo-aarhus-2010/slides/DougWilson... · adapted from existing on premise application systems. This paper examines

Kanban vs Scrum - GOTO Conference · Kanban vs Scrum Henrik Kniberg Agile/Lean coach @ Crisp, Stockholm JAOO, Aarhus Oct 6, 2009 ... assessment • End-usersupportmateirla • Glossary

Rod Johnson Spring Roo - GOTO Conferencegotocon.com/dl/jaoo-aarhus-2009/slides/RodJohnson_ExtremeJava... · Sucks Major Ass in Comparison to Anything and ... Rails/Django/PHP etc.

C# for Java Developers - GOTO Conferencegotocon.com/dl/jaoo-aarhus-2010/slides/PatrickLinskey_CForJava... · Who are you? •Hopefully, Java developers who are interested in learning

GOTO Conference Magazine 2011 Aarhus Denmark

Measuring Progress and Performance in Large Agile …gotocon.com/dl/goto-aarhus-2011/slides/AndyCarmichael_Measuring...MEASURING PROGRESS AND PERFORMANCE IN LARGE AGILE DEVELOPMENTS

Operations and Monitoring with Spring - GOTO Conferencegotocon.com/dl/jaoo-aarhus-2009/slides/Eberhard... · Operations and Monitoring with Spring Eberhard Wolff Regional Director

Elasticsearch 5 in Amazon Elasticsearch Service

Dell EMC Search · above Elasticsearch. Elasticsearch cluster ports NGINX TCP/ HTTPS 9300– 9400 Ports for communicating with Elasticsearch (Index data nodes). Elasticsearch cluster

GoTo™ North America Focused Delivery Program … | Linear Motion GoTo Bosch Rexroth Corporation 5 GoTo Focused Delivery Program Your local GoTo contact: GoTo Program Delivery Conditions

Beware Red Flags On The Internet - GOTO Conferencegotocon.com/...aar-2012/...MorningKeynoteBewareRedFlagsOnTheInternet.pdf · @Falkvinge GOTO 2012, Aarhus Beware Red Flags On The

Designing Apps for Amazon Web Servicesjaoo.dk/dl/goto-aarhus-2011/slides/MathiasMeyer... · Designing Apps for Amazon Web Services Mathias Meyer, GOTO Aarhus 2011 Montag, 10. Oktober

Goto aarhus: Mobile Browser as a platform

Getting to know the Grid - Goto Aarhus 2013