Clickstream Analysis With Apache Spark

40
CLICKSTREAM ANALYSIS WITH APACHE SPARK Andreas Zitzelsberger

Transcript of Clickstream Analysis With Apache Spark

Page 1: Clickstream Analysis With Apache Spark

CLICKSTREAM ANALYSIS WITH APACHE SPARK

AndreasZitzelsberger

Page 2: Clickstream Analysis With Apache Spark

THE CHALLENGE

Page 3: Clickstream Analysis With Apache Spark

ONE POT TO RULE THEM ALL

Web Tracking Ad Tracking

ERP CRM

▪ Products

▪ Inventory

▪ Margins

▪Customer

▪Orders

▪Creditworthiness

▪Ad Im

pressions

▪Ad Costs

▪Clicks & Views

▪Conversions

Page 4: Clickstream Analysis With Apache Spark

ONE POT TO RULE THEM ALL

Retention Reach

Monetarization

steer … ▪ Campaigns ▪ Offers ▪ Contents

Page 5: Clickstream Analysis With Apache Spark

REACT ON WEB SITE TRAFFIC IN REAL TIME

Image: https://www.flickr.com/photos/nick-m/3663923048

Page 6: Clickstream Analysis With Apache Spark

SAMPLE RESULTS

Geolocated and gender-specific conversions.

Frequency of visits

Performance of an ad campaign

Page 7: Clickstream Analysis With Apache Spark

THE CONCEPTS

Image: Randy Paulino

Page 8: Clickstream Analysis With Apache Spark

THE FIRST SKETCH

(= real-time)

SQL

Page 9: Clickstream Analysis With Apache Spark
Page 10: Clickstream Analysis With Apache Spark

CALCULATING USER JOURNEYS

C V VT VT VT C X

C V

V V V V V V V

C V V C V V V

VT VT V V V VT C

V X

Event stream: User journeys:

Web / Ad tracking

KPIs:▪ Unique users▪ Conversions▪ Ad costs / conversion value▪ …

V

X

VT

C Click

View

View Time

Conversion

Page 11: Clickstream Analysis With Apache Spark

THE ARCHITECTURE

Big Data

Page 12: Clickstream Analysis With Apache Spark

„LARRY & FRIENDS“ ARCHITECTURE

Runs not well for morethan 1 TB data in terms ofingestion speed, query timeand optimization efforts

Page 13: Clickstream Analysis With Apache Spark

Image: adweek.com

Nope. Sorry, no Big Data.

Page 14: Clickstream Analysis With Apache Spark

„HADOOP & FRIENDS“ ARCHITECTUREAggregation takes too long

Cumbersomeprogramming model(can be solved withpig, cascading et al.)

Not interactiveenough

Page 15: Clickstream Analysis With Apache Spark

Nope.Toosluggish.

Page 16: Clickstream Analysis With Apache Spark

Κ-ARCHITECTURE

Cumbersomeprogramming model

Over-engineered: We only need15min real-time ;-)

Stateful aggregations (unique x, conversions) require a separate DB with high throughput and fast aggregations & lookups.

Page 17: Clickstream Analysis With Apache Spark

Λ-ARCHITECTURECumbersomeprogramming model Complex

architecture

Redundant logic

Page 18: Clickstream Analysis With Apache Spark

FEELS OVER-ENGINEERED…

http://www.brainlazy.com/article/random-nonsense/over-engineered

Page 19: Clickstream Analysis With Apache Spark

The Final Architecture**) Maybe called µ-architecture one day ;-)

Page 20: Clickstream Analysis With Apache Spark

FUNCTIONAL ARCHITECTURE

Strange Events

IngestionRaw Event Stream

Collection Events Processing Analytics Warehouse

FactEntries

Atomic Event Frames

Data Lake

Master Data Integration

▪ Buffers load peeks▪ Ensures message

delivery (fire & forgetfor client)

▪ Create user journeys andunique user sets

▪ Enrich dimensions▪ Aggregate events to KPIs▪ Ability to replay for schema

evolution

▪ The representation of truth▪ Multidimensional data

model▪ Interactive queries for

actions in realtime anddata exploration

▪ Eternal memory for all events (even strangeones)

▪ One schema per eventtype. Time partitioned.

▪ Fault tolerant message handling▪ Event handling: Apply schema, time-partitioning, De-dup, sanity

checks, pre-aggregation, filtering, fraud detection▪ Tolerates delayed events▪ High throughput, moderate latency (~ 1min)

Page 21: Clickstream Analysis With Apache Spark

SERIAL CONNECTION OF STREAMING AND BATCHING

IngestionRaw Event Stream

Collection Event Data Lake Processing Analytics Warehouse

FactEntries

SQL InterfaceAtomic Event

Frames

▪ Cool programming model▪ Uniform dev&ops

▪ Simple solution▪ High compression ratio due to

column-oriented storage▪ High scan speed

▪ Cool programming model▪ Uniform dev&ops▪ High performance▪ Interface to R out-of-the-box▪ Useful libs: MLlib, GraphX, NLP, …

▪ Good connectivity (JDBC, ODBC, …)

▪ Interactive queries▪ Uniform ops▪ Can easily be replaced

due to Hive Metastore

▪ Obvious choice forcloud-scale messaging

▪ Way the best throughputand scalability of all evaluated alternatives

Page 22: Clickstream Analysis With Apache Spark

public Map<Long, UserJourney> sessionize(JavaRDD<AtomicEvent> events) { return events // Convert to a pair RDD with the userId as key .mapToPair(e -> new Tuple2<>(e.getUserId(), e)) // Build user journeys .<UserJourneyAcc>combineByKey( UserJourneyAcc::create, UserJourneyAcc::add, UserJourneyAcc::combine) // Convert to a Java map .collectAsMap(); }

Page 23: Clickstream Analysis With Apache Spark

STREAM VERSUS BATCH

https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/

Page 24: Clickstream Analysis With Apache Spark

APACHE FLINK

■ Alsohasanice,Spark-likeAPI■ Promisessimilarorbetter

performancethanspark

■ Lookslikethebestsolutionforaκ-Architecture

■ Butit’salsothenewestkidontheblock

Page 25: Clickstream Analysis With Apache Spark

EVENT VERSUS PROCESSING TIME■ There’sadifferencebetweeneventime(te)andprocessingtime

(tp).

■ Eventsarriveout-oforderevenduringnormaloperation.

■ Eventsmayarrivearbitrarylate.

Applyagraceperiodbeforeprocessingevents.

Allowarbitraryupdatewindowsofmetrics.

Page 26: Clickstream Analysis With Apache Spark

EXAMPLE

Minute

Hour

Day

Week

Month

Quarter

Year

I

U

U

U

U

U

U

I

U

U

U

U

U

U

U

Resolution inTime

Time

dtp

tp

tp: ProcessingTimeti: Ingestiontimete: EventTime

dtp: Aggregationtime framedtw: Graceperiod

: Insertfact: Updatefact

dtw

te

ti

Page 27: Clickstream Analysis With Apache Spark

LESSONS LEARNED

Image: http://hochmeister-alpin.at

Page 28: Clickstream Analysis With Apache Spark

BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS

ETL

Analytics

Realtime Analytics

Slice & Dice

Data Exploration

Polyglot Processing

http://datadventures.ghost.io/2014/07/06/polyglot-processing

Page 29: Clickstream Analysis With Apache Spark

POLYGLOT ANALYTICS

Data Lake Analytics Warehouse

SQL lane

R lane

Timeserieslane

Reporting Data ExplorationData Science

Page 30: Clickstream Analysis With Apache Spark

NO RETENTION PARANOIA

Data Lake

Analytics Warehouse

▪ Eternal memory ▪ Close to raw events ▪ Allows replays and refills

into warehouse

Aggressive forgetting with clearly defined retention policy per aggregation level like: ▪ 15min:30d ▪ 1h:4m ▪ …

Events

Strange Events

Page 31: Clickstream Analysis With Apache Spark

BEWARE OF THE HIPSTERS

Image: h&m

Page 32: Clickstream Analysis With Apache Spark

ENSURE YOUR SOFTWARE RUNS LOCALLY

The entire architecture must be able to run locally. Keep the round trips low for development and testing.

Throughput and reaction times need to be monitored continuously. Tune your software and the underlying frameworks as needed.

Page 33: Clickstream Analysis With Apache Spark

TUNE CONTINUOUSLY

IngestionRaw Event Stream

Collection Event Data Lake Processing Analytics Warehouse

FactEntries

SQL InterfaceAtomic Event

Frames

Load generator Throughput & latency probes

System, container and process monitoring

Page 34: Clickstream Analysis With Apache Spark

IN NUMBERSOverall dev effort until the first release: 250 person days

Dimensions: 10 KPIs: 26

Integrated 3rd party systems: 7Inbound data volume per day: 80GB

New data in DWH per day: 2GB

Total price of cheapest cluster which is able to handle production load:

Page 35: Clickstream Analysis With Apache Spark
Page 36: Clickstream Analysis With Apache Spark

THANK YOU

@andreasz82 [email protected]

Page 37: Clickstream Analysis With Apache Spark

BONUS SLIDES

Page 38: Clickstream Analysis With Apache Spark

CALCULATING UNIQUE USERS

■Weneedanexactuniqueusercount.

■ Ifyoucan,youshoulduseanapproximationsuchasHyperLogLog.

U1

U2

U3

U1

U4

Time

Use

rs

3 UU 2 UU

4 UU

Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the Analysis of Algorithms.

Page 39: Clickstream Analysis With Apache Spark

CHARTING TECHNOLOGY

https://github.com/qaware/big-data-landscape

Page 40: Clickstream Analysis With Apache Spark

CHOOSING WHERE TO AGGREGATE

Ingestion Event Data Lake Processing Analytics Warehouse

FactEntries

AnalyticsAtomic Event

Frames

1 2

3

- Enrichment - Preprocessing - Validation

The hard lifting.

- Processing steps that can be done at query time. - Interactive queries.