MongoDB ClickStream and Visualization

41
Implementing and Visualizing Click-Stream Data with MongoDB Jan 22, 2013 - New York MongoDB User Group Cameron Sim - LearnVest.com Monday, April 15, 13

description

Implementing ClickStream Analytics with Spring, Java, MongoDB and Django

Transcript of MongoDB ClickStream and Visualization

Page 1: MongoDB ClickStream and Visualization

Implementing and Visualizing Click-Stream Data with MongoDB

Jan 22, 2013 - New York MongoDB User Group

Cameron Sim - LearnVest.com

Monday, April 15, 13

Page 2: MongoDB ClickStream and Visualization

Agenda

About LearnVest

HL Application Architecture

Data Capture

Event Packaging

MongoDB Data Warehousing

Loading & Visualization

Finishing up

Monday, April 15, 13

Page 3: MongoDB ClickStream and Visualization

LearnVest Inc.www.learnvest.com

CompanyFounded in 2008 by Alexa Von Tobel, CEO

50+ People and Growing rapidlyBased in NYC

PlatformsWeb & iPhone

Mission StatementAiming to making Financial Planning as accessible as having a gym membership

Key ProductsAccount Aggregation and Management

(Bank, Credit, Loan, Investment, Mortgage)

Original and Syndicated Newsletter Content

Financial Planning(tiered product offering)

Stack

OperationalWordpress, Backbone.js, Node.jsJava Spring 3, Redis, Memcached,

MongoDB, ActiveMQ, Nginx, MySQL 5.x

AnalyticsMongoDB 2.2.0 (3-node replica-set)

Java 6, Spring 3pyMongo

Django 1.4

Monday, April 15, 13

Page 4: MongoDB ClickStream and Visualization

LearnVest.comWeb

Monday, April 15, 13

Page 5: MongoDB ClickStream and Visualization

LearnVest.comIPhone

Monday, April 15, 13

Page 6: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent PackagingMongoDB Data WarehousingMonday, April 15, 13

Page 7: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent PackagingMongoDB Data WarehousingMonday, April 15, 13

Page 8: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent PackagingMongoDB Data WarehousingMonday, April 15, 13

Page 9: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent PackagingMongoDB Data WarehousingMonday, April 15, 13

Page 10: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent PackagingMongoDB Data WarehousingMonday, April 15, 13

Page 11: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent PackagingMongoDB Data WarehousingMonday, April 15, 13

Page 12: MongoDB ClickStream and Visualization

Loading & Visualization

High Level Architecture} } } }Analytics

Services Loaders & Dashboards

Production

Platform Delivery Services

HTTPSpyMongoMongoDB Java ConnMongoDB ReplicationJDBC

Event CollectionEvent Packaging MongoDB Data WarehousingMonday, April 15, 13

Page 13: MongoDB ClickStream and Visualization

Philosophy For Data CollectionCapture Everything• User-Driven events over web and mobile• System-level exceptions• Everything else

Temporary Data• Be ‘ok’ with approximate data• Operational Databases are the system of record

Aggregate events as they come in• Remove the overhead of basic metrics (counts, sums) on core events• Group by user unique id and increment counts per event, over time-dimensions

(day, week-ending, month, year)

Monday, April 15, 13

Page 14: MongoDB ClickStream and Visualization

Data CaptureIOS

- (void) sendAnalyticEventType:(NSString*)eventType object:(NSString*)object name:(NSString*)name page:(NSString*)page source:(NSString*)source;{ NSMutableDictionary *eventData = [NSMutableDictionary dictionary];

if (eventType!=nil) [params setObject:eventType forKey:@"eventType"]; if (object!=nil) [eventData setObject:object forKey:@"object"]; if (name!=nil) [eventData setObject:name forKey:@"name"]; if (page!=nil) [eventData setObject:page forKey:@"page"]; if (source!=nil) [eventData setObject:source forKey:@"source"]; if (eventData!=nil) [params setObject:eventData forKey:@"eventData"]; [[LVNetworkEngine sharedManager] analytics_send:params];}

Monday, April 15, 13

Page 15: MongoDB ClickStream and Visualization

Data CaptureWEB (JavaScript)

function internalTrackPageView() { var cookie = { userContext: jQuery.cookie('UserContextCookie'), };

var trackEvent = { eventType: "pageView", eventData: { page: window.location.pathname + window.location.search } }; // AJAX jQuery.ajax({ url: "/api/track", type: "POST", dataType: "json", data: JSON.stringify(trackEvent), // Set Request Headers beforeSend: function (xhr, settings) { xhr.setRequestHeader('Accept', 'application/json'); xhr.setRequestHeader('User-Context', cookie.userContext); if(settings.type === 'PUT' || settings.type === 'POST') { xhr.setRequestHeader('Content-Type', 'application/json'); } } });}

Monday, April 15, 13

Page 16: MongoDB ClickStream and Visualization

Bus Event Packaging1.Spring 3 RESTful service layer, controller methods define the eventCode via @tracking

annotation

2.Custom Intercepter class extends HandlerInterceptorAdapter and implements postHandle() (for each event) to invoke calls via Spring @async to an EventPublisher

3.EventPublisher publishes to common event bus queue with multiple subscribers, one of which packages the eventPayload Map<String, Object> object and forwards to Analytics Rest Service

Monday, April 15, 13

Page 17: MongoDB ClickStream and Visualization

Bus Event Packaging1) Spring RestController Methods

Interface

@RequestMapping(value = "/user/login", method = RequestMethod.POST, headers="Accept=application/json")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request);

Concrete/Impl Class

@Override@Tracking("user.login")public Map<String, Object> userLogin(@RequestBody Map<String, Object> event, HttpServletRequest request){

//Implementation

return event;}

Monday, April 15, 13

Page 18: MongoDB ClickStream and Visualization

Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapter

protected void handleTracking(String trackingCode, Map<String, Object> modelMap, HttpServletRequest request) {

Map<String, Object> responseModel = new HashMap<String, Object>();

// remove non-serializables & copy over data from modelMap try { this.eventPublisher.publish(trackingCode, responseModel, request); } catch (Exception e) { log.error("Error tracking event '" + trackingCode + "' : " + ExceptionUtils.getStackTrace(e)); }}

Monday, April 15, 13

Page 19: MongoDB ClickStream and Visualization

Bus Event Packaging2) Custom Intercepter class extends HandlerInterceptorAdapter

public void publish (String eventCode, Map<String,Object> eventData, HttpServletRequest request) {

Map<String,Object> payload = new HashMap<String,Object>(); String eventId=UUID.randomUUID().toString(); Map<String, String> requestMap = HttpRequestUtils.getRequestHeaders(request); //Normalize message payload.put("eventType", eventData.get("eventType")); payload.put("eventData", eventData.get("eventType")); payload.put("version", eventData.get("eventType")); payload.put("eventId", eventId); payload.put("eventTime", new Date()); payload.put("request", requestMap); . . . //Send to the Analytics Service for MongoDB persistence}

public void sendPost(EventPayload payload){ HttpEntity request = new HttpEntity(payload.getEventPayload(), headers); Map m = restTemplate.postForObject(endpoint, request, java.util.Map.class);}

Monday, April 15, 13

Page 20: MongoDB ClickStream and Visualization

Bus Event PackagingThe Serialized Json (User Action)

{“eventCode” : “user.login”,“eventType” : “login”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “” : “”, “” : “”, “” : “” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }

}

Monday, April 15, 13

Page 21: MongoDB ClickStream and Visualization

Bus Event PackagingThe Serialized Json (Generic Event)

{“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }

}

Monday, April 15, 13

Page 22: MongoDB ClickStream and Visualization

Bus Event PackagingThe Serialized Json (Generic Event)

{“eventCode” : “generic.ui”,“eventType” : “pageView”,“version” : “1.0”,“eventTime” : “1358603157746”,“eventData” : { “page” : “/learnvest/moneycenter/inbox”, “section” : “transactions”, “name” : “view transactions” “object” : “page” },“request” : { “call-source” : “WEB”, “user-context” : “00002b4f1150249206ac2b692e48ddb3”, “user.agent” : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/ 23.0.1271.101 Safari/537.11”, “cookie” : “size=4; CP.mode=B; PHPSESSID=c087908516 ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF9 46F139669D746F; csrftoken=73bdcd ddf151dc56b8020855b2cb10c8", "content-length" : "204", "accept-encoding" : "gzip,deflate,sdch”, }

}

Monday, April 15, 13

Page 23: MongoDB ClickStream and Visualization

MongoDB Data WarehousingMongoDB Information• v2.2.0• 3-node replica-set• 1 Large (primary), 2x Medium (secondary) AWS Amazon-Linux machines• Each with single 500GB EBS volumes mounted to /opt/data

MongoDB Config Filedbpath = /opt/data/mongodb/datarest = truereplSet = voyager

Volumes~IM events daily on web, ~600K on mobile2-3 GB per day at start, slowed to ~1GB per dayCurrently at 78GB (collecting since August 2012)

Future Scaling Strategy• Setup 2nd Replica-Set• Shard replica-sets to n at 60% / 250GB per EBS volume• Shard key probably based on sequential mix of email_address & additional string

Monday, April 15, 13

Page 24: MongoDB ClickStream and Visualization

MongoDB Data WarehousingApproach

1. Persist all events, bucketed by source:- WEB MOBILE

2. Persist all events, bucketed by source, event code and time:- WEB/MOBILE user.login time (day, week-ending, month, year)

3. Insert into collection e_web / e_mobile

4. Upsert into:- e_web_user_login_day e_web_user_login_week e_web_user_login_month e_web_user_login_year

5. Predictable model for scaling and measuring business growth

Monday, April 15, 13

Page 25: MongoDB ClickStream and Visualization

MongoDB Data Warehousing2. Persist all events, bucketed by source, event code and time:-

//instantiate collections dynamicallyDBCollection collection_day = mongodb.getCollection(eventCode + "_day");DBCollection collection_week = mongodb.getCollection(eventCode + "_week");DBCollection collection_month = mongodb.getCollection(eventCode + "_month");DBCollection collection_year = mongodb.getCollection(eventCode + "_year"); BasicDBObject newDocument = new BasicDBObject().append("$inc" new BasicDBObject().append("count", 1));

//update day dimensioncollection_day.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(d)),newDocument, true, false);

//update week dimension collection_week.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_day.format(w)), newDocument, true, false);

//update month dimension collection_month.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_month.format(d)), newDocument, true, false);

//update month dimension collection_year.update(new BasicDBObject().append("user-context", userContext) .append("eventType", eventType) .append("date", sdf_year.format(d)), newDocument, true, false);

Monday, April 15, 13

Page 26: MongoDB ClickStream and Visualization

MongoDB Data WarehousingPersist all events, bucketed by source, event code and time:-

> show collectionse_mobilee_webe_web_account_addManual_daye_web_account_addManual_monthe_web_account_addManual_weeke_web_account_addManual_yeare_web_user_login_daye_web_user_login_weeke_web_user_login_monthe_web_user_login_yeare_mobile_generic_ui_daye_mobile_generic_ui_monthe_mobile_generic_ui_weeke_mobile_generic_ui_year

> db.e_web_user_login_day.find(){ "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 5, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" }{ "_id" : ObjectId("50cd6cfcb9a80a2b4ee21422"), "count" : 7, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" }{ "_id" : ObjectId("50cd6e51b9a80a2b4ee21427"), "count" : 2, "date" : "01/02", "user-context" : "c4ca4238a0b923820dcc509a6f75849b" }{ "_id" : ObjectId("50e4b9871b36921910222c42"), "count" : 3, "date" : "01/03", "user-context" : "50e49a561b36921910222c33" }

Monday, April 15, 13

Page 27: MongoDB ClickStream and Visualization

MongoDB Data WarehousingPersist all events

> db.e_web.findOne(){ "_id" : ObjectId("50e4a1ab0364f55ed07c2662"), "created_datetime" : ISODate("2013-01-02T21:07:55.656Z"), "created_date" : ISODate("2013-01-02T00:00:00.000Z"),"request" : { "content-type" : "application/json", "connection" : "keep-alive", "accept-language" : "en-US,en;q=0.8", "host" : "localhost:8080", "call-source" : "WEB", "accept" : "*/*", "user-context" : "c4ca4238a0b923820dcc509a6f75849b", "origin" : "chrome-extension://fdmmgilgnpjigdojojpjoooidkmcomcm", "user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.101 Safari/537.11", "accept-charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.3", "cookie" : "size=4; CP.mode=B; PHPSESSID=c087908516ee2fae50cef6500101dc89; resolution=1920; JSESSIONID=56EB165266A2C4AFF946F139669D746F; csrftoken=73bdcdddf151dc56b8020855b2cb10c8", "content-length" : "255", "accept-encoding" : "gzip,deflate,sdch" }, "eventType" : "flick", "eventData" : { "object" : "button", "name" : "split transaction button", "page" : "#inbox/79876/", "section" :

Monday, April 15, 13

Page 28: MongoDB ClickStream and Visualization

MongoDB Data WarehousingIndexing Strategy

• Indexes on core collections (e_web and e_mobile) come in under 3GB on 7.5GB Large Instance and 3.75GB on Medium instances

• Split datetime in two fields and compound index on date with other fields like eventType and user unique id (user-context)

• Heavy insertion rates, much lower read rates....so less indexes the better

Monday, April 15, 13

Page 29: MongoDB ClickStream and Visualization

MongoDB Data WarehousingIndexing Strategy

> db.e_web.getIndexes()[ { "v" : 1, "key" : { "request.user-context" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "request.user-context_1_created_date_1" }, { "v" : 1, "key" : { "eventData.name" : 1, "created_date" : 1 }, "ns" : "moneycenter.e_web", "name" : "eventData.name_1_created_date_1" }]

Monday, April 15, 13

Page 30: MongoDB ClickStream and Visualization

Loading & VisualizationObjective• Show historic and intraday stats on core use cases (logins, conversions)• Show user funnel rates on conversion pages• Show general usability - how do users really use the Web and IOS platforms?

Non-Functionals• Intraday doesn’t need to be “real-time”, polling is good enough for now• Overnight batch job for historic must scale horizontally

General Implementation Strategy• Do all heavy lifting & object manipulation, UI should just display graph or table• Modularize the service to be able to regenerate any graphs/tables without a full load

Monday, April 15, 13

Page 31: MongoDB ClickStream and Visualization

Loading & VisualizationJava Batch Service

Java Mongo library to query key collections and return user counts and sum of events

DBCursor webUserLogins = c.find( new BasicDBObject("date", sdf.format(new Date())));

private HashMap<String, Object> getSumAndCount(DBCursor cursor){ HashMap<String, Object> m = new HashMap<String, Object>(); int sum=0; int count=0; DBObject obj; while(cursor.hasNext()){ obj=(DBObject)cursor.next(); count++; sum=sum+(Integer)obj.get("count"); } m.put("sum", sum); m.put("count", count); m.put("average", sdf.format(new Float(sum)/count)); return m;}

Monday, April 15, 13

Page 32: MongoDB ClickStream and Visualization

Loading & VisualizationJava Batch Service

Use Aggregation Framework where required on core collections (e_web) and external data

//create aggregation objectsDBObject project = new BasicDBObject("$project", new BasicDBObject("day_value", fields) );DBObject day_value = new BasicDBObject( "day_value", "$day_value");DBObject groupFields = new BasicDBObject( "_id", day_value);

//create the fields to group by, in this case “number”groupFields.put("number", new BasicDBObject( "$sum", 1));

//create the group DBObject group = new BasicDBObject("$group", groupFields);

//executeAggregationOutput output = mycollection.aggregate( project, group ); for(DBObject obj : output.results()){ ..}

Monday, April 15, 13

Page 33: MongoDB ClickStream and Visualization

Loading & VisualizationJava Batch Service

MongoDB Command Line example on aggregation over a time period, e.g. month

> db.e_web.aggregate( [ { $match : { created_date : { $gt : ISODate("2012-10-25T00:00:00")}}}, { $project : { day_value : {"day" : { $dayOfMonth : "$created_date" }, "month":{ $month : "$created_date" }} }}, { $group : { _id : {day_value:"$day_value"} , number : { $sum : 1 } } }, { $sort : { day_value : -1 } } ])

Monday, April 15, 13

Page 34: MongoDB ClickStream and Visualization

Loading & VisualizationJava Batch Service

Persisting events into graph and table collections

>db.homeGraphs.find()

{ "_id" : ObjectId("50f57b5c1d4e714b581674e2"), "accounts_natural" : 54, "accounts_total" : 54, "date" : ISODate("2011-02-06T05:00:00Z"), "linked_rate" : "12.96", "premium_rate" : "0", "str_date" : "2011,01,06", "upgrade_rate" : "0", "users_avg_linked" : "3.43", "users_linked" : 7 }

{ "_id" : ObjectId("50f57b5c1d4e714b581674e3"), "accounts_natural" : 144, "accounts_total" : 144, "date" : ISODate("2011-02-07T05:00:00Z"), "linked_rate" : "11.11", "premium_rate" : "0", "str_date" : "2011,01,07", "upgrade_rate" : "0", "users_avg_linked" : "4", "users_linked" : 16 }

{ "_id" : ObjectId("50f57b5c1d4e714b581674e4"), "accounts_natural" : 119, "accounts_total" : 119, "date" : ISODate("2011-02-08T05:00:00Z"), "linked_rate" :

Monday, April 15, 13

Page 35: MongoDB ClickStream and Visualization

Loading & VisualizationDjango and HighCharts

Extract data (pyMongo)

def getHomeChart(dt_from, dt_to): """Called by home method to get latest 30 day numbers""" try: conn = pymongo.Connection('localhost', 27017) db = conn['lvanalytics']

cursor = db.accountmetrics.find( {"date" : {"$gte" : dt_from, "$lte" : dt_to}}).sort("date") return buildMetricsDict(cursor)

except Exception as e: logger.error(e.message)

Return the graph object (as a list or a dict of lists) to the view that called the method

pagedata={}pagedata['accountsGraph']=mongodb_home.getHomeChart()

return render_to_response('home.html',{'pagedata': pagedata}, context_instance=RequestContext(request))

Monday, April 15, 13

Page 36: MongoDB ClickStream and Visualization

Loading & VisualizationDjango and HighCharts

Populate the series.. (JavaScript with Django templating)

seriesOptions[0] = { id: 'naturalAccounts', name: "Natural Accounts", data: [ {% for a in pagedata.metrics.accounts_natural %} {% if not forloop.first %}, {% endif %} [Date.UTC({{a.0}}),{{a.1}}] {% endfor %} ], tooltip: { valueDecimals: 2 } };

Monday, April 15, 13

Page 37: MongoDB ClickStream and Visualization

Loading & VisualizationDjango and HighCharts

And Create the Charts and Tables...

Monday, April 15, 13

Page 38: MongoDB ClickStream and Visualization

Loading & VisualizationDjango and HighCharts

And Create the Charts and Tables...

Monday, April 15, 13

Page 39: MongoDB ClickStream and Visualization

Lessons Learned• Date Time managed as two fields, Datetime and Date

• Aggregating and upserting documents as events are received works for us

• Real-time Map-Reduce in pyMongo - too slow, don’t do this.

• Django-noRel - Unstable, use Django and configure MongoDB as a datastore only

• Memcached on Django is good enough (at the moment) - use django-celery with rabbitmq to pre-cache all data after data loading

• HighCharts is buggy - considering D3 & other libraries

• Don’t need to retrieve data directly from MongoDB to Django, perhaps provide all data via a service layer (at the expense of ever-additional features in pyMongo)

Monday, April 15, 13

Page 40: MongoDB ClickStream and Visualization

Next Steps• A/B testing framework, experiments and variances

• Unauthenticated / Authenticated user tracking

• Provide data async over service layer

• Segmentation with graphical libraries like D3 & Cross-Filter (http://square.github.com/crossfilter/)

• Saving Query Criteria, expanding out BI tools for internal users

• MongoDB Connector, Hadoop and Hive (maybe Tableau and other tools)

• Storm / Kafka for real-time analytics processing

• Shard the Replica-Set, looking into Gizzard as the middleware

Monday, April 15, 13