Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
879 -
download
0
Transcript of Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insights from Log Data
Building a Graph Database in Neo4j with Spark & Spark SQL to Gain New Insights from Log Data
ROBERT HRYNIEWICZ – DATA SCIENTIST/EVANGELIST, HORTONWORKSRACHEL POULSEN – DATA SCIENCE DIRECTOR, TIVO
Overview Overview of business problem
Introduction to TiVo and TiVo data Challenges with using data to optimize UI navigation
Solution Demo Next Steps Questions
TiVoBackground and context
TiVo is a discovery platform for integrated entertainment Multiple ways to find content
TiVo Roamio Demo Video
TiVo collects ~2 million logs a day from their boxes User action events, TiVo action events, inventory events These events are “memory-less” and don’t know what happened
prior to the event
MotivationTiVo Business Initiatives Help users get to content they want faster Help users discover new content easier Is feature X important to discovering and getting to content? (ex: Is the guide
still used to find content?
Challenge Measuring a KPI for initiatives in a log stream that is “memory-less” Identifying events or a pattern of events that impacts the KPIs
Data Objective – “Path Analysis” Analysis that answers questions around the navigational paths users take to
get to or from defined start and/or end points
Architecture Challenges Traditional data platform that was sample-based and SQL-based Relational databases
Technical Challenges Relational Databases Challenges
Little flexibility in “click path” definition Decisions about defining “paths” are made during processing step Many business assumptions have to be made with little insight
Solution
Solution
Graph Database (Neo4j) Relationships are first-class citizens Simple abstractions Enable sophisticated models
“Path Analysis”
PrototypeOne Day Graph Info
Edges: 57K relationshipsNodes:
135 UI or “screen” nodes12 “watch content” nodes
Size reduction (2K times)
70 GB log data 35MB Neo DB (all nodes & edges) 1 Day Oct 1, 2015
LOGS70+ GB
UI nodesWatch nodesEdges
35 MB
Screens, Transitions, and Content
Screen eventsRemote button press eventsWatch content events
Sample Graph
Live Movie My Shows
TiVo Central(Home)
Switched to
Switched to
Switched to
Architecture Overview
What’s captured in the graph? Node (UI)
Name Timestamp
Node (Watch) Type, e.g. Recorded Genre, e.g. TV Show Timestamp
Edge Average Time Total number of keys pressed Key sequence
e.g. Home Up/Down Select/Play Total number of times path taken Unique number of users taking this path
What’s captured in the graph?
TiVo Central
1/1/2016 0.4s average time3 keys pressedHome-Down-Select50 times path used27 unique users
Live Movie
1/1/2016
Raw Log File example…1444809715713072|Watch|live|WBINDT|MV|506|EP019641150097...1444809715812909|Key|HOME1444880816123454|UI|TivoCentralScreen1444809716234553|Key|DOWN1444809716354363|Key|SELECT1444809716518701|Trick3|PLAY|116|1|100|-11444809719888072|Key|PLAY1444809719889072|Trick3|PLAY|119|1|100|-11444809726966880|Watch|rec|WFXTDT|SH|508|...…
Filtered Log...Watch: LIVE MOVIE Key: HOME UI: TIVO HOME Key: DOWN Key: SELECT Key: PLAYWatch: REC SHOW…
Edge
Node
Node
EdgeNode
Same day
Algorithm Overview (1 of 2)1. Filter for desired events
• Remove non-Screen, non-Watch, non-Key events2. Session-ize and order logs to reflect Screen/Watch/Edge events3. Define display for Key Press events - two formats
• Normal: SELECT & UP x 2 & GUIDE & SELECT • Compact: TIVO & 9 KEYS & SELECT
4. Generate an Edge if transition < max time set by stakeholders (e.g. 5 min)• For all logs find the following sequence: Node X - timestamp x (start time)
key A key B key C
Node Y - timestamp y (end time)
Algorithm Overview (2 of 2)
5. For each unique node-edge-node calculate:1. Average transition time
2. Number of transitions
3. Number of unique transitions
4. Number of keys pressed
5. Key sequence (normal or compact)
6. Export results to CSV files
DEMO
What is the most popular path people take to get to content? live vs. recorded
What percent of total paths are most popular? What path is most popular? Overall? Unique? What app is most popular? What percent of total paths involve the Guide screen?
Business Advantages
Measure KPIs for time to content and content discovery
Optimize KPIs (understanding user behavior that impacts the KPIs)
Enhance A/B Testing by helping to answer “why?” Simplify user experience across products Increase engagement with new content Understand feature usage interactions not only as a
mutually exclusive experience
Future Work
Deploy to production -- multi-day queries Add relationships and nodes for feature usage Classify paths (“discovery” or “known
destination”) Exploratory analysis
Thanks! @RobHryniewicz @Bayesbabe