How the power of graphs helps deliver Packt's strategic vision
date post
16-Apr-2017Category
Data & Analytics
view
470download
0
Embed Size (px)
Transcript of How the power of graphs helps deliver Packt's strategic vision
How the Power of Graphs Help Packt Deliver our Strategic VisionGreg RobertsTwitter: @GregData
How graphs have helped us learn more about our market than we previously thought possible.
Talk is about how thinking with graphs helped evolve our mindset at Packtstarted off as basically a retailerNow were aiming to be a leading developer learning solutionPersonalisedSkill mapping
So lets get started
Who am I?Background in Physics and Maths
Joined Packt after graduating
Worked in products before moving into marketing
Joined as Data Analyst two years agostarted in products, eventually moved into marketingNow Senior Data Analyst:Analysing CampaignsResearching ideas for data driven products
For those of you who dont know Packt
Packt PublishingSoftware textbook publishing companyOver 3,500 BooksOver 150 VideosOver 6,000 Blogs and ArticlesLots of data to analyse!Focus on PracticalityActionable KnowledgeEmerging technologiesSubscription platform: Packtlibover 10,000 active users
IT book publisherBased in Birmingham and MumbaiStarted by Founders of Wrox
---------------Focus on Practical, results oriented contentFocus on emerging technologies identifying information gaps and get to market fast----------------Packtlib, growing engaged userbase, -----------------So when I joined,lots of interesting data floating about, Mostly siloed and unused
This journey started when we were asked to implement upsell recs.
Upsell Recommendations - VBA
Basket AnalysisAffinity matrix in ExcelVBA Macro (gross)
Analyse one month of data: 3 minutes
-With available data, decide to do collaborative filtering- basket affinities - Amazon Alsobought-Implemented in legacy tool of choice, VBA
Produced great results
-----------------------------------
- Very slow and inflexible
At the time, my hammer of choice was Python, so I decided to re-implement it in Python
Upsell Recommendations - Vanilla.py
Basket AnalysisAffinity matrix in Numpy arrayOptimise to one pass-through
Analyse one month of data: 3 seconds
-Same algorithm, with numpy- optimise, much faster----------------------------Still not satisfying.Id been reading about graphs recently, realised they were a natural way to think about the problem.
Upsell Recommendations - Neo4jBaskets are NodesAffinity matrix is a TraversalAlready optimised!!
Analyse onemonth of data: 30 milliseconds
Downloaded Neo4J, put in products, customers and baskets as nodes, purchases as edges
Did same algorithm--------------------------------Performance was amazing
Flexibility was apparent
However, Im not here to talk about recommendations
Been done better by more interesting people
Recommendations are just the beginningGraphs are an amazing tool for thinking about dataEverythings a graphTalk StructureThe data modelInto ProductionApplicationsThe future
Talk is about how thinking about our data with graphs Led us to a deeper understanding of our market& helped shift our focus from retailer to elearning provider-------------------------Talk structure:Walk through & build up modelCurrent applicationsFuture applicationsSo, the story so far
What do we know so far?Some products are related to others...What does that mean?
Engagement data gives us WHATs connected, not WHY---------------------------------What do the connections mean?Whats actually motivating our customersTo answer WHY, need to know WHAT
you need to know what our titles are about, content filtering
Content Filtering - Whats it all about?Requires robust metadataTakes Cost / Time / Resource / ErrorsAutomate! How do you automate metadata generation?
Content filtering requires robust metadataGood coverageDomain knowledgeWhat if you DONT HAVE GOOD METADATA? categories and keywords, Both at wrong levelManually applying metadata is a big jobSo I decided to automate it---------------------------you need:TopicsLinks between topicsIdeally a pre-made set of both
Fortunately, for our domain, the solution was right under our nose
Stick it all in the graph!>10M questions~60k distinct tagsVast coverageReal world usesExcellent API!
Generating a Topic Network
Developer Q&A site with > 10M questionsAll questions tagged, all tags moderatedAll tags linked by co-occurrenceCo-occurence represents real world use by customers in our market-------------------------------
Stick all tags and edges into the graph, example here
StackOverflow.com
Stick all tags and edges into the graph, example hereSee languagesAlso see concepts i.e. noSQL, graph-databases
Can already see the network effectsAnd how much information there is
So we have a network of topics, now need to attach them to content
Generating the metadata
Text Extraction
Tag Extraction
Initially user website copy; keyword rich--------------------------------Plain keywords gives some detail, but lots of noise-----------------------------------------------Extracting tags gives us clearer picture of the topics
There is still some noise though. Something which is specific on SO, may still be noise for usHow do you overcome this?
Getting More Information: tfidf
Algorithm from field of IRBoosts less common termsGives terms with most information about document
So we apply this to the terms extracted before
Tf_idf : Gives more information!
Here you have Riks book againCan see that the crucial topics are floating to the top
Step back, realise...Now weve built up a complex picture of our marketWe can start making visualisations like this
Top 1000 SO tagsCan see the key influencers and clustersClusters already reflect real life use, Immediately useful, customer segmentation
This picture is idealised. Real life more of a hairball, clusters not so well defined.How do you decide which of these tags represent key concepts?Need more context, need an ontology
Moving to an Ontology
Decide on a set of classes (Programming lang. Database, Task, etc.)Small manual tagging exercise to generate some entities--------------------------Most importantly, attach the ontology to SO tagsNow we can use the SO network to grow the ontologySits on top of with our existing knowledge
Lets take a step back again
What have we achieved?Model so far:Customer touchpointsAll Content & metadataTopics and their dependencies
How do you add value?
We have a large view of all our topics,Also a reasonable view of customer touchpointsCan do some nice stuff with thisSegmentationOutbound recommendationsStill, we dont know much about how our contents being consumedVery well to say LEARN X WITH BOOK Y, JOB DONE---------------------------To really deliver value to our customers, we need to understand how theyre USING our contentHelp organise learningReduce pain
We need to add two more thingsBook partsBook consumption data (from PacktLib)
We need to go deeper
Get all epub data, books as XMLParse out all chapters, sections, subsectionsAdd them all to the graphAttach SO tags as before
Great! Surely this will solve all our problems and tell us exactly whats going on?
Learning Neo4J (again)
Here are tags taken from the actual book, with tfidf appliedLooks pretty good, lots of specificsDoesnt work for all books
For example, Python Machine Learning
Python Machine Learning
very specificToo specific, wheres Python??At this level, Python is too commonLots of Pythonesque terms
So how do we get back to the RIGHT level of info?
For this we use the idea of spreading activation
Spreading ActivationPython Machine Learning
Keras
Scikit-Learn0.60.5
Python0.2 Spread Spread0.8 GPU
0.4
Originally from cognitive Psychology, to model memoryAlso used in IR, can be applied to any associative networkAlgorithmStart with initially activated concepts---------------------------Spread that weight to ALL related conceptsIterate over all activated concepts------------------------------
This brings python back up to the top, and we have or picture of what the books about.Now the final piece of the puzzle is actual book consumption
Now we add in the consumption from subscription library
Very powerful for marketingAND products
We can aggregate all those product views
How do people consume textbooks?
Shows an aggregate of how people consume Learning Neo4jsome chapters are more interesting than others----------Can go up again and look at different types of bookCan also go down and look at different types of consumerDoes an expert read differently from a newbie?Is one more likely to skim than the otherWhat does this even mean??
What have we achieved?Graph Contains:All content down to the page levelTopics (Stack Overflow)Key Concepts (Ontology)Topic RelationshipsAll touchpointsContent Consumption
Lets put it into action
So thats our modelA very comprehensive view of our marketAs Ive demonstrated, lots of interesting research to come out of itIn a moment will talk about applications-----------------------First an aside on putting it into production
Production Environment
Neo sits on a server, coupled to our CMS and lots of other sourcesRails API sits between the graph and the websiteAPI is very thinBusiness logic is stored in cypher queriesMeans split testing and query tuning is trivial
Obviously in real life the situations a bit more complicated
Actual Production Environment
Talk a bit if theres time