How the power of graphs helps deliver Packt's strategic vision

Click here to load reader

Embed Size (px)

Transcript of How the power of graphs helps deliver Packt's strategic vision

How the Power of Graphs Help Packt Deliver our Strategic VisionGreg RobertsTwitter: @GregData

How graphs have helped us learn more about our market than we previously thought possible.

Talk is about how thinking with graphs helped evolve our mindset at Packtstarted off as basically a retailerNow were aiming to be a leading developer learning solutionPersonalisedSkill mapping

So lets get started

Who am I?Background in Physics and Maths

Joined Packt after graduating

Worked in products before moving into marketing

Joined as Data Analyst two years agostarted in products, eventually moved into marketingNow Senior Data Analyst:Analysing CampaignsResearching ideas for data driven products

For those of you who dont know Packt

Packt PublishingSoftware textbook publishing companyOver 3,500 BooksOver 150 VideosOver 6,000 Blogs and ArticlesLots of data to analyse!Focus on PracticalityActionable KnowledgeEmerging technologiesSubscription platform: Packtlibover 10,000 active users

IT book publisherBased in Birmingham and MumbaiStarted by Founders of Wrox

---------------Focus on Practical, results oriented contentFocus on emerging technologies identifying information gaps and get to market fast----------------Packtlib, growing engaged userbase, -----------------So when I joined,lots of interesting data floating about, Mostly siloed and unused

This journey started when we were asked to implement upsell recs.

Upsell Recommendations - VBA

Basket AnalysisAffinity matrix in ExcelVBA Macro (gross)

Analyse one month of data: 3 minutes

-With available data, decide to do collaborative filtering- basket affinities - Amazon Alsobought-Implemented in legacy tool of choice, VBA

Produced great results

-----------------------------------

- Very slow and inflexible

At the time, my hammer of choice was Python, so I decided to re-implement it in Python

Upsell Recommendations - Vanilla.py

Basket AnalysisAffinity matrix in Numpy arrayOptimise to one pass-through

Analyse one month of data: 3 seconds

-Same algorithm, with numpy- optimise, much faster----------------------------Still not satisfying.Id been reading about graphs recently, realised they were a natural way to think about the problem.

Upsell Recommendations - Neo4jBaskets are NodesAffinity matrix is a TraversalAlready optimised!!

Analyse onemonth of data: 30 milliseconds

Downloaded Neo4J, put in products, customers and baskets as nodes, purchases as edges

Did same algorithm--------------------------------Performance was amazing

Flexibility was apparent

However, Im not here to talk about recommendations

Been done better by more interesting people

Recommendations are just the beginningGraphs are an amazing tool for thinking about dataEverythings a graphTalk StructureThe data modelInto ProductionApplicationsThe future

Talk is about how thinking about our data with graphs Led us to a deeper understanding of our market& helped shift our focus from retailer to elearning provider-------------------------Talk structure:Walk through & build up modelCurrent applicationsFuture applicationsSo, the story so far

What do we know so far?Some products are related to others...What does that mean?

Engagement data gives us WHATs connected, not WHY---------------------------------What do the connections mean?Whats actually motivating our customersTo answer WHY, need to know WHAT

you need to know what our titles are about, content filtering

Content Filtering - Whats it all about?Requires robust metadataTakes Cost / Time / Resource / ErrorsAutomate! How do you automate metadata generation?

Content filtering requires robust metadataGood coverageDomain knowledgeWhat if you DONT HAVE GOOD METADATA? categories and keywords, Both at wrong levelManually applying metadata is a big jobSo I decided to automate it---------------------------you need:TopicsLinks between topicsIdeally a pre-made set of both

Fortunately, for our domain, the solution was right under our nose

Stick it all in the graph!>10M questions~60k distinct tagsVast coverageReal world usesExcellent API!

Generating a Topic Network

Developer Q&A site with > 10M questionsAll questions tagged, all tags moderatedAll tags linked by co-occurrenceCo-occurence represents real world use by customers in our market-------------------------------

Stick all tags and edges into the graph, example here

StackOverflow.com

Stick all tags and edges into the graph, example hereSee languagesAlso see concepts i.e. noSQL, graph-databases

Can already see the network effectsAnd how much information there is

So we have a network of topics, now need to attach them to content

Generating the metadata

Text Extraction

Tag Extraction

Initially user website copy; keyword rich--------------------------------Plain keywords gives some detail, but lots of noise-----------------------------------------------Extracting tags gives us clearer picture of the topics

There is still some noise though. Something which is specific on SO, may still be noise for usHow do you overcome this?

Getting More Information: tfidf

Algorithm from field of IRBoosts less common termsGives terms with most information about document

So we apply this to the terms extracted before

Tf_idf : Gives more information!

Here you have Riks book againCan see that the crucial topics are floating to the top

Step back, realise...Now weve built up a complex picture of our marketWe can start making visualisations like this

Top 1000 SO tagsCan see the key influencers and clustersClusters already reflect real life use, Immediately useful, customer segmentation

This picture is idealised. Real life more of a hairball, clusters not so well defined.How do you decide which of these tags represent key concepts?Need more context, need an ontology

Moving to an Ontology

Decide on a set of classes (Programming lang. Database, Task, etc.)Small manual tagging exercise to generate some entities--------------------------Most importantly, attach the ontology to SO tagsNow we can use the SO network to grow the ontologySits on top of with our existing knowledge

Lets take a step back again

What have we achieved?Model so far:Customer touchpointsAll Content & metadataTopics and their dependencies

How do you add value?

We have a large view of all our topics,Also a reasonable view of customer touchpointsCan do some nice stuff with thisSegmentationOutbound recommendationsStill, we dont know much about how our contents being consumedVery well to say LEARN X WITH BOOK Y, JOB DONE---------------------------To really deliver value to our customers, we need to understand how theyre USING our contentHelp organise learningReduce pain

We need to add two more thingsBook partsBook consumption data (from PacktLib)

We need to go deeper

Get all epub data, books as XMLParse out all chapters, sections, subsectionsAdd them all to the graphAttach SO tags as before

Great! Surely this will solve all our problems and tell us exactly whats going on?

Learning Neo4J (again)

Here are tags taken from the actual book, with tfidf appliedLooks pretty good, lots of specificsDoesnt work for all books

For example, Python Machine Learning

Python Machine Learning

very specificToo specific, wheres Python??At this level, Python is too commonLots of Pythonesque terms

So how do we get back to the RIGHT level of info?

For this we use the idea of spreading activation

Spreading ActivationPython Machine Learning

Keras

Scikit-Learn0.60.5

Python0.2 Spread Spread0.8 GPU

0.4

Originally from cognitive Psychology, to model memoryAlso used in IR, can be applied to any associative networkAlgorithmStart with initially activated concepts---------------------------Spread that weight to ALL related conceptsIterate over all activated concepts------------------------------

This brings python back up to the top, and we have or picture of what the books about.Now the final piece of the puzzle is actual book consumption

Now we add in the consumption from subscription library

Very powerful for marketingAND products

We can aggregate all those product views

How do people consume textbooks?

Shows an aggregate of how people consume Learning Neo4jsome chapters are more interesting than others----------Can go up again and look at different types of bookCan also go down and look at different types of consumerDoes an expert read differently from a newbie?Is one more likely to skim than the otherWhat does this even mean??

What have we achieved?Graph Contains:All content down to the page levelTopics (Stack Overflow)Key Concepts (Ontology)Topic RelationshipsAll touchpointsContent Consumption

Lets put it into action

So thats our modelA very comprehensive view of our marketAs Ive demonstrated, lots of interesting research to come out of itIn a moment will talk about applications-----------------------First an aside on putting it into production

Production Environment

Neo sits on a server, coupled to our CMS and lots of other sourcesRails API sits between the graph and the websiteAPI is very thinBusiness logic is stored in cypher queriesMeans split testing and query tuning is trivial

Obviously in real life the situations a bit more complicated

Actual Production Environment

Talk a bit if theres time