CSC 8101 Non Relational Databases
-
Upload
sjwoodman -
Category
Technology
-
view
407 -
download
3
Transcript of CSC 8101 Non Relational Databases
Outline• Part 1: Non-Relational Databases (NoSQL)
– Trends forcing change
– NoSQL database types
– Graph Databases (Neo4J)
– Demo
• Part 2: Making Recommendations
– Background/example
– Pearson Score
– User based
– Item based
Credit: http://ecogreenliving.net/
Trend 1: Data Size
2006 2007 2008 2009 2010 2011 20120
500
1000
1500
2000
2500
3000
Digital Information Created, Captured, Replicated
worldwide
Exabytes
Source: IDC 2009
Trend 2: Connectedness
Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome
Trend 3: semi-structure• “The great majority of the data out there is not structured and
[there’s] no way in the world you can force people to structure
it.” [1]
• Trend accelerated by the decentralization of content generation
that is the hallmark of the age of participation (“web 2.0”)
• Evolving applications
[1] Stefano Mazzocci Apache and MIT
Types of Databases
• Relational
• Key-Value Stores
• BigTable Clones
• Document Databases
• Graph Databases
Relational Databases
• Data Model: Normalised, multi-table with referential integrity
• Good for very static data
– Payroll, accounts
– Well understood
– Not evolving
• SQL Queries (joins etc.)
• Good Tooling
• Examples: Oracle, MySQL, Postgres, …
Key-Value Stores• Data Model: (global) collection of K-V pairs
• Massive Distributed HashMap
• Partitioning and Replication usually ring based
– Load Balancer round robins the requests
– Hash(key) = partition
– Partition map maintains partition -> node mapping
– Quorum System (N, R, W), usually (3,2,2)
• Scales Well (1000B rows)
• How many apps need that?
– Google, Amazon, Facebook etc.
– <10 in the world
• Examples: Dynomite, Voldemort, Tokyo
[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]
BigTable Clones• Data model: single table, column families
• Distributed storage of semi-structured data (column families)
• Scale: “Petabyte range”
• Supports MapReduce well
• Example: Hbase, Hypertable
[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]
Document Databases• Inspired by Lotus Notes
• Data model: collections of K-V collections
• Document:
– Collection of K-V pairs (often JSON)
– Often versioned
• Scales: Dependant on implementation
• Can (potentially) store entire 3 tier web app
in the database (probably NOT the best
architecture!)
• Example: CouchDB, MongoDB
Graph Databases• Inspired by Euler & graph theory
• Data model: nodes, relationships, K-V on both
• Scale: 10B entities
• SPARQL Queries
• No O/R Impedance mismatch
• Semi Structured & Evolving Schema
• Example: AllegroGraph, VertexDB, Neo4j
Social Network Problem
• System stores people and friends
• Find all “friends of friends”
RDBMS Solution• SQL: single join to get
friends
• SELECT p.name, p2.name
FROM people AS p, people AS p2,
friends AS f
WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;
• SQL: 2-3 joins or subqueries to get “friends of friends”
• i.e. Not trivial and doesn’t scale
Graph DB Solution• Graph Traversal
• pathExists(a,b)
limit depth 2
Neo4J Model
• Nodes
• Relationships (edges)
• Properties on Both1
2
3
name = “Simon”job=“RA”
name = “Chris”
type=“KNOWS”
age=4 years
Live Demo!
Neo4J Model
• Transactions
• Reference Node
• Indexes (Apache Lucene)
• Visualisation
– Neoclipse
– The JIT
Neoclipse
Pros and Cons
• “Whiteboard friendly” – fits domain models better
• Scales up “enough”
• Evolve Schema
• Can represent semi-structured data
• Good Performance for graph/network traversals
• Lacks tool support
• Harder to write ad-hoc queries (SPARQL vs. SQL)
Important Reminders
• Other options exist apart from the
Relational Database
• Fit the technology to the domain model, not
the domain model to the technology
Questions?
• http://neo4j.org/
• Some material from
[http://nosql.mypopescu.com/post/
342947902/presentation-graphs-
neo4j-teh-awesome]
Part 2: Collaborative Filtering
• Calculating Similarities
• User based filtering
• Item based filtering
Why?• Sell more items
• Increase market share
• Better targeted advertising
• Up sell rather than new-sell
• Make more £££
• Not perfect
– Bad recommendations
– Inappropriate recommendations
It can go wrong
It will go wrong
Preference DataMovie
RatingsOnline Shopping Site
Recommender
5 Bought 1 Like 1
4 Didn’t Buy 0 No vote 0
3 Didn’t Like
-1
2
1
Recommending Items
• Step 1: Calculate similarities
– either user-user or item-item
• Step 2: Predict scores for “unseen”
items
• Step 3: Normalise and order
Example Data: Movie Reviews
Shawshank
Redemption
The Ghost
Lock Stock
Love Actuall
yTitanic Seven
Simon 5 4 4 1
Chris 1 3 4 5 4
Paul 4 5 2 4
Calculating Similarity
• Method 1: Euclidian Distance Score
• Compare Common Rankings
• n-dimensional preference space
• Score 0 – 1
• 1 = Identical
• 0 = Highly dissimilar
Calculating Euclidian Distance Score
• Done for each pair of people
• Difference in each axis
• Square
• Add them together
• Add 1 (avoids divide by zero)
• Square Root
• Invert
Chris and Simon
• Difference in each axis
– (5-1), (4-3) = 4, 1
• Square
– 16, 1
• Add them together
– 17
• Add 1 (avoids divide by zero)
– = 18
• Square Root
– = 4.24264069
• Invert
– = 0.23570226
Euclidian Distance Score
• Easy to calculate
• Bad for people who are similar but
consistently rate higher/lower
Pearson Correlation Coefficient
• More Complicated
• Line of Best Fit between commonly rated items
• Deals with grade inflation
• Other measures
– Jaccard Coefficient
– Manhattan Distance
User based Filtering
• Look at what similar people have liked but
you haven’t seen?
– Similar person likes something that has bad
reviews from everyone else?
• Weighted Score that ranks the other
people and takes into account similarity
Recommending Items
Similarity (ED) TitanicSim x Titanic
SevenSim x Seven
Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12
Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum2.4554455
454
Recommending Items
Similarity (ED) TitanicSim x Titanic
SevenSim x Seven
Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12
Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum2.4554455
454
SUMSUM
Recommending Items
Similarity (ED) TitanicSim x Titanic
SevenSim x Seven
Chris 0.23 4 0.92
Paul 0.78 2 1.56 4 3.12
Total 2.48 3.12
Sim Sum 1.01 0.78
Total/Sim Sum2.4554455
454
User Based Filtering - Conclusions
• Calculate Similarity between users
• Recommend based on similar users
• Similarity
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data
• Problem – need to compare every user/item to every
other user/item
Item Based Filtering
• Pre-compute most similar items for each
item
– Item similarities change less often than user
similarities and can be re-used
• Create a weighted list of items most
similar to user’s top rated items
Recommending Items
RatingTitanic (ED)
Rat x Titanic
Seven (ED)Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272Love
Actually1 0.737 0.737 0.184 0.184
Total 1.037 2.021 1.355 5.234Normalised (Rating /
Similarity)1.948
3.862730627
Recommending Items
RatingTitanic (ED)
Rat x Titanic
Seven (ED)Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272Love
Actually1 0.737 0.737 0.184 0.184
Total 1.037 2.021 1.355 5.234Normalised (Rating /
Similarity)1.948
3.862730627
SUM SUM
Recommending Items
RatingTitanic (ED)
Rat x Titanic
Seven (ED)Rat x Seven
Shawshank 5 0.084 0.42 0.366 1.83
The Ghost 4 0.125 0.5 0.487 1.948
Lock Stock 4 0.091 0.364 0.318 1.272Love
Actually1 0.737 0.737 0.184 0.184
Total 1.037 2.021 1.355 5.234Normalised (Rating /
Similarity)1.948
3.862730627
Item Based Filtering - Conclusions
• Calculate Similarity between items
• Recommend based on user’s ratings for items
• Similarity (as before)
– Euclidian Distance Score
– Pearson Coefficient – better for non-normalised data
• Problem – need to maintain item similarity data set
Item vs. User Based Filtering
• Item based scales better
– Need to maintain the similarities data set
• User based simpler to implement
• May (or may not) want to show users who is
similar in terms of habits
• Perform equally on dense data sets
• Item based performs better on sparse data sets
Questions?
• Reference: Programming Collective
Intelligence, Toby Seagram, O’Reilly 2007