CSC 8101 Non Relational Databases

Part 1: Non Relational DatabasesPart 2: Collaborative Filtering

Simon Woodman

[[email protected]]

Outline• Part 1: Non-Relational Databases (NoSQL)

– Trends forcing change

– NoSQL database types

– Graph Databases (Neo4J)

– Demo

• Part 2: Making Recommendations

– Background/example

– Pearson Score

– User based

– Item based

Credit: http://ecogreenliving.net/

Trend 1: Data Size

2006 2007 2008 2009 2010 2011 20120

500

1000

1500

2000

2500

3000

Digital Information Created, Captured, Replicated

worldwide

Exabytes

Source: IDC 2009

Trend 2: Connectedness

Source: http://nosql.mypopescu.com/post/342947902/presentation-graphs-neo4j-teh-awesome

Trend 3: semi-structure• “The great majority of the data out there is not structured and

[there’s] no way in the world you can force people to structure

it.” [1]

• Trend accelerated by the decentralization of content generation

that is the hallmark of the age of participation (“web 2.0”)

• Evolving applications

[1] Stefano Mazzocci Apache and MIT

Types of Databases

• Relational

• Key-Value Stores

• BigTable Clones

• Document Databases

• Graph Databases

Relational Databases

• Data Model: Normalised, multi-table with referential integrity

• Good for very static data

– Payroll, accounts

– Well understood

– Not evolving

• SQL Queries (joins etc.)

• Good Tooling

• Examples: Oracle, MySQL, Postgres, …

Key-Value Stores• Data Model: (global) collection of K-V pairs

• Massive Distributed HashMap

• Partitioning and Replication usually ring based

– Load Balancer round robins the requests

– Hash(key) = partition

– Partition map maintains partition -> node mapping

– Quorum System (N, R, W), usually (3,2,2)

• Scales Well (1000B rows)

• How many apps need that?

– Google, Amazon, Facebook etc.

– <10 in the world

• Examples: Dynomite, Voldemort, Tokyo

[http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf]

BigTable Clones• Data model: single table, column families

• Distributed storage of semi-structured data (column families)

• Scale: “Petabyte range”

• Supports MapReduce well

• Example: Hbase, Hypertable

[http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf]

Document Databases• Inspired by Lotus Notes

• Data model: collections of K-V collections

• Document:

– Collection of K-V pairs (often JSON)

– Often versioned

• Scales: Dependant on implementation

• Can (potentially) store entire 3 tier web app

in the database (probably NOT the best

architecture!)

• Example: CouchDB, MongoDB

Graph Databases• Inspired by Euler & graph theory

• Data model: nodes, relationships, K-V on both

• Scale: 10B entities

• SPARQL Queries

• No O/R Impedance mismatch

• Semi Structured & Evolving Schema

• Example: AllegroGraph, VertexDB, Neo4j

Social Network Problem

• System stores people and friends

• Find all “friends of friends”

RDBMS Solution• SQL: single join to get

friends

• SELECT p.name, p2.name

FROM people AS p, people AS p2,

friends AS f

WHERE p.id = 1 AND p.id = f.id1 AND p2.id = f.id2;

• SQL: 2-3 joins or subqueries to get “friends of friends”

• i.e. Not trivial and doesn’t scale

Graph DB Solution• Graph Traversal

• pathExists(a,b)

limit depth 2

Neo4J Model

• Nodes

• Relationships (edges)

• Properties on Both1

2

3

name = “Simon”job=“RA”

name = “Chris”

type=“KNOWS”

age=4 years

Live Demo!

Neo4J Model

• Transactions

• Reference Node

• Indexes (Apache Lucene)

• Visualisation

– Neoclipse

– The JIT

Neoclipse

Pros and Cons

• “Whiteboard friendly” – fits domain models better

• Scales up “enough”

• Evolve Schema

• Can represent semi-structured data

• Good Performance for graph/network traversals

• Lacks tool support

• Harder to write ad-hoc queries (SPARQL vs. SQL)

Important Reminders

• Other options exist apart from the

Relational Database

• Fit the technology to the domain model, not

the domain model to the technology

Questions?

• http://neo4j.org/

• Some material from

[http://nosql.mypopescu.com/post/

342947902/presentation-graphs-

neo4j-teh-awesome]

Part 2: Collaborative Filtering

• Calculating Similarities

• User based filtering

• Item based filtering

Why?• Sell more items

• Increase market share

• Better targeted advertising

• Up sell rather than new-sell

• Make more £££

• Not perfect

– Bad recommendations

– Inappropriate recommendations

It can go wrong

It will go wrong

Preference DataMovie

RatingsOnline Shopping Site

Recommender

5 Bought 1 Like 1

4 Didn’t Buy 0 No vote 0

3 Didn’t Like

-1

2

1

Recommending Items

• Step 1: Calculate similarities

– either user-user or item-item

• Step 2: Predict scores for “unseen”

items

• Step 3: Normalise and order

Example Data: Movie Reviews

Shawshank

Redemption

The Ghost

Lock Stock

Love Actuall

yTitanic Seven

Simon 5 4 4 1

Chris 1 3 4 5 4

Paul 4 5 2 4

Calculating Similarity

• Method 1: Euclidian Distance Score

• Compare Common Rankings

• n-dimensional preference space

• Score 0 – 1

• 1 = Identical

• 0 = Highly dissimilar

Calculating Euclidian Distance Score

• Done for each pair of people

• Difference in each axis

• Square

• Add them together

• Add 1 (avoids divide by zero)

• Square Root

• Invert

Chris and Simon

• Difference in each axis

– (5-1), (4-3) = 4, 1

• Square

– 16, 1

• Add them together

– 17

• Add 1 (avoids divide by zero)

– = 18

• Square Root

– = 4.24264069

• Invert

– = 0.23570226

Euclidian Distance Score

• Easy to calculate

• Bad for people who are similar but

consistently rate higher/lower

Pearson Correlation Coefficient

• More Complicated

• Line of Best Fit between commonly rated items

• Deals with grade inflation

• Other measures

– Jaccard Coefficient

– Manhattan Distance

User based Filtering

• Look at what similar people have liked but

you haven’t seen?

– Similar person likes something that has bad

reviews from everyone else?

• Weighted Score that ranks the other

people and takes into account similarity

Recommending Items

Similarity (ED) TitanicSim x Titanic

SevenSim x Seven

Chris 0.23 4 0.92

Paul 0.78 2 1.56 4 3.12

Total 2.48 3.12

Sim Sum 1.01 0.78

Total/Sim Sum2.4554455

454

Recommending Items


SevenSim x Seven

Chris 0.23 4 0.92

Paul 0.78 2 1.56 4 3.12

Total 2.48 3.12

Sim Sum 1.01 0.78


454

SUMSUM

Recommending Items


SevenSim x Seven

Chris 0.23 4 0.92

Paul 0.78 2 1.56 4 3.12

Total 2.48 3.12

Sim Sum 1.01 0.78


454

User Based Filtering - Conclusions

• Calculate Similarity between users

• Recommend based on similar users

• Similarity

– Euclidian Distance Score

– Pearson Coefficient – better for non-normalised data

• Problem – need to compare every user/item to every

other user/item

Item Based Filtering

• Pre-compute most similar items for each

item

– Item similarities change less often than user

similarities and can be re-used

• Create a weighted list of items most

similar to user’s top rated items

Recommending Items

RatingTitanic (ED)

Rat x Titanic

Seven (ED)Rat x Seven

Shawshank 5 0.084 0.42 0.366 1.83

The Ghost 4 0.125 0.5 0.487 1.948

Lock Stock 4 0.091 0.364 0.318 1.272Love

Actually1 0.737 0.737 0.184 0.184

Total 1.037 2.021 1.355 5.234Normalised (Rating /

Similarity)1.948

3.862730627

Recommending Items

RatingTitanic (ED)

Rat x Titanic


Shawshank 5 0.084 0.42 0.366 1.83

The Ghost 4 0.125 0.5 0.487 1.948

Lock Stock 4 0.091 0.364 0.318 1.272Love

Actually1 0.737 0.737 0.184 0.184


Similarity)1.948

3.862730627

SUM SUM

Recommending Items

RatingTitanic (ED)

Rat x Titanic


Shawshank 5 0.084 0.42 0.366 1.83

The Ghost 4 0.125 0.5 0.487 1.948

Lock Stock 4 0.091 0.364 0.318 1.272Love

Actually1 0.737 0.737 0.184 0.184


Similarity)1.948

3.862730627

Item Based Filtering - Conclusions

• Calculate Similarity between items

• Recommend based on user’s ratings for items

• Similarity (as before)

– Euclidian Distance Score

– Pearson Coefficient – better for non-normalised data

• Problem – need to maintain item similarity data set

Item vs. User Based Filtering

• Item based scales better

– Need to maintain the similarities data set

• User based simpler to implement

• May (or may not) want to show users who is

similar in terms of habits

• Perform equally on dense data sets

• Item based performs better on sparse data sets

Questions?

• Reference: Programming Collective

Intelligence, Toby Seagram, O’Reilly 2007

• [email protected]

CSC 8101 Non Relational Databases

Technology

Transcript of CSC 8101 Non Relational Databases