Data Mining and Open APIs

Data Mining and Open APIs

Toby Segaran

About Me

Software Developer at GenstructWork directly with scientistsDesign algorithms to aid in drug testing

“Programming Collective Intelligence”Published by O’ReillyDue out in August

Consult with open-source projects and other companieshttp://kiwitobes.com

Presentation Goals

Look at some Open APIsGet some dataVisualize algorithms for data-miningWork through some Python codeVariety of techniques and sources

Advocacy (why you should care)

Open data APIs

ZilloweBayFacebookdel.icio.usHotOrNotUpcoming

Yahoo AnswersAmazonTechnoratiTwitterGoogle News

programmableweb.com/apis for more…

Open API uses

MashupsIntegrationAutomationCommand-line toolsMost importantly, creating datasets!

What is data mining?

From a large dataset find the:ImplicitUnknownUseful

Data could be:Tabular, e.g. Price listsFree textPictures

Why it’s important now

More devices produce more dataPeople share more dataThe internet is vastProducts are more customizedAdvertising is targetedHuman cognition is limited

Traditional Applications

Computational BiologyFinancial MarketsRetail MarketsFraud DetectionSurveillanceSupply Chain OptimizationNational Security

Traditional = Inaccessible

Real applications are esotericTutorial examples are trivialGenerally lacking in “interest value”

Fun, Accessible Applications

Home price modelingWhere are the hottest people?Which bloggers are similar?Important attributes on eBayPredicting fashion trendsMovie popularity

Zillow

The Zillow API

Allows querying by addressReturns information about the property

BedroomsBathroomsZip CodePrice EstimateLast Sale Price

Requires registration keyhttp://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm

The Zillow API

REST Request

http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id=key&address=address&citystatezip=citystateszip

The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…

<response><results><result><zpid>48749425</zpid><links>…

</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>

The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…

<response><results><result><zpid>48749425</zpid><links>…

</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>

<zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude><longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount>

Zillow from Pythondef getaddressdata(address,city):

escad=address.replace(' ','+')

# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data

# Code 0 means success, otherwise there was an errorif code!='0': return None

# Extract the info about this propertytry:

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data

except:return None

return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)








except:return None



A home price dataset

1930

1909

1854

1894

1916

1847

Built

etc..

2107871Single43.502138F

947528Duplex53.502138E

552213Duplex42.502139D

595027Duplex43.502140C

776378Triplex93.502139B

505296Single21.502138A

PriceTypeBedroomsBathroomsZipHouse

What can we learn?

A made-up houses priceHow important is Zip Code?What are the important attributes?

Can we do better than averages?

Introducing Regression Trees

6Circle188Square2222Square1120Circle10ValueBA

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

6Circle18

8Square22

22Square11

20Circle10

ValueBA InitiallyAverage = 14

Standard Deviation = 8.2



6Circle18

8Square22

22Square11

20Circle10

ValueBA B = CircleAverage = 13


B = SquareAverage = 15




6Circle18

8Square22

22Square11

20Circle10

ValueBA A > 18Average = 8

Standard Deviation = 0

A <= 20Average = 16




6Circle18

8Square22

22Square11

20Circle10

ValueBA A > 11Average = 7


A <= 11Average = 21


Python Codedef variance(rows):

if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance

def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=valueelse:

split_function=lambda row:row[column]==value

# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)







def variance(rows):if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance







# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):



CART Algoritm

6Circle188Square2222Square1120Circle10ValueBA

CART Algoritm

22Square11

20Circle106Circle18

8Square22

CART Algoritm

Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:

column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)

# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)

# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)

else:return decisionnode(results=uniquecounts(rows))


column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)



def buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1





for value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)





if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],

tb=trueBranch,fb=falseBranch)else:

return decisionnode(results=uniquecounts(rows))

Zillow Results

Bathrooms > 3

Zip: 02139? After 1903?

Triplex?Duplex?Bedrooms > 4?Zip: 02140?

Just for Fun… Hot or Not

Supervised and Unsupervised

Regression trees are supervised“answers” are in the datasetTree models predict answers

Some methods are unsupervisedThere are no answersMethods just characterize the dataShow interesting patterns

Next challenge - Bloggers

Millions of blogs onlineUsually focus on a subject areaCan they be characterized automatically?… using only the words in the posts?

The Technorati Top 100

A single blog

Getting the content

Use Mark Pilgrim’s Universal Feed ReaderRetrieve the post titles and textSplit up the wordsCount occurrence of each word

Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):

# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:

if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:

wc.setdefault(word,0)wc[word]+=1

return d.feed.title,wc

def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']







for e in d.entries:if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:


Building a Word Matrix

Build a matrix of word countsBlogs are rows, words are columnsEliminate words that are:

Too commonToo rare

Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):

title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():

apcount.setdefault(word,0)if count>1:

apcount[word]+=1

wordlist=[]for w,bc in apcount.items():

frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():

out.write(blog)for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')

out.write('\n')




apcount[word]+=1






out.write('\n')

for feedurl in file('feedlist.txt'):title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():


apcount[word]+=1




apcount[word]+=1






out.write('\n')

wordlist=[]for w,bc in apcount.items():frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)




apcount[word]+=1






out.write('\n')

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():out.write(blog)for word in wordlist:


out.write('\n')

The Word Matrix

12220Quick Online Tips

2106GigaOM

0330Gothamist

“yahoo”“music”“kids”“china”

Determining distance

12220Quick Online Tips

2106GigaOM

0330Gothamist

“yahoo”“music”“kids”“china”

Euclidean “as the crow flies”

2222 )122()21()20()06( −+−+−+−

= 12 (approx)

Other Distance Metrics

ManhattanTanamotoPearson CorrelationChebychevSpearman

Hierarchical Clustering

Find the two closest itemCombine them into a single itemRepeat…

Hierarchical Algorithm

Dendrogram

Python Code

class bicluster:def

__init__(self,vec,left=None,right=None,distance=0.0,id=None):self.left=leftself.right=rightself.vec=vecself.id=idself.distance=distance

Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)

# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],

right=clust[lowestpair[1]],distance=closest,id=currentclustid)

# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

return clust[0]






return clust[0]

distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]






return clust[0]

while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):

for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances:

distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:

closest=dlowestpair=(i,j)






return clust[0]

# calculate the average of the two clustersmergevec=[

(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0for i in range(len(clust[0].vec))

]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],


del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

Hierarchical Blog Clusters

Rotating the Matrix

Words in a blog -> blogs containing each word

1220Yahoo213music203kids060chinaQuick OnlGigaOMGothamist

Hierarchical Word Clusters

K-Means Clustering

Divides data into distinct clustersUser determines how manyAlgorithm

Start with arbitrary centroidsAssign points to centroidsMove the centroidsRepeat

K-Means Algorithm

Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches








clusters[i]=avgs

return bestmatches

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),

max([row[i] for row in rows])) for i in range(len(rows[0]))]

# Create k randomly placed centroidsclusters=[[random.random()*

(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))]

for j in range(k)]








clusters[i]=avgs

return bestmatches

for t in range(100):bestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):

row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)








clusters[i]=avgs

return bestmatches

# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches








clusters[i]=avgs

return bestmatches

# Move the centroids to the average of their membersfor i in range(k):

avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:

for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]


clusters[i]=avgs

K-Means Results

>> [rownames[r] for r in k[0]]['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', "Seth's Blog"]

>> [rownames[r] for r in k[1]]['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']

2D Visualizations

Instead of Clusters, a 2D MapGoals

Preserve distances as much as possibleDraw in two dimensions

Dimension ReductionPrincipal Components AnalysisMultidimensional Scaling

Multidimensional Scaling

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc











return loc

n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]












return loc












return loc


totalerror=0for k in range(n):

for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the # other point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)











return loc

# If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: breaklasterror=totalerror











return loc

# Move each of the points by the learning rate times the gradientfor k in range(n):

loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

Numerical Predictions

Back to “supervised” learningWe have a set of numerical attributes

Specs for a laptopAge and rating for wineRatios for a stock

Want to predict another attributeFormula/model is unknowne.g. price

Regression Trees?

Regression trees find hard boundariesCan’t deal with complex formulae

Statistical regression

Requires specification of a modelUsually linearDoesn’t handle context

Alternative - Interpolation

Find “similar” itemsGuess price based on similar itemsNeed to determine:

What is similar?How should we aggregate prices?

Price Data from EBay

The eBay API

XML APISend XML over HTTPSReceive results in XML

http://developer.ebay.com/quickstartguide.

Some Python Code

def sendRequest(apicall,xmlparameters):connection = httplib.HTTPSConnection(serverUrl)connection.request("POST", '/ws/api.dll', xmlparameters, getHeaders(apicall))response = connection.getresponse()if response.status != 200:

print "Error sending request:" + response.reasonelse:

data = response.read()connection.close()

return data

def getHeaders(apicall,siteID="0",compatabilityLevel = "433"):headers = {"X-EBAY-API-COMPATIBILITY-LEVEL": compatabilityLevel,

"X-EBAY-API-DEV-NAME": devKey,"X-EBAY-API-APP-NAME": appKey,"X-EBAY-API-CERT-NAME": certKey,"X-EBAY-API-CALL-NAME": apicall,"X-EBAY-API-SITEID": siteID,"Content-Type": "text/xml"}

return headers

Some Python Codedef getItem(itemID):

xml = "<?xml version='1.0' encoding='utf-8'?>"+\"<GetItemRequest xmlns=\"urn:ebay:apis:eBLBaseComponents\">"+\"<RequesterCredentials><eBayAuthToken>" +\userToken +\"</eBayAuthToken></RequesterCredentials>" + \"<ItemID>" + str(itemID) + "</ItemID>"+\"<DetailLevel>ItemReturnAttributes</DetailLevel>"+\"</GetItemRequest>"

data=sendRequest('GetItem',xml)result={}response=parseString(data)result['title']=getSingleValue(response,'Title')sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')result['bids']=getSingleValue(sellingStatusNode,'BidCount')seller = response.getElementsByTagName('Seller')result['feedback'] = getSingleValue(seller[0],'FeedbackScore')attributeSet=response.getElementsByTagName('Attribute');attributes={}for att in attributeSet:

attID=att.attributes.getNamedItem('attributeID').nodeValueattValue=getSingleValue(att,'ValueLiteral')attributes[attID]=attValue

result['attributes']=attributesreturn result

Building an item table

etc..

17

14

13

14

Screen

$800112016001024Pavillion

$200120900256T22

$8005300160Lenovo

$3501401400512D600

PriceDVDHDDCPURAM

Distance between items

14

14

Screen

$200120900256T22

???1401400512New

PriceDVDHDDCPURAM

Euclidean, just like in clustering

22222 )11()1414()2040()9001400()256512( −+−+−+−+−

Idea 1 – use the closest item

With the item whose price I want to guess:

Calculate the distance for every item in my datasetGuess that the price is the same as the closest

This is called kNN with k=1

Problems with “outliers”

The closest item may be anomalousWhy?

Exceptional deal that won’t occur againSomething missing from the datasetData errors

Using an average

15

14

13

14

Screen

$325012016001024No. 3

$4001601400512No. 2

$3601301400512No. 1

???1401400512New

PriceDVDHDDCPURAM

k=3, estimate = $361

Using a weighted average

$325

$400

$360

???

Price

15

14

13

14

Screen

1012016001024No. 3

21601400512No. 2

31301400512No. 1

1401400512New

WeightDVDHDDCPURAM

Estimate = $367

Python code

def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0

# Get weighted averagefor i in range(k):

dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight

avg=avg/totalweightreturn avg

def getdistances(data,vec1):distancelist=[]for i in range(len(data)):

vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))

distancelist.sort()return distancelist

Python codedef getdistances(data,vec1):

distancelist=[]for i in range(len(data)):

vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))

distancelist.sort()return distancelist

def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0

# Get weighted averagefor i in range(k):

dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight

avg=avg/totalweightreturn avg

Too few – k too low

Too many – k too high

Determining the best k

Divide the dataset upTraining setTest set

Guess the prices for the test set using the training setSee how good the guesses are for different values of kKnown as “cross-validation”

Determining the best k

06

108

3011

2010

PriceAttribute 2010

PriceAttribute

06

108

3011

PriceAttribute

Test set

Training set

For k = 1, guess = 30, error = 10For k = 2, guess = 20, error = 0For k = 3, guess = 13, error = 7

Repeat with different test sets, average the error

Python codedef dividedata(data,test=0.05):

trainset=[]testset=[]for row in data:

if random()<test:testset.append(row)

else:trainset.append(row)

return trainset,testset

def testalgorithm(algf,trainset,testset):error=0.0for row in testset:

guess=algf(trainset,row['input'])error+=(row['result']-guess)**2

return error/len(testset)

def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):

trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)

return error/trials

Python code






return error/trials

def dividedata(data,test=0.05):trainset=[]testset=[]for row in data:











return error/trials




Problems with scale

Scaling the data

Scaling to zero

Determining the best scale

Try different weightsUse the “cross-validation” methodDifferent ways of choosing a scale:

Range-scalingIntuitive guessingOptimization

Methods covered

Regression treesHierarchical clusteringk-means clusteringMultidimensional scalingWeight k-nearest neighbors

New projects

OpenadsAn open-source ad serverUsers can share impression/click dataMatrix of what hits based on

Page TextAdAd placementSearch query

Can we improve targeting?

New Projects

FinanceAnalysts already drowning in infoStories sometimes broken on blogsMessage boards show sentiment

Extremely low signal-to-noise ratio

New Projects

EntertainmentHow much buzz is a movie generating?What psychographic profiles like this type of movie?

Of interest to studios and media investors

Data Mining and Open APIs

Documents

Transcript of Data Mining and Open APIs