Open Bank Project at APIDays Open Banking and Fintech APIs 2015
Data Mining and Open APIs
description
Transcript of Data Mining and Open APIs
Data Mining and Open APIs
Toby Segaran
About Me
Software Developer at GenstructWork directly with scientistsDesign algorithms to aid in drug testing
“Programming Collective Intelligence”Published by O’ReillyDue out in August
Consult with open-source projects and other companieshttp://kiwitobes.com
Presentation Goals
Look at some Open APIsGet some dataVisualize algorithms for data-miningWork through some Python codeVariety of techniques and sources
Advocacy (why you should care)
Open data APIs
ZilloweBayFacebookdel.icio.usHotOrNotUpcoming
Yahoo AnswersAmazonTechnoratiTwitterGoogle News
programmableweb.com/apis for more…
Open API uses
MashupsIntegrationAutomationCommand-line toolsMost importantly, creating datasets!
What is data mining?
From a large dataset find the:ImplicitUnknownUseful
Data could be:Tabular, e.g. Price listsFree textPictures
Why it’s important now
More devices produce more dataPeople share more dataThe internet is vastProducts are more customizedAdvertising is targetedHuman cognition is limited
Traditional Applications
Computational BiologyFinancial MarketsRetail MarketsFraud DetectionSurveillanceSupply Chain OptimizationNational Security
Traditional = Inaccessible
Real applications are esotericTutorial examples are trivialGenerally lacking in “interest value”
Fun, Accessible Applications
Home price modelingWhere are the hottest people?Which bloggers are similar?Important attributes on eBayPredicting fashion trendsMovie popularity
Zillow
The Zillow API
Allows querying by addressReturns information about the property
BedroomsBathroomsZip CodePrice EstimateLast Sale Price
Requires registration keyhttp://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm
The Zillow API
REST Request
http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id=key&address=address&citystatezip=citystateszip
The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…
<response><results><result><zpid>48749425</zpid><links>…
</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>
The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…
<response><results><result><zpid>48749425</zpid><links>…
</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>
<zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude><longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount>
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
Zillow from Pythondef getaddressdata(address,city):
escad=address.replace(' ','+')
# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)
# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data
# Code 0 means success, otherwise there was an errorif code!='0': return None
# Extract the info about this propertytry:
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
except:return None
return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)
zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data
A home price dataset
1930
1909
1854
1894
1916
1847
Built
etc..
2107871Single43.502138F
947528Duplex53.502138E
552213Duplex42.502139D
595027Duplex43.502140C
776378Triplex93.502139B
505296Single21.502138A
PriceTypeBedroomsBathroomsZipHouse
What can we learn?
A made-up houses priceHow important is Zip Code?What are the important attributes?
Can we do better than averages?
Introducing Regression Trees
6Circle188Square2222Square1120Circle10ValueBA
Introducing Regression Trees
6Circle188Square2222Square1120Circle10ValueBA
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA InitiallyAverage = 14
Standard Deviation = 8.2
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA B = CircleAverage = 13
Standard Deviation = 9.9
B = SquareAverage = 15
Standard Deviation = 9.9
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA A > 18Average = 8
Standard Deviation = 0
A <= 20Average = 16
Standard Deviation = 8.7
Minimizing deviation
Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most
6Circle18
8Square22
22Square11
20Circle10
ValueBA A > 11Average = 7
Standard Deviation = 1.4
A <= 11Average = 21
Standard Deviation = 1.4
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
def variance(rows):if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
Python Codedef variance(rows):
if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance
def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):
split_function=lambda row:row[column]>=valueelse:
split_function=lambda row:row[column]==value
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)
CART Algoritm
6Circle188Square2222Square1120Circle10ValueBA
CART Algoritm
6Circle188Square2222Square1120Circle10ValueBA
CART Algoritm
22Square11
20Circle106Circle18
8Square22
CART Algoritm
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)
# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
def buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
for value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:
column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)
# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)
else:return decisionnode(results=uniquecounts(rows))
if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],
tb=trueBranch,fb=falseBranch)else:
return decisionnode(results=uniquecounts(rows))
Zillow Results
Bathrooms > 3
Zip: 02139? After 1903?
Triplex?Duplex?Bedrooms > 4?Zip: 02140?
Just for Fun… Hot or Not
Just for Fun… Hot or Not
Supervised and Unsupervised
Regression trees are supervised“answers” are in the datasetTree models predict answers
Some methods are unsupervisedThere are no answersMethods just characterize the dataShow interesting patterns
Next challenge - Bloggers
Millions of blogs onlineUsually focus on a subject areaCan they be characterized automatically?… using only the words in the posts?
The Technorati Top 100
A single blog
Getting the content
Use Mark Pilgrim’s Universal Feed ReaderRetrieve the post titles and textSplit up the wordsCount occurrence of each word
Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):
# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:
if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
return d.feed.title,wc
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):
# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:
if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
return d.feed.title,wc
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
for e in d.entries:if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):
# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:
if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:
wc.setdefault(word,0)wc[word]+=1
return d.feed.title,wc
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']
Building a Word Matrix
Build a matrix of word countsBlogs are rows, words are columnsEliminate words that are:
Too commonToo rare
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
for feedurl in file('feedlist.txt'):title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
wordlist=[]for w,bc in apcount.items():frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):
title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():
apcount.setdefault(word,0)if count>1:
apcount[word]+=1
wordlist=[]for w,bc in apcount.items():
frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():
out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():out.write(blog)for word in wordlist:
if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')
out.write('\n')
The Word Matrix
12220Quick Online Tips
2106GigaOM
0330Gothamist
“yahoo”“music”“kids”“china”
Determining distance
12220Quick Online Tips
2106GigaOM
0330Gothamist
“yahoo”“music”“kids”“china”
Euclidean “as the crow flies”
2222 )122()21()20()06( −+−+−+−
= 12 (approx)
Other Distance Metrics
ManhattanTanamotoPearson CorrelationChebychevSpearman
Hierarchical Clustering
Find the two closest itemCombine them into a single itemRepeat…
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Hierarchical Algorithm
Dendrogram
Python Code
class bicluster:def
__init__(self,vec,left=None,right=None,distance=0.0,id=None):self.left=leftself.right=rightself.vec=vecself.id=idself.distance=distance
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):
for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:
closest=dlowestpair=(i,j)
Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)
# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
return clust[0]
# calculate the average of the two clustersmergevec=[
(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0for i in range(len(clust[0].vec))
]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],
right=clust[lowestpair[1]],distance=closest,id=currentclustid)
del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Hierarchical Blog Clusters
Rotating the Matrix
Words in a blog -> blogs containing each word
1220Yahoo213music203kids060chinaQuick OnlGigaOMGothamist
Hierarchical Word Clusters
K-Means Clustering
Divides data into distinct clustersUser determines how manyAlgorithm
Start with arbitrary centroidsAssign points to centroidsMove the centroidsRepeat
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
K-Means Algorithm
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),
max([row[i] for row in rows])) for i in range(len(rows[0]))]
# Create k randomly placed centroidsclusters=[[random.random()*
(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))]
for j in range(k)]
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
for t in range(100):bestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):
row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):
# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]
# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i
bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches
# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
return bestmatches
# Move the centroids to the average of their membersfor i in range(k):
avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:
for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]
for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])
clusters[i]=avgs
K-Means Results
>> [rownames[r] for r in k[0]]['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', "Seth's Blog"]
>> [rownames[r] for r in k[1]]['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']
2D Visualizations
Instead of Clusters, a 2D MapGoals
Preserve distances as much as possibleDraw in two dimensions
Dimension ReductionPrincipal Components AnalysisMultidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
Multidimensional Scaling
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):
for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the # other point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: breaklasterror=totalerror
def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]
for i in range(0,n)]outersum=0.0
# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]
lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)
for x in range(len(loc[i]))]))
# Move pointsgrad=[[0.0,0.0] for i in range(n)]
totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]
# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)
print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror
# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
return loc
# Move each of the points by the learning rate times the gradientfor k in range(n):
loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]
Numerical Predictions
Back to “supervised” learningWe have a set of numerical attributes
Specs for a laptopAge and rating for wineRatios for a stock
Want to predict another attributeFormula/model is unknowne.g. price
Regression Trees?
Regression trees find hard boundariesCan’t deal with complex formulae
Statistical regression
Requires specification of a modelUsually linearDoesn’t handle context
Alternative - Interpolation
Find “similar” itemsGuess price based on similar itemsNeed to determine:
What is similar?How should we aggregate prices?
Price Data from EBay
The eBay API
XML APISend XML over HTTPSReceive results in XML
http://developer.ebay.com/quickstartguide.
Some Python Code
def sendRequest(apicall,xmlparameters):connection = httplib.HTTPSConnection(serverUrl)connection.request("POST", '/ws/api.dll', xmlparameters, getHeaders(apicall))response = connection.getresponse()if response.status != 200:
print "Error sending request:" + response.reasonelse:
data = response.read()connection.close()
return data
def getHeaders(apicall,siteID="0",compatabilityLevel = "433"):headers = {"X-EBAY-API-COMPATIBILITY-LEVEL": compatabilityLevel,
"X-EBAY-API-DEV-NAME": devKey,"X-EBAY-API-APP-NAME": appKey,"X-EBAY-API-CERT-NAME": certKey,"X-EBAY-API-CALL-NAME": apicall,"X-EBAY-API-SITEID": siteID,"Content-Type": "text/xml"}
return headers
Some Python Codedef getItem(itemID):
xml = "<?xml version='1.0' encoding='utf-8'?>"+\"<GetItemRequest xmlns=\"urn:ebay:apis:eBLBaseComponents\">"+\"<RequesterCredentials><eBayAuthToken>" +\userToken +\"</eBayAuthToken></RequesterCredentials>" + \"<ItemID>" + str(itemID) + "</ItemID>"+\"<DetailLevel>ItemReturnAttributes</DetailLevel>"+\"</GetItemRequest>"
data=sendRequest('GetItem',xml)result={}response=parseString(data)result['title']=getSingleValue(response,'Title')sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')result['bids']=getSingleValue(sellingStatusNode,'BidCount')seller = response.getElementsByTagName('Seller')result['feedback'] = getSingleValue(seller[0],'FeedbackScore')attributeSet=response.getElementsByTagName('Attribute');attributes={}for att in attributeSet:
attID=att.attributes.getNamedItem('attributeID').nodeValueattValue=getSingleValue(att,'ValueLiteral')attributes[attID]=attValue
result['attributes']=attributesreturn result
Building an item table
etc..
17
14
13
14
Screen
$800112016001024Pavillion
$200120900256T22
$8005300160Lenovo
$3501401400512D600
PriceDVDHDDCPURAM
Distance between items
14
14
Screen
$200120900256T22
???1401400512New
PriceDVDHDDCPURAM
Euclidean, just like in clustering
22222 )11()1414()2040()9001400()256512( −+−+−+−+−
Idea 1 – use the closest item
With the item whose price I want to guess:
Calculate the distance for every item in my datasetGuess that the price is the same as the closest
This is called kNN with k=1
Problems with “outliers”
The closest item may be anomalousWhy?
Exceptional deal that won’t occur againSomething missing from the datasetData errors
Using an average
15
14
13
14
Screen
$325012016001024No. 3
$4001601400512No. 2
$3601301400512No. 1
???1401400512New
PriceDVDHDDCPURAM
k=3, estimate = $361
Using a weighted average
$325
$400
$360
???
Price
15
14
13
14
Screen
1012016001024No. 3
21601400512No. 2
31301400512No. 1
1401400512New
WeightDVDHDDCPURAM
Estimate = $367
Python code
def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0
# Get weighted averagefor i in range(k):
dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight
avg=avg/totalweightreturn avg
def getdistances(data,vec1):distancelist=[]for i in range(len(data)):
vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))
distancelist.sort()return distancelist
Python codedef getdistances(data,vec1):
distancelist=[]for i in range(len(data)):
vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))
distancelist.sort()return distancelist
def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0
# Get weighted averagefor i in range(k):
dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight
avg=avg/totalweightreturn avg
Too few – k too low
Too many – k too high
Determining the best k
Divide the dataset upTraining setTest set
Guess the prices for the test set using the training setSee how good the guesses are for different values of kKnown as “cross-validation”
Determining the best k
06
108
3011
2010
PriceAttribute 2010
PriceAttribute
06
108
3011
PriceAttribute
Test set
Training set
For k = 1, guess = 30, error = 10For k = 2, guess = 20, error = 0For k = 3, guess = 13, error = 7
Repeat with different test sets, average the error
Python codedef dividedata(data,test=0.05):
trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
Python code
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
def dividedata(data,test=0.05):trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
Python codedef dividedata(data,test=0.05):
trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
Python codedef dividedata(data,test=0.05):
trainset=[]testset=[]for row in data:
if random()<test:testset.append(row)
else:trainset.append(row)
return trainset,testset
def testalgorithm(algf,trainset,testset):error=0.0for row in testset:
guess=algf(trainset,row['input'])error+=(row['result']-guess)**2
return error/len(testset)
def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):
trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)
return error/trials
Problems with scale
Scaling the data
Scaling to zero
Determining the best scale
Try different weightsUse the “cross-validation” methodDifferent ways of choosing a scale:
Range-scalingIntuitive guessingOptimization
Methods covered
Regression treesHierarchical clusteringk-means clusteringMultidimensional scalingWeight k-nearest neighbors
New projects
OpenadsAn open-source ad serverUsers can share impression/click dataMatrix of what hits based on
Page TextAdAd placementSearch query
Can we improve targeting?
New Projects
FinanceAnalysts already drowning in infoStories sometimes broken on blogsMessage boards show sentiment
Extremely low signal-to-noise ratio
New Projects
EntertainmentHow much buzz is a movie generating?What psychographic profiles like this type of movie?
Of interest to studios and media investors