Data Mining and Open APIs

134
Data Mining and Open APIs Toby Segaran

description

 

Transcript of Data Mining and Open APIs

Page 1: Data Mining and Open APIs

Data Mining and Open APIs

Toby Segaran

Page 2: Data Mining and Open APIs

About Me

Software Developer at GenstructWork directly with scientistsDesign algorithms to aid in drug testing

“Programming Collective Intelligence”Published by O’ReillyDue out in August

Consult with open-source projects and other companieshttp://kiwitobes.com

Page 3: Data Mining and Open APIs

Presentation Goals

Look at some Open APIsGet some dataVisualize algorithms for data-miningWork through some Python codeVariety of techniques and sources

Advocacy (why you should care)

Page 4: Data Mining and Open APIs

Open data APIs

ZilloweBayFacebookdel.icio.usHotOrNotUpcoming

Yahoo AnswersAmazonTechnoratiTwitterGoogle News

programmableweb.com/apis for more…

Page 5: Data Mining and Open APIs

Open API uses

MashupsIntegrationAutomationCommand-line toolsMost importantly, creating datasets!

Page 6: Data Mining and Open APIs

What is data mining?

From a large dataset find the:ImplicitUnknownUseful

Data could be:Tabular, e.g. Price listsFree textPictures

Page 7: Data Mining and Open APIs

Why it’s important now

More devices produce more dataPeople share more dataThe internet is vastProducts are more customizedAdvertising is targetedHuman cognition is limited

Page 8: Data Mining and Open APIs

Traditional Applications

Computational BiologyFinancial MarketsRetail MarketsFraud DetectionSurveillanceSupply Chain OptimizationNational Security

Page 9: Data Mining and Open APIs

Traditional = Inaccessible

Real applications are esotericTutorial examples are trivialGenerally lacking in “interest value”

Page 10: Data Mining and Open APIs

Fun, Accessible Applications

Home price modelingWhere are the hottest people?Which bloggers are similar?Important attributes on eBayPredicting fashion trendsMovie popularity

Page 11: Data Mining and Open APIs

Zillow

Page 12: Data Mining and Open APIs

The Zillow API

Allows querying by addressReturns information about the property

BedroomsBathroomsZip CodePrice EstimateLast Sale Price

Requires registration keyhttp://www.zillow.com/howto/api/PropertyDetailsAPIOverview.htm

Page 13: Data Mining and Open APIs

The Zillow API

REST Request

http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id=key&address=address&citystatezip=citystateszip

Page 14: Data Mining and Open APIs

The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…

<response><results><result><zpid>48749425</zpid><links>…

</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>

Page 15: Data Mining and Open APIs

The Zillow API<SearchResults:searchresults xmlns:SearchResults="http://www. zillow.com/vstatic/3/static/xsd/SearchResults.xsd">…

<response><results><result><zpid>48749425</zpid><links>…

</links><address> <street>2114 Bigelow Ave N</street><zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude> <longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount></result></results></response>

<zipcode>98109</zipcode><city>SEATTLE</city><state>WA</state><latitude>47.637934</latitude><longitude>-122.347936</longitude></address> <yearBuilt>1924</yearBuilt><lotSizeSqFt>4680</lotSizeSqFt><finishedSqFt>3290</finishedSqFt><bathrooms>2.75</bathrooms><bedrooms>4</bedrooms><lastSoldDate>06/18/2002</lastSoldDate><lastSoldPrice currency="USD">770000</lastSoldPrice><valuation><amount currency="USD">1091061</amount>

Page 16: Data Mining and Open APIs

Zillow from Pythondef getaddressdata(address,city):

escad=address.replace(' ','+')

# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data

# Code 0 means success, otherwise there was an errorif code!='0': return None

# Extract the info about this propertytry:

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data

except:return None

return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)

Page 17: Data Mining and Open APIs

Zillow from Pythondef getaddressdata(address,city):

escad=address.replace(' ','+')

# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data

# Code 0 means success, otherwise there was an errorif code!='0': return None

# Extract the info about this propertytry:

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data

except:return None

return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)

# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

Page 18: Data Mining and Open APIs

Zillow from Pythondef getaddressdata(address,city):

escad=address.replace(' ','+')

# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data

# Code 0 means success, otherwise there was an errorif code!='0': return None

# Extract the info about this propertytry:

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data

except:return None

return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)

# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data

Page 19: Data Mining and Open APIs

Zillow from Pythondef getaddressdata(address,city):

escad=address.replace(' ','+')

# Construct the URLurl='http://www.zillow.com/webservice/GetDeepSearchResults.htm?'url+='zws-id=%s&address=%s&citystatezip=%s' % (zwskey,escad,city)

# Parse resulting XMLdoc=xml.dom.minidom.parseString(urllib2.urlopen(url).read())code=doc.getElementsByTagName('code')[0].firstChild.data

# Code 0 means success, otherwise there was an errorif code!='0': return None

# Extract the info about this propertytry:

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data

except:return None

return (zipcode,use,int(year),float(bath),int(bed),int(rooms),price)

zipcode=doc.getElementsByTagName('zipcode')[0].firstChild.datause=doc.getElementsByTagName('useCode')[0].firstChild.datayear=doc.getElementsByTagName('yearBuilt')[0].firstChild.databath=doc.getElementsByTagName('bathrooms')[0].firstChild.databed=doc.getElementsByTagName('bedrooms')[0].firstChild.datarooms=doc.getElementsByTagName('totalRooms')[0].firstChild.dataprice=doc.getElementsByTagName('amount')[0].firstChild.data

Page 20: Data Mining and Open APIs

A home price dataset

1930

1909

1854

1894

1916

1847

Built

etc..

2107871Single43.502138F

947528Duplex53.502138E

552213Duplex42.502139D

595027Duplex43.502140C

776378Triplex93.502139B

505296Single21.502138A

PriceTypeBedroomsBathroomsZipHouse

Page 21: Data Mining and Open APIs

What can we learn?

A made-up houses priceHow important is Zip Code?What are the important attributes?

Can we do better than averages?

Page 22: Data Mining and Open APIs

Introducing Regression Trees

6Circle188Square2222Square1120Circle10ValueBA

Page 23: Data Mining and Open APIs

Introducing Regression Trees

6Circle188Square2222Square1120Circle10ValueBA

Page 24: Data Mining and Open APIs

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

6Circle18

8Square22

22Square11

20Circle10

ValueBA InitiallyAverage = 14

Standard Deviation = 8.2

Page 25: Data Mining and Open APIs

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

6Circle18

8Square22

22Square11

20Circle10

ValueBA B = CircleAverage = 13

Standard Deviation = 9.9

B = SquareAverage = 15

Standard Deviation = 9.9

Page 26: Data Mining and Open APIs

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

6Circle18

8Square22

22Square11

20Circle10

ValueBA A > 18Average = 8

Standard Deviation = 0

A <= 20Average = 16

Standard Deviation = 8.7

Page 27: Data Mining and Open APIs

Minimizing deviation

Standard deviation is the “spread” of resultsTry all possible divisionsChoose the division that decreases deviation the most

6Circle18

8Square22

22Square11

20Circle10

ValueBA A > 11Average = 7

Standard Deviation = 1.4

A <= 11Average = 21

Standard Deviation = 1.4

Page 28: Data Mining and Open APIs

Python Codedef variance(rows):

if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance

def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=valueelse:

split_function=lambda row:row[column]==value

# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)

Page 29: Data Mining and Open APIs

Python Codedef variance(rows):

if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance

def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=valueelse:

split_function=lambda row:row[column]==value

# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)

def variance(rows):if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance

Page 30: Data Mining and Open APIs

Python Codedef variance(rows):

if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance

def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=valueelse:

split_function=lambda row:row[column]==value

# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)

# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=valueelse:

split_function=lambda row:row[column]==value

Page 31: Data Mining and Open APIs

Python Codedef variance(rows):

if len(rows)==0: return 0data=[float(row[len(row)-1]) for row in rows]mean=sum(data)/len(data)variance=sum([(d-mean)**2 for d in data])/len(data)return variance

def divideset(rows,column,value):# Make a function that tells us if a row is in # the first group (true) or the second group (false)split_function=Noneif isinstance(value,int) or isinstance(value,float):

split_function=lambda row:row[column]>=valueelse:

split_function=lambda row:row[column]==value

# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)

# Divide the rows into two sets and return themset1=[row for row in rows if split_function(row)]set2=[row for row in rows if not split_function(row)]return (set1,set2)

Page 32: Data Mining and Open APIs

CART Algoritm

6Circle188Square2222Square1120Circle10ValueBA

Page 33: Data Mining and Open APIs

CART Algoritm

6Circle188Square2222Square1120Circle10ValueBA

Page 34: Data Mining and Open APIs

CART Algoritm

22Square11

20Circle106Circle18

8Square22

Page 35: Data Mining and Open APIs

CART Algoritm

Page 36: Data Mining and Open APIs

Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:

column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)

# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)

# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)

else:return decisionnode(results=uniquecounts(rows))

Page 37: Data Mining and Open APIs

Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:

column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)

# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)

else:return decisionnode(results=uniquecounts(rows))

def buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1

Page 38: Data Mining and Open APIs

Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:

column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)

# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)

else:return decisionnode(results=uniquecounts(rows))

for value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)

Page 39: Data Mining and Open APIs

Python Codedef buildtree(rows,scoref=variance):if len(rows)==0: return decisionnode()current_score=scoref(rows)# Set up some variables to track the best criteriabest_gain=0.0best_criteria=Nonebest_sets=Nonecolumn_count=len(rows[0])-1for col in range(0,column_count):# Generate the list of different values in# this columncolumn_values={}for row in rows:

column_values[row[col]]=1# Now try dividing the rows up for each value# in this columnfor value in column_values.keys():(set1,set2)=divideset(rows,col,value)# Information gainp=float(len(set1))/len(rows)gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)if gain>best_gain and len(set1)>0 and len(set2)>0:best_gain=gainbest_criteria=(col,value)best_sets=(set1,set2)

# Create the sub branches if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],tb=trueBranch,fb=falseBranch)

else:return decisionnode(results=uniquecounts(rows))

if best_gain>0:trueBranch=buildtree(best_sets[0])falseBranch=buildtree(best_sets[1])return decisionnode(col=best_criteria[0],value=best_criteria[1],

tb=trueBranch,fb=falseBranch)else:

return decisionnode(results=uniquecounts(rows))

Page 40: Data Mining and Open APIs

Zillow Results

Bathrooms > 3

Zip: 02139? After 1903?

Triplex?Duplex?Bedrooms > 4?Zip: 02140?

Page 41: Data Mining and Open APIs

Just for Fun… Hot or Not

Page 42: Data Mining and Open APIs

Just for Fun… Hot or Not

Page 43: Data Mining and Open APIs

Supervised and Unsupervised

Regression trees are supervised“answers” are in the datasetTree models predict answers

Some methods are unsupervisedThere are no answersMethods just characterize the dataShow interesting patterns

Page 44: Data Mining and Open APIs

Next challenge - Bloggers

Millions of blogs onlineUsually focus on a subject areaCan they be characterized automatically?… using only the words in the posts?

Page 45: Data Mining and Open APIs

The Technorati Top 100

Page 46: Data Mining and Open APIs

A single blog

Page 47: Data Mining and Open APIs

Getting the content

Use Mark Pilgrim’s Universal Feed ReaderRetrieve the post titles and textSplit up the wordsCount occurrence of each word

Page 48: Data Mining and Open APIs

Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):

# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:

if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:

wc.setdefault(word,0)wc[word]+=1

return d.feed.title,wc

def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']

Page 49: Data Mining and Open APIs

Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):

# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:

if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:

wc.setdefault(word,0)wc[word]+=1

return d.feed.title,wc

def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']

for e in d.entries:if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:

wc.setdefault(word,0)wc[word]+=1

Page 50: Data Mining and Open APIs

Python Codeimport feedparserimport re# Returns title and dictionary of word counts for an RSS feeddef getwordcounts(url):

# Parse the feedd=feedparser.parse(url)wc={}# Loop over all the entriesfor e in d.entries:

if 'summary' in e: summary=e.summaryelse: summary=e.description# Extract a list of wordswords=getwords(e.title+' '+summary)for word in words:

wc.setdefault(word,0)wc[word]+=1

return d.feed.title,wc

def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']

def getwords(html):# Remove all the HTML tagstxt=re.compile(r'<[^>]+>').sub('',html)# Split words by all non-alpha characterswords=re.compile(r'[^A-Z^a-z]+').split(txt)# Convert to lowercasereturn [word.lower() for word in words if word!='']

Page 51: Data Mining and Open APIs

Building a Word Matrix

Build a matrix of word countsBlogs are rows, words are columnsEliminate words that are:

Too commonToo rare

Page 52: Data Mining and Open APIs

Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):

title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():

apcount.setdefault(word,0)if count>1:

apcount[word]+=1

wordlist=[]for w,bc in apcount.items():

frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():

out.write(blog)for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')

out.write('\n')

Page 53: Data Mining and Open APIs

Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):

title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():

apcount.setdefault(word,0)if count>1:

apcount[word]+=1

wordlist=[]for w,bc in apcount.items():

frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():

out.write(blog)for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')

out.write('\n')

for feedurl in file('feedlist.txt'):title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():

apcount.setdefault(word,0)if count>1:

apcount[word]+=1

Page 54: Data Mining and Open APIs

Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):

title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():

apcount.setdefault(word,0)if count>1:

apcount[word]+=1

wordlist=[]for w,bc in apcount.items():

frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():

out.write(blog)for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')

out.write('\n')

wordlist=[]for w,bc in apcount.items():frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)

Page 55: Data Mining and Open APIs

Python Codeapcount={}wordcounts={}for feedurl in file('feedlist.txt'):

title,wc=getwordcounts(feedurl)wordcounts[title]=wcfor word,count in wc.items():

apcount.setdefault(word,0)if count>1:

apcount[word]+=1

wordlist=[]for w,bc in apcount.items():

frac=float(bc)/len(feedlist)if frac>0.1 and frac<0.5: wordlist.append(w)

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():

out.write(blog)for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')

out.write('\n')

out=file('blogdata.txt','w')out.write('Blog')for word in wordlist: out.write('\t%s' % word)out.write('\n')for blog,wc in wordcounts.items():out.write(blog)for word in wordlist:

if word in wc: out.write('\t%d' % wc[word])else: out.write('\t0')

out.write('\n')

Page 56: Data Mining and Open APIs

The Word Matrix

12220Quick Online Tips

2106GigaOM

0330Gothamist

“yahoo”“music”“kids”“china”

Page 57: Data Mining and Open APIs

Determining distance

12220Quick Online Tips

2106GigaOM

0330Gothamist

“yahoo”“music”“kids”“china”

Euclidean “as the crow flies”

2222 )122()21()20()06( −+−+−+−

= 12 (approx)

Page 58: Data Mining and Open APIs

Other Distance Metrics

ManhattanTanamotoPearson CorrelationChebychevSpearman

Page 59: Data Mining and Open APIs

Hierarchical Clustering

Find the two closest itemCombine them into a single itemRepeat…

Page 60: Data Mining and Open APIs

Hierarchical Algorithm

Page 61: Data Mining and Open APIs

Hierarchical Algorithm

Page 62: Data Mining and Open APIs

Hierarchical Algorithm

Page 63: Data Mining and Open APIs

Hierarchical Algorithm

Page 64: Data Mining and Open APIs

Hierarchical Algorithm

Page 65: Data Mining and Open APIs

Dendrogram

Page 66: Data Mining and Open APIs

Python Code

class bicluster:def

__init__(self,vec,left=None,right=None,distance=0.0,id=None):self.left=leftself.right=rightself.vec=vecself.id=idself.distance=distance

Page 67: Data Mining and Open APIs

Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)

# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],

right=clust[lowestpair[1]],distance=closest,id=currentclustid)

# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

return clust[0]

Page 68: Data Mining and Open APIs

Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)

# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],

right=clust[lowestpair[1]],distance=closest,id=currentclustid)

# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

return clust[0]

distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]

Page 69: Data Mining and Open APIs

Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)

# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],

right=clust[lowestpair[1]],distance=closest,id=currentclustid)

# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

return clust[0]

while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):

for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances:

distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:

closest=dlowestpair=(i,j)

Page 70: Data Mining and Open APIs

Python Codedef hcluster(rows,distance=pearson):distances={}currentclustid=-1# Clusters are initially just the rowsclust=[bicluster(rows[i],id=i) for i in range(len(rows))]while len(clust)>1:lowestpair=(0,1)closest=distance(clust[0].vec,clust[1].vec)# loop through every pair looking for the smallest distancefor i in range(len(clust)):for j in range(i+1,len(clust)):# distances is the cache of distance calculationsif (clust[i].id,clust[j].id) not in distances: distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]if d<closest:closest=dlowestpair=(i,j)

# calculate the average of the two clustersmergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 for i in range(len(clust[0].vec))]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],

right=clust[lowestpair[1]],distance=closest,id=currentclustid)

# cluster ids that weren’t in the original set are negativecurrentclustid-=1del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

return clust[0]

# calculate the average of the two clustersmergevec=[

(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0for i in range(len(clust[0].vec))

]# create the new clusternewcluster=bicluster(mergevec,left=clust[lowestpair[0]],

right=clust[lowestpair[1]],distance=closest,id=currentclustid)

del clust[lowestpair[1]]del clust[lowestpair[0]]clust.append(newcluster)

Page 71: Data Mining and Open APIs

Hierarchical Blog Clusters

Page 72: Data Mining and Open APIs

Hierarchical Blog Clusters

Page 73: Data Mining and Open APIs

Hierarchical Blog Clusters

Page 74: Data Mining and Open APIs

Rotating the Matrix

Words in a blog -> blogs containing each word

1220Yahoo213music203kids060chinaQuick OnlGigaOMGothamist

Page 75: Data Mining and Open APIs

Hierarchical Word Clusters

Page 76: Data Mining and Open APIs

K-Means Clustering

Divides data into distinct clustersUser determines how manyAlgorithm

Start with arbitrary centroidsAssign points to centroidsMove the centroidsRepeat

Page 77: Data Mining and Open APIs

K-Means Algorithm

Page 78: Data Mining and Open APIs

K-Means Algorithm

Page 79: Data Mining and Open APIs

K-Means Algorithm

Page 80: Data Mining and Open APIs

K-Means Algorithm

Page 81: Data Mining and Open APIs

K-Means Algorithm

Page 82: Data Mining and Open APIs

Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches

Page 83: Data Mining and Open APIs

Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),

max([row[i] for row in rows])) for i in range(len(rows[0]))]

# Create k randomly placed centroidsclusters=[[random.random()*

(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))]

for j in range(k)]

Page 84: Data Mining and Open APIs

Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches

for t in range(100):bestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):

row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)

Page 85: Data Mining and Open APIs

Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches

# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

Page 86: Data Mining and Open APIs

Python Codeimport randomdef kcluster(rows,distance=pearson,k=4):

# Determine the minimum and maximum values for each pointranges=[(min([row[i] for row in rows]),max([row[i] for row in rows])) for i in range(len(rows[0]))]# Create k randomly placed centroidsclusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]

lastmatches=Nonefor t in range(100):print 'Iteration %d' % tbestmatches=[[] for i in range(k)]

# Find which centroid is the closest for each rowfor j in range(len(rows)):row=rows[j]bestmatch=0for i in range(k):d=distance(clusters[i],row)if d<distance(clusters[bestmatch],row): bestmatch=i

bestmatches[bestmatch].append(j)# If the results are the same as last time, this is completeif bestmatches==lastmatches: breaklastmatches=bestmatches

# Move the centroids to the average of their membersfor i in range(k):avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

return bestmatches

# Move the centroids to the average of their membersfor i in range(k):

avgs=[0.0]*len(rows[0])if len(bestmatches[i])>0:for rowid in bestmatches[i]:

for m in range(len(rows[rowid])):avgs[m]+=rows[rowid][m]

for j in range(len(avgs)):avgs[j]/=len(bestmatches[i])

clusters[i]=avgs

Page 87: Data Mining and Open APIs

K-Means Results

>> [rownames[r] for r in k[0]]['The Viral Garden', 'Copyblogger', 'Creating Passionate Users', 'Oilman', 'ProBlogger Blog Tips', "Seth's Blog"]

>> [rownames[r] for r in k[1]]['Wonkette', 'Gawker', 'Gothamist', 'Huffington Post']

Page 88: Data Mining and Open APIs

2D Visualizations

Instead of Clusters, a 2D MapGoals

Preserve distances as much as possibleDraw in two dimensions

Dimension ReductionPrincipal Components AnalysisMultidimensional Scaling

Page 89: Data Mining and Open APIs

Multidimensional Scaling

Page 90: Data Mining and Open APIs

Multidimensional Scaling

Page 91: Data Mining and Open APIs

Multidimensional Scaling

Page 92: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

Page 93: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

Page 94: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

Page 95: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

Page 96: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):

for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the # other point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

Page 97: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

# If the answer got worse by moving the points, we are done if lasterror and lasterror<totalerror: breaklasterror=totalerror

Page 98: Data Mining and Open APIs

def scaledown(data,distance=pearson,rate=0.01):n=len(data)# The real distances between every pair of itemsrealdist=[[distance(data[i],data[j]) for j in range(n)]

for i in range(0,n)]outersum=0.0

# Randomly initialize the starting points of the locations in 2Dloc=[[random.random(),random.random()] for i in range(n)]fakedist=[[0.0 for j in range(n)] for i in range(n)]

lasterror=Nonefor m in range(0,1000):# Find projected distancesfor i in range(n):for j in range(n):fakedist[i][j]=sqrt(sum([pow(loc[i][x]-loc[j][x],2)

for x in range(len(loc[i]))]))

# Move pointsgrad=[[0.0,0.0] for i in range(n)]

totalerror=0for k in range(n):for j in range(n):if j==k: continue# The error is percent difference between the distanceserrorterm=(fakedist[j][k]-realdist[j][k])/realdist[j][k]

# Each point needs to be moved away from or towards the other# point in proportion to how much error it hasgrad[k][0]+=((loc[k][0]-loc[j][0])/fakedist[j][k])*errortermgrad[k][1]+=((loc[k][1]-loc[j][1])/fakedist[j][k])*errorterm# Keep track of the total errortotalerror+=abs(errorterm)

print totalerror# If the answer got worse by moving the points, we are doneif lasterror and lasterror<totalerror: breaklasterror=totalerror

# Move each of the points by the learning rate times the gradientfor k in range(n):loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

return loc

# Move each of the points by the learning rate times the gradientfor k in range(n):

loc[k][0]-=rate*grad[k][0]loc[k][1]-=rate*grad[k][1]

Page 99: Data Mining and Open APIs
Page 100: Data Mining and Open APIs
Page 101: Data Mining and Open APIs
Page 102: Data Mining and Open APIs
Page 103: Data Mining and Open APIs

Numerical Predictions

Back to “supervised” learningWe have a set of numerical attributes

Specs for a laptopAge and rating for wineRatios for a stock

Want to predict another attributeFormula/model is unknowne.g. price

Page 104: Data Mining and Open APIs

Regression Trees?

Regression trees find hard boundariesCan’t deal with complex formulae

Page 105: Data Mining and Open APIs

Statistical regression

Requires specification of a modelUsually linearDoesn’t handle context

Page 106: Data Mining and Open APIs

Alternative - Interpolation

Find “similar” itemsGuess price based on similar itemsNeed to determine:

What is similar?How should we aggregate prices?

Page 107: Data Mining and Open APIs

Price Data from EBay

Page 108: Data Mining and Open APIs

The eBay API

XML APISend XML over HTTPSReceive results in XML

http://developer.ebay.com/quickstartguide.

Page 109: Data Mining and Open APIs

Some Python Code

def sendRequest(apicall,xmlparameters):connection = httplib.HTTPSConnection(serverUrl)connection.request("POST", '/ws/api.dll', xmlparameters, getHeaders(apicall))response = connection.getresponse()if response.status != 200:

print "Error sending request:" + response.reasonelse:

data = response.read()connection.close()

return data

def getHeaders(apicall,siteID="0",compatabilityLevel = "433"):headers = {"X-EBAY-API-COMPATIBILITY-LEVEL": compatabilityLevel,

"X-EBAY-API-DEV-NAME": devKey,"X-EBAY-API-APP-NAME": appKey,"X-EBAY-API-CERT-NAME": certKey,"X-EBAY-API-CALL-NAME": apicall,"X-EBAY-API-SITEID": siteID,"Content-Type": "text/xml"}

return headers

Page 110: Data Mining and Open APIs

Some Python Codedef getItem(itemID):

xml = "<?xml version='1.0' encoding='utf-8'?>"+\"<GetItemRequest xmlns=\"urn:ebay:apis:eBLBaseComponents\">"+\"<RequesterCredentials><eBayAuthToken>" +\userToken +\"</eBayAuthToken></RequesterCredentials>" + \"<ItemID>" + str(itemID) + "</ItemID>"+\"<DetailLevel>ItemReturnAttributes</DetailLevel>"+\"</GetItemRequest>"

data=sendRequest('GetItem',xml)result={}response=parseString(data)result['title']=getSingleValue(response,'Title')sellingStatusNode = response.getElementsByTagName('SellingStatus')[0];result['price']=getSingleValue(sellingStatusNode,'CurrentPrice')result['bids']=getSingleValue(sellingStatusNode,'BidCount')seller = response.getElementsByTagName('Seller')result['feedback'] = getSingleValue(seller[0],'FeedbackScore')attributeSet=response.getElementsByTagName('Attribute');attributes={}for att in attributeSet:

attID=att.attributes.getNamedItem('attributeID').nodeValueattValue=getSingleValue(att,'ValueLiteral')attributes[attID]=attValue

result['attributes']=attributesreturn result

Page 111: Data Mining and Open APIs

Building an item table

etc..

17

14

13

14

Screen

$800112016001024Pavillion

$200120900256T22

$8005300160Lenovo

$3501401400512D600

PriceDVDHDDCPURAM

Page 112: Data Mining and Open APIs

Distance between items

14

14

Screen

$200120900256T22

???1401400512New

PriceDVDHDDCPURAM

Euclidean, just like in clustering

22222 )11()1414()2040()9001400()256512( −+−+−+−+−

Page 113: Data Mining and Open APIs

Idea 1 – use the closest item

With the item whose price I want to guess:

Calculate the distance for every item in my datasetGuess that the price is the same as the closest

This is called kNN with k=1

Page 114: Data Mining and Open APIs

Problems with “outliers”

The closest item may be anomalousWhy?

Exceptional deal that won’t occur againSomething missing from the datasetData errors

Page 115: Data Mining and Open APIs

Using an average

15

14

13

14

Screen

$325012016001024No. 3

$4001601400512No. 2

$3601301400512No. 1

???1401400512New

PriceDVDHDDCPURAM

k=3, estimate = $361

Page 116: Data Mining and Open APIs

Using a weighted average

$325

$400

$360

???

Price

15

14

13

14

Screen

1012016001024No. 3

21601400512No. 2

31301400512No. 1

1401400512New

WeightDVDHDDCPURAM

Estimate = $367

Page 117: Data Mining and Open APIs

Python code

def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0

# Get weighted averagefor i in range(k):

dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight

avg=avg/totalweightreturn avg

def getdistances(data,vec1):distancelist=[]for i in range(len(data)):

vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))

distancelist.sort()return distancelist

Page 118: Data Mining and Open APIs

Python codedef getdistances(data,vec1):

distancelist=[]for i in range(len(data)):

vec2=data[i]['input']distancelist.append((euclidean(vec1,vec2),i))

distancelist.sort()return distancelist

def weightedknn(data,vec1,k=5,weightf=gaussian):# Get distancesdlist=getdistances(data,vec1)avg=0.0totalweight=0.0

# Get weighted averagefor i in range(k):

dist=dlist[i][0]idx=dlist[i][1]weight=weightf(dist)avg+=weight*data[idx]['result']totalweight+=weight

avg=avg/totalweightreturn avg

Page 119: Data Mining and Open APIs

Too few – k too low

Page 120: Data Mining and Open APIs

Too many – k too high

Page 121: Data Mining and Open APIs

Determining the best k

Divide the dataset upTraining setTest set

Guess the prices for the test set using the training setSee how good the guesses are for different values of kKnown as “cross-validation”

Page 122: Data Mining and Open APIs

Determining the best k

06

108

3011

2010

PriceAttribute 2010

PriceAttribute

06

108

3011

PriceAttribute

Test set

Training set

For k = 1, guess = 30, error = 10For k = 2, guess = 20, error = 0For k = 3, guess = 13, error = 7

Repeat with different test sets, average the error

Page 123: Data Mining and Open APIs

Python codedef dividedata(data,test=0.05):

trainset=[]testset=[]for row in data:

if random()<test:testset.append(row)

else:trainset.append(row)

return trainset,testset

def testalgorithm(algf,trainset,testset):error=0.0for row in testset:

guess=algf(trainset,row['input'])error+=(row['result']-guess)**2

return error/len(testset)

def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):

trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)

return error/trials

Page 124: Data Mining and Open APIs

Python code

def testalgorithm(algf,trainset,testset):error=0.0for row in testset:

guess=algf(trainset,row['input'])error+=(row['result']-guess)**2

return error/len(testset)

def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):

trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)

return error/trials

def dividedata(data,test=0.05):trainset=[]testset=[]for row in data:

if random()<test:testset.append(row)

else:trainset.append(row)

return trainset,testset

Page 125: Data Mining and Open APIs

Python codedef dividedata(data,test=0.05):

trainset=[]testset=[]for row in data:

if random()<test:testset.append(row)

else:trainset.append(row)

return trainset,testset

def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):

trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)

return error/trials

def testalgorithm(algf,trainset,testset):error=0.0for row in testset:

guess=algf(trainset,row['input'])error+=(row['result']-guess)**2

return error/len(testset)

Page 126: Data Mining and Open APIs

Python codedef dividedata(data,test=0.05):

trainset=[]testset=[]for row in data:

if random()<test:testset.append(row)

else:trainset.append(row)

return trainset,testset

def testalgorithm(algf,trainset,testset):error=0.0for row in testset:

guess=algf(trainset,row['input'])error+=(row['result']-guess)**2

return error/len(testset)

def crossvalidate(algf,data,trials=100,test=0.05):error=0.0for i in range(trials):

trainset,testset=dividedata(data,test)error+=testalgorithm(algf,trainset,testset)

return error/trials

Page 127: Data Mining and Open APIs

Problems with scale

Page 128: Data Mining and Open APIs

Scaling the data

Page 129: Data Mining and Open APIs

Scaling to zero

Page 130: Data Mining and Open APIs

Determining the best scale

Try different weightsUse the “cross-validation” methodDifferent ways of choosing a scale:

Range-scalingIntuitive guessingOptimization

Page 131: Data Mining and Open APIs

Methods covered

Regression treesHierarchical clusteringk-means clusteringMultidimensional scalingWeight k-nearest neighbors

Page 132: Data Mining and Open APIs

New projects

OpenadsAn open-source ad serverUsers can share impression/click dataMatrix of what hits based on

Page TextAdAd placementSearch query

Can we improve targeting?

Page 133: Data Mining and Open APIs

New Projects

FinanceAnalysts already drowning in infoStories sometimes broken on blogsMessage boards show sentiment

Extremely low signal-to-noise ratio

Page 134: Data Mining and Open APIs

New Projects

EntertainmentHow much buzz is a movie generating?What psychographic profiles like this type of movie?

Of interest to studios and media investors