We Are More Than Our Features
description
Transcript of We Are More Than Our Features
![Page 1: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/1.jpg)
We Are More Than Our Features
Craig Evans CAS587 – Culture As DataProject Results4 December 2012
![Page 2: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/2.jpg)
Challenge:Finding the Right Data Set
Wide variety of data types presented Global, national, local Big data, personal data
Discussed varying technologies Data mining Text mining Machine learning Visualisation
All very abstract …
![Page 3: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/3.jpg)
Motivation:Something Personal/Relatable
Never lose sight of the data Its not about the technology
Technology is a tool, not an endpoint Choose data that we can all see
something in
So …
![Page 4: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/4.jpg)
Goal
CAS587 is an interdisciplinary class We have different interests/focus –
do they come out through our readings analysis?
Analyse the writings of the CAS587 class, and see if there is any apparent trend in their writing.
![Page 5: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/5.jpg)
Importance …
To the student: Who else in the class has a similar interest? Who has expresses skills that are
complementary? Who would you reach out to to build a team
later? To the instructor:
Has the right message been communicated? Have your goals in educating the class been
met? To the wider population:
This is an example of how data can get used in a way unintended. Would you write differently if you knew the text was going to be used for this purpose?
Would you choose to post anonymously instead?
![Page 6: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/6.jpg)
Data Appropriateness
It is a “raw” data set No previous preprocessing It is not what the data was intended
for It is a little “random” in nature – not
a traditional structured dataset found in an online repository
![Page 7: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/7.jpg)
CAS587 – The Data Set
Starts as a PDF file Converted to standard ASCII text file
Manual cleanup of data required Removal of heading/footer information
Result? 150 files 96677 words 1150150 chars
![Page 8: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/8.jpg)
The Process
1. PDF’s submitted to CAS587 Website
2. Results exported toplain text files
3. Results imported todatabase
4. Results analyzed in custom Java application
5. Results returned to
database
6. Results returned to Excel / Visualisation Tool
Used trial version of publicly
available PDF2Text tool
mySQL
• Text parsed to individual words• Text stemmed using WordNet• tf*idf Weightings used to
generate keywords per person/article
• If time permits – run Sentiment Analysis over corpus
Excel is easy, but once data
processed, I can have some fun
with the visualisation
![Page 9: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/9.jpg)
tf*idf … term frequency x inverse doc frequency(From Wikipedia)
… a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is used as a weighting factor in information retrieval and text mining. The tf*idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
ExampleConsider a document containing 100 words where the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then (3 / 100) = 0.03. Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as log(10 000 000 / 1 000) = 4. The tf*idf score is the product of these quantities: 0.03 × 4 = 0.12.
![Page 10: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/10.jpg)
The Class Week 2-6
Week 2: What is Culture as Data? filter,comparative,autism,scholarship,writer,closed,overload,library,net,
outward,air, inside,coin,ecology,region Week 3: Social Media - culture, trends, and data
activism,movie,stock,flu,market,happy,mood,tweet,Trends,predict,happier,weak, Democrats,happiness,television
Week 4: Visualization, the challenges of visualizing culture - the challenges of manipulating large amounts of data visualization,template,analyst,analytic,seer,visual,computing,dot,cloud,
distort, manipulate,viewer,map,lie,trap Week 5: Books, Music, Images, Movies
music,dementia,rating,alzheimer,movie,taste,playlist,political,novel,Books,musical, affiliation,preference,listen,writing
Week 6: Data as Culture: Curating, Scrubbing, and Sampling classification,hire,narrative,card,database,replicate,icd,scientific,declin
e,finding,poetic,viscosity,replication,solution,electronic
![Page 11: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/11.jpg)
The Class Week 7-11
Week 7: Prediction customer,habit,pregnant,economy,economics,coupon,routine,cue,
prediction, purchasing,evaluation,trigger Week 8: Personal data online. Conversations and
Persistence. Interpretations of personal data. Spider,thesis,speaker,oatmeal,report,communicative,annual,persona,p
ublic,email, private,eat,analyzeword,mouth,wife Week 9: History of Big Data Critiques
skull,friction,reductionism,craniology,maturity,downfall,shimmering,positivism, introspectometer,domain,inaccurate,conflict,economics,igy,dominate
Week 10: Life After Privacy obfuscation,protect,privacy,car,policy,setting,private,default,public,opti
on,breach, anonymize,identifiable,regulation,photo Week 11: Art as Data; Data as Art
art,wind,transfinite,installation,artistic,cascade,choir,hint,visualization,rose,color, contents,flow,beautiful
![Page 12: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/12.jpg)
Picking on an IndividualCraig Evans – Total Corpus
Keywords from total corpus cent,visualisation,suspect,secondary,teach,
zip,irb,material,illustrate,interestingly, openly,playlist,artwork,profile,century, experience,lose,computationally,reuse
Most negative sentiment … not,lose,suspect,base,dementia,secondary,paranoid,
bias,present,present,disturbing,insufficient,paranoia, difficult,number
Most positive sentiment … model,interesting,good,well,better,researcher,accura
te, aware,time,time,beneficial,enable,teach,illustrate,find, method,read,add,excellent,art
![Page 13: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/13.jpg)
Picking on an IndividualCraig Evans – Week 7
Week 7: Prediction … Keywords customer,habit,pregnant,economy,economics,coupon
, routine,cue,prediction,purchasing,evaluation,trigger Keywords against rest of corpus
model,influence,buying,paper,predictive,joke, series,economist,valid,pregnant,resource,woman,link
Most negative sentiment … bias,difficult,not,base,nefarious,invalid,defunct,
savage,hard,blue,miss,number,scale,pregnant Most positive sentiment …
model,find,color,joke,read,sound,accurate, interesting,valid,valid,privacy,improve,influence, compare,reasoning,group,improvement,absolute
![Page 14: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/14.jpg)
CAS587 Wordle – Just for Karrie
![Page 15: We Are More Than Our Features](https://reader036.fdocuments.net/reader036/viewer/2022062301/56815757550346895dc4ff6b/html5/thumbnails/15.jpg)
Questions?