Data Structures Using C++1 Search Algorithms Sequential Search (Linear Search) Binary Search.
Michael ColeAOL Search Data 4 December 2006 Agenda Review the time line for the AOL Search data...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Michael ColeAOL Search Data 4 December 2006 Agenda Review the time line for the AOL Search data...
Michael Cole AOL Search Data 4 December 2006
Agenda
Review the time line for the AOL Search data event
Look at the data Can we use the data? The future of data access to the research
community
The AOL Search Data
Michael [email protected]
4 December 2006
Michael Cole AOL Search Data 4 December 2006
Introduce the AOL Search data. Review the social controversy around the
release of data intended for the research community.
Look at the data. Discuss the legality and ethics of using
the data in research.
Michael Cole AOL Search Data 4 December 2006
Time line for event Aug 4: The data is released
Aug 6: Several blogs pick up on release and post links to AOL Research
Aug 7: The link to the data is taken down ~11 pm EDT. AOL issues an apology calling the release a “screw-up”.
Aug 7: The data is available at several mirrors and three websites with search interfaces to the data are functioning:
http://www.aolsearchdatabase.com/
http://dontdelete.com
http://aolsearchlogs.com/
Aug 8: World Privacy Forum files a complaint with the FTC.
Aug 11: http://aolpsycho.com/ comes on line
Aug 14: Electronic Frontier Foundation files FTC complaint for violation of privacy policy.
Aug 21: The AOL CTO Maureen Govern resigns, lead researcher Chowduhry and his supervisor are “dismissed”.
Sept 21: Nielsen/NetRatings US search engine metrics report for August shows 18.2% fewer queries on AOL search compared to a year earlier. AOL was the only search engine to experience a loss in traffic.
Sept 25: Lawsuit filed in US District Court of Northern California
Michael Cole AOL Search Data 4 December 2006
Data Basic Collection Statistics
Dates: 1 March, 2006 - 31 May, 2006
Normalized queries: 36,389,567 lines of data
21,011,340 instances of new queries (w/ or w/o click-through)
7,887,022 requests for "next page" of results
19,442,629 user click-through events
16,946,938 queries w/o user click-through
10,154,742 unique (normalized) queries
657,426 unique user ID's
Source: aol.U500K_README.txt (distributed w/data set)
Michael Cole AOL Search Data 4 December 2006
Data (cont)
During this period, AOL Search had a market share of about 6.5% according to com Media Metrics. The data covers about 1.5% of AOL Search activity, so this data set is (very roughly) about 0.001% of all search activity.
The raw data file in text format is about 2.2G
Michael Cole AOL Search Data 4 December 2006
Data: tag cloud
Top queries
created using http://tagcrowd.com/
Michael Cole AOL Search Data 4 December 2006
Queries as questions? How many queries are formulated as
questions to the system?
unique queries
Who 8717 0.0009
What 51725 0.0051
Where 10688 0.0011
When 6477 0.0006
Why 7210 0.0007
total 84817 0.0084
20487936 what amphibian did norwegian composer edvard grieg keep in his pocket and stroke whenever he needed inspiration 2006-04-13 11:54:4
20487936.playingAlongAtHome
Michael Cole AOL Search Data 4 December 2006
Queries involving urls How many queries are requests to find a
link to a url? How many such requests are duplicates? Is this evidence people use the search
engines as substitutes for bookmarks?TLD in unique queries in full collection ratio
.com 1840127 5849400 0.31
.edu 52127 128321 0.41
.org 154821 331359 0.47
.net 78057 206650 0.38
.gov 44996 143200 0.31
.mil 6743 18036 0.37
total 2176871 6676966 0.33
Michael Cole AOL Search Data 4 December 2006
Queries involving urlsOf which (in the full collection):
google.com 146379
yahoo.com 176541
ask.com 19828
msnsearch.com 6
myspace.com 157599
Together, they are 0.086 of the total .com queries.
It is surprising so many urls are entered as search terms. Is this just evidence for an interface error - confusing the search box with the browser address bar?
Michael Cole AOL Search Data 4 December 2006
Privacy Issues To date, only one user has
been publicly identified. A New York Times reporter found user 4417749: Thelma Arnold, a 62-year-old woman living in Georgia.
How easy was it to identify her from the queries?
thelma
New York Times 9 Aug 2006
Michael Cole AOL Search Data 4 December 2006
Information Behavior
The data contains the AOL search activity of each individual over a three month period. This real world, unscripted information seeking provides a window into across many domains.
Example: Searching for medical information
Which sites are used? Are they popular/ authoritative?
Which terms are used? Is there an attempt to use medical terminology?
Michael Cole AOL Search Data 4 December 2006
Information Behavior (cont.)
The comprehensive search log may be able to support reasonable guesses of broad demographic categories for individuals.
Michael Cole AOL Search Data 4 December 2006
It is almost irresistible to connect the queries and build a description of a life.
At least one public web site uses the AOL data as the stuff of voyeurs: http://aolpsycho.com While the activity at aolpsycho.com is not
kind, the collective effort is labelling the AOL users http://www.aolpsycho.com/tag/list
Of course, associating a person with a query is not always justified:nyt09aol.htm
Michael Cole AOL Search Data 4 December 2006
AOL Search: Prepared research files
Ten fold, randomized files have been prepared. They are suitable as training and testing sets for statistical model selection and for machine learning.
randomized by ids randomized by time randomized with no time stratification stratified by week (Sun - Sat) [the short weeks at
beginning and end of data set are eliminated] stratified by month in addition, time-sorted data is available by week
and month
Michael Cole AOL Search Data 4 December 2006
Can the data be used? Legal
AOL's privacy policy indicates query data can be used:
to operate and improve the Web sites, services and offerings available through the AOL Network;
to personalize the content and advertisements provided to you;
to fulfil your requests for products, programs, and services;
to communicate with you and respond to your inquiries;
to conduct research about your use of the AOL Network; and
to help offer you other products, programs, or services that may be of interest.
The disclosure was part of an AOL Research program, and so seems to be covered by the policy.
Michael Cole AOL Search Data 4 December 2006
Should the data be used? Academic research public reputation
Can any valid statistics be compiled about real world queries without considering this data?
Data has already been used in research publications by AOL researchers.
So the fact of publishing results based on the data is not the issue
At a minimum, if the data is used to check assumptions, form hypotheses etc. that would seem to be OK. But doesn't this need to be disclosed?
Can one reference use of the data even if the research is not based on the AOL data?
Michael Cole AOL Search Data 4 December 2006
Should the data be used? (cont.) AOL withdrew access to the data on 6 August 2006.
They apologized for releasing it, but have not rescinded authorization to use the data as originally stated.
A number of visible web sites using the data have been set up and are still running.
No evidence AOL has delivered take down notices. http://www.aolsearchdatabase.com/ http://dontdelete.com http://aolpsycho.com http://www.seosleuth.com/site/ raw data mirrors: http://www.gregsadetsky.com/aol-
data/
Michael Cole AOL Search Data 4 December 2006
Impact? Chilling effect on cooperation between
academic community and commercial operations?
Special relationships to access and use data may be more critical than ever. Researchers without those relationships will be frozen out.
Developing large data sets for research use would be one response, but where are the resources for such an effort?
Michael Cole AOL Search Data 4 December 2006
Using Real World Data Sets
May use further anonymization techniques. Replacing potential identifiers such as string of numbers that may be social security numbers, bank accounts with random numbers or zeros. Place names can also be encoded. This could break the inference links that can lead to identification.
Does this information hiding compromise statistical work?
Probably not ... work on information behavior?
maybe not.