Michael ColeAOL Search Data 4 December 2006 Agenda Review the time line for the AOL Search data...

22
Michael Cole AOL Search Data 4 December 2006 Agenda Review the time line for the AOL Search data event Look at the data Can we use the data? The future of data access to the research community The AOL Search Data Michael Cole [email protected] 4 December 2006
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Michael ColeAOL Search Data 4 December 2006 Agenda Review the time line for the AOL Search data...

Michael Cole AOL Search Data 4 December 2006

Agenda

Review the time line for the AOL Search data event

Look at the data Can we use the data? The future of data access to the research

community

The AOL Search Data

Michael [email protected]

4 December 2006

Michael Cole AOL Search Data 4 December 2006

Introduce the AOL Search data. Review the social controversy around the

release of data intended for the research community.

Look at the data. Discuss the legality and ethics of using

the data in research.

Michael Cole AOL Search Data 4 December 2006

Time line for event Aug 4: The data is released

Aug 6: Several blogs pick up on release and post links to AOL Research

Aug 7: The link to the data is taken down ~11 pm EDT. AOL issues an apology calling the release a “screw-up”.

Aug 7: The data is available at several mirrors and three websites with search interfaces to the data are functioning:

http://www.aolsearchdatabase.com/

http://dontdelete.com

http://aolsearchlogs.com/

Aug 8: World Privacy Forum files a complaint with the FTC.

Aug 11: http://aolpsycho.com/ comes on line

Aug 14: Electronic Frontier Foundation files FTC complaint for violation of privacy policy.

Aug 21: The AOL CTO Maureen Govern resigns, lead researcher Chowduhry and his supervisor are “dismissed”.

Sept 21: Nielsen/NetRatings US search engine metrics report for August shows 18.2% fewer queries on AOL search compared to a year earlier. AOL was the only search engine to experience a loss in traffic.

Sept 25: Lawsuit filed in US District Court of Northern California

Michael Cole AOL Search Data 4 December 2006

The data statistics samples ML sets

Michael Cole AOL Search Data 4 December 2006

Data Basic Collection Statistics

Dates: 1 March, 2006 - 31 May, 2006

Normalized queries: 36,389,567 lines of data

21,011,340 instances of new queries (w/ or w/o click-through)

7,887,022 requests for "next page" of results

19,442,629 user click-through events

16,946,938 queries w/o user click-through

10,154,742 unique (normalized) queries

657,426 unique user ID's

Source: aol.U500K_README.txt (distributed w/data set)

Michael Cole AOL Search Data 4 December 2006

Data (cont)

During this period, AOL Search had a market share of about 6.5% according to com Media Metrics. The data covers about 1.5% of AOL Search activity, so this data set is (very roughly) about 0.001% of all search activity.

The raw data file in text format is about 2.2G

Michael Cole AOL Search Data 4 December 2006

Data: tag cloud

Top queries

created using http://tagcrowd.com/

Michael Cole AOL Search Data 4 December 2006

Two simple explorations

Michael Cole AOL Search Data 4 December 2006

Queries as questions? How many queries are formulated as

questions to the system?

unique queries

Who 8717 0.0009

What 51725 0.0051

Where 10688 0.0011

When 6477 0.0006

Why 7210 0.0007

total 84817 0.0084

20487936 what amphibian did norwegian composer edvard grieg keep in his pocket and stroke whenever he needed inspiration 2006-04-13 11:54:4

20487936.playingAlongAtHome

Michael Cole AOL Search Data 4 December 2006

Queries involving urls How many queries are requests to find a

link to a url? How many such requests are duplicates? Is this evidence people use the search

engines as substitutes for bookmarks?TLD in unique queries in full collection ratio

.com 1840127 5849400 0.31

.edu 52127 128321 0.41

.org 154821 331359 0.47

.net 78057 206650 0.38

.gov 44996 143200 0.31

.mil 6743 18036 0.37

total 2176871 6676966 0.33

Michael Cole AOL Search Data 4 December 2006

Queries involving urlsOf which (in the full collection):

google.com 146379

yahoo.com 176541

ask.com 19828

msnsearch.com 6

myspace.com 157599

Together, they are 0.086 of the total .com queries.

It is surprising so many urls are entered as search terms. Is this just evidence for an interface error - confusing the search box with the browser address bar?

Michael Cole AOL Search Data 4 December 2006

Privacy Issues To date, only one user has

been publicly identified. A New York Times reporter found user 4417749: Thelma Arnold, a 62-year-old woman living in Georgia.

How easy was it to identify her from the queries?

thelma

New York Times 9 Aug 2006

Michael Cole AOL Search Data 4 December 2006

Information Behavior

The data contains the AOL search activity of each individual over a three month period. This real world, unscripted information seeking provides a window into across many domains.

Example: Searching for medical information

Which sites are used? Are they popular/ authoritative?

Which terms are used? Is there an attempt to use medical terminology?

Michael Cole AOL Search Data 4 December 2006

Information Behavior (cont.)

The comprehensive search log may be able to support reasonable guesses of broad demographic categories for individuals.

Michael Cole AOL Search Data 4 December 2006

It is almost irresistible to connect the queries and build a description of a life.

At least one public web site uses the AOL data as the stuff of voyeurs: http://aolpsycho.com While the activity at aolpsycho.com is not

kind, the collective effort is labelling the AOL users http://www.aolpsycho.com/tag/list

Of course, associating a person with a query is not always justified:nyt09aol.htm

Michael Cole AOL Search Data 4 December 2006

AOL Search: Prepared research files

Ten fold, randomized files have been prepared. They are suitable as training and testing sets for statistical model selection and for machine learning.

randomized by ids randomized by time randomized with no time stratification stratified by week (Sun - Sat) [the short weeks at

beginning and end of data set are eliminated] stratified by month in addition, time-sorted data is available by week

and month

Michael Cole AOL Search Data 4 December 2006

Can the data be used? Legal

AOL's privacy policy indicates query data can be used:

to operate and improve the Web sites, services and offerings available through the AOL Network;

to personalize the content and advertisements provided to you;

to fulfil your requests for products, programs, and services;

to communicate with you and respond to your inquiries;

to conduct research about your use of the AOL Network; and

to help offer you other products, programs, or services that may be of interest.

The disclosure was part of an AOL Research program, and so seems to be covered by the policy.

Michael Cole AOL Search Data 4 December 2006

Should the data be used? Academic research public reputation

Can any valid statistics be compiled about real world queries without considering this data?

Data has already been used in research publications by AOL researchers.

So the fact of publishing results based on the data is not the issue

At a minimum, if the data is used to check assumptions, form hypotheses etc. that would seem to be OK. But doesn't this need to be disclosed?

Can one reference use of the data even if the research is not based on the AOL data?

Michael Cole AOL Search Data 4 December 2006

Should the data be used? (cont.) AOL withdrew access to the data on 6 August 2006.

They apologized for releasing it, but have not rescinded authorization to use the data as originally stated.

A number of visible web sites using the data have been set up and are still running.

No evidence AOL has delivered take down notices. http://www.aolsearchdatabase.com/ http://dontdelete.com http://aolpsycho.com http://www.seosleuth.com/site/ raw data mirrors: http://www.gregsadetsky.com/aol-

data/

Michael Cole AOL Search Data 4 December 2006

Impact? Chilling effect on cooperation between

academic community and commercial operations?

Special relationships to access and use data may be more critical than ever. Researchers without those relationships will be frozen out.

Developing large data sets for research use would be one response, but where are the resources for such an effort?

Michael Cole AOL Search Data 4 December 2006

Using Real World Data Sets

May use further anonymization techniques. Replacing potential identifiers such as string of numbers that may be social security numbers, bank accounts with random numbers or zeros. Place names can also be encoded. This could break the inference links that can lead to identification.

Does this information hiding compromise statistical work?

Probably not ... work on information behavior?

maybe not.

Michael Cole AOL Search Data 4 December 2006

references