Rules of Thumb for Information Acquisition from Large and Redundant Data
description
Transcript of Rules of Thumb for Information Acquisition from Large and Redundant Data
Rules of Thumb for Information Acquisition
from Large and Redundant Data
Wolfgang Gatterbauer
http://UniqueRecall.comDatabase groupUniversity of Washington
Version April 21, 2011
33rd European Conference on Information Retrieval (ECIR'11)
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Information Acquisition from Redundant DataPareto principle (80-20 rule) 20% causes 80% effect
e.g. business clients salese.g. software bugs errorse.g. health care patients HC resources
Information acquisition ? 20% data(instances)
? 80% information(concepts)
e.g. words in a corpus all words different wordse.g. used first names individual names different names
e.g. web harvesting web data web information
Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3
Information Acquisition from Redundant Data
Au A B Bu"Unique"
informationAvailable
DataRetrieved and extracted data
Acquired information
InformationIntegration
Information Retrieval, Information Extraction
InformationDissemination
Information Acquisition
Redundancydistribution k Recall r Expected sample
distribution k
Expected unique recall ru
Three assumptions• no disambiguity in data• random sampling w/o replacement• very large data sets
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A Simple Balls-and-Urn Sampling Model
Redundancy Distribution kk=(6,3,3,2,1)
Information i
Redundancy ki
Data
1 2 3 4 5
1
5
6
4
3
2
frequency of i-th most often appearing information (color)
(# colors: au=5)
(# balls: a=15)
Redundancy ki
1 2 3 4 5
1
5
6
4
3
2
Sampled Data(# balls: b=3)
Sampled Information i(# Colors: bu=2)
Recall rr=3/15=0.2
Unique recall ru=2/5=0.4Sample Redundancy
distribution k=(2,1)
5
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A model for sampling in the Limit of large data sets
1 2 3 4 51
56
432
6
1 3 5 7 91
56
432
2 4 6 8 10k=(6,3,3,2,1) k=(6,6,3,3,3,3,2,2,1,1)
1 2 3 4 51
56
432
a=(0.2,0.2,0.4,0,0,0.2)
a=15 a∞a=30
k2-3=3
a3=0.4
vertical perspective horizontal perspectivek3-6=3
a3=0.4 a3=0.4
...
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
The Intuition for constant redundancy k
0 1
ru = 1-(1-0.5)3=0.875
ru
1
2
3
k=const=3
7
Unique recall
2
Expected sample distribution
2=(3 choose 2)0.52(1-0.5)1=0.375 Binomial distribution
Indep. sampling with p=r
lima∞
r=0.5
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
The Intuition for arbitrary redundancy distributions
a3=0.4
a6= 0.2
a2= 0.2
a1= 0.2
1
2
3
5
6
4
Redundancy k
0.2 0.60 0.8 1
8
ru= a6[ 1-(1-r)6 ] + a3[ 1-(1-r)3 ] + a3[ 1-(1-r)2 ] + a1[ 1-(1-r)1 ]
Stratified sampling
lima∞
r=0.5
2
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
A horizontal Perspective for Sampling
9
au=5 au=20
au=100 au=1000
Horizontal layer of redundancy 1=0.8
Expected sampleredundancy k1=3
r=0.5
bu=800
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10
Unique Recall for arbitrary redundancy distributions
10
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12
Three formulations of Power law distributions
12
Stumpf et al. [PNAS’05]
Mitzenmacher [Internet Math.’04]
Adamic [TR’00]
Mitzenmacher [IM’04]
redundancy kiredundancy frequency ak
complementarycumulativefrequency k
Zipf-Mandelbrot ParetoPower-law probability Zipf [1932]
Commonly assumed to be different represen-tations of the same distribution
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Unique Recall with Power laws
13 13
For =1log-log plot
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Unique Recall with Power laws
14 14
Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information
For =1
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Invariants under Sampling
15 15
Given our model: Which redundancy distribution remains invariant under sampling?
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16
Invariant is a power-law! Hence, the tail of allpower laws remains invariant under sampling
16
ruk=k/k
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Also, the power law tail breaks in
17 17
Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18
Outline
• A horizontal sampling model• The role of power-laws• Real data & Discussion
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19
Sampling from Real DataIncoming links for one domain
Tag distribution on delicious.com
Rule of Thumb 1:not good!
Rule of Thumb 1:perfect!
Rule of Thumb 2:works, but only for small area
Rule of Thumb 2:works!
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Theory on Sampling from Power-Laws
Stumpf et al. [PNAS’05]
"Here, we show that random subnets sampled from scale-free networks are not themselves scale-free."
This paper
"Here, we show that there is one power-law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Some other related workPopulation sampling
Downey et al. [IJCAI’05]
Ipeirotis et al. [Sigmod’06]
Stumpf et al. [PNAS’05]
• urn model to estimate the probability that extracted information is correct.
• random sampling with replacement
• show that random subnets sampled from scale-free networks are not scale-free
• decision framework to search or to crawl• random sampling without replacement with
known population sizes
• goal: estimate size of population• e.g. mark-recapture animal sampling• sampling of small fraction w/ replacement
Bar-Yossef, Gurevich [WWW’06]
• biased methods to sample from search engine's index to estimate index size
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22
Summary 1/2 Au A B Bu
Information AvailableData
Retrieved Data
Acquired information
Inf. IntegrationIR & IEInf. Dissemination
Recall rUnique recall ru
• A simple model ofinformation acquisition from redundant data
• Full analytic solution
Inf. Acquisition
- no disambiguity- random sampling
w/o replacementRedundancy
distribution k
- large data
Normalized Redundancy
distribution a
ru(r)
ruk(r)
ruk=k/k
Sampledistribution k
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23
Summary 2/2
• Rule of thumb 1:
• Rule of thumb 2:
- 80/20 40/20
- power-law coreremains invariant
Unique recall for power-laws3 different power-laws
Sampling from power-lawsInvariant distribution
ruk r
- sensitive to exact power-law root
http://uniqueRecall.com
24
backup
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com
Geometric interpretation of k(, k, r)
25
BACKUP
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26
Information Acquisition from Redundant Data3 pieces of data, containing 2 pieces of (“unique”) information*
Capurro, Hjørland [ARIST’03]
e.g. words of a corpus:word appearances / vocabulary
e.g. used first names in a groupindividual names / different names
e.g. web harvesting:web data / web information
Motivating question:Can we learn 80% of the information,by looking at only 20% of the data?
Data(instances)
Information(concepts)
*data interpreted as redundant representation of information