Making Sense of Data Big and Small

Bruno Gonçalves www.bgoncalves.com

Making Sense of Data Big and Smallwww.bgoncalves.com

www.bgoncalves.com@bgoncalves


Big Data

E X P E R T O P I N I O N

8 1541-1672/09/$25.00 © 2009 IEEE IEEE INTELLIGENT SYSTEMSPublished by the IEEE Computer Society

Contact Editor: Brian Brannon, [email protected]

such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary par-ticles have proven more resistant to elegant math-ematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages.2 Perhaps when it comes to natural language processing and related fi elds, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

One of us, as an undergraduate at Brown Univer-sity, remembers the excitement of having access to the Brown Corpus, containing one million English words.3 Since then, our fi eld has seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to fi ve words long.4 In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfi ltered Web pages and thus contains incomplete sentences, spelling er-rors, grammatical errors, and all sorts of other er-rors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or tril-lions of links, videos, images, tables, and user inter-actions—captures even very rare aspects of human

behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

Learning from Text at Web ScaleThe biggest successes in natural-language-related machine learning have been statistical speech rec-ognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks; they are in fact much harder than tasks such as document classifi cation that ex-tract just a few bits of information from each doc-ument. The reason is that translation is a natural task routinely done every day for a real human need (think of the operations of the European Union or of news agencies). The same is true of speech tran-scription (think of closed-caption broadcasts). In other words, a large training set of the input-output behavior that we seek to automate is available to us in the wild. In contrast, traditional natural language processing problems such as document classifi ca-tion, part-of-speech tagging, named-entity recogni-tion, or parsing are not routine tasks, so they have no large corpus available in the wild. Instead, a cor-pus for these tasks requires skilled human annota-tion. Such annotation is not only slow and expen-sive to acquire but also diffi cult for experts to agree on, being bedeviled by many of the diffi culties we discuss later in relation to the Semantic Web. The fi rst lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we fi nd that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results5 or from the accumulated evi-dence of Web-based text patterns and formatted ta-bles,6 in both cases without needing any manually annotated data.

Eugene Wigner’s article “The Unreasonable Ef-

fectiveness of Mathematics in the Natural Sci-

ences”1 examines why so much of physics can be

neatly explained with simple mathematical formulas

Alon Halevy, Peter Norvig, and Fernando Pereira, Google

The Unreasonable Effectiveness of Data

Authorized licensed use limited to: Univ of Calif Berkeley. Downloaded on February 5, 2010 at 22:51 from IEEE Xplore. Restrictions apply.


From Data To Information

2 Chapter 1

Statistics are everywhereEverywhere you look you can find statistics, whether you’re browsing the Internet, playing sports, or looking through the top scores of your favorite video game. But what actually is a statistic?

Statistics are numbers that summarize raw facts and figures in some meaningful way. They present key ideas that may not be immediately apparent by just looking at the raw data, and by data, we mean facts or figures from which we can draw conclusions. As an example, you don’t have to wade through lots of football scores when all you want to know is the league position of your favorite team. You need a statistic to quickly give you the information you need.

The study of statistics covers where statistics come from, how to calculate them, and how you can use them effectively.

Gather data

Analyze

Draw conclusionsWhen you’ve analyzed your data, you make decisions and predictions.

Once you have data, you can analyze it and generate statistics. You can calculate probabilities to see how likely certain events are, test ideas, and indicate how confident you are about your results.

At the root of statistics is data.

Data can be gathered by looking

through existing sources, conducting

experiments, or by conducting surveys.

welcome to statsville!


Data Science

Hackin

g

DomainKnowledge

StatisticsData

Science

MachineLearning

TraditionalResearch

DangerZone!


Data Science

Data

Nerds

ArtNerds

StatsNerdsHigh

Salaries

DataMining

VisualizationGUIProgrammers


Data Science

Data Scientist: The Sexiest Job of the 21st Century

Meet the people who can coax treasure out of messy, unstructured data. by Thomas H. Davenport and D.J. Patil

ARTWORK Tamar Cohen, Andrew J Buboltz 2011, silk screen on a page from a high school yearbook, 8.5" x 12"

Spotlight

hen Jonathan Goldman ar-rived for work in June 2006

at LinkedIn, the business networking site, the place still

felt like a start-up. The com-pany had just under 8 million

accounts, and the number was growing quickly as existing mem-

bers invited their friends and col-leagues to join. But users weren’t

seeking out connections with the people who were already on the site at the rate executives had expected. Something was apparently miss-ing in the social experience. As one LinkedIn manager put it, “It was like arriving at a conference reception and realizing you don’t know anyone. So you just stand in the corner sipping your drink—and you probably leave early.”

70 Harvard Business Review October 2012

SPOTLIGHT ON BIG DATA


“Zero is the most natural number”(E. W. Dijkstra)

Count!

• How many items do we have?


Descriptive Statistics

Min Max

Mean µ =1

N

NX

i=1

xi

� =

vuut 1

N

NX

i=1

(xi � µ)2StandardDeviation


Anscombe’s Quartetx1 y1

10.0 8.04

8.0 6.95

13.0 7.58

9.0 8.81

11.0 8.33

14.0 9.96

6.0 7.24

4.0 4.26

12.0 10.84

7.0 4.82

5.0 5.68

x2 y2

10.0 9.14

8.0 8.14

13.0 8.74

9.0 8.77

11.0 9.26

14.0 8.10

6.0 6.13

4.0 3.10

12.0 9.13

7.0 7.26

5.0 4.74

x3 y3

10.0 7.46

8.0 6.77

13.0 12.74

9.0 7.11

11.0 7.81

14.0 8.84

6.0 6.08

4.0 5.39

12.0 8.15

7.0 6.42

5.0 5.73

x4 y4

8.0 6.58

8.0 5.76

8.0 7.71

8.0 8.84

8.0 8.47

8.0 7.04

8.0 5.25

19.0 12.50

8.0 5.56

8.0 7.91

8.0 6.89

9

11

7.50

~4.125

0.816

fit y=3+0.5x

µx

µy

�y

�x

⇢

03.256.5

9.7513

0 5 10 15 20

03.256.5

9.7513

0 5 10 15 20

03.256.5

9.7513

0 5 10 15 20

03.256.5

9.7513

0 5 10 15 20


Central Limit Theorem

• As the random variables:

• with:

• converge to a normal distribution:

• after some manipulations, we find:

n ! 1

Sn =1

n

X

i

xi

N�0,�2

�

pn (Sn � µ)

Sn ⇠ µ+N

�0,�2

�pn

The estimation of the mean converges to the true mean with the sqrt of the number of samples

! SE =�pn


Gaussian Distribution - Maximally Entropic

PN (x, µ,�) =1p2�

e

� (x�µ)2

2�2


Broad-tailed distributions

Pp (k, �) =1

Ck��


(Almost) Everyone is below average!

µ

80% 20%


Outliers

• “Bill Gates walks into a bar and on average every patron is a millionaire…”

• …but the median remains the same”

• Mean = 144.14 Median = 2

Median - “ t he va lue tha t separates the lower 50% of the distribution from the higher 50%”

1 1 1 2 2 2 1000


Quantiles

• Quantiles - Points taken at regular intervals of the cumulative distribution function

• Quartiles - Ranked set of points that divide the range in 4 equal intervals (25%, 50%, 75% quantiles)


Box and whiskers plot0

5

10

15

20

25

30N

umbe

r of i

nfec

ted

coun

trie

s

0 7 14 21

Time (days)

0

5

10

15

20

25

Num

ber o

f inf

ecte

d co

untr

ies

0 7 14 21

1

10

100

300

Num

ber o

f cas

es o

utsid

e ta

rget

cou

ntry

0 7 14 21

1

10

100

300

Num

ber o

f cas

es o

utsid

e ta

rget

cou

ntry

Time (days)0 7 14 21

0 7 14 21

Time (days)0 7 14 21

75

80

85

90

95

100

70

85

90

95

100

Prob

abili

ty o

f int

erna

tiona

l out

brea

k (%

)Pr

obab

ility

of i

nter

natio

nal o

utbr

eak

(%)

A) C) E)

B) D) F)• Show the variation of the data for each bin.

• More informative than just averages or medians.

• Useful to summarize experimental measurements, simulation results, natural variations, etc… when fluctuations are important


Reference Range

BMC Medicine 2009, 7:45 http://www.biomedcentral.com/1741-7015/7/45

Page 10 of 12(page number not for citation purposes)

drugs stockpiles available (source data from [51,52] andnational agencies), until the exhaustion of their stockpiles[4]. We have modeled this mitigation policy with a con-servative therapeutic successful use of drugs for 30% ofsymptomatic infectious individuals. The efficacy of the AVis accounted in the model by a 62% reduction in the trans-missibility of the disease of an infected person under AVtreatment when AV drugs are administered in a timelyfashion [30,32]. We assume that the drugs are adminis-tered within 1 day of the onset of symptoms. We also con-sider that the AV treatment reduces the infectious periodby 1 day [30,32]. In Figure 3 we show the delay obtainedwith the implementation of the AV treatment protocol ina subset of countries with available stockpiles. As anexample, we also show the incidence profiles for the casesof Spain and Germany, where it is possible to achieve adelay of about 4 weeks with the use of 5 million and 10million courses of AV, respectively. The results of this mit-igation might be extremely valuable in providing the nec-essary time for the implementation of the massvaccination program.

ConclusionWe have defined a Monte Carlo likelihood analysis for theassessment of the seasonal transmission potential of thenew A(H1N1) influenza based on the analysis of the chro-nology of case detection in affected countries at the early

stage of the epidemic. This method allows the use of datacoming from the border controls and the enhanced sur-veillance aimed at detecting the first cases reaching unin-fected countries. This data is, in principle, more reliablethan the raw count of cases provided by countries duringthe evolution of the epidemic. The procedure provides thenecessary input to the large-scale computational modelfor the analysis of the unfolding of the pandemic in thefuture months. The analysis shows the potential for anearly activity peak that strongly emphasizes the need fordetailed planning for additional intervention measures,such as social distancing and antiviral drugs use, to delaythe epidemic activity peak and thus increase the effective-ness of the subsequent vaccination effort.

Competing interestsThe authors declare that they have no competing interests.

Authors' contributionsDB, HH, BG, PB and CP contributed to conceiving anddesigning the study, performed numerical simulationsand statistical analysis, contributed to the data integrationand helped to draft the manuscript. JJR contributed toconceiving and designing the study, data tracking andintegration, statistical analysis and helped draft the man-uscript. NP and MT contributed to data tracking and inte-gration, statistical analysis and helped draft the

Delay effect induced by the use of antiviral drugs for treatment with 30% case detection and drug administrationFigure 3Delay effect induced by the use of antiviral drugs for treatment with 30% case detection and drug administra-tion. (a) Peak times of the epidemic activity in the worst-case scenario (black) and in the scenario where antiviral treatment is considered (red), for a set of countries in the Northern hemisphere. The intervals correspond to the 95% confidence interval (CI) of the peak time for the two values defining the best-fit seasonality scaling factor interval. (b, c) Incidence profiles for Spain and Germany in the worst-case scenario (black) and in the scenario where antiviral treatment is considered (red). Results are shown for Dmin = 0.6 only, for the sake of visualization. A delay of about 4 weeks results from the implemented mit-igation.

Sept Oct Nov Dec Jan

Japan

Spain

Italy

Germany

France

United Kingdom

Canada

United States

Sept Oct Nov Dec

A) no interventionantiviral treatment

~ 4 weeks B)

0

0.003

0.006

0.009

0.012

0.015

0.018

inci

denc

e

0

0.003

0.006

0.009

0.012

0.015

0.018

inci

denc

e

Germany

Spain

Sept Oct Nov Dec Jan

C)

Median95% RR⇢A

B

97.5% Percentile

2.5% Percentile

• Useful for continuous curves

• Indicates level of certainty “95% of the cases are in this range”


Tools For Statistical Analysis

Name Advantages Disadvantages Open Source

R Library support and Visualization Steep learning curve Yes

MatlabNative matrix

support,Visualization

Expensive, incomplete statistics support No

Scientific Python Ease and Simplicity Heavy development Yes

Excel Easy, Visual, Flexible Large datasets No

SAS Large DatasetsExpensive, outdated

programming language

No

Stata Easy Statistical Analysis No

SPSS Like Stata but more expensive and less flexible


Correlations


Pearson (Linear) Correlation Coefficient

• Does increasing one variable also increase the other?

⇢ =

PNi=1 (xi � µX) (yi � µY )

�X�Y


• The square of the Person correlation between the data and the fit.

• The amount of variance of the data that is explained by the “model”.

R2


Spearman Rank Correlation

• Equivalent to the Pearson Correlation Coefficient of the ranked variables

• squared difference in ranks

• less sensitive to outliers as values are limited by rank

⇢ = 1�6PN

i=1 d2i

N (N2 � 1)

⇢ = 1�6PN

i=1 d2i

N (N2 � 1)


Causation


Probability

p = 1


Probability

A

P (A) = Area of A

B

p = 1


Probability

A

P (A) = Area of A

B

P (A or B) = P (A) + P (B)

p = 1


Probability

P (A) = Area of A

P (A or B) = P (A) + P (B)

P (A and B) = overlap of A and B

�P (A and B)

AB

p = 1


Probability

P (A) = Area of A

P (A or B) = P (A) + P (B)


�P (A and B)

P (A|B) =P (B|A)P (A)

P (B)

P (B|A) =P (A and B)

P (A)

AB

p = 1


Medical Tests

Your doctor thinks you might have a rare disease that affects 1 person in 10,000. A test that is 99% accurate comes out positive. What’s the probability of you having the disease?

Bayes Theorem:

Total Probability:

Finally:

P (disease|positive test) =P (positive test|disease)P (disease)

P (positive test)

P (positive test) = P (positive test|disease)P (disease)

+ P (positive test|no disease)P (no disease)

P (disease|positive test) = 0.0098

AB


Medical Tests

Your doctor thinks you might have a rare disease that affects 1 person in 10,000. A test that is 99% accurate comes out positive. What’s the probability of you having the disease?

Bayes Theorem:

Total Probability:

Finally:


P (positive test)

P (positive test) = P (positive test|disease)P (disease)

+ P (positive test|no disease)P (no disease)

P (disease|positive test) = 0.0098

AB

Base Rate Fallacy

Low Base Rate Value+

Non-zero False Positive Rate


Consider a population of 1,000,000 individuals. The numbers we should expect are:

Medical Tests

disease no disease

positive 99 9,999 10,098

negative 1 989,901 989,902

100 999,900 1,000,000

AB



Medical Tests

disease no disease

positive 99 9,999 10,098

negative 1 989,901 989,902

100 999,900 1,000,000

Marginals

Marginals

AB



Medical Tests

disease no disease

positive 99 9,999 10,098

negative 1 989,901 989,902

100 999,900 1,000,000

Marginals

Marginals

P (disease|positive test) =TP

TP + FP

= 0.0098

P (no disease|negative test) =TN

TN + FN

= 0.99999

AB


(Confusion Matrix)

positive negative

positive TP FP

negative FN TN

FeatureTest

accuracy =TP + TN

TP + TN + FP + FN

precision =TP

TP + FP

sensitivity =TP

TP + FN

specificity =TN

FP + TN

harmonic mean F1 =2TP

2TP + FP + FN


A second Test

Bayes Theorem still looks the same:

but now the probability that we have the disease has been updated:

So this time we find:

Each test is providing new evidence, and Bayes theorem is simply telling us how to use it to update our beliefs.


P (positive test)

P † (disease) = 0.0098

P

† (disease|positive test) = 0.4949


Bayesian Coin Flips

• Biased coin with unknown probability of heads (p)

• Perform N flips and update our belief after each flip using Bayes Theorem

P (p|heads) = P (heads|p)P (p)

P (heads)

P (p|tails) = P (tails|p)P (p)

P (tails)

http://youtu.be/GTx0D8VY0CY


Bayesian Coin Flips


• Perform N flips and update our belief after each flip using Bayes Theorem

# Uninformative prior prior = np.ones(bins, dtype='float')/bins likelihood_heads = np.arange(bins)/float(bins) likelihood_tails = 1-likelihood_heads

for coin in flips: if coin: # Heads posterior = prior * likelihood_heads else: # Tails posterior = prior * likelihood_tails

# Normalize posterior /= np.sum(posterior)

# The posterior is now the new prior prior = posterior

P (p|heads) = P (heads|p)P (p)

P (heads)

P (p|tails) = P (tails|p)P (p)

P (tails)

http://youtu.be/GTx0D8VY0CY


Naive Bayes Classifier

• Let’s consider spam detection for a second. If you know:

• You know the probability that a specific word is used in a spam email. But how can you determine the probability that an email (set of words) is spam?

• You can simply assume that all the probabilities are independent:

• This is know as Naive Bayes and is surprisingly effective in many real world contexts.

P (spam|wordi)P (not spam|wordi)

P (spam|word1, word2, · · · , wordn)

P (spam|word1, word2, · · · , wordn) =Y

i

P (spam|wordi)


Maximum Likelihood Estimation

• Given a distribution, ,how likely are we to see a given set of data points, ?

• The probability of each point is, simply:

• So the probability of a given realization is:

• For mathematical convenience, we define the likelihood as:

• The set of parameters that maximizes characterize the distribution most likely to have generated the data.

P (x) xi

P (xi)

Y

i

P (xi)

L =

X

i

log [P (xi)]


MLE Coin Flips


• In a sequence of N flips, the likelihood of Nh heads and Nt=N-N tails is:

• or simply:

• Taking the derivative:

• Setting to zero and solving for p:p =

Nh

N

@L@p

=Nh

p� N �Nh

1� p

L = Nh log [p] + (N �Nh) log [1� p]

L = log

hpNh

(1� p)N�Nh

i Ignoring the combinatorial factor!


Binomial Distribution

• The probability of getting k successes with n trials of probability p (k heads in n coin flips):

• The mean value is:

• and the variance:

• and for sufficiently large n:

PB (k, n, p) =n!

k! (n� k)!pk (1� p)n�k

µ = np

� = np (1� p)

PB (k, n, p) ⇠ PN (np, np (1� p))


(Beta Distribution)

• Related to Binomial and has a very similar form:

• with and .

• is the continuous extension to

• The mean is:

• And the variance:

P� (x,↵,�) =� (↵+ �)

� (↵)� (�)x

↵�1 (1� x)��1

PB (k, n, p) =n!

k! (n� k)!pk (1� p)n�k

x 2 [0, 1] ↵,� > 0

� (a) a!

µ =↵

↵+ �

� =↵�

(↵+ �)2 (↵+ � + 1)


A/B Testing

• Divide users into two groups A and B

• Measure some metric for each group (conversion probability, for example)

pA pB


• If conversion is a binomial process, then:• Standard Error

• Z score

A/B Testing

pA pB

SE =

rp (1� p)

N

Z =pA � pBpSE2

A + SE2B


p-value

• Calculate the probability of an event more extreme that the observation under the “null hypothesis”

• The smaller the p-value the better.

• p < 0.05 Moderate evidence agains null-hypothesis

• p < 0.01 Strong evidence against null-hypothesis

• p < 0.001 Very strong evidence agains the null-hypothesis


Berkeley Discrimination Case Part I

Candidates Acceptance Rate SE

Men 8442 0.44 5.4x10-3

Women 4321 0.35 7.2x10-3

Z =pA � pBpSE2

A + SE2B

= 9.9

p ⇡ 10�23


p-value

“Statistical significance does not imply scientific significance”


(Bonferoni Correction)


(Bonferoni Correction)

• You can think of p as the probability of observing a result as extreme by chance. With n comparisons, this probability becomes: which quickly goes to 1 as n increases.

• However, by replacing p by p/n for each individual comparison, we obtain:

• and for sufficiently large n:

• allowing us to keep the probability of false results arbitrarily low even with arbitrarily large numbers of comparisons.

pn = 1� (1� p)n

pn = 1�⇣1� p

n

⌘n

pn ⇡ 1� e�p ⇡ p


SimpsonsThe


Simpsons’ Paradox


Simpsons’ Paradox

Candidates Acceptance Rate

Men 8442 0.44

Women 4321 0.35

Berkeley Discrimination Case Part II: The statisticians strike back.

Science 187, 398 (1975)


Simpsons’ Paradox


Men 8442 0.44

Women 4321 0.35

Dept Men WomenCandidates Acceptance Candidates Acceptance

A 825 0.62 108 0.82B 560 0.63 25 0.68C 325 0.37 594 0.34D 417 0.33 375 0.35E 191 0.28 393 0.24F 272 0.06 341 0.07

2590 0.46 1835 0.30

Berkeley Discrimination Case Part II: The statisticians strike back.

Science 187, 398 (1975)


Simpsons’ Paradox


Men 8442 0.44

Women 4321 0.35


A 825 0.62 108 0.82B 560 0.63 25 0.68C 325 0.37 594 0.34D 417 0.33 375 0.35E 191 0.28 393 0.24F 272 0.06 341 0.07

2590 0.46 1835 0.30

Science 187, 398 (1975)


Simpsons’ Paradox


Men 8442 0.44

Women 4321 0.35


A 825 0.62 108 0.82B 560 0.63 25 0.68C 325 0.37 594 0.34D 417 0.33 375 0.35E 191 0.28 393 0.24F 272 0.06 341 0.07

2590 0.46 1835 0.30

Science 187, 398 (1975)

“aggregated data can appear to reverse important trends

in the numbers being combined”WSJ, Dec 2, 2009


MLE - Fitting a theoretical function to experimental data

• In an experimental measurement, we expect (CLT) the experimental values to be normally distributed around the theoretical value with a certain variance. Mathematically, this means:

• where are the experimental values and the theoretical ones. The likelihood is then:

• Where we see that to maximize the likelihood we must minimize the sum of squares

y � f (x) ⇡ 1p2�

2exp

"� (y � f (x))

2

2�

2

#

yf (x)

Least Squares Fitting

L = �N

2

log

⇥2�

2⇤�

X

i

"(yi � f (xi))

2

2�

2

#


MLE - Fitting a power-law to experimental data

• We often find what look like power-law distributions in empirical data:and we would like to find the right parameter values.

• The likelihood of any set of points is:

• And maximizing, we find:

• with a standard error of:

P (k) =� � 1

kmin

✓k

kmin

◆��

L =

X

i

log

"� � 1

kmin

✓ki

kmin

◆��#

� = 1 + n

"X

i

log

✓ki

kmin

◆#�1

SE =� � 1p

n

SIAM Rev. 51, 661 (2009)


Clustering


K-Means

• Choose k randomly chosen points to be the centroid of each cluster

• Assign each point to belong the cluster whose centroid is closest

• Recompute the centroid positions (mean cluster position)

• Repeat until convergence


K-Means: Structure


K-Means: Structure Voronoi Tesselation


K-Means: Convergence

• How to quantify the “quality” of the solution found at each iteration, ?

• Measure the “Inertia”, the square intra-cluster distance:where are the coordinates of the centroid of the cluster to which is assigned.

• Smaller values are better

• Can stop when the relative variation is smaller than some value

µi xi

In+1 � In

In< tol

In =NX

i=0

kxi � µik2

n


K-Means: sklearn

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=nclusters) kmeans.fit(data)

centroids = kmeans.cluster_centers_ labels = kmeans.labels_


K-Means: Limitations


K-Means: Limitations

• No guarantees about Finding “Best” solution

• Each run can find different solution

• No clear way to determine “k”


Silhouettes

• For each point define as: the average distance between point and every other point within cluster .

• Let be: the minimum value of excluding

• The silhouette of is then:

ac (xi)

ac (xi) =1

Nc

X

j2c

kxi � xjk

b (xi) = minc 6=ci

ac (xi)

s (xi) =b (xi)� aci (xi)

max {b (xi) , aci (xi)}

xi

xi c

b (xi)

ciac (xi)

xi


Silhouetteshttp://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html


Expectation Maximization

• Iterative algorithm to learn parameter estimates in models with unobserved latent variables

• Two steps for each iteration

• Expectation: Calculate the likelihood of the data given current parameter estimate

• Maximization: Find the parameter values that maximize the likelihood

• Stop when the relative variation of the parameter estimates is smaller than some value


Expectation MaximizationNature BioTech 26, 897 (2008)



while (improvement > delta): expectation_A = np.zeros((5, 2), dtype=float) expectation_B = np.zeros((5, 2), dtype=float)

for i in range(0, len(experiments)): e = experiments[i] # i'th experiment ll_A = get_mn_likelihood(e, np.array([tA[-1], 1-tA[-1]])) ll_B = get_mn_likelihood(e, np.array([tB[-1], 1-tB[-1]]))

weightA = ll_A/(ll_A + ll_B) weightB = ll_B/(ll_A + ll_B)

expectation_A[i] = np.dot(weightA, e) expectation_B[i] = np.dot(weightB, e)

tA.append(sum(expectation_A)[0] / sum(sum(expectation_A))) tB.append(sum(expectation_B)[0] / sum(sum(expectation_B)))

improvement = max(abs(np.array([tA[-1], tB[-1]]) - np.array([tA[-2], tB[-2]])))

http://stats.stackexchange.com/questions/72774/numerical-example-to-understand-expectation-maximization


Gaussian Mixture Models


Gaussian Mixture Models

• One solution is to try to characterize each cluster as a Gaussian. In this case we want to find the set of parameters and mixtures that better reproduces the data.

• Given some data points we can calculate the prior:

• which we can update using the data: to obtain the posterior:

• which we can use to choose a new set of parameters and mixtures.

• Iterate using Expectation Maximization.

xi

p (✓) =X

i

�iN (µi,�i)

p (x|✓)

p (✓|x) = p (x|✓) p (✓)p (x)


Conclusions

• Don’t trust descriptive statistics too much, they can be misleading

• Get a feel for the data using visualizations, etc…

• Know the properties of the distributions you are using

• Know the assumptions (implicit and explicit) that you are making.

• Be careful on how you aggregate data

• Use machine learning methods like k-means, Expectation Maximization, etc… to better understand and describe the data but be sure you understand them as well.


If it still doesn’t make sense…


References

Making Sense of Data Big and Small

Data & Analytics

Transcript of Making Sense of Data Big and Small