Estimating Symmetric Distribution Properties Maximum ......Jayadev Acharya Cornell Hirakendu Das...

EstimatingSymmetricDistributionPropertiesMaximumLikelihoodStrikesBack!

Jayadev AcharyaCornell

Hirakendu DasYahoo

Alon OrlitskyUCSD

Ananda SureshGoogle

ShannonChannelMarch5,2018

Basedon:

AUnifiedMaximumLikelihoodApproachforEstimatingSymmetricPropertiesofDiscreteDistributions,InternationalConferenceonMachineLearning(ICML),2017

http://proceedings.mlr.press/v70/acharya17a.html

Slidesavailableat:https://people.ece.cornell.edu/acharya/talks/shannon-18acharya.pdf

Outline1. Motivation2. Methods3. Proofs4. Futuredirections

Chapter1:Motivation

Distributionproperties𝒫:acollectionofdiscretedistributions• 𝒫 = Δ$ :alldistributionsover 𝑘 = {1,… , 𝑘}• Δ+:distributionsover[6]

Property𝑓:𝒫 → ℝ

• 𝑝 3 =?• Isitfair?Is𝑝 𝑖 = 1/6 forall𝑖?

Propertyestimation

𝑝 unknown distributionin𝒫

Givenindependentsamples𝑋:; = 𝑋:, 𝑋<,… , 𝑋; ∼ 𝑝

Estimate𝑓 𝑝

Samplecomplexity 𝑆 𝑓,𝒫, 𝜀, 𝛿

Minimum𝑛 necessaryto

Estimate𝑓 𝑝 ± 𝜀

Witherrorprobability< 𝛿 (usuallyconstant)

Symmetricproperties𝑓 symmetric ifunchangedunderinputpermutations

Entropy𝐻 𝑝 ≜ ∑ 𝑝(𝑥) log :M(N)N

SupportsizeS 𝑝 ≜ ∑ 𝕀 M N QRS

Manyothers:Renyi entropy,supportcoverage,…

Coinswithbias0.4,andwithbias0.6havesameentropy!

Entropy

𝐻 𝑝 ≜ −U𝑝(𝑥) log 𝑝(𝑥)𝒙

• Mostpopularmeasureofrandomness• Centralquantityininformationtheory[Shannon’48]

Howmanysamplestoestimate𝐻(𝑝) to±𝜀?

Longlineofwork:[Empirical,Miller-Maddow,Jackknifed,Coverageadjusted,BUB(paninski’03),…]

Estimatingentropy• Randomnessofneuralspiketrains• Featureselectionindecisiontrees• Graphicalmodels(Chow-Liu)

Traditionalsettingforpropertyestimation:𝒫 = Δ$:distributionson[𝑘] (small𝑘)Obtainmanysamples(large𝑛)

Genetics,neuralspikes,text,computervision,ecology:𝒌 large,possiblyinfinite,perhapsunknown

Estimatingtheunseen:Corbet’sbutterflies

2yearstrappingbutterfliesinMalaypeninsula:

AskedFisher:

howmanynewspeciesifhegoesfortwomoreyears?

Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Species 118 74 44 24 29 22 20 19 20 15 12 14 6 12 6

Estimatingtheunseen: formulation𝑝: unknown discretedistribution

𝑆X 𝑝 ≜ 𝔼[#𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭𝐬𝐲𝐦𝐛𝐨𝐥𝐬in𝑚ind. samples ∼ 𝑝]

𝑆X 𝑝 =U 1 − 1 − 𝑝(𝑥) X

N

Normalizedcoverage:pq MX

Howmanysamplestoestimatepq MX

to±𝜀?

Estimatingtheunseen:applications• Estimatingvocabularysize• Ecologicaldiversity• Microbialdiversityonskin

Wellstudied[GoodToulmin,Efron Thisted]Forconstant𝜀:Requires 𝒎

𝟐samples

Morerecently,[Zou ValiantValiantChan…’16,OrlitskySureshWu’16]:

Requires 𝒎𝐥𝐨𝐠𝒎

samples

Chapter2:Methods

Plug-inestimation

Using𝑋:;,findanestimate�̂� of𝑝

Estimate𝑓 𝑝 by𝑓 �̂�

Howtoestimate𝑝?

Sequencemaximumlikelihood (SML)

𝑝Nvwxy ≜ argmax

M 𝑝(𝑥;)= arg max

M∏ 𝑝(𝑥})}

𝑋:~ = ℎ, ℎ, 𝑡𝑝�X� = argmax𝑝< ℎ · 𝑝 𝑡

𝑝�X� ℎ = <~, 𝑝�X� 𝑡 = :

~

Sameastheempirical-frequency distribution

Multiplicity𝑁N - #times𝑥 appearsin𝑋:;

𝑝wxy 𝑥 =𝑁N𝑛

SMLforentropy

SamplecomplexityofSMLtoestimate𝐻 𝑝 over𝚫$ :

𝑆�X� 𝐻, 𝚫$, 𝜀 = Θ𝑘𝜀

Intheasymptotic𝑛 → ∞,SMLisoptimal

Samplecomplexityofentropy

Samplecomplexityof𝐻(𝑝):

𝑆 𝐻, 𝚫$, 𝜀 = Θ𝑘

𝜀 ⋅ log 𝑘

[…,Paninski’03,ValiantValiant’11,HanJiaoVenkatWeissman’15,WuYang’15]

[ValiantValiant’11]:plug-in,sub-optimal in𝜀

[HanJiaoWeissman ’18]:optimalin𝜀 bytweakingVV’11

Optimalestimators

Generalrecipe:

1. Approximate𝐻(𝑝) withapolynomialin𝑝2. Estimatethepolynomial

Differentnon-plug-in estimatorforeachproperty

Sophisticatedapproximationtheoryresults

PriorworkForseveralimportantproperties

Optimalisalogarithmic factor betterthanempirical

Property 𝒫 SML Optimal References

𝐻 𝒑 𝚫$𝑘𝜀

𝑘𝜀 ⋅ log 𝑘

ValiantValiant ‘11,HanJiaoVenkatWeissman ‘15,WuYang‘15







𝑆X 𝒑𝑚

𝚫� 𝑚𝑚

log 𝑚log

1𝜀

Orlitsky SureshWu’16







𝑆X 𝒑𝑚

𝚫� 𝑚𝑚

log 𝑚log

1𝜀

Orlitsky SureshWu’16

𝑆 𝒑𝑘

𝚫$ 𝑘 log1𝜀

𝑘log 𝑘

log<1𝜀

WuYang’16

∥ 𝒑 − 𝑢 ∥: 𝚫$𝑘𝜀<

𝑘𝜀< ⋅ log 𝑘

Han JiaoWeissman ‘16

Ourresults [AcharyaDasOrlitsky Suresh’17]Unified,simple,sample-optimalapproachforallaboveproblems

• Plug-inestimator• Maximumlikelihoodprinciple:

sequencemaximumlikelihood(SML)profilemaximumlikelihood(PML)

Chapter3:PML

ProfilesProfileisthe multi-setofmultiplicities

Φ 𝑋:; ≜ {𝑁N: 𝑥 ∈ 𝑋:;}

Φ(ℎ, ℎ, 𝑡) = Φ(𝑡, ℎ, 𝑡) = {1,2}

Φ(𝛼, 𝛾, 𝛽, 𝛾) = {1,1,2}

ProbabilitymultisetSymmetricpropertiesdeterminedby

• Probabilitymultiset:{𝑝(1), 𝑝(2),… }

𝑝 ℎ = 0.4 ⟹ {0.4,0.6}

𝑝 ℎ = 0.6 ⟹ {0.4,0.6}

Profilesaresufficientstatisticforsymmetricproperties

ℎ, ℎ, 𝑡,OR𝑡, ℎ, 𝑡 ⟹ sameestimate

EstimatingprobabilitymultisetOrlitsky Santhanam Viswanathan Zhang ‘04:

“Onmodelingprofilesinsteadofvalues”,UAI

Moreextensively:

OSVZ:“Onestimatingaprobabilitymultiset”,online

Profilemaximumlikelihood(PML)Profile probability

𝑝 Φ = U 𝑝(𝑥;)Nv:� Nv ��

Distributionmaximizing theprofileprobability

𝑝�MX� = argmax

M∈𝒫𝑝(Φ)

PMLexample𝑋:~ = ℎ, ℎ, 𝑡Φ ℎ, ℎ, 𝑡 = 1,2

𝑝 Φ = 1,2

= 𝑝 𝑠, 𝑠, 𝑑 + 𝑝 𝑠, 𝑑, 𝑠 + 𝑝 𝑑, 𝑠, 𝑠

= 3 ⋅ 𝑝 𝑠, 𝑠, 𝑑

= 3 ⋅ ∑ 𝑝< 𝑥 𝑝(𝑦)N��

Symmetricpolynomial

SMLof{1,2}

𝑝 Φ = 1,2 = 3 U 𝑝< 𝑥 𝑝(𝑦)N��

𝑝�X� ℎ = <~, 𝑝�X� 𝑡 = :

~

𝑝�X� 1, 2 = 323

< 13 +

13

< 23 =

1827 =

23

PMLof{1,2}

𝑝 Φ = 1,2 = 3 U 𝑝< 𝑥 𝑝(𝑦)N��

U 𝑝< 𝑥 𝑝(𝑦)N��

=U𝑝< 𝑥 (1 − 𝑝 𝑥 )N

=U𝑝 𝑥 ⋅ 𝑝(𝑥)(1 − 𝑝 𝑥 )N

≤14

𝑝MX� 1, 2 =34

PMLof{1,1,2}Φ(𝛼, 𝛾, 𝛽, 𝛾) = {1,1, 2}

𝑝MX� 1,1,2 = 𝑈[5]

PMLcanpredictexistenceofunseensymbols

Maximize:

U 𝑝 𝑥 <𝑝 𝑦 𝑝 𝑧N��¤

,

subjectto:

U𝑝(𝑥)N

= 1,𝑝 𝑥 ≥ 0

Uniform[500],700samples

700samples

U[500],700x,12experiments

PMLplug-inToestimatesymmetric𝑓(𝑝):• Find𝑝MX� Φ(𝑋:;)• Output𝑓(𝑝MX�)

Advantages:• Notuningparameters• Notfunctionspecific

Rootedinthemaximumlikelihoodprinciple

Ingredient1:GoodnessofML

GeneralMLpluginestimation

𝒫: collectionofdistributionsoveranabstract domain𝒵

𝑓:𝒫 → ℝ anyproperty

Given𝑧 ∈ 𝒵 estimate𝑓

MLestimator:

Determine𝑝¤§¨ ≜ arg maxM∈𝒫

𝑝 𝑧

Output𝑓(𝑝¤§¨)HowgoodisMLE?

CompetitivenessofMLplugin

Theorem:Suppose𝑓©: 𝒵 → ℝ issuchthat∀𝑝 ∈ 𝒫,

Pr¬∼M

𝑓© 𝑍 − 𝑓 𝑝 > 𝜀 < 𝛿,

thenMLEpluginerrorboundedby

Pr¬∼M

𝑓 𝑝¤§¨ − 𝑓 𝑝 > 2 ⋅ 𝜀 < 𝛿 ⋅ |𝒵|.

Competitive withthebest𝑓©

CompetitivenessofMLEplugin- proofConsiderany𝑝 ∈ 𝒫

𝒵°± ≜ {𝑧 ∈ 𝒵: 𝑝 𝑧 ≥ 𝛿}

• 𝑧 ∈ 𝒵°±:• 𝑓© 𝑧 − 𝑓(𝑝) ≤ 𝜀 (byconditioninTheorem)

• 𝑝¤§¨ 𝑧 ≥ 𝑝 𝑧 ≥ 𝛿,hence 𝑓© 𝑧 − 𝑓(𝑝¤§¨) ≤ 𝜀• Triangleinequality: 𝑓(𝑝¤§¨) − 𝑓 𝑝 ≤ 2𝜀

• 𝑧 ∈ 𝒵²± ≜ {𝑧: 𝑝 𝑧 < 𝛿}

• Pr 𝑓 𝑝¤§¨³ − 𝑓 𝑝 > 2𝜀 ≤ Pr 𝒵²± < 𝛿 ⋅ 𝒵

Ingredient2:Errorprobabilities

PMLperformancebound

Theorem:If𝑛 = 𝑆 𝑓, 𝒫, 𝜀, 𝛿 ,then

𝑆MX� 𝑓, 𝒫, 2 ⋅ 𝜀, Φ; ⋅ 𝛿 ≤ 𝑛

|Φ;|:numberofprofilesoflength𝑛

Profileoflengthn:partitionofn

{3},{1,2}, {1,1,1}➞ 3,2+1,1+1+1

|Φ;| = partition#of𝑛

Hardy-Ramanujan: |Φ;| < 𝑒~ ;

PMLperformance:Try1Theorem:𝑛 = 𝑆 𝑓, 𝒫, 𝜀, 1/3 ⇒ 𝑆MX� 𝑓, 𝒫, 2𝜀, 1/3 ≤ 𝑂(𝑛<).

Proof:• Boosterrorprobability:• Take𝑛 ⋅ ℓ independentsamples,anddividetheminℓ parts• Estimate𝑓 foreachofthesamples• Takethemedianoftheestimates⟹ 𝛿 < exp(−ℓ)

• Φ;ℓ < exp(3 𝑛ℓ .º)

ErrorprobabilityofPMLmostexp −ℓ + 3 𝑛ℓ .º

Whenℓ > 𝑛,theerrordominates#profiles

PMLperformance:Try2Recall

𝑆 𝐻, Δ$, 𝜀, 1/3 = Θ𝑘

𝜀 ⋅ log 𝑘Withtwicethesampleserrordropsexponentially

𝑆 𝐻, Δ$, 𝜀, 𝑒»;¼.½ = Θ $

¾⋅y¿À $

• Modified estimatorswithsmallboundeddifferences• StrongerguaranteesfromMcDiarmid’s inequalityMosttechnicalpartofthepaperSimilarresultsforotherproperties

Combiningeverything

Fasterrorforpropertieswestudy:

If 𝒏 = 𝑺 𝒇,𝒫, 𝜺, 𝟏/𝟑 ,then 𝑺 𝒇,𝒫, 𝜺, 𝒆»𝟒 𝒏 ≤ 𝟒𝒏

MLplug-inresult:

If 𝑺 𝒇,𝒫, 𝜺, 𝒆»𝟒 𝒏 ≤ 𝟒𝒏,then 𝑺𝒑𝒎𝒍 𝒇,𝒫, 𝟐𝜺, 𝒆» 𝒏 ≤ 𝟒𝒏

Combining,wearedone!

ComputingPMLdistribution• EMalgorithm[OrlitskyPanSajamaSanthanamViswanathanZhang‘04,’13]

• ApproximatePMLviaBethePermanents[Vontobel’14]• ExtensionsofMarkovChains[Vatedka Vontobel’16]• Approximationviarelaxation[JiaoPavlichinWeissman ‘17]:

Chapter5:Directions

ApproximatePML• Perhaps findingexactPMLishard• CanshowthatapproximatingPMLenough

Question:Computeexp(−𝑛Ê) approximatePMLforany𝛽 < 1

Eventhisisoptimal(forlarge𝑘)

Higherdimensions• EstimateKLdivergencebetweendiscretedistributionsgivensamples(underassumptionsofcourse)

[BuZou LiangVeeravalli ‘16,HanJiaoWeissman ’16]

[Acharya’18]:PMLisoptimalforKLdivergenceestimation- Higherorderpartitions

IndependentProofTechniquesMLEgoodwhensomethingelseisgood

• MLperformanceindependentofotherresults?

OtherIsPMLoptimalforevery symmetricproperty?

Canwedosomethingforcontinuousdistributions?

Summary• Symmetricpropertyestimation• PMLplug-inapproach• Universal,simpletostate• Independentofparticularproperties

InFisher’swords…

OfcoursenobodyhasbeenabletoprovethatMLEisbestunderallcircumstances.MLEcomputedwithalltheinformationavailablemayturnouttobeinconsistent.Throwingawayasubstantialpartoftheinformationmayrenderthemconsistent.

R.A.Fisher

ThankYou!

Estimating Symmetric Distribution Properties Maximum ......Jayadev Acharya Cornell Hirakendu Das...

Documents

Transcript of Estimating Symmetric Distribution Properties Maximum ......Jayadev Acharya Cornell Hirakendu Das...