Estimating Symmetric Distribution Properties Maximum ......Jayadev Acharya Cornell Hirakendu Das...
Transcript of Estimating Symmetric Distribution Properties Maximum ......Jayadev Acharya Cornell Hirakendu Das...
EstimatingSymmetricDistributionPropertiesMaximumLikelihoodStrikesBack!
Jayadev AcharyaCornell
Hirakendu DasYahoo
Alon OrlitskyUCSD
Ananda SureshGoogle
ShannonChannelMarch5,2018
Basedon:
AUnifiedMaximumLikelihoodApproachforEstimatingSymmetricPropertiesofDiscreteDistributions,InternationalConferenceonMachineLearning(ICML),2017
http://proceedings.mlr.press/v70/acharya17a.html
Slidesavailableat:https://people.ece.cornell.edu/acharya/talks/shannon-18acharya.pdf
Outline1. Motivation2. Methods3. Proofs4. Futuredirections
Chapter1:Motivation
Distributionproperties𝒫:acollectionofdiscretedistributions• 𝒫 = Δ$ :alldistributionsover 𝑘 = {1,… , 𝑘}• Δ+:distributionsover[6]
Property𝑓:𝒫 → ℝ
• 𝑝 3 =?• Isitfair?Is𝑝 𝑖 = 1/6 forall𝑖?
Propertyestimation
𝑝 unknown distributionin𝒫
Givenindependentsamples𝑋:; = 𝑋:, 𝑋<,… , 𝑋; ∼ 𝑝
Estimate𝑓 𝑝
Samplecomplexity 𝑆 𝑓,𝒫, 𝜀, 𝛿
Minimum𝑛 necessaryto
Estimate𝑓 𝑝 ± 𝜀
Witherrorprobability< 𝛿 (usuallyconstant)
Symmetricproperties𝑓 symmetric ifunchangedunderinputpermutations
Entropy𝐻 𝑝 ≜ ∑ 𝑝(𝑥) log :M(N)N
SupportsizeS 𝑝 ≜ ∑ 𝕀 M N QRS
Manyothers:Renyi entropy,supportcoverage,…
Coinswithbias0.4,andwithbias0.6havesameentropy!
Entropy
𝐻 𝑝 ≜ −U𝑝(𝑥) log 𝑝(𝑥)𝒙
• Mostpopularmeasureofrandomness• Centralquantityininformationtheory[Shannon’48]
Howmanysamplestoestimate𝐻(𝑝) to±𝜀?
Longlineofwork:[Empirical,Miller-Maddow,Jackknifed,Coverageadjusted,BUB(paninski’03),…]
Estimatingentropy• Randomnessofneuralspiketrains• Featureselectionindecisiontrees• Graphicalmodels(Chow-Liu)
Traditionalsettingforpropertyestimation:𝒫 = Δ$:distributionson[𝑘] (small𝑘)Obtainmanysamples(large𝑛)
Genetics,neuralspikes,text,computervision,ecology:𝒌 large,possiblyinfinite,perhapsunknown
Estimatingtheunseen:Corbet’sbutterflies
2yearstrappingbutterfliesinMalaypeninsula:
AskedFisher:
howmanynewspeciesifhegoesfortwomoreyears?
Frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Species 118 74 44 24 29 22 20 19 20 15 12 14 6 12 6
Estimatingtheunseen: formulation𝑝: unknown discretedistribution
𝑆X 𝑝 ≜ 𝔼[#𝐝𝐢𝐬𝐭𝐢𝐧𝐜𝐭𝐬𝐲𝐦𝐛𝐨𝐥𝐬in𝑚ind. samples ∼ 𝑝]
𝑆X 𝑝 =U 1 − 1 − 𝑝(𝑥) X
N
Normalizedcoverage:pq MX
Howmanysamplestoestimatepq MX
to±𝜀?
Estimatingtheunseen:applications• Estimatingvocabularysize• Ecologicaldiversity• Microbialdiversityonskin
Wellstudied[GoodToulmin,Efron Thisted]Forconstant𝜀:Requires 𝒎
𝟐samples
Morerecently,[Zou ValiantValiantChan…’16,OrlitskySureshWu’16]:
Requires 𝒎𝐥𝐨𝐠𝒎
samples
Chapter2:Methods
Plug-inestimation
Using𝑋:;,findanestimate�̂� of𝑝
Estimate𝑓 𝑝 by𝑓 �̂�
Howtoestimate𝑝?
Sequencemaximumlikelihood (SML)
𝑝Nvwxy ≜ argmax
M 𝑝(𝑥;)= arg max
M∏ 𝑝(𝑥})}
𝑋:~ = ℎ, ℎ, 𝑡𝑝�X� = argmax𝑝< ℎ · 𝑝 𝑡
𝑝�X� ℎ = <~, 𝑝�X� 𝑡 = :
~
Sameastheempirical-frequency distribution
Multiplicity𝑁N - #times𝑥 appearsin𝑋:;
𝑝wxy 𝑥 =𝑁N𝑛
SMLforentropy
SamplecomplexityofSMLtoestimate𝐻 𝑝 over𝚫$ :
𝑆�X� 𝐻, 𝚫$, 𝜀 = Θ𝑘𝜀
Intheasymptotic𝑛 → ∞,SMLisoptimal
Samplecomplexityofentropy
Samplecomplexityof𝐻(𝑝):
𝑆 𝐻, 𝚫$, 𝜀 = Θ𝑘
𝜀 ⋅ log 𝑘
[…,Paninski’03,ValiantValiant’11,HanJiaoVenkatWeissman’15,WuYang’15]
[ValiantValiant’11]:plug-in,sub-optimal in𝜀
[HanJiaoWeissman ’18]:optimalin𝜀 bytweakingVV’11
Optimalestimators
Generalrecipe:
1. Approximate𝐻(𝑝) withapolynomialin𝑝2. Estimatethepolynomial
Differentnon-plug-in estimatorforeachproperty
Sophisticatedapproximationtheoryresults
PriorworkForseveralimportantproperties
Optimalisalogarithmic factor betterthanempirical
Property 𝒫 SML Optimal References
𝐻 𝒑 𝚫$𝑘𝜀
𝑘𝜀 ⋅ log 𝑘
ValiantValiant ‘11,HanJiaoVenkatWeissman ‘15,WuYang‘15
PriorworkForseveralimportantproperties
Optimalisalogarithmic factor betterthanempirical
Property 𝒫 SML Optimal References
𝐻 𝒑 𝚫$𝑘𝜀
𝑘𝜀 ⋅ log 𝑘
ValiantValiant ‘11,HanJiaoVenkatWeissman ‘15,WuYang‘15
𝑆X 𝒑𝑚
𝚫� 𝑚𝑚
log 𝑚log
1𝜀
Orlitsky SureshWu’16
PriorworkForseveralimportantproperties
Optimalisalogarithmic factor betterthanempirical
Property 𝒫 SML Optimal References
𝐻 𝒑 𝚫$𝑘𝜀
𝑘𝜀 ⋅ log 𝑘
ValiantValiant ‘11,HanJiaoVenkatWeissman ‘15,WuYang‘15
𝑆X 𝒑𝑚
𝚫� 𝑚𝑚
log 𝑚log
1𝜀
Orlitsky SureshWu’16
𝑆 𝒑𝑘
𝚫$ 𝑘 log1𝜀
𝑘log 𝑘
log<1𝜀
WuYang’16
∥ 𝒑 − 𝑢 ∥: 𝚫$𝑘𝜀<
𝑘𝜀< ⋅ log 𝑘
Han JiaoWeissman ‘16
Ourresults [AcharyaDasOrlitsky Suresh’17]Unified,simple,sample-optimalapproachforallaboveproblems
• Plug-inestimator• Maximumlikelihoodprinciple:
sequencemaximumlikelihood(SML)profilemaximumlikelihood(PML)
Chapter3:PML
ProfilesProfileisthe multi-setofmultiplicities
Φ 𝑋:; ≜ {𝑁N: 𝑥 ∈ 𝑋:;}
Φ(ℎ, ℎ, 𝑡) = Φ(𝑡, ℎ, 𝑡) = {1,2}
Φ(𝛼, 𝛾, 𝛽, 𝛾) = {1,1,2}
ProbabilitymultisetSymmetricpropertiesdeterminedby
• Probabilitymultiset:{𝑝(1), 𝑝(2),… }
𝑝 ℎ = 0.4 ⟹ {0.4,0.6}
𝑝 ℎ = 0.6 ⟹ {0.4,0.6}
Profilesaresufficientstatisticforsymmetricproperties
ℎ, ℎ, 𝑡,OR𝑡, ℎ, 𝑡 ⟹ sameestimate
EstimatingprobabilitymultisetOrlitsky Santhanam Viswanathan Zhang ‘04:
“Onmodelingprofilesinsteadofvalues”,UAI
Moreextensively:
OSVZ:“Onestimatingaprobabilitymultiset”,online
Profilemaximumlikelihood(PML)Profile probability
𝑝 Φ = U 𝑝(𝑥;)Nv:� Nv ��
Distributionmaximizing theprofileprobability
𝑝�MX� = argmax
M∈𝒫𝑝(Φ)
PMLexample𝑋:~ = ℎ, ℎ, 𝑡Φ ℎ, ℎ, 𝑡 = 1,2
𝑝 Φ = 1,2
= 𝑝 𝑠, 𝑠, 𝑑 + 𝑝 𝑠, 𝑑, 𝑠 + 𝑝 𝑑, 𝑠, 𝑠
= 3 ⋅ 𝑝 𝑠, 𝑠, 𝑑
= 3 ⋅ ∑ 𝑝< 𝑥 𝑝(𝑦)N��
Symmetricpolynomial
SMLof{1,2}
𝑝 Φ = 1,2 = 3 U 𝑝< 𝑥 𝑝(𝑦)N��
𝑝�X� ℎ = <~, 𝑝�X� 𝑡 = :
~
𝑝�X� 1, 2 = 323
< 13 +
13
< 23 =
1827 =
23
PMLof{1,2}
𝑝 Φ = 1,2 = 3 U 𝑝< 𝑥 𝑝(𝑦)N��
U 𝑝< 𝑥 𝑝(𝑦)N��
=U𝑝< 𝑥 (1 − 𝑝 𝑥 )N
=U𝑝 𝑥 ⋅ 𝑝(𝑥)(1 − 𝑝 𝑥 )N
≤14
𝑝MX� 1, 2 =34
PMLof{1,1,2}Φ(𝛼, 𝛾, 𝛽, 𝛾) = {1,1, 2}
𝑝MX� 1,1,2 = 𝑈[5]
PMLcanpredictexistenceofunseensymbols
Maximize:
U 𝑝 𝑥 <𝑝 𝑦 𝑝 𝑧N���¤
,
subjectto:
U𝑝(𝑥)N
= 1,𝑝 𝑥 ≥ 0
Uniform[500],700samples
700samples
U[500],700x,12experiments
PMLplug-inToestimatesymmetric𝑓(𝑝):• Find𝑝MX� Φ(𝑋:;)• Output𝑓(𝑝MX�)
Advantages:• Notuningparameters• Notfunctionspecific
Rootedinthemaximumlikelihoodprinciple
Ingredient1:GoodnessofML
GeneralMLpluginestimation
𝒫: collectionofdistributionsoveranabstract domain𝒵
𝑓:𝒫 → ℝ anyproperty
Given𝑧 ∈ 𝒵 estimate𝑓
MLestimator:
Determine𝑝¤§¨ ≜ arg maxM∈𝒫
𝑝 𝑧
Output𝑓(𝑝¤§¨)HowgoodisMLE?
CompetitivenessofMLplugin
Theorem:Suppose𝑓©: 𝒵 → ℝ issuchthat∀𝑝 ∈ 𝒫,
Pr¬∼M
𝑓© 𝑍 − 𝑓 𝑝 > 𝜀 < 𝛿,
thenMLEpluginerrorboundedby
Pr¬∼M
𝑓 𝑝¤§¨ − 𝑓 𝑝 > 2 ⋅ 𝜀 < 𝛿 ⋅ |𝒵|.
Competitive withthebest𝑓©
CompetitivenessofMLEplugin- proofConsiderany𝑝 ∈ 𝒫
𝒵°± ≜ {𝑧 ∈ 𝒵: 𝑝 𝑧 ≥ 𝛿}
• 𝑧 ∈ 𝒵°±:• 𝑓© 𝑧 − 𝑓(𝑝) ≤ 𝜀 (byconditioninTheorem)
• 𝑝¤§¨ 𝑧 ≥ 𝑝 𝑧 ≥ 𝛿,hence 𝑓© 𝑧 − 𝑓(𝑝¤§¨) ≤ 𝜀• Triangleinequality: 𝑓(𝑝¤§¨) − 𝑓 𝑝 ≤ 2𝜀
• 𝑧 ∈ 𝒵²± ≜ {𝑧: 𝑝 𝑧 < 𝛿}
• Pr 𝑓 𝑝¤§¨³ − 𝑓 𝑝 > 2𝜀 ≤ Pr 𝒵²± < 𝛿 ⋅ 𝒵
Ingredient2:Errorprobabilities
PMLperformancebound
Theorem:If𝑛 = 𝑆 𝑓, 𝒫, 𝜀, 𝛿 ,then
𝑆MX� 𝑓, 𝒫, 2 ⋅ 𝜀, Φ; ⋅ 𝛿 ≤ 𝑛
|Φ;|:numberofprofilesoflength𝑛
Profileoflengthn:partitionofn
{3},{1,2}, {1,1,1}➞ 3,2+1,1+1+1
|Φ;| = partition#of𝑛
Hardy-Ramanujan: |Φ;| < 𝑒~ ;
PMLperformance:Try1Theorem:𝑛 = 𝑆 𝑓, 𝒫, 𝜀, 1/3 ⇒ 𝑆MX� 𝑓, 𝒫, 2𝜀, 1/3 ≤ 𝑂(𝑛<).
Proof:• Boosterrorprobability:• Take𝑛 ⋅ ℓ independentsamples,anddividetheminℓ parts• Estimate𝑓 foreachofthesamples• Takethemedianoftheestimates⟹ 𝛿 < exp(−ℓ)
• Φ;ℓ < exp(3 𝑛ℓ .º)
ErrorprobabilityofPMLmostexp −ℓ + 3 𝑛ℓ .º
Whenℓ > 𝑛,theerrordominates#profiles
PMLperformance:Try2Recall
𝑆 𝐻, Δ$, 𝜀, 1/3 = Θ𝑘
𝜀 ⋅ log 𝑘Withtwicethesampleserrordropsexponentially
𝑆 𝐻, Δ$, 𝜀, 𝑒»;¼.½ = Θ $
¾⋅y¿À $
• Modified estimatorswithsmallboundeddifferences• StrongerguaranteesfromMcDiarmid’s inequalityMosttechnicalpartofthepaperSimilarresultsforotherproperties
Combiningeverything
Fasterrorforpropertieswestudy:
If 𝒏 = 𝑺 𝒇,𝒫, 𝜺, 𝟏/𝟑 ,then 𝑺 𝒇,𝒫, 𝜺, 𝒆»𝟒 𝒏 ≤ 𝟒𝒏
MLplug-inresult:
If 𝑺 𝒇,𝒫, 𝜺, 𝒆»𝟒 𝒏 ≤ 𝟒𝒏,then 𝑺𝒑𝒎𝒍 𝒇,𝒫, 𝟐𝜺, 𝒆» 𝒏 ≤ 𝟒𝒏
Combining,wearedone!
ComputingPMLdistribution• EMalgorithm[OrlitskyPanSajamaSanthanamViswanathanZhang‘04,’13]
• ApproximatePMLviaBethePermanents[Vontobel’14]• ExtensionsofMarkovChains[Vatedka Vontobel’16]• Approximationviarelaxation[JiaoPavlichinWeissman ‘17]:
Chapter5:Directions
ApproximatePML• Perhaps findingexactPMLishard• CanshowthatapproximatingPMLenough
Question:Computeexp(−𝑛Ê) approximatePMLforany𝛽 < 1
Eventhisisoptimal(forlarge𝑘)
Higherdimensions• EstimateKLdivergencebetweendiscretedistributionsgivensamples(underassumptionsofcourse)
[BuZou LiangVeeravalli ‘16,HanJiaoWeissman ’16]
[Acharya’18]:PMLisoptimalforKLdivergenceestimation- Higherorderpartitions
IndependentProofTechniquesMLEgoodwhensomethingelseisgood
• MLperformanceindependentofotherresults?
OtherIsPMLoptimalforevery symmetricproperty?
Canwedosomethingforcontinuousdistributions?
Summary• Symmetricpropertyestimation• PMLplug-inapproach• Universal,simpletostate• Independentofparticularproperties
InFisher’swords…
OfcoursenobodyhasbeenabletoprovethatMLEisbestunderallcircumstances.MLEcomputedwithalltheinformationavailablemayturnouttobeinconsistent.Throwingawayasubstantialpartoftheinformationmayrenderthemconsistent.
R.A.Fisher
ThankYou!