Learning Mixtures of Structured Distributions over Discrete Domains

Learning Mixtures of Structured Distributions over Discrete Domains

Xiaorui SunColumbia University

Joint work with Siu-On Chan(UC Berkeley), Ilias Diakonikolas(U Edinburgh), Rocco Servedio(Columbia University)

Density Estimation• PAC-type learning model• Set of possible target distributions over • Learner – Know the set but does not know the target

distribution – Independently draws a few samples from – Outputs (succinct description of a)

distribution which is -close to • Total variation distance is standard measure in

statistics

Learn a structured distribution

• If = {all distributions over }, samples are required

• Much better sample complexities possible for structured distributions– Poisson binomial distributions [DDS12a]• samples

–Monotone/k-modal [Bir87, DDS12b]• samples/ samples

This work: Learn mixture of structured distributions

• Learn mixture of distributions?– A set of distributions over – Target distribution is a mixture of

distributions from– i.e. , such that

• Our result: learn mixtures for several structured distributions– Sample complexity close to optimal– Efficient running time

Our results: learning mixture of log-concave

• Log-concave distribution over [n]– – for

1 n

Our results: log-concave

• Algorithm to learn a mixture of log-concave distributions – Sample complexity: – Running time: bit operations

• Lower bound: samples

Our results: mixture of unimodal

• Unimodal distribution over [n]– s.t.

1 n

Our results: mixture of unimodal

• A mixture of 2 unimodal distributions may have modes

• Algorithm to learn a mixture of unimodal distributions– Sample complexity: samples– Running time: bit operations

• Lower bound: samples

Our results: mixture of MHR

• Monotone hazard rate distribution – Hazard rate of : – if –MHR distribution: is a non-decreasing

function over

1 n

Our results: mixture of MHR

• Algorithm to learn a mixture of MHR distributions – Sample complexity: – Running time: bit operations

• Lower Bound: samples

Compare with parameter estimation

• Parameter estimation [KMV10, MV 10] – Learn a mixture of Gaussians– Independently draw a few samples from – Estimate the parameters of each

Gaussian component accurately • Number of samples inherently

exponentially depends on , even for a mixture of 1-dimensional normal distributions [MV10]

Compare with parameter estimation

• Parameter estimation needs at least exp() samples to learn a mixture of binomial distributions– Similar to the lower bound in [MV 10]

• Density estimation allows to estimate non parametric distributions– E.g. log-concave, unimodal, MHR

• Density estimation for mixture of binomial distributions over using samples– Binomial distribution is log-concave

Outline

• Learning algorithm based on decomposition

• Structural results for log-concave, unimodal, MHR distributions

Flat decomposition

• Key definition: distribution is -flat if there exists a partition of into intervals such that – is an -flat decomposition for

• is obtained by "flattening" within each interval – for

Flat decomposition

1 n

Learn -flat distributions

• Main general Thm: Let = {all the -flat distributions}. There is an algorithm which draws samples from , and outputs a hypothesis such that .

• Linear running time with respect to the number of samples

Easier problem: known decomposition

• Given– Samples from an -flat distribution – -flat decomposition for

• Idea: estimate probability mass of every interval in

• samples are enough

Real problem: unknown decomposition

• Only given samples from a -flat distribution

• Exists some -flat decomposition for , but unknown

• A useful fact [DDS+ 13]: If is a -flat decomposition of , and is a “refinement” of , is a -flat decomposition of – If know a refinement of , it is good

Unknown flat decomposition (cont)

• Idea: partition [n] into intervals each with small probability mass,

– Achieve by sampling from

1 n

𝒦ℒ


• Exist (unknown)– Refinement of both and– intervals

1 n

𝒦ℒ


• Exist – Refinement of both and– intervals– -flat decomposition for

1 n

𝒥


• Compare and

1 n

𝒥1 n

𝒥𝒦


• If the total probability mass of every intervals of is at most , then

• Partition [n] into intervals each with probability mass at most – samples are enough

Learn -flat distributions

• Main general Thm: Let {all the -flat distributions}. There is an algorithm which draws samples from , and outputs a hypothesis such that

Learn mixture of distributions

• Lem:A mixture of -flat distributions has an -flat decomposition– Tight for interesting distribution classes

• Thm(Learn mixture): Let be a mixture of -flat distributions. There is an algorithm which draws samples, and outputs a hypothesis s.t.

First application: learning mixture of log-concave distributions

• Recall definition:– – for

• Lem: Every log-concave distribution is -flat

• Learn a mixture of log-concave distributions with samples

Second application: learning mixture of unimodal distribution

• Lem: Every unimodal distribution is -flat [Bir87, DDS+13]

• Learn a mixture of unimodal distribution with samples

Third application: learning mixture of MHR distribution

• Monotone hazard rate distribution– Hazard rate of : – if – is a non-decreasing function over

• Lem: Every MHR distribution is -flat• Learn a mixture of MHR distributions

with samples

Conclusion and further directions

• Flat decomposition is a useful way to study mixtures of structured distributions

• Extend to higher dimension?• Efficient algorithm with optimal

sample complexity

Distribution Sample complexity Lower boundLog-concaveUnimodalMHR

Thank you !

Learning Mixtures of Structured Distributions over Discrete Domains

Documents

Transcript of Learning Mixtures of Structured Distributions over Discrete Domains