Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Learning Juntas

Elchanan Mossel

UC Berkeley

Ryan O’DonnellMIT

Rocco ServedioHarvard

What’s a junta?

junta:– A council or committee for political or governmental purposes– A group of persons controlling a government– A junto

junta:– A Boolean function which depends on only k << n Boolean

variables

Example: a 3-junta

1 1 0 1 1 1 1 1 0 1 11 1 0 1 0 0 1 0 0 0 00 1 0 1 0 1 1 0 0 1 10 0 1 1 0 1 0 1 1 0 11 1 0 1 0 1 0 1 0 1 0

f(x1,...,x10) = x3 OR (x6 AND x7 )

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 f(x)

Learning juntas

The problem: you get data labeled according to some k-junta. What’s the junta?

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 f(x)

1 1 0 1 1 1 1 1 0 1 11 1 0 0 1 1 0 0 1 0 00 1 0 1 1 1 1 0 0 1 10 0 1 0 0 0 1 1 0 0 11 1 0 1 0 1 0 1 0 1 0

• Motivation

• Warm-ups

• Our results

• How we do it

• Future work

Outline of talk

Why learn juntas?

• Natural, general problem (no assumptions on f )

• Real-world learning problems often have lots of irrelevant information

• Important special case of notorious open questions in learning theory: learning DNF, learning decision trees...

Learning decision trees

x5

x3 x1

x2 x1 x4

1 0 0

x6

1 10

Given data labeled according to some decision tree, what’s the tree?

01

Learning decision trees (cont)

• Any k-junta is expressible as a decision tree of size 2k

• So to learn poly(n)-size decision trees, must be able to learn log(n)-juntas.

Big open question: are decision trees of sizepoly(n) learnable in poly(n) time?

Similar situation for learning DNF.

Learning decision trees (cont)

• If we can learn log(n)-juntas, can learn decision trees of size log(n)…even this would be a big step forward.

So progress on juntas is necessary for progress on decision trees.

It’s also sufficient!

Again, similar situation for DNF.

The problem: PAC learn k-juntas under uniform

• Setup: we get random examples (x1,f(x1)), (x2,f(x2)),….where– each xi is uniform from {0,1}n

– f is an unknown k-junta

• Goal: output h such that wvhp Pr[h(x) f(x)] < .

The problem refined

• Setup: we get random examples (x1,f(x1)), (x2,f(x2)),….where– each xi is uniform from {0,1}n

– f is an unknown k-junta

• Goal: output h such that Pr[h(x) f(x)] < .

• Equivalent goal: output h = f• Equivalent goal: find k relevant variables of f

What’s known?• Easy lower bound: need at least 2k + k log n examples• Easy information-theoretic upper bound: 2k + k log

n examples are sufficient• Easy computational upper bound: there are ( )

possible sets of relevant variables, so can do exhaustive search in 2O(k) ( ) = O(nk) time

Can we learn in time poly(n,2k)?

n

k

n

k

Variant #1: membership queries

• If learner can make queries, can learn in poly(n,2k) time.– Draw random points. If all positive or all negative, done.

Otherwise, “walk” from positive point to negative point to identify relevant variable:

– Recurse.

1 1 0 1 0 0 1 0 1 ; 1

0 1 0 1 1 0 0 1 1 ; 0

0 1 0 1 0 0 1 0 1 ; 10 1 0 1 1 0 1 0 1 ; 10 1 0 1 1 0 0 0 1 ; 0

Variant #2: monotone functions

• If junta is monotone, can learn in poly(n,2k) time.– If xi is irrelevant, have

Pr[f(x) = 1 | xi = 1] = Pr[f(x) = 1 | xi = 0].

– If xi is relevant, have Pr[f(x) = 1 | xi = 1] > Pr[f(x) = 1 | xi = 0].

– Each probability is integer multiple of 1/2k .– So can test each variable in poly(2k) time.

Variant #3: random functions

• If junta is random, whp can learn in poly(n,2k) time.– If xi is irrelevant, have Pr[f(x) = xi] = 1/2 for sure.

– If xi is relevant, have Pr[Pr[f(x) = xi] = 1/2] 1/2k/2.

– Each probability is integer multiple of 1/2k .– So whp can find the relevant variables this way.

~~

Back to real problem

• Lower bound: need at least 2k + k log n examples

• Upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2O(k) ( ) time

Can we learn in time poly(n,2k)?

n

k

n

k

Previous work

• [Blum & Langley, 1994] suggested problem

• Little progress until….

• [Kalai & Mansour, 2001] gave algorithm that learns in time nk - k

1/2

Our result

• We give an algorithm that learns in time

n k

~~where 2.376 is the matrix multiplication exponent. So currently n .704k .

The main idea

• Let g be the hidden k-bit function

• Look at two different representations for g:– Only weird functions are hard to learn under first

representation– Only perverse functions are hard to learn under

second representation – No function is both weird and perverse

First representation:real polynomial

• View inputs, outputs as 1/1 valued • Fact: every Boolean function

g: {1,1}k {1,1} • has unique interpolating real polynomial

gR(x1,x2 ,….,xk )– gR coefficients are Fourier coefficients of g

– Examples:• parity on x1 ,x2 ,….,xk : polynomial is x1x2 ….xk

• x1 AND x2 : polynomial is (1 + x1 + x2 - x1 x2 )/2

Real polynomials

• Fourier coefficients measure correlation of g with corresponding parities:

E[g(x)xT] = coefficient of xT in gR

• So given a set T of variables, can estimate coefficient of xT via sampling– Nonzero only if every variable in T is relevant– Problem: may have to test all sets of up to k

variables to find a nonzero coefficient

First technical theorem:

Let g be a Boolean function on k variables such that gR has nonzero constant term:

gR(x) = c0 cT xT.

(s = degree of smallest nontrivial monomial)

Then s < 2k/3.

|T|>s

_

_

Second representation:GF2 polynomial

• View inputs, outputs as 0/1 valued

• Fact: every Boolean function g: {0,1}k {0,1}

has unique interpolating GF2 polynomial g2(x1,x2 ,…., xk)

• Examples:• parity on x1 ,x2 ,….,xk : polynomial is x1 + x2 +….+ xk

• x1 AND …. AND xk : polynomial is x1x2 ….xk

Learning parities

• Suppose g is some parity function, e.g. g(x)=parity(x1,x2, x4 )

• Can add labeled examples mod 2:0 1 0 1 0 0 1 0 1 ; 01 1 1 1 1 0 1 0 1 ; 1

1 0 1 0 1 0 0 0 0 ; 1

Learning parities (cont)

• Given a set of labeled examples, can do Gaussian elimination to obtain

• Will have b =1 iff x1 is in parity

• Repeat for x2 ,… ,xn to learn parity

1 0 0 0 0 0 0 0 0 ; b

Learning GF2 polynomials

• Given any g: {0,1}k {0,1} , can view g2 as parity over monomials (ANDs)

• If deg(g2) = d, have kd monomials

• In junta setting, have nd monomials– Problem: d could be k

Second technical theorem:

Let g be a Boolean function on k variables such that gR has zero constant term:

gR(x) = cT xT.

Then deg(g2 ) < k-s.

|T|>s

_

_

Algorithm to learn k-juntas

• Sample to test whether f is constant

• If not, sample to estimate Fourier coefficient of all sets of up to k variables– Nonzero coefficient of size m: recurse on all

2m settings of those variables

– All small coefficients zero: run parity-learning algorithm with monomials of size up to (1)k

Why does it work?

• If f unbalanced, will find nonzero coefficient of size at most

2k/3 < k

• If f balanced, parity learning algorithm guaranteed to succeed• So either way, make progress

Take > 2/3.

Running time

• Checking sets of of (up to) k variables takes nk time

• Running Gaussian elimination on monomials of size (1)k takes time nk ( = matrix multiplication exponent)

• So best is

What else can we do?

• Restrictions:

• Can look at f under “small” restrictions

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

0 x2 1 0 x5 x6 x7 x8 0 x10

A question

• Suppose g: {1,1}k {1,1} is

gR(x) = gT xT.

• Must there be some restriction fixing at most 2k/3 variables such that g((x)) is a parity function?

• If yes, can learn k-juntas in time n2k/3

|T|>k

Future work

• Faster algorithms?

• Non-binary input alphabets? – (non-binary outputs easy)

• Non-uniform distributions?– Product distributions?– General distributions?

Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.

Documents

Transcript of Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.