Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
-
Upload
jaliyah-paige -
Category
Documents
-
view
218 -
download
0
Transcript of Learning Juntas Elchanan Mossel UC Berkeley Ryan O’Donnell MIT Rocco Servedio Harvard.
What’s a junta?
junta:– A council or committee for political or governmental purposes– A group of persons controlling a government– A junto
junta:– A Boolean function which depends on only k << n Boolean
variables
Example: a 3-junta
1 1 0 1 1 1 1 1 0 1 11 1 0 1 0 0 1 0 0 0 00 1 0 1 0 1 1 0 0 1 10 0 1 1 0 1 0 1 1 0 11 1 0 1 0 1 0 1 0 1 0
f(x1,...,x10) = x3 OR (x6 AND x7 )
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 f(x)
Learning juntas
The problem: you get data labeled according to some k-junta. What’s the junta?
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 f(x)
1 1 0 1 1 1 1 1 0 1 11 1 0 0 1 1 0 0 1 0 00 1 0 1 1 1 1 0 0 1 10 0 1 0 0 0 1 1 0 0 11 1 0 1 0 1 0 1 0 1 0
Why learn juntas?
• Natural, general problem (no assumptions on f )
• Real-world learning problems often have lots of irrelevant information
• Important special case of notorious open questions in learning theory: learning DNF, learning decision trees...
Learning decision trees
x5
x3 x1
x2 x1 x4
1 0 0
x6
1 10
Given data labeled according to some decision tree, what’s the tree?
01
Learning decision trees (cont)
• Any k-junta is expressible as a decision tree of size 2k
• So to learn poly(n)-size decision trees, must be able to learn log(n)-juntas.
Big open question: are decision trees of sizepoly(n) learnable in poly(n) time?
Similar situation for learning DNF.
Learning decision trees (cont)
• If we can learn log(n)-juntas, can learn decision trees of size log(n)…even this would be a big step forward.
So progress on juntas is necessary for progress on decision trees.
It’s also sufficient!
Again, similar situation for DNF.
The problem: PAC learn k-juntas under uniform
• Setup: we get random examples (x1,f(x1)), (x2,f(x2)),….where– each xi is uniform from {0,1}n
– f is an unknown k-junta
• Goal: output h such that wvhp Pr[h(x) f(x)] < .
The problem refined
• Setup: we get random examples (x1,f(x1)), (x2,f(x2)),….where– each xi is uniform from {0,1}n
– f is an unknown k-junta
• Goal: output h such that Pr[h(x) f(x)] < .
• Equivalent goal: output h = f• Equivalent goal: find k relevant variables of f
What’s known?• Easy lower bound: need at least 2k + k log n examples• Easy information-theoretic upper bound: 2k + k log
n examples are sufficient• Easy computational upper bound: there are ( )
possible sets of relevant variables, so can do exhaustive search in 2O(k) ( ) = O(nk) time
Can we learn in time poly(n,2k)?
n
k
n
k
Variant #1: membership queries
• If learner can make queries, can learn in poly(n,2k) time.– Draw random points. If all positive or all negative, done.
Otherwise, “walk” from positive point to negative point to identify relevant variable:
– Recurse.
1 1 0 1 0 0 1 0 1 ; 1
0 1 0 1 1 0 0 1 1 ; 0
0 1 0 1 0 0 1 0 1 ; 10 1 0 1 1 0 1 0 1 ; 10 1 0 1 1 0 0 0 1 ; 0
Variant #2: monotone functions
• If junta is monotone, can learn in poly(n,2k) time.– If xi is irrelevant, have
Pr[f(x) = 1 | xi = 1] = Pr[f(x) = 1 | xi = 0].
– If xi is relevant, have Pr[f(x) = 1 | xi = 1] > Pr[f(x) = 1 | xi = 0].
– Each probability is integer multiple of 1/2k .– So can test each variable in poly(2k) time.
Variant #3: random functions
• If junta is random, whp can learn in poly(n,2k) time.– If xi is irrelevant, have Pr[f(x) = xi] = 1/2 for sure.
– If xi is relevant, have Pr[Pr[f(x) = xi] = 1/2] 1/2k/2.
– Each probability is integer multiple of 1/2k .– So whp can find the relevant variables this way.
~~
Back to real problem
• Lower bound: need at least 2k + k log n examples
• Upper bound: there are ( ) possible sets of relevant variables, so can do exhaustive search in 2O(k) ( ) time
Can we learn in time poly(n,2k)?
n
k
n
k
Previous work
• [Blum & Langley, 1994] suggested problem
• Little progress until….
• [Kalai & Mansour, 2001] gave algorithm that learns in time nk - k
1/2
Our result
• We give an algorithm that learns in time
n k
~~where 2.376 is the matrix multiplication exponent. So currently n .704k .
The main idea
• Let g be the hidden k-bit function
• Look at two different representations for g:– Only weird functions are hard to learn under first
representation– Only perverse functions are hard to learn under
second representation – No function is both weird and perverse
First representation:real polynomial
• View inputs, outputs as 1/1 valued • Fact: every Boolean function
g: {1,1}k {1,1} • has unique interpolating real polynomial
gR(x1,x2 ,….,xk )– gR coefficients are Fourier coefficients of g
– Examples:• parity on x1 ,x2 ,….,xk : polynomial is x1x2 ….xk
• x1 AND x2 : polynomial is (1 + x1 + x2 - x1 x2 )/2
Real polynomials
• Fourier coefficients measure correlation of g with corresponding parities:
E[g(x)xT] = coefficient of xT in gR
• So given a set T of variables, can estimate coefficient of xT via sampling– Nonzero only if every variable in T is relevant– Problem: may have to test all sets of up to k
variables to find a nonzero coefficient
First technical theorem:
Let g be a Boolean function on k variables such that gR has nonzero constant term:
gR(x) = c0 cT xT.
(s = degree of smallest nontrivial monomial)
Then s < 2k/3.
|T|>s
_
_
Second representation:GF2 polynomial
• View inputs, outputs as 0/1 valued
• Fact: every Boolean function g: {0,1}k {0,1}
has unique interpolating GF2 polynomial g2(x1,x2 ,…., xk)
• Examples:• parity on x1 ,x2 ,….,xk : polynomial is x1 + x2 +….+ xk
• x1 AND …. AND xk : polynomial is x1x2 ….xk
Learning parities
• Suppose g is some parity function, e.g. g(x)=parity(x1,x2, x4 )
• Can add labeled examples mod 2:0 1 0 1 0 0 1 0 1 ; 01 1 1 1 1 0 1 0 1 ; 1
1 0 1 0 1 0 0 0 0 ; 1
Learning parities (cont)
• Given a set of labeled examples, can do Gaussian elimination to obtain
• Will have b =1 iff x1 is in parity
• Repeat for x2 ,… ,xn to learn parity
1 0 0 0 0 0 0 0 0 ; b
Learning GF2 polynomials
• Given any g: {0,1}k {0,1} , can view g2 as parity over monomials (ANDs)
• If deg(g2) = d, have kd monomials
• In junta setting, have nd monomials– Problem: d could be k
Second technical theorem:
Let g be a Boolean function on k variables such that gR has zero constant term:
gR(x) = cT xT.
Then deg(g2 ) < k-s.
|T|>s
_
_
Algorithm to learn k-juntas
• Sample to test whether f is constant
• If not, sample to estimate Fourier coefficient of all sets of up to k variables– Nonzero coefficient of size m: recurse on all
2m settings of those variables
– All small coefficients zero: run parity-learning algorithm with monomials of size up to (1)k
Why does it work?
• If f unbalanced, will find nonzero coefficient of size at most
2k/3 < k
• If f balanced, parity learning algorithm guaranteed to succeed• So either way, make progress
Take > 2/3.
Running time
• Checking sets of of (up to) k variables takes nk time
• Running Gaussian elimination on monomials of size (1)k takes time nk ( = matrix multiplication exponent)
• So best is
What else can we do?
• Restrictions:
• Can look at f under “small” restrictions
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0 x2 1 0 x5 x6 x7 x8 0 x10
A question
• Suppose g: {1,1}k {1,1} is
gR(x) = gT xT.
• Must there be some restriction fixing at most 2k/3 variables such that g((x)) is a parity function?
• If yes, can learn k-juntas in time n2k/3
|T|>k