Sampling: an Algorithmic Perspective Richard Peng M.I.T.

Sampling: an Algorithmic Perspective

Richard PengM.I.T.

OUTLINE

• Structure preserving sampling• Sampling as a recursive

‘driver’• Sampling the inaccessible• What can sampling preserve?

RANDOM SAMPLING

• Collection of many objects• Pick a small subset of them

Goal:• Estimate quantities• Small approximates• Use in algorithms

SAMPLING CAN APPROXIMATE

• Point sets• Matrices• Graphs• Gradients

PRESERVING GRAPH STRUCTURESUndirected graph, n vertices, m < n2 edges

Is n2 edges (dense) sometimes necessary?

For some information, e.g. connectivity:encoded by spanning forest, < n edges

Deterministic, O(m) time algorithm

: questions

MORE INTRICATE STRUCTURES

k-connectivity: # of disjoint paths between s-t

[Benczur-Karger `96]: for ANY G, can sample to get H with O(nlogn) edges s.t. G ≈ H on all cuts

Stronger: weights of all 2n cuts in graphs

Cut: # of edges leaving a subset of vertices

s

t

Menger’s theorem / maxflow-mincut

: previous works

≈: multiplicative approximation

n

n

MORE GENERAL: ROW SAMPLING

A’

A

L2 Row sampling:

Given A with m>>n, sample a few rows to form A’ s.t.║Ax║2 ≈║A’x║2 ∀ x

m

0 -1 0 0 0 1 0

0 -5 0 0 0 5 0

≈n

• ║Ax║p: finite dimensional Banach space

• Sampling: embedding Banach spaces• e.g. [BLM `89], [Talagrand `90]

HOW TO SAMPLE?Widely used: uniform sampling Works well when data is

uniform e.g. complete graph

Problem: long path, removing any edge changes connectivity

(can also have both in one graph)

More systematic view of sampling?

SPECTRAL SPARSIFICATION VIA EFFECTIVE RESISTANCE

[Spielman-Srivastava `08]: suffices to sample with probabilities at least O(logn) times weight times effective resistance

Effective resistance: • commute time / m• Statistical leverage score

in unweighted graphs

L2 MATRIX-CHERNOFF BOUNDS

[Foster `49] Σi τi = rank ≤ n O(nlogn) rows

[Rudelson, Vershynin `07], [Tropp `12]: sampling with pi ≥ τiO( logn) gives B’ s.t. ║Bx║2 ≈║B’x║2 ∀x w.h.p.

τ: L2 statistical leverage scores

τi = biT(BTB)-1bi = ║bi║2

L-

1

Near optimal:• L2-row samples of

B• Graph sparsifiers

• In practice O(logn) 5 usually suffices

• can also improve via derandomization

THE `RIGHT’ PROBABILITIES

Only one non-zero row Column with one entry

00100

n/mn/mn/mn/m1

Path + clique:

1

1/n

τ: L2 statistical leverage scores

τi = biT(BTB)-1bi = ║bi║2

L-

1

Any good upper bounds to τi lead to size reductions

OUTLINE



ALGORITHMIC TEMPLATES

W-cycle:T(m) = 2T(m/2) + O(m)

V-cycle:T(m) = T(m/2) + O(m)

Instances:• Sorting• FFT• Voronoi / Delaunay

Instances:• Selection• Parallel indep. Set• Routing

Difficulty:• Exists many non-separable

graphs• Easy to compose hard instances

EFFICIENT GRAPH ALGORITHMS

Partition via separators

SIZE REDUCTION

Ultra-sparsifier: for any k, can find H ≈k G that’s tree + O(mlogcn/k) edges

` `

e.g. [Koutis-Miller-P `10]: obtain crude estimates on τi via a tree

• H equivalent to graph of size O(mlogcn/k)• Picking k > logcn gives reductions

: my results

INSTANCE: Lx = b

Input: graph Laplacian L, vector bOutput: x ≈ε L+b

Runtimes• [KMP `10, `11]: O(mlogn) work, O(m1/3) depth• [CKPPR`14, CMPPX`14]: O(mlog1/2n) work, O(m1/3)

depth

Note:• L+: pseudo-inverse• Approximate solution• Omitting log(1/ε)

+ recursive Chebyshev iteration:T(m) = k1/2(T(mlogcn/k) + O(m))

INSTANCE: INPUT-SPARSITY TIME NUMERICAL ALGORITHMS

Similar: Nystrom method

sample post-process

[Li-Miller-P 13]:• Create smaller approximation• Recurse on it• Bring solution back

INSTANCE: APPROX MAXFLOW

Absorb additional (small) error via more calls to approximatorRecurse on instances with smaller total size, total cost: O(mlogcn)

[P`14]: build approximator on the smaller graph

[Racke-Shah-Taubig `14] good approximator by solving maxflows

[Sherman `13] [KLOS `14]: structure approximators fast maxflow routines

OUTLINE



DENSE OBJECTS

• Matrix inverse• Schur complement• K-step random

walks• Cost-prohibitive to store• Application of

separatorsDirectly access sparse approximates?

TWO STEP RANDOM WALKSA: step of random walk

Still a graph, can sparsify!

A2: 2 step random walk

WHAT THIS ENABLED

[P-Spielman `14] use this to approximate (I – A)-1 = (I + A) (I + A2) (I + A4)…

• Similar to multi-level methods• Skipping: control / propagation of error

Combining known tools: efficiently sparsify I – A2 without computing A2

[Cheng-Cheng-Liu-P-Teng `15]: sparsified Newton’s method for matrix roots and Gaussian sampling

MATRIX SQUARING

Connectivity More general

Iteration Ai+1 ≈ Ai2 I - Ai+1 ≈ I - Ai

2

Until ║Ad║ small ║Ad║ small

Size Reduction Low degree Sparse graph

Method Derandomized Randomized

Solution transfer

Connectivity Solution vectors

• NC algorithm for shortest path• Logspace connectivity: [Reingold `02]• Deterministic squaring: [Rozenman-Vadhan

`05]

LONGER RANDOM WALKS

A: one step of random walk

A3: 3 steps of random walk

(part of) edge uv in A3

Length 3 path in A: u-y-z-v

PSEUDOCODE

Repeat O(cmlognε-2) times:1. Uniformly randomly pick 1 ≤ k ≤ c and edge e =

uv2. Perform (k -1)-step random walk from u.3. Perform (r - k)-step random walk from v.4. Add a scaled copy of the edge to the sparsifier

Resembles:• Local clustering• Approximate triangle counting (c =

3)

[Cheng-Cheng-Liu-P-Teng `15]: combine this with repeated squaring to approximate any random walk polynomial in nearyl-linear time.

GAUSSIAN ELIMINATION

[Lee-P-Spielman, in progress] approximate such circuits in O(mlogcn) time

Partial state of Gaussian elimination: linear system on a subset of variables

Graph theoretic interpretation: equivalent circuit on boundaries, Y-Δ transform

WHAT THIS ENABLES

[Lee-P-Spielman, in progress] O(n) time approximate Cholesky factorization for graph Laplacians

[Lee-Sun, `15] constructible in nearly-linear work

OUTLINE


‘driver’• Sampling the inaccessible• What can sampling

preserve?

MORE GENERAL STRUCTURES

• Non-linear structures• Directed constraints: Ax ≤

b

║y║1║y║2

OTHER NORMSGeneralization of row sampling:given A, q, find A’ s.t.║Ax║q ≈║A’x║q ∀ x

1-norm: standard for representing cuts, used in sparse recovery / robust regression

Applications (for general A):• Feature selection• Low rank approximation / PCA

q-norm: ║y║q = (Σ|yi|q)1/q

L1 ROW SAMPLING

L1 Lewis weights ([Lewis `78]):

w s.t. wi2 = ai

T(ATW-

1A)-1ai

Recursive definition!

[Sampling with pi ≥ wiO( logn) gives ║Ax║1 ≈ ║A’x║1

∀x Can check: Σi wi ≤ n O(nlogn) rows

[Talagrand `90, “Embedding subspaces of L1 into LN

1”] can be analyzed as row-sampling /

sparsification

[Cohen-P `15] w’i (ai

T(ATW-1A)-

1ai)1/2 Converges in loglogn steps

WHERE THIS FITS IN#rows for

q=2#rows for

q=1Runtim

e

Dasgupta et al. `09 n2.5 mn5

Magdon-Ismail `10 nlog2n mn2

Sohler-Woodruff `11 n3.5 mnω-1+θ

Drineas et al. `12 nlogn mnlogn

Clarkson et al. `12 n4.5log1.5n mnlogn

Clarkson-Woodruff `12 n2logn n8 nnz

Mahoney-Meng `12 n2 n3.5 nnz+n6

Nelson-Nguyen `12 n1+θ nnz

Li et.`13 nlogn n3.66 nnz+nω+θ

Cohen et al. 14,Cohen-P `15

nlogn nlogn nnz+nω+θ

[Cohen-P `15] Elementary, optimization motivated proof of w.h.p. concentration for L1

CONNECTION TO LEARNING THEORY

Sparsely-used Dictionary Learning: given Y, find A, X so that ║Y - AX║ is small and X is sparse

[Spielman-Wang-Wright `12]: L1 regression solves this using about n2 samples

[Luh-Vu `15]: generic chaining: O(nlog4n) samples suffice

Proof in [Cohen-P `15] gives O(nlog2n) samples

Key: if X satisfies the Bernoulli-Subgaussian model, then ║Xy║1 is close to expectation for all y

‘Right’ bound should be O(nlogn)

UNSPARSIFIABLE INSTANCE

Complete bipartite graph:Removing any edge uv makes v unreaclable from u

Preserve less structure?

WEAKER REQUIREMENT

Sample only needs to make gains in some directions

Q1

P

Q2

≈w.p. 1/2

w.p. 1/2

[Cohen-Kyng-Pachocki-P-Rao `14]: point-wise convergence without matrix concentration

UNIFORM SAMPLING?

Nystrom method (on matrices):• Pick random subset of data• Compute on subset• Post-process result

Post-processing:• Theoretical works before us: copy x over• Practical: projection, least-squares fitting

[CLMMPS `15]: half the rows as A’ gives good sampling probabilities for A that sum to ≤ 2n

How powerful is (recursive) post-processing?

WHY IS THIS EFFECTIVE?

Needle in a haystack: only d dimensions, can’t have too many, easy to find via post-process

Hay in a haystack: half the data should still contain some info

FUTURE WORK

More concretely:• More sparsification based

algorithms? E.g. multi-grid maxflow?• Sampling directed graphs• Hardness results?

• What structures can sampling preserve?• What do sampling need to preserve?

Sampling: an Algorithmic Perspective Richard Peng M.I.T.

Documents

Transcript of Sampling: an Algorithmic Perspective Richard Peng M.I.T.