Some topics on Sparsitykowon.dongseo.ac.kr/~lbg/cagd/kmmcs/201008/NamSangNam.pdfI Morphological...
Transcript of Some topics on Sparsitykowon.dongseo.ac.kr/~lbg/cagd/kmmcs/201008/NamSangNam.pdfI Morphological...
Some topics on Sparsity
Sangnam Nam
August 12, 2010
Why Sparsity?: randomly generated image
Why Sparsity?: randomly generated image
Why Sparsity?: randomly generated image
Why Sparsity?: randomly generated image
Why Sparsity?: randomly generated image
Why Sparsity?: randomly generated image
Why Sparsity?: DCT of randomly generated image
Why Sparsity?: DCT of randomly generated image
Compression: Image
Compression: Image
Compression: Audio
Denoising
Denoising
Denoising: Simple example
signal x dct transform of x
Denoising: Simple example
signal + noise dct transform of (signal + noise)
Denoising: Simple example
Denoising with hard thresholding at 0.07
Denoising: Simple example
Denoising with hard thresholding at 0.18
Denoising: Simple example
With augmented dictionary:
Transform coefficients Denoised Result
Inpainting: Original Image
Inpainting: letter
Inpainting: letter
Inpainting: letter
Inpainting: lines
Inpainting: lines
Inpainting: lines
Inpainting: random
Inpainting: random
Inpainting: random
Compressed Sensing
Compressed Sensing
Compressed Sensing
Compressed Sensing
Compressed Sensing
Compressed Sensing
Separation
Separation
Super-resolution
Super-resolution
Notation/Terminology
I y ∈ Rd : signal
I φ ∈ Φ: a column of Φ ∈ Rd×K
I φ : φ ∈ Φ spans Rd ⇒ Φ : a dictionary for Rd
I φ ∈ Φ: an atom
I y = Φx ⇒ x : a coefficient vector / representation
Sparse Model
`0-minimization
To find the sparsest representation of y :we solve
minx‖x‖0 subject to y = Mx
How do we know we have the solution?
`0-minimization
To find the sparsest representation of y :we solve
minx‖x‖0 subject to y = Mx
How do we know we have the solution?
Uniqueness of Sparse Solutions
I Let ‖x‖0 := #i : xi 6= 0.
I Let Spark of Φ := minx∈ker(Φ),x 6=0 ‖x‖0.
I Let y = Mx , ‖x‖0 ≤ (Spark of Φ)/2
⇒ x is unique solution of minz ‖z‖0 subject to y = Mz .
Uniqueness of Sparse Solutions
I Let ‖x‖0 := #i : xi 6= 0.I Let Spark of Φ := minx∈ker(Φ),x 6=0 ‖x‖0.
I Let y = Mx , ‖x‖0 ≤ (Spark of Φ)/2
⇒ x is unique solution of minz ‖z‖0 subject to y = Mz .
Uniqueness of Sparse Solutions
I Let ‖x‖0 := #i : xi 6= 0.I Let Spark of Φ := minx∈ker(Φ),x 6=0 ‖x‖0.
I Let y = Mx , ‖x‖0 ≤ (Spark of Φ)/2
⇒ x is unique solution of minz ‖z‖0 subject to y = Mz .
Uniqueness of Sparse Solutions
I Let ‖x‖0 := #i : xi 6= 0.I Let Spark of Φ := minx∈ker(Φ),x 6=0 ‖x‖0.
I Let y = Mx , ‖x‖0 ≤ (Spark of Φ)/2
⇒ x is unique solution of minz ‖z‖0 subject to y = Mz .
`0-minimization
Good, we (kind of) know when we have the solution.
Unfortunately, solving `0-minimization is NP-hard.What to do?
`0-minimization
Good, we (kind of) know when we have the solution.Unfortunately, solving `0-minimization is NP-hard.
What to do?
`0-minimization
Good, we (kind of) know when we have the solution.Unfortunately, solving `0-minimization is NP-hard.What to do?
Alternatives
I Greedy Algorithm/Matching Pursuit
I Convex Optimization/Basis Pursuit
Matching Pursuit
Matching Pursuit [Mallat & Zhang]
1. Given signal y , dictionary Φ.
2. Set k = 0, y0 = 0, r0 = y .
3. Find φ := arg minφ∈Φ〈φ, rk〉.4. Update yk+1 := yk + 〈rk , φ〉φ, rk+1 := rk − 〈rk , φ〉φ.
5. Increment k .
6. If yk uses a specified number of atoms or rk is smaller than aspecified error, stop. Otherwise, go to step 3.
Energy Preservation
Energy Preservation (Pythagoras theorem)
‖rk‖22 = ‖rk+1‖2
2 + |〈rk , φ〉|2
Drawback of Matching Pursuit?
MP can select an atom more than once.
Orthogonal Matching Pursuit
Orthogonal Matching Pursuit (OMP)
1. Given signal y , dictionary Φ.
2. Set k = 0, Λ = , y0 = 0, r0 = y .
3. Find ik := arg mini 〈φi , rk〉 and set Λ := Λ ∪ ik.4. Compute ∆y := (best `2-approximation from span of ΦΛ to
rk), update yk+1 := yk + ∆y , and rk+1 := rk −∆y .
5. Increment k .
6. If yk uses a specified number of atoms or rk is smaller than aspecified error, stop. Otherwise, go to step 3.
Variety of Matching Pursuits
I Morphological Component Analysis [MCA, Bobin et al]
I Stagewise OMP [Donoho et al]
I CoSAMP [Needell & Tropp]
I ROMP [Needell & Vershynin]
I Iterative Hard Thresholding [Blumensath & Davies]
Sampling Issue
To find good approximation of f : [0, 5]→ R, one may sample thefunction values
1. Uniformly at N locations for some large N, or
2. Adaptively sample–more locations near 5 in the figure– atsome large number N locations,
then, use linear interpolation, cubic spline approximation, etc.
What if it is a polynomial of degree 6?
What we know it is a polynomial of degree 6? One needs only 7different samples! Approximation is perfect!
What if it is a sum of 6 monomials?
What if it is a sum of 6 monomials of degree less than 200? Howmany samples do we need?
Think about Economy!
Digital Camera
1. High resolution photo (e.g. size 1024x1024) requires largenumber of samples (e.g. 1024x1024 samples).
2. Image is (immediately) compressed to smaller and lossy JPEGfile.
3. Not a big deal for digital cameras?
Shannon Sampling / Nyquist Rate
Nyquist rate ⇒ 44.1 KHz sampling rate
⇒ large audio file⇒ Compress to lossy MP3 file.
Shannon Sampling / Nyquist Rate
Nyquist rate ⇒ 44.1 KHz sampling rate⇒ large audio file
⇒ Compress to lossy MP3 file.
Shannon Sampling / Nyquist Rate
Nyquist rate ⇒ 44.1 KHz sampling rate⇒ large audio file⇒ Compress to lossy MP3 file.
Fewer Samples
I Can we sample less to begin with?
I How do we recover/approximate the original data?
I Key idea to exploit: Sparsity
Fewer Samples
I Can we sample less to begin with?
I How do we recover/approximate the original data?
I Key idea to exploit: Sparsity
Fewer Samples
I Can we sample less to begin with?
I How do we recover/approximate the original data?
I Key idea to exploit: Sparsity
`1-minimization
Relaxminx‖x‖0 subject to y = Φx
Solveminx‖x‖1 subject to y = Φx
Convex problem. Can be recast into Linear Programming.
`1-minimization
Relaxminx‖x‖0 subject to y = Φx
Solveminx‖x‖1 subject to y = Φx
Convex problem. Can be recast into Linear Programming.
Why does `1-minimization work?
Feasible solutions
Why does `1-minimization work?
`2-balls `1-balls `p-balls, 0 < p < 1
Why does `1-minimization work?
`2-minimizer `1-minimizer `p-minimizer, 0 < p < 1
Why does `1-minimization work?
`2-minimizer `1-minimizer `p-minimizer, 0 < p < 1
When can we guarantee successful recovery?
Null Space Property
Notation:N(Φ) := (Kernel / Null Space of Φ)Λ := (Support of x∗)Λ := Λc
Null Space Property
minx‖x‖1 s.t. Φx = Φx∗ (A)
`1-minimization (A) recovers x∗ if and only if
|〈z , sign(x∗)〉| < ‖zΛ‖1
for all z ∈ N(Φ).
Null Space Property
minx‖x‖1 s.t. Φx = Φx∗ (A)
`1-minimization (A) recovers x∗ if and only if
|〈z , sign(x∗)〉| < ‖zΛ‖1
`1-minimization (A) recovers every x∗ with support Λ if
‖zΛ‖1 < ‖zΛ‖1
for all z ∈ N(Φ).
Null Space Property
minx‖x‖1 s.t. Φx = Φx∗ (A)
`1-minimization (A) recovers x∗ if and only if
|〈z , sign(x∗)〉| < ‖zΛ‖1
`1-minimization (A) recovers every x∗ with support Λ if
‖zΛ‖1 < ‖zΛ‖1
`1-minimization (A) recovers every k-sparse x∗ if
‖zΛ‖1 < ‖zΛ‖1
for all z ∈ N(Φ) and Λ with #Λ ≤ k .
Uniqueness for `1 using NSP
If x∗ is k-sparse, then
‖x∗ + z‖1 ≥ ‖x∗‖1 − ‖zΛ‖1 + ‖zΛ‖1
≥ ‖x∗‖1 .
I.e., x∗ is the unique `1-minimizer.
Coherence
Coherence M(Φ) := maxi 6=j |〈φi , φj〉|.
[Gribonval,Nielsen]
TheoremIf ‖x∗‖0 ≤
12
(1 + 1
M(Φ)
)is a solution of y = Mx, then x∗ is the
unique minimizer of the `1-problem. x∗ is also the uniqueminimizer of the `0-problem!
Coherence
Coherence M(Φ) := maxi 6=j |〈φi , φj〉|.
[Gribonval,Nielsen]
TheoremIf ‖x∗‖0 ≤
12
(1 + 1
M(Φ)
)is a solution of y = Mx, then x∗ is the
unique minimizer of the `1-problem. x∗ is also the uniqueminimizer of the `0-problem!
Derivation of `1-guaranteeing sparsity level
For z ∈ N(Φ),
φjzj = −∑i 6=j
φizi .
|zj | ≤ M(Φ)∑i 6=j
|zi |
(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1
(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1
⇒ ‖zΛ‖1 ≤‖z‖1
2 if
#Λ ≤ 1
2(1 +
1
M(Φ)).
Derivation of `1-guaranteeing sparsity level
For z ∈ N(Φ),
φjzj = −∑i 6=j
φizi .
|zj | ≤ M(Φ)∑i 6=j
|zi |
(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1
(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1
⇒ ‖zΛ‖1 ≤‖z‖1
2 if
#Λ ≤ 1
2(1 +
1
M(Φ)).
Derivation of `1-guaranteeing sparsity level
For z ∈ N(Φ),
φjzj = −∑i 6=j
φizi .
|zj | ≤ M(Φ)∑i 6=j
|zi |
(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1
(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1
⇒ ‖zΛ‖1 ≤‖z‖1
2 if
#Λ ≤ 1
2(1 +
1
M(Φ)).
Derivation of `1-guaranteeing sparsity level
For z ∈ N(Φ),
φjzj = −∑i 6=j
φizi .
|zj | ≤ M(Φ)∑i 6=j
|zi |
(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1
(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1
⇒ ‖zΛ‖1 ≤‖z‖1
2 if
#Λ ≤ 1
2(1 +
1
M(Φ)).
Derivation of `1-guaranteeing sparsity level
For z ∈ N(Φ),
φjzj = −∑i 6=j
φizi .
|zj | ≤ M(Φ)∑i 6=j
|zi |
(1 + M(Φ))|zj | ≤ M(Φ) ‖z‖1
(1 + M(Φ)) ‖zΛ‖1 ≤ (#Λ)M(Φ) ‖z‖1
⇒ ‖zΛ‖1 ≤‖z‖1
2 if
#Λ ≤ 1
2(1 +
1
M(Φ)).
Coherence gives suboptimal estimates
For Φ ∈ Rd×K , we may suppose M(Φ) = O(
1√d
).
We can recover x∗ if x∗ is 12
(1 +
√O(d)
)= O(
√d) sparse.
To put it differently, in order to recover k-sparse x∗, d = O(k2)measurements are desired.
Coherence gives suboptimal estimates
For Φ ∈ Rd×K , we may suppose M(Φ) = O(
1√d
).
We can recover x∗ if x∗ is 12
(1 +
√O(d)
)= O(
√d) sparse.
To put it differently, in order to recover k-sparse x∗, d = O(k2)measurements are desired.
Coherence gives suboptimal estimates
For Φ ∈ Rd×K , we may suppose M(Φ) = O(
1√d
).
We can recover x∗ if x∗ is 12
(1 +
√O(d)
)= O(
√d) sparse.
To put it differently, in order to recover k-sparse x∗, d = O(k2)measurements are desired.
Recovery of Logan-Shepp phantom using FourierTransform Samples
An Image to recover:
Recovery of Logan-Shepp phantom using FourierTransform Samples
Available samples in Fourier Domain (2.7% (4.3%)):
Recovery of Logan-Shepp phantom using FourierTransform Samples
Reconstructed image—perfect!:
How many samples do we need?
To recover
we (obviously) need k ≤ m ≤ N samples.Need to identify k nonzero positions out of N locations ⇒log2
(Nk
)≈ k log N
k measurements
How many samples do we need?
To recover
we (obviously) need k ≤ m ≤ N samples.
Need to identify k nonzero positions out of N locations ⇒log2
(Nk
)≈ k log N
k measurements
How many samples do we need?
To recover
we (obviously) need k ≤ m ≤ N samples.Need to identify k nonzero positions out of N locations ⇒log2
(Nk
)≈ k log N
k measurements
Random Sampling of Fourier Transform
[Candes, Romberg, Tao]
TheoremLet N = n2, f be an n × n real-valued image supported on T . LetΩ be a subset of 1, . . . , n2 chosen unformly at random. If
#Ω ≥ C (#T ) log N,
then with high probability (1− O(N−M)), the minimizer to
ming‖g‖1 s.t. g |Ω = f |Ω
is unique and is equal to f .
Restricted Isometry Property
For s ∈ N, A matrix Φ satisfies the Restricted Isometry Propertywith the isometry constant δs if
(1− δs) ‖x‖22 ≤ ‖Φx‖2
2 ≤ (1 + δs) ‖x‖22
for all s-sparse signal x .
Perfect Recovery condition in terms of RIP constant
[Candes]
TheoremAssume that δ2s <
√2− 1. Then, the solution x∗ to
minx‖x‖1 s.t. Mx = Mx
satisfies‖x∗ − x‖2 ≤ C0s−1/2 ‖x − xs‖1
where C0 is a constant.
Remark: The sensing matrix M is non-adaptive, but the`1-minimization with M recovers every s-sparse signal.
RIP and NSP
[Candes]
TheoremIf h is in the nullspace of Φ, then
‖hT‖1 ≤ ρ ‖hT c‖1 , ρ :=√
2δ2s(1− δ2s)−1
for every T with #T = s.
Examples of Measurement Matrices with good RIP
I Random matrices with i.i.d. entries. If entries of Φ ∈ Rm×d
are drawn from i.i.d. Gaussian with mean 0 and variance 1/m,then with overwhelming probability the RIC δs is less than√
2− 1 when s ≤ Cm/ log(d/m).
I Fourier ensemble. Φ ∈ Rm×d is constructed by samplingrandom m rows of the discrete Fourier transform.
I General orthogonal measurement ensembles.
Questions
1. ‘Random dictionaries’ are good for compressed sensing. Arethere any deterministic dictionaries that are as good asrandom ones?
2. Any alternative to RIP?
3. What happens when the model is not exactly sparse?
4. What happens when there is noise?
Stable and Robust Recovery
We observey = Φx + n.
We reconstruct x as the solution to
minx‖x‖1 s.t. ‖y − Φx‖2 ≤ ε.
Stable and Robust Recovery
[Candes]
TheoremAssume that δ2s <
√2− 1 and ‖n‖2 ≤ ε. Then, the solution x∗
satisfies‖x∗ − x‖2 ≤ C0s−1/2 ‖x − xs‖1 + C1ε
where C0 and C1 are some constants.
Global Optimization Approach for Noisy Problem
We solve
minx
1
2‖Mx − y‖2
2 + λ ‖x‖1
for some λ > 0.
λ→ 0?
λ→∞?
Global Optimization Approach for Noisy Problem
We solve
minx
1
2‖Mx − y‖2
2 + λ ‖x‖1
for some λ > 0.
λ→ 0?
λ→∞?
Global Optimization Approach for Noisy Problem
We solve
minx
1
2‖Mx − y‖2
2 + λ ‖x‖1
for some λ > 0.
λ→ 0?
λ→∞?
Compressed Sensing
Paper: Compressed Sensing by David Donoho
‖θ − θN‖2 ≤ C (N + 1)1/2−1/p, N = 0, 1, . . . .
We Need Good Dictionaries
Techniques/Algorithms of sparse modeling begin with somedictionary Φ that provides sparse representations of the class ofsignals of interest. What are those Φ’s?
Good Dictionaries
I Smooth functions ⇒ Fourier transform
I Smooth functions with point singularities ⇒ Wavelets
I Singularities along smooth curves ⇒ Curvelets, Shearlets, etc.
I . . .
PCA / SVD?
We want to design/learn dictionaries from data.
Principal Component Analysis ⇒ Not suitable for non-Gaussianmixtures
[Olshausen & Field]Used Sparsity Promoting regularization to learn Dictionary⇒ Localized, oriented, bandpass receptive fields emerge
PCA / SVD?
We want to design/learn dictionaries from data.
Principal Component Analysis ⇒ Not suitable for non-Gaussianmixtures
[Olshausen & Field]Used Sparsity Promoting regularization to learn Dictionary⇒ Localized, oriented, bandpass receptive fields emerge
PCA / SVD?
We want to design/learn dictionaries from data.
Principal Component Analysis ⇒ Not suitable for non-Gaussianmixtures
[Olshausen & Field]Used Sparsity Promoting regularization to learn Dictionary⇒ Localized, oriented, bandpass receptive fields emerge
Dictionary Learning via `1-criterion
If Φ were a good dictionary for a signal x , then we can recover xby solving
minx‖x‖1 s.t. y = Φx .
⇒ Given a data set Y := [y1, . . . , yN ], we may find a gooddictionary Φ by solving
minΦ,X‖X‖1 s.t. Y = ΦX .
Dictionary Learning via `1-criterion
If Φ were a good dictionary for a signal x , then we can recover xby solving
minx‖x‖1 s.t. y = Φx .
⇒ Given a data set Y := [y1, . . . , yN ], we may find a gooddictionary Φ by solving
minΦ,X‖X‖1 s.t. Y = ΦX .
Some issues
I We can decrease ‖X‖1 as small as we wish by scaling because
αΦα−1X = ΦX
for any α ∈ R.
I The problemminΦ,X‖X‖1 s.t. Y = ΦX
is not convex.
The first issue can be dealt with by requiring that each column ofΦ to be of unit length. We will write Φ ∈ U(d ,K ) to mean thatΦ ∈ Rd×K is such a dictionary.
Some issues
I We can decrease ‖X‖1 as small as we wish by scaling because
αΦα−1X = ΦX
for any α ∈ R.
I The problemminΦ,X‖X‖1 s.t. Y = ΦX
is not convex.
The first issue can be dealt with by requiring that each column ofΦ to be of unit length. We will write Φ ∈ U(d ,K ) to mean thatΦ ∈ Rd×K is such a dictionary.
Some issues
I We can decrease ‖X‖1 as small as we wish by scaling because
αΦα−1X = ΦX
for any α ∈ R.
I The problemminΦ,X‖X‖1 s.t. Y = ΦX
is not convex.
The first issue can be dealt with by requiring that each column ofΦ to be of unit length. We will write Φ ∈ U(d ,K ) to mean thatΦ ∈ Rd×K is such a dictionary.
Dictionary Identification
General Question: Given a data set Y , we learned a dictionary Φvia some learning method. Is Φ the ‘right’ one?
Dictionary Identification
Suppose that the training data Y is generated by
Y = Φ0X0.
Under what condition on X0 (and Φ0), can we be sure that theminimization problem
minΦ,X‖X‖1 s.t. Y = ΦX
will recover the pair Φ0 and X0?
Necessary and Sufficient Condition
[Gribonval & Schnass]
TheoremSuppose that K ≥ d. (Φ0,X0) is a strict local minimum of theproblem
minΦ,X‖X‖1 s.t. ΦX = Φ0X0,Φ ∈ U(d ,K )
if and only if
|〈CX0 + V , sign(X0)〉| < ‖(CX0 + V )Λ‖1
for every CX0 + V 6= 0 with diag(Φ∗0Φ0C ) = 0 and V ∈ N(Φ0).
Sufficient Condition for Basis
Notation:
Sufficient Condition for Basis
Theorem(Φ0,X0) is a strict local minimum of the problem
minΦ,X‖X‖1 s.t. ΦX = Φ0X0,Φ ∈ U(d , d)
if for every k = 1, . . . , d, there exists dk ∈ Rd with ‖dk‖∞ < 1such that
Xkdk = Xksk − diag((∥∥x i∥∥)i)
mk
where mk is the k-th column of Φ∗0Φ0.
Geometric Interpretation
Local minimum Not local minimum
uk := Xksk − diag((∥∥x i∥∥)i)
mk , Qk := [−1, 1]rk .
Bernoulli-Gaussian Model
X0 follows the Bernoulli-Gaussian Model with parameter 0 < p < 1if each entry of X0 is the product of independent Bernoulli(p)random variable ξij and Normal random variable gij , i.e.,
xij = ξijgij .
Concentration of Measure Phenomena
If X0 follows the Bernoulli-Gaussian distribution with parameter p,then with high probability XkQk contains the `2-ball of radius
α ≈ Np(1− p)
√2
π,
Xksk is contained in the `2-ball of radius
β ≈√
NK p,
and diag((∥∥x i∥∥)i)
mk is contained in the `2-ball of radius
γ ≈ Np
√2
π.
Concentration of Measure Phenomena
If X0 follows the Bernoulli-Gaussian distribution with parameter p,then with high probability XkQk contains the `2-ball of radius
α ≈ Np(1− p)
√2
π,
Xksk is contained in the `2-ball of radius
β ≈√
NK p,
and diag((∥∥x i∥∥)i)
mk is contained in the `2-ball of radius
γ ≈ Np
√2
π.
Concentration of Measure Phenomena
If X0 follows the Bernoulli-Gaussian distribution with parameter p,then with high probability XkQk contains the `2-ball of radius
α ≈ Np(1− p)
√2
π,
Xksk is contained in the `2-ball of radius
β ≈√
NK p,
and diag((∥∥x i∥∥)i)
mk is contained in the `2-ball of radius
γ ≈ Np
√2
π.
Sufficient Condition under Bernoulli-Gaussian Model
If the coherence M(Φ0) of Φ0 is less than 1− p then for largeenough N, Φ0 is locally identifiable.
Surprise: One needs only
N ≥ CK log K .
Sufficient Condition under Bernoulli-Gaussian Model
If the coherence M(Φ0) of Φ0 is less than 1− p then for largeenough N, Φ0 is locally identifiable.
Surprise: One needs only
N ≥ CK log K .
Some dictionary learning methods
I K-SVD [Elad et al.]
I ISI [Gowreesunker]
I . . .