ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

ICML2012読み会: Scaling Up Coordinate Descent Algorithms for

Large L1 regularization Problems

2012-07-28

Yoshihiko Suhara

@sleepy_yoshi

読む論文

• Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

– by C. Scherrer, M. Halappanavar, A. Tewari, D. Haglin

• Coordinate Descent の並列計算

– [Bradley+ 11] Parallel Coordinate Descent for L1-Regularized Loss Minimization (ICML2011) とか

2

概要

• 共有メモリマルチコア環境におけるParallel Coordinate Descentの一般化フレームワークを紹介

• 以下の2つの手法を提案 – Thread-Greedy Coordinate Descent – Coloring-Based Coordinate Descent

• Parallel CDの4手法を実験で比較

– Thread-Greedy が思いのほかよかった

3

L1正則化損失関数の最適化

• L1正則化損失関数

min𝒘

1

𝑛 ℓ 𝒚𝑖 , 𝑿𝒘 𝑖 + 𝜆 𝒘 1

𝑛

𝑖=1

• ここで – 𝑿 ∈ ℝ𝑛×𝑘: 計画行列 – 𝒘 ∈ ℝ𝑘: 重みベクトル – ℓ(𝑦,⋅): 微分可能な凸関数

• たとえば – Lasso (L1 + 二乗誤差) – L1正則化ロジスティック回帰

4

記法

𝑿 = 𝑿1, 𝑿2, … , 𝑿𝑗 , … 𝑿𝑘

𝒆𝑗 = 0, 0, … , 1, … , 0 𝑇

𝑿 =

𝒙1𝑇

𝒙2𝑇

⋮𝒙𝑖

𝑇

⋮𝒙𝑛

𝑇

5

補足: Coordinate Descent

• 座標降下法とも呼ばれる (?) • 選択された次元に対して直線探索 • いろんな次元の選び方

– 例) Cyclic Coordinate Descent

• 並列計算する場合には全次元の部分集合を選択して更新

6

7

GenCD: A Generic Framework for Parallel Coordinate Descent

なぜかここから英語

Generic Coordinate Descent (GenCD)

8

Step 1: Select

• Selecting 𝐽 coordinates

• The selection criteria differs for variations of CD techniques – cyclic CD (CCD)

– stochastic CD (SCD) • selection of a singlton

– fully greedy CD

• 𝐽 = {1, … , 𝑘}

– Shotgun [Bradley+ 11] • selects a random subset of a given size

9

Step 2: Propose

• Propose step computes a proposed increment 𝛿𝑗 for each 𝑗 ∈ 𝐽. – this step does not actually change the weights

• In Step 2, we maintain a vector 𝝋 ∈ ℝ𝑘, where 𝝋𝑗 is a proxy for the objective function evaluated at 𝒘 + 𝜹𝑗𝒆

𝑗

– update 𝝋𝑗 whenever a new proposal is calculated for j

– 𝝋 is not necessary if the algorithm will accepts all proposals

10

Step 3: Accept

• In Accept step, the algorithm accepts 𝐽′ ⊆ 𝐽 – [Bradley+ 11] show correlations among features can

lead to divergence if too many coordinates are updated at once (see below figure)

• In CCD, SCD, Shotgun, the algorithm allows all proposals to be accepted – No need to calculate 𝝋

11

Step 4: Update

• In Update step, the algorithm updates according to the set 𝐽′

12

𝑿𝒘 を保持

Approximate Minimization (1/2)

• Propose step calculates a proposed increment 𝜹𝑗 for each 𝑗 ∈ 𝐽

𝛿 = argmin𝛿 𝐹 𝒘 + 𝛿𝒆𝑗 + 𝜆|𝒘𝑗 + 𝛿|

where, 𝐹 𝒘 =1

𝑛 ℓ 𝒚𝑖 , 𝑿𝒘 𝑖

𝑛𝑖=1

• For a general loss function, there is no closed-form solution along a given coordinate. – Thus, consider approximate minimization

13

Approximate Minimization (2/2)

• Well known minimizer (e.g., [Yuan and Lin 10])

𝛿 = −𝜓 𝒘𝑗;𝛻𝑗𝐹 𝒘 − 𝜆

𝛽,𝛻𝑗𝐹 𝒘 + 𝜆

𝛽

where, 𝜓 𝑥; 𝑎, 𝑏 = 𝑎 if 𝑥 < 𝑎𝑏 if 𝑥 > 𝑏𝑥 otherwise

14

for squared loss 𝛽 = 1, logistic loss 𝛽 = 1/4.

Step 2: Propose (Approximated)

15

ℓ′ 𝒚𝑖,𝒛𝑖𝑖 ,𝑿𝑗

𝑛 ?

Decrease in the approximated objective

Experiments

16

Algorithms (conventional)

• SHOTGUN [Bradley+ 11] – Select step: random subset of the columns – Accept step: accepts every proposal

• No need to compute a proxy for the objective

– convergence is guaranteed only if the # of coordinates selected is at most 𝑃∗ =

𝑘

2𝜌 (*1)

• GREEDY – Select step: all coordinates – Propose step: each thread generating proposals for some subset

of the coordinates using approximation – Accept step: Only accepts the single best among the all threads.

17 (*1) 𝜌 is the matrix eigenvalue of 𝑿𝑇𝑿

Comparisons of the Algorithms

18

Algorithms (proposed)

• THREAD-GREEDY – Select step: random set of coordinates (?) – Propose step: each thread generating proposals for some subset of the

coordinates using approximation – Accept step: Each thread accepts the best of the proposals – no proof for convergence (however, empirical results are encouraging)

• COLORING

– Preprocessing: structurally independent features are identified via partial distance-2 coloring

– Select step: a random color is selected – Accept step: accepts every proposal

• since the features are disjoint.

19

Implementation and Platform

• Implementation – gcc with OpenMP

• -O3 -fopenmp flags

• parallel for pragma

• static scheduling – Given n iterations and p threads, each thread gets n/p iterations

• Platform – AMD Opteron (Magny-Cours)

• with 48 cores (12 cores x 4 sockets)

– 256GB Memory

20

Datasets

21

(Number of Non-Zero)

Convergence rates

22

ナゼカワカラナイ

Scalability

23

Summary

• Presented GenCD, a generic framework for expressing parallel coordinate descent – Select, Propose, Accept, Upadte

• Performs convergence and scalability tests for the four algorithms – but the authors do not favor any of these algorithms

over the others

• The condition for convergence of the THREAD-

GREEDY algorithm is an open question

24

References

• [Yuan and Lin 10] G. Yuan, C. Lin, “A Comparison of Opitmization Methods and Software for Large-scale L1-regularized Linear Classification”, Journal of Machine Learning Research, vol.11, pp.3183-3234, 2010.

• [Bradley+ 11] J. K. Bradley, A. Kyrola, D. Bickson, C. Guestrin, “Parallel Coordinate Descent for L1-Regularized Loss Minimization”, In Proc. ICML ‘11, 2011.

25

おわり

26

ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems

Technology

Transcript of ICML2012読み会 Scaling Up Coordinate Descent Algorithms for Large L1 regularization Problems