Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019....

20
Orthogonal Array Based Big Data Subsampling Hongquan Xu University of California, Los Angeles Joint work with Lin Wang, Jake Kramer and Weng Kee Wong Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 1 / 20

Transcript of Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019....

Page 1: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Orthogonal Array Based Big Data

Subsampling

Hongquan Xu

University of California, Los Angeles

Joint work with

Lin Wang, Jake Kramer and Weng Kee Wong

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 1 / 20

Page 2: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Motivation

Data are big (sometimes redundant)

Analyzing the full data may be computationally unfeasible

Storing all of the data may not be possible

There are a lot of circumstances under which X is big while the

label (or response) Y is expensive to obtain

Big X Small Y

Patients’ records Performance of

a new medicine

Images as visual Brain response

stimuli

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 2 / 20

Page 3: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Linear Regression Setup

y = β0 + β1x1 + · · ·+ βpxp + ε

Question:

1. If the budget only allows k responses (labels), k � n, choose

which k to label?

2. When all responses are available, to accelerate the computation,

how to choose a subsample of size k � n ?

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 3 / 20

Page 4: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Leverage Subsampling

A subsampling method consists of

sampling probabilities πi , i = 1, . . . , n,∑n

i πi = 1

a weighted estimator β̂s = (XTs WXs)−1XT

s WYs , where Xs is

the subsample taken from X and W = diag(w1, . . . ,wk) is a

weight matrix

1. Uniform sampling: πi = 1/n, wi = 1

2. Leveraging: πi = hii/(p + 1), wi = 1/πi , where

hii = (X (XTX )−1XT )ii

(Drineas et al., 2006; Ma et al., 2015)

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 4 / 20

Page 5: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Information-based Subsampling

The OLS for a subsample Xs is β̂s = (XTs Xs)−1XT

s Ys

E(β̂s) = β, Var(β̂s) = σ2(XTs Xs)−1

The Fisher information matrix for β with subdata is

Ms = XTs Xs

D-optimality: to find Xs with k points that maximizes det(Ms)

Available approach: IBOSS (Wang H., Yang, and Stufken, 2018 JASA)

Include data points with extreme (largest and smallest) covariate

values

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 5 / 20

Page 6: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Orthogonal Array-based Subsampling

Lemma

Suppose each covariate is scaled to [−1, 1]. For a subsample Xs of

size k ,

det(Ms) ≤ kp+1,

and the equality holds (D-optimal) if and only if Xs forms a

two-level OA with levels from {−1, 1} and strength t ≥ 2.

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 6 / 20

Page 7: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

A Measure of “Similarity”

For x , y ∈ [−1, 1], define

δ(x , y) =

{2− (x2 + y2)/2, if sign(x) = sign(y);

1− (x2 + y2)/2, otherwise.

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

x

y

{1 ≤ δ(x , y) ≤ 2, if sign(x) = sign(y);

0 ≤ δ(x , y) ≤ 1, otherwise.

δ(−1, 1) = δ(1,−1) = 0

δ(−1,−1) = δ(1, 1) = 1

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 7 / 20

Page 8: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

J-optimality

For two data points xi = (xi1, . . . , xip) and xj = (xj1, . . . , xjp), let

δ(xi , xj) =

p∑l=1

δ(xil , xjl)

and define the J-optimality criterion (Xu 2002, Technometrics) as

J(Xs) =∑

1≤i<j≤k[δ(xi , xj)]2 .

Theorem

For any k-point subsample Xs over [−1, 1]p,

J(Xs) ≥ [k2p(p + 1)− 4kp2]/8,

with equality if and only if Xs forms an OA with strength 2.

Question: How to find Xs that minimizes J(Xs)?Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 8 / 20

Page 9: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Algorithm: Select and eliminate points simultaneously

Step 0. Given n × p matrix X , scale each variable to [-1,1].

Step 1. Find a point that is farthest to the origin. Include it as Xs and

let i = 1.

Step 2. For each x ∈ X , compute the J-score

J(x ,Xs) =∑xs∈Xs

δ(x , xs)2.

Step 3. Find x∗ ∈ X that minimizes J(x ,Xs) and add x∗ to Xs .

Step 4. Keep t = bn/ic points in X with t smallest J(x ,Xs) values.

Remove x∗ and other points from X .

Step 5. Increase i by 1 and repeat Steps 2–4 until Xs contains k points.

Note: The complexity is O(np log(k)) as∑k

i=1(1/i) = O(log(k)).

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 9 / 20

Page 10: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Algorithm

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

x[, 1]

x[, 2

]

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 10 / 20

Page 11: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Algorithm

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

x[candi, 1]

x[c

andi, 2

]

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 11 / 20

Page 12: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Algorithm

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

x[candi, 1]

x[c

andi, 2

]

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 12 / 20

Page 13: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Algorithm

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

x[candi, 1]

x[c

andi, 2

]

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 13 / 20

Page 14: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Algorithm

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

x[candi, 1]

x[c

andi, 2

]

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 14 / 20

Page 15: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

A Toy Simulation

Setup: x1, x2iid∼ Unif[−1, 1], n = 1000, p = 2, k = 100, ε ∼ N(0, 1)

y = 1 + 2x1 + 2x2 + ε

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Full data 1000x2

x_1

x_2

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

IBOSS method with k=100

x_1

x_2

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

OA−based method with k=100

x_1

x_2

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 15 / 20

Page 16: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Comparison

Unif_Sampling Leveraging IBOSS OA−based

0.0

00

.05

0.1

00

.15

0.2

0

MSEs for slope parameters (1000 repetitions)

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 16 / 20

Page 17: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Theoretical Properties

Assume that the covariates x1, . . . , xp are i.i.d. with a uniform

distribution in [a, b]p, and we are considering the model

y = β0 + β1x1 + · · ·+ βpxp + ε.

Theorem

The OA-based subsampling minimizes MSE, i.e., the sum of the

variances of coefficient estimations, almost surely as n→∞.

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 17 / 20

Page 18: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Model with Interactions

y = β0 + β1x1 + · · ·+ βpxp + β12x1x2 + · · ·+ β(p−1)pxp−1xp + ε

Define the J-optimality criterion as

J4(Xs) =∑

1≤i<j≤k[δ(xi , xj)]4 .

Assume that the covariates x1, . . . , xp are i.i.d. with a uniform

distribution in [a, b]p.

Theorem

The OA-based subsampling minimizes MSE, i.e., the sum of

variances of coefficient estimations, almost surely as n→∞.

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 18 / 20

Page 19: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Simulation

Setup: x1, . . . , xpiid∼ Unif[−1, 1], n = 104, p = 10, k = 128,

ε ∼ N(0, 32)

y = 1 + x1 + · · ·+ x10 + x1x2 + · · ·+ x9x10 + ε

Unif_Sampling Leveraging IBOSS OA−based

0.5

1.0

1.5

MSEs for slope parameters (100 repetitions)

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 19 / 20

Page 20: Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019. 5. 17. · Orthogonal Array-based Subsampling Lemma Suppose each covariate is scaled

Summary

Propose a new framework for subsampling: OA-based

subsampling

Computational complexity: O(np log(k))

Minimize J(Xs) for a linear model without interactions, attains

optimality for coefficient estimation for large n

Minimize J4(Xs) for a linear model with interactions, attains

optimality for coefficient estimation for large n

More efficient than leverage sampling and IBOSS

Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 20 / 20