20110620 amst rdam_kpb

104 views 0 download

description

AmstRdam presentation

Transcript of 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Computing near the data:let someone else do the heavy lifting for you

Konrad Banachewicz

AmstRdam, June 20th 2011

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

”We’re drowning in data and starving for information”

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Problems:

memory

bandwidth

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

But is it really necessary?

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Two possible approaches:

1 Ripley i Chen: extra interface, pure R

2 R + SQL

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Ripley i Chen

R(user) // CORBA // R(servant)

��DB

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Alternative

R(user) // DBoo

Two scenarios:

1 pure R processing

2 computations partially in DB

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Pure R processing

200000 400000 600000 800000 1000000

05

1015

2025

30

Case study 1, method 1

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Computations partially in DB

200000 400000 600000 800000 1000000

05

1015

2025

30

Case study 1, method 2

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Pure R processing

15 20 25 30 35

010

2030

4050

60

Case study 1, method 1

Dataset size (columns)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Computations partially in DB

15 20 25 30 35

010

2030

4050

60

Case study 1, method 1

Dataset size (columns)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Pure R processing

2000000 4000000 6000000 8000000 10000000

020

4060

8010

0

Case study 3, method 1

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Computations partially in DB

200000 400000 600000 800000 1000000

020

4060

8010

0

Case study 3, method 2

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

IntroductionComputing in databases

Conclusion

1 with minimal effort, significant speedups are possible

2 ODBC as minimal requirement

3 extensions: parallel computing...

Konrad Banachewicz Computing near the data