useR 2014 jskim

33
The R User Conference 2014 useR 2014 @˜- UCLA in LA 6.30 - 7.3 @˜- (UCLA in LA) The R User Conference 2014 6.30 - 7.3 1 / 33

description

useR 2014

Transcript of useR 2014 jskim

Page 1: useR 2014 jskim

The R User Conference 2014useR 2014

김진섭

UCLA in LA

6.30 − 7.3

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 1 / 33

Page 2: useR 2014 jskim

What is useR?

Contents

1 What is useR?

2 1st day: TutorialApplied Predictive Modeling in RGraphical Models and Bayesian Networks with R

3 2nd day

4 3rd day

5 4th day

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 2 / 33

Page 3: useR 2014 jskim

What is useR?

Since 2004. . .

Main meeting of the R user and developer community.

Invited keynote lecturesbroad spectrum of topics ranging from technical and R-relatedcomputing issues to general statistical topics of current interest

User-contributed presentations

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 3 / 33

Page 4: useR 2014 jskim

What is useR?

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 4 / 33

Page 5: useR 2014 jskim

What is useR?

Figure. Afternoon tutorial: dplyr

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 5 / 33

Page 6: useR 2014 jskim

1st day: Tutorial

Contents

1 What is useR?

2 1st day: TutorialApplied Predictive Modeling in RGraphical Models and Bayesian Networks with R

3 2nd day

4 3rd day

5 4th day

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 6 / 33

Page 7: useR 2014 jskim

1st day: Tutorial

List

http://user2014.stat.ucla.edu/#tutorials

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 7 / 33

Page 8: useR 2014 jskim

1st day: Tutorial Applied Predictive Modeling in R

Applied Predictive Modeling in R

Max Kuhn, Ph.D : Pfizer Global R&D

http://appliedpredictivemodeling.com

caret package in R

Outline

Conventions in R

Data Splitting and Estimating Performance

Data Pre-Processing

Over–Fitting and Resampling

Training and Tuning Tree Models

Training and Tuning A Support Vector Machine

Comparing Models

Parallel Processing

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 8 / 33

Page 9: useR 2014 jskim

1st day: Tutorial Applied Predictive Modeling in R

배운점

1 해석 VS 예측Trade-off : Logistic regression이 해석은 편하지만(Odds ratio), 이를고집할 필요는 없다. 예측의 정확도와 속도를 위해 얼마든지 logit가정은 깨질 수 있다.X들을 필요에 따라 자유롭게 변형(ex: Log 변환, scale, centering)R2, AIC, p-value. . . VS cross validation, bootstrapping, sampling,ROC curve

2 Supervised machine learningRegression : simple, glm, PCA, penalized(Ridge, Lasso, elastic-net) . . .Classification: K–Nearest Neighbors, treesCommon: Boosting, Support Vector Machine (SVM)

3 Parallel Processing

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 9 / 33

Page 10: useR 2014 jskim

1st day: Tutorial Applied Predictive Modeling in R

당면과제 및 향후과제

1 Supervised learning

당장 적용.Our data를 R이 감당할 수 있는가???

2 Unsupervised learning

Deep learning: 심층 신경망(DNN: Deep Neural Network)2014년 핵심기술로 뽑힘. ex)음성 인식온라인강의수강?https://class.coursera.org/neuralnets-2012-001/lecture

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 10 / 33

Page 11: useR 2014 jskim

1st day: Tutorial Graphical Models and Bayesian Networks with R

Graphical Models and Bayesian Networks with R

Probability propagation with Bayesian networks (BNs) and theirimplementation in the gRain (gRaphical independence networks)package.

A look under the hood of BNs to understand mechanisms ofprobability propagation. Dependency graphs and conditionalindependence restrictions.

Log-linear models, graphical models, decompsable models and theirimplementation in the gRim (gRaphical independence models)package.

Model selection with gRim

Converting a decompsable graphical model to a Bayesian network.

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 11 / 33

Page 12: useR 2014 jskim

1st day: Tutorial Graphical Models and Bayesian Networks with R

The chest clinic narrative

p(V ) = p(a)p(t|a)p(s)p(l |s)p(b|s)p(e|t, l)p(d |e, b)p(x |e)

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 12 / 33

Page 13: useR 2014 jskim

2nd day

Contents

1 What is useR?

2 1st day: TutorialApplied Predictive Modeling in RGraphical Models and Bayesian Networks with R

3 2nd day

4 3rd day

5 4th day

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 13 / 33

Page 14: useR 2014 jskim

2nd day

Opening Keynote

John Chambers(Stats, Stanford) - Interfaces, Efficiency and BigData

Rcpp: cpp function → R

RLLVM, http://www.omegahat.org/Rllvm: R을 compile해서 쓴다.

h2o: java base의 machine learining for big data.

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 14 / 33

Page 15: useR 2014 jskim

2nd day

1185 Terra Bella Ave, Mountain View, CA 94043 | 650-429-8337 | h2o.ai

.ai

.ai

H2O is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is the only alternative to combine the power of highly advanced algorithms, the freedom of open source, and the capacity of truly scalable in-memory processing for big data on one or many nodes. Com-bined, these capabilities make it faster, easier, and more cost effective to harness big data to maximum benefit for the business.

With H2O, you can:

• Makebetterpredictions. Harness sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables.

• Getstartedwithminimaleffortandinvest-ment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments.

ChurnPrediction• Banking. What are the profiles and usage

patterns of customers who are most likely to defect?

• Onlineretail. What are the leading indicators and patterns of behavior to predict customer churn? Predict the segment of customers most likely to churn, and when, in order to intercept it and change their behavior.

FraudPrediction• Paymentprocessor.

Predict fraudulent activity using anomaly detection methods.

• Insurance. Stop fraud before claims are paid using real time scoring. Identify repeat offenders and score incoming claims based on fraudulent his-tory patterns.

ScoringEngine• Score customers based on purchase history and

analyze the lifetime value of key accounts to discover upsell and cross-sell opportunities.

PricingEngine• Travel. Analyze different cost and promotional

packages to create the most competitive combi-nation of services.

• Healthcare. Discover new insights and create competitive services and healthcare programs by analyzing patient attributes, including envi-ronment, lifestyle and medical history.

Forecast• Realestate. Predict property value and forecast

sales by neighborhoods and regional variables. Analyze larger nationwide datasets vs. smaller sample sets to realize greater accuracy and find previously unnoticed patterns.

KeyBenefitsBETTERPREDICTIONS• Ready-to-use,powerfulalgorithmsfor

regression,classification,clustering,anddeeplearning—alongwithadvancedcapa-bilitiesforchurnprediction,recommenda-tions,fraudprediction,andmore.

SPEED• In-memoryprocessingprovidesreal-time

responsivenessandenablesyoutorunmoremodels.

• Fine-grainparalleldistributiononbigdata—enablingaccuratecomputationsacrossoneormanynodesbymovingthecodetothedata.

EASEOFUSE• Easysetupanduse,eitherthroughan

intuitiveWebinterfaceoryourexistingtools,includingR,Java,Scala,andPython.

• ModelexportinplainJavacodeforreal-timescoringinanyenvironment.

EXTENSIBILITY• SeamlessHadoopintegrationwithdistrib-

uteddataingestionfromHDFSandS3.

AlgorithmsEXPLORATORYDATAANALYTICS(EDA)• Summary*• K-Means*• PCA*• DataMunging/Transformation**SupportedinRADVANCEDALGORITHMS• GeneralizedLinearModel(GLM)—

Poisson,GammaTweedie,binomial(logit),Gaussian*

• RandomForest*• GradientBoostedRegression*• GradientBoostedClassification**LowLatencyJavaScoringSCORINGANDPREDICTIONENGINES• GLM• RandomForest• GradientBoostedRegression• GradientBoostedClassification• K-MeansDEEPLEARNING• NeuralNetworks

H2OThe Open Source In-Memory Prediction Engine

Whatcanyoudowithbetterpredictions?Expectmorefromyourdata.

Customers

1185 Terra Bella Ave, Mountain View, CA 94043 | 650-429-8337 | h2o.ai.ai

H2O Billion Row Machine Learning Benchmark GLM Logistic Regression

H2O 16 EC2 nodes

H2O 16 EC2 nodes

H2O 48 EC2 nodes

H2O 48 EC2 nodes

16.5 sec, 2 iteraons numerical

34.9 sec, 3 iteraons numerical and categorical

14.2 sec, 3 iteraons numerical and categorical

5.6 sec, 2 iteraons numerical

Hadoop/Mahout

Airline Dataset 1987-2013, 42 GB CSV, 1 billion rows, 12 input columns, 1 outcome column 9 numerical features, 3 categorical features with cardinalies 30, 376 and 380

Compute Hardware: AWS EC2 c3.2xlarge - 8 cores and 15 GB per node, 1 GbE interconnect

WorkwithR,FamiliarToolsandIntuitiveInterfacesThrough its intuitive Web interface and inte-gration with common tools, H2O makes it fast and easy to get started with big data analyt-ics. The solution works seamlessly with R and R Studio. For example, using the R interface, you can forward workflows to H2O for big data processing, and work in a familiar interface while running algorithms on data sets that are hundreds of times larger than what would be possible on a user machine. H2O also features native support for Java, Scala, and Python. The solution’s interface is driven by JSON APIs, which makes it easy to plug into your organiza-tion’s existing tools and processes to train your data and continuously improve your models and predictive accuracy.

In-MemoryProcessingResponsivenessWith H2O, your organization can harness the responsiveness of highly optimized in-memory processing, so you can operationalize many more models and gain real-time intelligence in business transactions and interactions. With model export as plain Java code, you gain light-ning fast real-time scoring in any environment. In addition, the solution enables data scientists to view partial query results while longer pro-cesses are running, so they can immediately

spot a job that should be stopped and more quickly iterate to find the optimal approach.

NativeRandSeamlessHadoopIntegrationH2O can run as a standalone platform or within an existing Hadoop installation, bringing in-memory performance to Hadoop. H2O works with data in HDFS and supports familiar pro-gramming tools, such as Hive and Pig. In addition, the solution can be efficiently run in Amazon Web Ser-vices environments.

Fine-GrainDistributedProcessingonBigDataatSpeedsUpto100xFasterFaster H2O lets you model interactively using in-memory processing, and delivers paral-lel distributed scalability required to support your big data production environments. The solution combines the responsiveness of in-memory processing with the ability to run fast

serialization between nodes and clusters—so you can support the size requirements of your large data sets. Further, H2O does this distributed processing with fine-grain parallel-ism, which enables optimal efficiency, without introducing degradation in computational accuracy.

JointheH2OMovementH2O brings better algorithms to big data. H2O is a fast open source in-memory predic-tion engine and machine learning platform. With H2O enterprises can use all of their data (instead of sampling) in real-time for better predictions. Users can model data quickly and make better data-driven decisions faster by running advanced algorithms such as Deep Learning, Classification, Regression, Decision Trees, Forests, Gradient Boosting, GLM, PCA and more. Data Scientists can take both simple & sophisticated models to production from the same interactive platform used for modeling within R and JSON.

Our earliest customers have built powerful domain specific predictive engines for Recom-mendations, Pricing, Outlier Detection and Fraud Prediction for Insurance and Ad Plat- forms. H2O is nurturing a grassroots movement of math, systems and data scientists to herald the new wave of Discovery with Big Data Science. H2O is on CRN’s 10 Coolest Big Data Products of 2013. www.h2o.ai

For latest features and updates, go to H2O Open Source Github Repository http://0xdata.github.io/h2o/

Copyright © H20 All rights reserved. All trademarks referenced herein belong to their respective companies.

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 15 / 33

Page 16: useR 2014 jskim

2nd day

Session 1: Bayesian

베이지안 통계 계산에서 approximation

Spatial analysis 적용.

베이지안 소프트웨어..

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 16 / 33

Page 17: useR 2014 jskim

2nd day

Session 2

R For Improving Consumer Engagement and Health OutcomesActiveHealth Management, Incinternal member data + externally-purchased lifestyle & behavioraldataK-Means Clustering and CART classification treesResponse rate가 그룹에 따라 다르다. 맞춤 적용.

Shiny: R made interactivehttp://shiny.rstudio.com/gallery/kmeans-example.html

Fostering the next generation of open science with Rhttp://ropensci.org/

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 17 / 33

Page 18: useR 2014 jskim

2nd day

Invited Talk

Martin Maechler(Math, Zurich)- Good Practices in R Programming

잡다한 코딩스킬들.. (ex: < − VS =, 주석, 띄어쓰기 등..)

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 18 / 33

Page 19: useR 2014 jskim

2nd day

Session 3

dplyr,data.table: High performance in data step

PivotalR: A Package for Machine Learning on Big Datahttps://github.com/gopivotal/PivotalR

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 19 / 33

Page 20: useR 2014 jskim

2nd day

Poster Session 1: 28 posters

Reproducible Research in Public HealthJinseob Kim1, Joohon Sung1,?

1. Complex Disease and Genetic Epidemiology Branch, Graduate School of Public Health, Seoul National University

Objectives

In this study, we aimed to construct pipelinesof reproducible statistical analysis in healthresearch. The development of pipelines in thisstudy consists of• Automatic suggestions of a summary tabledescribing the general characteristics of thestudy

• Univariate analysis of both explanatory andoutcome variables of a study

• Graphical presentations of summary andunivariate analyses

• Automatic analysis and tabulations of mainresults

based on frequently used analytical methodsin the health research area (e.g., multiple re-gression, logistic regression, survival analysis,multilevel analysis, genome-wide associationstudy(GWAS)).

Background

For most health reasearchers who learned and ap-plied the statistical methods properly had spentlong time for learning statistics. Addtionally, sta-tistical methods are ever evolving and updatingtheir knowledge and analytic skills will require themost precious resources-the time.

Methods

With xtable packages in R and tex4ht in LATEX,researcher will get PDF or odt(open documenttext) version of descriptive statistics when they se-lect data, variables(factor or numeric), strata vari-able(sex, region, etc. . . )[1, 2]. In next, among var-ious analysis, we present GWAS’ example usingFASTA method in GenABEL R package or fas-tassoc method in MERLIN. If they select lists ofphenotypes, covariates and kinship matrix, PDFor odt files including each phenotype’s name, nar-row sense heritability, top SNPs in GWAS re-sults, qqplot and manhattan plot were createdautomatically[3, 4]. Example dataset is TG, LDLphenotypes and chromosome 21 in Healthy TwinStudy, Korea[5].

Key Methods

Researchers can obtain tables and figures if they select data set and dependent variables of interest,and define the nature of each variables (e.g. continuous, binomial, count), explanatory variables, andgroup variable (e.g., sex, region, unit of random effects or family structure).

Example: GWAS-TG

Table : PDF file- Descriptive statistics of TGVariable: Mean (SD) or N (%) Male Female P-value P-value:NP∗

Age 44.6 (13.6) 43.95 (12.73) 0.208 0.294Smoke < 0.001 < 0.001

No 278 (26.18) 1506 (90.29)Past 290 (27.31) 53 (3.18)

Current 494 (46.52) 109 (6.53)TG 140.34 (92.34) 98.8 (62.82) < 0.001 < 0.001

∗NP: non-parametric

Figure : PDF file- GWAS results of TG

TG (h2 = 0.48)

SNP Chromosome Position A1 A2 N MAF B FASTA SE FASTA P FASTA B fassoc SE fassoc P fassocrs12626621 21 23496579 C T 1838 0.167 13.61 3.46 8.41E-05 -12.82 3.68 5.00E-04

rs1702393 21 30866791 C T 1832 0.353 10.67 2.72 8.76E-05 -11.51 2.96 9.99E-05rs12627596 21 30867101 T A 1840 0.352 10.06 2.72 2.10E-04 -11.06 2.95 1.81E-04

rs128592 21 30878280 C T 1841 0.352 10.04 2.72 2.18E-04 -11.06 2.95 1.80E-04rs198935 21 30867998 C T 1799 0.358 10.13 2.74 2.19E-04 -11.42 2.98 1.26E-04

rs11702393 21 39304254 G A 1832 0.297 10.71 2.91 2.35E-04 -12.55 3.13 6.23E-05rs382004 21 19226026 T C 1840 0.035 25.36 7.00 2.92E-04 -22.86 7.63 2.75E-03

rs1888516 21 41219289 G A 1837 0.014 38.42 10.61 2.93E-04 -34.38 11.09 1.92E-03rs9306107 21 44798662 G A 1827 0.008 51.02 14.36 3.82E-04 -45.77 15.16 2.53E-03

rs426803 21 19225690 T C 1839 0.037 24.02 6.78 3.98E-04 -22.16 7.38 2.67E-03rs198936 21 30869966 C T 1837 0.351 9.41 2.73 5.54E-04 -10.69 2.96 3.10E-04

rs1702405 21 30914735 A G 1841 0.493 -9.05 2.63 5.86E-04 10.97 2.86 1.23E-04rs2257149 21 36213059 G A 1806 0.360 -9.39 2.75 6.24E-04 10.86 3.00 2.96E-04

rs174897 21 30919037 A G 1835 0.492 -8.74 2.63 8.89E-04 10.62 2.85 1.92E-04rs198871 21 30932579 G A 1819 0.492 -8.76 2.64 8.91E-04 10.42 2.85 2.59E-04

Table 1: GWAS: TG

(a) QQplot-FASTA: TG (b) QQplot-Fassoc: TG

Figure 1: QQplot: TG

(a) Manhattan plot-FASTA: TG

(b) Manhattan plot-Fassoc: TG

Figure 2: Manhattan plot: TG

1

Example: GWAS:LDL

Table : PDF file- Descriptive statistics of LDLVariable: Mean (SD) or N (%) Male Female P-value P-value:NP∗

Age 44.6 (13.6) 43.95 (12.73) 0.208 0.294FBS 97.12 (19.67) 91.36 (16.54) < 0.001 < 0.001

tCholesterol 191.08 (34.99) 188.63 (35.83) 0.077 0.031HDL 46.45 (11.24) 52.26 (12.72) < 0.001 < 0.001LDL 112.84 (30.7) 108.76 (30.3) < 0.001 < 0.001

∗NP: non-parametric

Figure : PDF file- GWAS results of LDL

LDL (h2 = 0.47)

SNP Chromosome Position A1 A2 N MAF B FASTA SE FASTA P FASTA B fassoc SE fassoc P fassocrs4818418 21 18802344 C T 1823 0.276 4.87 1.13 1.53E-05 -4.54 1.22 1.93E-04rs1735790 21 18788001 A G 1792 0.274 4.90 1.15 2.10E-05 -5.20 1.24 2.82E-05rs2824856 21 18793090 G A 1811 0.278 4.82 1.14 2.21E-05 -4.71 1.22 1.15E-04rs2824898 21 18813513 T C 1841 0.280 4.70 1.12 2.50E-05 -4.46 1.21 2.34E-04rs2252190 21 18792169 A G 1839 0.277 4.74 1.13 2.54E-05 -4.81 1.22 7.94E-05rs2824899 21 18813559 T C 1840 0.280 4.69 1.12 2.68E-05 -4.46 1.21 2.34E-04

rs914244 21 46340779 C T 1839 0.408 -4.38 1.05 2.88E-05 4.78 1.15 3.07E-05rs2824857 21 18793134 G T 1837 0.279 4.67 1.12 3.26E-05 -4.71 1.22 1.09E-04rs2026211 21 18806617 C T 1842 0.276 4.64 1.12 3.51E-05 -4.38 1.21 3.10E-04rs2824880 21 18807606 A G 1842 0.276 4.64 1.12 3.51E-05 -4.38 1.21 3.10E-04rs2838534 21 44498077 T C 1811 0.352 3.94 1.06 2.04E-04 -4.69 1.16 5.22E-05

rs456164 21 27761178 T C 1842 0.395 -3.82 1.05 2.85E-04 4.46 1.14 9.88E-05

Table 1: GWAS: LDL

(a) QQplot-FASTA: LDL (b) QQplot-Fassoc: LDL

Figure 1: QQplot: LDL

(a) Manhattan plot-FASTA: LDL

(b) Manhattan plot-Fassoc: LDL

Figure 2: Manhattan plot: LDL

1

Conclusion

Using xtable package in R, LATEX and tex4ht pack-age in LATEX with various statistical packages inR, we developed a automatic words describing theresult tables and figures with PDF or opendocu-ment format directly[1, 2]. Though we presentedonly descriptive statistics and GWAS examples,pipelines of other analysis(e.g., survival analysis,multilevel analysis, etc. . . ) were also made us-ing similar packages above and some additionalpackages. This automated statistical pipeline toolswill help individual researcher in health-related orbroader arena to help to reduce their analytical bur-dens, as well as to conduct appropriate statisticalanalysis much faster and reliable manner.

References

[1] David B. Dahl. xtable: Export tables to LaTeX or HTML,2014. R package version 1.7-3.

[2] Emma Cliffe. Methods to produce flexible and accessiblelearning resources in mathematics: overview document.2012.

[3] GenABEL project developers. GenABEL: genome-wideSNP association analysis, 2013. R package version 1.8-0.

[4] Wei-Min Chen and Gonçalo R Abecasis. Family-basedassociation tests for genomewide association scans. TheAmerican Journal of Human Genetics, 81(5):913–926,2007.

[5] Joohon Sung, Sung-Il Cho, Yun-Mi Song, Kayoung Lee,Eun-Young Choi, Mina Ha, Jihae Kim, Ho Kim, YeonjuKim, Eun-Kyung Shin, et al. Do we need more twin stud-ies? the healthy twin study, korea. International journalof epidemiology, 35(2):488–490, 2006.

Contact Information

• Web: http://snugepi.snu.ac.kr• Email: [email protected]• Phone: +82-2-880-2743

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 20 / 33

Page 21: useR 2014 jskim

2nd day

Visualization이 대세?

한국인 2명: 연세대학교 치과대학(불참), 성균관대학교(공대?)

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 21 / 33

Page 22: useR 2014 jskim

3rd day

Contents

1 What is useR?

2 1st day: TutorialApplied Predictive Modeling in RGraphical Models and Bayesian Networks with R

3 2nd day

4 3rd day

5 4th day

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 22 / 33

Page 23: useR 2014 jskim

3rd day

Invited Talk

Dirk Eddelbuettel - R, C++ and Rcpp

Rcpp : R의 간편함 + cpp의 speed

R에서 간단하게(pointer필요없음) cpp 함수 만든다.cpp코드에 R 함수 삽입가능.R ⇐⇒ cpp 자유자재로 왔다갔다. . .

Docker : 새로운 가상머신

기존의 가상머신 : 프로그램 + 라이브러리 + OSDocker : 프로그램 + 라이브러리 onlyOnly for linux

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 23 / 33

Page 24: useR 2014 jskim

3rd day

Sponsors Talk

Revolution Analytics

상업용 R.

SAS보다 낫다.

Oracle: ROracle

R+ Oracle database

Google

SAS 없다.

엔지니어, 전산, 통계학 전공자 구인.

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 24 / 33

Page 25: useR 2014 jskim

3rd day

Session 4

R in the Midst of Exploding Stars: Distributed, Time-Domain TransientClassification

NASA에서도 R을 쓴다.

초신성 등 별 분류에 machine learning(clustering..) 이용.

Imputation of Missing Values with the R Package VIM

다양한 Imputation method 지원.

Imputation결과를 그래픽으로 지원.

PSAboot: An R Package for Bootstrapping Propensity Score Analysis

http://github.com/jbryer/psa

Risk of bias due to unobserved covariates

다양한 방법(5가지)으로 bootstrap을 이용하여 PSA 수행함.

Permutation Tests in Multidimensional Scaling

다차원 정보 → 상대적인 거리로 표현.

smacof package in R

Permutation test: dissimilarity가 random하게 분포(null) VS 데이터의 분포

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 25 / 33

Page 26: useR 2014 jskim

3rd day

Invited Talk

David Diez(Openintro) - Textbooks struggle where softwaresucceeds

http://www.openintro.org

Open source textbook: paperback < $10

Labs, videos, for teachers(slides..)

statsTeachR.org, OpenStaxCollege.org,https://www.coursera.org

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 26 / 33

Page 27: useR 2014 jskim

3rd day

Session 5: Biology/Ecology

Simulating Influenza Transmission with Real Network Data

Network data: High school, chip??, statnet package: graph

Simulation: various incubation period, infection period, transmission probability. . .

Enhancing Medical Reporting by Combining Electronic Health Records withREDCap: Applications of the REDCap API

http://www.project-redcap.org

데이터 모을 때 chart review 하나하나 할 필요 없이 의사가 관심가져야 할 환자를결정해준다.

Simulations for regulatory decision making: How many simulations do we need torun?

Simulation이 규제기관에서 중요하다(ex: FDA).

Simulation은 complex model(not analytically tractable)에서 꼭 필요하다.

1,000번, 10,000번 정도로는 어림없는 경우가 있다. 400만번 이상까지도.. R &parallel computing. . .

Monitoring Patients with Ongoing Reduced Kidney Function

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 27 / 33

Page 28: useR 2014 jskim

3rd day

Poster session 2: 28 posters

대세는

1 Visualization(ex: shiny)

Shiny-ing compareGroupsData Works: An Interactive Data Visualization Application Builtwith ShinyR graphics in Tidal Wetland RestorationStatistics without Numbers: Using Data Visualization toQuantify Trends for Cycling SafetyVisually Analyzing and Running Multilevel Data in R and BUGSUsing RGraphviz as a first pass for layout of small structuralmodel graphsDeveloping shiny applications for the classroom

2 Automatic reporting tools

Multi-center Clinical trials reporting with RTeaching data analysis in R through the lens of reproducibilityBetter Data Quality In Clinical Trials

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 28 / 33

Page 29: useR 2014 jskim

4th day

Contents

1 What is useR?

2 1st day: TutorialApplied Predictive Modeling in RGraphical Models and Bayesian Networks with R

3 2nd day

4 3rd day

5 4th day

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 29 / 33

Page 30: useR 2014 jskim

4th day

Invited Talk

Karline Soetaert(Head of the Department of Ecosystem Studies,Royal Netherlands Institute of Sea Research) - Solving differentialequations in R

Marine science: 모든 것을 다 측정할 순 없다. 비싸다..

미분방정식으로 나머지를 추정.

이것저것 짬뽕해서 쓰다가 2008년 R로 통일하려고 패키지 deSolve개발.

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 30 / 33

Page 31: useR 2014 jskim

4th day

Session 6

13시 10분 출국관계로 듣지 못함.

계획

mlr: machine learning package in Rrapport: a report templating system in R

http://rapport-package.info/#templates

흔히 쓰는 통계분석에 대한 report tool예) Var1의 평균은 mean, 표준편차는 sd였다. + Table + Histogram.논문용 테이블은 구현안됨..

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 31 / 33

Page 32: useR 2014 jskim

4th day

종합: useR2014의 대세?

1 Machine Learning: Predictive modelling

H2o, alteryx, Rstudio, TIBC대부분 predictive modelling 회사

2 Performance : parallel, other languages . . .

3 Visualization

4 Reproducible research

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 32 / 33

Page 33: useR 2014 jskim

4th day

http://user2015.math.aau.dk

김진섭 (UCLA in LA) The R User Conference 2014 6.30 − 7.3 33 / 33