NATIONAL CENTER FOR SUPERCOMPUTING APPLICATIONS
2017 NCSA Industry ConferenceHPC for Agriculture, in Genomics and elsewhere
2017 Industry Conference
NCSA Genomics for Industry and AcademeHospitals
Data Analytics
Computational thermo/fluids
Multiphysics
Computational biology,genomics
Pharma
Agro
International science
UIUC science
NCSA Industry
Slide 2 of 14
2017 Industry Conference
NCSA Genomics: an interdisciplinary team
3Brian Bliss
Computer Science
Matthew KendziorCrop Science
Plant Biotechnology, Molecular Biology
Katherine KendigFine Arts, Writing,
Management
Daniel WicklandInformatics
Ramshankar VenkatakrishnanECE, Electronics, Communications
Weihao Ge,Physics, Biophysics
Staff Graduate students
Undergraduates
Slide 3 of 14
Joshua AllenMath, Writing, Bioinformatics
Daniel LanierMath, containerization,
cloud computing
Post-Docs
Brian RaoMolecular Biology
Bioinformatics
Shubham RawlaniECE, Information
Management
Yazhuo ZhangInformation
ManagementPriya Balgi
Business analytics, Information Technology
Sparsh AgarwalBiotechnology
Prakruthi BurraBioinformatics,
Math, CS
Dave IstantoCrop Sciences,Bioinformatics
Mingyu YangCS
Dipro RayCS, Math, Statistics
Tajesvi BhatCS
2017 Industry Conference
Diverse range of projectsIT projects1. Big data network transfers2. Parallel file packaging and archiving utilities for big data3. Hardware benchmarking4. Testing and validation of data analytics platform5. Software installation, testing, profiling, optimization
Software projects6. Optimization of R code for data cleaning in multidimensional heterogeneous datasets7. Scaling computation of epistatic interactions in GWAS data8. Instrumenting a high-throughput variant calling workflow on Blue Waters, iForge, AWS9. Workflow management and automation with Swift/T, Nextflow and Cromwell/WDL
Science projects11. Profiling mutations in cancer12. Detecting structural variants in polyploid plants via de-novo assembly13. Population-level variant recall for very large cohorts in Alzheimer’s disease14. Evolution of persistence strategies in biomolecular networks
Slide 4 of 14
2017 Industry Conference
Part 1: IT project examples
2017 Industry Conference
Example 1: Software installs and configThe field is far from being fully baked. Tons of software packages that keep changing.
In-house software installs• On iForge >200 software packages
https://omictools.com/omicX owns and operates a platform that helps to search bioinformatics tools available; France• Genomics 26,325 tools• Proteomics 8,034• Metabolomics 3,396
Why this is successful:• we actually practice using these software packages daily
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton
Hail:0.1 (stable) and 0.2 beta (development). “We recommend that new users install 0.2 beta“
Slide 6 of 14
2017 Industry Conference
Example 2: workflow maintenance
Cantarel, Brandi L., et al. "MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes." Genome research 18.1 (2008): 188-196.
Mohamed, Amin Roushdy. Transcriptomics of coral-algal interactions: novel insights into the establishment of symbiosis. Diss. James Cook University, 2016.
2017 Industry Conference
Problem: • stakeholder is planning to scale up their analysis in software X;• current infrastructure not adequate • in-house scalability analysis yields contradictory and confusing results
Solution: run a project with NCSA Industry• a domain expert studies software X• performs scalability analysis correctly• scales up the problem on iForge• makes purchasing recommendation based on data
Why this is successful:• we already have the equipment before it hits the market • we daily practice performance and scalability• if we do not yet have something ourselves, we will ask for it, and shall receive• Diverse expertise available on hand
Example 3: Hardware purchasing decisions
Investigate issues of:• User behavior• Software installation• Nature of the data• Script misconfiguration• Filesystem • Cluster interconnect
Slide 8 of 14
2017 Industry Conference
Part 2: Software
project examples
2017 Industry Conference
Problem: • Software is scientifically solid, but too slow to go live in production, give the data volumes at the company
Solution: a team investigated issues potentially related to• Understand algorithms and language implementation• Code profiling on the software• Benchmark performance on various hardware• Implement code improvements• Recommend better ways of
• configuring, • scheduling and • running the software
• Suggest software alternatives
Why this is successful:• We run research projects in our domain, thus staying on top of the recent developments
Example 1: acceleration, performance tuning
Slide 10 of 14
2017 Industry Conference
SPARK/Scala
Code work:
• Generate algorithm and memory models for SEMS and LASSO.
• Determine which computational components can utilize GPU vs. CPU.
• Identify when it is appropriate to use SPARK vs. MPI.
• Compare, test, improve software components.
• Develop plugins for data pre/post-processing.
• Containerize this application.
Example 2: novel, optimized applications
C/MPI/Hadoop/TensorFlow
Fast, epistatic GWAS models in large datasetsComplex continuous traits require complex genotype/phenotype models
Stat research:
• Several approaches for analysis of complex traits, e.g. SEMS and LASSO
• Include biological information to focus the model search
• Develop methods for reducing parameter search space
Slide 11 of 14
2017 Industry Conference
Part 3: Science
project examples
2017 Industry Conference
Example 1: Structural variation & HLHSStructural variant callingVariant calling by de-novo genome assembly and graph comparisonCortex, Cortex-var, Popins, BASIL/ANISE
• Sequencing data harmonization – hundreds of samples• Assembly and variant calling workflow automation• Development of the post-processing steps • Annotation, validation and statistical analysis of results
Slide 13 of 14
Sample iuncleaned graph
Sample ...uncleaned graph
Sample Nuncleaned graph
Combined graph with low coverage nodes removed
Sample icleaned graph
Sample ...cleaned graph
Sample Ncleaned graph
Reference graph
Multi-color graph with pool, cleaned samples, reference
Cortex_var variant calling output format
2017 Industry Conference
Example 2: Data harmonization
Need: to perform a multidimensional correlative analysis or ML on:• genomic, • metabolomic, • socioeconomic, • phenotypic factors,• GIS data
Data need to be • Cleaned,• normalized• brought into the same range/mean/STD• batch effects taken care of
All this complicated stat work needs to be automated for arbitrary, large datasets.
Slide 14 of 14
THANKS!
Top Related