Download - NATIONAL CENTER FOR SUPERCOMPUTING APPLICATIONS · Structural variant calling Variant calling by de -novo genome assembly and graph comparison Cortex, Cortex-var, Popins, BASIL/ANISE

NATIONAL CENTER FOR SUPERCOMPUTING APPLICATIONS

2017 NCSA Industry ConferenceHPC for Agriculture, in Genomics and elsewhere

2017 Industry Conference

NCSA Genomics for Industry and AcademeHospitals

Data Analytics

Computational thermo/fluids

Multiphysics

Computational biology,genomics

Pharma

Agro

International science

UIUC science

NCSA Industry

Slide 2 of 14


NCSA Genomics: an interdisciplinary team

3Brian Bliss

Computer Science

Matthew KendziorCrop Science

Plant Biotechnology, Molecular Biology

Katherine KendigFine Arts, Writing,

Management

Daniel WicklandInformatics

Ramshankar VenkatakrishnanECE, Electronics, Communications

Weihao Ge,Physics, Biophysics

Staff Graduate students

Undergraduates

Slide 3 of 14

Joshua AllenMath, Writing, Bioinformatics

Daniel LanierMath, containerization,

cloud computing

Post-Docs

Brian RaoMolecular Biology

Bioinformatics

Shubham RawlaniECE, Information

Management

Yazhuo ZhangInformation

ManagementPriya Balgi

Business analytics, Information Technology

Sparsh AgarwalBiotechnology

Prakruthi BurraBioinformatics,

Math, CS

Dave IstantoCrop Sciences,Bioinformatics

Mingyu YangCS

Dipro RayCS, Math, Statistics

Tajesvi BhatCS


Diverse range of projectsIT projects1. Big data network transfers2. Parallel file packaging and archiving utilities for big data3. Hardware benchmarking4. Testing and validation of data analytics platform5. Software installation, testing, profiling, optimization

Software projects6. Optimization of R code for data cleaning in multidimensional heterogeneous datasets7. Scaling computation of epistatic interactions in GWAS data8. Instrumenting a high-throughput variant calling workflow on Blue Waters, iForge, AWS9. Workflow management and automation with Swift/T, Nextflow and Cromwell/WDL

Science projects11. Profiling mutations in cancer12. Detecting structural variants in polyploid plants via de-novo assembly13. Population-level variant recall for very large cohorts in Alzheimer’s disease14. Evolution of persistence strategies in biomolecular networks

Slide 4 of 14


Part 1: IT project examples


Example 1: Software installs and configThe field is far from being fully baked. Tons of software packages that keep changing.

In-house software installs• On iForge >200 software packages

https://omictools.com/omicX owns and operates a platform that helps to search bioinformatics tools available; France• Genomics 26,325 tools• Proteomics 8,034• Metabolomics 3,396

Why this is successful:• we actually practice using these software packages daily

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton

Hail:0.1 (stable) and 0.2 beta (development). “We recommend that new users install 0.2 beta“

Slide 6 of 14

https://omictools.com/


Example 2: workflow maintenance

Cantarel, Brandi L., et al. "MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes." Genome research 18.1 (2008): 188-196.

Mohamed, Amin Roushdy. Transcriptomics of coral-algal interactions: novel insights into the establishment of symbiosis. Diss. James Cook University, 2016.


Problem: • stakeholder is planning to scale up their analysis in software X;• current infrastructure not adequate • in-house scalability analysis yields contradictory and confusing results

Solution: run a project with NCSA Industry• a domain expert studies software X• performs scalability analysis correctly• scales up the problem on iForge• makes purchasing recommendation based on data

Why this is successful:• we already have the equipment before it hits the market • we daily practice performance and scalability• if we do not yet have something ourselves, we will ask for it, and shall receive• Diverse expertise available on hand

Example 3: Hardware purchasing decisions

Investigate issues of:• User behavior• Software installation• Nature of the data• Script misconfiguration• Filesystem • Cluster interconnect

Slide 8 of 14


Part 2: Software

project examples


Problem: • Software is scientifically solid, but too slow to go live in production, give the data volumes at the company

Solution: a team investigated issues potentially related to• Understand algorithms and language implementation• Code profiling on the software• Benchmark performance on various hardware• Implement code improvements• Recommend better ways of

• configuring, • scheduling and • running the software

• Suggest software alternatives

Why this is successful:• We run research projects in our domain, thus staying on top of the recent developments

Example 1: acceleration, performance tuning

Slide 10 of 14


SPARK/Scala

Code work:

• Generate algorithm and memory models for SEMS and LASSO.

• Determine which computational components can utilize GPU vs. CPU.

• Identify when it is appropriate to use SPARK vs. MPI.

• Compare, test, improve software components.

• Develop plugins for data pre/post-processing.

• Containerize this application.

Example 2: novel, optimized applications

C/MPI/Hadoop/TensorFlow

Fast, epistatic GWAS models in large datasetsComplex continuous traits require complex genotype/phenotype models

Stat research:

• Several approaches for analysis of complex traits, e.g. SEMS and LASSO

• Include biological information to focus the model search

• Develop methods for reducing parameter search space

Slide 11 of 14


Part 3: Science

project examples


Example 1: Structural variation & HLHSStructural variant callingVariant calling by de-novo genome assembly and graph comparisonCortex, Cortex-var, Popins, BASIL/ANISE

• Sequencing data harmonization – hundreds of samples• Assembly and variant calling workflow automation• Development of the post-processing steps • Annotation, validation and statistical analysis of results

Slide 13 of 14

Sample iuncleaned graph

Sample ...uncleaned graph

Sample Nuncleaned graph

Combined graph with low coverage nodes removed

Sample icleaned graph

Sample ...cleaned graph

Sample Ncleaned graph

Reference graph

Multi-color graph with pool, cleaned samples, reference

Cortex_var variant calling output format


Example 2: Data harmonization

Need: to perform a multidimensional correlative analysis or ML on:• genomic, • metabolomic, • socioeconomic, • phenotypic factors,• GIS data

Data need to be • Cleaned,• normalized• brought into the same range/mean/STD• batch effects taken care of

All this complicated stat work needs to be automated for arbitrary, large datasets.

Slide 14 of 14

THANKS!