NIH Data Science SIG

Abstract: Powered by the continuous decrease of the cost of sequencing a single human genome, «big data» sequencing studies (>10,000 sample) are becoming common in both industrial and research settings. Recently, several prominent libraries like GATK4, ADAM, and Hail have used Apache Spark to achieve this goal. Apache Spark is a «map-reduce»-like system that allows code written in Scala, Java, Python, R, or SQL to be run in parallel across a cluster with hundreds to thousands of cores. In this talk, we will briefly explain what Apache Spark is and how it works. Then, we will look at a few genomic analyses where Apache Spark drops latency from hours to minutes, which enables a human-in-the-loop analysis refinement loop. As part of these analyses, we will also explore how Apache Spark can be used to integrate other data sources (clinical measurements, imaging) with genomics data.

About the speaker: Frank Austin Nothaft is the Lead for Genomics at Databricks, where he drives the use of Apache Spark for genomics (and other biomedical) use cases. Prior to joining Databricks, Frank did his Doctorate in Computer Science at the UC Berkeley AMPLab, where he was a core developer of the Big Data Genomics/ADAM project, and a contributor to the Toil workflow management system. Prior to joining UC Berkeley, Frank worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips.

Links of interest:

If you are interested in meeting with the speaker, please email seandavi@gmail.com.

Analyzing massive genomics datasets using Apache Spark