Panel: Exploring the Genome: Hurdles, Obstacles, and Solutions with the Genome Analysis Toolkit

Wednesday July 25, 11:00am, Meeting Room 12F

Exploring genomic data involves processing petabyte-scale sets of data with computationally intensive algorithms to produce a set of results. With advances in technology, the time to sequence an individual’s whole genome has gone from months to days and the number of genomes being sequenced has increased exponentially. With sequencing data so readily available, the pressure on genomic data processing software is greater than ever to be fast, correct, and inexpensive.

During this panel we will explore the current state of analyzing genomic data from sequencing to variant discovery using The Broad Institute’s Genome Analysis Toolkit (GATK) as a case study. The panel will begin with the biological basis for genomics. Then we will discuss issues of scalability, and move on to variant discovery techniques, including new machine learning methods such as variational inference and convolutional neural networks. During this discussion, we will review the progress in the field of genomics over the past decade, highlight current challenges, and look ahead to future developments.

The Broad Institute’s Genome Analysis Toolkit (GATK is one of the most widely used software toolkits for variant discovery and genotyping of genomic data. In the latest GATK release (Version 4) we have begun leveraging the latest machine learning techniques such as convolutional neural networks and variational inference to discover a much wider range of genetic variation (somatic variants, copy number variants and structural variants).

The GATK is an open-source tool released under a BSD 3-clause license. It can be found on Github at:

The Broad Institute of MIT and Harvard is one of the premiere genome analysis centers in the world, with the goal of improving human health using genetics to advance the understanding of biology and treatment of human disease and to lay the groundwork for a new generation of therapies.

Short biographies of the GATK framework team are below.

Invited Speakers

Jonn Smith – Senior Software Engineer, The Broad Institute

Jonn Smith is a member of the GATK team. Previously he led several software integration efforts and Principal Investigator for a project sponsored by the Defense Advanced Research Project Agency (DARPA). Smith’s background is in Neuroscience and Computer Science. His interests include genomics, machine learning and communication in distributed systems.

David Roazen – Principal Software Engineer, The Broad Institute

A member of the Broad Institute for the past 7 years, David currently leads a team of developers working on the Genome Analysis Toolkit (GATK), and open-source suite of tools that use statistical and machine learning methods to identify variation in genetic data. His background is in Computer Science, and interests include genomics, parallel computing, and machine learning.

Louis Bergelson – Senior Software Engineer, The Broad Institute

Louis Bergelson has been a member of the Broad Institute for the past 5 years, and has worked as a cancer analyst and software engineer, and now serves as a member of the GATK team. His background is in Computer Engineering and pottery.

James Emery – Associate Software Engineer, The Broad Institute

James Emery is currently a member of the GATK team. His background is in Computer Science and Chemistry. James’ interests include optimization of distributed and parallel computing and efficient big data analysis using commercial cloud computing solutions.