NCI Containers and Workflows Interest Group Webinar Series (Past Webinars)

Past Webinars

Friday, September 9, 2022, 3 to 4 pm EST


Dr. Travis Zack, MD, PhD

UCSF Information Commons, Clinical Use Cases and Models - Session II of II
Dr. Travis Zack, MD, PhD, Oncology Fellow
Bakar Computational Health Sciences Institute, UCSF

Travis Zack is interested in combining his backgrounds in computational genomics and clinical oncology with machine learning methods on real world oncologic treatment data to better understand and predict treatment efficacy and toxicity. He graduated with a PhD in Biophysics from Harvard University and continued there to earn an MD in the Health Sciences and Technology program. He completed Internal Medicine Residency at UCSF and is now in Oncology Fellowship at the same.

 

Abstract:
Recently, there have been increasing efforts to use machine learning on large datasets obtained from electronic medical records (EMR) to inform and improve clinical care, yet the standardization and organization of this data has so far limited its utility in oncology. The options and complexity of cancer continue to expand, and with frequent protocol modifications due to patient intolerance, accurate and properly controlled comparisons and cohort identification across providers and institutions can be challenging. Here we expand on previous methods1 to leverage HemOnc.org, an physician-curated comprehensive database of oncology treatment protocols, to create a database of 5146 regimens across 146 hematology and oncology diseases that includes information about drug names, optimal dosages, administration days, cycle lengths, and number of cycles within a complete treatment. We use rule-based natural language processing to convert this text database into a structured database of anti-neoplastic regimens. We have developed a convolutional time series maximum likelihood estimate algorithm to identify the most likely regimen a patient is undergoing at each point in a patient’s treatment history, as well as modifications in therapy. We illustrate the utility of these tools to analyze and compare treatment and treatment modifications within UCSF and across 5 UC campuses.

Presentation Recording

Resource Files


Friday, June 10, 2022, 3 to 4 pm EST


Dr. Sharat Israni, PhD

UCSF Information and Cancer Commons: A Multi-factor Platform for Deep Integrative Biomedical Research and Precision Medicine - Session I of II
Dr. Sharat Israni, PhD, Executive Director and CTO
Bakar Computational Health Sciences Institute, UCSF

Sharat Israni, PhD, is Executive Director & CTO at UCSF’s Bakar Computational Health Sciences Institute, which is building UCSF’s Information Commons next-gen research computing capability. Previously, he was Executive Director, Data Science, at Stanford Medicine. A long-serving Technology executive, Sharat’s teams pioneered the use of “Big Data.” He served as SVP/VP of Data at Yahoo! (1999-2008) and Intuit (2010-13), amongst the foremost companies that used Data Science/AI to re-invent their products. He led Digital Media systems for broadcast/interactive TV at Silicon Graphics; and Data teams at IBM and HP. Sharat was PI for the NITRD Open Knowledge Network workshop 2017 hosted by NSF and NIH, as well as other workshops and grants. A graduate of IIT/Kanpur and Univ. of Wisconsin/Madison, Sharat is a frequent peer reviewer of journal articles and grant proposals.

 

Dr. Sharat Israni, PhD

Dr. Gundolf Schenk, PhD, Principal Data Scientist
Bakar Computational Health Sciences Institute, UCSF

Gundolf Schenk, PhD, is interested in modeling biological data and providing informatics tools for biomedical research. His background comprises structural bioinformatics, x-ray scattering experiments of biomolecules in solution, and machine learning. He graduated from the Universities of Hamburg and Goettingen, Germany, and continued theoretical and experimental method development and data analysis for biomolecular research at the European Molecular Biology Laboratory and at Stanford University. At BCHSI he applies skills to integrate clinical notes and the automatic detection of structure within various modalities of biomedical data.

Abstract:
UCSF Information Commons (IC) is a research data platform, geared towards multifactor inquiry at the scale of AI and precision medicine needs that mark biomedical research today. Spanning over 5.5 million UCSF patients since the eighties, IC brings together their clinical records, clinical notes, radiology images of most modalities, as well as clinical genomic profiles. These are all certifiably deidentified, so researchers do not need any further IRB approval to explore the data and shape their research. IC is built entirely on open source, with rich support for deep data science and AI. Further, it is compatible with the UC Data Discovery Platform, which makes available clinical records across the six UC medical centers, for wide inquiry. We present examples of how multifactor research result in provably richer findings.

UCSF Cancer Commons, built on the above IC, aims to support data-driven cancer research with best-of-breed technology. It includes cancer-specific patient data of the above IC formats, plus pathology imaging and the Cancer Registry. It makes these data accessible for exploration, cohort building, statistical analyses and AI model building via user-friendly tools, such as UCSF’s local installation of cBioPortal, and via programming tools such as Jupyter, Rstudio leveraging distributed computing technologies like Apache Spark, FB Presto in the AWS cloud and on-premise HPC.

In the next edition of this CWIG, a practicing oncologist will present their use of the IC for very large dimension oncology studies.

Presentation Recording

Resource Links:
Resource Files


Friday, May 13, 2022, 3 to 4 pm EST


Dr. W. Lee Pang

Scalable and Reproducible Genomics Data Analysis on AWS
Dr. W. Lee Pang, Principal Developer Advocate
Amazon Web Services, HealthAI

Dr. Lee is a Principal Developer Advocate with the Health AI services team at AWS where his focus is optimizing and elevating the user experience performing large scale bioinformatics and genomics data analysis in the cloud. He has a PhD in Bioengineering and over a decade of hands-on experience as a research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.

 

Abstract:
The rate of raw genomics data is growing rapidly, and some estimate that the amount of data worldwide is on the order of exabytes. Processing such mountains of information into science ready formats like variant calls, expression matrices, etc. is nontrivial and requires workflow architectures that can scale in both performance and cost efficiency. Similarly, with more worldwide interest in genomics from individual researchers to global consortia, computational methods need to be portable, easy to share, and easy to deploy. AWS offers practically unlimited compute capacity, elasticity, and flexibility to process enormous amounts of genomics data cost effectively and on-demand. With its global footprint, AWS enables the rapid deployment of reproducible computing infrastructure worldwide. In this talk, we’ll highlight the core patterns, architectures, and tooling used by many genomics customers who are leveraging AWS to tackle their biggest genomics data processing challenges. We’ll also highlight ways that AWS facilitates computational portability and reproducible research, accelerating advances in genomics worldwide.

Presentation Recording

Resource Files


Friday, April 8, 2022, 3 to 4 pm EST


James Eddy

Getting to know the GA4GH Workflow Execution Service (WES) API
James Eddy, Director of Informatics & Biocomputing
Sage Bionetworks

Dr. Eddy is the Director of Informatics & Computing at Sage Bionetworks. He is a computational biologist and cancer researcher with experience in developing and managing high-throughput molecular databases, bioinformatics pipelines, and analytical workflows. He and his team serve as key technical contributors on the data coordinating center for several large consortia, including the NCI Cancer Systems Biology Consortium (CSBC) and the NCI Human Tumor Atlas Network (HTAN). Dr. Eddy is well versed in best practices for reproducible data science in biomedical research. Dr. Eddy has worked closely with the Global Alliance for Genomics & Health (GA4GH) to develop standards for data sharing and analysis, including serving as champion for the Workflow Execution Service (WES) API and as the Cloud Work Stream representative on the Technical Alignment Subcommittee (TASC).

 

James Eddy

Ian Fore, D.Phil., Senior Biomedical Informatics Program Manager
Sage Bionetworks

Dr. Ian Fore is a senior biomedical informatics program manager at CBIIT, specializing in integrating data from both basic and clinical science. His current contributions focus on the core components of the Cancer Research Data Commons (CRDC), which enable data integration, topics such as subject and specimen data, identifiers, and data aggregation. Beyond the technology, interoperability is also about how humans work together, and Dr. Fore leads CBIIT’s One Team One Mission program to enhance collaboration within and between projects.

Abstract:
The Global Alliance for Genomics & Health (GA4GH) Cloud Work Stream focuses on API standards (and implementations from partner Driver projects and other community members) that make it easier to “send the algorithms to the data.” Developed collaboratively between bioinformaticians, cloud and workflow platform engineers, and other stakeholders in the GA4GH community, the Workflow Execution Service (WES) API provides a standard way for users to submit workflow requests to workflow execution systems, and to monitor their execution. This API lets users run a single workflow (e.g., CWL, WDL, Nextflow formatted workflows) on multiple different platforms, clouds, and environments. We will provide an overview of the existing functionality described in the WES API standard, and also how WES fits in with other standards from the GA4GH Cloud Workstream like the Tool Registry Service (TRS) and Data Repository Service (DRS) APIs. We will also present some current use cases and implementations of WES and review ongoing development. We hope that this introduction to the WES API can encourage feedback and contribution from CWIG members.

Presentation Recording

Resource Files


Friday, March 11, 2022, 3 to 4 pm EST


Dr. Enis Afgan

Galaxy and Software Containers: A Recipe for Success
Dr. Enis Afgan, Research Scientist
Johns Hopkins University

Enis Afgan is a research scientist in the Schatz Lab at Johns Hopkins University, working on the Galaxy and AnVIL projects. His area of focus has been applying distributed computing techniques to making biomedical computing more accessible. He has been working with cloud computing technologies and software containers to deliver scalable software services for researchers.

 

Abstract:
Galaxy (galaxyproject.org) is a popular tools and workflow execution platform used by thousands. How does it work? How does it scale to expose thousands of popular tools? And how does it handle millions of jobs per month? We will go over the system architecture of Galaxy and all the components it needs to function. We will explore how to install Galaxy locally for development and production use cases. We will also showcase how Galaxy is deployed at usegalaxy.org and in a FedRAMP-managed environment at AnVIL. Galaxy is increasingly making use of software containers to ensure consistency and promote software portability. We will take a detailed look at the latest available model for deploying Galaxy: the Galaxy Helm chart. The Galaxy Helm chart abstracts the mechanics of deploying a Galaxy into a single, highly-configurable package that handles installation and management of all the required software services for running Galaxy. This will include reflection on compute infrastructure, software components, as well as tools and reference data. Overall, this will be an expository talk about what goes on behind the scenes to make a Galaxy installation function at scale.

Presentation Recording

Resource Files


Friday, February 11, 2022, 3 to 4 pm EST


Jeremy Goecks, Ph.D.

The Galaxy Platform for Accessible, Reproducible, and Scalable Biomedical Data Science
Jeremy Goecks, Ph.D., Associate Professor, Department of Biomedical Engineering
Oregon Health & Science University

Dr. Jeremy Goecks is an Associate Professor of Biomedical Engineering and Section Head for Cancer Data Science at Oregon Health & Science University. Dr. Goecks has leadership positions in several national and international biomedical consortia that span from generation of single-cell tumor atlases to the development of cloud-scale computational infrastructure for biomedical data. He is principal investigator for an NCI Cancer Moonshot Center in the Human Tumor Atlas Network (HTAN; https://humantumoratlas.org/) where he leads the generation and analysis of single-cell omics and imaging datasets from longitudinal biopsies. He is a principal investigator for the Galaxy platform (https://galaxyproject.org), a computational workbench used daily by thousands of scientists across the world, and is a key contributor the NHGRI’s AnVIL cloud-based data commons.

 

Abstract:
Started in 2005, the Galaxy Project (https://galaxyproject.org/) has worked to solve key issues plaguing modern data-intensive biomedicine—the ability of researchers to access cutting-edge analysis methods, to precisely reproduce and share complex computational analyses, and to perform large-scale analyses across many datasets. Galaxy has become one of the largest and most widely used open-source platforms for biomedical data science. Promoting openness and collaboration in all facets of the project has enabled Galaxy to build a vibrant world-wide community of scientist users, software developers, system engineers, and educators who continuously contribute new software features, add the latest analysis tools, adopt modern infrastructure such as package managers and software containers, author training materials, and lead research and training workshops. In this talk, I will share an overview of the Galaxy Project and highlight several recent applications of Galaxy to cancer research, including the use of machine learning to predict therapeutic response and analysis of single-cell spatial omics to understand tumor spatial biology.

Presentation Recording

Resource Links:
Resource Files


Friday, January 14, 2022, 3 to 4 pm EST


Melissa Cline, Ph.D.

Federated Analysis for Cancer Variant Interpretation
Melissa Cline, Ph.D., Associate Research Scientist
UC Santa Cruz Genomics Institute

Dr. Melissa Cline is an Associate Research Scientist at the UC Santa Cruz Genomics Institute. She is the Program Manager of the BRCA Challenge, a consortium launched by the Global Alliance of Genomics and Health to pioneer methods for privacy-preserving data sharing with the goal of expediting genetic variation. She leads the development of BRCA Exchange, the world’s largest public repository on BRCA variation, and on federated analysis methods for germline variant interpretation. Dr Cline received her Ph.D. from UC Santa Cruz in 2000. Subsequently, she worked as a Senior Research Scientist at Affymetrix before returning to academia, as a postdoctoral researcher at the Pasteur Institute and UC Santa Cruz, ultimately joining the Genomics Institute in 2009.

 

Abstract:
Pathogenic variation in BRCA1 and BRCA2 is a major risk factor for cancers including breast, ovarian, pancreatic and prostate. Genetic testing is empowering individuals and their health care providers to understand and better manage their heritable risk of cancer, but is limited by the many gaps in our knowledge of human genetic variation. These gaps, termed “Variants of Uncertain Significance” (VUS), are rare genetic variants for which there is insufficient evidence to determine their clinical impact. Variant interpretation frequently requires some amount of patient-level data: clinical data describing the incidence of disease in patients with the VUS. Due to their sensitive nature, these data are mostly siloed. In 2015, the BRCA Challenge was launched to address this problem by assembling a team of experts to develop new approaches to share variant data on BRCA1 and BRCA2, as exemplars for other genes and heritable disorders. One promising approach is federated analysis. By sharing containerized analysis software with institutions that hold patient-level data, we have been able to analyze this data in situ, without the need to share the patient-level data directly, generating variant-level summaries that are less sensitive and can be shared more easily, and yet contain sufficient information to further variant interpretation. We will describe our experience with this approach, as well as future directions in container technology that will encourage greater variant data sharing.

Presentation Recording

Resource Files


Friday, November 12, 2021, 3 to 4 pm EST


Jeffrey Grover, Ph.D.

Developing Scalable Bioinformatics Workflows on the Cancer Genomics Cloud
Jeffrey Grover, Ph.D., Genomics Scientist
Seven Bridges

Jeff is a bioinformatics scientist and received his Ph.D. in molecular and cellular biology from the University of Arizona in 2020. He is experienced in integrating results from multi-omics data analysis, data visualization, and in bioinformatics workflow automation. At Seven Bridges Jeff works to expand our bioinformatics workflow offerings, provide technical expertise across public programs, and prototype internal technical solutions. In his free time he’s usually hacking in Python or R, traveling to a national park, or keeping his guitar collection in top playing shape.

 

Abstract:
The Cancer Genomics Cloud (CGC) is a cloud-based bioinformatics ecosystem supported by the National Cancer Institute (NCI). The CGC allows users to run computational workflows defined in the Common Workflow Language (CWL) on a wealth of large datasets, in place, in the cloud. Users may also upload their own data and take advantage of the scalability of cloud computing for their data analysis. In addition to the hundreds of publicly available bioinformatics workflows in the CGC Public Apps Gallery users can employ a variety of methods to develop their own. These include an integrated graphical user interface for creating workflows, as well as an ecosystem of tools enabling local development and automated deployment of workflows to the CGC. We will detail how to develop efficient workflows for the CGC and how to use best practices such as version control and continuous integration with the CGC, using publicly available tools developed by Seven Bridges

Presentation Recording

Resource Links:
Resource Files


Friday, October 8, 2021, 3 to 4 pm EST


Dr. Pjotr Prins

Reproducible FAIR+ workflows and the CCWL
Dr. Pjotr Prins, Assistant Professor
University of Tennessee Health Science Center

Dr. Prins is a bioinformatician at large & assistant (coding) professor at the Department of Genetics, Genomics, and Informatics at the University of Tennessee Health Science Center. He is the director of Genenetwork.org and, notwithstanding collecting appointments at various institutes (see https://thebird.nl/), he is a dedicated free software and hardware champion and writes critical software for genetics and pangenomics.

 

Dr. Pjotr Prins

Arun Isaac, PhD Student
University of Tennessee Health Science Center

Arun Isaac is a Ph.D. student at the Department of Computational and Data Sciences, Indian Institute of Science. He spends much of his free time contributing to free software and hacking on Emacs Lisp and Scheme. He contributes regularly to GNU Guix and is the author of guile-email, an email parser for Guile. He also contributes to Tamil localization.

Abstract:
FAIR principles are focused on data and fail to account for reproducible and (on-demand) workflows. In this talk, we will explore FAIR+ (Findable, Accessible, Interoperable, Reusable, and Computable) in the context of GeneNetwork.org - one of the oldest web resources in bioinformatics. With GeneNetwork we are realizing reproducible software deployment, building on free and open-source software including GNU Guix and containers. We also are building scalable workflows that are triggered on demand to run in the cloud or on bare metal and we created our own HPC to run GNU Guix-based pangenomics. In this talk, we will present our infrastructure, including a prototype COVID19 cloud setup, with a hands-on introduction of GNU Guix and the concise CWL - a CWL generator that looks like shell scripts, but in reality, can be reasoned on and are far more portable.

The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high-performance computing (HPC) environments.

Guix is an advanced distribution of the GNU operating system developed by the GNU Project which respects the freedom of computer users. Guix supports transactional upgrades and roll-backs, unprivileged package management, and more. When used as a standalone distribution, Guix supports declarative system configuration for transparent and reproducible operating systems.

The Concise Common Workflow Language (CCWL) is a concise syntax to express CWL workflows. It is implemented as an Embedded Domain Specific Language (EDSL) in the Scheme programming language, a minimalist dialect of the Lisp family of programming languages.

Presentation Recording

Resource Links:
Resource Files


Friday, September 10, 2021, 3 to 4 pm EST


Junjun Zhang

WFPM: A novel WorkFlow Package Manager to enable collaborative bioinformatics workflow development
Junjun Zhang, Senior Bioinformatics Manager
Ontario Institute for Cancer Research

Junjun Zhang has over 19 years of extensive experience in designing and building innovative solutions for cancer genomics and bioinformatics. He has led NCI’s GDC data model development, co-led the development of the International Cancer Genome Consortium Data Portal, and developed the central metadata tracking system for the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes project. Recently, he led the establishment of best practices for the ICGC ARGO uniform workflow development, and developed the first full-featured workflow package manager tool.

 

Abstract:
Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which results in poor adoption of the widely practiced Don’t Repeat Yourself (DRY) principle and the divide-and-conquer strategy.

To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (https://www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture enables ARGO developers spreading across the globe to collaboratively build its uniform workflows with different developers focusing on different components. All ARGO component packages are reusable for the general bioinformatics community to import as modules to build their own workflows.

Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to implement best practices and the aforementioned modular approach. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard reusable workflow packages. WFPM CLI source code: https://github.com/icgc-argo/wfpm_, documentation: https://wfpm.readthedocs.io

Presentation Recording

Resource Links:
Resource Files