NCI Containers and Workflows Interest Group Webinar Series (Past Webinars)
Friday, September 9, 2022, 3 to 4 pm EST
UCSF Information Commons, Clinical Use Cases and Models - Session II of II
Dr. Travis Zack, MD, PhD, Oncology Fellow
Bakar Computational Health Sciences Institute, UCSF
Recently, there have been increasing efforts to use machine learning on large datasets obtained from electronic medical records (EMR) to inform and improve clinical care, yet the standardization and organization of this data has so far limited its utility in oncology. The options and complexity of cancer continue to expand, and with frequent protocol modifications due to patient intolerance, accurate and properly controlled comparisons and cohort identification across providers and institutions can be challenging. Here we expand on previous methods1 to leverage HemOnc.org, an physician-curated comprehensive database of oncology treatment protocols, to create a database of 5146 regimens across 146 hematology and oncology diseases that includes information about drug names, optimal dosages, administration days, cycle lengths, and number of cycles within a complete treatment. We use rule-based natural language processing to convert this text database into a structured database of anti-neoplastic regimens. We have developed a convolutional time series maximum likelihood estimate algorithm to identify the most likely regimen a patient is undergoing at each point in a patient’s treatment history, as well as modifications in therapy. We illustrate the utility of these tools to analyze and compare treatment and treatment modifications within UCSF and across 5 UC campuses.
Friday, June 10, 2022, 3 to 4 pm EST
UCSF Information and Cancer Commons: A Multi-factor Platform for Deep Integrative Biomedical Research and Precision Medicine - Session I of II
Dr. Sharat Israni, PhD, Executive Director and CTO
Bakar Computational Health Sciences Institute, UCSF
Dr. Gundolf Schenk, PhD, Principal Data Scientist
Bakar Computational Health Sciences Institute, UCSF
UCSF Information Commons (IC) is a research data platform, geared towards multifactor inquiry at the scale of AI and precision medicine needs that mark biomedical research today. Spanning over 5.5 million UCSF patients since the eighties, IC brings together their clinical records, clinical notes, radiology images of most modalities, as well as clinical genomic profiles. These are all certifiably deidentified, so researchers do not need any further IRB approval to explore the data and shape their research. IC is built entirely on open source, with rich support for deep data science and AI. Further, it is compatible with the UC Data Discovery Platform, which makes available clinical records across the six UC medical centers, for wide inquiry. We present examples of how multifactor research result in provably richer findings.
UCSF Cancer Commons, built on the above IC, aims to support data-driven cancer research with best-of-breed technology. It includes cancer-specific patient data of the above IC formats, plus pathology imaging and the Cancer Registry. It makes these data accessible for exploration, cohort building, statistical analyses and AI model building via user-friendly tools, such as UCSF’s local installation of cBioPortal, and via programming tools such as Jupyter, Rstudio leveraging distributed computing technologies like Apache Spark, FB Presto in the AWS cloud and on-premise HPC.
In the next edition of this CWIG, a practicing oncologist will present their use of the IC for very large dimension oncology studies.
Friday, May 13, 2022, 3 to 4 pm EST
Scalable and Reproducible Genomics Data Analysis on AWS
Dr. W. Lee Pang, Principal Developer Advocate
Amazon Web Services, HealthAI
The rate of raw genomics data is growing rapidly, and some estimate that the amount of data worldwide is on the order of exabytes. Processing such mountains of information into science ready formats like variant calls, expression matrices, etc. is nontrivial and requires workflow architectures that can scale in both performance and cost efficiency. Similarly, with more worldwide interest in genomics from individual researchers to global consortia, computational methods need to be portable, easy to share, and easy to deploy. AWS offers practically unlimited compute capacity, elasticity, and flexibility to process enormous amounts of genomics data cost effectively and on-demand. With its global footprint, AWS enables the rapid deployment of reproducible computing infrastructure worldwide. In this talk, we’ll highlight the core patterns, architectures, and tooling used by many genomics customers who are leveraging AWS to tackle their biggest genomics data processing challenges. We’ll also highlight ways that AWS facilitates computational portability and reproducible research, accelerating advances in genomics worldwide.
Friday, April 8, 2022, 3 to 4 pm EST
Getting to know the GA4GH Workflow Execution Service (WES) API
James Eddy, Director of Informatics & Biocomputing
Ian Fore, D.Phil., Senior Biomedical Informatics Program Manager
The Global Alliance for Genomics & Health (GA4GH) Cloud Work Stream focuses on API standards (and implementations from partner Driver projects and other community members) that make it easier to “send the algorithms to the data.” Developed collaboratively between bioinformaticians, cloud and workflow platform engineers, and other stakeholders in the GA4GH community, the Workflow Execution Service (WES) API provides a standard way for users to submit workflow requests to workflow execution systems, and to monitor their execution. This API lets users run a single workflow (e.g., CWL, WDL, Nextflow formatted workflows) on multiple different platforms, clouds, and environments. We will provide an overview of the existing functionality described in the WES API standard, and also how WES fits in with other standards from the GA4GH Cloud Workstream like the Tool Registry Service (TRS) and Data Repository Service (DRS) APIs. We will also present some current use cases and implementations of WES and review ongoing development. We hope that this introduction to the WES API can encourage feedback and contribution from CWIG members.
Friday, March 11, 2022, 3 to 4 pm EST
Galaxy and Software Containers: A Recipe for Success
Dr. Enis Afgan, Research Scientist
Johns Hopkins University
Galaxy (galaxyproject.org) is a popular tools and workflow execution platform used by thousands. How does it work? How does it scale to expose thousands of popular tools? And how does it handle millions of jobs per month? We will go over the system architecture of Galaxy and all the components it needs to function. We will explore how to install Galaxy locally for development and production use cases. We will also showcase how Galaxy is deployed at usegalaxy.org and in a FedRAMP-managed environment at AnVIL. Galaxy is increasingly making use of software containers to ensure consistency and promote software portability. We will take a detailed look at the latest available model for deploying Galaxy: the Galaxy Helm chart. The Galaxy Helm chart abstracts the mechanics of deploying a Galaxy into a single, highly-configurable package that handles installation and management of all the required software services for running Galaxy. This will include reflection on compute infrastructure, software components, as well as tools and reference data. Overall, this will be an expository talk about what goes on behind the scenes to make a Galaxy installation function at scale.
Friday, February 11, 2022, 3 to 4 pm EST
The Galaxy Platform for Accessible, Reproducible, and Scalable Biomedical Data Science
Jeremy Goecks, Ph.D., Associate Professor, Department of Biomedical Engineering
Oregon Health & Science University
Started in 2005, the Galaxy Project (https://galaxyproject.org/) has worked to solve key issues plaguing modern data-intensive biomedicine—the ability of researchers to access cutting-edge analysis methods, to precisely reproduce and share complex computational analyses, and to perform large-scale analyses across many datasets. Galaxy has become one of the largest and most widely used open-source platforms for biomedical data science. Promoting openness and collaboration in all facets of the project has enabled Galaxy to build a vibrant world-wide community of scientist users, software developers, system engineers, and educators who continuously contribute new software features, add the latest analysis tools, adopt modern infrastructure such as package managers and software containers, author training materials, and lead research and training workshops. In this talk, I will share an overview of the Galaxy Project and highlight several recent applications of Galaxy to cancer research, including the use of machine learning to predict therapeutic response and analysis of single-cell spatial omics to understand tumor spatial biology.
Friday, January 14, 2022, 3 to 4 pm EST
Federated Analysis for Cancer Variant Interpretation
Melissa Cline, Ph.D., Associate Research Scientist
UC Santa Cruz Genomics Institute
Pathogenic variation in BRCA1 and BRCA2 is a major risk factor for cancers including breast, ovarian, pancreatic and prostate. Genetic testing is empowering individuals and their health care providers to understand and better manage their heritable risk of cancer, but is limited by the many gaps in our knowledge of human genetic variation. These gaps, termed “Variants of Uncertain Significance” (VUS), are rare genetic variants for which there is insufficient evidence to determine their clinical impact. Variant interpretation frequently requires some amount of patient-level data: clinical data describing the incidence of disease in patients with the VUS. Due to their sensitive nature, these data are mostly siloed. In 2015, the BRCA Challenge was launched to address this problem by assembling a team of experts to develop new approaches to share variant data on BRCA1 and BRCA2, as exemplars for other genes and heritable disorders. One promising approach is federated analysis. By sharing containerized analysis software with institutions that hold patient-level data, we have been able to analyze this data in situ, without the need to share the patient-level data directly, generating variant-level summaries that are less sensitive and can be shared more easily, and yet contain sufficient information to further variant interpretation. We will describe our experience with this approach, as well as future directions in container technology that will encourage greater variant data sharing.
Friday, November 12, 2021, 3 to 4 pm EST
Developing Scalable Bioinformatics Workflows on the Cancer Genomics Cloud
Jeffrey Grover, Ph.D., Genomics Scientist
The Cancer Genomics Cloud (CGC) is a cloud-based bioinformatics ecosystem supported by the National Cancer Institute (NCI). The CGC allows users to run computational workflows defined in the Common Workflow Language (CWL) on a wealth of large datasets, in place, in the cloud. Users may also upload their own data and take advantage of the scalability of cloud computing for their data analysis. In addition to the hundreds of publicly available bioinformatics workflows in the CGC Public Apps Gallery users can employ a variety of methods to develop their own. These include an integrated graphical user interface for creating workflows, as well as an ecosystem of tools enabling local development and automated deployment of workflows to the CGC. We will detail how to develop efficient workflows for the CGC and how to use best practices such as version control and continuous integration with the CGC, using publicly available tools developed by Seven Bridges
Friday, October 8, 2021, 3 to 4 pm EST
Reproducible FAIR+ workflows and the CCWL
Dr. Pjotr Prins, Assistant Professor
University of Tennessee Health Science Center
Arun Isaac, PhD Student
University of Tennessee Health Science Center
FAIR principles are focused on data and fail to account for reproducible and (on-demand) workflows. In this talk, we will explore FAIR+ (Findable, Accessible, Interoperable, Reusable, and Computable) in the context of GeneNetwork.org - one of the oldest web resources in bioinformatics. With GeneNetwork we are realizing reproducible software deployment, building on free and open-source software including GNU Guix and containers. We also are building scalable workflows that are triggered on demand to run in the cloud or on bare metal and we created our own HPC to run GNU Guix-based pangenomics. In this talk, we will present our infrastructure, including a prototype COVID19 cloud setup, with a hands-on introduction of GNU Guix and the concise CWL - a CWL generator that looks like shell scripts, but in reality, can be reasoned on and are far more portable.
The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high-performance computing (HPC) environments.
Guix is an advanced distribution of the GNU operating system developed by the GNU Project which respects the freedom of computer users. Guix supports transactional upgrades and roll-backs, unprivileged package management, and more. When used as a standalone distribution, Guix supports declarative system configuration for transparent and reproducible operating systems.
The Concise Common Workflow Language (CCWL) is a concise syntax to express CWL workflows. It is implemented as an Embedded Domain Specific Language (EDSL) in the Scheme programming language, a minimalist dialect of the Lisp family of programming languages.
- FAIR+ workflows
Friday, September 10, 2021, 3 to 4 pm EST
WFPM: A novel WorkFlow Package Manager to enable collaborative bioinformatics workflow development
Junjun Zhang, Senior Bioinformatics Manager
Ontario Institute for Cancer Research
Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which results in poor adoption of the widely practiced Don’t Repeat Yourself (DRY) principle and the divide-and-conquer strategy.
To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (https://www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture enables ARGO developers spreading across the globe to collaboratively build its uniform workflows with different developers focusing on different components. All ARGO component packages are reusable for the general bioinformatics community to import as modules to build their own workflows.
Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to implement best practices and the aforementioned modular approach. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard reusable workflow packages. WFPM CLI source code: https://github.com/icgc-argo/wfpm_, documentation: https://wfpm.readthedocs.io