Past Webinars
Docker Tech Talk
Scott Johnston, Chief Executive Officer
Docker
Todd Densmore, Sr. Solutions Architect
Docker
Abstract:
What is Docker? What problem is it solving? Why does everyone love talking about it? In this session, we will provide a foundational understanding and several live walkthroughs that demonstrate the power of Docker and containers. This includes functionality specifically used by data scientists and medical researchers across Docker's user base to enhance the reproducibility of their work, collaboration across teams, and resource efficiency.
Resource Files
Use of Containers for Custom Software Development at the NCI for AWS Cloud and On Premises
Krish Seshadri, Sr Cloud Architect
NCI/CBIIT/DSSB
Lawrence Brem, Federal Manager for Software Development
NCI/CBIIT
Abstract:
The System Engineering team collaborates with software developers and researchers to construct their infrastructure, employing technologies like Docker and platform-specific services such as AWS ECS and Fargate. This presentation explores the nuances of Docker image construction, offering insights into the underlying procedures. Furthermore, an overview of the architectural pattern used to create a resilient Drupal platform based on Docker will be presented.
Resource Files
AWS HealthOmics: Transform omics data into insights
Ariella Sasson, Principal Solutions Architect
Amazon Web Services
Abstract:
AWS HealthOmics helps healthcare and life science organizations build at-scale to store, query, and analyze genomic, transcriptomic, and other omics data. By removing the undifferentiated heavy lifting, you can generate deeper insights from omics data to improve health and advance scientific discoveries. With AWS HealthOmics, you can either bring your own containerized bioinformatics workflows written in Nextflow, WDL or CWL, or run one of the Ready2Run pre-configured workflows through simple API calls, for a managed compute experience.
Resource Files
Realizing FAIR principles and Reproducible Computational Workflows with the Arvados Platform
Brett Smith, Senior Software Engineer
Curii
Abstract:
Reproducible research is necessary to ensure that scientific work can be understood, independently verified, and built upon in future work. FAIR principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets (e.g. data, metadata). Computational workflows can follow FAIR principles for both the workflows’ descriptions themselves and the (meta)data the workflows use or produce. FAIR principles have been suggested for research software, and FAIR principles specifically for scientific computational workflows is still an active area of discussion. However, existing FAIR principles for digital assets, software, or workflows don’t address computational reproducibility and therefore more guidelines are needed for reproducible workflows and research.
Our talk will focus on the FAIR principles and the other aspects of data and workflow management we believe are necessary for reproducible research. We will discuss how the Arvados platform help you “go FAIR” and beyond with your data, digital objects, and all aspects of your computational workflows. The Arvados Platform, is a 100% open source platform that integrates a data management system and a compute management system to create a unified environment to store and organize data and run Common Workflow Language (CWL) workflows. Specifically, we will discuss using Arvados to run, record and reproduce complex workflows, define and access metadata, determine data provenance, and share and publish FAIR results directly from the platform.
Resource Files
Google Cloud Medical Imaging
Marcos Novaes, Solution Architect
Google
Abstract:
1. Google Cloud Medical Imaging Suite Overview
2. Jupyter lab extensions for Medical Imaging:
- Interactive Python Widgets
- 3DSlicer Kernel
- Running 3DSlicer and MONAILabel in a Jupyer environment using IDC datasets
3. Demonstration
4. Questions & Answers
Resource Files
Using Google for NCI Research
Mike Callaghan, Cloud Customer Experience Account Lead
Google
Dave Belardo, Customer Engineer
Google
Dr. Philip Meacham, PhD, NIH/CIT STRIDES Initiative Cloud Instructional Specialist
Deloitte Consulting
Abstract:
1. Alphabet / Google Overview
2. Google Healthcare & Life Sciences (HCLS) Overview
3. STRIDES Overview & Benefits for NCI Researchers
4. Question & Answer
Resource Files
UCSF Information Commons, Clinical Use Cases and Models - Session II of II
Dr. Travis Zack, MD, PhD, Oncology Fellow
Bakar Computational Health Sciences Institute, UCSF
Abstract:
Recently, there have been increasing efforts to use machine learning on large datasets obtained from electronic medical records (EMR) to inform and improve clinical care, yet the standardization and organization of this data has so far limited its utility in oncology. The options and complexity of cancer continue to expand, and with frequent protocol modifications due to patient intolerance, accurate and properly controlled comparisons and cohort identification across providers and institutions can be challenging. Here we expand on previous methods1 to leverage HemOnc.org, an physician-curated comprehensive database of oncology treatment protocols, to create a database of 5146 regimens across 146 hematology and oncology diseases that includes information about drug names, optimal dosages, administration days, cycle lengths, and number of cycles within a complete treatment. We use rule-based natural language processing to convert this text database into a structured database of anti-neoplastic regimens. We have developed a convolutional time series maximum likelihood estimate algorithm to identify the most likely regimen a patient is undergoing at each point in a patient’s treatment history, as well as modifications in therapy. We illustrate the utility of these tools to analyze and compare treatment and treatment modifications within UCSF and across 5 UC campuses.
Resource Files
UCSF Information and Cancer Commons: A Multi-factor Platform for Deep Integrative Biomedical Research and Precision Medicine - Session I of II
Dr. Sharat Israni, PhD, Executive Director and CTO
Bakar Computational Health Sciences Institute, UCSF
Dr. Gundolf Schenk, PhD, Principal Data Scientist
UCSF
Abstract:
UCSF Information Commons (IC) is a research data platform, geared towards multifactor inquiry at the scale of AI and precision medicine needs that mark biomedical research today. Spanning over 5.5 million UCSF patients since the eighties, IC brings together their clinical records, clinical notes, radiology images of most modalities, as well as clinical genomic profiles. These are all certifiably deidentified, so researchers do not need any further IRB approval to explore the data and shape their research. IC is built entirely on open source, with rich support for deep data science and AI. Further, it is compatible with the UC Data Discovery Platform, which makes available clinical records across the six UC medical centers, for wide inquiry. We present examples of how multifactor research result in provably richer findings.
UCSF Cancer Commons, built on the above IC, aims to support data-driven cancer research with best-of-breed technology. It includes cancer-specific patient data of the above IC formats, plus pathology imaging and the Cancer Registry. It makes these data accessible for exploration, cohort building, statistical analyses and AI model building via user-friendly tools, such as UCSF’s local installation of cBioPortal, and via programming tools such as Jupyter, Rstudio leveraging distributed computing technologies like Apache Spark, FB Presto in the AWS cloud and on-premise HPC.
In the next edition of this CWIG, a practicing oncologist will present their use of the IC for very large dimension oncology studies.
Resource Links:
Scalable and Reproducible Genomics Data Analysis on AWS
Dr. W. Lee Pang, Principal Developer Advocate
Amazon Web Services, HealthAI
Abstract:
The rate of raw genomics data is growing rapidly, and some estimate that the amount of data worldwide is on the order of exabytes. Processing such mountains of information into science ready formats like variant calls, expression matrices, etc. is nontrivial and requires workflow architectures that can scale in both performance and cost efficiency. Similarly, with more worldwide interest in genomics from individual researchers to global consortia, computational methods need to be portable, easy to share, and easy to deploy. AWS offers practically unlimited compute capacity, elasticity, and flexibility to process enormous amounts of genomics data cost effectively and on-demand. With its global footprint, AWS enables the rapid deployment of reproducible computing infrastructure worldwide. In this talk, we’ll highlight the core patterns, architectures, and tooling used by many genomics customers who are leveraging AWS to tackle their biggest genomics data processing challenges. We’ll also highlight ways that AWS facilitates computational portability and reproducible research, accelerating advances in genomics worldwide.
Resource Files
Getting to know the GA4GH Workflow Execution Service (WES) API
James Eddy, Director of Informatics & Biocomputing
Sage Bionetworks
Ian Fore, D.Phil., Senior Biomedical Informatics Program Manager
Center for Biomedical Informatics and Information Technology (CBIIT)
Abstract:
The Global Alliance for Genomics & Health (GA4GH) Cloud Work Stream focuses on API standards (and implementations from partner Driver projects and other community members) that make it easier to “send the algorithms to the data.” Developed collaboratively between bioinformaticians, cloud and workflow platform engineers, and other stakeholders in the GA4GH community, the Workflow Execution Service (WES) API provides a standard way for users to submit workflow requests to workflow execution systems, and to monitor their execution. This API lets users run a single workflow (e.g., CWL, WDL, Nextflow formatted workflows) on multiple different platforms, clouds, and environments. We will provide an overview of the existing functionality described in the WES API standard, and also how WES fits in with other standards from the GA4GH Cloud Workstream like the Tool Registry Service (TRS) and Data Repository Service (DRS) APIs. We will also present some current use cases and implementations of WES and review ongoing development. We hope that this introduction to the WES API can encourage feedback and contribution from CWIG members.
Resource Files
Galaxy and Software Containers: A Recipe for Success
Dr. Enis Afgan, Research Scientist
Johns Hopkins University
Abstract:
Galaxy (galaxyproject.org) is a popular tools and workflow execution platform used by thousands. How does it work? How does it scale to expose thousands of popular tools? And how does it handle millions of jobs per month? We will go over the system architecture of Galaxy and all the components it needs to function. We will explore how to install Galaxy locally for development and production use cases. We will also showcase how Galaxy is deployed at usegalaxy.org and in a FedRAMP-managed environment at AnVIL. Galaxy is increasingly making use of software containers to ensure consistency and promote software portability. We will take a detailed look at the latest available model for deploying Galaxy: the Galaxy Helm chart. The Galaxy Helm chart abstracts the mechanics of deploying a Galaxy into a single, highly-configurable package that handles installation and management of all the required software services for running Galaxy. This will include reflection on compute infrastructure, software components, as well as tools and reference data. Overall, this will be an expository talk about what goes on behind the scenes to make a Galaxy installation function at scale.
Resource Files
The Galaxy Platform for Accessible, Reproducible, and Scalable Biomedical Data Science
Jeremy Goecks, Ph.D., Associate Professor, Department of Biomedical Engineering
Oregon Health & Science University
Abstract:
Started in 2005, the Galaxy Project (https://galaxyproject.org/) has worked to solve key issues plaguing modern data-intensive biomedicine—the ability of researchers to access cutting-edge analysis methods, to precisely reproduce and share complex computational analyses, and to perform large-scale analyses across many datasets. Galaxy has become one of the largest and most widely used open-source platforms for biomedical data science. Promoting openness and collaboration in all facets of the project has enabled Galaxy to build a vibrant world-wide community of scientist users, software developers, system engineers, and educators who continuously contribute new software features, add the latest analysis tools, adopt modern infrastructure such as package managers and software containers, author training materials, and lead research and training workshops. In this talk, I will share an overview of the Galaxy Project and highlight several recent applications of Galaxy to cancer research, including the use of machine learning to predict therapeutic response and analysis of single-cell spatial omics to understand tumor spatial biology.
Resource Links:
Federated Analysis for Cancer Variant Interpretation
Melissa Cline, Ph.D., Associate Research Scientist
UC Santa Cruz Genomics Institute
Abstract:
Pathogenic variation in BRCA1 and BRCA2 is a major risk factor for cancers including breast, ovarian, pancreatic and prostate. Genetic testing is empowering individuals and their health care providers to understand and better manage their heritable risk of cancer, but is limited by the many gaps in our knowledge of human genetic variation. These gaps, termed “Variants of Uncertain Significance” (VUS), are rare genetic variants for which there is insufficient evidence to determine their clinical impact. Variant interpretation frequently requires some amount of patient-level data: clinical data describing the incidence of disease in patients with the VUS. Due to their sensitive nature, these data are mostly siloed. In 2015, the BRCA Challenge was launched to address this problem by assembling a team of experts to develop new approaches to share variant data on BRCA1 and BRCA2, as exemplars for other genes and heritable disorders. One promising approach is federated analysis. By sharing containerized analysis software with institutions that hold patient-level data, we have been able to analyze this data in situ, without the need to share the patient-level data directly, generating variant-level summaries that are less sensitive and can be shared more easily, and yet contain sufficient information to further variant interpretation. We will describe our experience with this approach, as well as future directions in container technology that will encourage greater variant data sharing.
Resource Files
Developing Scalable Bioinformatics Workflows on the Cancer Genomics Cloud
Jeffrey Grover, Ph.D., Genomics Scientist
Seven Bridges
Abstract:
The Cancer Genomics Cloud (CGC) is a cloud-based bioinformatics ecosystem supported by the National Cancer Institute (NCI). The CGC allows users to run computational workflows defined in the Common Workflow Language (CWL) on a wealth of large datasets, in place, in the cloud. Users may also upload their own data and take advantage of the scalability of cloud computing for their data analysis. In addition to the hundreds of publicly available bioinformatics workflows in the CGC Public Apps Gallery users can employ a variety of methods to develop their own. These include an integrated graphical user interface for creating workflows, as well as an ecosystem of tools enabling local development and automated deployment of workflows to the CGC. We will detail how to develop efficient workflows for the CGC and how to use best practices such as version control and continuous integration with the CGC, using publicly available tools developed by Seven Bridges
Resource Links:
Reproducible FAIR+ workflows and the CCWL
Dr. Pjotr Prins, Assistant Professor
University of Tennessee Health Science Center
Arun Isaac, PhD Student
University of Tennessee Health Science Center
Abstract:
FAIR principles are focused on data and fail to account for reproducible and (on-demand) workflows. In this talk, we will explore FAIR+ (Findable, Accessible, Interoperable, Reusable, and Computable) in the context of GeneNetwork.org - one of the oldest web resources in bioinformatics. With GeneNetwork we are realizing reproducible software deployment, building on free and open-source software including GNU Guix and containers. We also are building scalable workflows that are triggered on demand to run in the cloud or on bare metal and we created our own HPC to run GNU Guix-based pangenomics. In this talk, we will present our infrastructure, including a prototype COVID19 cloud setup, with a hands-on introduction of GNU Guix and the concise CWL - a CWL generator that looks like shell scripts, but in reality, can be reasoned on and are far more portable.
The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high-performance computing (HPC) environments.
Guix is an advanced distribution of the GNU operating system developed by the GNU Project which respects the freedom of computer users. Guix supports transactional upgrades and roll-backs, unprivileged package management, and more. When used as a standalone distribution, Guix supports declarative system configuration for transparent and reproducible operating systems.
The Concise Common Workflow Language (CCWL) is a concise syntax to express CWL workflows. It is implemented as an Embedded Domain Specific Language (EDSL) in the Scheme programming language, a minimalist dialect of the Lisp family of programming languages.
Resource Links:
WFPM: A novel WorkFlow Package Manager to enable collaborative bioinformatics workflow development
Junjun Zhang, Senior Bioinformatics Manager
Ontario Institute for Cancer Research
Abstract:
Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which results in poor adoption of the widely practiced Don’t Repeat Yourself (DRY) principle and the divide-and-conquer strategy.
To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (https://www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture enables ARGO developers spreading across the globe to collaboratively build its uniform workflows with different developers focusing on different components. All ARGO component packages are reusable for the general bioinformatics community to import as modules to build their own workflows.
Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to implement best practices and the aforementioned modular approach. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard reusable workflow packages. WFPM CLI source code: https://github.com/icgc-argo/wfpm_, documentation: https://wfpm.readthedocs.io
Resource Links:
Please contact ncicwigusermail@mail.nih.gov for access to older webinar presentations