NCI Containers and Workflows Interest Group Webinar Series (Past Webinars)

Past Webinars

November 12, 2021, 3:00-4:00 p.m. ET


Topic: Developing Scalable Bioinformatics Workflows on the Cancer Genomics Cloud
Presenter: Dr. Jeffrey Grover


Jeffrey Grover Jeffrey Grover, Ph.D.
Title: Genomics Scientist
Organization: Seven Bridges

 

Speaker Bio: Jeff is a bioinformatics scientist and received his Ph.D. in molecular and cellular biology from the University of Arizona in 2020. He is experienced in integrating results from multi-omics data analysis, data visualization, and in bioinformatics workflow automation. At Seven Bridges Jeff works to expand our bioinformatics workflow offerings, provide technical expertise across public programs, and prototype internal technical solutions. In his free time he’s usually hacking in Python or R, traveling to a national park, or keeping his guitar collection in top playing shape.

 

Abstract: The Cancer Genomics Cloud (CGC) is a cloud-based bioinformatics ecosystem supported by the National Cancer Institute (NCI). The CGC allows users to run computational workflows defined in the Common Workflow Language (CWL) on a wealth of large datasets, in place, in the cloud. Users may also upload their own data and take advantage of the scalability of cloud computing for their data analysis. In addition to the hundreds of publicly available bioinformatics workflows in the CGC Public Apps Gallery users can employ a variety of methods to develop their own. These include an integrated graphical user interface for creating workflows, as well as an ecosystem of tools enabling local development and automated deployment of workflows to the CGC. We will detail how to develop efficient workflows for the CGC and how to use best practices such as version control and continuous integration with the CGC, using publicly available tools developed by Seven Bridges.

Resource: 

https://www.cancergenomicscloud.org/

https://rabix.io/

https://github.com/rabix/sb-ci

https://github.com/rabix/benten

https://github.com/rabix/sbpack

Presentation recording

Slides

October 8, 2021, 3:00-4:00 p.m. ET


Topic: Reproducible FAIR+ workflows and the CCWL
Presenters: Dr. Pjotr Prins (Assistant Professor) & Arun Isaac (PhD Student)


Dr. Prins

Dr. Pjotr Prins
Title:  Assistant Professor
Organization: University of Tennessee Health Science Center

Speaker Bio: Dr. Prins is a bioinformatician at large & assistant (coding) professor at the Department of Genetics, Genomics, and Informatics at the University of Tennessee Health Science Center. He is the director of Genenetwork.org and, notwithstanding collecting appointments at various institutes (see https://thebird.nl/), he is a dedicated free software and hardware champion and writes critical software for genetics and pangenomics.

 

Arun

Arun Isaac
Title: PhD Student
Organization: University of Tennessee Health Science Center

Speaker Bio: Arun Isaac is a Ph.D. student at the Department of Computational and Data Sciences, Indian Institute of Science. He spends much of his free time contributing to free software and hacking on Emacs Lisp and Scheme. He contributes regularly to GNU Guix and is the author of guile-email, an email parser for Guile. He also contributes to Tamil localization.

 

Abstract: FAIR principles are focused on data and fail to account for reproducible and (on-demand) workflows. In this talk, we will explore FAIR+ (Findable, Accessible, Interoperable, Reusable, and Computable) in the context of GeneNetwork.org - one of the oldest web resources in bioinformatics. With GeneNetwork we are realizing reproducible software deployment, building on free and open-source software including GNU Guix and containers. We also are building scalable workflows that are triggered on demand to run in the cloud or on bare metal and we created our own HPC to run GNU Guix-based pangenomics. In this talk, we will present our infrastructure, including a prototype COVID19 cloud setup, with a hands-on introduction of GNU Guix and the concise CWL - a CWL generator that looks like shell scripts, but in reality, can be reasoned on and are far more portable.

The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high-performance computing (HPC) environments.

Guix is an advanced distribution of the GNU operating system developed by the GNU Project which respects the freedom of computer users.  Guix supports transactional upgrades and roll-backs, unprivileged package management, and more. When used as a standalone distribution, Guix supports declarative system configuration for transparent and reproducible operating systems.

The Concise Common Workflow Language (CCWL) is a concise syntax to express CWL workflows. It is implemented as an Embedded Domain Specific Language (EDSL) in the Scheme programming language, a minimalist dialect of the Lisp family of programming languages.

Resources:
https://genenetwork.org/
https://covid19.genenetwork.org/ FAIR+ workflows
https://hpc.guix.info/blog/2019/01/creating-a-reproducible-workflow-with-cwl/
https://guix.gnu.org/
https://git.systemreboot.net/ccwl/tree/README.org CCWL

Presentation Recording

Dr. Prins' Slides

Dr. Isaac's Slides

 

September 10, 2021, 3:00-4:00 p.m. ET


Topic: WFPM: A novel WorkFlow Package Manager to enable collaborative bioinformatics workflow development
Presenter: Junjun Zhang


Dr. Zhang Junjun Zhang
Title: Senior Bioinformatics Manager
Organization: Ontario Institute for Cancer Research

 

Speaker Bio: Junjun Zhang has over 19 years of extensive experience in designing and building innovative solutions for cancer genomics and bioinformatics. He has led NCI’s GDC data model development, co-led the development of the International Cancer Genome Consortium Data Portal, and developed the central metadata tracking system for the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes project. Recently, he led the establishment of best practices for the ICGC ARGO uniform workflow development, and developed the first full-featured workflow package manager tool.

 

Abstract: Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which results in poor adoption of the widely practiced Don’t Repeat Yourself (DRY) principle and the divide-and-conquer strategy.

To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (https://www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture enables ARGO developers spreading across the globe to collaboratively build its uniform workflows with different developers focusing on different components. All ARGO component packages are reusable for the general bioinformatics community to import as modules to build their own workflows.

Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to implement best practices and the aforementioned modular approach. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard reusable workflow packages. WFPM CLI source code: https://github.com/icgc-argo/wfpm_, documentation: https://wfpm.readthedocs.io

Resource: https://softeng.oicr.on.ca/junjun_zhang/2021/03/31/build-workflows-collaboratively-using-packages/

Presentation Recording

Slides

 

Access archived webinars.