NCI Containers and Workflows Interest Group Webinar Series (Past Webinars) | Events Registration

Past Webinars

Wednesday, June 26, 2024, 11:00 to 12:00 pm ET

Scott Johnston

Docker Tech Talk
Scott Johnston, Chief Executive Officer
Docker

Scott Johnston leads Docker, the company behind the open-source Docker project and the global container ecosystem of people, products, platforms, and partners revolutionizing how modern apps are built, shared, and run.

Scott joined Docker in 2014 as its first product manager and went on to serve as COO and CPO before being named CEO in November 2019. Scott began his career as a software engineer and subsequently served in operational and leadership roles in product, business development, operations, and marketing with industry-leading companies such as Sun Microsystems, Cisco, Netscape, Puppet, and Loudcloud (parent of Opsware).

Scott lives in Palo Alto with his wife and three children and holds degrees in engineering and business from Stanford University.

Scott Johnston

Todd Densmore, Sr. Solutions Architect
Docker

Todd started his professional career as a Java developer and has been fortunate to have witnessed many advancements in software development: (VMs, cloud services, DevOps, containerization) and also several failures: (groovy, hcl, cfengine, etc.).

He prefers to work for startups in the DevTool space and always advocates for better Developer Experience.

Todd received his MSc in Artificial Intelligence from the University of Edinburgh and has worked for companies producing: static and dynamic code analysis, CI/CD tools, web IDEs, and ephemeral testing environments.

Most recently Todd has helped companies adopt containers and Kubernetes to help modernize their software development practices.

Todd also loves soccer and plays when he can.

Abstract:
What is Docker? What problem is it solving? Why does everyone love talking about it? In this session, we will provide a foundational understanding and several live walkthroughs that demonstrate the power of Docker and containers. This includes functionality specifically used by data scientists and medical researchers across Docker's user base to enhance the reproducibility of their work, collaboration across teams, and resource efficiency.

Presentation Recording

Resource Files

NIH-NCI-CWIG Data Science with Docker.pdf3.4 MB

Wednesday, March 20, 2024, 11:00 am to 12:00 pm ET

Krish Seshadri

Use of Containers for Custom Software Development at the NCI for AWS Cloud and On Premises
Krish Seshadri, Sr Cloud Architect
NCI/CBIIT/DSSB

As the senior cloud architect and computer scientist at CBIIT’s Digital Services and Solutions Branch, Krish Seshadri plays a pivotal role in the realm of cloud services. His responsibilities span technical direction, architectural consultation, containerization, infrastructure automation, migration, optimization, and cost savings initiatives. Serving as the system owner for the CBIIT cloud platform on Amazon Web Services (AWS) and Google Cloud Platforms, he leads a team focused on executing cloud engineering activities, along with the day-to-day operations and maintenance of the CBIIT cloud infrastructure.

Lawrence Brem, Federal Manager for Software Development
NCI/CBIIT

Lawrence Brem has been at the NCI for 15 years, many years as a contractor managing the underlying software infrastructure. Recently Lawrence is now the Federal Manager for Software Development at CBIIT, NCI. In this role he oversees asuite of 40 Scientific Analysis Tools, more than 30 difference Business Applications, the NCI Grants Management Applications, and some custom website development including the events.cancer.gov site.

He is focused on maturing the software development lifecycle at the NCI with a specific focus on Cloud technologies including Containerization. He oversees several efforts that use Containerization for improved performance, faster DevOps, and higher quality software development.

Abstract:
The System Engineering team collaborates with software developers and researchers to construct their infrastructure, employing technologies like Docker and platform-specific services such as AWS ECS and Fargate. This presentation explores the nuances of Docker image construction, offering insights into the underlying procedures. Furthermore, an overview of the architectural pattern used to create a resilient Drupal platform based on Docker will be presented.

Presentation Recording

Resource Files

2024-03-14_NIH-NCI-CWIG-Presentation.pptx2.84 MB

Wednesday, September 20, 2023, 11:00 am - 12:00 pm ET

Ariella Sasson

AWS HealthOmics: Transform omics data into insights
Ariella Sasson, Principal Solutions Architect
Amazon Web Services

Ariella Sasson is a Principal Solutions Architect specializing in genomics and life sciences. Ariella has a background in math and computer science, a PhD in Computational Biology, and over decade of experience working in clinical genomics, oncology, and pharma. She is passionate about using technology and big data to accelerate HCLS research, genomics and personalized medicine.

Abstract:
AWS HealthOmics helps healthcare and life science organizations build at-scale to store, query, and analyze genomic, transcriptomic, and other omics data. By removing the undifferentiated heavy lifting, you can generate deeper insights from omics data to improve health and advance scientific discoveries. With AWS HealthOmics, you can either bring your own containerized bioinformatics workflows written in Nextflow, WDL or CWL, or run one of the Ready2Run pre-configured workflows through simple API calls, for a managed compute experience.

Presentation Recording

Resource Files

AWS_HealthOmics_CWIG_Sept20_2023.pdf1.9 MB

Wednesday, June 28, 2023, 11 to 12 pm ET

Brett Smith

Realizing FAIR principles and Reproducible Computational Workflows with the Arvados Platform
Brett Smith, Senior Software Engineer
Curii

Brett Smith is a Senior Software Developer with longtime experience in Linux programming and system administration as well as deep roots in Free and Open Source Software communities. He has previously worked at the Free Software Foundation, World Wide Web Consortium, and Software Freedom Conservancy in both technical and community roles. He recently joined Curii to help build connections between Arvados and other open source bioinformatics communities.

Abstract:
Reproducible research is necessary to ensure that scientific work can be understood, independently verified, and built upon in future work. FAIR principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets (e.g. data, metadata). Computational workflows can follow FAIR principles for both the workflows’ descriptions themselves and the (meta)data the workflows use or produce. FAIR principles have been suggested for research software, and FAIR principles specifically for scientific computational workflows is still an active area of discussion. However, existing FAIR principles for digital assets, software, or workflows don’t address computational reproducibility and therefore more guidelines are needed for reproducible workflows and research.

Our talk will focus on the FAIR principles and the other aspects of data and workflow management we believe are necessary for reproducible research. We will discuss how the Arvados platform help you “go FAIR” and beyond with your data, digital objects, and all aspects of your computational workflows. The Arvados Platform, is a 100% open source platform that integrates a data management system and a compute management system to create a unified environment to store and organize data and run Common Workflow Language (CWL) workflows. Specifically, we will discuss using Arvados to run, record and reproduce complex workflows, define and access metadata, determine data provenance, and share and publish FAIR results directly from the platform.

Presentation Recording

Resource Files

FAIR Principles and Reproducible Workflows with Arvados.pdf1.97 MB

Thursday, November 10, 2022, 3 to 4 pm ET

Marcos Novaes

Google Cloud Medical Imaging
Marcos Novaes, Solution Architect
Google

Dr. Novaes advises Google's customers in the design of large scale solutions using the Google Cloud Platform, particularly in the areas of High Performance Computing (HPC), Machine Learning (ML) and Internet of Things (IoT). His areas of interest include the re-architecture of traditional distributed numerical methods at large scale using modern technologies developed for Machine Learning such as Google's Tensorflow.

Prior to joining Google in 2013, Dr. Novaes was with IBM for 18 years. Dr. Novaes was assigned to the IBM T. J. Watson Research Center, where he led the research related to parallel processing, Cloud Computing and network protocols. Dr. Novaes holds a PhD degree in Computer Science and is the author of 43 patents and several technical publications in the area of distributed computing.

Abstract:
1. Google Cloud Medical Imaging Suite Overview
2. Jupyter lab extensions for Medical Imaging:
- Interactive Python Widgets
- 3DSlicer Kernel
- Running 3DSlicer and MONAILabel in a Jupyer environment using IDC datasets
3. Demonstration
4. Questions & Answers

Presentation Recording

Resource Files

NCI CWIG Google Medical Image Suite - 11.10.22.pdf3.13 MB

Friday, October 14, 2022, 3 to 4 pm ET

Mike Callaghan

Using Google for NCI Research
Mike Callaghan, Cloud Customer Experience Account Lead
Google

Mike is a Cloud Customer Experience Account Lead working across NIH researchers, the STRIDES Initiative team, Partners and Google to drive NIH researcher success in the cloud. Mike has been helping clients achieve success for over 25 years.

Mike Callaghan

Dave Belardo, Customer Engineer
Google

Dave is a Customer Engineer aligning NIH objectives with Google solutions maximizing researcher outcomes. Dave has been supporting NIH for over 5 years and supporting Google customers for over 10 years.

Dr. Philip Meacham, PhD

Dr. Philip Meacham, PhD, NIH/CIT STRIDES Initiative Cloud Instructional Specialist
Deloitte Consulting

Dr. Philip Meacham holds a PhD in epidemiology from the University of Rochester with over five years of academic teaching and clinical research experience in cancer epidemiology. Supporting the NIH STRIDES Initiative, Dr. Meacham is a part of the STRIDES Client Services Team and coordinates the development and delivery of a range of learning and workforce development activities for internal and external NIH audiences.

Abstract:
1. Alphabet / Google Overview
2. Google Healthcare & Life Sciences (HCLS) Overview
3. STRIDES Overview & Benefits for NCI Researchers
4. Question & Answer

Presentation Recording

Resource Files

NCI CWIG Google Introduction - 10.14.22.pdf19.76 MB

Friday, September 9, 2022, 3 to 4 pm ET

Dr. Travis Zack, MD, PhD

UCSF Information Commons, Clinical Use Cases and Models - Session II of II
Dr. Travis Zack, MD, PhD, Oncology Fellow
Bakar Computational Health Sciences Institute, UCSF

Travis Zack is interested in combining his backgrounds in computational genomics and clinical oncology with machine learning methods on real world oncologic treatment data to better understand and predict treatment efficacy and toxicity. He graduated with a PhD in Biophysics from Harvard University and continued there to earn an MD in the Health Sciences and Technology program. He completed Internal Medicine Residency at UCSF and is now in Oncology Fellowship at the same.

Abstract:
Recently, there have been increasing efforts to use machine learning on large datasets obtained from electronic medical records (EMR) to inform and improve clinical care, yet the standardization and organization of this data has so far limited its utility in oncology. The options and complexity of cancer continue to expand, and with frequent protocol modifications due to patient intolerance, accurate and properly controlled comparisons and cohort identification across providers and institutions can be challenging. Here we expand on previous methods1 to leverage HemOnc.org, an physician-curated comprehensive database of oncology treatment protocols, to create a database of 5146 regimens across 146 hematology and oncology diseases that includes information about drug names, optimal dosages, administration days, cycle lengths, and number of cycles within a complete treatment. We use rule-based natural language processing to convert this text database into a structured database of anti-neoplastic regimens. We have developed a convolutional time series maximum likelihood estimate algorithm to identify the most likely regimen a patient is undergoing at each point in a patient’s treatment history, as well as modifications in therapy. We illustrate the utility of these tools to analyze and compare treatment and treatment modifications within UCSF and across 5 UC campuses.

Presentation Recording

Resource Files

Dr. Zack_NCI CWIG_presentation - 9.9.2022.pdf7.59 MB

Friday, June 10, 2022, 3 to 4 pm ET

Dr. Sharat Israni, PhD

UCSF Information and Cancer Commons: A Multi-factor Platform for Deep Integrative Biomedical Research and Precision Medicine - Session I of II
Dr. Sharat Israni, PhD, Executive Director and CTO
Bakar Computational Health Sciences Institute, UCSF

Sharat Israni, PhD, is Executive Director & CTO at UCSF’s Bakar Computational Health Sciences Institute, which is building UCSF’s Information Commons next-gen research computing capability. Previously, he was Executive Director, Data Science, at Stanford Medicine. A long-serving Technology executive, Sharat’s teams pioneered the use of “Big Data.” He served as SVP/VP of Data at Yahoo! (1999-2008) and Intuit (2010-13), amongst the foremost companies that used Data Science/AI to re-invent their products. He led Digital Media systems for broadcast/interactive TV at Silicon Graphics; and Data teams at IBM and HP. Sharat was PI for the NITRD Open Knowledge Network workshop 2017 hosted by NSF and NIH, as well as other workshops and grants. A graduate of IIT/Kanpur and Univ. of Wisconsin/Madison, Sharat is a frequent peer reviewer of journal articles and grant proposals.

Dr. Sharat Israni, PhD

Dr. Gundolf Schenk, PhD, Principal Data Scientist
UCSF

Gundolf Schenk, PhD, is interested in modeling biological data and providing informatics tools for biomedical research. His background comprises structural bioinformatics, x-ray scattering experiments of biomolecules in solution, and machine learning. He graduated from the Universities of Hamburg and Goettingen, Germany, and continued theoretical and experimental method development and data analysis for biomolecular research at the European Molecular Biology Laboratory and at Stanford University. At BCHSI he applies skills to integrate clinical notes and the automatic detection of structure within various modalities of biomedical data.

Abstract:
UCSF Information Commons (IC) is a research data platform, geared towards multifactor inquiry at the scale of AI and precision medicine needs that mark biomedical research today. Spanning over 5.5 million UCSF patients since the eighties, IC brings together their clinical records, clinical notes, radiology images of most modalities, as well as clinical genomic profiles. These are all certifiably deidentified, so researchers do not need any further IRB approval to explore the data and shape their research. IC is built entirely on open source, with rich support for deep data science and AI. Further, it is compatible with the UC Data Discovery Platform, which makes available clinical records across the six UC medical centers, for wide inquiry. We present examples of how multifactor research result in provably richer findings.

UCSF Cancer Commons, built on the above IC, aims to support data-driven cancer research with best-of-breed technology. It includes cancer-specific patient data of the above IC formats, plus pathology imaging and the Cancer Registry. It makes these data accessible for exploration, cohort building, statistical analyses and AI model building via user-friendly tools, such as UCSF’s local installation of cBioPortal, and via programming tools such as Jupyter, Rstudio leveraging distributed computing technologies like Apache Spark, FB Presto in the AWS cloud and on-premise HPC.

In the next edition of this CWIG, a practicing oncologist will present their use of the IC for very large dimension oncology studies.

Presentation Recording

Resource Links:

https://informationcommons.ucsf.edu/

Resource Files

Friday, May 13, 2022, 3 to 4 pm ET

Dr. W. Lee Pang

Scalable and Reproducible Genomics Data Analysis on AWS
Dr. W. Lee Pang, Principal Developer Advocate
Amazon Web Services, HealthAI

Dr. Lee is a Principal Developer Advocate with the Health AI services team at AWS where his focus is optimizing and elevating the user experience performing large scale bioinformatics and genomics data analysis in the cloud. He has a PhD in Bioengineering and over a decade of hands-on experience as a research scientist and software engineer in bioinformatics, computational systems biology, and data science developing tools ranging from high throughput pipelines for *omics data processing to compliant software for clinical data capture and analysis.

Abstract:
The rate of raw genomics data is growing rapidly, and some estimate that the amount of data worldwide is on the order of exabytes. Processing such mountains of information into science ready formats like variant calls, expression matrices, etc. is nontrivial and requires workflow architectures that can scale in both performance and cost efficiency. Similarly, with more worldwide interest in genomics from individual researchers to global consortia, computational methods need to be portable, easy to share, and easy to deploy. AWS offers practically unlimited compute capacity, elasticity, and flexibility to process enormous amounts of genomics data cost effectively and on-demand. With its global footprint, AWS enables the rapid deployment of reproducible computing infrastructure worldwide. In this talk, we’ll highlight the core patterns, architectures, and tooling used by many genomics customers who are leveraging AWS to tackle their biggest genomics data processing challenges. We’ll also highlight ways that AWS facilitates computational portability and reproducible research, accelerating advances in genomics worldwide.

Presentation Recording

Resource Files

Scalable and reproducible genomics data analysis on AWS.pdf14.16 MB

Friday, April 8, 2022, 3 to 4 pm EST

James Eddy

Getting to know the GA4GH Workflow Execution Service (WES) API
James Eddy, Director of Informatics & Biocomputing
Sage Bionetworks

Dr. Eddy is the Director of Informatics & Computing at Sage Bionetworks. He is a computational biologist and cancer researcher with experience in developing and managing high-throughput molecular databases, bioinformatics pipelines, and analytical workflows. He and his team serve as key technical contributors on the data coordinating center for several large consortia, including the NCI Cancer Systems Biology Consortium (CSBC) and the NCI Human Tumor Atlas Network (HTAN). Dr. Eddy is well versed in best practices for reproducible data science in biomedical research. Dr. Eddy has worked closely with the Global Alliance for Genomics & Health (GA4GH) to develop standards for data sharing and analysis, including serving as champion for the Workflow Execution Service (WES) API and as the Cloud Work Stream representative on the Technical Alignment Subcommittee (TASC).

James Eddy

Ian Fore, D.Phil., Senior Biomedical Informatics Program Manager
Center for Biomedical Informatics and Information Technology (CBIIT)

Dr. Ian Fore is a senior biomedical informatics program manager at CBIIT, specializing in integrating data from both basic and clinical science. His current contributions focus on the core components of the Cancer Research Data Commons (CRDC), which enable data integration, topics such as subject and specimen data, identifiers, and data aggregation. Beyond the technology, interoperability is also about how humans work together, and Dr. Fore leads CBIIT’s One Team One Mission program to enhance collaboration within and between projects.

Abstract:
The Global Alliance for Genomics & Health (GA4GH) Cloud Work Stream focuses on API standards (and implementations from partner Driver projects and other community members) that make it easier to “send the algorithms to the data.” Developed collaboratively between bioinformaticians, cloud and workflow platform engineers, and other stakeholders in the GA4GH community, the Workflow Execution Service (WES) API provides a standard way for users to submit workflow requests to workflow execution systems, and to monitor their execution. This API lets users run a single workflow (e.g., CWL, WDL, Nextflow formatted workflows) on multiple different platforms, clouds, and environments. We will provide an overview of the existing functionality described in the WES API standard, and also how WES fits in with other standards from the GA4GH Cloud Workstream like the Tool Registry Service (TRS) and Data Repository Service (DRS) APIs. We will also present some current use cases and implementations of WES and review ongoing development. We hope that this introduction to the WES API can encourage feedback and contribution from CWIG members.

Presentation Recording

Resource Files

20220408 - NCI CWIG - Workflow Execution Service.pdf11.09 MB

Friday, March 11, 2022, 3 to 4 pm EST

Dr. Enis Afgan

Galaxy and Software Containers: A Recipe for Success
Dr. Enis Afgan, Research Scientist
Johns Hopkins University

Enis Afgan is a research scientist in the Schatz Lab at Johns Hopkins University, working on the Galaxy and AnVIL projects. His area of focus has been applying distributed computing techniques to making biomedical computing more accessible. He has been working with cloud computing technologies and software containers to deliver scalable software services for researchers.

Abstract:
Galaxy (galaxyproject.org) is a popular tools and workflow execution platform used by thousands. How does it work? How does it scale to expose thousands of popular tools? And how does it handle millions of jobs per month? We will go over the system architecture of Galaxy and all the components it needs to function. We will explore how to install Galaxy locally for development and production use cases. We will also showcase how Galaxy is deployed at usegalaxy.org and in a FedRAMP-managed environment at AnVIL. Galaxy is increasingly making use of software containers to ensure consistency and promote software portability. We will take a detailed look at the latest available model for deploying Galaxy: the Galaxy Helm chart. The Galaxy Helm chart abstracts the mechanics of deploying a Galaxy into a single, highly-configurable package that handles installation and management of all the required software services for running Galaxy. This will include reflection on compute infrastructure, software components, as well as tools and reference data. Overall, this will be an expository talk about what goes on behind the scenes to make a Galaxy installation function at scale.

Presentation Recording

Resource Files

Galaxy and software containers_ a recipe for success.pdf9.08 MB

Friday, February 11, 2022, 3 to 4 pm EST

Jeremy Goecks, Ph.D.

The Galaxy Platform for Accessible, Reproducible, and Scalable Biomedical Data Science
Jeremy Goecks, Ph.D., Associate Professor, Department of Biomedical Engineering
Oregon Health & Science University

Dr. Jeremy Goecks is an Associate Professor of Biomedical Engineering and Section Head for Cancer Data Science at Oregon Health & Science University. Dr. Goecks has leadership positions in several national and international biomedical consortia that span from generation of single-cell tumor atlases to the development of cloud-scale computational infrastructure for biomedical data. He is principal investigator for an NCI Cancer Moonshot Center in the Human Tumor Atlas Network (HTAN; https://humantumoratlas.org/) where he leads the generation and analysis of single-cell omics and imaging datasets from longitudinal biopsies. He is a principal investigator for the Galaxy platform (https://galaxyproject.org), a computational workbench used daily by thousands of scientists across the world, and is a key contributor the NHGRI’s AnVIL cloud-based data commons.

Abstract:
Started in 2005, the Galaxy Project (https://galaxyproject.org/) has worked to solve key issues plaguing modern data-intensive biomedicine—the ability of researchers to access cutting-edge analysis methods, to precisely reproduce and share complex computational analyses, and to perform large-scale analyses across many datasets. Galaxy has become one of the largest and most widely used open-source platforms for biomedical data science. Promoting openness and collaboration in all facets of the project has enabled Galaxy to build a vibrant world-wide community of scientist users, software developers, system engineers, and educators who continuously contribute new software features, add the latest analysis tools, adopt modern infrastructure such as package managers and software containers, author training materials, and lead research and training workshops. In this talk, I will share an overview of the Galaxy Project and highlight several recent applications of Galaxy to cancer research, including the use of machine learning to predict therapeutic response and analysis of single-cell spatial omics to understand tumor spatial biology.

Presentation Recording

Resource Links:

https://galaxyproject.org/

Resource Files

Goecks_2022_02 NCI CWIG.pdf63.31 MB

Friday, January 14, 2022, 3 to 4 pm ET

Melissa Cline, Ph.D.

Federated Analysis for Cancer Variant Interpretation
Melissa Cline, Ph.D., Associate Research Scientist
UC Santa Cruz Genomics Institute

Dr. Melissa Cline is an Associate Research Scientist at the UC Santa Cruz Genomics Institute. She is the Program Manager of the BRCA Challenge, a consortium launched by the Global Alliance of Genomics and Health to pioneer methods for privacy-preserving data sharing with the goal of expediting genetic variation. She leads the development of BRCA Exchange, the world’s largest public repository on BRCA variation, and on federated analysis methods for germline variant interpretation. Dr Cline received her Ph.D. from UC Santa Cruz in 2000. Subsequently, she worked as a Senior Research Scientist at Affymetrix before returning to academia, as a postdoctoral researcher at the Pasteur Institute and UC Santa Cruz, ultimately joining the Genomics Institute in 2009.

Abstract:
Pathogenic variation in BRCA1 and BRCA2 is a major risk factor for cancers including breast, ovarian, pancreatic and prostate. Genetic testing is empowering individuals and their health care providers to understand and better manage their heritable risk of cancer, but is limited by the many gaps in our knowledge of human genetic variation. These gaps, termed “Variants of Uncertain Significance” (VUS), are rare genetic variants for which there is insufficient evidence to determine their clinical impact. Variant interpretation frequently requires some amount of patient-level data: clinical data describing the incidence of disease in patients with the VUS. Due to their sensitive nature, these data are mostly siloed. In 2015, the BRCA Challenge was launched to address this problem by assembling a team of experts to develop new approaches to share variant data on BRCA1 and BRCA2, as exemplars for other genes and heritable disorders. One promising approach is federated analysis. By sharing containerized analysis software with institutions that hold patient-level data, we have been able to analyze this data in situ, without the need to share the patient-level data directly, generating variant-level summaries that are less sensitive and can be shared more easily, and yet contain sufficient information to further variant interpretation. We will describe our experience with this approach, as well as future directions in container technology that will encourage greater variant data sharing.

Presentation Recording

Resource Files

NCI Containers and Workflows - Melissa Cline - 01_14_2022.pdf3.18 MB

Friday, November 12, 2021, 3 to 4 pm ET

Jeffrey Grover, Ph.D.

Developing Scalable Bioinformatics Workflows on the Cancer Genomics Cloud
Jeffrey Grover, Ph.D., Genomics Scientist
Seven Bridges

Jeff is a bioinformatics scientist and received his Ph.D. in molecular and cellular biology from the University of Arizona in 2020. He is experienced in integrating results from multi-omics data analysis, data visualization, and in bioinformatics workflow automation. At Seven Bridges Jeff works to expand our bioinformatics workflow offerings, provide technical expertise across public programs, and prototype internal technical solutions. In his free time he’s usually hacking in Python or R, traveling to a national park, or keeping his guitar collection in top playing shape.

Abstract:
The Cancer Genomics Cloud (CGC) is a cloud-based bioinformatics ecosystem supported by the National Cancer Institute (NCI). The CGC allows users to run computational workflows defined in the Common Workflow Language (CWL) on a wealth of large datasets, in place, in the cloud. Users may also upload their own data and take advantage of the scalability of cloud computing for their data analysis. In addition to the hundreds of publicly available bioinformatics workflows in the CGC Public Apps Gallery users can employ a variety of methods to develop their own. These include an integrated graphical user interface for creating workflows, as well as an ecosystem of tools enabling local development and automated deployment of workflows to the CGC. We will detail how to develop efficient workflows for the CGC and how to use best practices such as version control and continuous integration with the CGC, using publicly available tools developed by Seven Bridges

Presentation Recording

Resource Links:

Resource Files

2021-11-12_-_Jeffrey_Grover_-__NCI_CWIG_Presentation10.pptx15.92 MB

Friday, October 8, 2021, 3 to 4 pm EST

Dr. Pjotr Prins

Reproducible FAIR+ workflows and the CCWL
Dr. Pjotr Prins, Assistant Professor
University of Tennessee Health Science Center

Dr. Prins is a bioinformatician at large & assistant (coding) professor at the Department of Genetics, Genomics, and Informatics at the University of Tennessee Health Science Center. He is the director of Genenetwork.org and, notwithstanding collecting appointments at various institutes (see https://thebird.nl/), he is a dedicated free software and hardware champion and writes critical software for genetics and pangenomics.

Dr. Pjotr Prins

Arun Isaac, PhD Student
University of Tennessee Health Science Center

Arun Isaac is a Ph.D. student at the Department of Computational and Data Sciences, Indian Institute of Science. He spends much of his free time contributing to free software and hacking on Emacs Lisp and Scheme. He contributes regularly to GNU Guix and is the author of guile-email, an email parser for Guile. He also contributes to Tamil localization.

Abstract:
FAIR principles are focused on data and fail to account for reproducible and (on-demand) workflows. In this talk, we will explore FAIR+ (Findable, Accessible, Interoperable, Reusable, and Computable) in the context of GeneNetwork.org - one of the oldest web resources in bioinformatics. With GeneNetwork we are realizing reproducible software deployment, building on free and open-source software including GNU Guix and containers. We also are building scalable workflows that are triggered on demand to run in the cloud or on bare metal and we created our own HPC to run GNU Guix-based pangenomics. In this talk, we will present our infrastructure, including a prototype COVID19 cloud setup, with a hands-on introduction of GNU Guix and the concise CWL - a CWL generator that looks like shell scripts, but in reality, can be reasoned on and are far more portable.

The Common Workflow Language (CWL) is an open standard for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high-performance computing (HPC) environments.

Guix is an advanced distribution of the GNU operating system developed by the GNU Project which respects the freedom of computer users. Guix supports transactional upgrades and roll-backs, unprivileged package management, and more. When used as a standalone distribution, Guix supports declarative system configuration for transparent and reproducible operating systems.

The Concise Common Workflow Language (CCWL) is a concise syntax to express CWL workflows. It is implemented as an Embedded Domain Specific Language (EDSL) in the Scheme programming language, a minimalist dialect of the Lisp family of programming languages.

Presentation Recording

Resource Links:

Resource Files

CWIG_Oct_2021_Pjotr_Prins_slides_fairplus.pdf161.55 KB
CWIG_Oct_2021_Arun_Isaac_slides.pdf94.05 KB

Friday, September 10, 2021, 3 to 4 pm EST

Junjun Zhang

WFPM: A novel WorkFlow Package Manager to enable collaborative bioinformatics workflow development
Junjun Zhang, Senior Bioinformatics Manager
Ontario Institute for Cancer Research

Junjun Zhang has over 19 years of extensive experience in designing and building innovative solutions for cancer genomics and bioinformatics. He has led NCI’s GDC data model development, co-led the development of the International Cancer Genome Consortium Data Portal, and developed the central metadata tracking system for the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes project. Recently, he led the establishment of best practices for the ICGC ARGO uniform workflow development, and developed the first full-featured workflow package manager tool.

Abstract:
Recent advances in bioinformatics workflow development solutions have focused on addressing reproducibility and portability but significantly lag behind in supporting component reuse and sharing, which results in poor adoption of the widely practiced Don’t Repeat Yourself (DRY) principle and the divide-and-conquer strategy.

To address these limitations, the International Cancer Genome Consortium Accelerating Research in Genomic Oncology (ICGC ARGO) initiative (https://www.icgc-argo.org) has adopted a modular approach in which a series of "best practice" genome analysis workflows have been encapsulated in a series of well-defined packages which are then incorporated into higher-level workflows. Following this approach, we have developed five production workflows which extensively reuse component packages. This flexible architecture enables ARGO developers spreading across the globe to collaboratively build its uniform workflows with different developers focusing on different components. All ARGO component packages are reusable for the general bioinformatics community to import as modules to build their own workflows.

Recently, we have developed an open source command line interface (CLI) tool called WorkFlow Package Manager (WFPM) CLI that provides assistance throughout the entire workflow development lifecycle to implement best practices and the aforementioned modular approach. With a highly streamlined process and automation in template code generation, continuous integration testing and releasing, WFPM CLI significantly lowers the barriers for users to develop standard reusable workflow packages. WFPM CLI source code: https://github.com/icgc-argo/wfpm_, documentation: https://wfpm.readthedocs.io

Presentation Recording

Resource Links:

https://softeng.oicr.on.ca/junjun_zhang/2021/03/31/build-workflows-collaboratively-using-packages/

Resource Files

NCI_CWIG_webinar_WFPM_Sept_2021_Junjun_Zhang.pdf2.45 MB

Please contact ncicwigusermail@mail.nih.gov for access to older webinar presentations