Quick Summary: The Seven Bridges Cancer Genomics Cloud (CGC) is a platform designed to provide researchers with immediate access to massive datasets like The Cancer Genome Atlas (TCGA), enabling high-performance analysis without the need for local infrastructure.

The Challenge of Big Data in Genomics

As cancer genomic datasets grow in size and complexity, the availability of scalable compute resources—the ‘cloud’—facilitates rapid and cost-effective data analysis. Downloading and storing datasets like TCGA, which contains genomic and clinical data from more than 11,000 patients, requires significant time and resources.

The CGC was funded as a pilot project by the US National Cancer Institute (NCI) to explore novel approaches to democratize access to these massive datasets alongside the tools and computational resources to analyze them.

Key Features of the CGC

The platform offers a comprehensive suite of technologies for modern biomedical research:

  • Massive Hosted Datasets: Immediate access to over a petabyte of data, including TCGA and the Cancer Cell Line Encyclopedia (CCLE).
  • Curated Tools: More than 200 pre-installed bioinformatics workflows, including pipelines for variant calling and RNA sequencing analysis.
  • Reproducibility: The CGC ensures reproducibility by recording all aspects of analysis, including files used, tool versions, and parameter settings using Common Workflow Language (CWL).
  • Scalability: Analysis is readily scalable on-demand. For example, a researcher performed targeted variant calling across 11,000 participants in about three hours for under $15.

Visualizing the Platform

The following figures illustrate how the CGC integrates data, computation, and visualization to support collaborative research.

Seven Bridges CGC Figure 1

Figure 1: The Cancer Genomics Cloud Ecosystem.

  • A: The system enables users to upload private data, annotate it with metadata, and run analyses using optimized resources.
  • B: Time course showing over 9,000 RNA-Seq samples analyzed in parallel, all completed within 100 minutes.
  • C: The Data Browser allows users to explore and select data by specifying properties of interest visually.
  • D: The Case Explorer focuses on genetic properties, allowing global views of gene expression and mutation status.

Impact on Scientific Discovery

Since its launch in February 2016, the CGC has registered over 1,900 researchers from 150 institutions across 30 countries. The platform enables a diverse range of research, such as the study of mammary-tumor-associated RNAs (MaTARs). By analyzing TCGA data on the cloud, researchers were able to confirm the relevance of human MaTAR orthologs in clinical breast cancer.

References

  1. Lau, J. W., et al. (2017). The Cancer Genomics Cloud: Collaborative, reproducible, and democratized-a new paradigm in large-scale computational research. Cancer Res., 77(21), e3-e6.
  2. Stein, L. D., et al. (2015). Data analysis: Create a cloud commons. Nature, 523, 149-151.
  3. The Cancer Genome Atlas Research Network. (2017). Integrated genomic and molecular characterization of cervical cancer. Nature.
  4. Diermeier, S. D., et al. (2016). Mammary Tumor-Associated RNAs Impact Tumor Cell Proliferation, Invasion, and Migration. Cell Rep., 17, 261-274.
  5. Kaushik, G., et al. (2016). Rabix: an open-source workflow executor supporting recomputability and interoperability. Pac Symp Biocomput., 22, 154-165.