cancer_genomics_vis

Visualizing Multivariate Analysis of Cancer Data

Dick Kreisberg Institute for Systems Biology http://bit.ly/regulome

In Breast Cancer , which inter-chromosomal molecular features are most associated with BRCA1 mRNA expression ?

Chr 22, mir-hsa-3200, associated with ERBB2/Her2 gene in paired tissues

Outline

Introduction
The Cancer Genome Atlas
Heterogeneous Cancer Data
Visualizing Multivariate Analysis
Challenges in Cancer Genomics Visualization

Who am I?

A software engineer in the Shmulevich Lab at the Institute for Systems Biology in Seattle, Washington in the USA.

Who is the Shmulevich Lab?

We are a group of bioinformaticians and software engineers focused on the challenges of computational biology.

What is the Institute for Systems Biology?

It is a non-profit research institute dedicated to understanding biological complexity. www.systemsbiology.org

The Cancer Genome Atlas

cancergenome.nih.gov

TCGA Tumor Data

Clinical information (age, gender, vital status, tumor grade, histology) Tumor sample high-throughput molecular data

mRNA expression (Agilent arrays and/or Illumina GA/HiSeq) microRNA expression (Illumina GA/HiSeq) protein expression (Reverse Phase Protein Array)

DNA methylation (Illumina 27k/450k) DNA copy-number segmentation (Affymetrix SNP6) DNA mutations (Illumina)

Downstream Information:

Subtype / Cluster assignments (single-platform analysis) GISTIC regions of interest (copy-number) Tumor Sample purity / ploidy Somatic Mutation rates Micro-Satellite Instability

The Reality of Research Data:

messy noisy nuanced

contradictive

Heterogeneous Feature Matrix

Association Analysis

Pairwise Analysis

Pairwise Associations

Significantly associated features (according to corrected p-value) form a pair of connected nodes.

Why do we need tools that explore cancer data?

Quickly ask a question of the data
Lower the barrier to the data
Provide supporting information and context
Distribute the effort across the research community

Barriers: Data is messy. Strange formats. Different between tumors.

Multivariate Analysis

There are many methods being used in cancer research. In general, we see three types of approaches.

Statistical Information Theoretic Machine Learning

Colorectal Cancer Aggressiveness

A combined p-value approach to identifying molecular features associated with tumor aggresiveness. Vesteinn Thorsson

Clinical variables contributing to CRC tumor aggressiveness

These are the clinical variable we will focus on: those that correlate with tumor aggressivess. Histological Type. Mucinous (mucin producing ~ extracellular mucin) or non-mucinous adenocarcinoma H&E stain, HE stain or hematoxylin and eosin stain is a popular staining method in histology. It is the most widely used stain in medical diagnosis; for example when a pathologist looks at a biopsy of a suspected cancer, the histological section is likely to be stained with H&E and termed H&E section, H+E section, or HE section.

Tumor aggresiveness as a composite of six p-value associations

Fisher’s Product Method for combining statistical tests (Fisher, 1948) Follows χ 2 -distribution with 2 x 6=12 degrees of freedom Weights w i used to equalize contributions of the 6 clinical variables

Random Forest

Decision Tree Ensemble Learning RF-ACE: Random Forest with Artificial Contrast Ensembles developed by Timo Erkkila http://rf-ace.googlecode.com Breiman, Leo (2001). "Random Forests", Machine Learning , 45 (1):5-32. Tuv, Eugene, et al. "Feature selection with ensembles, artificial variables, and redundancy elimination." The Journal of Machine Learning Research 10 (2009): 1341-1366.

A Decision Tree

Learning the data for the Histological Feature "HER2 Status"

Randomness - Bootstrapping

Randomness - Bagging

A Forest of Voting Trees

Multi-scale Association Explorer

A tool for exploring associations among genomic and non-genomic features.

Features for Exploring the Data

Edges Between Genomic Features
Edges With Non-Genomic Features
Managing Scale
Analytical Insight and Oversight (scatterplot, violinplot, cubbyhole)

Go to the tool!

PubCrawl

Incorporating Semantic and Interaction Associations Andrea Eakin, Brady Bernard

Normalized Google Distance

A measure of semantic similarity.

Protein Domain Interactions

Raghavachari, Balaji, et al. "DOMINE: a database of protein domain interactions." Nucleic acids research 36.suppl 1 (2008): D656-D661.

Semantic and Domain Associations for MYC combined with GBM statistical and multivariate analysis.

Challenges in (Interactive) Cancer Visualization

There are many challenges left!

Associations

Problems:

Comparison and grouping of multiple cancer types (and subtypes) across all types of features
Inclusion of functional, semantic, analytical and physical associations
Highly connected (and confusing) networks

Possible Solutions:

Multigraphs
Managed layout
Automatic grouping based on scale
Topological Data Analysis (identify the simplicial complexes)

Lum, P. Y., et al. "Extracting insights from the shape of complex data using topology." Scientific Reports 3 (2013).

Scale

Problems: Tying together information at many scales Protein -> Pathway -> Hallmark -> Tumor -> Patient Portraying feature data per patient with thousands of samples across multiple cancer types A seemingly endless number of interesting results

Possible Solutions: ??? Collaborative tools that enable scientific effort to be distrbuted.

Interpretation

How best to convey the information to the cancer biologist?

Which visual abstractions are most useful to reason on?

When is an exploratory tool called for? When is it not called for?

Scared?

Don't be

High-Throughput Computation

600,000 cores running Random Forest on Google Compute Engine

The Center for Systems Analysis of the Cancer Regulome

Ilya Shmulevich (ISB) + Wei Zhang (MD Anderson Cancer Center)

www.cancerregulome.org

Projects @ ISB

Regulome Explorer explorer.cancerregulome.org Pubcrawl explorer.cancerregulome.org/pubcrawl/ Genespot www.genespot.org Transcriptional Regulation & Epigenetic Landscape trel.systemsbiology.net

Search 'codefor@systemsbiology.org' at code.google.com

Acknowledgements

Ilya Shmulevich Wei Zhang Andrea Eakin Hector Rovira Da Yang Jake Lin Ryan Bressler Yuexin Lin Brady Bernard Timo Erkkila Yan Sun Sheila Reynolds Vesteinn Thorsson Lisa Iype Kalle Leinonen Patrick May Lesley Wilkerson

dick.kreisberg@systemsbiology.org

何か質問はありますか？

Nanika shitsumon wa arimasu ka? Are there any questions?

Resources

d3.js d3js.org Science.js github.com/jasondavies/science.js/ CytoscapeWeb cytoscapeweb.cytoscape.org/ Cytoscape.js cytoscape.github.com/cytoscape.js/ Circos circos.ca

Reveal.JS

lab.hakim.se/reveal-js/

JSFiddle

jsfiddle.net

Visualizing Multivariate Analysis of Cancer Data – Who am I? – The Cancer Genome Atlas

rbkreisberg

Visualizing Multivariate Analysis of Cancer Data – Who am I? – The Cancer Genome Atlas

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();