Visualizing Multivariate Analysis of Cancer Data
Dick Kreisberg
Institute for Systems Biology
http://bit.ly/regulome
Chr 22, mir-hsa-3200, associated with ERBB2/Her2 gene in paired tissues
Outline
- Introduction
- The Cancer Genome Atlas
- Heterogeneous Cancer Data
- Visualizing Multivariate Analysis
- Challenges in Cancer Genomics Visualization
Who am I?
A software engineer in the Shmulevich Lab at the Institute for Systems Biology in Seattle, Washington in the USA.
Who is the Shmulevich Lab?
We are a group of bioinformaticians and software engineers focused on the challenges of computational biology.
What is the Institute for Systems Biology?
It is a non-profit research institute dedicated to understanding biological complexity.
www.systemsbiology.org
The Cancer Genome Atlas
cancergenome.nih.gov
TCGA Tumor Data
Clinical information (age, gender, vital status, tumor grade, histology)
Tumor sample high-throughput molecular data
mRNA expression (Agilent arrays and/or Illumina GA/HiSeq)
microRNA expression (Illumina GA/HiSeq)
protein expression (Reverse Phase Protein Array)
DNA methylation (Illumina 27k/450k)
DNA copy-number segmentation (Affymetrix SNP6)
DNA mutations (Illumina)
Downstream Information:
Subtype / Cluster assignments (single-platform analysis)
GISTIC regions of interest (copy-number)
Tumor Sample purity / ploidy
Somatic Mutation rates
Micro-Satellite Instability
The Reality of Research Data:
messy
noisy
nuanced
contradictive
Heterogeneous Feature Matrix
Pairwise Associations
Significantly associated features (according to corrected p-value) form a pair of connected nodes.
Why do we need tools that
explore
cancer data?
- Quickly ask a question of the data
- Lower the barrier to the data
- Provide supporting information and context
- Distribute the effort across the research community
Barriers: Data is messy. Strange formats. Different between tumors.
Multivariate Analysis
There are many methods being used in cancer research. In general, we see three types of approaches.
Statistical
Information Theoretic
Machine Learning
Colorectal Cancer Aggressiveness
A combined p-value approach to identifying molecular features associated with tumor aggresiveness.
Vesteinn Thorsson
Clinical variables contributing to CRC tumor aggressiveness
These are the clinical variable we will focus on: those that correlate with tumor aggressivess.
Histological Type. Mucinous (mucin producing ~ extracellular mucin) or non-mucinous adenocarcinoma
H&E stain, HE stain or hematoxylin and eosin stain is a popular staining method in histology. It is the most widely used stain in medical diagnosis; for example when a pathologist looks at a biopsy of a suspected cancer, the histological section is likely to be stained with H&E and termed H&E section, H+E section, or HE section.
Tumor aggresiveness as a composite of six p-value associations
Fisher’s Product Method for combining statistical tests (Fisher, 1948)
Follows χ 2
-distribution with 2 x 6=12 degrees of freedom
Weights w i
used to equalize contributions of the 6 clinical variables
Random Forest
Decision Tree Ensemble Learning
RF-ACE: Random Forest with Artificial Contrast Ensembles
developed by Timo Erkkila
http://rf-ace.googlecode.com
Breiman, Leo (2001). "Random Forests", Machine Learning
, 45
(1):5-32.
Tuv, Eugene, et al. "Feature selection with ensembles, artificial variables, and redundancy elimination."
The Journal of Machine Learning Research 10
(2009): 1341-1366.
A Decision Tree
Learning the data for the Histological Feature "HER2 Status"
Randomness - Bootstrapping
Multi-scale Association Explorer
A tool for exploring associations among genomic and non-genomic features.
Features for Exploring the Data
- Edges Between Genomic Features
- Edges With Non-Genomic Features
- Managing Scale
-
Analytical Insight and Oversight
(scatterplot, violinplot, cubbyhole)
PubCrawl
Incorporating Semantic and Interaction Associations
Andrea Eakin, Brady Bernard
Normalized Google Distance
A measure of semantic similarity.
Protein Domain Interactions
Raghavachari, Balaji, et al. "DOMINE: a database of protein domain interactions."
Nucleic acids research
36.suppl 1 (2008): D656-D661.
Challenges in (Interactive) Cancer Visualization
There are many challenges left!
Associations
Problems:
-
Comparison and grouping of multiple cancer types (and subtypes) across all types of features
-
Inclusion of functional, semantic, analytical and physical associations
- Highly connected (and confusing) networks
Possible Solutions:
- Multigraphs
- Managed layout
- Automatic grouping based on scale
-
Topological Data Analysis (identify the simplicial complexes)
Lum, P. Y., et al. "Extracting insights from the shape of complex data using topology." Scientific Reports 3 (2013).
Scale
Problems:
Tying together information at many scales
Protein -> Pathway -> Hallmark -> Tumor -> Patient
Portraying feature data per patient with thousands of samples across multiple cancer types
A seemingly endless number of interesting results
Possible Solutions:
???
Collaborative tools that enable scientific effort to be distrbuted.
Interpretation
How best to convey the information to the cancer biologist?
Which visual abstractions are most useful to reason on?
When is an exploratory tool called for? When is it not called for?
High-Throughput Computation
600,000 cores running Random Forest on Google Compute Engine
The Center for Systems Analysis of the Cancer Regulome
Ilya Shmulevich (ISB) + Wei Zhang (MD Anderson Cancer Center)
www.cancerregulome.org
Acknowledgements
Ilya Shmulevich
Wei Zhang
Andrea Eakin
Hector Rovira
Da Yang
Jake Lin
Ryan Bressler
Yuexin Lin
Brady Bernard
Timo Erkkila
Yan Sun
Sheila Reynolds
Vesteinn Thorsson
Lisa Iype
Kalle Leinonen
Patrick May
Lesley Wilkerson
dick.kreisberg@systemsbiology.org
Nanika shitsumon wa arimasu ka?
Are there any questions?
Resources
d3.js
d3js.org
Science.js
github.com/jasondavies/science.js/
CytoscapeWeb
cytoscapeweb.cytoscape.org/
Cytoscape.js
cytoscape.github.com/cytoscape.js/
Circos
circos.ca
Reveal.JS
lab.hakim.se/reveal-js/
JSFiddle
jsfiddle.net