How machine learning helps cancer research

Evelina Gabasova

@evelgab

MRC Cancer Unit, University of Cambridge

DNA

DNA path – 2.5 km 10,000 colored stripes representing BRCA2 gene, one of approx. 30,000 genes in human 3 billion BPs Whole DNA – 20 times around the Earth

Cost of whole genome sequencing

The whole area is 15 years old Human genome project finished in 2003 Joint effort of 20 research centres

Sequencing data

It is projected that soon just the amount of sequencing information we’ll have will be larger than what data astronomy, particle physics processes Projected in a couple of years, genomics will have more data than Youtube (300 hours of video are uploaded to YouTube every minute) -> challenges – high dimensionality (DNA is HUGE)

DNA and genes

DNA is stable storage Contains genes, each gene encodes a protein To synthetize a protein, we need to copy out a piece of DNA Explain what is gene expression – crutial part of analysing cells

Cancer

Genetic mutations
Oncogenes and tumour suppressors

BRCA1 and BRCA2 are chromosome guardians

Cancer is not a single disease

Cancer is a group of diseases, characterized by uncontrolled growth of tissues BRCA genes act as guardians during chromosome duplication – repair errors during the duplication BRCA2 discovered in 1994 Cancer ~ bugs in genetic code Mutation in gene (change in base pair, malformed protein, different function) Find causal mechanisms of cancer - imagine debugging large program in unknown programming language, that you can't easily run (it's possible to isolate little parts of the system and tweak them in the lab)

Clustering

Clustering is about finding groups in data, that are more similar within themselves and different across themselves Identifying types of cancer, because cancer is not a single disease Using gene expression to find subtypes of cancer, that behave differently in different patients

Example

Clustering wholesale customers

440 wholesale customers

Annual spending on

Fresh produce
Milk products
Grocery products
Frozen products
Detergents and paper
Delicatessen

Unsupervised algorithm, tries to find patterns in data Hotels spend a lot on paper and detergents

Methods for clustering

k-means clustering
hierarchical clustering
spectral clustering
Gaussian mixture model and other probabilistic methods
...

find gif of k-means or gaussian mixture model

Visualisation of high-dimensional data

Principal component analysis

Clustering cancer data

Genes instead of customers

Gene expression instead of spending on products

Why would we even want to cluster cancer data?

Conventional medicine

Precision medicine

Clustering in cancer research

TCGA breast cancer Clustering 368 tumour samples based on expression of 648 genes.

Find different sybtypes of cancer [DEMO - cluster gene expression data from TCGA] Survival analysis

Integrative clustering

how single data in isolation don’t tell much Example of what I do Identify groups that exist among these different data types. Problems: cannot just put the data together, because they measure something different, have to devise more intelligent methods

Collaborative filtering

What is collaborative filtering

Example

The Netflix prize

User

Film 1

Film 2

Film 3

Film 4

...

Film 1000

Alice

...

Bob

...

Carol

...

Zoe

...

Matrix factorization

what methods are used to compute matrix factorisation?

Collaborative filtering of cancer data

Patients instead of users

DNA mutations instead of film ratings

Mutational signatures

Patient

C/A

C/G

C/T

T/A

T/C

T/G

Alice

Bob

Carol

...

Zoe

C/T and T/C are associated with environmental damage different cancers have different signatures

Matrix factorization to identify features

Proving system stability

Theorem proving

SAT

(A ∨ ¬B ) ∧ (¬ A ∨ B)

A = true B = true

Theorem proving

Satisfiability Modulo Theories (SMT)

(A ∨ ¬B ) ∧ (¬ A ∨ B)

((a > 3) ∨ (b < 1)) ∧ ((a < 5) ∨ (b = 0))

a = 4 b = 0

[DEMO]

Software verification

Z3 theorem prover

Preconditions Postconditions Loop conditions ➜ SMT formulas Assertions ...

Software verification

Spec#

Software verification

Spec#

Proving stability of biological processes

Proteins

Genes ➜ Variables Receptors …

➜ v + 1 if v < T(v) v ➜ v if v = T(v) ➜ v - 1 if v > T(v)

Bio Model Analyser

Traditional way to do this is via systems of differential equations DNA pathways - inhibition and activation Target function T - average effect of inhibitors and activators If we can get to the state where v = T(v) for all variables, the system stabilises Transform this into SMT formula and find a value of the state Iterate until the system stabilises of ends up in a cycle

Chronic myeloid leukemia

Proving stability of biological systems

Chronic myeloid leukemia

Model constructed from identified interactions from literature Relatively simple system because it’s caused by a single change when two chromosomes overlap and exchange parts The model correctly corresponds to the currently existing drug targets Allows testing what happens when we disable a receptor, or a specific protein Turns out the system is remarkably stable, needs disruption of at least two elements 54 nodes, 104 interactions, sourced from published literature (160 papers)

Machine learning is not just

for targeted advertising

or algorithmic trading

@evelgab evelina@evelinag.com github.com/evelinag evelinag.com

Links

How machine learning helps cancer research Evelina Gabasova @evelgab MRC Cancer Unit, University of Cambridge

ml-cancer-research

evelinag

ml-cancer-research

2 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

ml-cancer-research

How machine learning helps cancer research

Evelina Gabasova

@evelgab

MRC Cancer Unit, University of Cambridge

DNA

Cost of whole genome sequencing

Sequencing data

DNA and genes

Cancer

Clustering

Example

Clustering wholesale customers

Methods for clustering

Visualisation of high-dimensional data

Principal component analysis

Clustering cancer data

Conventional medicine

Precision medicine

Clustering in cancer research

Integrative clustering

Integrative clustering

Collaborative filtering

Example

The Netflix prize

Matrix factorization

Collaborative filtering of cancer data

Mutational signatures

Matrix factorization to identify features

Matrix factorization to identify features

Matrix factorization to identify features

Proving system stability

Theorem proving

SAT

Theorem proving

Satisfiability Modulo Theories (SMT)

Software verification

Z3 theorem prover

Software verification

Spec#

Software verification

Spec#

Proving stability of biological processes

Chronic myeloid leukemia

Proving stability of biological systems

Chronic myeloid leukemia

Machine learning is not just

for targeted advertising

or algorithmic trading

Links

2 0