Metagenomic Assembly Evaluation

Decreasing the assembler space

Ino de Bruijn / @inodb Presentation at ino.pm/2014hack

Content

de Bruijn graph (optional)
de novo Metagenomic Assembly benchmark
- Comparing assemblers
- Classifiying errors in single assembly

de novo metagenomic assembly benchmark

Assembling Illumina paired end reads with de Bruijn graph assemblers

de Bruijn Graph

cut up reads into overlapping kmers
unique kmers determine comp reqs

Choosing the right k

smaller k more sensitive more coverage
larger k more specific less coverage

de Bruijn Graph information available

kmer coverage
read that created the kmer
overlap between kmers
insert size distribution

Single genome

Estimate coverage distribution over genome
Identify repeats
Regions appearing multiple times

Metagenomics complicates the process

number of genomes unknown
coverage of genomes differs
- different abundances/length per genome
- short k for low abundant
- large k for high abundant
closely related organisms
- shared regions > k

Sounds impossible!

Let's validate some assemblers!

Mock community

in vitro mock community 52 species
reference genomes available
even community
log-normal community

Assembly recipes

Contiging single kmer
```
{velvet,ray,metavelvet}noscaf
```

Scaffolding single kmer

{velvet,ray,metavelvet}scaf
{velvet,ray,metavelvet}noscaf =>
{velvet,ray,metavelvet}bambus2

Merging kmers 21 to 75

{velvet,ray,metavelvet}noscaf =>
{newbler,minimus2}

Merging kmers and scaffolding

{velvet,ray,metavelvet}noscaf =>
{newbler,minimus2}bambus2

GNU Make Pipeline to run assembly recipes

# make velvet assemblies
make velvet
# make ray assemblies
make ray
# make all assemblies
make all

MetAssemble

Pipeline can schedule rules

PBS or SBATCH
Resource usage per rule
Includes job dependencies

$ make -f Makefile-sbatch raynoscaf31

sbatch  -A b2010008 \
    -J processed-reads/pair.qtrim \
    -t 01:00:00 -p core \
    /glob/inod/github/metassemble/bin/wrapper_jobscript.sbatch \
    make -e processed-reads/pair.qtrim

sbatch -d afterok:metassemble/processed-reads/pair.qtrim -A b2010008 \
    -J assemblies/ray/noscaf/noscaf_31/ma-contigs.fa \
    -t 01-00:00:00 -p node -N 2 -n 32 \
    /glob/inod/github/metassemble/bin/wrapper_jobscript.sbatch \
    make -e assemblies/ray/noscaf/noscaf_31/ma-contigs.fa

MetAssemble

Benchmark metrics

Map assembly to references with MUMmer
Purity of a contig
- Largest alignment coverage times identity
- Expressed as ratio of contig length
- Encompasses many erroneous features
  - chimericity, errors, indels, inversions, relocations
Metagenome coverage
- How much of the reference covered by purest alignments of contigs

Calculated with MASMVALI

Purity noscaf recipes

Purity thresholds

Conclusions

Purest but shortest contigs
```
velvetnoscaf
```
Slightly less pure, slightly longer
```
raynoscaf
```
Longest contigs at a small price of purity
```
{velvet,ray}noscafnewbler
```
Scaffolding and Meta-Velvet don't work well on this data
Rule of thumb high coverage indicates high purity

Conclusions

Purest but shortest contigs
```
velvetnoscaf
```
Slightly less pure, slightly longer
```
raynoscaf
```
Longest contigs at a small price of purity
```
{velvet,ray}noscafnewbler
```
Scaffolding and Meta-Velvet don't work well on this data
Rule of thumb high coverage indicates high purity

Analyze kmer origin with Kraken

Check for each kmer in a contig whether it is erroneous or chimeric
If chimeric
- what rank in the taxonomic tree is the kmer coming from

Kmer origin

Impure contigs

Number of kmers chimeric: 219020
Number of kmers erroneous: 89314

unclassifed: not in any of categories below
a_tip_error: has erroneous kmers at the tip, but also other erroneous kmers
one_tip_error: has one tip with erroneous_kmers and no other erroneous kmers
two_tip_error: has two tips with erroneous_kmers and no other erroneous kmers
one_break: has one point not located at the tips with only erroneous kmers
100% qrycov: the query contig is completely aligned to the reference

Metagenomic Assembly Evaluation – Decreasing the assembler space

inodb

Metagenomic Assembly Evaluation – Decreasing the assembler space

0 0

2014-11-masmvali-presentation

Metagenomic Assembly Evaluation

Decreasing the assembler space

Content

de novo metagenomic assembly benchmark

Assembling Illumina paired end reads with de Bruijn graph assemblers

de Bruijn Graph

Choosing the right k

de Bruijn Graph information available

Single genome

Metagenomics complicates the process

Sounds impossible!

Let's validate some assemblers!

Mock community

Assembly recipes

GNU Make Pipeline to run assembly recipes

Pipeline can schedule rules

Benchmark metrics

Purity noscaf recipes

Purity thresholds

Conclusions

Conclusions

Analyze kmer origin with Kraken

Kmer origin

Impure contigs

Predicting contig impurity

Current State masmvaliweb

Current State masmvaliweb

Stop

Time

Metagenomic Assembly Evaluation – Decreasing the assembler space

inodb

Metagenomic Assembly Evaluation – Decreasing the assembler space

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

2014-11-masmvali-presentation

Metagenomic Assembly Evaluation

Decreasing the assembler space

Content

de novo metagenomic assembly benchmark

Assembling Illumina paired end reads with de Bruijn graph assemblers

de Bruijn Graph

Choosing the right k

de Bruijn Graph information available

Single genome

Metagenomics complicates the process

Sounds impossible!

Let's validate some assemblers!

Mock community

Assembly recipes

GNU Make Pipeline to run assembly recipes

Pipeline can schedule rules

Benchmark metrics

Purity noscaf recipes

Purity thresholds

Conclusions

Conclusions

Analyze kmer origin with Kraken

Kmer origin

Impure contigs

Predicting contig impurity

Current State masmvaliweb

Current State masmvaliweb

Stop

Time

0 0