Metagenomic Assembly Evaluation – Decreasing the assembler space



Metagenomic Assembly Evaluation – Decreasing the assembler space

0 0


2014-11-masmvali-presentation

Presentation at metagenomics hackathon NI Cambridge

On Github inodb / 2014-11-masmvali-presentation

Metagenomic Assembly Evaluation

Decreasing the assembler space

Ino de Bruijn / @inodb Presentation at ino.pm/2014hack

Content

  • de Bruijn graph (optional)
  • de novo Metagenomic Assembly benchmark
    • Comparing assemblers
    • Classifiying errors in single assembly

de novo metagenomic assembly benchmark

Assembling Illumina paired end reads with de Bruijn graph assemblers

de Bruijn Graph

  • cut up reads into overlapping kmers
  • unique kmers determine comp reqs

Choosing the right k

  • smaller k more sensitive more coverage
  • larger k more specific less coverage

de Bruijn Graph information available

  • kmer coverage
  • read that created the kmer
  • overlap between kmers
  • insert size distribution

Single genome

  • Estimate coverage distribution over genome
  • Identify repeats
  • Regions appearing multiple times

Metagenomics complicates the process

  • number of genomes unknown
  • coverage of genomes differs
    • different abundances/length per genome
    • short k for low abundant
    • large k for high abundant
  • closely related organisms
    • shared regions > k

Sounds impossible!

Let's validate some assemblers!

Mock community

  • in vitro mock community 52 species
  • reference genomes available
  • even community
  • log-normal community

Assembly recipes

  • Contiging single kmer
    {velvet,ray,metavelvet}noscaf
  • Scaffolding single kmer
    {velvet,ray,metavelvet}scaf
    {velvet,ray,metavelvet}noscaf =>
    {velvet,ray,metavelvet}bambus2
  • Merging kmers 21 to 75
    {velvet,ray,metavelvet}noscaf =>
    {newbler,minimus2}
  • Merging kmers and scaffolding
    {velvet,ray,metavelvet}noscaf =>
    {newbler,minimus2}bambus2

GNU Make Pipeline to run assembly recipes

# make velvet assemblies
make velvet
# make ray assemblies
make ray
# make all assemblies
make all

MetAssemble

Pipeline can schedule rules

  • PBS or SBATCH
  • Resource usage per rule
  • Includes job dependencies
$ make -f Makefile-sbatch raynoscaf31

sbatch  -A b2010008 \
    -J processed-reads/pair.qtrim \
    -t 01:00:00 -p core \
    /glob/inod/github/metassemble/bin/wrapper_jobscript.sbatch \
    make -e processed-reads/pair.qtrim

sbatch -d afterok:metassemble/processed-reads/pair.qtrim -A b2010008 \
    -J assemblies/ray/noscaf/noscaf_31/ma-contigs.fa \
    -t 01-00:00:00 -p node -N 2 -n 32 \
    /glob/inod/github/metassemble/bin/wrapper_jobscript.sbatch \
    make -e assemblies/ray/noscaf/noscaf_31/ma-contigs.fa

MetAssemble

Benchmark metrics

  • Map assembly to references with MUMmer
  • Purity of a contig
    • Largest alignment coverage times identity
    • Expressed as ratio of contig length
    • Encompasses many erroneous features
      • chimericity, errors, indels, inversions, relocations
  • Metagenome coverage
    • How much of the reference covered by purest alignments of contigs

Calculated with MASMVALI

Purity noscaf recipes

Purity thresholds

Conclusions

  • Purest but shortest contigs
    velvetnoscaf
  • Slightly less pure, slightly longer
    raynoscaf
  • Longest contigs at a small price of purity
    {velvet,ray}noscafnewbler
  • Scaffolding and Meta-Velvet don't work well on this data
  • Rule of thumb high coverage indicates high purity

Conclusions

  • Purest but shortest contigs
    velvetnoscaf
  • Slightly less pure, slightly longer
    raynoscaf
  • Longest contigs at a small price of purity
    {velvet,ray}noscafnewbler
  • Scaffolding and Meta-Velvet don't work well on this data
  • Rule of thumb high coverage indicates high purity

Analyze kmer origin with Kraken

  • Check for each kmer in a contig whether it is erroneous or chimeric
  • If chimeric
    • what rank in the taxonomic tree is the kmer coming from

Kmer origin

Impure contigs

  • Number of kmers chimeric: 219020
  • Number of kmers erroneous: 89314
  • unclassifed: not in any of categories below
  • a_tip_error: has erroneous kmers at the tip, but also other erroneous kmers
  • one_tip_error: has one tip with erroneous_kmers and no other erroneous kmers
  • two_tip_error: has two tips with erroneous_kmers and no other erroneous kmers
  • one_break: has one point not located at the tips with only erroneous kmers
  • 100% qrycov: the query contig is completely aligned to the reference

Predicting contig impurity

  • Tried several assembly quality predictors without luck
    • FRCurve
    • REAPR
    • ALE

Current State masmvaliweb

Current State masmvaliweb

Stop

Time

0