Content
- de Bruijn graph (optional)
-
de novo Metagenomic Assembly benchmark
- Comparing assemblers
- Classifiying errors in single assembly
de novo metagenomic assembly benchmark
Assembling Illumina paired end reads with de Bruijn graph assemblers
de Bruijn Graph
- cut up reads into overlapping kmers
- unique kmers determine comp reqs
Choosing the right k
- smaller k more sensitive more coverage
- larger k more specific less coverage
de Bruijn Graph information available
- kmer coverage
- read that created the kmer
- overlap between kmers
- insert size distribution
Single genome
- Estimate coverage distribution over genome
- Identify repeats
- Regions appearing multiple times
Metagenomics complicates the process
- number of genomes unknown
- coverage of genomes differs
- different abundances/length per genome
- short k for low abundant
- large k for high abundant
- closely related organisms
Sounds impossible!
Let's validate some assemblers!
Mock community
-
in vitro mock community 52 species
- reference genomes available
- even community
- log-normal community
Assembly recipes
- Contiging single kmer
{velvet,ray,metavelvet}noscaf
- Scaffolding single kmer
{velvet,ray,metavelvet}scaf
{velvet,ray,metavelvet}noscaf =>
{velvet,ray,metavelvet}bambus2
- Merging kmers 21 to 75
{velvet,ray,metavelvet}noscaf =>
{newbler,minimus2}
- Merging kmers and scaffolding
{velvet,ray,metavelvet}noscaf =>
{newbler,minimus2}bambus2
GNU Make Pipeline to run assembly recipes
# make velvet assemblies
make velvet
# make ray assemblies
make ray
# make all assemblies
make all
MetAssemble
Pipeline can schedule rules
- PBS or SBATCH
- Resource usage per rule
- Includes job dependencies
$ make -f Makefile-sbatch raynoscaf31
sbatch -A b2010008 \
-J processed-reads/pair.qtrim \
-t 01:00:00 -p core \
/glob/inod/github/metassemble/bin/wrapper_jobscript.sbatch \
make -e processed-reads/pair.qtrim
sbatch -d afterok:metassemble/processed-reads/pair.qtrim -A b2010008 \
-J assemblies/ray/noscaf/noscaf_31/ma-contigs.fa \
-t 01-00:00:00 -p node -N 2 -n 32 \
/glob/inod/github/metassemble/bin/wrapper_jobscript.sbatch \
make -e assemblies/ray/noscaf/noscaf_31/ma-contigs.fa
MetAssemble
Benchmark metrics
- Map assembly to references with MUMmer
- Purity of a contig
- Largest alignment coverage times identity
- Expressed as ratio of contig length
- Encompasses many erroneous features
- chimericity, errors, indels, inversions, relocations
- Metagenome coverage
- How much of the reference covered by purest
alignments of contigs
Calculated with MASMVALI
Analyze kmer origin with Kraken
- Check for each kmer in a contig whether it is erroneous or chimeric
- If chimeric
- what rank in the taxonomic tree is the kmer coming from
Impure contigs
- Number of kmers chimeric: 219020
- Number of kmers erroneous: 89314
- unclassifed: not in any of categories below
- a_tip_error: has erroneous kmers at the tip, but also other erroneous kmers
- one_tip_error: has one tip with erroneous_kmers and no other erroneous kmers
- two_tip_error: has two tips with erroneous_kmers and no other erroneous kmers
- one_break: has one point not located at the tips with only erroneous kmers
- 100% qrycov: the query contig is completely aligned to the reference
Predicting contig impurity
- Tried several assembly quality predictors without luck
Current State masmvaliweb
Current State masmvaliweb