Sanger 2014

View Github Repository Open presentation in a new window

bmpvieira

See all presentation from bmpvieira

Sanger 2014

0 0

sanger14

On Github bmpvieira / sanger14

Sanger 2014

bmpvieira.com/sanger14

Bruno Vieira | @bmpvieira

Phd Student @

Bioinformatics and Population Genomics

Supervisor:Yannick Wurm | @yannick__

Before:

© 2014 Bruno Vieira CC-BY 4.0

Some problems I faced during my research:

Difficulty getting relevant descriptions and datasets from NCBI API using bio* libs
For web projects, needed to implement the same functionality on browser and server
Difficulty writing scalable, reproducible and complex bioinformatic pipelines

Bionode.io - Modular and universal bioinformatics

Pipeable UNIX command line tools and JavaScript / Node.js APIs for bioinformatic analysis workflows on the server and browser.Collaborates with BioJS - Represent biological data on the web

Dat - Build data pipelines

Provides a streaming interface between every file format and data storage backend. "git for data"

dat-data.com | @maxogden | @mafintosh

Difficulty getting relevant description and datasets from NCBI API using bio* libs

Python example: URL for the Achromyrmex assembly?

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG

import xml.etree.ElementTree as ET
from Bio import Entrez
Entrez.email = "mail@bmpvieira.com"
esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")
esearch_record = Entrez.read(esearch_handle)
for id in esearch_record['IdList']:
  esummary_handle = Entrez.esummary(db="assembly", id=id)
  esummary_record = Entrez.read(esummary_handle)
  documentSummarySet = esummary_record['DocumentSummarySet']
  document = documentSummarySet['DocumentSummary'][0]
  metadata_XML = document['Meta'].encode('utf-8')
  metadata = ET.fromstring('' + metadata_XML + '')
  for entry in Metadata[1]:
    print entry.text

Solution: bionode-ncbi

Difficulty getting relevant description and datasets from NCBI API using bio* libs

Example: URL for the Achromyrmex assembly?

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz

JavaScript

var bio = require('bionode')
bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) {
  console.log(urls[0].genomic.fna)
})

bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)
function printGenomeURL(urls) {
  console.log(urls[0].genomic.fna)
})

Difficulty getting relevant description and datasets from NCBI API using bio* libs

Example: URL for the Achromyrmex assembly?

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz

JavaScript

var ncbi = require('bionode-ncbi')
var ndjson = require('ndjson')
ncbi.urls('assembly', 'Acromyrmex')
.pipe(ndjson.stringify())
.pipe(process.stdout)

BASH

bionode-ncbi urls assembly Acromyrmex |
tool-stream extractProperty genomic.fna

Need to reimplement the same code on browser and server.

Solution: JavaScript everywhere

Module counts

Benefit from other JS projects

Reusable, small and tested modules

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

Solution: Node.js Streams everywhere

 var ncbi = require('bionode-ncbi')
 var tool = require('tool-stream')
 var through = require('through2')
 var fork1 = through.obj()
 var fork2 = through.obj()

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

Solution: Node.js Streams everywhere

ncbi
.search('sra', 'Solenopsis invicta')
.pipe(fork1)
.pipe(dat.reads)

fork1
.pipe(tool.extractProperty('expxml.Biosample.id'))
.pipe(ncbi.search('biosample'))
.pipe(dat.samples)

fork1
.pipe(tool.extractProperty('uid'))
.pipe(ncbi.link('sra', 'pubmed'))
.pipe(ncbi.search('pubmed'))
.pipe(fork2)
.pipe(dat.papers)

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

bionode-ncbi search genome Guillardia theta |
tool-stream extractProperty assemblyid |
bionode-ncbi download assembly |
tool-stream collectMatch status completed |
tool-stream extractProperty uid|
bionode-ncbi link assembly bioproject |
tool-stream extractProperty destUID |
bionode-ncbi link bioproject sra |
tool-stream extractProperty destUID |
grep 35526 |
bionode-ncbi download sra |
bionode-sra fastq-dump |
tool-stream extractProperty destFile |
bionode-bwa mem 503988/GCA_000315625.1_Guith1_genomic.fna.gz |
tool-stream collectMatch status finished|
tool-stream extractProperty sam|
bionode-sam

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

 { 
   "import-data": [ 
     "bionode-ncbi search genome eukaryota", 
     "dat import --json --primary=uid" 
   ], 
   "search-ncbi": [ 
     "dat cat", 
     "grep Guillardia", 
     "tool-stream extractProperty assemblyid", 
     "bionode-ncbi download assembly -", 
     "tool-stream collectMatch status completed", 
     "tool-stream extractProperty uid", 
     "bionode-ncbi link assembly bioproject -", 
     "tool-stream extractProperty destUID", 
     "bionode-ncbi link bioproject sra -", 
     "tool-stream extractProperty destUID", 
     "grep 35526", 
     "bionode-ncbi download sra -", 
     "tool-stream collectMatch status completed", 
     "tee > metadata.json" 
   ],

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

   "index-and-align": [ 
     "cat metadata.json", 
     "bionode-sra fastq-dump -", 
     "tool-stream extractProperty destFile", 
     "bionode-bwa mem **/*fna.gz" 
   ], 
   "convert-to-bam": [ 
     "bionode-sam 35526/SRR070675.sam" 
   ] 
 }

pipeline main
  run pipeline import

pipeline import run foobar | run dat import --json

Project status:

Databases

NCBI

Wrappers

Parsers

Databases

EBI and ENSEMBL

Wrappers

QSUB
BLAST and BLAT
Bowtie
KHMER

Parsers

FASTQ
SAM/BAM
VCF/BCF

Users and Contributors:

Dat
Biodalliance
BioJS
Yeo Lab (UC San Diego)
- Michael Lovci
- Olga Botvinnik
Afra
GeneValidator

Soon?

DNADigest
Erik Garrison | erikgarrison

Thanks!

Acknowledgements:

@yannick__ @maxogden @mafintosh @alanmrice @dasmoth

Links

Why Node.js / JavaScript

Streams applies well to Bioinformatics
Easy to write CLI wrappers for Streams
Reusable, small and tested modules
Same language everywhere (JavaScript)
Package Manager that works (NPM)
Huge number modules (93327, 199/day)
Use other JS projects (Dat, BioJS, NoFlo)
Possible to write Desktop GUI apps in JS

Package Manager that works

npm install bionode
npm install bionode -g
npm test
npm start
npm run test-browser
npm run build-docs
npm init
npm publish