Sanger 2014



Sanger 2014

0 0


sanger14


On Github bmpvieira / sanger14

Sanger 2014

bmpvieira.com/sanger14

Phd Student @

Bioinformatics and Population Genomics

Supervisor:Yannick Wurm | @yannick__

Before:

Some problems I faced during my research:

  • Difficulty getting relevant descriptions and datasets from NCBI API using bio* libs
  • For web projects, needed to implement the same functionality on browser and server
  • Difficulty writing scalable, reproducible and complex bioinformatic pipelines

Bionode.io - Modular and universal bioinformatics

Pipeable UNIX command line tools and JavaScript / Node.js APIs for bioinformatic analysis workflows on the server and browser.Collaborates with BioJS - Represent biological data on the web

Dat - Build data pipelines

Provides a streaming interface between every file format and data storage backend. "git for data"

dat-data.com | @maxogden | @mafintosh

Difficulty getting relevant description and datasets from NCBI API using bio* libs

Python example: URL for the Achromyrmex assembly?

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG
import xml.etree.ElementTree as ET
from Bio import Entrez
Entrez.email = "mail@bmpvieira.com"
esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")
esearch_record = Entrez.read(esearch_handle)
for id in esearch_record['IdList']:
  esummary_handle = Entrez.esummary(db="assembly", id=id)
  esummary_record = Entrez.read(esummary_handle)
  documentSummarySet = esummary_record['DocumentSummarySet']
  document = documentSummarySet['DocumentSummary'][0]
  metadata_XML = document['Meta'].encode('utf-8')
  metadata = ET.fromstring('' + metadata_XML + '')
  for entry in Metadata[1]:
    print entry.text

Solution: bionode-ncbi

Difficulty getting relevant description and datasets from NCBI API using bio* libs

JavaScript

var bio = require('bionode')
bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) {
  console.log(urls[0].genomic.fna)
})
bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)
function printGenomeURL(urls) {
  console.log(urls[0].genomic.fna)
})

Difficulty getting relevant description and datasets from NCBI API using bio* libs

JavaScript

var ncbi = require('bionode-ncbi')
var ndjson = require('ndjson')
ncbi.urls('assembly', 'Acromyrmex')
.pipe(ndjson.stringify())
.pipe(process.stdout)

BASH

bionode-ncbi urls assembly Acromyrmex |
tool-stream extractProperty genomic.fna

Need to reimplement the same code on browser and server.

Solution: JavaScript everywhere

Module counts

Benefit from other JS projects

Dat

Reusable, small and tested modules

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

Solution: Node.js Streams everywhere

 var ncbi = require('bionode-ncbi')
 var tool = require('tool-stream')
 var through = require('through2')
 var fork1 = through.obj()
 var fork2 = through.obj()

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

Solution: Node.js Streams everywhere

ncbi
.search('sra', 'Solenopsis invicta')
.pipe(fork1)
.pipe(dat.reads)

fork1
.pipe(tool.extractProperty('expxml.Biosample.id'))
.pipe(ncbi.search('biosample'))
.pipe(dat.samples)

fork1
.pipe(tool.extractProperty('uid'))
.pipe(ncbi.link('sra', 'pubmed'))
.pipe(ncbi.search('pubmed'))
.pipe(fork2)
.pipe(dat.papers)

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

bionode-ncbi search genome Guillardia theta |
tool-stream extractProperty assemblyid |
bionode-ncbi download assembly |
tool-stream collectMatch status completed |
tool-stream extractProperty uid|
bionode-ncbi link assembly bioproject |
tool-stream extractProperty destUID |
bionode-ncbi link bioproject sra |
tool-stream extractProperty destUID |
grep 35526 |
bionode-ncbi download sra |
bionode-sra fastq-dump |
tool-stream extractProperty destFile |
bionode-bwa mem 503988/GCA_000315625.1_Guith1_genomic.fna.gz |
tool-stream collectMatch status finished|
tool-stream extractProperty sam|
bionode-sam

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

 { 
   "import-data": [ 
     "bionode-ncbi search genome eukaryota", 
     "dat import --json --primary=uid" 
   ], 
   "search-ncbi": [ 
     "dat cat", 
     "grep Guillardia", 
     "tool-stream extractProperty assemblyid", 
     "bionode-ncbi download assembly -", 
     "tool-stream collectMatch status completed", 
     "tool-stream extractProperty uid", 
     "bionode-ncbi link assembly bioproject -", 
     "tool-stream extractProperty destUID", 
     "bionode-ncbi link bioproject sra -", 
     "tool-stream extractProperty destUID", 
     "grep 35526", 
     "bionode-ncbi download sra -", 
     "tool-stream collectMatch status completed", 
     "tee > metadata.json" 
   ], 

Difficulty writing scalable, reproducible and complex bioinformatic pipelines.

   "index-and-align": [ 
     "cat metadata.json", 
     "bionode-sra fastq-dump -", 
     "tool-stream extractProperty destFile", 
     "bionode-bwa mem **/*fna.gz" 
   ], 
   "convert-to-bam": [ 
     "bionode-sam 35526/SRR070675.sam" 
   ] 
 } 

datscript

pipeline main
  run pipeline import

pipeline import run foobar | run dat import --json

example

Project status:

Databases

Wrappers

Parsers

Wishlist:

Databases

  • EBI and ENSEMBL

Wrappers

  • QSUB
  • BLAST and BLAT
  • Bowtie
  • KHMER

Parsers

  • FASTQ
  • SAM/BAM
  • VCF/BCF

Users and Contributors:

Soon?

Thanks!

Acknowledgements:

@yannick__ @maxogden @mafintosh @alanmrice @dasmoth

Links

Why Node.js / JavaScript

Package Manager that works

npm install bionode
npm install bionode -g
npm test
npm start
npm run test-browser
npm run build-docs
npm init
npm publish