Galaxy Community Conference 2015



Galaxy Community Conference 2015

0 0


GCC2015-presentation

GCC2015-presentation

On Github pvanheus / GCC2015-presentation

Galaxy Community Conference 2015

Peter van Heusden (pvh@sanbi.ac.za)
* Background: what do we want? * Scientist-centric software * Support iterative development * Record provenance of data * Map scientific enquiry to computing resources
* Galaxy analysis model is a workbench with tools down the left * and results on the right * Restricted to the set of tools provided by the Galaxy administrator * Adding a tool requires admin privileges and "tool XML"
* Enter Interactive Environment * IPython available, R planned * IPython notebook with straightforward interface to Galaxy datasets, can call back into Galaxy using BioBlend API * Behind the scenes notebook is encapsulated in Docker container * Downside: analysis is not a tool, and cannot be used elsewhere in Galaxy
* Galaxy tool building is "an art, not a science" * Tools combine XML and Cheetah templates * Complexity is hard to avoid, but * Planemo provides tools for automating linting tool, testing and publishing to toolshed
* Planemo is a command line tool for working with Galaxy tools * Mostly allows for following "best practice" in tool building * Sticking point: tool dependencies * Galaxy tool dependency spec is effectively a build descriptor * Like Maven, NPM, Homebrew * Talk within the Galaxy community is to move towards Homebrew / Linuxbrew as package manager
* Docker: Linux containers ++ * Specify software as layers: Ubuntu + PostgreSQL + Galaxy * Build specification is (typically) via Dockerfile
* Allows composition of machine image * Compile, distribute, run everywhere
* Bjoern Gruening, University of Freiberg * Moving more towards the admin: * Galaxy Flavours * Everything, wrapped up in a Docker container * Specified by Dockerfile that builds Galaxy from Galaxy source + command line installer for tools * Good for providing relatively fixed Galaxy instance

Visualisation powered by BioJS

* Sebastian Wilzbach, Technical University Munich * Much BioJS work Developed at The Genome Analysis Centre, TGAC * BioJS is a community oriented project for developing Javascript visualisation components for bioinformatics * BioJS2Galaxy binds BioJS visualisations to Galaxy data types
* Announced at GCC2014, Dataset collections facilitate MapReduce processing of data: * Push to extend tool support for collections to tools provided by Galaxy devteam and Intergalactic Utilities Commission (community) * List / Pair / List of Pairs * First step towards new execution model in Galaxy, revealing need for changes in UI, provenance tracking, backend execution
* Current Galaxy workflows are abstraction intolerant * No sub-workflows (work during GCC2015 hackathon) * Workflow components have received relatively little attention in recent years * Change from "Galaxy is a workflow engine" 2010 paper to "Galaxy isn't a workflow engine" 2012 GCC

John Chilton's presentation at BOSC 2015

* What I like to call the Galaxy 10,000 problem * Currently Galaxy workflows map to tasks on the backend * Large collections threaten to overwhelm scheduler * As you grow workflow system starts looking more and more like a (dataflow or pure functional) language
* Galaxy didn't get here first * Extensive literature on workflow systems that many are not aware of * Kepler model of "pluggable directors" provides a powerful abstraction * Not every execution is alike
* COMAD extension to Kepler (Shawn Bowers and Timothy McPhillips) * Need to provide first class support for selecting, filtering, transforming collections in workflows * Manesh Anand's work on provenance recording as logic language statements maps quite closely to provenance as graph in graph DB

Slide adapted from Ben Lorica

* Data model in Galaxy is data/type to data/type * To integrate workflow with insight requires integrating workflow system with model storage and query system * Possibly embed Galaxy workflow (e.g. Refinery https://github.com/parklab/refinery-platform)

What next?

  • SANBI cluster support for Docker (October 2015)
  • Implement RNASeq and genome annotation in Galaxy workflows
  • BioJS already in use in Bass Explorer (bassex) at SANBI, increase use and contribute
  • Re-engineer Galaxy workflow engine (results by mid 2016)

Thank you

GCC 2015

Many thanks to Alan Christoffels and Olabode Ajayi for supporting and collaborating on this research.

Galaxy Community Conference 2015 Peter van Heusden (pvh@sanbi.ac.za)