Biological file format & Data submission

Preecha Patumcharoenpol

Goal

Understanding concept of file format! Having an experience with commonly used file format and tools

Preparation, Validation, Conversion

The problem?

File format is for

Examples

Sequence: FASTA, GENBANK, EMBL, GCC, GFF, BED
Alignment: CIGAR, SAM, PSL, BAM, BLAST
Data: SBML, KGML
Common: Delimited (Tabular), XML, JSON

http://xkcd.com/927/

FASTA

>SPLC1_S230110 putative signaling protein with GGDEF and EAL domain protein [Arthrospira platensis C1]
MLSLVAKIIQNLVRDTDLLARLGGDEFVIVLEDLEATNEATRVAERILESLRSSPLQVGK
RDVFVNSSIGIVVRTNRHEKAEDLLRDADLAMYRAKHEGRGRYAIFDPLMHFQAVQQMHL
ENDLRKAIENNQLVLYYQPIVNIKNQRIQGLEALVRWQHPERGLLAPGHFINIAENTGLI
IPIGRWLLHTACQQLAEWENQFPHHFLKMSVNLSVKQLDIFLLEQLDEVLNNYNLKQNSL
VLEITESMLVANIEKTCDLLNQIKAKGIGLSIDDFGTGYSSLSYLHQLPVNSLKIDRSFV
SPANLSDRHQVIAKSIIALSKLLKLHVIAEGVETPEQFHWLKKLGCEAAQGYLFSRPVPA
SDITEL

>gi|493673229|ref|WP_006623555.1| MULTISPECIES: diguanylate cyclase [Arthrospira]
MLSLVAKIIQNLVRDTDLLARLGGDEFVIVLEDLEATNEATRVAERILESLRSSPLQVGKRDVFVNSSIG
IVVRTNRHEKAEDLLRDADLAMYRAKHEGRGRYAIFDPLMHFQAVQQMHLENDLRKAIENNQLVLYYQPI
VNIKNQRIQGLEALVRWQHPERGLLAPGHFINIAENTGLIIPIGRWLLHTACQQLAEWENQFPHHFLKMS
VNLSVKQLDIFLLEQLDEVLNNYNLKQNSLVLEITESMLVANIEKTCDLLNQIKAKGIGLSIDDFGTGYS
SLSYLHQLPVNSLKIDRSFVSPANLSDRHQVIAKSIIALSKLLKLHVIAEGVETPEQFHWLKKLGCEAAQ
GYLFSRPVPASDITEL

Multi-FASTA

>gi|459201371|ref|YP_007507330.1| 3-hydroxypropionic acid resistance peptide [Escherichia coli str. K-12 substr. MG1655]
MKPALRDFIAIVQERLASVTA
>gi|459201369|ref|NP_414883.5| 2-hydroxy-6-ketonona-2,4-dienedioic acid hydrolase [Escherichia coli str. K-12 substr. MG1655]
MSYQPQTEAATSRFLNVEEAGKTLRIHFNDCGQGDETVVLLHGSGPGATGWANFSRNIDP
LVEAGYRVILLDCPGWGKSDSVVNSGSRSDLNARILKSVVDQLDIAKIHLLGNSMGGHSS
VAFTLKWPERVGKLVLMGGGTGGMSLFTPMPTEGIKRLNQLYRQPTIENLKLMMDIFVFD
TSDLTDALFEARLNNMLSRRDHLENFVKSLEANPKQFPDFGPRLAEIKAQTLIVWGRNDR
FVPMDAGLRLLSGIAGSELHIFRDCGHWAQWEHADAFNQLVLNFLARP
>gi|459201370|ref|YP_007507329.1| Mn(2)-response protein, MntR-repressed [Escherichia coli str. K-12 substr. MG1655]
MNEFKRCMRVFSHSPFKVRLMLLSMLCDMVNNKPQQDKPSDK

Common format

Delimited (Tabular)
JSON
XML

Always check your data before do anything.

cat blast.fmt7 | head -n50
cat blast.fmt | grep -v "#" | head -n50

Analysis

Filter by identity.

Get everything above.

grep -v "#" blast.fmt7 | gawk '{if ($3 > 80) print}'

# Same thing
cat blast.fmt7 | gawk '/^[^#]/{if ($3 > 80) print}'
gawk '/^[^#]/{if ($3 > 80) print}' blast.fmt7

Filter by identity.

cat blast.fmt7 | gawk '/^[^#]/{if ($3 < 80 && $3 > 80 ) print}'
# Or
cat blast.fmt7 | gawk -f 01_filter.awk

Cut

grep -v "#" blast.fmt7 | gawk '{print}'
grep -v "#" blast.fmt7 | gawk '{print $1 "\t" $2 "\t" $3}'
# How about.
grep -v "#" blast.fmt7 | cut -d$'\t' -f1-3
# Even better
grep -v "#" blast.fmt7 | cut -d$'\t' -f1-3,11,12

Discuss

Why not excel?

In commandline, you can save your work as a script.
Most work can be done with combination of grep/sed/gawk/cut/other scripting.
Excel cannot perform a complex operation.
- Some might argue that you can do it with VBScript but if you have to write a program, you might as well as write a python/perl script.

JSON (JavaScript Object Notation

Data-interchange format

Easy to write and read (Human).
Easy to parse and generate (Machine).
LINK

It is the closest thing to what we can call an Industry standard.

How it is look like

curl -s http://togows.org/entry/kegg-pathway/cre03440.json

Remember this one?

We had some difficulty when we tried to extract specific fields

curl -s http://rest.kegg.jp/get/path:cre03440

The reason is very simple, most tools we used assume that the data is complete in one line

Basic

echo '{"a" : 1, "b": "2"}' | jq '.'
echo '{"a" : 1, "b": "2"}' | jq '.a'
echo '{"a" : 1, "b": "2"}' | jq '.b'

Array

echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[0]'
echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[5]'

Hierachical

echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[0]'
echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[5]'

DATA='[{"entry": 1, "data": {"inside": "deep"}}, {"entry": 2, "data": {"inside": "very deep"}}]'
echo $DATA | jq '.'
echo $DATA | jq '.[0]'
echo $DATA | jq '.[0] | keys'
echo $DATA | jq '.[0] | .entry '
echo $DATA | jq '.[0] | .data.inside'

echo $DATA | jq '.[] | .data.inside'

Intermission

http://togows.org is a web-service that provide a data in JSON format.

Direction: Build a command that print all gene from pathway

Hint: I already give you everything you need

# curl -s http://rest.kegg.jp/get/path:cre03440
curl -s "http://togows.org/entry/kegg-pathway/cre03440.json" | # Put your code here

                                                                 #Raw output   #Select current       # Select Key ("genes")  #Pipe  List all keys     #Join
curl -s "http://togows.org/entry/kegg-pathway/cre03440.json" | jq -r         '       .[]              .genes                  |      keys            | join("\n")'

Google it.
Just google it.

DATA='Data hereData not here'
echo $DATA | xmllint --format -
echo $DATA | xmllint --xmllint --xpath '/start/inside/text()' -

Exercise

Download goo.gl/YTHDdT

            curl -L goo.gl/YTHDdT > NC_005213.gbk

Excercise

Readseq

http://www.ebi.ac.uk/cgi-bin/readseq.cgi (EMBL-EBI)
http://www-bimas.cit.nih.gov/molbio/readseq/ (NIH)

Biological file format & Data submission –

yumyai

Biological file format & Data submission –

0 0

2015_experimental_biology_fileformat

Biological file format & Data submission

Goal

The problem?

File format is for

Examples

FASTA

Multi-FASTA

Common format

Analysis

Cut

Discuss

JSON (JavaScript Object Notation

How it is look like

Remember this one?

Intermission

Exercise

Excercise

Biological file format & Data submission –

yumyai

Biological file format & Data submission –

0 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

2015_experimental_biology_fileformat

Biological file format & Data submission

Goal

The problem?

File format is for

Examples

FASTA

Multi-FASTA

Common format

Analysis

Cut

Discuss

JSON (JavaScript Object Notation

How it is look like

Remember this one?

Intermission

Exercise

Excercise

0 0