Biological file format & Data submission –



Biological file format & Data submission –

0 0


2015_experimental_biology_fileformat


On Github yumyai / 2015_experimental_biology_fileformat

Biological file format & Data submission

Preecha Patumcharoenpol

Goal

Understanding concept of file format! Having an experience with commonly used file format and tools
  • Preparation, Validation, Conversion

The problem?

File format is for

Examples

  • Sequence: FASTA, GENBANK, EMBL, GCC, GFF, BED
  • Alignment: CIGAR, SAM, PSL, BAM, BLAST
  • Data: SBML, KGML
  • Common: Delimited (Tabular), XML, JSON

http://xkcd.com/927/

FASTA

>SPLC1_S230110 putative signaling protein with GGDEF and EAL domain protein [Arthrospira platensis C1]
MLSLVAKIIQNLVRDTDLLARLGGDEFVIVLEDLEATNEATRVAERILESLRSSPLQVGK
RDVFVNSSIGIVVRTNRHEKAEDLLRDADLAMYRAKHEGRGRYAIFDPLMHFQAVQQMHL
ENDLRKAIENNQLVLYYQPIVNIKNQRIQGLEALVRWQHPERGLLAPGHFINIAENTGLI
IPIGRWLLHTACQQLAEWENQFPHHFLKMSVNLSVKQLDIFLLEQLDEVLNNYNLKQNSL
VLEITESMLVANIEKTCDLLNQIKAKGIGLSIDDFGTGYSSLSYLHQLPVNSLKIDRSFV
SPANLSDRHQVIAKSIIALSKLLKLHVIAEGVETPEQFHWLKKLGCEAAQGYLFSRPVPA
SDITEL
          
>gi|493673229|ref|WP_006623555.1| MULTISPECIES: diguanylate cyclase [Arthrospira]
MLSLVAKIIQNLVRDTDLLARLGGDEFVIVLEDLEATNEATRVAERILESLRSSPLQVGKRDVFVNSSIG
IVVRTNRHEKAEDLLRDADLAMYRAKHEGRGRYAIFDPLMHFQAVQQMHLENDLRKAIENNQLVLYYQPI
VNIKNQRIQGLEALVRWQHPERGLLAPGHFINIAENTGLIIPIGRWLLHTACQQLAEWENQFPHHFLKMS
VNLSVKQLDIFLLEQLDEVLNNYNLKQNSLVLEITESMLVANIEKTCDLLNQIKAKGIGLSIDDFGTGYS
SLSYLHQLPVNSLKIDRSFVSPANLSDRHQVIAKSIIALSKLLKLHVIAEGVETPEQFHWLKKLGCEAAQ
GYLFSRPVPASDITEL
          

Multi-FASTA

>gi|459201371|ref|YP_007507330.1| 3-hydroxypropionic acid resistance peptide [Escherichia coli str. K-12 substr. MG1655]
MKPALRDFIAIVQERLASVTA
>gi|459201369|ref|NP_414883.5| 2-hydroxy-6-ketonona-2,4-dienedioic acid hydrolase [Escherichia coli str. K-12 substr. MG1655]
MSYQPQTEAATSRFLNVEEAGKTLRIHFNDCGQGDETVVLLHGSGPGATGWANFSRNIDP
LVEAGYRVILLDCPGWGKSDSVVNSGSRSDLNARILKSVVDQLDIAKIHLLGNSMGGHSS
VAFTLKWPERVGKLVLMGGGTGGMSLFTPMPTEGIKRLNQLYRQPTIENLKLMMDIFVFD
TSDLTDALFEARLNNMLSRRDHLENFVKSLEANPKQFPDFGPRLAEIKAQTLIVWGRNDR
FVPMDAGLRLLSGIAGSELHIFRDCGHWAQWEHADAFNQLVLNFLARP
>gi|459201370|ref|YP_007507329.1| Mn(2)-response protein, MntR-repressed [Escherichia coli str. K-12 substr. MG1655]
MNEFKRCMRVFSHSPFKVRLMLLSMLCDMVNNKPQQDKPSDK
          

Common format

  • Delimited (Tabular)
  • JSON
  • XML

Always check your data before do anything.

cat blast.fmt7 | head -n50
cat blast.fmt | grep -v "#" | head -n50
            

Analysis

Filter by identity.

Get everything above.

grep -v "#" blast.fmt7 | gawk '{if ($3 > 80) print}'

# Same thing
cat blast.fmt7 | gawk '/^[^#]/{if ($3 > 80) print}'
gawk '/^[^#]/{if ($3 > 80) print}' blast.fmt7
            

Filter by identity.

cat blast.fmt7 | gawk '/^[^#]/{if ($3 < 80 && $3 > 80 ) print}'
# Or
cat blast.fmt7 | gawk -f 01_filter.awk
            

Cut

grep -v "#" blast.fmt7 | gawk '{print}'
grep -v "#" blast.fmt7 | gawk '{print $1 "\t" $2 "\t" $3}'
# How about.
grep -v "#" blast.fmt7 | cut -d$'\t' -f1-3
# Even better
grep -v "#" blast.fmt7 | cut -d$'\t' -f1-3,11,12
          

Discuss

Why not excel?

  • In commandline, you can save your work as a script.
  • Most work can be done with combination of grep/sed/gawk/cut/other scripting.
  • Excel cannot perform a complex operation.
    • Some might argue that you can do it with VBScript but if you have to write a program, you might as well as write a python/perl script.

JSON (JavaScript Object Notation

Data-interchange format

  • Easy to write and read (Human).
  • Easy to parse and generate (Machine).
  • LINK

It is the closest thing to what we can call an Industry standard.

How it is look like

curl -s http://togows.org/entry/kegg-pathway/cre03440.json
          

Remember this one?

We had some difficulty when we tried to extract specific fields

curl -s http://rest.kegg.jp/get/path:cre03440
            

The reason is very simple, most tools we used assume that the data is complete in one line

Basic

echo '{"a" : 1, "b": "2"}' | jq '.'
echo '{"a" : 1, "b": "2"}' | jq '.a'
echo '{"a" : 1, "b": "2"}' | jq '.b'
          

Array

echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[0]'
echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[5]'
          

Hierachical

echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[0]'
echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[5]'
          

DATA='[{"entry": 1, "data": {"inside": "deep"}}, {"entry": 2, "data": {"inside": "very deep"}}]'
echo $DATA | jq '.'
echo $DATA | jq '.[0]'
echo $DATA | jq '.[0] | keys'
echo $DATA | jq '.[0] | .entry '
echo $DATA | jq '.[0] | .data.inside'

echo $DATA | jq '.[] | .data.inside'
          

Intermission

http://togows.org is a web-service that provide a data in JSON format.

Direction: Build a command that print all gene from pathway

Hint: I already give you everything you need

# curl -s http://rest.kegg.jp/get/path:cre03440
curl -s "http://togows.org/entry/kegg-pathway/cre03440.json" | # Put your code here
          

                                                                 #Raw output   #Select current       # Select Key ("genes")  #Pipe  List all keys     #Join
curl -s "http://togows.org/entry/kegg-pathway/cre03440.json" | jq -r         '       .[]              .genes                  |      keys            | join("\n")'
          

  • Google it.
  • Just google it.
DATA='Data hereData not here'
echo $DATA | xmllint --format -
echo $DATA | xmllint --xmllint --xpath '/start/inside/text()' -
          

Exercise

Download goo.gl/YTHDdT

            curl -L goo.gl/YTHDdT > NC_005213.gbk
          

Excercise

Readseq

  • http://www.ebi.ac.uk/cgi-bin/readseq.cgi (EMBL-EBI)
  • http://www-bimas.cit.nih.gov/molbio/readseq/ (NIH)