Preecha Patumcharoenpol
http://xkcd.com/927/
>SPLC1_S230110 putative signaling protein with GGDEF and EAL domain protein [Arthrospira platensis C1] MLSLVAKIIQNLVRDTDLLARLGGDEFVIVLEDLEATNEATRVAERILESLRSSPLQVGK RDVFVNSSIGIVVRTNRHEKAEDLLRDADLAMYRAKHEGRGRYAIFDPLMHFQAVQQMHL ENDLRKAIENNQLVLYYQPIVNIKNQRIQGLEALVRWQHPERGLLAPGHFINIAENTGLI IPIGRWLLHTACQQLAEWENQFPHHFLKMSVNLSVKQLDIFLLEQLDEVLNNYNLKQNSL VLEITESMLVANIEKTCDLLNQIKAKGIGLSIDDFGTGYSSLSYLHQLPVNSLKIDRSFV SPANLSDRHQVIAKSIIALSKLLKLHVIAEGVETPEQFHWLKKLGCEAAQGYLFSRPVPA SDITEL
>gi|493673229|ref|WP_006623555.1| MULTISPECIES: diguanylate cyclase [Arthrospira] MLSLVAKIIQNLVRDTDLLARLGGDEFVIVLEDLEATNEATRVAERILESLRSSPLQVGKRDVFVNSSIG IVVRTNRHEKAEDLLRDADLAMYRAKHEGRGRYAIFDPLMHFQAVQQMHLENDLRKAIENNQLVLYYQPI VNIKNQRIQGLEALVRWQHPERGLLAPGHFINIAENTGLIIPIGRWLLHTACQQLAEWENQFPHHFLKMS VNLSVKQLDIFLLEQLDEVLNNYNLKQNSLVLEITESMLVANIEKTCDLLNQIKAKGIGLSIDDFGTGYS SLSYLHQLPVNSLKIDRSFVSPANLSDRHQVIAKSIIALSKLLKLHVIAEGVETPEQFHWLKKLGCEAAQ GYLFSRPVPASDITEL
>gi|459201371|ref|YP_007507330.1| 3-hydroxypropionic acid resistance peptide [Escherichia coli str. K-12 substr. MG1655] MKPALRDFIAIVQERLASVTA >gi|459201369|ref|NP_414883.5| 2-hydroxy-6-ketonona-2,4-dienedioic acid hydrolase [Escherichia coli str. K-12 substr. MG1655] MSYQPQTEAATSRFLNVEEAGKTLRIHFNDCGQGDETVVLLHGSGPGATGWANFSRNIDP LVEAGYRVILLDCPGWGKSDSVVNSGSRSDLNARILKSVVDQLDIAKIHLLGNSMGGHSS VAFTLKWPERVGKLVLMGGGTGGMSLFTPMPTEGIKRLNQLYRQPTIENLKLMMDIFVFD TSDLTDALFEARLNNMLSRRDHLENFVKSLEANPKQFPDFGPRLAEIKAQTLIVWGRNDR FVPMDAGLRLLSGIAGSELHIFRDCGHWAQWEHADAFNQLVLNFLARP >gi|459201370|ref|YP_007507329.1| Mn(2)-response protein, MntR-repressed [Escherichia coli str. K-12 substr. MG1655] MNEFKRCMRVFSHSPFKVRLMLLSMLCDMVNNKPQQDKPSDK
Always check your data before do anything.
cat blast.fmt7 | head -n50 cat blast.fmt | grep -v "#" | head -n50
Filter by identity.
Get everything above.
grep -v "#" blast.fmt7 | gawk '{if ($3 > 80) print}' # Same thing cat blast.fmt7 | gawk '/^[^#]/{if ($3 > 80) print}' gawk '/^[^#]/{if ($3 > 80) print}' blast.fmt7
Filter by identity.
cat blast.fmt7 | gawk '/^[^#]/{if ($3 < 80 && $3 > 80 ) print}' # Or cat blast.fmt7 | gawk -f 01_filter.awk
grep -v "#" blast.fmt7 | gawk '{print}' grep -v "#" blast.fmt7 | gawk '{print $1 "\t" $2 "\t" $3}' # How about. grep -v "#" blast.fmt7 | cut -d$'\t' -f1-3 # Even better grep -v "#" blast.fmt7 | cut -d$'\t' -f1-3,11,12
Why not excel?
Data-interchange format
It is the closest thing to what we can call an Industry standard.
curl -s http://togows.org/entry/kegg-pathway/cre03440.json
We had some difficulty when we tried to extract specific fields
curl -s http://rest.kegg.jp/get/path:cre03440
The reason is very simple, most tools we used assume that the data is complete in one line
Basic
echo '{"a" : 1, "b": "2"}' | jq '.' echo '{"a" : 1, "b": "2"}' | jq '.a' echo '{"a" : 1, "b": "2"}' | jq '.b'
Array
echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[0]' echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[5]'
Hierachical
echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[0]' echo '[1, "item", 29, "anotheritem", "moreitem"]' | jq '.[5]'
DATA='[{"entry": 1, "data": {"inside": "deep"}}, {"entry": 2, "data": {"inside": "very deep"}}]' echo $DATA | jq '.' echo $DATA | jq '.[0]' echo $DATA | jq '.[0] | keys' echo $DATA | jq '.[0] | .entry ' echo $DATA | jq '.[0] | .data.inside' echo $DATA | jq '.[] | .data.inside'
http://togows.org is a web-service that provide a data in JSON format.
Direction: Build a command that print all gene from pathway
Hint: I already give you everything you need
# curl -s http://rest.kegg.jp/get/path:cre03440 curl -s "http://togows.org/entry/kegg-pathway/cre03440.json" | # Put your code here
#Raw output #Select current # Select Key ("genes") #Pipe List all keys #Join curl -s "http://togows.org/entry/kegg-pathway/cre03440.json" | jq -r ' .[] .genes | keys | join("\n")'
DATA='Data hereData not here' echo $DATA | xmllint --format - echo $DATA | xmllint --xmllint --xpath '/start/inside/text()' -
Download goo.gl/YTHDdT
curl -L goo.gl/YTHDdT > NC_005213.gbk
Readseq