[Biojava-l] FASTA header, loc attribute question

JP jp at javaclass.co.uk
Mon May 11 06:05:43 EDT 2009


Hi there at Biojava,

I have two FASTA files - one containing amino acid sequences and the other
containing dna sequences.

In the AA FASTA file I have something like :

>FBpp0077713 type=protein;
loc=2L:join(384551..384894,385701..385746,386308..386576,386703..387270);
ID=FBpp0077713; name=al-PA; parent=FBgn0000061,FBtr0078053;
dbxref=FlyBase:FBpp0077713,GB_protein:AAF51505.1,GB_protein:AAF51505,FlyBase_Annotation_IDs:CG3935-PA,REFSEQ:NP_722629;
MD5=64a866db3e2913b97a2158c2de9d02f6; length=408; release=r5.9;
species=Dmel;
MGISEEIKLEELPQEAKLAHPDAVVLVDRAPGSSAASAGAALTVSMSVSG
GAPSGASGASGGTNSPVSDGNSDCEADEYAPKRKQRRYRTTFTSFQLEEL...
etc etc etc

I would like to parse this header line in particular the loc attribute and
extract it from the entry in the DNA FASTA file (so I get the genomic data
for the protein)

>FBgn0000061 type=gene; loc=2L:378116..387439; ID=FBgn0000061; name=al;
dbxref=FlyBase:FBgn0000061,FlyBase:FBan0003935,FlyBase_Annotation_IDs:CG3935,GB:AE003589,GB_protein:AAF51505,GB:AY121696,GB_protein:AAM52023,GB:BI485174,GB:CZ486795,GB:L08401,GB_protein:AAA28840,UniProt/Swiss-Prot:Q06453,INTERPRO:IPR000047,INTERPRO:IPR001356,INTERPRO:IPR003654,INTERPRO:IPR009057,INTERPRO:IPR012287,bdgpinsituexpr:al,dedb:5830,drsc:FBgn0000061,flight:FBgn0000061,flyatlas:FBgn0000061,flyexpress:FBgn0000061,flygrid:59464,flymine:FBgn0000061,geo:FBgn0000061,hdri:FBgn0000061,if:/gene/aristal.htm,orthologs:ensANOGA:ENSANGP00000011877,orthologs:ensBOSTA:ENSBTAP00000015907,orthologs:ensCANFA:ENSCAFP00000009888,orthologs:ensGALGA:ENSGALP00000005255,orthologs:ensHOMSA:ENSP00000298420,orthologs:ensMACMU:ENSMMUP00000007349,orthologs:ensMONDO:ENSMODP00000008388,orthologs:ensPANTR:ENSPTRP00000004281,orthologs:ensRATNO:ENSRNOP00000027186,orthologs:ensTETNI:GSTENP00015517001,orthologs:graORYSA:Q6YYB8,orthologs:graORYSA:Q8W0T5,orthologs:modCAEEL:WBGene00044330,orthologs:modDANRE:ZDB-GENE-990415-15,orthologs:modMUSMU:MGI:1097716,panther:FBgn0000061;
cyto_range=21C1-21C1; gbunit=AE014134; MD5=0f5568cf13aeb2c7076f11b1ce3d6b2f;
length=9324; release=r5.9; species=Dmel;
GTAGTTTGCTGCCGGCTCTGGAACAGCCCGGTCATCTCGTCGCGTTCGGT
TCCGATTCCGATTCGAATAGTCGAGCTGGGGATACATTGTTGTTTCCGGG
etc etc etc

I understand this is not exactly conventional, but does biojava support the
parsing of the loc attribute ? (join, complement etc.)

Many Thanks
JP



More information about the Biojava-l mailing list