[Biopython] missing fields in SeqIO EMBL parser?

Wim De Smet Wim.DeSmet at UGent.be
Fri May 7 14:59:36 UTC 2010


On 07-05-10 16:50, Peter wrote:
> On Fri, May 7, 2010 at 3:36 PM, Wim De Smet<Wim.DeSmet at ugent.be>  wrote:
>>
>> Sure, take this record:
>> http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2
>>
>> I'm looking for the data from the database cross reference lines (DR), i.e.:
>> DR   RFAM; RF00177; SSU_rRNA_5.
>> DR   SILVA-SSU; FJ904258.
>>
>> I assumed this would be in the record.dxrefs fields, but it's empty when I
>> parse this file. It's more of a nice to have than anything else at this
>> point, but I'll have to figure out another way to get a hold of these
>> elements then.
>
> That was also left as a TODO - the dbxrefs list is normally used for single
> identifiers - here it would be "RFAM:RF00177" and "SILVA-SSU:FJ904258"
> for consistency with the other parsers. At the time I was undecided on how
> to handle any secondary identifier Would you need/want this too? Maybe
> as  "RFAM:RF00177:SSU_rRNA_5"?

I don't really need it as such, I'm just parsing the file and dropping 
the fields in the database, so they could be in there verbatim for all I 
care. (I'm not even sure what the secondary identifier means in this case.)

For what I'm doing the easiest fix would really be if the parser took 
these lines it didn't understand and just add them to the record anyway 
as extra 'stuff' that I can extract the rest out of.

For example, for those DR lines it might look a bit like this:
 >>> print record.unknown['DR']
('RFAM; RF00177; SSU_rRNA_5.', 'SILVA-SSU; FJ904258')

That way, you'd be (sorta) Future Proof(TM). Just a suggestion anyway. 
Thanks for taking the time to respond.

cheers,
Wim

-- 
Wim De Smet
http://www.straininfo.net/



More information about the Biopython mailing list