[Biopython] Can the GenBank/EMBL parser recover from errors?

Peter biopython at maubp.freeserve.co.uk
Fri May 14 13:27:34 UTC 2010


On Wed, May 5, 2010 at 7:09 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Peter wrote:
>> I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't
>> so corrupt that it can't be scanned to identify each record). Just
>> wrap each record access in an error handler.
>
> That approach should now work with the latest code on the trunk.
> Up until recently the EMBL index code was not picking up on the
> AC line which can be used for the record.id in the parser. This
> didn't seem to matter for the EMBL files in our unit tests, but does
> for those from the IMGT:
>
> http://github.com/biopython/biopython/commit/e3fb9f7b643099042cb7188f383f256b36befb52

That fix was a bit premature - I rushed myself, see the follow up
revision of 6 May 2010:
http://github.com/biopython/biopython/commit/06af841fde2b94c06bee0cbf81ed84c0dfa7f314

On 12 May 2010, as Bug 3069 comment 11, Uri wrote:
http://bugzilla.open-bio.org/show_bug.cgi?id=3069#c11
> Also note that the SeqIO.index function doesn't treat the IMGT headers
> correctly, so it's not possible to access any of the records from the index it
> creates (this was also addressed in my patch where I subclassed an
> independent IMGT parser).

Could you clarify what is going wrong? I've tried this file:
http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z

>>> from Bio import SeqIO
>>> data = SeqIO.index("imgt.dat", "embl")
>>> len(data)
145795
>>> data.keys()[:10]
['EU619982', 'E00551', 'AX616599', 'U21449', 'AY885180', 'AY885181',
'AY885182', 'AY885183', 'AF273409', 'AF273408']
>>> data["EU619982"]
SeqRecord(seq=Seq('AGCTGGGCCTCAGTGAAAACCCTCCTGCTAGCCTCTGGATACAGGTTGACTAGT...CCA',
IUPACAmbiguousDNA()), id='EU619982', name='EU619982',
description='Homo sapiens clone SeqHK32 immunoglobulin heavy chain
variable region mRNA, partial cds.  ; mRNA; rearranged configuration;
Ig-Heavy; regular; group IGHV.', dbxrefs=[])
>>> data["E00551"]
SeqRecord(seq=Seq('GGCCTCCTCCGGGGGGGCTGGAACGACGTGG',
IUPACAmbiguousDNA()), id='E00551', name='E00551', description='Genomic
DNA fragment encoding human antibody D gene on h-chain.  ; unassigned
DNA; unknown configuration; Ig-Heavy; regular.', dbxrefs=[])

Of course for a "broken" record like AF273408 there is a LocationParserError
due to the location 1..445> and so on.

Peter



More information about the Biopython mailing list