[Biopython] Can the GenBank/EMBL parser recover from errors?

Peter Cock p.j.a.cock at googlemail.com
Wed Apr 28 22:11:43 UTC 2010


On Wednesday, April 28, 2010, Uri Laserson <laserson at mit.edu> wrote:
> Hi,
>
> I am trying to parse a large file of EMBL records that I know has some
> errors in it.  However, rather than having the parser break when it gets to
> the error, I'd rather it just skip that record, and move on to the next one.
>  I was wondering if this functionality is already built in somewhere.  One
> way I can do this is like this:
>
> iterator = SeqIO.parse(ip,'embl').__iter__()
> while True:
>     try:
>         record = iterator.next()
>     # Now I specify all the parsing errors I want to catch:
>     except LocationParserError:
>         # Reinitialize iterator at current file position. The iterator
>         # then skips to the beginning of the next record and continues.
>         iterator = SeqIO.parse(ip,'embl').__iter__()
>     except StopIteration:
>         break
>
> This way, whenever there is a parsing error, I just reinitialize the
> iterator at the current file position, and it seeks to the beginning of the
> next record.  However, this requires me to write out the for loop manually
> (using StopIteration).  Does anyone know of a cleaner/more elegant way of
> doing this?
>
> Thanks!

Hi Uri,

There is no obvious way to handle this within the Bio.SeqIO.parse framework.

I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't
so corrupt that it can't be scanned to identify each record). Just
wrap each record access in an error handler.

Peter




More information about the Biopython mailing list