[Biopython-dev] [Biopython] skipping a bad record read in SeqIO

Sun Jun 7 15:10:16 EDT 2009

Thanks Peter.

OK, it's a genbank file, but the point is not hacking around that problem
(which I did), it's more of a biopython policy question.

Biopython cannot handle every record format variant (==error) out there, and
we should probably have a method for skipping over illegible records. The
records skipped should be noted, of course, e.g. by writing to stderr. If
the record cannot be read, then the preceding record ID and / or the record
serial number should be written.

Does that sound like something we should be doing?

On Sun, Jun 7, 2009 at 4:52 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg<idoerg at gmail.com> wrote:
> > Suppose an iterator based reader throws an exception due to a bad record.
> I
> > want to note that in stderr an move on to the next record. How do i do
> that?
>
> The short answer is you can't (at least not easily), but the details
> would depend on which parser you are using (i.e. which file format).
>
> Do you have a corrupt file, or do you think you might have found a bug
> in a parser? More details would help.
>
> If you really have to do this, then if the file format is simple I
> would suggest you manually read the file into chunks and then pass
> them to SeqIO one by one. Not elegant but it would work. For example
> with a GenBank file, loop over the file line by line caching the data
> until you reach a new LOCUS line. Then turn the cached lines into a
> StringIO handle and give it to Bio.SeqIO.read() to parse that single
> record (in a try/except).
>
> Peter
>

-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org