[Biopython-dev] Bio.GenBank (was: Bio.File)

Sun Sep 11 14:06:13 UTC 2011

On Sun, Sep 11, 2011 at 4:22 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Hi all,
>
> There are several issues here.
> Let's talk about Bio.GenBank first.
>
> I think it's OK to have a module Bio.GenBank in addition
> to Bio.SeqIO, but it's a bit unclear to me which code in
> Bio.GenBank is still relevant and which (if any) can
> potentially be deprecated.

Bio.GenBank uses a scanner/consumer to offer two
object models for GenBank/EMBL files. First, SeqRecord
objects which is wrapped by Bio.SeqIO. Second, a
more faithful GenBank record object which also
supports non-sequence based GenBank whole
genome shotgun master records. These are GenBank
files that summarize the content of a project, and
provide lists of scaffold and contig files in the project.
I have never used this - Iddo has though.

So currently none of Bio.GenBank can really be
deprecated. If we don't care about WGS records,
then perhaps the RecordParser could be deprecated
and later with some refactoring Bio.SeqIO could parse
things directly. That would be my long term ideal.

Maybe we can represent the WGS records as
SeqRecord objects without a sequence, but I
don't like that idea really. Such files are NOT
sequence files at all.

>
> Also we'd need some documentation for Bio.GenBank.
>

In general it would be a good idea to have a
worked example parsing a (small) GenBank
file and showing where in the SeqRecord
each bit of annotation goes. Doing this as
a doctest (embedded in the Tutorial perhaps)
would keep the documentation up to date
(any changes should show up as a unit test
failure).

> In particular it's not clear to me which classes in
> Bio.GenBank are intended to be used by users.
> The description at the top of Bio.GenBank says
> that only Bio.GenBank.RecordParser should be
> used directly.

What is says is "Currently the ONLY reason to use
Bio.GenBank directly is for the RecordParser which
turns a GenBank file into GenBank-specific Record
objects.", by which I mean if you want SeqRecord
objects, use Bio.SeqIO instead (which will call
Bio.GenBank.FeatureParser internally), since that
is our standard API for parsing as SeqRecords.

> However, in the test code in
> Bio.Graphics.GenomeDiagram (after
> "if name=='__main__':") Bio.GenBank.FeatureParser
> is used. Should that be replaced by Bio.SeqIO then?

Yes. If the code is needed at all...

> Also I think that the RecordParser should
> raise an Exception if it cannot find a record
> when parsing.

I disagree (or at least, when exposed via
Bio.SeqIO I disagree).

> Compare the following:
>
>>>> from Bio import SeqIO
>>>> from StringIO import StringIO
>>>> handle = StringIO("no record here")
>>>> SeqIO.read(handle, 'fasta')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 617, in read
>    raise ValueError("No records found in handle")
> ValueError: No records found in handle

That's fine - the read function says it
will raise an exception if there is not
exactly one record.

Perhaps you meant to use parse here
as in the following example? If you do,
you get no records and no exception.

>>>> from Bio import GenBank
>>>> parser = GenBank.RecordParser()
>>>> handle = StringIO("no record here")
>>>> parser.parse(handle)
>>>> # no error raised
>
> This still lets us ignore header text before
> the actual start of a GenBank record; the
> error should only be raised if no GenBank
> record can be found anywhere.
>

If you used Bio.SeqIO.read(...) with GenBank
format on an empty file you'd also get an
exception.

I explicitly test the SeqIO parsers to check
they handle an empty file gracefully - and
for simple sequential formats like FASTA
and GenBank that means returns no records.
This is an important special case, and it
should be handled this way for generic
pipelines. I often have empty FASTA files.

Peter