[Biopython-dev] [Bug 2417] New: Bio.SeqIO single SeqRecord read/parse function

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Sat Dec 8 13:09:00 UTC 2007


http://bugzilla.open-bio.org/show_bug.cgi?id=2417

           Summary: Bio.SeqIO single SeqRecord read/parse function
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Most sequence file format can contain a single record, and in this situation
having to use an iterator returned by Bio.SeqIO.parse() can be clumsy.

For example, dealing with GenBank files for bacterial genomes or chromosomes. 
Or, from the tutorial as of Biopython 1.44,

from Bio.WWW import ExPASy
from Bio import SeqIO
seq_record = SeqIO.parse(ExPASy.get_sprot_raw("O23729"), "swiss").next()
print seq_record.id
print seq_record.seq
print len(seq_record.seq)

Using the iterator.next() method as above works fine, it will however silently
ignore any unexpected subsequent records if present.  Checking your file only
has one record would require a an additional check to confirm a second .next()
call fails, or another such workaround.

I am proposing a new function for use with a handle containing one and only one
record.  This would raise an error if the handle contained no records, or if it
contained more than one record.  It would be defined in Bio/SeqIO/__init__.py
as a simple wrapper for Bio.SeqIO.parse()

Note - My proposed "read single record" function would NOT work for cases where
the handle contains multiple records and you only want the first one (because I
would raise an exception). I would regard this as a corner case, and catering
to this risks silently ignoring unexpected second and subsequent records in
other use cases.  In such situations using Bio.SeqIO.parse(...).next() is
advised.

I had previously suggested "parse1", "parse_sole", "parse_only" - none of which
are very appealing.  On the dev mailing list today, Michiel has proposed
"read":

Michiel de Hoon wrote:
>
> Peter wrote:
> > I'd suggested a Bio.SeqIO function, with a name like parse1() or
> > parse_sole() etc which would return a single SeqRecord - and raise
> > an error if the handle didn't contain one and only one record.  We
> > could call this function read() if you prefer.
> >
> I'd prefer read() instead of parse1(), parse_sole() etc. for the
> following reasons:
> 
> 1) Having two names that are clearly different emphasizes the fact that
> they return different things (parse() returns an iterator, read() a record).
> 
> 2) Some modules deal with data that always consist of one record (for
> example, gene expression data in case of Bio.Cluster). Such modules can
> have a read() function but not a parse(). It would feel strange if a
> module has a parse1() function but not a parse().

I plan to add this functionality to Bio/SeqIO/__init__.py as a "read" function,
and update the tutorial accordingly shortly.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list