[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc

Thu Aug 17 15:13:40 UTC 2006

Albert Krewinkel wrote:
> Peter wrote:
> 
>>Oh - you meant just adding EMBL feature iteration.  I was thinking 
>>about the larger task of full EMBL file reading.
> 
> I started working on that, but I'm not very far yet.

Are you starting from Bio.GenBank or from scratch?  I would point out 
that the code in Bio.GenBank was inserted into what was once a Martel 
based parser, and designed to be a transparent change for the end user.

What I would like to do is recycle that code into a new far simpler 
SeqIO GenBank parser which would only return SeqRecords.  In particular 
I would get rid off all the scanner/consumer model with all its function 
callbacks.

At this point I would try and handle both GenBank and EMBL files together.

I expect this to be faster, and easier to understand.  It would be a lot 
less flexible for the "power user", but then so is all the new SeqIO 
code I have been writing.

>>Doing just the features is very easy, here you go:
>>
>>http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2
> 
> Wow, that was quick.

Well, I did have something along these lines planned in advance - that's 
why there my parse function was outside the GenbankCdsFeatureIterator class.

 > And it's works allmost perfectly. One exception:
> In _parse_embl_or_genbank_feature(), when parsing the location, it
> shoudl say something like
> 
> <code>
> from string import digits
> while feature_location[-1] not in (')', digits):
>     line = iterator.next()
>     feature_location += line[FEATURE_QUALIFIER_INDENT:].strip()
> </code>
> 
> This way, features may have multiline join(...) positions.

Good point, something I was aware of and coped with in Bio.GenBank but 
hadn't done in the CDS iterator.  Thanks for point this out.

This affects both GenBank and EMBL files by the way.  My code is very 
similar but I included an assert to check the indent, and I only check 
for a trailing comma.  This works on all the files I have tried.

Peter