[Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Peter (BioPython Dev)
biopython-dev at maubp.freeserve.co.uk
Thu Aug 17 15:13:40 UTC 2006
Albert Krewinkel wrote:
> Peter wrote:
>>Oh - you meant just adding EMBL feature iteration. I was thinking
>>about the larger task of full EMBL file reading.
> I started working on that, but I'm not very far yet.
Are you starting from Bio.GenBank or from scratch? I would point out
that the code in Bio.GenBank was inserted into what was once a Martel
based parser, and designed to be a transparent change for the end user.
What I would like to do is recycle that code into a new far simpler
SeqIO GenBank parser which would only return SeqRecords. In particular
I would get rid off all the scanner/consumer model with all its function
At this point I would try and handle both GenBank and EMBL files together.
I expect this to be faster, and easier to understand. It would be a lot
less flexible for the "power user", but then so is all the new SeqIO
code I have been writing.
>>Doing just the features is very easy, here you go:
> Wow, that was quick.
Well, I did have something along these lines planned in advance - that's
why there my parse function was outside the GenbankCdsFeatureIterator class.
> And it's works allmost perfectly. One exception:
> In _parse_embl_or_genbank_feature(), when parsing the location, it
> shoudl say something like
> from string import digits
> while feature_location[-1] not in (')', digits):
> line = iterator.next()
> feature_location += line[FEATURE_QUALIFIER_INDENT:].strip()
> This way, features may have multiline join(...) positions.
Good point, something I was aware of and coped with in Bio.GenBank but
hadn't done in the CDS iterator. Thanks for point this out.
This affects both GenBank and EMBL files by the way. My code is very
similar but I included an assert to check the indent, and I only check
for a trailing comma. This works on all the files I have tried.
More information about the Biopython-dev