[Biopython-dev] About a new GenBankWriter class with SeqIO interface

Wed May 16 07:53:41 UTC 2007

Howard Salis wrote:
>> Sounds nice - its something I've been thinking about doing myself,
>> but I wanted to do both both GenBank and EMBL, sharing the feature
>> table writing code.
> 
> Yep, since EMBL and GenBank share the same feature format, I've 
> separated the "foreword", feature table, and sequence write
> functions.

Using "foreword / features / sequence" avoids clashing with the terms
"header" and "footer" used in Bio.SeqIO to mean parts of a
multi-sequence file which do not belong to a specific record.  Maybe I
should update Bio/GenBank/Scanner.py to use similar terminology...

> So if someone wants to write the EMBL writer, they just need to write
> the appropriate foreword.

There is also the issue of translation between EMBL/GenBank terminology, 
for example where someone has read in an EMBL file and wants to write it 
out as a GenBank file.  For a simple example, the division class should
probably map: {'PRI': 'MAM', 'BCT': 'PRO', 'UNA': 'UNC'}

> I think the sequence data is stored the same too? Is that correct?

Actually, the way the sequence is printed out is slightly different.

>>> I also add/change a couple of lines in __init__.py to store
>>> whether a sequence was linear or circular and to store the string
>>> that encodes its molecule type (ss-RNA, etc).
>> I thought we already stored this information - but I'm not sure off
>> hand.
> 
> Well, there's the alphabet of the sequence (e.g. UnAmbiguousDNA()) 
> that says whether it's DNA, RNA, peptide, etc, but even if I matched 
> these ups with strings, then the "ss-", "ds-", etc part would be 
> missing. I just saved the exact wording of the sequence type (e.g. 
> "ds-DNA", "ss-RNA", etc) to an dictionary key named 
> self.data.annotations["sequence_type"] in the _FeatureConsumer class 
> under GenBank. This is in addition to the alphabet of the sequence so
> it shouldn't conflict.

That's probably a good idea.  However, we would need to check what the 
EMBL equivalents are and convert them when writing GenBank files. Maybe 
we should just keep things simple and write one of RNA/DNA/Protein only?

> Ok, done! It's at http://bugzilla.open-bio.org/show_bug.cgi?id=2294

I have made some more specific comments on the bug.  I this email I have 
tried to stick to the broader picture.

Peter