[BioPython] Determining if GenBank record is circular
Peter
biopython at maubp.freeserve.co.uk
Tue Sep 2 09:00:41 UTC 2008
On Tue, Sep 2, 2008 at 2:25 AM, Chris Lasher <chris.lasher at gmail.com> wrote:
> On Mon, Sep 1, 2008 at 8:19 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>>
>> Should be in LOCUS:
>>
>> LOCUS NC_002678 7036071 bp DNA circular BCT 22-JUL-2008
>
> Ah, sure. Let me re-state my question more precisely: Where is this
> represented in the SeqRecord object created by SeqIO.parse(), or is it
> represented at all?
Currently if the sequence is circular I don't think it is represented
at all when parsed in a SeqRecord.
Bio.SeqIO uses the Bio.GenBank.FeatureParser, which gets passed this
information from the Scanner via the residue_type event. This is a
combined lump of data containing both the sequence type (DNA, RNA etc)
and if it is linear or circular. It is currently only used to
determine the Seq alphabet, and has never been recorded. So in
addition to not recording if the LOCUS line said the sequence was
circular, if the LOCUS line contained cDNA, mRNA, ... this fine detail
is also currently lost in the SeqRecord representation. On the other
hand, the Bio.GenBank.RecordParser stores all this as the record's
residue_type property (a single combined field, presumably reflecting
the layout of early GenBank files).
It would be a logical improvement to record the sequence data
(molecule type and if circular) in the SeqRecord's annotations
dictionary - perhaps as two fields but we'd need to check if that
would be straight forward for EMBL files too. Alternatively, if
Biopython included a native CircularSeq object, we could use that
explicitly when the sequence is declared as circular. This might be
considered a little surprising though.
Do you want to file a bug on this Chris?
Peter
More information about the Biopython
mailing list