[Biopython] Creating GenBank files
Peter
biopython at maubp.freeserve.co.uk
Wed Sep 16 16:31:27 UTC 2009
On Wed, Sep 16, 2009 at 5:14 PM, Peter Saffrey <pzs at dcs.gla.ac.uk> wrote:
> Peter wrote:
>>
>> Yes, you must create a SeqRecord object with suitable SeqFeature objects,
>> and then write it out with SeqIO in GenBank format. If all your features
>> have trivial locations, this is pretty easy.
>
> Thanks for this. I've managed to get this to work, but encountered a few
> minor issues.
>
> I already have GenBank files created by CLC Genomics Workbench 3 but I want
> to make these in a script. The CLC generated GenBank files look like this:
>
> LOCUS Setd2-tagged 11750 bp DNA linear UNA
> FEATURES Location/Qualifiers
> misc_feature 1..50
> /label="Subcloning HA Upstream"
> ...(snip other features)
>
> ORIGIN
> 1 TTGGTGTGAG CTCTTTGTGT CTTGCCTAAG TATGTGCATC TGTCTTGTCT
>
> ...(snip sequence)
>
>
> To do this in biopython, I need to create my feature thus:
>
> sf = SeqFeature.SeqFeature(SeqFeature.FeatureLocation(0,50),
> type="misc_feature", qualifiers = { "label" : [ "Subcloning HA Upstream" ]})
>
> The issues I had were:
>
> - In the docstring for SeqFeature, it says the attribute is "qualifier" but
> it should be "qualifiers".
I've fixed that in CVS - thanks for reporting it.
> - My first stab at the qualifiers argument was to do
>
> qualifiers = { "label" : "mylabel" }
>
> but if I do that, it iterates over "mylabel" giving me one "label" for each
> character! Maybe the qualifier printer should check it's being given a list
> and not a string?
As you have realised, based on what the GenBank (and other) parsers
do, the GenBank output code was expecting the qualifier values to be
a list (of strings). There are similar issues in the BioSQL code, and yes,
I agree we should cope with either here too.
> - I'd like to remove some of the extraneous header from the GenBank file:
>
> DEFINITION .
> ACCESSION <unknown id>
> VERSION <unknown id>
> KEYWORDS .
> SOURCE .
> ORGANISM .
> .
>
> Is this possible?
>
Why would you want to?
They are there deliberately as according the the NCBI GenBank release
notes (which pretty much is the official file format definition) those are all
mandatory keywords, so should be present (even if with just a dot/period
indicating no data). I would regard the CLC Genomics Workbench 3 output
as technically out of spec.
>
> Sorry for the long message,
>
Not at all.
Peter C.
More information about the Biopython
mailing list