[Biopython-dev] [Bug 2544] Bio.GenBank and SeqFeature improvements

Wed Jul 16 06:30:21 EDT 2008

http://bugzilla.open-bio.org/show_bug.cgi?id=2544

biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|Bio.SeqIO improvements      |Bio.GenBank and SeqFeature
                   |                            |improvements

------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk  2008-07-16 06:30 EST -------
(In reply to comment #0)
> $ python
> 
> Python 2.5.2 (r252:60911, Jul  2 2008, 22:55:24) 
> [GCC 4.3.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from Bio import SeqIO
> >>> handle = open("genbank-synthetic.gb")

I'm guessing the missing line here was something like:
seq_record = SeqIO.read(handle, "genbank")

> >>> print seq_record
> ID: EF452680.2
> Name: EF452680
> Description: Synthetic construct nitric oxide synthase (NOS) gene, partial cds.
> /comment=On Feb 4, 2008 this sequence version replaced gi:145391444.
> /sequence_version=2
> /source=synthetic construct
> /taxonomy=['other sequences', 'artificial sequences']
> /keywords=['']
> /references=[<Bio.SeqFeature.Reference instance at 0x834cb6c>,
> <Bio.SeqFeature.Reference instance at 0x834c16c>, <Bio.SeqFeature.Reference
> instance at 0x834ceac>, <Bio.SeqFeature.Reference instance at 0x834c2ec>]
> /accessions=['EF452680']
> /data_file_division=SYN
> /date=11-JUN-2008
> /organism=synthetic construct
> /gi=166831528
> Seq('TAGGCCTCTGCTTGCCGTTTGTTTCGTCAGCGATTTTTATAGTCTCAGCCTCCT...GCC',
> IUPACAmbiguousDNA())
> >>> 
> 
> 
> I do not see how I could access the value 'DNA' from the LOCUS line:
> LOCUS       EF452680                 260 bp    DNA     linear   SYN 11-JUN-2008

Currently the sequence type (DNA, RNA, Protein) is used internally by the
GenBank parser to determine the alphabet.  It is not currently recorded in the
SeqRecord object's annotation but could be.  How about something like this?:

seq_record.annotations["seq_type"]

> No, I do not want to read seq_record.features[0].qualifiers['mol_type'][0].

Assuming that the first feature is the source (typically the case), and
assuming it has a specified molecule type, then your suggestion is one work
around.  But I agree, its not nice.

> Could seq_record.features have a repr() function to give me something useful
> instead of this?
> 
> >>> print seq_record.features
> [<Bio.SeqFeature.SeqFeature instance at 0x837bc2c>, <Bio.SeqFeature.SeqFeature
> instance at 0x837b9cc>, <Bio.SeqFeature.SeqFeature instance at 0x837bd4c>]

Yes we could add that, but you wouldn't want to do that on a typical genome
with thousands of features.  Adding a repr method for the Reference object is
also something I had wondered about doing.

> I don't see documented anywhere in the biopython docs access the features,
> pasting something like the following into docs would give a user clue where to
> look for for values:
> 
> >>> print seq_record.features[0].qualifiers
> {'db_xref': ['taxon:32630'], 'mol_type': ['other DNA'], 'organism': ['synthetic
> construct'], 'chromosome': ['Ib'], 'PCR_primers': ['fwd_seq:
> aggcctctgcttgccgtttgtttcg, rev_seq: cgccggcggcacacgctcaactaattac']}
> >>> print seq_record.features[1].qualifiers
> {'gene': ['NOS']}
> >>> print seq_record.features[2].qualifiers
> {'product': ['nitric oxide synthase'], 'codon_start': ['2'], 'EC_number':
> ['1.14.13.39'], 'transl_table': ['11'], 'note': ['derived from Toxoplasma
> gondii'], 'db_xref': ['GI:166831529'], 'translation':
> ['RPLLAVCFVSDFYSLSLLHFASVPFHESDGCVGRSHWLPGKHANYVKPAGARKRPEVGCRSSCLLRSVCCDILSPVRTRGN'],
> 'gene': ['NOS'], 'protein_id': ['ABP65329.2']}

There is a minimal bit of text in what is currently Chapter 10 of the tutorial
on the SeqFeature object.  I agree, this is an area that needs improvement.

Perhaps a full example of parsing a simple GenBank file in the SeqIO chapter
would help?

> >>> print seq_record.features[3].qualifiers
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> IndexError: list index out of range

You must have only three features (indexed 0, 1 and 2) which explains the index
error.

> I wonder if I could access the above dicts as seq_record.features['source']
> or seq_record.features['CDS']. Where is the 'source', 'gene', 'CDS' gone?

As the .type attribute, try this:

for feature in seq_record.features:
   print feature.type

You can't access the features by type (e.g. seq_record.features['CDS']) because
there is generally more than one feature of each type.

Peter

P.S. Most of your comments are not on Bio.SeqIO itself, but actually about the
underlying Bio.GenBank parser or the SeqFeature object.  I have therefore
changed the title.

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.