[Biopython-dev] Bio.GFF and Brad's code
Peter
biopython at maubp.freeserve.co.uk
Sat Apr 18 09:54:44 EDT 2009
On Sat, Apr 18, 2009 at 5:28 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> This is how I used the parser:
>
>>>> from GFFParser import GFFAddingIterator
>>>> gff_iterator = GFFAddingIterator()
>>>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff")
> # It would be better to pass a handle to get_all_features
> # instead of a file name. The file may be gzipped or bzipped,
> # or the user may want to read it from the internet.
>>>> len(rec_dict['1'])
> 50
> # fifty microRNAs on chromosome 1
>>>> rec_dict['1'].features[0]
> Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2')
>>>> rec_dict['1'].features[0].qualifiers['ACC']
> ['MI0006363']
>>>> rec_dict['1'].features[0].qualifiers['ID']
> ['hsa-mir-1302-2']
> # This is still OK, though a bit more deeply nested than I would like.
>>>> rec_dict['1'].features[0].location
> Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366))
>>>> rec_dict['1'].features[0].location._start
> Bio.SeqFeature.ExactPosition(20228)
> # Am I supposed to use _start here? It looks like a private variable.
>>>> rec_dict['1'].features[0].location._start.position
> 20228
No, you are meant to use start, e.g.:
>>> print rec_dict['1'].features[0].location.start
20228
>>> rec_dict['1'].features[0].location.start.position
20228
This is what I was talking about in the earlier email on this thread,
the SeqFeature has start and end "attributes", but they are done
with some magic in __getattr__. I plan to update this to use a
modern python property get (so they will show up in dir(...) and
we can give them docstring), but don't recall filing a bug on this
issue yet.
See also,
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html
http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html
Related to this, perhaps the position classes (and in particular
the ExactPosition class) should have an __int__ method, so you
can use the object directly (rather than messing about with
subproperties like .position). This should let you do the following
(untested):
record = ... #e.g. a SeqRecord from a GFF file or GenBank
feature = record.features[5] #for example
sub_seq = my_seq[feature.location.start:feature.location.end]
Coupled with a variation of Brad's suggestion of adding start
and end properties to the SeqFeature, if we make these act
as proxies for feature.location.start and feature.location.end
that would become just:
record = ...
feature = record.features[5] #for example
sub_seq = my_seq[feature.start:feature.end]
The fuzzy locations (from GenBank or EMBL files) would need
a bit of care, ideally matching how the NCBI do things (easily
checked by taking an NCBI GenBank files and comparing it to
the simpler locations given in their FASTA, PTT or GFF files).
Peter
More information about the Biopython-dev
mailing list