[Biopython-dev] Bio.GFF and Brad's code
Peter Cock
p.j.a.cock at googlemail.com
Tue Apr 21 11:52:26 UTC 2009
On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> [accessing start and end]
>> >>> print rec_dict['1'].features[0].location.start
>> 20228
>> >>> rec_dict['1'].features[0].location.start.position
>> 20228
> [...]
>> Coupled with a variation of Brad's suggestion of adding start
>> and end properties to the SeqFeature, if we make these act
>> as proxies for feature.location.start and feature.location.end
>> that would become just:
>>
>> record = ...
>> feature = record.features[5] #for example
>> sub_seq = my_seq[feature.start:feature.end]
>
> Thanks Peter, that's exactly right.
Actually, it isn't - my mistake. Adding start and end properties to
the SeqFeature as proxies for feature.location.start and
feature.location.end wouldn't be a great idea. Currently
feature.location.start and features.location.end are position objects,
and even if they had an __int__ method you can't do this:
record[feature.location.start:record.feature.location.end]
or:
record.seq[feature.location.start:record.feature.location.end]
You would have to do this:
record[int(feature.location.start):int(record.feature.location.end)]
or:
record.seq[int(feature.location.start):int(record.feature.location.end)]
The above wouldn't work well for fuzzy locations, we're better off
with the current explicit option:
record[feature.location.start.position:record.feature.location.end.position]
or:
record.seq[feature.location.start.position:record.feature.location.end.position]
where if the user wants to they can take into account the fuzzy
details, such as adding record.feature.location.end.extension to the
end slice point.
----------------
Now the good news, we can instead simply using the FeatureLocation
shortcuts for (approximated) plain integers:
record[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end]
or:
record.seq[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end]
These methods already take into consideration fuzzy ends, and knows to
treat the start and end differently to get the wider feature.
So, a slight variation of the proposed internal details would be to
make SeqFeature.start and end proxies for
SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end
(i.e. plain integers), achieving the goal of just:
record[feature.start:record.feature.end]
or:
record.seq[feature.start:record.feature.location.end]
(Suitable for non-join features, and gives a reasonable approximation
for fuzzy locations).
> Accessing the start and end coordinates in SeqFeatures is unnecessarily
> cumbersome right now, but can be fixed fairly simply. We should be able
> to get this in now that 1.50 is rolled out.
> ...
> To be clear, start and end in SeqFeature would be integers and not
> handle any fuzzy stuff. All of the representation is still there for
> those actually dealing with fuzziness, but the top level attributes
> would expose the coordinates nicely for the remaining 99% of cases.
Right - and with the above correction that SeqFeature.start and end
would be proxies for SeqFeature.location.nofuzzy_start and
SeqFeature.location.nofuzzy_end, you would get plain integers, and
this should cover most use cases. At least for non-Eukaryotes ;)
>> I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO),
>> the SeqFeatures are way too complicated for my mind.
> [...]
>> For a basic parser, I like the _gff_line_map function much better.
>> Applied to the first line in the GFF file, it returns
> [...]
>> which is exactly what I need, in (almost) the places where I'd expect them.
>
> Does solving the start/end problem as described above help bridge the
> gap between SeqFeatures and the custom representation? Are there other
> usability issues you found? I would prefer to expose one data structure
> and think SeqFeature can handle the data well. They scale to nested
> cases, and will be familiar to those using features in SeqIO or BioSQL.
You must agree that SeqFeature and FeatureLocation objects are not
very lightweight. I understood that one of your goals with Bio.GFF
and map/reduce is to handle massive files, so surely it makes sense to
use a simple object structure here?
Peter
More information about the Biopython-dev
mailing list