[Bioperl-l] Entrez Gene ASN

Sat Mar 12 19:55:44 EST 2005

On Friday, March 11, 2005, at 11:02  AM, Stefan Kirov wrote:

>
>
> Hilmar Lapp wrote:
>
>> Gene shouldn't be fundamentally different from LocusLink, and 
>> LocusLink was represented as an annotated SeqI within bioperl.
>
> It is not, you are right.
>
>>
>> If at all possible I'd still like it to remain that way for Gene in 
>> order to allow for a smooth transition from LL to Gene for code 
>> that's been using the former.
>>
> hmmmm, back compatibility is good thing, but sometimes it may be hard 
> to achieve.

Well, now you contradict yourself. Above you agree that Gene and 
LocusLink are fundamentally the same, and here you say representing 
them in a compatible fashion may be hard to achieve ...

There are problems indeed though, read on ...

>
>> If you want to emphasize the fact that it's a container for 
>> sequences, then that sounds like a ClusterI to me, which can be 
>> richly annotated too.
>
> Let me disagree here. Cluster is designed for independent sequences, 
> where Gene should deal with sequences, that have hierarchical 
> relationship among themselves.

Two notes here. First, ClusterI is not designed for independent 
sequences. It is just meant as a container for sequences, be those 
related to each other or not.

Second, the ability to represent hierarchical relationships between 
sequences is basically absent from bioperl, not just from ClusterI 
(aside from ClusterI representing a relationship between the containing 
seq and the contained seqs).

We should think seriously before we add that capability. Most of the 
people and effort in the field towards hierarchical relationships 
between biological entities with sequence takes place in the domain of 
feature hierarchies, *not* sequence hierarchies. See GFF3, SO, GBrowse, 
Chado, and related efforts.

The only place I know where sequence heirarchies are extensively used 
is in our local adaptation of Biosql, and we do all of this in SQL (as 
bioperl and therefore bioperl-db has zero support for it).

It's possible but I'm not sure also wise to duplicate the support for 
feature hierarchies to sequences ... Wouldn't it in the end benefit 
more people if you were able to tie in Gene into the Unflattener that 
Chris wrote?

>  This is one of the issues I think  Seq object is not designed to deal 
> with.  What we need is:
> genome--(Bio::Seq)-
>                   |--transcript(Bio::Seq)
>                                          |--protein(Bio::Seq)
>                     |--transcript(Bio::Seq)
>                                          |--protein(Bio::Seq)

Well, yeah, if you replace Bio::Seq with Bio::SeqFeatureI you are 
pretty close to GFF3 and a growing wealth of support for it.

>
> Another significant concern I have is that if we store everything as 
> SeqFeature or the overhead may become huge (some records have hundreds 
> of different features)

Have you talked to Lincoln about this? I believe GBrowse is dealing 
pretty well with this huge overhead but I may be missing something here.

> [...] and any user of the parser will have to do quite of a data 
> mining to find the relevant feature. One approach would be to add more 
> Bio::Annotation:: objects (for example Bio::Annotation::STS, 
> Bio::Annotation::GRIF, etc).

Possibly. Bio::Annotation objects was in fact what I was primarily 
referring to when I spoke about annotation.

> We may decide to create a simplified (Bio::Seq, no relationships) or 
> more complex object (Gene), based on the user request.

Just as an aside, I guess you know that there is a Gene object already, 
but it's feature based.

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------