[Bioperl-l] new GFF3 support methods added

Lincoln Stein lstein at cshl.edu
Mon Mar 8 13:17:26 EST 2004


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Chris,

Nice job.  My only comment is that there's been a great deal of 
consternation over the role of whitespace in GFF3 recently and I am 
thinking changing the column delimiter back to strict tabs and 
allowing spaces (but no tabs or other unescaped whitespace) in the 
fields.  I don't think this will affect your methods at all, but just 
a heads up.

Lincoln

On Friday 05 March 2004 08:52 pm, Chris Mungall wrote:
> I have committed some new stuff to bioperl-live:
>
> the script seq/unflatten_seq will now generate GFF3 - the
> unflattener module is used to build the 'feature graph' connecting
> genes, transcripts, exons and CDSs together. This means we can have
> GFF3 for anything in genbank!
>
> As far as I'm aware, the only other sensible output formats to use
> here (ie formats that support feature graphs/containment
> hierarchies) are: chado, chaos, and the write-only asciitree.
>
> This feature graph is written out in the GFF3 using the ID and
> Parent tags. To do this there is an extra intermediate step - the
> bioperl FeatureHolderI hierarchy is traversed and ID/Parent tags
> are generated.
>
> Here is a description of the changes I have made:
>
> [unless you're a bioperl hacker you don't really need to read the
> rest of this]
>
> You can get the context of what I'm on about from this thread:
> http://bioperl.org/pipermail/bioperl-l/2003-December/014150.html
>
> Two new public methods:
>
>   FeatureHolderI->set_ParentIDs_from_hierarchy
>
>     sets both ID and ParentID from FeatureHolder hierarchy
>
>   SeqFeatureI->generate_unique_persistent_id
>
>     this is required by the above method
>
>     Lincoln wanted this to be private, but I think it has
>     to be called from outside
>
>   FeatureHolderI->create_hierarchy_from_ParentIDs
>
>     the inverse of set_ParentIDs_from_hierarchy
>
> (note that I have put the implementation in the interface - in the
> absence of proper abstract classes, this was deemed the best thing
> to do in the previous discussion on this)
>
> Modifications:
>
>   SeqFeatureI->primary_id
>
> This now maps to the tag_value 'ID' (ie the tag that GFF3 uses to
> uniquely identify a feature).
>
> Minor modification
>
>   Bio::Tools::GFF now allows the -noparse=>1 option
>
>   this is simply to stop the module waiting on input from STDIN
>   when used in write-mode (maybe there's a better way of doing this
>   but I didn't want to mess with this module)
>
> Proto-test
>
>   t/FeatureHolder.x
>
>   This unflattens a genbank sequences and roundtrips it to chadoxml
>   via GFF3
>
>   This doesn't work yet - if you dump a splitfeature as GFF3 and
>   re-import it, it becomes two features. Any volunteers to help
>   fix this?
>
> Unique IDs in bioperl:
>
> In the discussion that preceeded this, it seemed that people liked
> the idea of persistent unique IDs, but there was no suggestions as
> to how to go about it. This is inherently difficult with objects,
> but I borrowed a solution from relational modeling.
>
> A persistent unique ID is generated using
>
>   seq_id
>   primary_tag
>   start
>   end
>
> It is assumed that these are all set and comprise a "unique key"
> over features. Of course, there's no way to enforce this with
> objects. The generated ID is simply these values concatenated with
> : delimiters. You can think of this is a skolem function if you're
> that way inclined.
>
> Another assumption is that seq_id is unique and persistent.
>
> Of course, if you're dealing with data that changes with time, then
> changing the coordinates of a feature will change it's id. This is
> fine. If you want to use your own IDs rather than the generated
> ones, you can simply set the primary_id() field - or if you are
> using genbank files, add something like this
>
>   /ID=CG12345-RA
>
> to the feature.
>
> Stuff still to do:
>
> * fix GFF3 to deal with roundtripping splitlocations
>
> * A nest_features() method, as discussed in the previous email.
> This is the opposite of set_ParentIDs_from_hierarchy(), for reading
> in GFF3 (and then writing to a feature-graph compatible format, or
> to a database such as biosql or chado).
>
> * Bio::SeqIO::GFF3
>
> I know a lot of us would like this a lot - is there any plans to
> implement this yet?
>
> * A GeneModel factory
>
> This would take the output of the unflattener (a set of feature
> graphs typed to SO) and make SeqFeature::Gene objects
>
> Cheers
> Chris
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

- -- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFATLi20CIvUP7P+AkRAuErAKCc4iNS3cnVLpbkLAfpba176o29aQCgndia
SZzP/ANnxz7kvKmg+5Ovq9c=
=FM1X
-----END PGP SIGNATURE-----


More information about the Bioperl-l mailing list