[Bioperl-l] EntrezGene ASN parser

Hilmar Lapp hlapp at gmx.net
Fri Apr 1 03:52:10 EST 2005


On Wednesday, March 30, 2005, at 02:31  PM, Stefan Kirov wrote:

> I just finished a Bioperl EntrezGene Parser based on Mingyi Liu's ASN 
> Gene parser. It creates two main objects: a Bio::Seq object which 
> contains most of the data such as references, description, map 
> location, etc; and a Bio::Cluster::SequenceFamily object, which 
> contains the refseqs and the gene structure (through NT/NC annotation, 
> represented as Bio::SeqFeature::Gene objects).

You added Bio::SeqFeature::Gene objects to a 
Bio::Cluster::SequenceFamily instance?

Bio::Cluster::SequenceFamily as a Bio::ClusterI should accept only 
Bio::PrimarySeqI as members ... I.e., originally these clusters were 
meant to hold sequences.

I'm not sure it's a good idea to mix bags of sequences with bags of 
features.

Or I misunderstood and you meant something else?

> Another data I make available is the uncaptured data. So each time a 
> some data is transfered from the hash which represents the parsed 
> data, I am deleting the respective  key. Everything else is concidered 
> uncaptured. I am doing this since some records could be non-compliant 
> or simply there may be new data supplied by NCBI. There will be 
> naturally some data, which is not interesting, and therefore is not 
> captured (a lot of redundant data in the EntrezGene). So the parser 
> would act like that:
> my ($egene,$assoc_seq,$uncaptured)=$egparser->next_seq;

Be careful here, this is non-compliant with Bio::SeqIO which mandates 
that next_seq() return a sequence object.

You could use wantarray to determine whether to return a single object 
(supposedly $egene?) or three elements, but if someone does

	my $seq = $egparser->next_seq();

the result should not be the scalar 3 (i.e., number of elements).

> There are few things I need to add (Markers and GO are not yet in 
> these objects), but most of work is done. Unless somebody objects, I 
> will commit the code (Bio::SeqIO::entrezgene?) when I write the 
> documentation to match the standard.

Sounds like a good name. I suggest you commit so that interested others 
(i.e., me :) can have a look.

Also, if you have certain use cases driving your work that expect 
certain things in certain places, it'd be good if you start writing 
test cases that check for those things. I certainly have such a use 
case as I probably indicated earlier; so if I need things in different 
places than you put them it'd be good to see where changes can be made 
easily and where not. I depend(ed) a lot on the LocusLink annotation 
and that will be no different for its successor.

> Few notes:
> 1. It would be nice if there is Bio::Annotation::DBLink::url method. 
> It makes sense (I think) since most DB links would refer also to a 
> webpage.

Feel free to add, but don't expect e.g. bioperl-db to (de)serialize 
this.

> 2. It takes now 45 minutes to parse the whole human ASN file, which is 
> 4 times slower. Keeping uncaptured data slows things down a bit, so I 
> will introduce -debug option. Anyway I think the speed is not going to 
> be an issue.

What would -debug do?

I think there should be an option to disable the keeping of what you 
call uncaptured data. Also, as I said before the standard way of 
calling is to ask for a sequence object, so if I know in advance that 
that's all I'm ever going to do I should have the option to disable 
construction of those other 2 objects you propose to return from 
next_seq.

Sounds like the Entrez Gene parser is coming along without me having to 
write it. I'm thrilled Stefan!!

	-hilmar


> 3. Due to the cyclic reference in the GeneStructure object I am 
> removing the Transcript->{parent} in the parser. This code should be 
> deleted once the Transcript object is fixed.
> There are also some other minor issues, but I think I will be able to 
> fix them by the end of the week.
> Please let me know what you think.
> Stefan
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the Bioperl-l mailing list