[Bioperl-l] Entrez Gene and bioperl-db

Stefan A Kirov skirov at utk.edu
Thu Jan 6 21:33:05 EST 2005


Peter,
Why unigene can't be added as Bio::Annotation object for example? Peter,
would you mind if I give you a hand, as I am also doing some Entrez Gene
DB parsing.
Hilmar,
Getting back to your post, I have some concern about automatic
parsing of multiple files (if I got this right...). Say if one downloads
the whole Entrez Gene stuff and all is OK I don't see why this can't be
done. But if something goes wrong (and occasionally it will), it will be
really hard for the user to understand he misses parts of the data. Of
course this could be done through warnings, but what about people who
intentionally parse part of the DB? I guess one can add something like
-suppress_warning=>1/0.
Another issue that comes to mind is the approach of a stream is fine for
people with the whole DB on their minds. But of you need particular
record, I guess you you could index the files, but this totally different
game. Any volunteers?


On Thu, 6 Jan 2005, Peter Robinson wrote:

>Dear Bioperlers,
>
>I have started looking at writing some modules to parse the new Entrez
>gene, which is kind of an expanded LocusLink. The really interesting
>files are species specific and are in the ASN.1 format, and I am still
>experimenting around with the best way of parsing them. To get started,
>I am looking at the tab-delimited flat files. It seems to me that it
>would be interesting to be able to parse gene_info and gene2accession
>using the Bio::SeqIO system, the other files such as gene2unigene seem
>less suited for this (the latter has just two entries which could be
>parsed ad hoc easily enough).
>
>In any case, I am sending a proposed module Bio::SeqIO::geneinfo.pm as
>well as a test script (which contains a small excerpt of gene_info in
>the data section) for comments and criticism to the list. I am presently
>working on another module for Bio::SeqIO::gene2accession and plan to
>write a demo script using both modules to convert NCBI accession numbers
>to MGI accession numbers (which is something one might want to do in
>order to use Gene Ontology for affymetrix data, although one needs
>additional work for probesets which are only related to ESTs).
>
>For the moment it seemed better to just parse in the NCBI taxon id into
>the Bio::Species object (only this info is supplied by gene_info), and
>expect users who need the information to use the taxonomy support of
>other Bioperl modules in their scripts.
>
>I will continue to work on parsing the species specific ASN.1 files, but
>I will be trying a combination of lex/yacc/C to do this. If that works I
>will look into trying perl support for lex/yacc for potential use in
>Bioperl, but since I am not sure how long this will take me, I do not
>want to scare off anyone else who would like to give this a shot.
>
>best,
>peter
>
>
>On Tue, 2005-01-04 at 22:03, Jason Stajich wrote:
>> On Jan 4, 2005, at 3:52 PM, Peter Robinson wrote:
>>
>> > Hi Jason,
>> >
>> > thanks for the advice. It seems as if the documentation of
>> > Bio::DB::Taxonomy is a bit out of sync.
>> >  my $db = new Bio::DB::Taxonomy(-source => 'flatfile'
>> >                                  -nodesfile => $nodesfile,
>> >                                  -namesfile => $namefile);
>> > What does 'flatfile' refer to here? It is not apparent upon looking at
>> > the code for new.
>> >
>> See Bio::DB::Taxonomy::flatfile for more information.  As I mentioned
>> in the mail I sent, flatfile is for downloading the taxonomy DB from
>> NCBI.  This lets you run it locally using an indexed  (BerkelyDB via
>> DB_File) version of the file.
>>
>> You must need the most up-to-date verion of the modules - works fine
>> for me for both the entrez and flatfile code, but you may have to
>> upgrade off of the 1.4.0 release. Code from CVS or the bioperl-1.5 RC1
>> code should work fine.
>>
>>
>>
>> > I had somewhat better luck using the entrez version, but I got a
>> > pretty amusing error
>> > message:
>> >
>> > MSG: can't create a species object for Homo sapiens (human) because it
>> > isn't a species but is a '' instead
>> >
>> > ###
>> > Full error and a dump of the script follow:
>> >
>> > my $db = new Bio::DB::Taxonomy(-source => 'entrez'); #
>> > my $taxaid = $db->get_taxonid('Homo sapiens');
>> > my $species = $db->get_Taxonomy_Node(-taxonid => '9606');
>> > print Dumper($species);
>> >
>> > ###
>> >
>> > Use of uninitialized value in string eq at
>> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
>> > Use of uninitialized value in sprintf at
>> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>> >
>> > -------------------- WARNING ---------------------
>> > MSG: can't create a species object for Homo sapiens (human) because it
>> > isn't a species but is a '' instead
>> > ---------------------------------------------------
>> > Use of uninitialized value in string eq at
>> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
>> > Use of uninitialized value in sprintf at
>> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
>> >
>> > -------------------- WARNING ---------------------
>> > MSG: can't create a species object for Homo sapiens (human) because it
>> > isn't a species but is a '' instead
>> > ---------------------------------------------------
>> > $VAR1 = {
>> >           'TaxId' => '9606',
>> >           'Division' => 'mammals',
>> >           'GeneNumber' => '32775',
>> >           'Rank' => 'species',
>> >           'ProtNumber' => '247791',
>> >           'ScientificName' => 'Homo sapiens',
>> >           'CommonName' => 'human',
>> >           'NucNumber' => '9025800',
>> >           'GenNumber' => '25',
>> >           'StructNumber' => '5638'
>> >         };
>> > peter at anna:~/programs/bioperlTest$
>> >
>> >
>> > --best, peter
>> >
>> > On Mon, 2005-01-03 at 23:51, Jason Stajich wrote:
>> >> Bio::DB::Taxonomy is the factory code - it is pretty easy to get a
>> >> species object (or equivalent) using this code.  But you cannot (or
>> >> could not when I wrote this, not sure of the current status) get the
>> >> full classification from the NCBI taxonomy retrieval via cgi.  i.e.
>> >> you
>> >> can only get genus and species for a taxon id and I don't know how to
>> >> walk up the hierarchy using the web API.  Earlier emails to NCBI
>> >> seemed
>> >> to indicate this is all they intended to provide, but not sure what
>> >> the
>> >> current status is.
>> >>
>> >>   my $db = new Bio::DB::Taxonomy(-source => 'entrez'); # use NCBI
>> >> Entrez
>> >> over HTTP
>> >>    my $taxaid = $db->get_taxonid('Homo sapiens');
>> >>    my $taxonnode = $db->get_Taxonomy_Node(-taxonid => '9606');
>> >>
>> >> You can get the full classification if you use the
>> >> Bio::DB::Taxonomy::flatfile factory which requires you to have
>> >> downloaded the taxonomy db flatfile from NCBI.  Since this is more
>> >> reliable (and faster) it is what I have tended to use for grouping
>> >> sets
>> >> of seqDB search results, etc.
>> >>
>> >> -jason
>> >> On Jan 3, 2005, at 5:40 PM, Peter Robinson wrote:
>> >>
>> >>> Hi Bioperlers, hi Hilmar,
>> >>>
>> >>> after some thinking I have embarked on a lex/yacc parser for the
>> >>> Entrez
>> >>> Gene ASN.1 format as the way of least resistance, although I am not
>> >>> sure
>> >>> how that would fit in to BioPerl. If anyone is interested in this (or
>> >>> has a better idea of how to go about it..), please drop me a line.
>> >>>
>> >>> In the meantime I have been looking at writing code to parse some of
>> >>> the
>> >>> "easy" Entrez gene documents, starting off with gene_info. This file
>> >>> includes the NCBI taxon id for each entry. I would like to convert
>> >>> this
>> >>> to a Bio::Species object to pass to the following
>> >>> 	my $seq = $self->sequence_factory->create(
>> >>> 			     -verbose => $self->verbose(),
>> >>> 			     -accession_number => $geneID,
>> >>> 			     -desc => $description,
>> >>> 			     -display_id => $symbol,
>> >>> 			     -species =>  ???
>> >>> 			     -annotation => $ann);
>> >>>
>> >>> and saw the Bio::Taxonomy::FactoryI code, which appears to want to do
>> >>> this sort of thing. However, the code for that is pretty preliminary.
>> >>> Is
>> >>> anyone working on this at the moment? Or is there a better way of
>> >>> doing
>> >>> this (it seems a shame not to provide the actual species name if one
>> >>> has
>> >>> the taxid...)
>> >>>
>> >>> best
>> >>>
>> >>> Peter
>> >>>
>> >>>
>> >>>
>> >>> On Tue, 2004-12-28 at 07:17, Hilmar Lapp wrote:
>> >>>> Great to hear that someone is giving this a shot. Yes at this point
>> >>>> is
>> >>>> appears that NCBI is only offering the ASN.1, not a conversion to
>> >>>> XML.
>> >>>> Their asn2xml tool will not work with this ASN.1 format either, just
>> >>>> checked it to be sure. They do seem to be mulling the option of XML
>> >>>> though on the Gene FAQ. Maybe if enough people get in their ears
>> >>>> they
>> >>>> will spend some effort towards that. After all, the entrez gene web
>> >>>> interface can display XML on demand - even though it looks fairly
>> >>>> hideous.
>> >>>>
>> >>>> There is no ASN.1 support in bioperl at all. Also, ASN.1 support in
>> >>>> perl is actually thin - there is Convert::ASN1 at version 0.18 two
>> >>>> years ago that I could find ... doesn't make me feel warm and fuzzy.
>> >>>>
>> >>>> In the absence of any XML available from NCBI, gene_info might be
>> >>>> the
>> >>>> best start. An option could be to check for the presence of the
>> >>>> other
>> >>>> tab-delimited files and use those that are present. These are
>> >>>> tab-delimited and hence the format itself is trivial so you can
>> >>>> focus
>> >>>> entirely on setting up a Bio::Seq plus annotation that's
>> >>>> comparable/compatible to what the current SeqIO::locuslink does.
>> >>>>
>> >>>> My $0.02 (worth less and less almost every day).
>> >>>>
>> >>>> 	-hilmar
>> >>>>
>> >>>> On Thursday, December 23, 2004, at 10:51  AM, Peter Robinson wrote:
>> >>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> I have been thinking about given a BioPerl EntrezGene parser a try
>> >>>>> since
>> >>>>> I have been a heavy user of locus link to date. One issue is that
>> >>>>> the
>> >>>>> files that correspond to LL_tmpl (which was a flat file) are now in
>> >>>>> asn
>> >>>>> format
>> >>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/help/
>> >>>>> genehelp.html#query
>> >>>>> Although I saw some mention of ASN support in Bioperl by googling,
>> >>>>> I
>> >>>>> can't seem to find any module that does this in the present
>> >>>>> distribution. What is the status on that? In any case, I will be
>> >>>>> working
>> >>>>> on this in the next month or two and if anything nice comes of it I
>> >>>>> will
>> >>>>> send it to you / BioPerpl.
>> >>>>>
>> >>>>> best wishes & happy holidays
>> >>>>>
>> >>>>> Peter
>> >>>>>
>> >>>>> On Tue, 2004-12-14 at 09:00, Hilmar Lapp wrote:
>> >>>>>> Since load_seqdatabase.pl will use bioperl's SeqIO parsers for
>> >>>>>> parsing
>> >>>>>> any input file, what you're asking is whether or not there is a
>> >>>>>> SeqIO
>> >>>>>> parser for NCBI Gene.
>> >>>>>>
>> >>>>>> The answer to that question is no, not yet. Anybody who feels
>> >>>>>> motivated
>> >>>>>> is welcome to give it a try ... Since I'll need it, I'll write the
>> >>>>>> parser if nobody else does within the next 3 months, but I'm not
>> >>>>>> going
>> >>>>>> to promise when exactly this will happen.
>> >>>>>>
>> >>>>>> 	-hilmar
>> >>>>>>
>> >>>>>> On Monday, December 13, 2004, at 08:03  AM, Law, Annie wrote:
>> >>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> I was wondering with regards to bioperl-db the scripts and schema
>> >>>>>>> and
>> >>>>>>> load_seqdatabase.pl has there been preparation for integration of
>> >>>>>>> Entrez
>> >>>>>>> gene information when locuslink is phased out?  Or if it has
>> >>>>>>> already
>> >>>>>>> been
>> >>>>>>> changed could somebody point
>> >>>>>>> me to the documentation or changed code?
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Annie.
>> >>>>>>> _______________________________________________
>> >>>>>>> Bioperl-l mailing list
>> >>>>>>> Bioperl-l at portal.open-bio.org
>> >>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> >>>>>>>
>> >>>>>>>
>> >>>>> --
>> >>>>> Peter N. Robinson
>> >>>>> peter.robinson at t-online.de
>> >>>>> peter.robinson at charite.de
>> >>>>> http://www.charite.de/ch/medgen/robinson/
>> >>>>>
>> >>>>>
>> >>> --
>> >>> Peter N. Robinson
>> >>> peter.robinson at t-online.de
>> >>> peter.robinson at charite.de
>> >>> http://www.charite.de/ch/medgen/robinson/
>> >>>
>> >>> _______________________________________________
>> >>> Bioperl-l mailing list
>> >>> Bioperl-l at portal.open-bio.org
>> >>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> >>>
>> >>>
>> >> --
>> >> Jason Stajich
>> >> jason.stajich at duke.edu
>> >> http://www.duke.edu/~jes12/
>> > --
>> > Peter N. Robinson
>> > peter.robinson at t-online.de
>> > peter.robinson at charite.de
>> > http://www.charite.de/ch/medgen/robinson/
>> >
>> >
>> --
>> Jason Stajich
>> jason.stajich at duke.edu
>> http://www.duke.edu/~jes12/
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>--
>Peter N. Robinson
>peter.robinson at t-online.de
>peter.robinson at charite.de
>http://www.charite.de/ch/medgen/robinson/
>


More information about the Bioperl-l mailing list