[Bioperl-l] Entrez Gene and bioperl-db

Peter Robinson Peter.Robinson at t-online.de
Thu Jan 6 15:44:27 EST 2005


Dear Bioperlers,

I have started looking at writing some modules to parse the new Entrez
gene, which is kind of an expanded LocusLink. The really interesting
files are species specific and are in the ASN.1 format, and I am still
experimenting around with the best way of parsing them. To get started,
I am looking at the tab-delimited flat files. It seems to me that it
would be interesting to be able to parse gene_info and gene2accession
using the Bio::SeqIO system, the other files such as gene2unigene seem
less suited for this (the latter has just two entries which could be
parsed ad hoc easily enough).

In any case, I am sending a proposed module Bio::SeqIO::geneinfo.pm as
well as a test script (which contains a small excerpt of gene_info in
the data section) for comments and criticism to the list. I am presently
working on another module for Bio::SeqIO::gene2accession and plan to
write a demo script using both modules to convert NCBI accession numbers
to MGI accession numbers (which is something one might want to do in
order to use Gene Ontology for affymetrix data, although one needs
additional work for probesets which are only related to ESTs).

For the moment it seemed better to just parse in the NCBI taxon id into
the Bio::Species object (only this info is supplied by gene_info), and
expect users who need the information to use the taxonomy support of
other Bioperl modules in their scripts.

I will continue to work on parsing the species specific ASN.1 files, but
I will be trying a combination of lex/yacc/C to do this. If that works I
will look into trying perl support for lex/yacc for potential use in
Bioperl, but since I am not sure how long this will take me, I do not
want to scare off anyone else who would like to give this a shot.

best,
peter


On Tue, 2005-01-04 at 22:03, Jason Stajich wrote:
> On Jan 4, 2005, at 3:52 PM, Peter Robinson wrote:
> 
> > Hi Jason,
> >
> > thanks for the advice. It seems as if the documentation of
> > Bio::DB::Taxonomy is a bit out of sync.
> >  my $db = new Bio::DB::Taxonomy(-source => 'flatfile'
> >                                  -nodesfile => $nodesfile,
> >                                  -namesfile => $namefile);
> > What does 'flatfile' refer to here? It is not apparent upon looking at 
> > the code for new.
> >
> See Bio::DB::Taxonomy::flatfile for more information.  As I mentioned 
> in the mail I sent, flatfile is for downloading the taxonomy DB from 
> NCBI.  This lets you run it locally using an indexed  (BerkelyDB via 
> DB_File) version of the file.
> 
> You must need the most up-to-date verion of the modules - works fine 
> for me for both the entrez and flatfile code, but you may have to 
> upgrade off of the 1.4.0 release. Code from CVS or the bioperl-1.5 RC1 
> code should work fine.
> 
> 
> 
> > I had somewhat better luck using the entrez version, but I got a 
> > pretty amusing error
> > message:
> >
> > MSG: can't create a species object for Homo sapiens (human) because it
> > isn't a species but is a '' instead
> >
> > ###
> > Full error and a dump of the script follow:
> >
> > my $db = new Bio::DB::Taxonomy(-source => 'entrez'); #
> > my $taxaid = $db->get_taxonid('Homo sapiens');
> > my $species = $db->get_Taxonomy_Node(-taxonid => '9606');
> > print Dumper($species);
> >
> > ###
> >
> > Use of uninitialized value in string eq at
> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
> > Use of uninitialized value in sprintf at
> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
> >
> > -------------------- WARNING ---------------------
> > MSG: can't create a species object for Homo sapiens (human) because it
> > isn't a species but is a '' instead
> > ---------------------------------------------------
> > Use of uninitialized value in string eq at
> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 192.
> > Use of uninitialized value in sprintf at
> > /usr/local/share/perl/5.8.4/Bio/DB/Taxonomy/entrez.pm line 201.
> >
> > -------------------- WARNING ---------------------
> > MSG: can't create a species object for Homo sapiens (human) because it
> > isn't a species but is a '' instead
> > ---------------------------------------------------
> > $VAR1 = {
> >           'TaxId' => '9606',
> >           'Division' => 'mammals',
> >           'GeneNumber' => '32775',
> >           'Rank' => 'species',
> >           'ProtNumber' => '247791',
> >           'ScientificName' => 'Homo sapiens',
> >           'CommonName' => 'human',
> >           'NucNumber' => '9025800',
> >           'GenNumber' => '25',
> >           'StructNumber' => '5638'
> >         };
> > peter at anna:~/programs/bioperlTest$
> >
> >
> > --best, peter
> >
> > On Mon, 2005-01-03 at 23:51, Jason Stajich wrote:
> >> Bio::DB::Taxonomy is the factory code - it is pretty easy to get a
> >> species object (or equivalent) using this code.  But you cannot (or
> >> could not when I wrote this, not sure of the current status) get the
> >> full classification from the NCBI taxonomy retrieval via cgi.  i.e. 
> >> you
> >> can only get genus and species for a taxon id and I don't know how to
> >> walk up the hierarchy using the web API.  Earlier emails to NCBI 
> >> seemed
> >> to indicate this is all they intended to provide, but not sure what 
> >> the
> >> current status is.
> >>
> >>   my $db = new Bio::DB::Taxonomy(-source => 'entrez'); # use NCBI 
> >> Entrez
> >> over HTTP
> >>    my $taxaid = $db->get_taxonid('Homo sapiens');
> >>    my $taxonnode = $db->get_Taxonomy_Node(-taxonid => '9606');
> >>
> >> You can get the full classification if you use the
> >> Bio::DB::Taxonomy::flatfile factory which requires you to have
> >> downloaded the taxonomy db flatfile from NCBI.  Since this is more
> >> reliable (and faster) it is what I have tended to use for grouping 
> >> sets
> >> of seqDB search results, etc.
> >>
> >> -jason
> >> On Jan 3, 2005, at 5:40 PM, Peter Robinson wrote:
> >>
> >>> Hi Bioperlers, hi Hilmar,
> >>>
> >>> after some thinking I have embarked on a lex/yacc parser for the 
> >>> Entrez
> >>> Gene ASN.1 format as the way of least resistance, although I am not
> >>> sure
> >>> how that would fit in to BioPerl. If anyone is interested in this (or
> >>> has a better idea of how to go about it..), please drop me a line.
> >>>
> >>> In the meantime I have been looking at writing code to parse some of
> >>> the
> >>> "easy" Entrez gene documents, starting off with gene_info. This file
> >>> includes the NCBI taxon id for each entry. I would like to convert 
> >>> this
> >>> to a Bio::Species object to pass to the following
> >>> 	my $seq = $self->sequence_factory->create(
> >>> 			     -verbose => $self->verbose(),
> >>> 			     -accession_number => $geneID,
> >>> 			     -desc => $description,
> >>> 			     -display_id => $symbol,
> >>> 			     -species =>  ???
> >>> 			     -annotation => $ann);
> >>>
> >>> and saw the Bio::Taxonomy::FactoryI code, which appears to want to do
> >>> this sort of thing. However, the code for that is pretty preliminary.
> >>> Is
> >>> anyone working on this at the moment? Or is there a better way of 
> >>> doing
> >>> this (it seems a shame not to provide the actual species name if one
> >>> has
> >>> the taxid...)
> >>>
> >>> best
> >>>
> >>> Peter
> >>>
> >>>
> >>>
> >>> On Tue, 2004-12-28 at 07:17, Hilmar Lapp wrote:
> >>>> Great to hear that someone is giving this a shot. Yes at this point 
> >>>> is
> >>>> appears that NCBI is only offering the ASN.1, not a conversion to 
> >>>> XML.
> >>>> Their asn2xml tool will not work with this ASN.1 format either, just
> >>>> checked it to be sure. They do seem to be mulling the option of XML
> >>>> though on the Gene FAQ. Maybe if enough people get in their ears 
> >>>> they
> >>>> will spend some effort towards that. After all, the entrez gene web
> >>>> interface can display XML on demand - even though it looks fairly
> >>>> hideous.
> >>>>
> >>>> There is no ASN.1 support in bioperl at all. Also, ASN.1 support in
> >>>> perl is actually thin - there is Convert::ASN1 at version 0.18 two
> >>>> years ago that I could find ... doesn't make me feel warm and fuzzy.
> >>>>
> >>>> In the absence of any XML available from NCBI, gene_info might be 
> >>>> the
> >>>> best start. An option could be to check for the presence of the 
> >>>> other
> >>>> tab-delimited files and use those that are present. These are
> >>>> tab-delimited and hence the format itself is trivial so you can 
> >>>> focus
> >>>> entirely on setting up a Bio::Seq plus annotation that's
> >>>> comparable/compatible to what the current SeqIO::locuslink does.
> >>>>
> >>>> My $0.02 (worth less and less almost every day).
> >>>>
> >>>> 	-hilmar
> >>>>
> >>>> On Thursday, December 23, 2004, at 10:51  AM, Peter Robinson wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I have been thinking about given a BioPerl EntrezGene parser a try
> >>>>> since
> >>>>> I have been a heavy user of locus link to date. One issue is that 
> >>>>> the
> >>>>> files that correspond to LL_tmpl (which was a flat file) are now in
> >>>>> asn
> >>>>> format
> >>>>> http://www.ncbi.nlm.nih.gov/entrez/query/static/help/
> >>>>> genehelp.html#query
> >>>>> Although I saw some mention of ASN support in Bioperl by googling, 
> >>>>> I
> >>>>> can't seem to find any module that does this in the present
> >>>>> distribution. What is the status on that? In any case, I will be
> >>>>> working
> >>>>> on this in the next month or two and if anything nice comes of it I
> >>>>> will
> >>>>> send it to you / BioPerpl.
> >>>>>
> >>>>> best wishes & happy holidays
> >>>>>
> >>>>> Peter
> >>>>>
> >>>>> On Tue, 2004-12-14 at 09:00, Hilmar Lapp wrote:
> >>>>>> Since load_seqdatabase.pl will use bioperl's SeqIO parsers for
> >>>>>> parsing
> >>>>>> any input file, what you're asking is whether or not there is a
> >>>>>> SeqIO
> >>>>>> parser for NCBI Gene.
> >>>>>>
> >>>>>> The answer to that question is no, not yet. Anybody who feels
> >>>>>> motivated
> >>>>>> is welcome to give it a try ... Since I'll need it, I'll write the
> >>>>>> parser if nobody else does within the next 3 months, but I'm not
> >>>>>> going
> >>>>>> to promise when exactly this will happen.
> >>>>>>
> >>>>>> 	-hilmar
> >>>>>>
> >>>>>> On Monday, December 13, 2004, at 08:03  AM, Law, Annie wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I was wondering with regards to bioperl-db the scripts and schema
> >>>>>>> and
> >>>>>>> load_seqdatabase.pl has there been preparation for integration of
> >>>>>>> Entrez
> >>>>>>> gene information when locuslink is phased out?  Or if it has
> >>>>>>> already
> >>>>>>> been
> >>>>>>> changed could somebody point
> >>>>>>> me to the documentation or changed code?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Annie.
> >>>>>>> _______________________________________________
> >>>>>>> Bioperl-l mailing list
> >>>>>>> Bioperl-l at portal.open-bio.org
> >>>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >>>>>>>
> >>>>>>>
> >>>>> -- 
> >>>>> Peter N. Robinson
> >>>>> peter.robinson at t-online.de
> >>>>> peter.robinson at charite.de
> >>>>> http://www.charite.de/ch/medgen/robinson/
> >>>>>
> >>>>>
> >>> -- 
> >>> Peter N. Robinson
> >>> peter.robinson at t-online.de
> >>> peter.robinson at charite.de
> >>> http://www.charite.de/ch/medgen/robinson/
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at portal.open-bio.org
> >>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>>
> >> --
> >> Jason Stajich
> >> jason.stajich at duke.edu
> >> http://www.duke.edu/~jes12/
> > -- 
> > Peter N. Robinson
> > peter.robinson at t-online.de
> > peter.robinson at charite.de
> > http://www.charite.de/ch/medgen/robinson/
> >
> >
> --
> Jason Stajich
> jason.stajich at duke.edu
> http://www.duke.edu/~jes12/
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
-- 
Peter N. Robinson
peter.robinson at t-online.de
peter.robinson at charite.de
http://www.charite.de/ch/medgen/robinson/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: geneinfo.pm
Type: application/x-perl
Size: 10931 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050106/6754c375/geneinfo.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: geneinfotest.pl
Type: application/x-perl
Size: 11184 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/bioperl-l/attachments/20050106/6754c375/geneinfotest.bin


More information about the Bioperl-l mailing list