[Bioperl-l] Packages retrieving online alignment sequences

Chris Fields cjfields at illinois.edu
Sat Aug 7 22:07:39 UTC 2010


On Aug 7, 2010, at 4:14 PM, Gregory Jordan wrote:

> Maybe I'm just a bit naive here, but what is the expected difference between
> accession and ID and why do we need a separate method for each?

Depends on the remote service, but in many cases there is a difference.  With NCBI eutils you can have either an accession and the unique identifier (UID, or GI for nuc/protein seqs).  efetch can use both, but only the UID is guaranteed to retrieve a single sequence all the time; the accession can (very rarely) map to more than one sequence.  

The other eutils services require either a string (esearch) or a UID, but do not allow an accession.

> Seems to me
> that one could just have a single method, get_Aln, which determines under
> the hood whether the query string is an accession or ID.

A simpler method could be introduced, but I can see that being potentially brittle in the long run.  A naked alphanumeric string doesn't reveal much about what it is at face value w/o knowing database/service-specific behavior.  And then we're reliant on that behavior not changing, which we can't guarantee (this has bitten us in the past).  What would one do if NCBI (for instance) allowed accessions derived completely of digits, or conversely a unique ID with mixed alphanumerics?

Using methods specific for ID/acc at least guarantees a behavior on the backend w/o guessing, and if there is no danger of overlap (a service accepts either/or) one could simply be an alias of the other.

> It would be nice if the SimpleAlign object had its Annotation filled with
> some extra metadata (such as accession, ID, database version number, URI,
> etc.).

According to the deobfuscator SimpleAlign does have accession() and id().  The others could be simple attributes, and can be added as simple getter/setters, or as annotation via Bio::Annotation (this is the way Stockholm annotation is currently handled).  Something to think about.

> One other thing: have you thought about adding an Ensembl adaptor? Or maybe
> something similar already exists in BioPerl...?

That's a good idea, though it might make more sense if this was done when mem-efficient (possibly DB-dependent) AlignI modules are present within bioperl, which is part of the GSoC (see below).  For instance, have a Bio::Align::AlignI with a backend ensembl DB adaptor that works lazily.

If using the Ensembl Perl API, a few possible roadblocks/problems might pop up. Ensembl currently requires bioperl (v1.2.3, but it works with the latest as well, at least when I've used it).  If using the ensembl perl API we would just need to ensure we aren't conflicting with ensembl code that pulls in bioperl classes expecting a v1.2.3 API when we only support the latest.  I don't foresee this being an issue, though (there is precedent for this, see Sendu's Ensembl module Bio::Tools::Run::Ensembl in bioperl-run).

> Sure Ensembl provides their own Perl API, but for someone who doesn't want
> to go through the hassle of installing it from CVS (pardon my french, but
> wtf!?! Who still uses CVS) and learning a whole new API, it might be
> convenient to have a simple BioPerl module for quickly grabbing gene family
> alignments from the public Ensembl MySQL databases. I'd be willing to help
> write the necessary SQL queries for this.
> 
> greg

The GSoC project on alignment subsystem refactoring will be finishing up this month, so I'm sure Jun discuss ideas for initial DB-dependent implementations.  The more input and coders implementing the better, IMO.

As for writing up an adaptor to ensembl outside of it's API, overall I don't think it's a bad idea, but if it's possible maybe start without reinventing things, then move to direct SQL.  Unless it's easier to use SQL.

chris

> On 6 August 2010 14:11, Jun Yin <jun.yin at ucd.ie> wrote:
> 
>> Hi, Dave,
>> 
>> Thx for reminding me this. I will definitely try it.
>> 
>> Cheers,
>> Jun Yin
>> Ph.D. student in U.C.D.
>> 
>> Bioinformatics Laboratory
>> Conway Institute
>> University College Dublin
>> 
>> 
>> -----Original Message-----
>> From: Dave Messina [mailto:David.Messina at sbc.su.se]
>> Sent: Friday, August 06, 2010 2:07 PM
>> To: Jun Yin
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Packages retrieving online alignment sequences
>> 
>> Sounds great, Jun!
>> 
>> Did you happen to test your code on very large alignments? I know there's
>> one in Pfam that's something like 100,000 sequences. An rRNA, I believe.
>> 
>> 
>> Dave
>> 
>> 
>> __________ Information from ESET Smart Security, version of virus signature
>> database 5346 (20100806) __________
>> 
>> The message was checked by ESET Smart Security.
>> 
>> http://www.eset.com
>> 
>> 
>> 
>> 
>> __________ Information from ESET Smart Security, version of virus signature
>> database 5346 (20100806) __________
>> 
>> The message was checked by ESET Smart Security.
>> 
>> http://www.eset.com
>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list