[Bioperl-l] Problem retrieving CDS by Acession #

Chris Fields cjfields at uiuc.edu
Thu Sep 7 16:33:39 UTC 2006


...
> >
> > get_Seq_by_version() worked.
> >
> > That does not explain why get_Seq_by_acc does not work with the primary
> > part of the accession #.
> 
> As an example of why this shouldn't work, doing a search in entrez (online
> version) will bring up the newest version of an accession if the version
> is
> not included.  If one specifies the version, though, one gets that
> version,
> even if it is not the newest.  So, asking get_Seq_by_acc() with a version
> and
> ignoring the version would potentially get you the wrong version for the
> accession.
> 
> If you know that you want the most recent version, just strip the version
> information and use get_Seq_by_acc().

As an aside, if you want only one unique sequence, such as through
get_Seq_by* methods, you should consider using the GI.  

NCBI recommends retrieving sequence data using the GI or accession.version
and not the accession only.  Using the accession works 99% of the time but I
have seen a few instances when retrieving sequence using the accession only
gets the wrong sequence via Bio::DB::GenBank/GenPept get_Seq_by_acc(),
sometimes getting even mixed sequences:

http://article.gmane.org/gmane.comp.lang.perl.bio.general/11560/

Part of the reason is a quirk with some sequences that are returned via
EUtilities via NCBI, which may be due to misclassification in the database.
Another is the accession isn't considered unique by NCBI as they may not
assign it (it may come from another database, for instance), so there may be
more than one sequence returned.  get_Seq* methods only return one sequence
from the stream, that being the first one (they only expect one anyway).  If
the first one in the sequence stream is the wrong one...

Chris





More information about the Bioperl-l mailing list