[Bioperl-l] Uniprot/Swiss accessions?

Tue May 19 01:44:35 UTC 2009

No, that doesn't work :-(
Here's some blast output with the database formatted with local ids:
=====================================================================
Database: uniprot_sprot.fasta
           466,739 sequences; 165,389,953 total letters

Searching..................................................done

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

sp|Q4U9M9|104K_THEAN Unknown                                          421   e-117
sp|P15711|104K_THEPA Unknown                                          265   6e-70
sp|Q2SPQ2|CHED_HAHCH Unknown                                           33   4.2

 Score =  421 bits (1083), Expect = e-117,   Method: Compositional matrix adjust.
 Identities = 0/209 (0%), Positives = 0/209 (0%)

Query: 1   VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60

Query: 61  QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120

Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180

Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209

===========================================================================

If I tweak the fasta and change the ids from lcl to gi and re-formatdb, all works correctly:

===========================================================================
Query= test
         (612 letters)

Database: uniprot_sprot.fasta
           466,739 sequences; 165,389,953 total letters

Searching..................................................done

                                                                 Score    E
Sequences producing significant alignments:                      (bits) Value

sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theile...   421   e-117
sp|P15711|104K_THEPA 104 kDa microneme/rhoptry antigen OS=Theile...   265   6e-70
sp|Q2SPQ2|CHED_HAHCH Probable chemoreceptor glutamine deamidase ...    33   4.2

>sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425
           PE=3 SV=1
          Length = 893

 Score =  421 bits (1083), Expect = e-117,   Method: Compositional matrix adjust.
 Identities = 201/209 (96%), Positives = 201/209 (96%)

Query: 1   VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60
           VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED
Sbjct: 72  VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 131

Query: 61  QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120
           QYLA        IHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD
Sbjct: 132 QYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 191

Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180
           KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY
Sbjct: 192 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 251

Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209
           VATIPKLKDFAEPYHPIILDISDIDYVNF
Sbjct: 252 VATIPKLKDFAEPYHPIILDISDIDYVNF 280

============================================================================

To my mind, this is a bug in formatdb but NCBI don't see it that way.

--Russell

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com
> Sent: Tuesday, 19 May 2009 12:20 p.m.
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Uniprot/Swiss accessions?
> 
> Hi, Smithies,
> 
> Using an integral local id should work as well.
> 
> A define will look like '>lcl|12345 ...'
> 
> Bill
> 
> > Hi guys,
> > Thanx for your suggestions.
> >
> > With the magic of awk and comm, I split the amalgamated accessions and
> > created lists of swissprot IDs for both the file from NCBI and the file
> > from Uniprot.
> >
> > sp_ncbi_accessions.txt          458,377 ids
> > sp_uniprot_accessions.txt       466,739 ids
> >
> > *       The NCBI file has 95 ids that don't appear in the Uniprot list
> > *       The Uniprot file has 8,457 ids that don't appear in the NCBI list
> > *       There are 458,282 ids that appear on both lists.
> >
> > I did a quick random sample of the 8,457 ids unique to Uniprot and none
> > could be found in the "protein" database at NCBI but all were in the
> > "gene" database as "reference sequences that belong to a specific genome
> > build" and all belonged to recently sequenced bacterial genomes. As none
> > are in the "protein" database, they don't have GI numbers.
> >
> > The 95 ids that were at NCBI but not in Uniprot were usually (random
> > sample again) described as "putative protein" (or "very putative protein"
> > in one case) and are the result of gene predictions. Eg
> > http://www.ncbi.nlm.nih.gov/protein/48429254
> >
> >
> > So what I'll do is use the NCBI database and add in the extra 8,457 ids
> > unique to Uniprot and assign them fake GI numbers so I can formatdb them
> > with the " -o T" option.
> >
> >
> > Thanx again for your help,
> >
> >
> >
> > Russell Smithies
> > Bioinformatics Applications Developer
> > T +64 3 489 9085
> > E  russell.smithies at agresearch.co.nz
> > Invermay  Research Centre
> > Puddle Alley,
> > Mosgiel,
> > New Zealand
> > T  +64 3 489 3809
> > F  +64 3 489 9174
> > www.agresearch.co.nz
> >
> >
> > Toitu te whenua, Toitu te tangata
> > Sustain the land, Sustain the people
> >
> >
> >
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l