[Bioperl-l] Uniprot/Swiss accessions?
Smithies, Russell
Russell.Smithies at agresearch.co.nz
Tue May 19 01:44:35 UTC 2009
No, that doesn't work :-(
Here's some blast output with the database formatted with local ids:
=====================================================================
Database: uniprot_sprot.fasta
466,739 sequences; 165,389,953 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
sp|Q4U9M9|104K_THEAN Unknown 421 e-117
sp|P15711|104K_THEPA Unknown 265 6e-70
sp|Q2SPQ2|CHED_HAHCH Unknown 33 4.2
Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix adjust.
Identities = 0/209 (0%), Positives = 0/209 (0%)
Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60
Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120
Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180
Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209
===========================================================================
If I tweak the fasta and change the ids from lcl to gi and re-formatdb, all works correctly:
===========================================================================
Query= test
(612 letters)
Database: uniprot_sprot.fasta
466,739 sequences; 165,389,953 total letters
Searching..................................................done
Score E
Sequences producing significant alignments: (bits) Value
sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theile... 421 e-117
sp|P15711|104K_THEPA 104 kDa microneme/rhoptry antigen OS=Theile... 265 6e-70
sp|Q2SPQ2|CHED_HAHCH Probable chemoreceptor glutamine deamidase ... 33 4.2
>sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425
PE=3 SV=1
Length = 893
Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix adjust.
Identities = 201/209 (96%), Positives = 201/209 (96%)
Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60
VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED
Sbjct: 72 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 131
Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120
QYLA IHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD
Sbjct: 132 QYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 191
Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180
KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY
Sbjct: 192 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 251
Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209
VATIPKLKDFAEPYHPIILDISDIDYVNF
Sbjct: 252 VATIPKLKDFAEPYHPIILDISDIDYVNF 280
============================================================================
To my mind, this is a bug in formatdb but NCBI don't see it that way.
--Russell
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com
> Sent: Tuesday, 19 May 2009 12:20 p.m.
> To: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Uniprot/Swiss accessions?
>
> Hi, Smithies,
>
> Using an integral local id should work as well.
>
> A define will look like '>lcl|12345 ...'
>
> Bill
>
> > Hi guys,
> > Thanx for your suggestions.
> >
> > With the magic of awk and comm, I split the amalgamated accessions and
> > created lists of swissprot IDs for both the file from NCBI and the file
> > from Uniprot.
> >
> > sp_ncbi_accessions.txt 458,377 ids
> > sp_uniprot_accessions.txt 466,739 ids
> >
> > * The NCBI file has 95 ids that don't appear in the Uniprot list
> > * The Uniprot file has 8,457 ids that don't appear in the NCBI list
> > * There are 458,282 ids that appear on both lists.
> >
> > I did a quick random sample of the 8,457 ids unique to Uniprot and none
> > could be found in the "protein" database at NCBI but all were in the
> > "gene" database as "reference sequences that belong to a specific genome
> > build" and all belonged to recently sequenced bacterial genomes. As none
> > are in the "protein" database, they don't have GI numbers.
> >
> > The 95 ids that were at NCBI but not in Uniprot were usually (random
> > sample again) described as "putative protein" (or "very putative protein"
> > in one case) and are the result of gene predictions. Eg
> > http://www.ncbi.nlm.nih.gov/protein/48429254
> >
> >
> > So what I'll do is use the NCBI database and add in the extra 8,457 ids
> > unique to Uniprot and assign them fake GI numbers so I can formatdb them
> > with the " -o T" option.
> >
> >
> > Thanx again for your help,
> >
> >
> >
> > Russell Smithies
> > Bioinformatics Applications Developer
> > T +64 3 489 9085
> > E russell.smithies at agresearch.co.nz
> > Invermay Research Centre
> > Puddle Alley,
> > Mosgiel,
> > New Zealand
> > T +64 3 489 3809
> > F +64 3 489 9174
> > www.agresearch.co.nz
> >
> >
> > Toitu te whenua, Toitu te tangata
> > Sustain the land, Sustain the people
> >
> >
> >
> >
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list