[Bioperl-l] Refseq Version
Smithies, Russell
Russell.Smithies at agresearch.co.nz
Tue Feb 9 21:12:27 UTC 2010
At a rough guess, I'd say they've done some merging of sequences to remove redundancy i.e. where two or more sequences from different species are identical, they've only added a single sequence to the blast database and added all the IDs to the fasta header in the same fashion the NR database is done.
I don't have refseq_protein database at the moment but you could check by dumping it as fasta and grepping for the Ctrl-A fasta header description separator.
--Russell
From: shalu sharma [mailto:sharmashalu.bio at gmail.com]
Sent: Tuesday, 9 February 2010 5:38 a.m.
To: Smithies, Russell
Cc: bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] Refseq Version
Thanks a lot Russell.
But i am still confused. Actually i asked the server admin and he said that this is Refseq's latest vesrion (the one i am using).
But the number of sequences which i am getting from blast report are not matching with the refseq 38 release ( or i don't know which numbers to match).
Like from blast report i am getting :
$ fastacmd -I -d /db/ncbiblast/refseq/refseq_
protein
Database: NCBI Protein Reference Sequences
7,585,993 sequences; 2,644,770,521 total letters
And when i am looking at refseq release notes , i don't understand that which numbers to match with because i don't see these numbers in release notes.
Thanks a lot, I really appreciate your help.
Thanks
Shalu
On Sun, Feb 7, 2010 at 4:05 PM, Smithies, Russell <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz>> wrote:
AAArrrgg, what is it with Outlook this morning!!!
Formatting kaput again but I'm sure you can work it out from there!
--Russell
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-<mailto:bioperl-l->
> bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of Smithies, Russell
> Sent: Monday, 8 February 2010 9:59 a.m.
> To: 'shalu sharma'
> Cc: 'bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>'
> Subject: Re: [Bioperl-l] Refseq Version
>
> I should have known it would break the formatting :-(
>
> Try this:
>
> Release 1:June 30, 2003;Release Size: 4672871949 bases, 263588685 amino
> acids, 1061675 records
> Release 2:October 21, 2003;Release Size: 2124 organisms, 7745398573
> nucleotide bases, 286957682 amino acids, 1097404 records
> Release 3:January 13, 2004;Release Size: 2218 organisms, 7992741222
> nucleotide bases, 294647847 amino acids, 1101244 records
> Release 4:March 24, 2004;Release Size: 2358 organisms, 8175128887
> nucleotide bases, 318253841 amino acids, 1193457 records
> Release 5:May 2 , 2004;Release Size: 2395 organisms, 8325515623 nucleotide
> bases, 337229387 amino acids, 1255613 records
> Release 6:July 5, 2004;Release Size: 2467 organisms, 8696371716 nucleotide
> bases, 365446682 amino acids, 1367206 records
> Release 7:September 12, 2004;Release Size: 2558 organisms, 21072808460
> nucleotide bases, 405233619 amino acids, 1579579 records
> Release 8:October 31, 2004;Release Size: 2645 organisms, 26814386658
> nucleotide bases, 430300369 amino acids, 1709723 records
> Release 9:January 9, 2005;Release Size: 2780 organisms, 36786975473
> nucleotide bases, 470534907 amino acids, 1843944 records
> Release 10:March 6, 2005;Release Size:2827 organisms, 36893741150
> nucleotide bases, 482862858 amino acids, 1893478 records
> Release 11:May 8, 2005;Release Size:2928 organisms, 39731702362 nucleotide
> bases, 507980644 amino acids, 2477893 records
> Release 12:July 10, 2005;Release Size:2969 organisms, 43043256058
> nucleotide bases, 608493108 amino acids, 2869675 records
> Release 13:September 11, 2005;Release Size:3060 organisms, 44727484853
> nucleotide bases, 686768902 amino acids, 3400773 records
> Release 14:November 20, 2005;Release Size:3198 organisms, 47364955367
> nucleotide bases, 763761075 amino acids, 3272776 records
> Release 15:January 1, 2006;Release Size:3244 organisms, 52645441913
> nucleotide bases, 810009733 amino acids, 3436263 records
> Release 16:March 11, 2006;Release Size:3397 organisms, 56175443059
> nucleotide bases, 887509001 amino acids, 3715260 records
> Release 17:May 1, 2006;Release Size:3497 organisms, 62130037371 nucleotide
> bases, 927587669 amino acids, 3999859 records
> Release 18:July 11, 2006;Release Size:3695 organisms, 70474041999
> nucleotide bases, 974374765 amino acids, 4186692 records
> Release 19:September 10, 2006;Release Size: 3774 organisms, 70694879544
> nucleotide bases, 1012985077 amino acids, 4311543 records
> Release 20:November 5, 2006;Release Size:3919 organisms, 72679681505
> nucleotide bases, 1061797276 amino acids, 4567569 records
> Release 21:January 6, 2007;Release Size:4079 organisms, 73864990566
> nucleotide bases, 1144795927 amino acids, 4742335 records
> Release 22:March 5, 2007;Release Size:4187 organisms, 82441128546
> nucleotide bases, 1215085694 amino acids, 5207865 records
> Release 23:May 8, 2007;Release Size:4300 organisms, 83148327110 nucleotide
> bases, 1291050995 amino acids, 5503385 records
> Release 24:July 10, 2007;Release Size:4511 organisms, 89856995521
> nucleotide bases, 1365916222 amino acids, 6073814 records
> Release 25:September 11, 2007;Release Size:4646 organisms, 91265840843
> nucleotide bases, 1470475398 amino acids, 6515132 records
> Release 26:November 4, 2007;Release Size:4737 organisms, 99105705485
> nucleotide bases, 1495032507 amino acids, 6698250 records
> Release 27:January 6, 2008;Release Size:4926 organisms, 101059552113
> nucleotide bases, 1556356987 amino acids, 7025715 records
> Release 28:March 9, 2008;Release Size: 5059 organisms, 102051350525
> nucleotide bases, 1770627427 amino acids, 7914560 records
> Release 29:May 4, 2008;Release Size:5168 organisms, 104671101150
> nucleotide bases, 1870214220 amino acids, 8376141 records
> Release 30:July 7, 2008;Release Size:5395 organisms, 105074486709
> nucleotide bases, 1913447691 amino acids, 8572852 records
> Release 31:August 30, 2008;Release Size: 5513 organisms, 109214348591
> nucleotide bases, 2026768719 amino acids, 9145702 records
> Release 32:November 10, 2008;Release Size: 5726 organisms, 111122203221
> nucleotide bases, 2089596746 amino acids, 9501764 records
> Release 33:January 16, 2009;Release Size:7773 organisms, 116001583818
> nucleotide bases, 2204073443 amino acids, 10325282 records
> Release 34:March 6, 2009;Release Size: 8054 organisms, 111792574830
> nucleotide bases, 2299682138 amino acids, 10021870 records
> Release 35:May 4, 2009;Release Size: 8393 organisms, 113210655336
> nucleotide bases, 2565199170 amino acids, 10993891 records
> Release 36:July 2, 2009;Release Size: 8665 organisms, 117013741530
> nucleotide bases, 2756884219 amino acids, 12141825 records
> Release 37:September 3, 2009;Release Size: 9005 organisms, 119151229820
> nucleotide bases, 2965450333 amino acids, 12941750 records
> Release 38:November 7, 2009;Release Size: 9166 organisms, 119196622435
> nucleotide bases, 3115246540 amino acids, 13436447 records
>
>
>
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org> [mailto:bioperl-l-<mailto:bioperl-l->
> > bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>] On Behalf Of Smithies, Russell
> > Sent: Monday, 8 February 2010 9:47 a.m.
> > To: 'shalu sharma'
> > Cc: 'bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>'
> > Subject: Re: [Bioperl-l] Refseq Version
> >
> > Release 39 was Jan 30 and according to the README releases only come out
> > in odd months (January, March, May, July, September, November)
> > The stats file is here: ftp://ftp.ncbi.nih.gov/refseq/release/release-
> > statistics/RefSeq-release39.01232010.stats.txt
> >
> > The numbers of sequences between the fasta release and the pre-build
> blast
> > databases seem to differ but I guess only NCBI can explain that.
> > I can't see any way of extracting the release number from the pre-build
> > blast databases (apart from the build date) but it might be worth asking
> > NCBI if they'd include the information in future releases.
> >
> >
> > FYI, here's the old release stats.
> > (I wget'ed and grep'ed all the stats files)
> >
> > Release
> >
> > Date
> >
> > Year
> >
> > Organisms
> >
> > Nucleotide Bases
> >
> > Amino Acids
> >
> > Records
> >
> > 1
> >
> > Jun-30
> >
> > 2003
> >
> > 4,672,871,949
> >
> > 263,588,685
> >
> > 1,061,675
> >
> > 2
> >
> > Oct-21
> >
> > 2003
> >
> > 2,124
> >
> > 7,745,398,573
> >
> > 286,957,682
> >
> > 1,097,404
> >
> > 3
> >
> > Jan-13
> >
> > 2004
> >
> > 2,218
> >
> > 7,992,741,222
> >
> > 294,647,847
> >
> > 1,101,244
> >
> > 4
> >
> > Mar-24
> >
> > 2004
> >
> > 2,358
> >
> > 8,175,128,887
> >
> > 318,253,841
> >
> > 1,193,457
> >
> > 5
> >
> > May-02
> >
> > 2004
> >
> > 2,395
> >
> > 8,325,515,623
> >
> > 337,229,387
> >
> > 1,255,613
> >
> > 6
> >
> > Jul-05
> >
> > 2004
> >
> > 2,467
> >
> > 8,696,371,716
> >
> > 365,446,682
> >
> > 1,367,206
> >
> > 7
> >
> > Sep-12
> >
> > 2004
> >
> > 2,558
> >
> > 21,072,808,460
> >
> > 405,233,619
> >
> > 1,579,579
> >
> > 8
> >
> > Oct-31
> >
> > 2004
> >
> > 2,645
> >
> > 26,814,386,658
> >
> > 430,300,369
> >
> > 1,709,723
> >
> > 9
> >
> > Jan-09
> >
> > 2005
> >
> > 2,780
> >
> > 36,786,975,473
> >
> > 470,534,907
> >
> > 1,843,944
> >
> > 10
> >
> > Mar-06
> >
> > 2005
> >
> > 2,827
> >
> > 36,893,741,150
> >
> > 482,862,858
> >
> > 1,893,478
> >
> > 11
> >
> > May-08
> >
> > 2005
> >
> > 2,928
> >
> > 39,731,702,362
> >
> > 507,980,644
> >
> > 2,477,893
> >
> > 12
> >
> > Jul-10
> >
> > 2005
> >
> > 2,969
> >
> > 43,043,256,058
> >
> > 608,493,108
> >
> > 2,869,675
> >
> > 13
> >
> > Sep-11
> >
> > 2005
> >
> > 3,060
> >
> > 44,727,484,853
> >
> > 686,768,902
> >
> > 3,400,773
> >
> > 14
> >
> > Nov-20
> >
> > 2005
> >
> > 3,198
> >
> > 47,364,955,367
> >
> > 763,761,075
> >
> > 3,272,776
> >
> > 15
> >
> > Jan-01
> >
> > 2006
> >
> > 3,244
> >
> > 52,645,441,913
> >
> > 810,009,733
> >
> > 3,436,263
> >
> > 16
> >
> > Mar-11
> >
> > 2006
> >
> > 3,397
> >
> > 56,175,443,059
> >
> > 887,509,001
> >
> > 3,715,260
> >
> > 17
> >
> > May-01
> >
> > 2006
> >
> > 3,497
> >
> > 62,130,037,371
> >
> > 927,587,669
> >
> > 3,999,859
> >
> > 18
> >
> > Jul-11
> >
> > 2006
> >
> > 3,695
> >
> > 70,474,041,999
> >
> > 974,374,765
> >
> > 4,186,692
> >
> > 19
> >
> > Sep-10
> >
> > 2006
> >
> > 3,774
> >
> > 70,694,879,544
> >
> > 1,012,985,077
> >
> > 4,311,543
> >
> > 20
> >
> > Nov-05
> >
> > 2006
> >
> > 3,919
> >
> > 72,679,681,505
> >
> > 1,061,797,276
> >
> > 4,567,569
> >
> > 21
> >
> > Jan-06
> >
> > 2007
> >
> > 4,079
> >
> > 73,864,990,566
> >
> > 1,144,795,927
> >
> > 4,742,335
> >
> > 22
> >
> > Mar-05
> >
> > 2007
> >
> > 4,187
> >
> > 82,441,128,546
> >
> > 1,215,085,694
> >
> > 5,207,865
> >
> > 23
> >
> > May-08
> >
> > 2007
> >
> > 4,300
> >
> > 83,148,327,110
> >
> > 1,291,050,995
> >
> > 5,503,385
> >
> > 24
> >
> > Jul-10
> >
> > 2007
> >
> > 4,511
> >
> > 89,856,995,521
> >
> > 1,365,916,222
> >
> > 6,073,814
> >
> > 25
> >
> > Sep-11
> >
> > 2007
> >
> > 4,646
> >
> > 91,265,840,843
> >
> > 1,470,475,398
> >
> > 6,515,132
> >
> > 26
> >
> > Nov-04
> >
> > 2007
> >
> > 4,737
> >
> > 99,105,705,485
> >
> > 1,495,032,507
> >
> > 6,698,250
> >
> > 27
> >
> > Jan-06
> >
> > 2008
> >
> > 4,926
> >
> > 101,059,552,113
> >
> > 1,556,356,987
> >
> > 7,025,715
> >
> > 28
> >
> > Mar-09
> >
> > 2008
> >
> > 5,059
> >
> > 102,051,350,525
> >
> > 1,770,627,427
> >
> > 7,914,560
> >
> > 29
> >
> > May-04
> >
> > 2008
> >
> > 5,168
> >
> > 104,671,101,150
> >
> > 1,870,214,220
> >
> > 8,376,141
> >
> > 30
> >
> > Jul-07
> >
> > 2008
> >
> > 5,395
> >
> > 105,074,486,709
> >
> > 1,913,447,691
> >
> > 8,572,852
> >
> > 31
> >
> > Aug-30
> >
> > 2008
> >
> > 5,513
> >
> > 109,214,348,591
> >
> > 2,026,768,719
> >
> > 9,145,702
> >
> > 32
> >
> > Nov-10
> >
> > 2008
> >
> > 5,726
> >
> > 111,122,203,221
> >
> > 2,089,596,746
> >
> > 9,501,764
> >
> > 33
> >
> > Jan-16
> >
> > 2009
> >
> > 7,773
> >
> > 116,001,583,818
> >
> > 2,204,073,443
> >
> > 10,325,282
> >
> > 34
> >
> > Mar-06
> >
> > 2009
> >
> > 8,054
> >
> > 111,792,574,830
> >
> > 2,299,682,138
> >
> > 10,021,870
> >
> > 35
> >
> > May-04
> >
> > 2009
> >
> > 8,393
> >
> > 113,210,655,336
> >
> > 2,565,199,170
> >
> > 10,993,891
> >
> > 36
> >
> > Jul-02
> >
> > 2009
> >
> > 8,665
> >
> > 117,013,741,530
> >
> > 2,756,884,219
> >
> > 12,141,825
> >
> > 37
> >
> > Sep-03
> >
> > 2009
> >
> > 9,005
> >
> > 119,151,229,820
> >
> > 2,965,450,333
> >
> > 12,941,750
> >
> > 38
> >
> > Nov-07
> >
> > 2009
> >
> > 9,166
> >
> > 119,196,622,435
> >
> > 3,115,246,540
> >
> > 13,436,447
> >
> >
> >
> > --Russell
> >
> >
> > From: shalu sharma [mailto:sharmashalu.bio at gmail.com<mailto:sharmashalu.bio at gmail.com>]
> > Sent: Saturday, 6 February 2010 3:56 a.m.
> > To: Smithies, Russell
> > Cc: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>
> > Subject: Re: [Bioperl-l] Refseq Version
> >
> > Hi Russell,
> > Thanks for your response.
> > I am getting the number of sequence in the database but not the release
> > number (like 38, 39).
> > This is what i did:
> >
> > $ fastacmd -I -d /db/ncbiblast/refseq/refseq_protein
> > Database: NCBI Protein Reference Sequences
> > 7,585,993 sequences; 2,644,770,521 total letters
> >
> > File names:
> > /db/ncbiblast/refseq/refseq_protein.00
> > Date: Jan 30, 2010 8:34 PM Version: 4 Longest sequence: 36,805
> > res
> > /db/ncbiblast/refseq/refseq_protein.01
> > Date: Jan 30, 2010 8:34 PM Version: 4 Longest sequence: 33,403
> > res
> > /db/ncbiblast/refseq/refseq_protein.02
> > Date: Jan 30, 2010 8:34 PM Version: 4 Longest sequence: 15,830
> > res
> >
> > I am still confuse that how i can get the release number. I know refseq
> 39
> > was released on Jan 30, 2010 but i don't know how to confirm this. I
> also
> > tried look refseq release file but was not able to get any thing.
> >
> > I would really appreciate if anyone can help me out with this.
> >
> > Thanks
> > Shalu
> >
> > On Thu, Feb 4, 2010 at 6:39 PM, Smithies, Russell
> >
> <Russell.Smithies at agresearch.co.nz<mailto:Russell.Smithies at agresearch.co.nz><mailto:Russell.Smithies at agresearch.co.n<mailto:Russell.Smithies at agresearch.co.n>
> > z>> wrote:
> > If you have access to the blast database, use fastacmd -I -d
> databasename
> > Otherwise, it's usually at the bottom of your blast result.
> >
> > --Russell
> >
> > > -----Original Message-----
> > > From: bioperl-l-bounces at lists.open-bio.org<mailto:bioperl-l-bounces at lists.open-bio.org><mailto:bioperl-l-<mailto:bioperl-l->
> > bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>> [mailto:bioperl-l-<mailto:bioperl-l-><mailto:bioperl-l-<mailto:bioperl-l->>
> > > bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org><mailto:bounces at lists.open-bio.org<mailto:bounces at lists.open-bio.org>>] On
> Behalf
> > Of shalu sharma
> > > Sent: Friday, 5 February 2010 11:02 a.m.
> > > To: bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org><mailto:bioperl-l at lists.open-bio.org<mailto:bioperl-l at lists.open-bio.org>>
> > > Subject: [Bioperl-l] Refseq Version
> > >
> > > Hi All,
> > > This is not a bioperl query.
> > > Is there any way to check refseq version (release). Actually i am
> using
> > > some
> > > server to blast my sequences (blastall) against refseq. Is there any
> way
> > i
> > > can get the version information on the refseq database (from the blast
> > > file
> > > or directly from the database)?
> > >
> > > Thanks
> > > Shalu
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org><mailto:Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>>
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > =======================================================================
> > Attention: The information contained in this message and/or attachments
> > from AgResearch Limited is intended only for the persons or entities
> > to which it is addressed and may contain confidential and/or privileged
> > material. Any review, retransmission, dissemination or other use of, or
> > taking of any action in reliance upon, this information by persons or
> > entities other than the intended recipients is prohibited by AgResearch
> > Limited. If you have received this message in error, please notify the
> > sender immediately.
> > =======================================================================
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org<mailto:Bioperl-l at lists.open-bio.org>
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list