Indexing Refseq
simon andrews (BI)
simon.andrews at bbsrc.ac.uk
Mon Oct 21 14:00:35 UTC 2002
I'm having all sorts of problems working with the latest release of RefSeq, due to a change in the way the files are being laid out.
In older releases of RefSeq the LOCUS identifier was the same as the accession number (eg NM_0123456), but in the latest version the LOCUS identifier is the gene identifier, and these aren't unique in the database!!
This means that when I run dbiflat (even using -idformat REFSEQ) I get a load of warnings about duplicate entries and when I later try to use the database I find that a load of entries are inaccessible because of this.
For example accessions NM_134265,NM_134264 and NM_015626 all have the ID WSB1.
How can I get dbiflat to index with the accession number as it's primary identifier so I don't lose entries when indexing them??
Thanks
Simon
PS This actually looks like a mistake by the RefSeq curators - I mean who thought that having a non-unique primary sequence identifier was a good idea!!!
--
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute
simon.andrews at bbsrc.ac.uk
+44 (0)1223 496463
More information about the EMBOSS
mailing list