Indexing Refseq

Mon Oct 21 14:00:35 UTC 2002

I'm having all sorts of problems working with the latest release of RefSeq, due to a change in the way the files are being laid out.

In older releases of RefSeq the LOCUS identifier was the same as the accession number (eg NM_0123456), but in the latest version the LOCUS identifier is the gene identifier, and these aren't unique in the database!!

This means that when I run dbiflat (even using -idformat REFSEQ) I get a load of warnings about duplicate entries and when I later try to use the database I find that a load of entries are inaccessible because of this.

For example accessions NM_134265,NM_134264 and NM_015626 all have the ID WSB1.

How can I get dbiflat to index with the accession number as it's primary identifier so I don't lose entries when indexing them??

Thanks

Simon

PS This actually looks like a mistake by the RefSeq curators - I mean who thought that having a non-unique primary sequence identifier was a good idea!!!

--
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews at bbsrc.ac.uk
+44 (0)1223 496463