[EMBOSS] Long Refseq accessions and dbifasta/seqret bug??
    Rob Pollock 
    rob.pollock at gmail.com
       
    Tue Aug 23 03:34:07 UTC 2005
    
    
  
Hi,
I download human.protein.faa from NCBI on a weekly basis and use dbifasta to 
build an emboss database.  However, I have recently noticed that
seqret has problems
finding sequences with certain accessions, even though they are in the database.
I suspect this is something to do with the new long RefSeq accessions
and the way
the index is built.  Here is an example: (I call my RefSeq protein
database 'genprot' for
want of a better name)
I am using EMBOSS 3.0.0 btw. (also Version 2.10.0 did the same thing)
This fails:
  % seqret 'genprot:NP_001015.1'
  Reads and writes (returns) sequences
  Error: Unable to read sequence 'genprot:NP_001015.1'
  Died: seqret terminated: Bad value for '-sequence' and no prompt
However this works:
  % seqret 'genprot:NP_001015'
  Reads and writes (returns) sequences
  Output sequence [np_001015.fasta]:
  % more np_001015.fasta
  >NP_001015.1 NP_001015.1 ribosomal protein S21 [Homo sapiens]
  MQNDAGEFVDLYVPRKCSASNRIIGAKDHASIQMNVAEVDKVTGRFNGQFKTYAICGAIR
  RMGESDDSILRLAKADGIVSKNF
If I search human.protein.faa for strings that match NP_0010150 get a
whole list:
>gi|62632744|ref|NP_001015050.1| hypothetical protein LOC200810 [Homo sapiens]
>gi|62821803|ref|NP_001015884.1| RPB11b2alpha protein [Homo sapiens]
>gi|62865862|ref|NP_001015508.1| purine-rich element binding protein G
isoform B [Homo sapiens]
>gi|62865641|ref|NP_001015879.1| aurora kinaseatching NP_001015 I get
a whole list:
... etc
Again, using the full accession for any one of these fails:
  % seqret 'genprot:NP_001015071.1'
  Reads and writes (returns) sequences
  Error: Unable to read sequence 'genprot:NP_001015071.1'
  Died: seqret terminated: Bad value for '-sequence' and no prompt
But, again, the truncated accession works:
  % seqret 'genprot:NP_001015071'
  Reads and writes (returns) sequences
  Output sequence [np_001015071.fasta]:
Other sequences searched by full accession work fine (as long as
they're _in_ the
database!)
   % seqret 'genprot:NP_000544.1'
  Reads and writes (returns) sequences
  Output sequence [np_000544.fasta]:
I think the long accessions are calling dbifasta/seqret to spit the
dummy somehow.
Note:This is how I format my database
  % dbifasta -dbname genprot -idformat ncbi -directory ~/genprot
-filenames \*.faa -auto
Maybe there is a workaround at this point?? Any suggestions??
[I have a  file that includes sequences I need that aren't found in the current
release of Refseq which is diff.faa, hence the -filenames \*.faa bit.]
Thanks in advance
Rob Pollock.
    
    
More information about the EMBOSS
mailing list