[EMBOSS] notseq and fasta definition headers
Peter Rice
pmr at ebi.ac.uk
Tue Jun 17 20:28:47 UTC 2008
Andres Pinzon wrote:
> The output is correct, but notseq changes the definition in the fasta
> headers, so if the fasta header in "xaa.list.fasta" was:
>
> lcl|29855|ORF26673_6
>
> the corresponding fasta header in sequence in 1000-1.fasta is:
>
> 29855
>
> Is there a way to tell "notseq" to keep the original fasta headers intact?
Yes.
FASTA format is not simple ... we have seen many ways to hide extra
information in the ID (EMBOSS recognizes NCBI id formats and parses out
the ID 29855) and also in the description (we try to recognize
conventions used by GCG and ACEDB)
But you can also specify "pearson" format which reads the ID without
parsing. Just add to the commandline:
notseq -sf pearson
Now you have another problem. This will not work for notseq!!!
The exclude string in notseq is a pattern. In processing the pattern,
some pattern characters are removed:
whitespace
',' and ';'
'|'
So your exclude pattern cannot include any '|' chatracters.
As a workaround, you can exclude "*ORF26673_6" and the IDs will be
preserved.
For the next release we will allow '|' characters. When notseq was first
written there was a possibility to use regualr expressions, but now we
only use simple text matching so the pipe characters are not a problem.
Hope that helps
Peter
More information about the EMBOSS
mailing list