[Open-bio-l] FASTQ identifiers
Charles Plessy
charles-listes+open-bio at plessy.org
Sun Aug 2 01:25:37 UTC 2009
Le Fri, Jul 31, 2009 at 10:15:57AM +0100, Peter a écrit :
> The situation is similar to the FASTA format (and others), in that there
> are a number of reasonably well documented conventions in use
> (e.g. the NCBI FASTA identifiers with | characters). However, equally,
> there are thousands of ad hoc local conventions.
Hello,
I just would like to mention such an ad-hoc convention in use at workplace:
with FASTQ sequences we sometimes replace the original name by the sequence
itself. This can be useful for instance to troubleshoot some sequence
manipulations.
@EAS54_6_R1_2_1_413_324
CCCTTCTTGTCTTCAGCGTTTCTCC
+EAS54_6_R1_2_1_413_324
;;3;;;;;;;;;;;;7;;;;;;;88
becomes:
@CCCTTCTTGTCTTCAGCGTTTCTCC
CCCTTCTTGTCTTCAGCGTTTCTCC
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;3;;;;;;;;;;;;7;;;;;;;88
and after some arbitrary trimming at the ends:
@CCCTTCTTGTCTTCAGCGTTTCTCC
TTCTTGTCTTCAGCGTTTCT
+CCCTTCTTGTCTTCAGCGTTTCTCC
;;;;;;;;;;;;7;;;;;;;
With FASTA format, we sometimes eliminate redundant sequences and record how
many times they occurred by adding the count to the name.
For instance:
>seq1
AAATTT
>seq2
AAATAT
>seq3
AAATTT
becomes:
>AAATTT_2
AAATTT
>AAATAT_1
AAATAT
If this is popular elsewhere, it may be useful to implement functions that
allow doing this efficiently.
Have a nice day,
--
Charles Plessy
Tsurumi, Kanagawa, Japan
More information about the Open-Bio-l
mailing list