[EMBOSS] Counting the number of sequences in a file

Peter biopython at maubp.freeserve.co.uk
Tue Jul 20 12:27:42 EDT 2010


Hi all,

Is there a tool in EMBOSS to just count the number of sequences in a file?
For simple file formats like FASTA or GenBank I'd typically just use grep:

$ grep -c "^LOCUS " gbvrt1.seq
31065

However, this becomes more complicated for general file formats (e.g. FASTQ
files where in addition to identifiers the quality lines can also
start with @) or
binary files like BAM which EMBOSS now supports.

Right now I could handle this by using seqret to convert the file into FASTA
and then pipe that though grep to count the records. But an EMBOSS tool
would be more elegant, e.g.

$ countseq -sformat=genbank gbvrt1.seq
31065

For the implementation you might offer the choice between using the normal
EMBOSS parsing (as in seqret) versus file format specific regular expression
searches which just look for marker lines (without checking validity) which
should be really fast.

Regards,

Peter C.


More information about the EMBOSS mailing list