[Bioperl-l] Bio::SeqIO can't guess the format of data from a pipe

Chris Fields cjfields at illinois.edu
Sat Aug 27 03:54:05 UTC 2011


On Aug 27, 2011, at 6:12 AM, Florent Angly wrote:

> On the topic of guessing file formats, last I checked, it was difficult to reuse the format guessed by Bio::SeqIO
> 
> For example, if I want to takes sequences in any format (FASTA, FASTQ, ...) and filter some of them out and put them in a new file in the same format, I need to do something along these lines:
> 
>    # Open the file and let BioPerl guess its format
>    my $in = Bio::SeqIO->new( -file => $input_seqfile );
> 
>    # Have Bioperl guess the format (again) so we can use the same format for the output file
>    my $format = $in->_guess_format( $input_seqfile );
> 
>    # Open the output file (same format as the input file
>    my $out = Bio::SeqIO->new( -file => ">".$output_seqfile , format => $format );
> 
>    # Now do the work...
> 
> The limitations of the code above is that in is more complex than it should be and forces Bioperl do check the file format twice. My proposal would be to store the format of a file somewhere in the Bio::SeqIO object and create a new get/set method in Bio::SeqIO called format() to store of access its value.

The name of the class is the format (that's how they are loaded).  We could add this as a convenience level for Bio::SeqIO (fairly easy to do, actually), but it would only makes sense as a getter.  Bio::SeqIO dynamically loads the proper Bio::SeqIO::<format> module in the constructor (Bio::SeqIO::genbank, for example).  Being able to set the format to 'fasta' with a loaded Bio::SeqIO::genbank still gets GenBank format.

> The idea would be that the example code above could be rewritten as:
> 
>    # Open the file and let BioPerl guess its format
>    my $in = Bio::SeqIO->new( -file => $input_seqfile );
> 
>    # Retrieve the format guessed by BioPerl
>    my $format = $in->format( );
> 
>    # Open the output file using the same format as the input file
>    my $out = Bio::SeqIO->new( -file => ">".$output_seqfile , format => $format );
> 
>    # Now do the work...
> 
> I think this is more elegant since it is more readable, requires less computation (the file format is guessed once), and is more consistent with other Bio::SeqIO methods like alphabet, that guesses the alphabet but has a get/set method to access it.
> 
> Florent

Guessing the alphabet for the vast majority of sequence data isn't quite as complex and quixotic as guessing a sequence format. The latter is far more variable and infinitely increases, much like standards (ex: http://xkcd.com/927/).  

Not that sequences aren't capable of change...

chris



More information about the Bioperl-l mailing list