[Bioperl-l] Bio::SeqIO can't guess the format of data from a pipe

J.J. Emerson jj.emerson at gmail.com
Thu Aug 25 01:53:38 UTC 2011


Hello All,

I have experienced some behavior in SeqIO that doesn't seem to be what I
would expect. Basically, for a certain script, if I try to pass something
like "-fh => \*STDIN" to Bio::SeqIO->new(), it will fail if both of the
following two conditions are met simultaneously:

   1. STDIN is coming from a pipe;
   2. SeqIO is trying to guess the format.

If STDIO is coming from redirection instead of a pipe or if the format is
specified manually (i.e. BioPERL doesn't have to guess), the error doesn't
seem to occur.

This issue has been reported previously:

http://lists.open-bio.org/pipermail/bioperl-l/2010-July/033681.html
https://redmine.open-bio.org/issues/3122

This issue is ultimately one of using seek() on a pipe, which is forbidden
(see below). To be clear, there are kludgy ways around this that allow
BioPERL to take input from a pipe AND guess the format. My naive and
inefficient kludge was to test for reading from STDIN and for the absence of
a format. If both of these conditions are met, then I slurp STDIN into a
variable and then open a filehandle on that variable, and pass it to SeqIO,
which can guess the format if the fh isn't opened on a pipe. SeqIO then
successfully guesses the format and does the SeqIO thing, at the expense of
having the program pass over the data at least twice. And if the input file
is huge, it could potentially consume all the memory. A better way to
address the problem would be to process the input one line at a time, but
this seems to require more extensive changes.

The reason I'm reposting this is because I think that the inability to guess
the sequence format from data originating from a pipe is an important
limitation for a fundamental part of BioPERL. When designing scripts to be
used in pipelines, the inability to guess formats for piped data limits
BioPERL's pipelineability substantially. Even though previous reports of
this have been made and a bug opened and closed, I was wondering if anyone
thought this was worthwhile fixing so as to make SeqIO (and probably AlignIO
as well?) more flexible?

Does anyone think this should be refiled as a bug?

Cheers,

J.J.

PS

Below are snippets of code and/or errors related to reproducing the failure
to guess unspecified formats. I'll see how Mailman treats my attachments and
post the code as a reply if they don't work.

The bioperl_fhtest.pl attachment is the script that reproduces the error.
The w.fa is a fasta file containing some sequence.

Here are the command lines to generate the behavior I observe (w.fa is a
file containing some fasta sequences, in my case it was the w gene from
different *Drosophila* species):

./bioperl_fhtest.pl fasta < w.fa # Works (redirection, no guessing)
> ./bioperl_fhtest.pl < w.fa # Works (redirection, guessing)
>
> cat w.fa | ./bioperl_fhtest.pl fasta # Works (pipe, no guessing)
> cat w.fa | ./bioperl_fhtest.pl # DOESN'T work (pipe, guessing)
>


Here's the error I get in the last case:

------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: Failed resetting the filehandle; IO error occurred
> STACK: Error::throw
> STACK: Bio::Root::Root::throw
> /usr/local/share/perl/5.10.1/Bio/Root/Root.pm:472
> STACK: Bio::Tools::GuessSeqFormat::guess
> /usr/local/share/perl/5.10.1/Bio/Tools/GuessSeqFormat.pm:512
> STACK: Bio::SeqIO::new /usr/local/share/perl/5.10.1/Bio/SeqIO.pm:381
> STACK: ./bioperl_fhtest.pl:8
> -----------------------------------------------------------
>

>From what I gather, the error is triggered by a failure of seek() on a STDIO
fh on lines 517-518 (text from the version GuessSeqFormat.pm installed on my
server):

    512     if (defined $self->{-file}) {
>     513         # Close the file we opened.
>     514         close($fh);
>     515     } elsif (ref $fh eq 'GLOB') {
>     516         # Try seeking to the start position.
>     517         seek($fh, $start_pos, 0) || $self->throw("Failed resetting
> the ".
>     518                                         "filehandle; IO error
> occurred");;
>     519     } elsif (defined $fh && $fh->can('setpos')) {
>     520         # Seek to the start position.
>     521         $fh->setpos($start_pos);
>     522     }
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bioperl_fhtest.pl
Type: text/x-perl-script
Size: 505 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20110824/9a20f472/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: w.fa
Type: application/octet-stream
Size: 6335 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20110824/9a20f472/attachment-0004.obj>


More information about the Bioperl-l mailing list