[Bioperl-l] Bio::SeqIO can't guess the format of data from a pipe

Thu Aug 25 21:04:15 UTC 2011

On Aug 25, 2011, at 1:52 PM, J.J. Emerson wrote:

> Hi Chris,
> 
> You asked:
> 
> My question (not a criticism, just trying to understand the problem): why are you going through all the trouble of using GuessSeqFormat as a permanent solution anyway?  If you have a stream returning a possibly unknown data type, I would argue that the fundamental bug is not GuessSeqFormat but something else, more specifically not knowing the behavior of the data source and the returned format to begin with.  Is something preventing that?
> 
> In my particular case, I'm trying not to impose a particular usage scenario onto the script I'm writing in the hopes it will be useful (and general) to others in my lab in the future*. In my proximate case, I will certainly be able to provide SeqIO with a format argument. But insofar as GuessSeqFormat is considered desirable (and reasonable people could indeed disagree whether it is desirable) I think its applicability shouldn't hinge on whether it is guessing on a pipe or a file.
> 
> My point is, GuessSeqFormat is fine as a temporary stop-gap, but it is not a permanent solution to your problems (it is guessing, after all).  Note the code has had very little development over the years, and the related SeqIO code hasn't aged particularly well.
> 
> I see. I wasn't aware that GuessSeqFormat was so relatively neglected. Given the rather challenging nature of the more elegant fix you suggested (using the buffering of Root:IO), perhaps I should consider dropping my issue or filing it as a feature request rather than a bug?

That's fine.  I don't want to dissuade you from taking this on, either.  

> Cheers,
> 
> J.J.
> 
> PS
> 
> * The way I plan on using my script is roughly as follows:
> 
> prog1 [some arguments] \
> | myscript.pl --informat fasta \
> | prog2 \
> | prog3 > pipeline.output
> 
> However, I'd like for the "--informat" switch to be optional, mainly to increase usability for other users. For any well considered format, the information is right there in the data to know what the format is, and as such, providing the format a second time is somewhat redundant. In principle, being able to do the following would be very useful:
> 
> prog1 [some arguments] \
> | myscript.pl \
> | prog2 > pipeline.output
> 
> The modularity of pipelining is very valuable and this is what caused me to anticipate a usage scenario that involved both GuessSeqFormat and reading from a pipe.

Not disagreeing with you at all, flexible code is best.

chris