[Biopython] File format autodetection.

Ivan Gregoretti ivangreg at gmail.com
Tue Jun 24 19:40:40 UTC 2014


Thank you Lenna. It works now. It took me a little while to find that
I was expected to pass a file handle to UndoHandle().

Ivan



Ivan Gregoretti, PhD
Bioinformatics



On Tue, Jun 24, 2014 at 2:59 PM, Lenna Peterson <arklenna at gmail.com> wrote:
>
>
>
> On Tue, Jun 24, 2014 at 2:00 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
>>
>> Indeed, the STDIN stream is the challenge. That is why I though that
>> the question was worth documenting in the Biopython list.
>>
>> Would anybody mind showing how peekline() is used? I tried using it on
>> a SeqIO.parse generator but I get an error:
>>
>> AttributeError: 'generator' object has no attribute 'peekline'
>
>
> peekline() is a method of UndoHandle, not the generator.
>
> Cheers,
>
> Lenna
>
>
>>
>>
>> I am using Biopython 1.61 and Python 2.7.3 on linux 64bit.
>>
>> Thank you,
>>
>> Ivan
>>
>>
>>
>>
>> Ivan Gregoretti, PhD
>> Bioinformatics
>>
>>
>> On Tue, Jun 24, 2014 at 1:41 PM, Fields, Christopher J
>> <cjfields at illinois.edu> wrote:
>> > On Jun 24, 2014, at 11:54 AM, Peter Cock <p.j.a.cock at googlemail.com>
>> > wrote:
>> >
>> >> Hi Ivan,
>> >>
>> >> Biopython's SeqIO does not (and will not) do automatic file
>> >> format detection, it is just too hard to get right so instead
>> >> that's the user's task:
>> >>
>> >> Zen of Python: Explicit is better than implicit.
>> >> http://legacy.python.org/dev/peps/pep-0020/
>> >>
>> >> (BioPerl's SeqIO can do format guessing)
>> >
>> > (somewhat)
>> >
>> > You are welcome to try it, but Bio::Tools::GuessSeqFormat is IMHO one of
>> > the misbegotten step-children of Bioperl; if you delve into it, you’ll find
>> > it also tries to guess whether something is a sequence or an alignment file.
>> > My general feeling is that if you don’t know the source of your data (and
>> > from that the format) then there is only so much we can do to help.  Doing
>> > so from STDIN is even trickier.
>> >
>> > So, it’s there, it works in most cases so we keep it around, but caveat
>> > emptor.  We really don’t really maintain that module any more than very
>> > routine bugs fixes.
>> >
>> >> Your use case is one which highlights a technical reason
>> >> why this is hard - you are using stdin, a read-once handle.
>> >> You cannot peek at the file, guess the format, seek back to
>> >> the beginning, and then give the handle to a specific parser.
>> >>
>> >> You could use Biopython's UndoHandle here, but it will
>> >> impose a (modest) performance overhead.
>> >>
>> >> from Bio.File import UndoHandle
>> >> help(UndoHandle)
>> >>
>> >> e.g. Use the .peekline() method to spot FASTA vs FASTQ?
>> >>
>> >> Peter
>> >
>> > That seems like a pretty reasonable option.
>> >
>> > chris
>> >
>> >> On Tue, Jun 24, 2014 at 5:16 PM, Ivan Gregoretti <ivangreg at gmail.com>
>> >> wrote:
>> >>> Hello Biopythoneers,
>> >>>
>> >>> The question:
>> >>>
>> >>> What is the strategy currently used for file format autodetection?
>> >>>
>> >>>
>> >>> The context:
>> >>>
>> >>> I have written a command line program that gets a stream of FASTQ data
>> >>> and reports how many records are contained. You can visualise it like
>> >>> this
>> >>>
>> >>> zcat myfile.fq.gz | fxcounttags.py -i /dev/stdin -o /dev/stdout >
>> >>> myfile.counts
>> >>>
>> >>> That works fine for FASTQ but I need to extend the functionality to
>> >>> FASTA streams. How would you write fxcounttags.py to detect
>> >>> FASTQ/FASTA?
>> >>>
>> >>> Thank you,
>> >>>
>> >>> Ivan
>> >>>
>> >>>
>> >>>
>> >>> Ivan Gregoretti, PhD
>> >>> Bioinformatics
>> >>> _______________________________________________
>> >>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> >>> http://mailman.open-bio.org/mailman/listinfo/biopython
>> >> _______________________________________________
>> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> >> http://mailman.open-bio.org/mailman/listinfo/biopython
>> >
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython
>
>



More information about the Biopython mailing list