[Biopython] File format autodetection.

Ivan Gregoretti ivangreg at gmail.com
Wed Jun 25 03:14:16 UTC 2014


Hello everybody.

After considering all contributions, for our case I have decided for a
solution based on first_line.startswith(">").

With that, I can now handle without exceptions "@" FASTQ, ">" FASTA
and the tricky "" empty file that is neither FASTQ nor FASTA (or
perhaps both at the same time).

Peter, great catch, empty files are common in our work.

Thank you all.

Ivan



Ivan Gregoretti, PhD
Bioinformatics


On Tue, Jun 24, 2014 at 5:50 PM, Rohan Sachdeva <rsachdev at usc.edu> wrote:
> Seems like you could just do this in pure python?
>
> for line in sys.stdin:
>   if line[0] == '>':
>         print fasta
>     elif line[0] == '@':
>         print 'fastq'
>     break
>
>
>
> --------------------------------
> Rohan Sachdeva
> @archaeaologist
> Doctoral Candidate, John Heidelberg Laboratory
> Marine Environmental Biology
> University of Southern California
> Los Angeles, CA
> rsachdev at usc.edu
> (213) 740-4748
>
>
> On Tue, Jun 24, 2014 at 12:40 PM, Ivan Gregoretti <ivangreg at gmail.com>
> wrote:
>>
>> Thank you Lenna. It works now. It took me a little while to find that
>> I was expected to pass a file handle to UndoHandle().
>>
>> Ivan
>>
>>
>>
>> Ivan Gregoretti, PhD
>> Bioinformatics
>>
>>
>>
>> On Tue, Jun 24, 2014 at 2:59 PM, Lenna Peterson <arklenna at gmail.com>
>> wrote:
>> >
>> >
>> >
>> > On Tue, Jun 24, 2014 at 2:00 PM, Ivan Gregoretti <ivangreg at gmail.com>
>> > wrote:
>> >>
>> >> Indeed, the STDIN stream is the challenge. That is why I though that
>> >> the question was worth documenting in the Biopython list.
>> >>
>> >> Would anybody mind showing how peekline() is used? I tried using it on
>> >> a SeqIO.parse generator but I get an error:
>> >>
>> >> AttributeError: 'generator' object has no attribute 'peekline'
>> >
>> >
>> > peekline() is a method of UndoHandle, not the generator.
>> >
>> > Cheers,
>> >
>> > Lenna
>> >
>> >
>> >>
>> >>
>> >> I am using Biopython 1.61 and Python 2.7.3 on linux 64bit.
>> >>
>> >> Thank you,
>> >>
>> >> Ivan
>> >>
>> >>
>> >>
>> >>
>> >> Ivan Gregoretti, PhD
>> >> Bioinformatics
>> >>
>> >>
>> >> On Tue, Jun 24, 2014 at 1:41 PM, Fields, Christopher J
>> >> <cjfields at illinois.edu> wrote:
>> >> > On Jun 24, 2014, at 11:54 AM, Peter Cock <p.j.a.cock at googlemail.com>
>> >> > wrote:
>> >> >
>> >> >> Hi Ivan,
>> >> >>
>> >> >> Biopython's SeqIO does not (and will not) do automatic file
>> >> >> format detection, it is just too hard to get right so instead
>> >> >> that's the user's task:
>> >> >>
>> >> >> Zen of Python: Explicit is better than implicit.
>> >> >> http://legacy.python.org/dev/peps/pep-0020/
>> >> >>
>> >> >> (BioPerl's SeqIO can do format guessing)
>> >> >
>> >> > (somewhat)
>> >> >
>> >> > You are welcome to try it, but Bio::Tools::GuessSeqFormat is IMHO one
>> >> > of
>> >> > the misbegotten step-children of Bioperl; if you delve into it,
>> >> > you’ll find
>> >> > it also tries to guess whether something is a sequence or an
>> >> > alignment file.
>> >> > My general feeling is that if you don’t know the source of your data
>> >> > (and
>> >> > from that the format) then there is only so much we can do to help.
>> >> > Doing
>> >> > so from STDIN is even trickier.
>> >> >
>> >> > So, it’s there, it works in most cases so we keep it around, but
>> >> > caveat
>> >> > emptor.  We really don’t really maintain that module any more than
>> >> > very
>> >> > routine bugs fixes.
>> >> >
>> >> >> Your use case is one which highlights a technical reason
>> >> >> why this is hard - you are using stdin, a read-once handle.
>> >> >> You cannot peek at the file, guess the format, seek back to
>> >> >> the beginning, and then give the handle to a specific parser.
>> >> >>
>> >> >> You could use Biopython's UndoHandle here, but it will
>> >> >> impose a (modest) performance overhead.
>> >> >>
>> >> >> from Bio.File import UndoHandle
>> >> >> help(UndoHandle)
>> >> >>
>> >> >> e.g. Use the .peekline() method to spot FASTA vs FASTQ?
>> >> >>
>> >> >> Peter
>> >> >
>> >> > That seems like a pretty reasonable option.
>> >> >
>> >> > chris
>> >> >
>> >> >> On Tue, Jun 24, 2014 at 5:16 PM, Ivan Gregoretti
>> >> >> <ivangreg at gmail.com>
>> >> >> wrote:
>> >> >>> Hello Biopythoneers,
>> >> >>>
>> >> >>> The question:
>> >> >>>
>> >> >>> What is the strategy currently used for file format autodetection?
>> >> >>>
>> >> >>>
>> >> >>> The context:
>> >> >>>
>> >> >>> I have written a command line program that gets a stream of FASTQ
>> >> >>> data
>> >> >>> and reports how many records are contained. You can visualise it
>> >> >>> like
>> >> >>> this
>> >> >>>
>> >> >>> zcat myfile.fq.gz | fxcounttags.py -i /dev/stdin -o /dev/stdout >
>> >> >>> myfile.counts
>> >> >>>
>> >> >>> That works fine for FASTQ but I need to extend the functionality to
>> >> >>> FASTA streams. How would you write fxcounttags.py to detect
>> >> >>> FASTQ/FASTA?
>> >> >>>
>> >> >>> Thank you,
>> >> >>>
>> >> >>> Ivan
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Ivan Gregoretti, PhD
>> >> >>> Bioinformatics
>> >> >>> _______________________________________________
>> >> >>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> >> >>> http://mailman.open-bio.org/mailman/listinfo/biopython
>> >> >> _______________________________________________
>> >> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> >> >> http://mailman.open-bio.org/mailman/listinfo/biopython
>> >> >
>> >>
>> >> _______________________________________________
>> >> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> >> http://mailman.open-bio.org/mailman/listinfo/biopython
>> >
>> >
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython
>
>



More information about the Biopython mailing list