[Bioperl-l] Windows bug in Bio::DB::Fasta?

Lincoln Stein lstein at cshl.edu
Mon Aug 22 18:18:07 EDT 2005


I've just looked into this. The bug occurs when Windows opens the FASTA file 
in text mode rather than binary mode; when in text mode the "\r\n" sequence 
is invisibly mapped to "\n" during readline operations, so Bio::DB::Fasta 
thinks that it is dealing with a Unix-format file; then when the module tries 
to seek() to the proper line number, Windows doesn't do the line end mapping, 
so it seeks to the wrong offset.  (sound of hairs being pulled)

I've fixed the problem by explicitly calling binmode() on all filehandles that 
Bio::DB::Fasta calls. The new version of Fasta.pm is in both bioperl CVS and 
the gbrowse 1.63 CVS version. It ought to fix Chris' GC content weirdness.

Lincoln

On Monday 15 August 2005 01:22 pm, Scott Cain wrote:
> Just to follow up on my own email with a little more information: in
> Fasta.pm, line 697:
>
>   $termination_length ||= /\r\n$/ ? 2 : 1;  # account for crlf-terminated
> Windows files
>
> The pattern match is failing on DOS formatted files; I don't know why.
> Does anyone else?
>
> On Mon, 2005-08-15 at 10:35 -0400, Scott Cain wrote:
> > Hello all,
> >
> > I am investigating a bug in GBrowse that seems to only surface when
> > people are using the memory (ie, file) adaptor on Windows systems.
> > Here's the bug report:
> >
> > https://sourceforge.net/tracker/?func=detail&atid=391291&aid=1256169&grou
> >p_id=27707
> >
> > I've tracked the problem down to Bio::DB::Fasta when the file is dos
> > formatted (that is, it has both line feeds and carriage returns), BDF
> > returns the wrong string when a subsequence is requested, but when the
> > file is unix formatted (ie only CR (or is it only LF?)), it returns the
> > right string.  I wrote the very simple test script below and stepped it
> > through the perl debugger.  It looks like the bug is in the caloffset
> > method, as it returns the same offsets regardless of the file type,
> > which then makes the subsequent seek into the file go to the wrong
> > coordinates of dos formatted files.
> >
> > Unfortunately, I don't really know what is going on caloffset, so I
> > don't know how to fix it, but it presumably has to check the format of
> > the file somewhere and take that into account.
> >
> > Thanks,
> > Scott

-- 
Lincoln D. Stein
Cold Spring Harbor Laboratory
1 Bungtown Road
Cold Spring Harbor, NY 11724
FOR URGENT MESSAGES & SCHEDULING, 
PLEASE CONTACT MY ASSISTANT, 
SANDRA MICHELSEN, AT michelse at cshl.edu


More information about the Bioperl-l mailing list