[Bioperl-l] Bio::SeqIO issue

Thu Aug 6 15:01:05 UTC 2009

I'm not sure what version we have. Cornel may have installed it a while
ago from CVS:

Module id = Bio::Root::Build
    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
    CPAN_VERSION 1.006000
    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
    INST_VERSION 1.006900
cpan> m Bio::Root::Version
Module id = Bio::Root::Version
    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
    CPAN_VERSION 1.006000
    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
    INST_VERSION 1.006900
cpan> m Bio::SeqIO
Module id = Bio::SeqIO
    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
    CPAN_VERSION 1.006000
    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
    INST_VERSION undef

Cornel still has the checked-out "bioperl-live" directory and the last
changes are from March this year.

As per why he used "Fasta" instead of 'fasta" as the format parameter in
Bio::SeqIO, it's because that what it says in the modules manual. He now
tried 'fasta' instead and see no changes in behavior. Omitting the
format parameter altogether, fasta-formatted sequence continues to be
treated correctly, the first line being removed. However, raw sequence
is being treated differently in that the first line is not being removed
any more. Instead, the program returns the first line only. Which, in
the example I am going to forward in my next message, will return 60
amino acids out of raw sequence of 300 aa. Can't win with raw
sequence...

The files may be created on different platforms, we didn't notice any
difference between using files created on Windows or Linux.

Thanks
Uwe

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: Wednesday, August 05, 2009 6:54 PM
To: Chris Fields
Cc: Hilgert, Uwe; BioPerl List
Subject: Re: [Bioperl-l] Bio::SeqIO issue

I don't think that can be the problem. If anything, providing the  
format ought to be better in terms of result than not providing it?

Uwe - I'd like you to go back to Chris' initial questions that you  
haven't answered yet: "What version of bioperl are you using, OS,  
etc?  What does your data look like?" I'd add to that, can you show us  
your full script, or a smaller code snippet that reproduces the problem.

I suspect that either something in your script is swallowing the line,  
or that the line endings in your data file are from a different OS  
than the one you're running the script on. (Or that you are running a  
very old version of BioPerl, which is entirely possible if you  
installed through CPAN.)

	-hilmar

On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:

> Uwe,
>
> Please keep replies on the list.
>
> It's very possible that's the issue; IIRC the fasta parser pulls out  
> the full sequence in chunks (based on local $/ = "\n>") and splits  
> the header off as the first line in that chunk.  You could probably  
> try leaving the format out and letting SeqIO guess it, or passing  
> the file into Bio::Tools::GuessSeqFormat directly, but it's probably  
> better to go through the files and add a file extension that  
> corresponds to the format.
>
> chris
>
> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>
>> Thanks, Chris. The files have no extension, but we indicate what  
>> format
>> to use, like in the manual:
>>
>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>
>> I wonder now whether this could exactly cause the problem: as we are
>> telling that input files are in fasta format they are being treated  
>> as
>> such (=remove first line) - regardless of whether they really are  
>> fasta?
>>
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>> Uwe Hilgert, Ph.D.
>> Dolan DNA Learning Center
>> Cold Spring Harbor Laboratory
>>
>> C: (516) 857-1693
>> V: (516) 367-5185
>> E: hilgert at cshl.edu
>> F: (516) 367-5182
>> W: http://www.dnalc.org
>>
>> -----Original Message-----
>> From: Chris Fields [mailto:cjfields at illinois.edu]
>> Sent: Wednesday, August 05, 2009 5:04 PM
>> To: Hilgert, Uwe
>> Cc: bioperl-l at lists.open-bio.org
>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>
>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>
>>> Is my impression correct that Bio::SeqIO just assumes that sequences
>>> are
>>> being submitted in FASTA format?
>>
>> No. See:
>>
>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>
>> SeqIO tries to guess at the format using the file extension, and if
>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>> possible that the extension is causing the problem, or that
>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced to
>> guessing).  In any case, it's always advisable to explicitly indicate
>> the format when possible.
>>
>> Relevant lines:
>>
>>   return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/ 
>> i;
>> ...
>>   return 'raw'     if /\.(txt)$/i;
>>
>>> In our experience, implementing
>>> Bio::SeqIO led to the first line of files being cut off,  
>>> regardless of
>>> whether the files were indeed fasta files or files that only  
>>> contained
>>> sequence.
>>
>> Files that only contain sequence are 'raw'.  Ones in FASTA are  
>> 'fasta'.
>>
>>> Which, in the latter, led to sequence submissions that had the
>>> first line of nucleotides removed. Has anyone tried to write a fix  
>>> for
>>> this?
>>
>> This sounds like a bug, but we have very little to go on beyond your
>> description.  What version of bioperl are you using, OS, etc?  What
>> does your data look like?  File extension?
>>
>> chris
>>
>>> Thanks,
>>>
>>> Uwe
>>>
>>>
>>>
>>>
>>>
>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>
>>> Uwe Hilgert, Ph.D.
>>>
>>> Dolan DNA Learning Center
>>>
>>> Cold Spring Harbor Laboratory
>>>
>>>
>>>
>>> V: (516) 367-5185
>>>
>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>
>>> F: (516) 367-5182
>>>
>>> W: http://www.dnalc.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================