[Bioperl-l] Bio::SeqIO issue

Thu Aug 6 05:12:13 UTC 2009

Just to confirm: the following is using bioperl-live on my macbook pro  
(perl 5.10.0, 64bit).  We need to decide if this is a legit bug or a  
user issue (if it's the former, we can easily add an exception  
indicating lack of a header).  Note that 'raw' also fails for the raw  
example below (doesn't appear to remove newlines).

-c

cjfields4:fasta cjfields$ cat raw_v_fasta.pl
#!/usr/bin/perl -w

use strict;
use warnings;
use IO::String;
use Bio::SeqIO;
use Test::More qw(no_plan);

my %seq;

$seq{raw} = <<RAW;
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
RAW

$seq{fasta} = <<FASTA;
 >CATH_RAT
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
FASTA

my %newdata;
for my $input (sort keys %seq) {
     my $fh = IO::String->new($seq{$input});
     my $seq = Bio::SeqIO->new(-format => 'fasta',
                              -fh => $fh)->next_seq;
     $newdata{$input} = $seq->seq;
}
is($newdata{raw}, $newdata{fasta}, 'format');

cjfields4:fasta cjfields$ perl raw_v_fasta.pl
not ok 1 - format
#   Failed test 'format'
#   at raw_v_fasta.pl line 36.
#          got:  
'HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
#     expected:  
'MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRNHTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
1..1
# Looks like you failed 1 test of 1.

On Aug 5, 2009, at 6:12 PM, Mark A. Jensen wrote:

> If these items were included in a Bugzilla report, that would be  
> most convenient (= most likely to get looked carefully)
> and is the best place for us to keep track of these kinds of  
> issues-- http://bugzilla.bioperl.org/
> cheers MAJ
> ----- Original Message ----- From: "Hilmar Lapp" <hlapp at gmx.net>
> To: "Chris Fields" <cjfields at illinois.edu>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
> Sent: Wednesday, August 05, 2009 6:53 PM
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
>
>> I don't think that can be the problem. If anything, providing the   
>> format ought to be better in terms of result than not providing it?
>> Uwe - I'd like you to go back to Chris' initial questions that you   
>> haven't answered yet: "What version of bioperl are you using, OS,   
>> etc?  What does your data look like?" I'd add to that, can you show  
>> us  your full script, or a smaller code snippet that reproduces the  
>> problem.
>> I suspect that either something in your script is swallowing the  
>> line,  or that the line endings in your data file are from a  
>> different OS  than the one you're running the script on. (Or that  
>> you are running a  very old version of BioPerl, which is entirely  
>> possible if you  installed through CPAN.)
>> -hilmar
>> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>>> Uwe,
>>>
>>> Please keep replies on the list.
>>>
>>> It's very possible that's the issue; IIRC the fasta parser pulls  
>>> out  the full sequence in chunks (based on local $/ = "\n>") and  
>>> splits  the header off as the first line in that chunk.  You could  
>>> probably  try leaving the format out and letting SeqIO guess it,  
>>> or passing  the file into Bio::Tools::GuessSeqFormat directly, but  
>>> it's probably  better to go through the files and add a file  
>>> extension that  corresponds to the format.
>>>
>>> chris
>>>
>>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>>
>>>> Thanks, Chris. The files have no extension, but we indicate what   
>>>> format
>>>> to use, like in the manual:
>>>>
>>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>>
>>>> I wonder now whether this could exactly cause the problem: as we  
>>>> are
>>>> telling that input files are in fasta format they are being  
>>>> treated  as
>>>> such (=remove first line) - regardless of whether they really  
>>>> are  fasta?
>>>>
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>> Uwe Hilgert, Ph.D.
>>>> Dolan DNA Learning Center
>>>> Cold Spring Harbor Laboratory
>>>>
>>>> C: (516) 857-1693
>>>> V: (516) 367-5185
>>>> E: hilgert at cshl.edu
>>>> F: (516) 367-5182
>>>> W: http://www.dnalc.org
>>>>
>>>> -----Original Message-----
>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>>> To: Hilgert, Uwe
>>>> Cc: bioperl-l at lists.open-bio.org
>>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>>
>>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>>
>>>>> Is my impression correct that Bio::SeqIO just assumes that  
>>>>> sequences
>>>>> are
>>>>> being submitted in FASTA format?
>>>>
>>>> No. See:
>>>>
>>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>>
>>>> SeqIO tries to guess at the format using the file extension, and if
>>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>>>> possible that the extension is causing the problem, or that
>>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's  
>>>> forced to
>>>> guessing).  In any case, it's always advisable to explicitly  
>>>> indicate
>>>> the format when possible.
>>>>
>>>> Relevant lines:
>>>>
>>>>  return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa) 
>>>> $/ i;
>>>> ...
>>>>  return 'raw'     if /\.(txt)$/i;
>>>>
>>>>> In our experience, implementing
>>>>> Bio::SeqIO led to the first line of files being cut off,   
>>>>> regardless of
>>>>> whether the files were indeed fasta files or files that only   
>>>>> contained
>>>>> sequence.
>>>>
>>>> Files that only contain sequence are 'raw'.  Ones in FASTA are   
>>>> 'fasta'.
>>>>
>>>>> Which, in the latter, led to sequence submissions that had the
>>>>> first line of nucleotides removed. Has anyone tried to write a  
>>>>> fix  for
>>>>> this?
>>>>
>>>> This sounds like a bug, but we have very little to go on beyond  
>>>> your
>>>> description.  What version of bioperl are you using, OS, etc?  What
>>>> does your data look like?  File extension?
>>>>
>>>> chris
>>>>
>>>>> Thanks,
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>
>>>>> Uwe Hilgert, Ph.D.
>>>>>
>>>>> Dolan DNA Learning Center
>>>>>
>>>>> Cold Spring Harbor Laboratory
>>>>>
>>>>>
>>>>>
>>>>> V: (516) 367-5185
>>>>>
>>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>>
>>>>> F: (516) 367-5182
>>>>>
>>>>> W: http://www.dnalc.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> -- 
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>