[Bioperl-l] Bio::SeqIO issue
Hilmar Lapp
hlapp at gmx.net
Thu Aug 6 13:36:08 UTC 2009
Why is specifying fasta format when your input is not in fasta format
not a user error?
I agree with the not removing newlines in raw format being a bug.
-hilmar
On Aug 6, 2009, at 1:12 AM, Chris Fields wrote:
> Just to confirm: the following is using bioperl-live on my macbook
> pro (perl 5.10.0, 64bit). We need to decide if this is a legit bug
> or a user issue (if it's the former, we can easily add an exception
> indicating lack of a header). Note that 'raw' also fails for the
> raw example below (doesn't appear to remove newlines).
>
> -c
>
> cjfields4:fasta cjfields$ cat raw_v_fasta.pl
> #!/usr/bin/perl -w
>
> use strict;
> use warnings;
> use IO::String;
> use Bio::SeqIO;
> use Test::More qw(no_plan);
>
> my %seq;
>
> $seq{raw} = <<RAW;
> MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
> HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
> TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
> QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
> VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
> RAW
>
> $seq{fasta} = <<FASTA;
> >CATH_RAT
> MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
> HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
> TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
> QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
> VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
> FASTA
>
> my %newdata;
> for my $input (sort keys %seq) {
> my $fh = IO::String->new($seq{$input});
> my $seq = Bio::SeqIO->new(-format => 'fasta',
> -fh => $fh)->next_seq;
> $newdata{$input} = $seq->seq;
> }
> is($newdata{raw}, $newdata{fasta}, 'format');
>
> cjfields4:fasta cjfields$ perl raw_v_fasta.pl
> not ok 1 - format
> # Failed test 'format'
> # at raw_v_fasta.pl line 36.
> # got:
> 'HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
> # expected:
> 'MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRNHTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
> 1..1
> # Looks like you failed 1 test of 1.
>
> On Aug 5, 2009, at 6:12 PM, Mark A. Jensen wrote:
>
>> If these items were included in a Bugzilla report, that would be
>> most convenient (= most likely to get looked carefully)
>> and is the best place for us to keep track of these kinds of
>> issues-- http://bugzilla.bioperl.org/
>> cheers MAJ
>> ----- Original Message ----- From: "Hilmar Lapp" <hlapp at gmx.net>
>> To: "Chris Fields" <cjfields at illinois.edu>
>> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
>> Sent: Wednesday, August 05, 2009 6:53 PM
>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>
>>
>>> I don't think that can be the problem. If anything, providing the
>>> format ought to be better in terms of result than not providing it?
>>> Uwe - I'd like you to go back to Chris' initial questions that
>>> you haven't answered yet: "What version of bioperl are you using,
>>> OS, etc? What does your data look like?" I'd add to that, can
>>> you show us your full script, or a smaller code snippet that
>>> reproduces the problem.
>>> I suspect that either something in your script is swallowing the
>>> line, or that the line endings in your data file are from a
>>> different OS than the one you're running the script on. (Or that
>>> you are running a very old version of BioPerl, which is entirely
>>> possible if you installed through CPAN.)
>>> -hilmar
>>> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>>>> Uwe,
>>>>
>>>> Please keep replies on the list.
>>>>
>>>> It's very possible that's the issue; IIRC the fasta parser pulls
>>>> out the full sequence in chunks (based on local $/ = "\n>") and
>>>> splits the header off as the first line in that chunk. You
>>>> could probably try leaving the format out and letting SeqIO
>>>> guess it, or passing the file into Bio::Tools::GuessSeqFormat
>>>> directly, but it's probably better to go through the files and
>>>> add a file extension that corresponds to the format.
>>>>
>>>> chris
>>>>
>>>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>>>
>>>>> Thanks, Chris. The files have no extension, but we indicate
>>>>> what format
>>>>> to use, like in the manual:
>>>>>
>>>>> $in = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>>>
>>>>> I wonder now whether this could exactly cause the problem: as we
>>>>> are
>>>>> telling that input files are in fasta format they are being
>>>>> treated as
>>>>> such (=remove first line) - regardless of whether they really
>>>>> are fasta?
>>>>>
>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>> Uwe Hilgert, Ph.D.
>>>>> Dolan DNA Learning Center
>>>>> Cold Spring Harbor Laboratory
>>>>>
>>>>> C: (516) 857-1693
>>>>> V: (516) 367-5185
>>>>> E: hilgert at cshl.edu
>>>>> F: (516) 367-5182
>>>>> W: http://www.dnalc.org
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>>>> To: Hilgert, Uwe
>>>>> Cc: bioperl-l at lists.open-bio.org
>>>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>>>
>>>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>>>
>>>>>> Is my impression correct that Bio::SeqIO just assumes that
>>>>>> sequences
>>>>>> are
>>>>>> being submitted in FASTA format?
>>>>>
>>>>> No. See:
>>>>>
>>>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>>>
>>>>> SeqIO tries to guess at the format using the file extension, and
>>>>> if
>>>>> one isn't present makes use of Bio::Tools::GuessSeqFormat. It's
>>>>> possible that the extension is causing the problem, or that
>>>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's
>>>>> forced to
>>>>> guessing). In any case, it's always advisable to explicitly
>>>>> indicate
>>>>> the format when possible.
>>>>>
>>>>> Relevant lines:
>>>>>
>>>>> return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)
>>>>> $/ i;
>>>>> ...
>>>>> return 'raw' if /\.(txt)$/i;
>>>>>
>>>>>> In our experience, implementing
>>>>>> Bio::SeqIO led to the first line of files being cut off,
>>>>>> regardless of
>>>>>> whether the files were indeed fasta files or files that only
>>>>>> contained
>>>>>> sequence.
>>>>>
>>>>> Files that only contain sequence are 'raw'. Ones in FASTA are
>>>>> 'fasta'.
>>>>>
>>>>>> Which, in the latter, led to sequence submissions that had the
>>>>>> first line of nucleotides removed. Has anyone tried to write a
>>>>>> fix for
>>>>>> this?
>>>>>
>>>>> This sounds like a bug, but we have very little to go on beyond
>>>>> your
>>>>> description. What version of bioperl are you using, OS, etc?
>>>>> What
>>>>> does your data look like? File extension?
>>>>>
>>>>> chris
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Uwe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>
>>>>>> Uwe Hilgert, Ph.D.
>>>>>>
>>>>>> Dolan DNA Learning Center
>>>>>>
>>>>>> Cold Spring Harbor Laboratory
>>>>>>
>>>>>>
>>>>>>
>>>>>> V: (516) 367-5185
>>>>>>
>>>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>>>
>>>>>> F: (516) 367-5182
>>>>>>
>>>>>> W: http://www.dnalc.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> --
>>> ===========================================================
>>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>>> ===========================================================
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Bioperl-l
mailing list