[Bioperl-l] Bio::SeqIO issue
Chris Fields
cjfields at illinois.edu
Thu Aug 6 05:12:13 UTC 2009
Just to confirm: the following is using bioperl-live on my macbook pro
(perl 5.10.0, 64bit). We need to decide if this is a legit bug or a
user issue (if it's the former, we can easily add an exception
indicating lack of a header). Note that 'raw' also fails for the raw
example below (doesn't appear to remove newlines).
-c
cjfields4:fasta cjfields$ cat raw_v_fasta.pl
#!/usr/bin/perl -w
use strict;
use warnings;
use IO::String;
use Bio::SeqIO;
use Test::More qw(no_plan);
my %seq;
$seq{raw} = <<RAW;
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
RAW
$seq{fasta} = <<FASTA;
>CATH_RAT
MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRN
HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCW
TFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNG
QCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHA
VLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV
FASTA
my %newdata;
for my $input (sort keys %seq) {
my $fh = IO::String->new($seq{$input});
my $seq = Bio::SeqIO->new(-format => 'fasta',
-fh => $fh)->next_seq;
$newdata{$input} = $seq->seq;
}
is($newdata{raw}, $newdata{fasta}, 'format');
cjfields4:fasta cjfields$ perl raw_v_fasta.pl
not ok 1 - format
# Failed test 'format'
# at raw_v_fasta.pl line 36.
# got:
'HTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
# expected:
'MWTALPLLCAGAWLLSAGATAELTVNAIEKFHFTSWMKQHQKTYSSREYSHRLQVFANNWRKIQAHNQRNHTFKMGLNQFSDMSFAEIKHKYLWSEPQNCSATKSNYLRGTGPYPSSMDWRKKGNVVSPVKNQGACGSCWTFSTTGALESAVAIASGKMMTLAEQQLVDCAQNFNNHGCQGGLPSQAFEYILYNKGIMGEDSYPYIGKNGQCKFNPEKAVAFVKNVVNITLNDEAAMVEAVALYNPVSFAFEVTEDFMMYKSGVYSSNSCHKTPDKVNHAVLAVGYGEQNGLLYWIVKNSWGSNWGNNGYFLIERGKNMCGLAACASYPIPQV'
1..1
# Looks like you failed 1 test of 1.
On Aug 5, 2009, at 6:12 PM, Mark A. Jensen wrote:
> If these items were included in a Bugzilla report, that would be
> most convenient (= most likely to get looked carefully)
> and is the best place for us to keep track of these kinds of
> issues-- http://bugzilla.bioperl.org/
> cheers MAJ
> ----- Original Message ----- From: "Hilmar Lapp" <hlapp at gmx.net>
> To: "Chris Fields" <cjfields at illinois.edu>
> Cc: "BioPerl List" <bioperl-l at lists.open-bio.org>
> Sent: Wednesday, August 05, 2009 6:53 PM
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
>
>> I don't think that can be the problem. If anything, providing the
>> format ought to be better in terms of result than not providing it?
>> Uwe - I'd like you to go back to Chris' initial questions that you
>> haven't answered yet: "What version of bioperl are you using, OS,
>> etc? What does your data look like?" I'd add to that, can you show
>> us your full script, or a smaller code snippet that reproduces the
>> problem.
>> I suspect that either something in your script is swallowing the
>> line, or that the line endings in your data file are from a
>> different OS than the one you're running the script on. (Or that
>> you are running a very old version of BioPerl, which is entirely
>> possible if you installed through CPAN.)
>> -hilmar
>> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>>> Uwe,
>>>
>>> Please keep replies on the list.
>>>
>>> It's very possible that's the issue; IIRC the fasta parser pulls
>>> out the full sequence in chunks (based on local $/ = "\n>") and
>>> splits the header off as the first line in that chunk. You could
>>> probably try leaving the format out and letting SeqIO guess it,
>>> or passing the file into Bio::Tools::GuessSeqFormat directly, but
>>> it's probably better to go through the files and add a file
>>> extension that corresponds to the format.
>>>
>>> chris
>>>
>>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>>
>>>> Thanks, Chris. The files have no extension, but we indicate what
>>>> format
>>>> to use, like in the manual:
>>>>
>>>> $in = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>>
>>>> I wonder now whether this could exactly cause the problem: as we
>>>> are
>>>> telling that input files are in fasta format they are being
>>>> treated as
>>>> such (=remove first line) - regardless of whether they really
>>>> are fasta?
>>>>
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>> Uwe Hilgert, Ph.D.
>>>> Dolan DNA Learning Center
>>>> Cold Spring Harbor Laboratory
>>>>
>>>> C: (516) 857-1693
>>>> V: (516) 367-5185
>>>> E: hilgert at cshl.edu
>>>> F: (516) 367-5182
>>>> W: http://www.dnalc.org
>>>>
>>>> -----Original Message-----
>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>>> To: Hilgert, Uwe
>>>> Cc: bioperl-l at lists.open-bio.org
>>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>>
>>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>>
>>>>> Is my impression correct that Bio::SeqIO just assumes that
>>>>> sequences
>>>>> are
>>>>> being submitted in FASTA format?
>>>>
>>>> No. See:
>>>>
>>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>>
>>>> SeqIO tries to guess at the format using the file extension, and if
>>>> one isn't present makes use of Bio::Tools::GuessSeqFormat. It's
>>>> possible that the extension is causing the problem, or that
>>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's
>>>> forced to
>>>> guessing). In any case, it's always advisable to explicitly
>>>> indicate
>>>> the format when possible.
>>>>
>>>> Relevant lines:
>>>>
>>>> return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)
>>>> $/ i;
>>>> ...
>>>> return 'raw' if /\.(txt)$/i;
>>>>
>>>>> In our experience, implementing
>>>>> Bio::SeqIO led to the first line of files being cut off,
>>>>> regardless of
>>>>> whether the files were indeed fasta files or files that only
>>>>> contained
>>>>> sequence.
>>>>
>>>> Files that only contain sequence are 'raw'. Ones in FASTA are
>>>> 'fasta'.
>>>>
>>>>> Which, in the latter, led to sequence submissions that had the
>>>>> first line of nucleotides removed. Has anyone tried to write a
>>>>> fix for
>>>>> this?
>>>>
>>>> This sounds like a bug, but we have very little to go on beyond
>>>> your
>>>> description. What version of bioperl are you using, OS, etc? What
>>>> does your data look like? File extension?
>>>>
>>>> chris
>>>>
>>>>> Thanks,
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>
>>>>> Uwe Hilgert, Ph.D.
>>>>>
>>>>> Dolan DNA Learning Center
>>>>>
>>>>> Cold Spring Harbor Laboratory
>>>>>
>>>>>
>>>>>
>>>>> V: (516) 367-5185
>>>>>
>>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>>
>>>>> F: (516) 367-5182
>>>>>
>>>>> W: http://www.dnalc.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> --
>> ===========================================================
>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
>> ===========================================================
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
More information about the Bioperl-l
mailing list