[Bioperl-l] Bio::SeqIO issue

Chris Fields cjfields at illinois.edu
Thu Aug 6 16:49:53 UTC 2009


Cornel,

I'm failing to see how adding '>' would solve the problem.

This is a simple validation issue: should we throw an exception on bad  
input (no '>'), or just argue GIGO based on user error (the assumption  
that the SeqIO parser will read raw sequence correctly when set to  
'fasta' is wrong)?

I think, in this circumstance, the former applies.  It is easy to add,  
and the use of an exception in this case is violently user-friendly,  
e.g. it will stop cold and immediately point out the problem.   
Otherwise data is (silently) being modified, which is always a bad  
thing.

chris

On Aug 6, 2009, at 11:04 AM, Ghiban, Cornel wrote:

> Hi,
>
> It doesn't matter what sequence we use. As Chris Fields's showed in  
> his test, not having
> ">" as the 1st character on the first line is the problem.
> We always assumed the sequence is in FASTA format and this seems to  
> be wrong.
>
> I think, the solution to our problem is to check whether the ">"  
> symbol is present or not.
> If not present then it will be added.
>
> Thank you,
> Cornel Ghiban
>
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Thursday, August 06, 2009 11:18 AM
> To: Hilgert, Uwe
> Cc: Chris Fields; BioPerl List; Ghiban, Cornel
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> Uwe - could you send an actual data file (as an attachment) that  
> reproduces the problem, or is that not possible?
>
> 	-hilmar
>
> On Aug 6, 2009, at 11:01 AM, Hilgert, Uwe wrote:
>
>> I'm not sure what version we have. Cornel may have installed it a
>> while ago from CVS:
>>
>> Module id = Bio::Root::Build
>>   CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>>   CPAN_VERSION 1.006000
>>   INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
>>   INST_VERSION 1.006900
>> cpan> m Bio::Root::Version
>> Module id = Bio::Root::Version
>>   CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>>   CPAN_VERSION 1.006000
>>   INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
>>   INST_VERSION 1.006900
>> cpan> m Bio::SeqIO
>> Module id = Bio::SeqIO
>>   CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>>   CPAN_VERSION 1.006000
>>   INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
>>   INST_VERSION undef
>>
>> Cornel still has the checked-out "bioperl-live" directory and the  
>> last
>> changes are from March this year.
>>
>> As per why he used "Fasta" instead of 'fasta" as the format parameter
>> in Bio::SeqIO, it's because that what it says in the modules manual.
>> He now tried 'fasta' instead and see no changes in behavior. Omitting
>> the format parameter altogether, fasta-formatted sequence continues  
>> to
>> be treated correctly, the first line being removed. However, raw
>> sequence is being treated differently in that the first line is not
>> being removed any more. Instead, the program returns the first line
>> only. Which, in the example I am going to forward in my next message,
>> will return 60 amino acids out of raw sequence of 300 aa. Can't win
>> with raw sequence...
>>
>>
>> The files may be created on different platforms, we didn't notice any
>> difference between using files created on Windows or Linux.
>>
>> Thanks
>> Uwe
>>
>>
>>
>>
>> -----Original Message-----
>> From: Hilmar Lapp [mailto:hlapp at gmx.net]
>> Sent: Wednesday, August 05, 2009 6:54 PM
>> To: Chris Fields
>> Cc: Hilgert, Uwe; BioPerl List
>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>
>> I don't think that can be the problem. If anything, providing the
>> format ought to be better in terms of result than not providing it?
>>
>> Uwe - I'd like you to go back to Chris' initial questions that you
>> haven't answered yet: "What version of bioperl are you using, OS,  
>> etc?
>> What does your data look like?" I'd add to that, can you show us your
>> full script, or a smaller code snippet that reproduces the problem.
>>
>> I suspect that either something in your script is swallowing the  
>> line,
>> or that the line endings in your data file are from a different OS
>> than the one you're running the script on. (Or that you are running a
>> very old version of BioPerl, which is entirely possible if you
>> installed through CPAN.)
>>
>> 	-hilmar
>>
>> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>>
>>> Uwe,
>>>
>>> Please keep replies on the list.
>>>
>>> It's very possible that's the issue; IIRC the fasta parser pulls out
>>> the full sequence in chunks (based on local $/ = "\n>") and splits
>>> the header off as the first line in that chunk.  You could probably
>>> try leaving the format out and letting SeqIO guess it, or passing  
>>> the
>>> file into Bio::Tools::GuessSeqFormat directly, but it's probably
>>> better to go through the files and add a file extension that
>>> corresponds to the format.
>>>
>>> chris
>>>
>>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>>
>>>> Thanks, Chris. The files have no extension, but we indicate what
>>>> format to use, like in the manual:
>>>>
>>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>>
>>>> I wonder now whether this could exactly cause the problem: as we  
>>>> are
>>>> telling that input files are in fasta format they are being treated
>>>> as such (=remove first line) - regardless of whether they really  
>>>> are
>>>> fasta?
>>>>
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Uwe
>>>> Hilgert, Ph.D.
>>>> Dolan DNA Learning Center
>>>> Cold Spring Harbor Laboratory
>>>>
>>>> C: (516) 857-1693
>>>> V: (516) 367-5185
>>>> E: hilgert at cshl.edu
>>>> F: (516) 367-5182
>>>> W: http://www.dnalc.org
>>>>
>>>> -----Original Message-----
>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>>> To: Hilgert, Uwe
>>>> Cc: bioperl-l at lists.open-bio.org
>>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>>
>>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>>
>>>>> Is my impression correct that Bio::SeqIO just assumes that
>>>>> sequences are being submitted in FASTA format?
>>>>
>>>> No. See:
>>>>
>>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>>
>>>> SeqIO tries to guess at the format using the file extension, and if
>>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>>>> possible that the extension is causing the problem, or that
>>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced
>>>> to guessing).  In any case, it's always advisable to explicitly
>>>> indicate the format when possible.
>>>>
>>>> Relevant lines:
>>>>
>>>> return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
>>>> i;
>>>> ...
>>>> return 'raw'     if /\.(txt)$/i;
>>>>
>>>>> In our experience, implementing
>>>>> Bio::SeqIO led to the first line of files being cut off,  
>>>>> regardless
>>>>> of whether the files were indeed fasta files or files that only
>>>>> contained sequence.
>>>>
>>>> Files that only contain sequence are 'raw'.  Ones in FASTA are
>>>> 'fasta'.
>>>>
>>>>> Which, in the latter, led to sequence submissions that had the
>>>>> first line of nucleotides removed. Has anyone tried to write a fix
>>>>> for this?
>>>>
>>>> This sounds like a bug, but we have very little to go on beyond  
>>>> your
>>>> description.  What version of bioperl are you using, OS, etc?  What
>>>> does your data look like?  File extension?
>>>>
>>>> chris
>>>>
>>>>> Thanks,
>>>>>
>>>>> Uwe
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>
>>>>> Uwe Hilgert, Ph.D.
>>>>>
>>>>> Dolan DNA Learning Center
>>>>>
>>>>> Cold Spring Harbor Laboratory
>>>>>
>>>>>
>>>>>
>>>>> V: (516) 367-5185
>>>>>
>>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>>
>>>>> F: (516) 367-5182
>>>>>
>>>>> W: http://www.dnalc.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>




More information about the Bioperl-l mailing list