[Bioperl-l] Bio::SeqIO issue

Thu Aug 6 20:09:22 UTC 2009

If one supplies raw sequence (no descriptor) to a FASTA parser  
(requires a descriptor), then it is bad input.  One can't reasonably  
expect the parser to work correctly under those circumstance.  Garbage  
in, garbage out.

The simplest and (IMHO) best solution under such circumstances is for  
the parser to die meaningfully ("Sequence is not FASTA format; '>'  
descriptor line is missing" or similar).  Tacking a '>' onto bad data  
doesn't make it magically work, it's just bad data with a '>' appended.

To take this one step further, what if this were genbank data?  Or  
XML?  A well-formed exception, though initially inconvenient to the  
user, will indicate the problem right away.  Silently trying to fix  
the problem by appending '>' to bad input data wouldn't work, and the  
resulting failure downstream (likely from validate_seq) would obscure  
the real problem, being the user is using the wrong format parser.

chris

On Aug 6, 2009, at 2:36 PM, Hilgert, Uwe wrote:

> Hmmm, I fail to see how supplying raw sequence could be a called "bad"
> input or a "problem". In our case, for example, not every user is a
> bioinformatics expert and Cornel was suggesting to account for that
> instead of trying to "train" the user to adhere to requirements that
> have not much to do with what s/he tries to accomplish. I don't really
> see data being modified, rather that the data format is being  
> adopted to
> the needs of the software; which I would argue should be something the
> software is being able to take care of.
>
> Uwe
>
>
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu]
> Sent: Thursday, August 06, 2009 12:50 PM
> To: Ghiban, Cornel
> Cc: Hilmar Lapp; Hilgert, Uwe; BioPerl List
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> Cornel,
>
> I'm failing to see how adding '>' would solve the problem.
>
> This is a simple validation issue: should we throw an exception on bad
> input (no '>'), or just argue GIGO based on user error (the assumption
> that the SeqIO parser will read raw sequence correctly when set to
> 'fasta' is wrong)?
>
> I think, in this circumstance, the former applies.  It is easy to add,
> and the use of an exception in this case is violently user-friendly,
> e.g. it will stop cold and immediately point out the problem.
> Otherwise data is (silently) being modified, which is always a bad
> thing.
>
> chris
>
> On Aug 6, 2009, at 11:04 AM, Ghiban, Cornel wrote:
>
>> Hi,
>>
>> It doesn't matter what sequence we use. As Chris Fields's showed in
>> his test, not having
>> ">" as the 1st character on the first line is the problem.
>> We always assumed the sequence is in FASTA format and this seems to
>> be wrong.
>>
>> I think, the solution to our problem is to check whether the ">"
>> symbol is present or not.
>> If not present then it will be added.
>>
>> Thank you,
>> Cornel Ghiban
>>
>> -----Original Message-----
>> From: Hilmar Lapp [mailto:hlapp at gmx.net]
>> Sent: Thursday, August 06, 2009 11:18 AM
>> To: Hilgert, Uwe
>> Cc: Chris Fields; BioPerl List; Ghiban, Cornel
>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>
>> Uwe - could you send an actual data file (as an attachment) that
>> reproduces the problem, or is that not possible?
>>
>> 	-hilmar
>>
>> On Aug 6, 2009, at 11:01 AM, Hilgert, Uwe wrote:
>>
>>> I'm not sure what version we have. Cornel may have installed it a
>>> while ago from CVS:
>>>
>>> Module id = Bio::Root::Build
>>>  CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>>>  CPAN_VERSION 1.006000
>>>  INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
>>>  INST_VERSION 1.006900
>>> cpan> m Bio::Root::Version
>>> Module id = Bio::Root::Version
>>>  CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>>>  CPAN_VERSION 1.006000
>>>  INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
>>>  INST_VERSION 1.006900
>>> cpan> m Bio::SeqIO
>>> Module id = Bio::SeqIO
>>>  CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>>>  CPAN_VERSION 1.006000
>>>  INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
>>>  INST_VERSION undef
>>>
>>> Cornel still has the checked-out "bioperl-live" directory and the
>>> last
>>> changes are from March this year.
>>>
>>> As per why he used "Fasta" instead of 'fasta" as the format  
>>> parameter
>>> in Bio::SeqIO, it's because that what it says in the modules manual.
>>> He now tried 'fasta' instead and see no changes in behavior.  
>>> Omitting
>>> the format parameter altogether, fasta-formatted sequence continues
>>> to
>>> be treated correctly, the first line being removed. However, raw
>>> sequence is being treated differently in that the first line is not
>>> being removed any more. Instead, the program returns the first line
>>> only. Which, in the example I am going to forward in my next  
>>> message,
>>> will return 60 amino acids out of raw sequence of 300 aa. Can't win
>>> with raw sequence...
>>>
>>>
>>> The files may be created on different platforms, we didn't notice  
>>> any
>>> difference between using files created on Windows or Linux.
>>>
>>> Thanks
>>> Uwe
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Hilmar Lapp [mailto:hlapp at gmx.net]
>>> Sent: Wednesday, August 05, 2009 6:54 PM
>>> To: Chris Fields
>>> Cc: Hilgert, Uwe; BioPerl List
>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>
>>> I don't think that can be the problem. If anything, providing the
>>> format ought to be better in terms of result than not providing it?
>>>
>>> Uwe - I'd like you to go back to Chris' initial questions that you
>>> haven't answered yet: "What version of bioperl are you using, OS,
>>> etc?
>>> What does your data look like?" I'd add to that, can you show us  
>>> your
>>> full script, or a smaller code snippet that reproduces the problem.
>>>
>>> I suspect that either something in your script is swallowing the
>>> line,
>>> or that the line endings in your data file are from a different OS
>>> than the one you're running the script on. (Or that you are  
>>> running a
>>> very old version of BioPerl, which is entirely possible if you
>>> installed through CPAN.)
>>>
>>> 	-hilmar
>>>
>>> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>>>
>>>> Uwe,
>>>>
>>>> Please keep replies on the list.
>>>>
>>>> It's very possible that's the issue; IIRC the fasta parser pulls  
>>>> out
>>>> the full sequence in chunks (based on local $/ = "\n>") and splits
>>>> the header off as the first line in that chunk.  You could probably
>>>> try leaving the format out and letting SeqIO guess it, or passing
>>>> the
>>>> file into Bio::Tools::GuessSeqFormat directly, but it's probably
>>>> better to go through the files and add a file extension that
>>>> corresponds to the format.
>>>>
>>>> chris
>>>>
>>>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>>>
>>>>> Thanks, Chris. The files have no extension, but we indicate what
>>>>> format to use, like in the manual:
>>>>>
>>>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>>>
>>>>> I wonder now whether this could exactly cause the problem: as we
>>>>> are
>>>>> telling that input files are in fasta format they are being  
>>>>> treated
>>>>> as such (=remove first line) - regardless of whether they really
>>>>> are
>>>>> fasta?
>>>>>
>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
>>>>> Uwe
>>>>> Hilgert, Ph.D.
>>>>> Dolan DNA Learning Center
>>>>> Cold Spring Harbor Laboratory
>>>>>
>>>>> C: (516) 857-1693
>>>>> V: (516) 367-5185
>>>>> E: hilgert at cshl.edu
>>>>> F: (516) 367-5182
>>>>> W: http://www.dnalc.org
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>>>> To: Hilgert, Uwe
>>>>> Cc: bioperl-l at lists.open-bio.org
>>>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>>>
>>>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>>>
>>>>>> Is my impression correct that Bio::SeqIO just assumes that
>>>>>> sequences are being submitted in FASTA format?
>>>>>
>>>>> No. See:
>>>>>
>>>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>>>
>>>>> SeqIO tries to guess at the format using the file extension, and  
>>>>> if
>>>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
>>>>> possible that the extension is causing the problem, or that
>>>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced
>>>>> to guessing).  In any case, it's always advisable to explicitly
>>>>> indicate the format when possible.
>>>>>
>>>>> Relevant lines:
>>>>>
>>>>> return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
>>>>> i;
>>>>> ...
>>>>> return 'raw'     if /\.(txt)$/i;
>>>>>
>>>>>> In our experience, implementing
>>>>>> Bio::SeqIO led to the first line of files being cut off,
>>>>>> regardless
>>>>>> of whether the files were indeed fasta files or files that only
>>>>>> contained sequence.
>>>>>
>>>>> Files that only contain sequence are 'raw'.  Ones in FASTA are
>>>>> 'fasta'.
>>>>>
>>>>>> Which, in the latter, led to sequence submissions that had the
>>>>>> first line of nucleotides removed. Has anyone tried to write a  
>>>>>> fix
>>>>>> for this?
>>>>>
>>>>> This sounds like a bug, but we have very little to go on beyond
>>>>> your
>>>>> description.  What version of bioperl are you using, OS, etc?   
>>>>> What
>>>>> does your data look like?  File extension?
>>>>>
>>>>> chris
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Uwe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>>>
>>>>>> Uwe Hilgert, Ph.D.
>>>>>>
>>>>>> Dolan DNA Learning Center
>>>>>>
>>>>>> Cold Spring Harbor Laboratory
>>>>>>
>>>>>>
>>>>>>
>>>>>> V: (516) 367-5185
>>>>>>
>>>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>>>
>>>>>> F: (516) 367-5182
>>>>>>
>>>>>> W: http://www.dnalc.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> --
>>> ===========================================================
>>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>>> ===========================================================
>>>
>>>
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l