[Bioperl-l] Bio::SeqIO issue

Ghiban, Cornel ghiban at cshl.edu
Thu Aug 6 16:04:38 UTC 2009


Hi,

It doesn't matter what sequence we use. As Chris Fields's showed in his test, not having 
">" as the 1st character on the first line is the problem.
We always assumed the sequence is in FASTA format and this seems to be wrong.

I think, the solution to our problem is to check whether the ">" symbol is present or not.
If not present then it will be added.

Thank you,
Cornel Ghiban

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: Thursday, August 06, 2009 11:18 AM
To: Hilgert, Uwe
Cc: Chris Fields; BioPerl List; Ghiban, Cornel
Subject: Re: [Bioperl-l] Bio::SeqIO issue

Uwe - could you send an actual data file (as an attachment) that reproduces the problem, or is that not possible?

	-hilmar

On Aug 6, 2009, at 11:01 AM, Hilgert, Uwe wrote:

> I'm not sure what version we have. Cornel may have installed it a 
> while ago from CVS:
>
> Module id = Bio::Root::Build
>    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>    CPAN_VERSION 1.006000
>    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
>    INST_VERSION 1.006900
> cpan> m Bio::Root::Version
> Module id = Bio::Root::Version
>    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>    CPAN_VERSION 1.006000
>    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
>    INST_VERSION 1.006900
> cpan> m Bio::SeqIO
> Module id = Bio::SeqIO
>    CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
>    CPAN_VERSION 1.006000
>    INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
>    INST_VERSION undef
>
> Cornel still has the checked-out "bioperl-live" directory and the last 
> changes are from March this year.
>
> As per why he used "Fasta" instead of 'fasta" as the format parameter 
> in Bio::SeqIO, it's because that what it says in the modules manual. 
> He now tried 'fasta' instead and see no changes in behavior. Omitting 
> the format parameter altogether, fasta-formatted sequence continues to 
> be treated correctly, the first line being removed. However, raw 
> sequence is being treated differently in that the first line is not 
> being removed any more. Instead, the program returns the first line 
> only. Which, in the example I am going to forward in my next message, 
> will return 60 amino acids out of raw sequence of 300 aa. Can't win 
> with raw sequence...
>
>
> The files may be created on different platforms, we didn't notice any 
> difference between using files created on Windows or Linux.
>
> Thanks
> Uwe
>
>
>
>
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> Sent: Wednesday, August 05, 2009 6:54 PM
> To: Chris Fields
> Cc: Hilgert, Uwe; BioPerl List
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> I don't think that can be the problem. If anything, providing the 
> format ought to be better in terms of result than not providing it?
>
> Uwe - I'd like you to go back to Chris' initial questions that you 
> haven't answered yet: "What version of bioperl are you using, OS, etc?  
> What does your data look like?" I'd add to that, can you show us your 
> full script, or a smaller code snippet that reproduces the problem.
>
> I suspect that either something in your script is swallowing the line, 
> or that the line endings in your data file are from a different OS 
> than the one you're running the script on. (Or that you are running a 
> very old version of BioPerl, which is entirely possible if you 
> installed through CPAN.)
>
> 	-hilmar
>
> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
>
>> Uwe,
>>
>> Please keep replies on the list.
>>
>> It's very possible that's the issue; IIRC the fasta parser pulls out 
>> the full sequence in chunks (based on local $/ = "\n>") and splits 
>> the header off as the first line in that chunk.  You could probably 
>> try leaving the format out and letting SeqIO guess it, or passing the 
>> file into Bio::Tools::GuessSeqFormat directly, but it's probably 
>> better to go through the files and add a file extension that 
>> corresponds to the format.
>>
>> chris
>>
>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
>>
>>> Thanks, Chris. The files have no extension, but we indicate what 
>>> format to use, like in the manual:
>>>
>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>>>
>>> I wonder now whether this could exactly cause the problem: as we are 
>>> telling that input files are in fasta format they are being treated 
>>> as such (=remove first line) - regardless of whether they really are 
>>> fasta?
>>>
>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Uwe 
>>> Hilgert, Ph.D.
>>> Dolan DNA Learning Center
>>> Cold Spring Harbor Laboratory
>>>
>>> C: (516) 857-1693
>>> V: (516) 367-5185
>>> E: hilgert at cshl.edu
>>> F: (516) 367-5182
>>> W: http://www.dnalc.org
>>>
>>> -----Original Message-----
>>> From: Chris Fields [mailto:cjfields at illinois.edu]
>>> Sent: Wednesday, August 05, 2009 5:04 PM
>>> To: Hilgert, Uwe
>>> Cc: bioperl-l at lists.open-bio.org
>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>>>
>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>>>
>>>> Is my impression correct that Bio::SeqIO just assumes that 
>>>> sequences are being submitted in FASTA format?
>>>
>>> No. See:
>>>
>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
>>>
>>> SeqIO tries to guess at the format using the file extension, and if 
>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's 
>>> possible that the extension is causing the problem, or that 
>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced 
>>> to guessing).  In any case, it's always advisable to explicitly 
>>> indicate the format when possible.
>>>
>>> Relevant lines:
>>>
>>>  return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
>>> i;
>>> ...
>>>  return 'raw'     if /\.(txt)$/i;
>>>
>>>> In our experience, implementing
>>>> Bio::SeqIO led to the first line of files being cut off, regardless 
>>>> of whether the files were indeed fasta files or files that only 
>>>> contained sequence.
>>>
>>> Files that only contain sequence are 'raw'.  Ones in FASTA are 
>>> 'fasta'.
>>>
>>>> Which, in the latter, led to sequence submissions that had the 
>>>> first line of nucleotides removed. Has anyone tried to write a fix 
>>>> for this?
>>>
>>> This sounds like a bug, but we have very little to go on beyond your 
>>> description.  What version of bioperl are you using, OS, etc?  What 
>>> does your data look like?  File extension?
>>>
>>> chris
>>>
>>>> Thanks,
>>>>
>>>> Uwe
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>>>
>>>> Uwe Hilgert, Ph.D.
>>>>
>>>> Dolan DNA Learning Center
>>>>
>>>> Cold Spring Harbor Laboratory
>>>>
>>>>
>>>>
>>>> V: (516) 367-5185
>>>>
>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>>>
>>>> F: (516) 367-5182
>>>>
>>>> W: http://www.dnalc.org
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>

--
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================







More information about the Bioperl-l mailing list