[Bioperl-l] Bio::SeqIO issue

Wed Aug 5 21:37:52 UTC 2009

Uwe,

Please keep replies on the list.

It's very possible that's the issue; IIRC the fasta parser pulls out  
the full sequence in chunks (based on local $/ = "\n>") and splits the  
header off as the first line in that chunk.  You could probably try  
leaving the format out and letting SeqIO guess it, or passing the file  
into Bio::Tools::GuessSeqFormat directly, but it's probably better to  
go through the files and add a file extension that corresponds to the  
format.

chris

On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:

> Thanks, Chris. The files have no extension, but we indicate what  
> format
> to use, like in the manual:
>
> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>
> I wonder now whether this could exactly cause the problem: as we are
> telling that input files are in fasta format they are being treated as
> such (=remove first line) - regardless of whether they really are  
> fasta?
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Uwe Hilgert, Ph.D.
> Dolan DNA Learning Center
> Cold Spring Harbor Laboratory
>
> C: (516) 857-1693
> V: (516) 367-5185
> E: hilgert at cshl.edu
> F: (516) 367-5182
> W: http://www.dnalc.org
>
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu]
> Sent: Wednesday, August 05, 2009 5:04 PM
> To: Hilgert, Uwe
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>
>> Is my impression correct that Bio::SeqIO just assumes that sequences
>> are
>> being submitted in FASTA format?
>
> No. See:
>
> http://www.bioperl.org/wiki/HOWTO:SeqIO
>
> SeqIO tries to guess at the format using the file extension, and if
> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
> possible that the extension is causing the problem, or that
> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced to
> guessing).  In any case, it's always advisable to explicitly indicate
> the format when possible.
>
> Relevant lines:
>
>    return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/ 
> i;
> ...
>    return 'raw'     if /\.(txt)$/i;
>
>> In our experience, implementing
>> Bio::SeqIO led to the first line of files being cut off, regardless  
>> of
>> whether the files were indeed fasta files or files that only  
>> contained
>> sequence.
>
> Files that only contain sequence are 'raw'.  Ones in FASTA are  
> 'fasta'.
>
>> Which, in the latter, led to sequence submissions that had the
>> first line of nucleotides removed. Has anyone tried to write a fix  
>> for
>> this?
>
> This sounds like a bug, but we have very little to go on beyond your
> description.  What version of bioperl are you using, OS, etc?  What
> does your data look like?  File extension?
>
> chris
>
>> Thanks,
>>
>> Uwe
>>
>>
>>
>>
>>
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>
>> Uwe Hilgert, Ph.D.
>>
>> Dolan DNA Learning Center
>>
>> Cold Spring Harbor Laboratory
>>
>>
>>
>> V: (516) 367-5185
>>
>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>
>> F: (516) 367-5182
>>
>> W: http://www.dnalc.org
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>