[Bioperl-l] Bio::SeqIO issue
Chris Fields
cjfields at illinois.edu
Wed Aug 5 21:37:52 UTC 2009
Uwe,
Please keep replies on the list.
It's very possible that's the issue; IIRC the fasta parser pulls out
the full sequence in chunks (based on local $/ = "\n>") and splits the
header off as the first line in that chunk. You could probably try
leaving the format out and letting SeqIO guess it, or passing the file
into Bio::Tools::GuessSeqFormat directly, but it's probably better to
go through the files and add a file extension that corresponds to the
format.
chris
On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
> Thanks, Chris. The files have no extension, but we indicate what
> format
> to use, like in the manual:
>
> $in = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
>
> I wonder now whether this could exactly cause the problem: as we are
> telling that input files are in fasta format they are being treated as
> such (=remove first line) - regardless of whether they really are
> fasta?
>
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> Uwe Hilgert, Ph.D.
> Dolan DNA Learning Center
> Cold Spring Harbor Laboratory
>
> C: (516) 857-1693
> V: (516) 367-5185
> E: hilgert at cshl.edu
> F: (516) 367-5182
> W: http://www.dnalc.org
>
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu]
> Sent: Wednesday, August 05, 2009 5:04 PM
> To: Hilgert, Uwe
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
>
>> Is my impression correct that Bio::SeqIO just assumes that sequences
>> are
>> being submitted in FASTA format?
>
> No. See:
>
> http://www.bioperl.org/wiki/HOWTO:SeqIO
>
> SeqIO tries to guess at the format using the file extension, and if
> one isn't present makes use of Bio::Tools::GuessSeqFormat. It's
> possible that the extension is causing the problem, or that
> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced to
> guessing). In any case, it's always advisable to explicitly indicate
> the format when possible.
>
> Relevant lines:
>
> return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
> i;
> ...
> return 'raw' if /\.(txt)$/i;
>
>> In our experience, implementing
>> Bio::SeqIO led to the first line of files being cut off, regardless
>> of
>> whether the files were indeed fasta files or files that only
>> contained
>> sequence.
>
> Files that only contain sequence are 'raw'. Ones in FASTA are
> 'fasta'.
>
>> Which, in the latter, led to sequence submissions that had the
>> first line of nucleotides removed. Has anyone tried to write a fix
>> for
>> this?
>
> This sounds like a bug, but we have very little to go on beyond your
> description. What version of bioperl are you using, OS, etc? What
> does your data look like? File extension?
>
> chris
>
>> Thanks,
>>
>> Uwe
>>
>>
>>
>>
>>
>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>>
>> Uwe Hilgert, Ph.D.
>>
>> Dolan DNA Learning Center
>>
>> Cold Spring Harbor Laboratory
>>
>>
>>
>> V: (516) 367-5185
>>
>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
>>
>> F: (516) 367-5182
>>
>> W: http://www.dnalc.org
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list