[Bioperl-l] SeqIO bug?

Heikki Lehvaslaiho heikki at nildram.co.uk
Sat Feb 12 07:38:39 EST 2005


Ryan,

Most of our parsers assume that if you state the format the sequences really 
are in that format. There is some built in guessing in SeqIO: if you do not 
specify the format, the code will look into the sequence file. 

If the file does not follow any well defined formats, it is very difficult to 
guess what it could be...

As for your raw output: raw assumes that there is one sequence per line. In 
you case that seems to be the number of characters in one line.

I suggest that you first convert your sequences into fasta with a simple 
script. You do not say what is the limiting character or characters between 
sequences, nor if you have names for them, so I can not write the script for 
you, but you can use the following code as a basis:

#----------------------------
#!/usr/bin/env perl -w
use strict;

my $delimiter = "xx\n";
$/=$delimiter;

my $count;

while (<>) {
    $count++;
    s/$delimiter//;
    s/\W//g;
    s/\d//g;
    print ">$count\n$_\n";
}
#----------------------------

It assumes files like this as input:

-----------------------------------------------
ag cagc
xx
1 catgctagctacgtatgc
2 cgtcagctagctga
3 catcgtagc
xx
ttt tgtt ttatt atatat
xx
-----------------------------------------------


I hope this helps,

	-Heikki

On Friday 11 February 2005 19:31, Ryan Golhar wrote:
> I have a bunch of cDNA sequences that I'm trying to process.  The
> sequences are in FASTA format, but they are all missing the FASTA header
> ie that just contain the sequence.  As a test to make sure I'm reading
> them in correctly, I doing the following:
>
>         my $seq_in = Bio::SeqIO->new(-file => "<myseqfile",
> -format => 'fasta');
>         my $seq = $seq_in->next_seq();
>         print $seq->length;
>
> It prints out a number, but reads the first line as the FASTA header
> even though its not there.  Wouldn't it make more sense to either print
> out an error message about the missing FASTA header, or read in the file
> as just the sequence regardless of specifying the FASTA format?
>
> If I try to read the sequence in as "raw", the length is always printed
> out as 70...
>
> Ryan
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_ebi _ac _uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambridge, CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________


More information about the Bioperl-l mailing list