[Bioperl-l] Another GuessSeqFormat question

Heikki Lehvaslaiho heikki at ebi.ac.uk
Thu Aug 18 05:54:37 EDT 2005


Tim,

I thought there must be something in your problem I did not catch!

In principle it could be done, but practise it would be really difficult, 
these  text based formats just vary too much - the most recent GuessSeqFormat 
shows that well. I would suggest that you try do determine ways to separate 
AlignIO, SeqIO and SearchIO files from each other and then call the 
appropriate one. Once you got the heuristics  together you might want to 
think of putting the logic into a module.

Fasta files pose a  big problem hese. There is no general way to know if a 
fasta file is representing an alignment or not. For your specific case, you 
might find a heuristics that tells them apart, e.g. ratio of gap characters 
to residues, but that is highly unlikely to hold on someone else's data.

Good luck,

 -Heikki

On Thursday 18 August 2005 00:18, Tim Erwin wrote:
> Thanks, Heikki, but I am trying to parse different IO objects such as
> AlignIO, SeqIO and SearchIO, but what I am trying to do is guess the
> format of any IO object and then use the appropriate parser.
>
> i.e If I have a unknown file output.out I want to guess the format and
> then the appropriate IO parser to use. Is there a way to do this or
> should I just test all the IO parsers with an eval block.
>
> Regards,
>
> Tim
>
> On Wed, 2005-08-17 at 10:03 +0100, Heikki Lehvaslaiho wrote:
> > Tim,
> >
> > Bio::Tools::GuessSeqFormat is not meant to be used directly. It is called
> > automatically by the constructor (new() method) of Bio::SeqIO:
> >
> >  my $format = $param{'-format'} ||
> >      $class->_guess_format( $param{-file} || $ARGV[0] );
> >
> >  if( ! $format ) {
> >      if ($param{-file}) {
> >   $format = Bio::Tools::GuessSeqFormat->new(-file => $param{-file}||
> >                     $ARGV[0] )->guess;
> >      } elsif ($param{-fh}) {
> >   $format = Bio::Tools::GuessSeqFormat->new(-fh => $param{-fh}||
> >                     $ARGV[0] )->guess;
> >      }
> >  }
> >         # ... code removed
> >  return "Bio::SeqIO::$format"->new(@args);
> >
> > The logic from the above code is as follows:
> >
> > 1. _guess_format() tries to determine the format of the file based on the
> > filename extension.
> >
> > 2. Only if that fails try looking into the file/stream to guess the
> > format using the Bio::Tools::GuessSeqFormat code.
> >
> > 3. The returned object is not a Bio::SeqIO but a Bio::SeqIO::$format
> > object, which has the correct next_seq() and write_seq() methods. You can
> > therefore use ref($seqoobject) to find out what parser is being used.
> >
> >
> >
> > The standard code for doing this should contain all the automation
> > needed:
> >
> > foreach my $inputfilename (@all_files) {
> >     my $in  = Bio::SeqIO->new(-file => $inputfilename);
> >     while ( my $seq = $in->next_seq() ) {
> >      # do something
> >     }
> > }
> >
> >
> > Yours,
> >        -Heikki
> >
> > On Wednesday 17 August 2005 08:15, Tim Erwin wrote:
> > > Hi,
> > >
> > > Is there a way to determine which parser to use based on the guess from
> > > Bio::Tools::GuessSeqFormat without hard coding a hash? I am interested
> > > in parsing and storing various files to a database.
> > >
> > > I was wondering if it is a good idea to make a some extra functions so
> > > that files could be parsed automatically.
> > >
> > > i.e for a fasta file
> > >
> > > my $obj = new Bio::Tools::GuessSeqFormat( -file => $filename );
> > > my $format = $obj->guess;
> > > my $parser = $obj->parser;              #RETURNS Bio::SeqIO
> > > my $next_method = $obj->next_method;    #RETURNS next_seq
> > > my $write_method = $obj->write_method;  #RETURNS write_seq
> > >
> > > #PARSE FILE
> > > my $infile = new $parser(-file => $filename, -format => $format);
> > > while (my $result = $infile->$next_method) {
> > >
> > >   #DO STUFF HERE
> > >   #ADD $result TO DATABASE
> > >
> > > }
> > >
> > > Perhaps there is a better way to do this? Any suggestions would be
> > > great.
> > >
> > > Regards,
> > >
> > > Tim
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at portal.open-bio.org
> > > http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_ebi _ac _uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambridge, CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________


More information about the Bioperl-l mailing list