[Bioperl-l] Bio::SeqIO issue

Sun Aug 9 02:41:53 UTC 2009

Hello,
  It sounds like you want a layer to to figure out what they're giving
your program before you open it, you could use Bio::Tools::GuessSeqFormat
and spare your user the pain of knowledge.  It seems reasonable that
coddling happens only when requested.

use IO::String;
use Bio::SeqIO;
use Bio::Tools::GuessSeqFormat;

my @files = ( 'NC_000913.fasta', '.gb' );

for my $file ( @files ) {
  my ( $string, $strio, $out );
  $strio = IO::String->new( $string );
  $out = Bio::SeqIO->new ( -fh => $strio, -format => 'raw' );

  my $guesser = new Bio::Tools::GuessSeqFormat( -file => $file ); 
  my $in = Bio::SeqIO->new( -format => $guesser->guess , -file => $file ); 

  while ( my $seq = $in->next_seq() ) {
    $out->write_seq( $seq );
    print substr($string, 0, 30), "\n";
  }
}

Joel

On Thu, Aug 06, 2009 at 03:36:36PM -0400, Hilgert, Uwe wrote:
> Hmmm, I fail to see how supplying raw sequence could be a called "bad"
> input or a "problem". In our case, for example, not every user is a
> bioinformatics expert and Cornel was suggesting to account for that
> instead of trying to "train" the user to adhere to requirements that
> have not much to do with what s/he tries to accomplish. I don't really
> see data being modified, rather that the data format is being adopted to
> the needs of the software; which I would argue should be something the
> software is being able to take care of.
> 
> Uwe
> 
> 
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu] 
> Sent: Thursday, August 06, 2009 12:50 PM
> To: Ghiban, Cornel
> Cc: Hilmar Lapp; Hilgert, Uwe; BioPerl List
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
> 
> Cornel,
> 
> I'm failing to see how adding '>' would solve the problem.
> 
> This is a simple validation issue: should we throw an exception on bad  
> input (no '>'), or just argue GIGO based on user error (the assumption  
> that the SeqIO parser will read raw sequence correctly when set to  
> 'fasta' is wrong)?
> 
> I think, in this circumstance, the former applies.  It is easy to add,  
> and the use of an exception in this case is violently user-friendly,  
> e.g. it will stop cold and immediately point out the problem.   
> Otherwise data is (silently) being modified, which is always a bad  
> thing.
> 
> chris
> 
> On Aug 6, 2009, at 11:04 AM, Ghiban, Cornel wrote:
> 
> > Hi,
> >
> > It doesn't matter what sequence we use. As Chris Fields's showed in  
> > his test, not having
> > ">" as the 1st character on the first line is the problem.
> > We always assumed the sequence is in FASTA format and this seems to  
> > be wrong.
> >
> > I think, the solution to our problem is to check whether the ">"  
> > symbol is present or not.
> > If not present then it will be added.
> >
> > Thank you,
> > Cornel Ghiban
> >
> > -----Original Message-----
> > From: Hilmar Lapp [mailto:hlapp at gmx.net]
> > Sent: Thursday, August 06, 2009 11:18 AM
> > To: Hilgert, Uwe
> > Cc: Chris Fields; BioPerl List; Ghiban, Cornel
> > Subject: Re: [Bioperl-l] Bio::SeqIO issue
> >
> > Uwe - could you send an actual data file (as an attachment) that  
> > reproduces the problem, or is that not possible?
> >
> > 	-hilmar
> >
> > On Aug 6, 2009, at 11:01 AM, Hilgert, Uwe wrote:
> >
> >> I'm not sure what version we have. Cornel may have installed it a
> >> while ago from CVS:
> >>
> >> Module id = Bio::Root::Build
> >>   CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
> >>   CPAN_VERSION 1.006000
> >>   INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
> >>   INST_VERSION 1.006900
> >> cpan> m Bio::Root::Version
> >> Module id = Bio::Root::Version
> >>   CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
> >>   CPAN_VERSION 1.006000
> >>   INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
> >>   INST_VERSION 1.006900
> >> cpan> m Bio::SeqIO
> >> Module id = Bio::SeqIO
> >>   CPAN_USERID  CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
> >>   CPAN_VERSION 1.006000
> >>   INST_FILE    /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
> >>   INST_VERSION undef
> >>
> >> Cornel still has the checked-out "bioperl-live" directory and the  
> >> last
> >> changes are from March this year.
> >>
> >> As per why he used "Fasta" instead of 'fasta" as the format parameter
> >> in Bio::SeqIO, it's because that what it says in the modules manual.
> >> He now tried 'fasta' instead and see no changes in behavior. Omitting
> >> the format parameter altogether, fasta-formatted sequence continues  
> >> to
> >> be treated correctly, the first line being removed. However, raw
> >> sequence is being treated differently in that the first line is not
> >> being removed any more. Instead, the program returns the first line
> >> only. Which, in the example I am going to forward in my next message,
> >> will return 60 amino acids out of raw sequence of 300 aa. Can't win
> >> with raw sequence...
> >>
> >>
> >> The files may be created on different platforms, we didn't notice any
> >> difference between using files created on Windows or Linux.
> >>
> >> Thanks
> >> Uwe
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> >> Sent: Wednesday, August 05, 2009 6:54 PM
> >> To: Chris Fields
> >> Cc: Hilgert, Uwe; BioPerl List
> >> Subject: Re: [Bioperl-l] Bio::SeqIO issue
> >>
> >> I don't think that can be the problem. If anything, providing the
> >> format ought to be better in terms of result than not providing it?
> >>
> >> Uwe - I'd like you to go back to Chris' initial questions that you
> >> haven't answered yet: "What version of bioperl are you using, OS,  
> >> etc?
> >> What does your data look like?" I'd add to that, can you show us your
> >> full script, or a smaller code snippet that reproduces the problem.
> >>
> >> I suspect that either something in your script is swallowing the  
> >> line,
> >> or that the line endings in your data file are from a different OS
> >> than the one you're running the script on. (Or that you are running a
> >> very old version of BioPerl, which is entirely possible if you
> >> installed through CPAN.)
> >>
> >> 	-hilmar
> >>
> >> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
> >>
> >>> Uwe,
> >>>
> >>> Please keep replies on the list.
> >>>
> >>> It's very possible that's the issue; IIRC the fasta parser pulls out
> >>> the full sequence in chunks (based on local $/ = "\n>") and splits
> >>> the header off as the first line in that chunk.  You could probably
> >>> try leaving the format out and letting SeqIO guess it, or passing  
> >>> the
> >>> file into Bio::Tools::GuessSeqFormat directly, but it's probably
> >>> better to go through the files and add a file extension that
> >>> corresponds to the format.
> >>>
> >>> chris
> >>>
> >>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
> >>>
> >>>> Thanks, Chris. The files have no extension, but we indicate what
> >>>> format to use, like in the manual:
> >>>>
> >>>> $in  = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
> >>>>
> >>>> I wonder now whether this could exactly cause the problem: as we  
> >>>> are
> >>>> telling that input files are in fasta format they are being treated
> >>>> as such (=remove first line) - regardless of whether they really  
> >>>> are
> >>>> fasta?
> >>>>
> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Uwe
> >>>> Hilgert, Ph.D.
> >>>> Dolan DNA Learning Center
> >>>> Cold Spring Harbor Laboratory
> >>>>
> >>>> C: (516) 857-1693
> >>>> V: (516) 367-5185
> >>>> E: hilgert at cshl.edu
> >>>> F: (516) 367-5182
> >>>> W: http://www.dnalc.org
> >>>>
> >>>> -----Original Message-----
> >>>> From: Chris Fields [mailto:cjfields at illinois.edu]
> >>>> Sent: Wednesday, August 05, 2009 5:04 PM
> >>>> To: Hilgert, Uwe
> >>>> Cc: bioperl-l at lists.open-bio.org
> >>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
> >>>>
> >>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
> >>>>
> >>>>> Is my impression correct that Bio::SeqIO just assumes that
> >>>>> sequences are being submitted in FASTA format?
> >>>>
> >>>> No. See:
> >>>>
> >>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
> >>>>
> >>>> SeqIO tries to guess at the format using the file extension, and if
> >>>> one isn't present makes use of Bio::Tools::GuessSeqFormat.  It's
> >>>> possible that the extension is causing the problem, or that
> >>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced
> >>>> to guessing).  In any case, it's always advisable to explicitly
> >>>> indicate the format when possible.
> >>>>
> >>>> Relevant lines:
> >>>>
> >>>> return 'fasta'   if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
> >>>> i;
> >>>> ...
> >>>> return 'raw'     if /\.(txt)$/i;
> >>>>
> >>>>> In our experience, implementing
> >>>>> Bio::SeqIO led to the first line of files being cut off,  
> >>>>> regardless
> >>>>> of whether the files were indeed fasta files or files that only
> >>>>> contained sequence.
> >>>>
> >>>> Files that only contain sequence are 'raw'.  Ones in FASTA are
> >>>> 'fasta'.
> >>>>
> >>>>> Which, in the latter, led to sequence submissions that had the
> >>>>> first line of nucleotides removed. Has anyone tried to write a fix
> >>>>> for this?
> >>>>
> >>>> This sounds like a bug, but we have very little to go on beyond  
> >>>> your
> >>>> description.  What version of bioperl are you using, OS, etc?  What
> >>>> does your data look like?  File extension?
> >>>>
> >>>> chris
> >>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Uwe
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>>
> >>>>> Uwe Hilgert, Ph.D.
> >>>>>
> >>>>> Dolan DNA Learning Center
> >>>>>
> >>>>> Cold Spring Harbor Laboratory
> >>>>>
> >>>>>
> >>>>>
> >>>>> V: (516) 367-5185
> >>>>>
> >>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
> >>>>>
> >>>>> F: (516) 367-5182
> >>>>>
> >>>>> W: http://www.dnalc.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >
> > --
> > ===========================================================
> > : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> > ===========================================================
> >
> >
> >
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l