[Bioperl-l] Bio::SeqIO issue
Joel Martin
j_martin at lbl.gov
Sun Aug 9 02:41:53 UTC 2009
Hello,
It sounds like you want a layer to to figure out what they're giving
your program before you open it, you could use Bio::Tools::GuessSeqFormat
and spare your user the pain of knowledge. It seems reasonable that
coddling happens only when requested.
use IO::String;
use Bio::SeqIO;
use Bio::Tools::GuessSeqFormat;
my @files = ( 'NC_000913.fasta', '.gb' );
for my $file ( @files ) {
my ( $string, $strio, $out );
$strio = IO::String->new( $string );
$out = Bio::SeqIO->new ( -fh => $strio, -format => 'raw' );
my $guesser = new Bio::Tools::GuessSeqFormat( -file => $file );
my $in = Bio::SeqIO->new( -format => $guesser->guess , -file => $file );
while ( my $seq = $in->next_seq() ) {
$out->write_seq( $seq );
print substr($string, 0, 30), "\n";
}
}
Joel
On Thu, Aug 06, 2009 at 03:36:36PM -0400, Hilgert, Uwe wrote:
> Hmmm, I fail to see how supplying raw sequence could be a called "bad"
> input or a "problem". In our case, for example, not every user is a
> bioinformatics expert and Cornel was suggesting to account for that
> instead of trying to "train" the user to adhere to requirements that
> have not much to do with what s/he tries to accomplish. I don't really
> see data being modified, rather that the data format is being adopted to
> the needs of the software; which I would argue should be something the
> software is being able to take care of.
>
> Uwe
>
>
> -----Original Message-----
> From: Chris Fields [mailto:cjfields at illinois.edu]
> Sent: Thursday, August 06, 2009 12:50 PM
> To: Ghiban, Cornel
> Cc: Hilmar Lapp; Hilgert, Uwe; BioPerl List
> Subject: Re: [Bioperl-l] Bio::SeqIO issue
>
> Cornel,
>
> I'm failing to see how adding '>' would solve the problem.
>
> This is a simple validation issue: should we throw an exception on bad
> input (no '>'), or just argue GIGO based on user error (the assumption
> that the SeqIO parser will read raw sequence correctly when set to
> 'fasta' is wrong)?
>
> I think, in this circumstance, the former applies. It is easy to add,
> and the use of an exception in this case is violently user-friendly,
> e.g. it will stop cold and immediately point out the problem.
> Otherwise data is (silently) being modified, which is always a bad
> thing.
>
> chris
>
> On Aug 6, 2009, at 11:04 AM, Ghiban, Cornel wrote:
>
> > Hi,
> >
> > It doesn't matter what sequence we use. As Chris Fields's showed in
> > his test, not having
> > ">" as the 1st character on the first line is the problem.
> > We always assumed the sequence is in FASTA format and this seems to
> > be wrong.
> >
> > I think, the solution to our problem is to check whether the ">"
> > symbol is present or not.
> > If not present then it will be added.
> >
> > Thank you,
> > Cornel Ghiban
> >
> > -----Original Message-----
> > From: Hilmar Lapp [mailto:hlapp at gmx.net]
> > Sent: Thursday, August 06, 2009 11:18 AM
> > To: Hilgert, Uwe
> > Cc: Chris Fields; BioPerl List; Ghiban, Cornel
> > Subject: Re: [Bioperl-l] Bio::SeqIO issue
> >
> > Uwe - could you send an actual data file (as an attachment) that
> > reproduces the problem, or is that not possible?
> >
> > -hilmar
> >
> > On Aug 6, 2009, at 11:01 AM, Hilgert, Uwe wrote:
> >
> >> I'm not sure what version we have. Cornel may have installed it a
> >> while ago from CVS:
> >>
> >> Module id = Bio::Root::Build
> >> CPAN_USERID CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
> >> CPAN_VERSION 1.006000
> >> INST_FILE /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Build.pm
> >> INST_VERSION 1.006900
> >> cpan> m Bio::Root::Version
> >> Module id = Bio::Root::Version
> >> CPAN_USERID CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
> >> CPAN_VERSION 1.006000
> >> INST_FILE /usr/lib/perl5/site_perl/5.8.8/Bio/Root/Version.pm
> >> INST_VERSION 1.006900
> >> cpan> m Bio::SeqIO
> >> Module id = Bio::SeqIO
> >> CPAN_USERID CJFIELDS (Christopher Fields <cjfields at bioperl.org>)
> >> CPAN_VERSION 1.006000
> >> INST_FILE /usr/lib/perl5/site_perl/5.8.8/Bio/SeqIO.pm
> >> INST_VERSION undef
> >>
> >> Cornel still has the checked-out "bioperl-live" directory and the
> >> last
> >> changes are from March this year.
> >>
> >> As per why he used "Fasta" instead of 'fasta" as the format parameter
> >> in Bio::SeqIO, it's because that what it says in the modules manual.
> >> He now tried 'fasta' instead and see no changes in behavior. Omitting
> >> the format parameter altogether, fasta-formatted sequence continues
> >> to
> >> be treated correctly, the first line being removed. However, raw
> >> sequence is being treated differently in that the first line is not
> >> being removed any more. Instead, the program returns the first line
> >> only. Which, in the example I am going to forward in my next message,
> >> will return 60 amino acids out of raw sequence of 300 aa. Can't win
> >> with raw sequence...
> >>
> >>
> >> The files may be created on different platforms, we didn't notice any
> >> difference between using files created on Windows or Linux.
> >>
> >> Thanks
> >> Uwe
> >>
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Hilmar Lapp [mailto:hlapp at gmx.net]
> >> Sent: Wednesday, August 05, 2009 6:54 PM
> >> To: Chris Fields
> >> Cc: Hilgert, Uwe; BioPerl List
> >> Subject: Re: [Bioperl-l] Bio::SeqIO issue
> >>
> >> I don't think that can be the problem. If anything, providing the
> >> format ought to be better in terms of result than not providing it?
> >>
> >> Uwe - I'd like you to go back to Chris' initial questions that you
> >> haven't answered yet: "What version of bioperl are you using, OS,
> >> etc?
> >> What does your data look like?" I'd add to that, can you show us your
> >> full script, or a smaller code snippet that reproduces the problem.
> >>
> >> I suspect that either something in your script is swallowing the
> >> line,
> >> or that the line endings in your data file are from a different OS
> >> than the one you're running the script on. (Or that you are running a
> >> very old version of BioPerl, which is entirely possible if you
> >> installed through CPAN.)
> >>
> >> -hilmar
> >>
> >> On Aug 5, 2009, at 5:37 PM, Chris Fields wrote:
> >>
> >>> Uwe,
> >>>
> >>> Please keep replies on the list.
> >>>
> >>> It's very possible that's the issue; IIRC the fasta parser pulls out
> >>> the full sequence in chunks (based on local $/ = "\n>") and splits
> >>> the header off as the first line in that chunk. You could probably
> >>> try leaving the format out and letting SeqIO guess it, or passing
> >>> the
> >>> file into Bio::Tools::GuessSeqFormat directly, but it's probably
> >>> better to go through the files and add a file extension that
> >>> corresponds to the format.
> >>>
> >>> chris
> >>>
> >>> On Aug 5, 2009, at 4:23 PM, Hilgert, Uwe wrote:
> >>>
> >>>> Thanks, Chris. The files have no extension, but we indicate what
> >>>> format to use, like in the manual:
> >>>>
> >>>> $in = Bio::SeqIO->new(-file => "file_path", -format => 'Fasta');
> >>>>
> >>>> I wonder now whether this could exactly cause the problem: as we
> >>>> are
> >>>> telling that input files are in fasta format they are being treated
> >>>> as such (=remove first line) - regardless of whether they really
> >>>> are
> >>>> fasta?
> >>>>
> >>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Uwe
> >>>> Hilgert, Ph.D.
> >>>> Dolan DNA Learning Center
> >>>> Cold Spring Harbor Laboratory
> >>>>
> >>>> C: (516) 857-1693
> >>>> V: (516) 367-5185
> >>>> E: hilgert at cshl.edu
> >>>> F: (516) 367-5182
> >>>> W: http://www.dnalc.org
> >>>>
> >>>> -----Original Message-----
> >>>> From: Chris Fields [mailto:cjfields at illinois.edu]
> >>>> Sent: Wednesday, August 05, 2009 5:04 PM
> >>>> To: Hilgert, Uwe
> >>>> Cc: bioperl-l at lists.open-bio.org
> >>>> Subject: Re: [Bioperl-l] Bio::SeqIO issue
> >>>>
> >>>> On Aug 5, 2009, at 3:27 PM, Hilgert, Uwe wrote:
> >>>>
> >>>>> Is my impression correct that Bio::SeqIO just assumes that
> >>>>> sequences are being submitted in FASTA format?
> >>>>
> >>>> No. See:
> >>>>
> >>>> http://www.bioperl.org/wiki/HOWTO:SeqIO
> >>>>
> >>>> SeqIO tries to guess at the format using the file extension, and if
> >>>> one isn't present makes use of Bio::Tools::GuessSeqFormat. It's
> >>>> possible that the extension is causing the problem, or that
> >>>> GuessSeqFormat guessing wrong (it's apt to do that, as it's forced
> >>>> to guessing). In any case, it's always advisable to explicitly
> >>>> indicate the format when possible.
> >>>>
> >>>> Relevant lines:
> >>>>
> >>>> return 'fasta' if /\.(fasta|fast|fas|seq|fa|fsa|nt|aa|fna|faa)$/
> >>>> i;
> >>>> ...
> >>>> return 'raw' if /\.(txt)$/i;
> >>>>
> >>>>> In our experience, implementing
> >>>>> Bio::SeqIO led to the first line of files being cut off,
> >>>>> regardless
> >>>>> of whether the files were indeed fasta files or files that only
> >>>>> contained sequence.
> >>>>
> >>>> Files that only contain sequence are 'raw'. Ones in FASTA are
> >>>> 'fasta'.
> >>>>
> >>>>> Which, in the latter, led to sequence submissions that had the
> >>>>> first line of nucleotides removed. Has anyone tried to write a fix
> >>>>> for this?
> >>>>
> >>>> This sounds like a bug, but we have very little to go on beyond
> >>>> your
> >>>> description. What version of bioperl are you using, OS, etc? What
> >>>> does your data look like? File extension?
> >>>>
> >>>> chris
> >>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Uwe
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> >>>>>
> >>>>> Uwe Hilgert, Ph.D.
> >>>>>
> >>>>> Dolan DNA Learning Center
> >>>>>
> >>>>> Cold Spring Harbor Laboratory
> >>>>>
> >>>>>
> >>>>>
> >>>>> V: (516) 367-5185
> >>>>>
> >>>>> E: hilgert at cshl.edu <mailto:hilgert at cshl.edu>
> >>>>>
> >>>>> F: (516) 367-5182
> >>>>>
> >>>>> W: http://www.dnalc.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioperl-l mailing list
> >>>>> Bioperl-l at lists.open-bio.org
> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >
> > --
> > ===========================================================
> > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> > ===========================================================
> >
> >
> >
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list