[Bioperl-l] SeqIO

Staffa, Nick (NIH/NIEHS) staffa at niehs.nih.gov
Thu Mar 6 18:27:31 EST 2008


Thanks 
I really appreciate all the interest given and help generated.
that sure sounds like a great idea, but i think
Bio::Tools::GuessSeqFormat needs more RIGOR before it declares itself.
Is there a substitute?
It works great with
>> !!NA_SEQUENCE 1.0
>>    NewDNA  Length: 810  March 5, 2008 18:26  Type: N  Check: 3368  ..
>> 
>>        1  TGTTCGAATT CCGTGCGGTC CACCTCCCCT AGGAGCTCAG TGGGCTGGTT
>> et c.

as seen in:
gir.niehs.nih.gov> CGwindows.pl TestDNA.seq.org | more
guesser guesses gcg
TGTTCGAATTCCGTGCGGTCCACCTCCCCTAGGAGCTCAGTGGGCTGGTTGGATTCCGTGCCATCCCGGCAGGGCA
GAGCCTCGGGA  et c.
(yes, I added
my $file_type = $guesser->guess;
print "guesser guesses $file_type\n";
)

BUT
when applied to a genbank sequence passed thru the Seqlab editor and turned
into GCG, to wit:
!!NA_SEQUENCE 1.0
LOCUS       HSPGK2G      1911 bp    DNA             PRI       12-SEP-1993
DEFINITION  Human testis-specific PGK-2 gene for phosphoglycerate kinase
            (ATP:3-phospho-D-glycerate 1-phosphotransferase, EC 2.7.2.3).
ACCESSION   X05246 Y00261
...
...
BASE COUNT      583 a    367 c    442 g    519 t
ORIGIN

 HSPGK2G  Length: 1911  August 24, 1998 10:56  Type: N  Check: 4156  ..

       1  GCCCCTCAAC AGCAAGTTGG TTCTTCAGCA TTAAGATCCA GGTGTCAGCC
et c.

It thinks it is a flawed PIR:

gir.niehs.nih.gov> CGwindows.pl hspgk2g.seq | more
guesser guesses pir

------------- EXCEPTION  -------------
MSG: PIR stream read attempted without leading '>P1;' [ !!NA_SEQUENCE 1.0
LOCUS       HSPGK2G      1911 bp    DNA             PRI       12-SEP-1993


Must look at why guesser is thinking PIR.




On 3/6/08 11:22 AM, "Marc Logghe" <Marc.Logghe at ablynx.com> wrote:

> Hi Nick,
> I don't think you should leave out the -format option. You have to leave
> it in but the format should be provided by the B::T::GuessSeqFormat
> object.
> Something like:
> 
> #!/usr/bin/perl
> use strict;
> use Bio::SeqIO;
> use Bio::Tools::GuessSeqFormat;
> 
> $| = 1;
> my $number_of_files = @ARGV;
> if(!$number_of_files){print "no files entered\n";exit:}
> foreach my $file (@ARGV){
>   my $guesser = Bio::Tools::GuessSeqFormat->new(-file => $file);
>   my $seqio_object = Bio::SeqIO->new(-file => $guesser->file, -format =>
> $guesser->guess);
>   my $seq_object = $seqio_object->next_seq;
>   my $sequence = $seq_object->seq;
>   print "$sequence\n";
> }
> 
> HTH,
> Marc
> 
> 
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
>> bounces at lists.open-bio.org] On Behalf Of Staffa, Nick (NIH/NIEHS)
>> Sent: donderdag 6 maart 2008 16:24
>> To: Heikki Lehvaslaiho; bioperl-l at lists.open-bio.org
>> Cc: Chris Fields
>> Subject: Re: [Bioperl-l] SeqIO
>> 
>> Here's the scoop:
>> When I use Jason's suggestion, (-format => 'gcg'),
>> My program works without complaint on the original file that looks
> like:
>> !!NA_SEQUENCE 1.0
>>    NewDNA  Length: 810  March 5, 2008 18:26  Type: N  Check: 3368  ..
>> 
>>        1  TGTTCGAATT CCGTGCGGTC CACCTCCCCT AGGAGCTCAG TGGGCTGGTT
>> et c.
>> 
>> BUT if I remove the first line to test Bio::Tools::GuessSeqFormat,
>> (which should be retro-gcg format (before version 11?)),
>> my program runs, but there IS a complaint:
>> Use of uninitialized value in scalar chomp at
>> /usr/lib/perl5/site_perl/5.8.5/Bio/SeqIO/gcg.pm line 118, <GEN0> line
> 1.
>> BUT
>> If I remove (-format => 'gcg'),  I get no complaint, but the sequence
>> returned still has its numbers imbedded. This effects my calculations.
>> 
>> Thanks, at least i know what my options are.
>> 
>> 
>> 
>> Nick Staffa
>> Telephone: 919-316-4569  (NIEHS: 6-4569)
>> Scientific Computing Support Group
>> NIEHS Information Technology Support Services Contract
>> (Science Task Monitor: Roy W. Reter (reter at niehs.nih.gov)
>> National Institute of Environmental Health Sciences
>> National Institutes of Health
>> Research Triangle Park, North Carolina
>



More information about the Bioperl-l mailing list