[Bioperl-l] Bug in GCG SeqIO Formatting?

Tex Thompson tex at biosysadmin.com
Tue Feb 17 01:21:51 EST 2004


Hilmar,

Thanks for the tip. There are no stack errors, but here is the output from the
test program shown below:


>AF317472 !!NA_SEQUENCE 1.0LOCUS       AF317472                2679 bp    DNA     linear   PLN 07-DEC-2000DEFINITION  Candida albicans cAMP-dependent protein kinase regulatory subunit            (PKA-R) gene, complete cds.ACCESSION   AF317472VERSION     AF317472.1  GI:11596392KEYWORDS    .SOURCE      Candida albicans  ORGANISM  Candida albicans            Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;            Saccharomycetales; mitosporic Saccharomycetales; Candida.REFERENCE   1  (bases 1 to 2679)  AUTHORS   Giasson,L. and Parrot,M.  TITLE     Sequence of the Candida albicans cAMP-dependent protein kinase            regulatory subunit  JOURNAL   UnpublishedREFERENCE   2  (bases 1 to 2679)  AUTHORS   Giasson,L. and Parrot,M.  TITLE     Direct Submission  JOURNAL   Submitted (27-OCT-2000) School of Dentistry, Laval University,            GREB, Ste-Foy, Quebec G1K 7P4, CanadaFEATURES             Location/Qualifiers     source          1. .2679               
       /organism="Candida albicans"                     /mol_type="genomic DNA"                     /strain="CAI4"                     /db_xref="taxon:5476"     gene            <977. .>2356                     /gene="PKA-R"     mRNA            <977. .>2356                     /gene="PKA-R"                     /product="cAMP-dependent protein kinase regulatory                     subunit"     CDS             977. .2356                     /gene="PKA-R"                     /codon_start=1                     /transl_table=12                     /product="cAMP-dependent protein kinase regulatory                     subunit"                     /protein_id="AAG38599.1"                     /db_xref="GI:11596393"                     /translation="MSNPQQQFISDELSQLQKEIISKNPQDVLQFCANYFNTKLQAQR                     SELWSQQAKAEAAGIDLFPSVDHVNVNSSGVSIVNDRQPSFKSPFGVNDPHSNHDEDP                     HAKDTKTDTAAAAVGGGIFKSNFDVKKSASNPPTKEVDPDDPSKPSSSSQPNQQSASA                     SSKTPSSKIPVAFNANR
 RTSVSAEALNPAKLKLDSWKPPVNNLSITEEETLANNLKNN                     FLFKQLDANSKKTVIAALQQKSFAKDTVIIQQGDEGDFFYIIETGTVDFYVNDAKVSS                     SSEGSSFGELALMYNSPRAATAVAATDVVCWALDRLTFRRILLEGTFNKRLMYEDFLK                     DIEVLKSLSDHARSKLADALSTEMYHKGDKIVTEGEQGENFYLIESGNCQVYNEKLGN                     IKQLTKGDYFGELALIKDLPRQATVEALDNVIVATLGKSGFQRLLGPVVEVLKEQDPT                     KSQDPTAGH"ORIGIN
GAATTCAAAAAATCAAAAAAATCAAAAAAAAACCGTGGAAGGTAAGTTGTATATTTATAA
ATCAACGTGAATAATTTTCAACACTGTGTCAACATCTGTGAAAAAAACCTGTGTGTACTG
CATATAGGACCTCACCTATTACGTAGAATATACTAGAAATAGTTACAACCATAAAAAGAT
TAATTGTGCTTACGTGGCAACTTTGAGATTTTTCTTTTTTCTGTTTCTTTCTTTCTTTTT
TTGGCTTAAACAACAAATGTCGCAAATTATACAAACGACATTTGCTGCCCATGTCATTTT
GTCGTTATCACGTGAAGTGTCGCAGATTTATGTATTCTCACTTCATTTCTATGGTCATCA
ATTGTTCATTCATTCTCTATCTTCAAAAATCTGTGATTTGATGATTTTGATTAAAAGAAA
GCAAAGAGAATACTGAAAAAAAGCAAAGAGAATATAGAAAAGAAACAATAAAAGAATAGT
TTCTAAGTTACTTTGGAGTCTGCTATTACCATGTATCTATGTGATTGCCCTATCAAATTG
GACAATACGGGTTTTTGTTTAGTCACGATAATCACAAACTTCCCCCAGCAATGACATACG
TAGCAAGTAATATTTATATCTCTTCTATTTTTTTGATCTTACATAATCTGTCGTGTTTTT
TTAAGTTGTTGTTATGAAGAAGTAATTTCATAATGATCAAGTGTGTAACTGAAATTTCAT
CGCAATTTTAAACAAACAAGCTAATAATTATTATTATTAATAGTTAATTTGCTAAGTTGA
GTAAAATTTGCTTTTCTTGAGAAAAAGGAGAAATTACTTTGGGAGTGAGTTTGAAGAGAG
AAACTAAAGTAAGTAAATGAGTGAGAGGGAGAGACAGAGAGCGAGAGGGGGAGTAAAAAA
AAAAGTTGCCCACAAACAAATTGTGATACCGGTCTTTTAGCATATATCTTCTACTCTTCA
ATCAACATCTTTACCAATGTCTAATCCTCAACAACAATTCATATCTGATGAATTGTCGCA
GTTACAGAAAGAAATAATTTCCAAAAACCCGCAAGATGTCTTACAGTTTTGCGCCAACTA
TTTCAACACCAAGTTACAAGCTCAAAGAAGTGAGTTATGGTCGCAACAAGCTAAAGCAGA
AGCCGCAGGCATCGACTTATTCCCATCTGTTGATCATGTGAATGTTAATTCTAGTGGTGT
GAGCATTGTGAATGATAGACAACCAAGTTTTAAATCACCTTTTGGTGTTAATGATCCACA
TCTGAATCACGACGAAGATCCCCATGCCAAAGATACCAAAACAGATACTGCTGCTGCTGC
TGTTGGTGGGGGTATTTTCAAATCAAATTTTGATGTTAAAAAGAGTGCTTCTAATCCTCC
AACCAAGGAAGTAGATCCAGATGACCCATCAAAACCATCGTCATCGAGCCAACCAAATCA
ACAATCAGCATCAGCATCATCAAAAACGCCATCATCAAAGATCCCAGTTGCTTTCAACGC
TAATAGAAGAACATCTGTATCTGCTGAAGCCTTGAATCCAGCAAAATTGAAATTAGATAG
TTGGAAACCTCCAGTTAATAATTTGAGCATTACCGAAGAAGAAACATTAGCCAACAATTT
AAAGAACAATTTCCTTTTCAAACAATTGGACGCAAACTCTAAGAAAACTGTGATTGCTGC
TTTACAACAAAAATCATTTGCTAAAGATACAGTAATTATCCAACAAGGTGATGAAGGGGA
CTTTTTTTACATTATTGAAACTGGTACAGTTGATTTCTATGTTAATGATGCTAAAGTAAG
TTCCAGTAGCGAAGGGTCATCTTTTGGGGAATTGGCTTTGATGTATAATTCACCAAGAGC
TGCTACGGCAGTTGCTGCCACCGATGTTGTCTGTTGGGCATTGGACCGTTTGACATTCCG
TCGAATTCTTTTGGAAGGTACTTTTAACAAGAGATTGATGTACGAGGATTTCTTAAAAGA
TATTGAGGTTTTGAAATCTCTTTCGGATCATGCACGTTCAAAATTGGCAGATGCATTGAG
CACAGAAATGTATCACAAGGGTGATAAAATAGTCACTGAAGGTGAACAAGGAGAGAACTT
TTATTTAATAGAAAGTGGAAACTGTCAAGTTTACAATGAAAAGTTGGGCAATATCAAACA
ATTAACAAAAGGTGATTATTTTGGTGAGCTTGCATTAATAAAAGACTTACCAAGACAAGC
TACTGTGGAAGCATTGGATAATGTAATCGTTGCCACATTAGGTAAATCCGGGTTCCAAAG
ATTATTGGGTCCTGTTGTGGAGGTATTGAAAGAACAAGACCCTACAAAGAGTCAAGACCC
AACTGCTGGTCATTAAGTGTACAATAAGTAGTTGTTTATTATCTTATATTGTTTTATGTT
AGTATATTCTATCTTTTTTTTTTTGGCTTACTCACCTTCTGGTGTTTTCGTTGCGATTTT
GATAATGGATGGTTGGTGCAAAAGTTCAACTACATTTCTTGTTGTCAGGTATATACGAGA
TGGCAGCATGAACGAGCTCACCATGGGTTGAACATTATTGAAGTTATCCGGCCGTGCCTT
TTGCGAAACATGGTAACTAATATATTGCAAACTTGGCTTCTACAGAAAATATACAATCTA
ATACCTTGAGGAATTTCCTCTATATATAATAGAGAATTC

It looks like a lot of the header information is all stuck on 
that first line. Looking at it more carefully it looks like a 
valid FASTA file, but is this really desired behavior?

Thanks for the help,

Tex Thompson
RIT Bioinformatics

On Mon, 16 Feb 2004, Hilmar Lapp wrote:

> Rule #1: If your code doesn't work the way you think it should, or 
> fails with an exception, and you do want help from the mailing list, 
> then be sure to send along the *complete* output, in particular the 
> stack trace if there was any.
> 
> Rule #2: Double check that you followed rule #1.
> 
> Rule #3: Check again that you followed rule #1.
> 
> There really aren't any other rules here. If you choose not to follow 
> rule #1 you indicate that you're not actually interested in getting 
> help.
> 
> 	-hilmar
> 
> On Monday, February 16, 2004, at 02:49  PM, Tex Thompson wrote:
> 
> > Hello Mailing List,
> >
> > I have a user complaining that the following code isn't working on his
> > GCG-formatted sequence files:
> >
> > #!/usr/bin/perl
> >
> > use strict;
> >
> > use Bio::SeqIO;
> > my $io  = Bio::SeqIO->new( -file => "af317472.gbpln3", -format => 
> > "gcg");
> > my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => "fasta" );
> >
> > while ( my $seq = $io->next_seq ) {
> >    $out->write_seq( $seq );
> > }
> >
> > Here's an example sequence file:
> >
> > !!NA_SEQUENCE 1.0
> > LOCUS       AF317472                2679 bp    DNA     linear   PLN 
> > 07-DEC-2000
> > DEFINITION  Candida albicans cAMP-dependent protein kinase regulatory 
> > subunit
> >             (PKA-R) gene, complete cds.
> > ACCESSION   AF317472
> > VERSION     AF317472.1  GI:11596392
> > KEYWORDS    .
> > SOURCE      Candida albicans
> >   ORGANISM  Candida albicans
> >             Eukaryota; Fungi; Ascomycota; Saccharomycotina; 
> > Saccharomycetes;
> >             Saccharomycetales; mitosporic Saccharomycetales; Candida.
> > REFERENCE   1  (bases 1 to 2679)
> >   AUTHORS   Giasson,L. and Parrot,M.
> >   TITLE     Sequence of the Candida albicans cAMP-dependent protein 
> > kinase
> >             regulatory subunit
> >   JOURNAL   Unpublished
> > REFERENCE   2  (bases 1 to 2679)
> >   AUTHORS   Giasson,L. and Parrot,M.
> >   TITLE     Direct Submission
> >   JOURNAL   Submitted (27-OCT-2000) School of Dentistry, Laval 
> > University,
> >             GREB, Ste-Foy, Quebec G1K 7P4, Canada
> > FEATURES             Location/Qualifiers
> >      source          1. .2679
> >                      /organism="Candida albicans"
> >                      /mol_type="genomic DNA"
> >                      /strain="CAI4"
> >                      /db_xref="taxon:5476"
> >      gene            <977. .>2356
> >                      /gene="PKA-R"
> >      mRNA            <977. .>2356
> >                      /gene="PKA-R"
> >                      /product="cAMP-dependent protein kinase regulatory
> >                      subunit"
> >      CDS             977. .2356
> >                      /gene="PKA-R"
> >                      /codon_start=1
> >                      /transl_table=12
> >                      /product="cAMP-dependent protein kinase regulatory
> >                      subunit"
> >                      /protein_id="AAG38599.1"
> >                      /db_xref="GI:11596393"
> >                      
> > /translation="MSNPQQQFISDELSQLQKEIISKNPQDVLQFCANYFNTKLQAQR
> >                      
> > SELWSQQAKAEAAGIDLFPSVDHVNVNSSGVSIVNDRQPSFKSPFGVNDPHSNHDEDP
> >                      
> > HAKDTKTDTAAAAVGGGIFKSNFDVKKSASNPPTKEVDPDDPSKPSSSSQPNQQSASA
> >                      
> > SSKTPSSKIPVAFNANRRTSVSAEALNPAKLKLDSWKPPVNNLSITEEETLANNLKNN
> >                      
> > FLFKQLDANSKKTVIAALQQKSFAKDTVIIQQGDEGDFFYIIETGTVDFYVNDAKVSS
> >                      
> > SSEGSSFGELALMYNSPRAATAVAATDVVCWALDRLTFRRILLEGTFNKRLMYEDFLK
> >                      
> > DIEVLKSLSDHARSKLADALSTEMYHKGDKIVTEGEQGENFYLIESGNCQVYNEKLGN
> >                      
> > IKQLTKGDYFGELALIKDLPRQATVEALDNVIVATLGKSGFQRLLGPVVEVLKEQDPT
> >                      KSQDPTAGH"
> > ORIGIN
> >
> > AF317472  Length: 2679  February 16, 2004 17:02  Type: N  Check: 9369  
> > ..
> >
> >        1  GAATTCAAAA AATCAAAAAA ATCAAAAAAA AACCGTGGAA GGTAAGTTGT
> >
> >       51  ATATTTATAA ATCAACGTGA ATAATTTTCA ACACTGTGTC AACATCTGTG
> >
> >      101  AAAAAAACCT GTGTGTACTG CATATAGGAC CTCACCTATT ACGTAGAATA
> >
> >      151  TACTAGAAAT AGTTACAACC ATAAAAAGAT TAATTGTGCT TACGTGGCAA
> >
> >      201  CTTTGAGATT TTTCTTTTTT CTGTTTCTTT CTTTCTTTTT TTGGCTTAAA
> >
> >      251  CAACAAATGT CGCAAATTAT ACAAACGACA TTTGCTGCCC ATGTCATTTT
> >
> >      301  GTCGTTATCA CGTGAAGTGT CGCAGATTTA TGTATTCTCA CTTCATTTCT
> >
> >      351  ATGGTCATCA ATTGTTCATT CATTCTCTAT CTTCAAAAAT CTGTGATTTG
> >
> >      401  ATGATTTTGA TTAAAAGAAA GCAAAGAGAA TACTGAAAAA AAGCAAAGAG
> >
> >      451  AATATAGAAA AGAAACAATA AAAGAATAGT TTCTAAGTTA CTTTGGAGTC
> >
> >      501  TGCTATTACC ATGTATCTAT GTGATTGCCC TATCAAATTG GACAATACGG
> >
> >      551  GTTTTTGTTT AGTCACGATA ATCACAAACT TCCCCCAGCA ATGACATACG
> >
> >      601  TAGCAAGTAA TATTTATATC TCTTCTATTT TTTTGATCTT ACATAATCTG
> >
> >      651  TCGTGTTTTT TTAAGTTGTT GTTATGAAGA AGTAATTTCA TAATGATCAA
> >
> >      701  GTGTGTAACT GAAATTTCAT CGCAATTTTA AACAAACAAG CTAATAATTA
> >
> >      751  TTATTATTAA TAGTTAATTT GCTAAGTTGA GTAAAATTTG CTTTTCTTGA
> >
> >      801  GAAAAAGGAG AAATTACTTT GGGAGTGAGT TTGAAGAGAG AAACTAAAGT
> >
> >      851  AAGTAAATGA GTGAGAGGGA GAGACAGAGA GCGAGAGGGG GAGTAAAAAA
> >
> >      901  AAAAGTTGCC CACAAACAAA TTGTGATACC GGTCTTTTAG CATATATCTT
> >
> >      951  CTACTCTTCA ATCAACATCT TTACCAATGT CTAATCCTCA ACAACAATTC
> >
> >     1001  ATATCTGATG AATTGTCGCA GTTACAGAAA GAAATAATTT CCAAAAACCC
> >
> >     1051  GCAAGATGTC TTACAGTTTT GCGCCAACTA TTTCAACACC AAGTTACAAG
> >
> >     1101  CTCAAAGAAG TGAGTTATGG TCGCAACAAG CTAAAGCAGA AGCCGCAGGC
> >
> >     1151  ATCGACTTAT TCCCATCTGT TGATCATGTG AATGTTAATT CTAGTGGTGT
> >
> >     1201  GAGCATTGTG AATGATAGAC AACCAAGTTT TAAATCACCT TTTGGTGTTA
> >
> >     1251  ATGATCCACA TCTGAATCAC GACGAAGATC CCCATGCCAA AGATACCAAA
> >
> >     1301  ACAGATACTG CTGCTGCTGC TGTTGGTGGG GGTATTTTCA AATCAAATTT
> >
> >     1351  TGATGTTAAA AAGAGTGCTT CTAATCCTCC AACCAAGGAA GTAGATCCAG
> >
> >     1401  ATGACCCATC AAAACCATCG TCATCGAGCC AACCAAATCA ACAATCAGCA
> >
> >     1451  TCAGCATCAT CAAAAACGCC ATCATCAAAG ATCCCAGTTG CTTTCAACGC
> >
> >     1501  TAATAGAAGA ACATCTGTAT CTGCTGAAGC CTTGAATCCA GCAAAATTGA
> >
> >     1551  AATTAGATAG TTGGAAACCT CCAGTTAATA ATTTGAGCAT TACCGAAGAA
> >
> >     1601  GAAACATTAG CCAACAATTT AAAGAACAAT TTCCTTTTCA AACAATTGGA
> >
> >     1651  CGCAAACTCT AAGAAAACTG TGATTGCTGC TTTACAACAA AAATCATTTG
> >
> >     1701  CTAAAGATAC AGTAATTATC CAACAAGGTG ATGAAGGGGA CTTTTTTTAC
> >
> >     1751  ATTATTGAAA CTGGTACAGT TGATTTCTAT GTTAATGATG CTAAAGTAAG
> >
> >     1801  TTCCAGTAGC GAAGGGTCAT CTTTTGGGGA ATTGGCTTTG ATGTATAATT
> >
> >     1851  CACCAAGAGC TGCTACGGCA GTTGCTGCCA CCGATGTTGT CTGTTGGGCA
> >
> >     1901  TTGGACCGTT TGACATTCCG TCGAATTCTT TTGGAAGGTA CTTTTAACAA
> >
> >     1951  GAGATTGATG TACGAGGATT TCTTAAAAGA TATTGAGGTT TTGAAATCTC
> >
> >     2001  TTTCGGATCA TGCACGTTCA AAATTGGCAG ATGCATTGAG CACAGAAATG
> >
> >     2051  TATCACAAGG GTGATAAAAT AGTCACTGAA GGTGAACAAG GAGAGAACTT
> >
> >     2101  TTATTTAATA GAAAGTGGAA ACTGTCAAGT TTACAATGAA AAGTTGGGCA
> >
> >     2151  ATATCAAACA ATTAACAAAA GGTGATTATT TTGGTGAGCT TGCATTAATA
> >
> >     2201  AAAGACTTAC CAAGACAAGC TACTGTGGAA GCATTGGATA ATGTAATCGT
> >
> >     2251  TGCCACATTA GGTAAATCCG GGTTCCAAAG ATTATTGGGT CCTGTTGTGG
> >
> >     2301  AGGTATTGAA AGAACAAGAC CCTACAAAGA GTCAAGACCC AACTGCTGGT
> >
> >     2351  CATTAAGTGT ACAATAAGTA GTTGTTTATT ATCTTATATT GTTTTATGTT
> >
> >     2401  AGTATATTCT ATCTTTTTTT TTTTGGCTTA CTCACCTTCT GGTGTTTTCG
> >
> >     2451  TTGCGATTTT GATAATGGAT GGTTGGTGCA AAAGTTCAAC TACATTTCTT
> >
> >     2501  GTTGTCAGGT ATATACGAGA TGGCAGCATG AACGAGCTCA CCATGGGTTG
> >
> >     2551  AACATTATTG AAGTTATCCG GCCGTGCCTT TTGCGAAACA TGGTAACTAA
> >
> >     2601  TATATTGCAA ACTTGGCTTC TACAGAAAAT ATACAATCTA ATACCTTGAG
> >
> >     2651  GAATTTCCTC TATATATAAT AGAGAATTC
> >
> > I'm not a GCG expert, but is this a correctly formatted GCG file in 
> > the first
> > place? If not, is this an error in the SeqIO parser?  I've found this 
> > behavior
> > to be the same on Solaris 8 and on Linux, both running BioPerl 1.4 and 
> > Perl
> > 5.8.1.
> >
> > Thanks a bunch,
> >
> > Tex Thompson
> > RIT Bioinformatics
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> 



More information about the Bioperl-l mailing list