[Bioperl-l] SeqIO (stress) testing

Kris Boulez krbou@pgsgent.be
Wed, 20 Dec 2000 12:13:57 +0100


Quoting Hilmar Lapp (hlapp@gmx.net):
> Kris Boulez wrote:
> > 
> > - starting from t/test.genbank, writing a swiss-prot file gives (we die,
> >   no error thrown)
> 
> test.genbank is DNA. Do you translate it?
> 
Nope, checked test.fasta to be protein, forgot this one.
Should this matter (i.e. does Swissprot checks it is writing a protein
sequence) ?

> Genbank (DNA) and Swissprot feature tables are basically incompatible.
> The post I quoted lately contains an example I think. (E.g., you can't
> have 'source' in a Swissprot feature table; the latter is supposed to
> contain only protein sites.)
> 
> > 
> > - starting from t/test.genbank, writing a gcg file, reading this gcg
> >   file gives
> > -------------------- EXCEPTION --------------------
> > MSG: Looks like start of another sequence. See documentation.
> > CONTEXT: Error in uNKNOWN CONTEXT
> > SCRIPT: seqtest.pl
> > STACK:
> > Bio::SeqIO::gcg::next_seq(123)
> > main::seqtest.pl(14)
> > ---------------------------------------------------
> > 
> > - starting from t/test.embl, there is a problem for SeqIO to read a gcg
> >   file it wrote himself (it just loops forever). I will investigate this
> > one further as it's not clear when/what happens.
> > 
> 
> The GCG module seems to be broken. I wanted to use it some time ago, but
> it even didn't want to read simple sequence files. At that time we had
> GCG 10, maybe something in the format has changed. GCG format is
> problematic, because there really isn't a genuine GCG format. A Genbank
> sequence in GCG format is in fact the sequence in Genbank format with 1
> header line prepended and the sequence formatted specially (with a line
> containing checksum etc, and the notorious two dots). Likewise for a
> EMBL sequence.
> 
> How many people have a serious interest in this module? If there are
> some, could you also provide some example files of a recent GCG version
> (e.g., 10.1); I personally don't have access to GCG presently.
> 

Given the widespread use of GCG there is (I guess) an intrest. We found
out this undefinedness of the GCG format in another project as well.

> > By looking at the test (and test sequences) we have now I saw that we
> > only try to read the first sequence from our test sequence files (apart
> > >from GCG, which reads more then one file). The test.embl even contains
> > only one sequence. I think that we should test for reading/writing
> > multiple sequences from one file.
> > 
> 
> Genbank format and FASTA are tested for reads of multiple entries.
> (Check further down the script.)
> 
I missed the Genbank test. As far as I can see the test for Fasta is
using Bio::SeqIO::MultiFile (test 17) or works on one sequence (tests
2-5).


Kris,