Bioperl: Embl parsing oddities (long)

Keith James kdj@sanger.ac.uk
16 Mar 2000 01:12:07 +0000


I've been evaluating the pre-release 2 as a replacement for the Sanger
prEMBL modules for parsing EMBL feature tables in some of our
heavily-used scripts. I've figured out the basics of how the info is
stored by running some test scripts in the debugger, but there are a
couple of points that I could do with some advice on.

First off, the S. coelicolor genome project is cosmid based and so has
many CDSs which are partial because they run off the end of a
cosmid. I wrote a script to produce a database of translations for our
blast server which also collects the 'partial' CDSs and joins their
translations where possible, so that they can also be searched as
full-length sequences while we are still in progress.

Using the prEMBL modules I can get the location in the form:

<1..228

which indicates a partial CDS. There is no /partial qualifier as this
seems to be optional where the < implies that information. Neither
does there have to be a /codon_start=n, which would also warn of a
partial.

Looking at Bio::SeqIO::embl I can see a couple of lines where the <>'s
are stripped out, but can't see where this information has been stored
(has it been?). The resultant feature tells me that it starts at base
1, whereas I would say that biologically speaking it starts beyond
base 1 (of the cosmid). It's a vital piece of information.

It looks like I'm going to be writing some Embl->Ace stuff too and
need the same info to tag CDSs with Start_not_found/End_not_found.

Secondly, what behaviour can we expect for non-standard tags? Our
annotations are littered with 'made up qualifiers' while they are in
progress. At the moment the splitting of qualifiers into tag & value
seems to be unpredictable as I'm getting codon_start=2 or
label=SCJ4.52c as tags with the value being empty (looks like I should
send that to the bug list).

Finally, I'm not clear on which way the treatment of stop codons is
going to go. A CDS in EMBL includes the stop codon in its range, so
calling translate returns a string of amino acids with a * at the
end. Trying to do this with such a CDS

my $bio_aa = $bio_nt->translate();

throws an exception because the translation has a * which isn't
allowed by the sub seq in BIO::PrimarySeq (so I hacked it to allow my
tests to work). Also the length function includes the * when it
calculates the aa length (reported).

So will the *'s be staying in the translations, or will they be
disappearing in future? I can't think of a compelling reason for
either choice, myself.

On the whole it looks like a transition to Bioperl for some of the
Pathogens group scripts won't make me want to chew off my own limbs ;)

cheers,

Keith

-- 

Keith James  --  kdj@sanger.ac.uk  --  http://www.sanger.ac.uk/Users/kdj
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================