[Biojava-l] [biojavax] EMBL parser : features parsing

Richard Holland richard.holland at ebi.ac.uk
Thu Apr 20 12:05:00 UTC 2006


Hi.

I made some small changes to the code, although nothing that would fix
this kind of problem, committed it back to CVS, checked it out again,
compiled, and ran a test program that read in an EMBL file with the
feature table you describe below, and output it in EMBL format to
another file. I then compared the two files... and found no differences!
The split-on-equals problem didn't occur, and all notes appeared
alongside their correct features.

Could there be a problem maybe with the script you are using?

I've really no idea what the problem is as I can't reproduce it based on
the current CVS contents!

cheers,
Richard

On Thu, 2006-04-20 at 11:35 +0200, Morgane THOMAS-CHOLLIER wrote:
> Hi,
> 
> I have tested today's version from CVS.
> 
> Both EBI and Ensembl files now react the same way.
> The last annotation of a feature is nevertheless related to its 
> immediate following feature.
> e.g. :
> 
> FT   gene            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /note="Hoxb-9"
> FT   mRNA            <1..>118
> FT                   /gene="Hoxb9"
> FT                   /product="HOXB9"
> FT   CDS             <1..>118
> 
> /note="Hoxb-9" is related to mRNA
> /product="HOXB9" is related to CDS
> 
> Concerning the split-on-equals problem, I still observe the problem :
> 
>  [(#2) biojavax:note: transcript_i]
> 
> for this annotation :  /note="transcript_id=ENSMUST00000048680"
> 
> Thanks for helping,
> 
> Cheers,
> 
> Morgane.
> 
> Richard Holland wrote:
> > I have committed an UNTESTED patch based on Jolyon's suggestion, and
> > also attempted to fix the split-on-equals problem Morgane observed. 
> >
> > Please let me know if there are any problems with it.
> >
> > As this problem affected the UniProt parser in a similar manner (much of
> > the code is identical), the same fixes were applied there too.
> >
> > cheers,
> > Richard
> >
> > On Thu, 2006-04-13 at 17:42 +0100, Jolyon Holdstock wrote:
> >   
> >> Hi Morgane,
> >>
> >> I have amended the EmblFormat readSection method as below and the
> >> parsing seems to work; please test it.
> >>
> >> I think that the last bit of annotation is carried over into the next
> >> feature so before adding the new feature I dump the annotation and reset
> >> currentTag and currentVal.
> >>
> >> if (!line.startsWith(" ")) {
> >> //--------- new code starts ---------------------------
> >>   if (currentTag!=null) {
> >>     section.add(new String[]{currentTag,currentVal.toString()});
> >>     currentTag = null;
> >>     currentVal = null;
> >>   }
> >> //--------- new code ends -----------------------------
> >> // case 1 : word value - splits into key-value on its own
> >>   section.add(line.split("\\s+"));
> >> }
> >>
> >> Cheers,
> >>
> >> Jolyon
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: biojava-l-bounces at lists.open-bio.org
> >> [mailto:biojava-l-bounces at lists.open-bio.org] On Behalf Of Morgane
> >> THOMAS-CHOLLIER
> >> Sent: 12 April 2006 09:35
> >> To: biojava-l at open-bio.org
> >> Subject: [Biojava-l] [biojavax] EMBL parser : features parsing[Scanned]
> >>
> >> Hello again,
> >>
> >> I am currently using biojavax to parse EMBL files exported from Ensembl 
> >> website.
> >>
> >> Compared to the EBI files I have, they show a difference in the Features
> >>
> >> lines :
> >>
> >> sometimes, only one "/word" is present. ie:
> >>
> >> EBI file :
> >>
> >> FT   gene            <1..>118
> >> FT                   /gene="Hoxb9"
> >> FT                   /note="Hoxb-9"
> >>
> >> Ensembl file;
> >>
> >> FT   gene         complement(1..3218)
> >> FT                   /gene="ENSMUSG00000038227"
> >>
> >> The problem I encounter is that the parser correctly convert the "/word"
> >>
> >> into a Note, but the Note is then in relation with the immediate 
> >> following feature (ie: mRNA).
> >> The current gene feature thus has no annotation.
> >>
> >> This behavior is reproducible when removing one "/word" of an EBI file.
> >>
> >> Apart from this issue, I noted that Ensembl EMBL files uses "=" inside a
> >>
> >> feature (ie: /note="transcript_id=ENSMUST00000048680") which ends up 
> >> with an incomplete Note, as the parser seems to split on "=" to separate
> >>
> >> the Key and the Value.
> >>
> >> Thanks for your help,
> >>
> >> Morgane.
> >>
> >>     
> 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416




More information about the Biojava-l mailing list