From p.j.a.cock at googlemail.com  Tue Aug  2 14:01:54 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 2 Aug 2011 19:01:54 +0100
Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files
Message-ID: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>

Hi EMBOSS folk,

I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto
who has been adding ABI support to Biopython.

With EMBOSS 6.3.1 compiled from source on Mac (as an example),

$ seqret -osformat="fastq-sanger" -filter 310.ab1
@D11F
TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather
than the expected ID from within the file we get EMBOSS_001,

$ seqret -osformat="fastq-sanger" -filter 310.ab1
@EMBOSS_001
TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Regards,

Peter Cock

---------- Forwarded message ----------
From: Wibowo Arindrarto <w.arindrarto at gmail.com>
Date: Sat, Jul 30, 2011 at 8:42 AM
Subject: Re: [Biopython-dev] SeqIO Abi Parser
To: Peter Cock <p.j.a.cock at googlemail.com>
Cc: biopython-dev at lists.open-bio.org


Hi Peter,
I've done some more improvements to the code:
- I've written the check and unittest for the file handle mode. I've
set it so that abi file has to be opened in 'rb' mode, otherwise it'll
return an error. While it's ok to open in 'r' mode in python 2 in
Linux, it has to be specified as 'rb' in Windows and/or Python 3 for
the file to be read correctly. So I decided forcing it to 'rb' is the
best. Because of this, I changed 'test_SeqIO.py:503' to include the
mode argument when opening.
- I've also checked against test_Emboss.py for seqret output, after
including the abi format in it. My EMBOSS version is 6.4.0. There was
a slight problem with this testing, since for some reason the ID
returned by seqret is always "EMBOSS_001". Something might be wrong
with my EMBOSS installation, since when I previously tested it against
6.1.0, the ID was correct (although the qual values not, so I had to
upgrade). As expected, if I comment out the code that tests for
sequence id ('test_Emboss.py:168-172') the tests pass. Maybe you could
try testing it as well and see if EMBOSS also returns the default id
instead of the sample name?
- Finally, I did some small cosmetic changes to the code (typos, etc).
All changes have been pushed to my github fork. Now I still have time
for the weekend to improve whatever needs to be improved :).
Regards,
---
Wibowo Arindrarto (bow)
http://bow.web.id


On Fri, Jul 29, 2011 at 18:20, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Hi again,
>
> I had a bit of time this afternoon so I looked at this.
>
> On Fri, Jul 29, 2011 at 1:14 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > On Fri, Jul 29, 2011 at 12:34 PM, Wibowo Arindrarto wrote:
> >> Hi Peter,
> >> Thanks for explaining. I understand why we should stick to the stored
> >> sequence id. In this case, we can use the filename as SeqRecord.name as
> >> well. Regarding BioPerl, I don't have it installed myself -- but I took a
> >> quick look at their source and it seems they also use the stored sequence ID
> >> as their main identifier instead of the filename. If the stored sequence ID
> >> is not present, it's "(unknown)" in their case.
> >
> > OK good, that means Biopython, BioPerl and EMBOSS should be
> > consistent :)
>
> I've made that switch,
>
> >> I'll look on the test_SeqIO.py over the weekend. I think it'll have
> >> something to do with some ambiguous dna base stored in the abi files.
> >> Regards,
> >
> > Some of the alphabet stuff is a bit nasty - so please feel free to ask
> > or get me to help.
>
> I've done enough to get the test_SeqIO.py unit test to pass.
>
> We probably need a check (like in SFF) to check the user hasn't given
> a handle opened in text mode. That should probably have a unit test
> too.
>
> I still haven't cross checked the sequence and PHRED scores from
> your code and EMBOSS.
>
> Anyway - I'll leave the code for you to work on for now...
>
> Peter


From pmr at ebi.ac.uk  Tue Aug  2 14:27:07 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 02 Aug 2011 19:27:07 +0100
Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI
	files
In-Reply-To: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>
References: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>
Message-ID: <4E38417B.6000505@ebi.ac.uk>

On 02/08/2011 19:01, Peter Cock wrote:
> Hi EMBOSS folk,
>
> I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto
> who has been adding ABI support to Biopython.
>
> With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather
> than the expected ID from within the file we get EMBOSS_001,

Can you please run with -debug on the command line and send me the 
seqret.dbg file to see what it thought was in the file

regards,

Peter Rice
EMBOSS team

From p.j.a.cock at googlemail.com  Wed Aug  3 03:57:01 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 3 Aug 2011 08:57:01 +0100
Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI
	files
In-Reply-To: <4E38417B.6000505@ebi.ac.uk>
References: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>
	<4E38417B.6000505@ebi.ac.uk>
Message-ID: <CAKVJ-_7jeiY9yPriP8TkjHuoHxKQ80rSG_=+W9=cPXV+zZd_tQ@mail.gmail.com>

On Tue, Aug 2, 2011 at 7:27 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> On 02/08/2011 19:01, Peter Cock wrote:
>>
>> Hi EMBOSS folk,
>>
>> I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto
>> who has been adding ABI support to Biopython.
>>
>> With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather
>> than the expected ID from within the file we get EMBOSS_001,
>
> Can you please run with -debug on the command line and send me the
> seqret.dbg file to see what it thought was in the file

No problem - sent directly to Peter R,

Peter

From ajb at ebi.ac.uk  Thu Aug 11 09:22:25 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 11 Aug 2011 14:22:25 +0100 (BST)
Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released
Message-ID: <53905.82.26.12.214.1313068945.squirrel@imap04.ebi.ac.uk>

New bug-fix files are available for EMBOSS-6.4.0 and, for Windows
users, a new version of mEMBOSS is available.

The bugs fixed are appended for easy reference.

1) UNIX

As usual, the most convenient way of applying the bug-fixes should be
to apply the patch file:

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-11.gz

to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code
and recompiling/installing.

(see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch
 for instructions on using 'patch').

Alternatively, you can individually copy the patched files
from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory
if your system does not support 'patch'.

2) mEMBOSS

The new version incorporates all the bug-fixes listed below.
Uninstall your previous mEMBOSS installation and download and install
the new setup file from:

ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.2-setup.exe


Alan

-----------------------------------------------------------------------

Fix 1. EMBOSS-6.4.0/emboss/dbiflat.c
       EMBOSS-6.4.0/emboss/dbxflat.c

10 Aug 2011: The SwissProt description line format includes additional
             tags which interfere with the EMBL parser used in
             previous releases. The fix replaces this with a SwissProt
             parser that strips out the extra tags. After patching the
             release, any existing SwissProt description index files
             should be reindexed. Other indexes are unchanged.

Fix 2. EMBOSS-6.4.0/ajax/core/ajquery.c

10 Aug 2011: For databases with more than one valid format (examples
             include the EBI dbfetch server) this fix allows the
             format to be specified with a qualifier on the command
             line. In the original release, only a format in the query
             string was used.


Fix 3. EMBOSS-6.4.0/ajax/core/ajfeatread.c

10 Aug 2011: When parsing GFF3 format input, long feature tags (for
             example extremely long translations) exceeded limits in
             regular expression parsing. This fix decouples testing for
             escaped quotes from the main task of finding quoted
             strings.


Fix 4. EMBOSS-6.4.0/emboss/data/Etcode.dat

10 Aug 2001: The local data file used by application tcode had a missing
             parameter line.


Fix 5. EMBOSS-6.4.0/ajax/core/ajrange.c

10 Aug 2011: When sequence ranges (and possible highlighting for
             showalign) were in a list file, the parser overwrote
             string values.


Fix 5. EMBOSS-6.4.0/ajax/core/ajseqabi.c

10 Aug 2011: Sample names in ABI format files were stored in
             incompletely defined strings. This fix corrects the
             string object. The sample name is also used as the
             sequence name.


Fix 6. EMBOSS-6.4.0/emboss/dbxresource.c

10 Aug 2011: A future change to the format of Data Resource Catalogue
             entries in DRCAT.dat requires an update to the parsing of
             category lines. The current version is not affected.


Fix 7. EMBOSS-6.4.0/emboss/server.ensemblgenomes
       EMBOSS-6.4.0/emboss/cacheensembl.c
       EMBOSS-6.4.0/ajax/ensembl/ensregistry.c
       EMBOSS-6.4.0/ajax/ensembl/ensregistry.c
       EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.c
       EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.h

10 Aug 2011: Microbial genomes use an enumerated species code which
             must be added to the query for data retrieval. This fix
             adds the species code to the comment field. In the next
             release a more complete solution will be implemented.


Fix 8. EMBOSS-6.4.0/ajax/core/ajarch.h

10-Aug-2011: Corrects the size of long integers on Windows systems only.


Fix 9. EMBOSS-6.4.0/emboss/cirdna.c

10-Aug-2011: Cirdna prints text inside solid blocks invisibly. When
             printed outside the text scaling was too small. The text
             scale is now adjusted for the radius and sequence length
             so that labels should be readable outside the box.


Fix 10. EMBOSS-6.4.0/ajax/core/ajpat.c

10-Aug-2011: Fuzznuc, fuzzpro and fuzztran using a pattern file
             ignored the command line -mismatch qualifier for the
             first pattern. The default mismatch is now set to this
             value at the start of the pattern matching loop in the
             library.


Fix 11. EMBOSS-6.4.0/ajax/core/ajfmt.c

11-Aug-2011: The function ajFmtScanF() handled va_list incorrectly. Only
             potentially affected code developers.


From ajb at ebi.ac.uk  Thu Aug 11 11:58:25 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 11 Aug 2011 16:58:25 +0100 (BST)
Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released
Message-ID: <49005.82.26.12.214.1313078305.squirrel@imap04.ebi.ac.uk>

UNIX users who downloaded the bug-fix patch file for EMBOSS earlier
this afternoon may have found that there were compilation problems on
a limited number of architectures.

The patch has been amended slightly to hopefully fix this problem
so please download it again if you were affected.

If anyone continues to experience compilation problems then
please let me know.

Alan


From p.j.a.cock at googlemail.com  Tue Aug 16 11:03:26 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Aug 2011 16:03:26 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
Message-ID: <CAKVJ-_6N0hWOu2o1GAdV81YJiEqzMCfLJmaHnWnNa9SwhQUSyQ@mail.gmail.com>

Dear Peter R. (et al.),

I recall from one of our chats in person that EMBOSS has some
mapping tables to convert the various different data file format's
feature names into a common standard (the Sequence Ontology?),
for the purpose of inter-converting files. e.g. Converting a UniProt/
SwissProt plain text protein file into a GenPept protein file or GFF3

Is that a fair summary?

It seems to match the minutes of this meeting (found with
Google) http://emboss.sourceforge.net/meetings/2009-02-16.html

> DASGFF requires a sequence ontology (or BioSapiens
> ontology) tag for protein features. Peter has updated the
> Efeatures definitions for proteins to use GFF3 sequence
> ontology codes as internal identifiers, and to use GFF3
> as the principle definitions for all protein features. All
> SwissProt feature types (36 in the current Swissprot
> release) are also defined with the closest possible match
> to the sequence ontology. Where there is no exact match,
> an EMBOSS internal type is defined using the closets SO
> code and the original feature type as a suffix. For SwissProt
> output this is converted back to the swissprot feature type.
> For GFF3 output the internal type is an alias for the closest
> (more general) SO term.

Can you point me at these mapping tables in the EMBOSS
source code please?

I'm particularly interested in the SwissProt to SO mapping
right now.

Thanks.

Peter C.

From pmr at ebi.ac.uk  Tue Aug 16 11:26:51 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 16 Aug 2011 16:26:51 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4A8BF7.4020106@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk>
Message-ID: <4E4A8C3B.8030306@ebi.ac.uk>

On 08/16/2011 04:03 PM, Peter Cock wrote:
> Dear Peter R. (et al.),
> 
> I recall from one of our chats in person that EMBOSS has some
> mapping tables to convert the various different data file format's
> feature names into a common standard (the Sequence Ontology?),
> for the purpose of inter-converting files. e.g. Converting a UniProt/
> SwissProt plain text protein file into a GenPept protein file or GFF3
> 
> Is that a fair summary?

Yes, We needed an internal identifier for feature types, and picked SO
for nucleotides - and then were able to add the protein terms when they
became available.

There are a few made up internal names, with _text after the SO term,
that were needed in the early days of the BioSapiens Ontology and some
dodgy mapping between SO and EMBL/GenBank for immunoglobulin gene
regions, but I believe are no longer used.

The first term in the file is defined as the default if nothing is
recognized (region or misc_feature)

> Can you point me at these mapping tables in the EMBOSS
> source code please?

emboss/data/Efeatures.embl
emboss/data/Efeatures.swiss

> I'm particularly interested in the SwissProt to SO mapping
> right now.

That was originally done by the BioSapiens "Network of excellence" for
annotating ENCODE data. They developed the protein features which were
then added to the sequence ontology.

You can look at SO terms in EMBOSS with:

ontoget so:0001094

or

ontoget -filter -oformat excel so:0001094

(Hmmm, should do something better for a missing namespace - it was
defined as a format for EDAM)


Let me know if you spot anything in need of updating.

We also have (especially for EMBL) equivalent Etags files listing the
available feature qualifiers.

regards,

Peter Rice
EMBOSS Team

From p.j.a.cock at googlemail.com  Tue Aug 16 11:36:24 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Aug 2011 16:36:24 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4A8C3B.8030306@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk>
	<4E4A8C3B.8030306@ebi.ac.uk>
Message-ID: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>

On Tue, Aug 16, 2011 at 4:26 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> Yes, We needed an internal identifier for feature types, and picked SO
> for nucleotides - and then were able to add the protein terms when they
> became available.
> ...

Thanks!

>
> Let me know if you spot anything in need of updating.
>

I have found three protein features which have been
renamed, and one which appears to be wrong... see below.

I recently noticed that the UniProt provide GFF3 files,
e.g. http://www.uniprot.org/uniprot/P99999.gff

========================================
##gff-version 3
##sequence-region P99999 1 105
P99999	UniProtKB	Initiator methionine	1	1	.	.	.	Note=Removed	
P99999	UniProtKB	Chain	2	105	.	.	.	ID=PRO_0000108218;Note=Cytochrome c	
P99999	UniProtKB	Metal binding	19	19	.	.	.	Note=Iron (heme axial ligand)	
P99999	UniProtKB	Metal binding	81	81	.	.	.	Note=Iron (heme axial ligand)	
P99999	UniProtKB	Binding site	15	15	.	.	.	Note=Heme (covalent)	
P99999	UniProtKB	Binding site	18	18	.	.	.	Note=Heme (covalent)	
P99999	UniProtKB	Modified residue	2	2	.	.	.	Note=N-acetylglycine	
P99999	UniProtKB	Modified
residue	49	49	.	.	.	Note=Phosphotyrosine;Status=By similarity
P99999	UniProtKB	Modified
residue	98	98	.	.	.	Note=Phosphotyrosine;Status=By similarity
P99999	UniProtKB	Natural variant	42	42	.	.	.	ID=VAR_044450;Note=In
THC4%3B increases the pro-apoptotic function by triggering caspase
activation more efficiently than wild-type%3B does not affect the
redox function.
P99999	UniProtKB	Natural variant	56	56	.	.	.	ID=VAR_048850	
P99999	UniProtKB	Natural variant	66	66	.	.	.	ID=VAR_002204;Note=In
10%25 of the molecules.
P99999	UniProtKB	Sequence conflict	18	18	.	.	.	.	
P99999	UniProtKB	Sequence conflict	41	41	.	.	.	.	
P99999	UniProtKB	Helix	4	14	.	.	.	.	
P99999	UniProtKB	Turn	16	18	.	.	.	.	
P99999	UniProtKB	Beta strand	23	25	.	.	.	.	
P99999	UniProtKB	Beta strand	28	30	.	.	.	.	
P99999	UniProtKB	Turn	36	38	.	.	.	.	
P99999	UniProtKB	Helix	51	56	.	.	.	.	
P99999	UniProtKB	Helix	62	70	.	.	.	.	
P99999	UniProtKB	Helix	72	75	.	.	.	.	
P99999	UniProtKB	Helix	89	102	.	.	.	.	
========================================

However, they are not using Sequence Ontology terms
in column three and so fail the online GFF3 validator
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
listed in http://www.sequenceontology.org/gff3.shtml
(GFF3 specification currently at v1.20). Additionally
that UniProt GFF3 uses an upper case reserved tag,
"Status" rather than perhaps "status", in the modified
residue features.

I will report this to UniProt later. However, first I thought
I would try converting one of the other files provided into
GFF3 using EMBOSS seqret for an alternative, e.g. the
plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt

I can convert this using seqret as follows:

========================================
$ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt
-stdout -auto
##gff-version 3
##sequence-region CYC_HUMAN 1 105
#!Date 2011-08-16
#!Type Protein
#!Source-version EMBOSS 6.4.0.0
CYC_HUMAN	SWISSPROT	cleaved_initiator_methionine	1	1	.	+	.	ID=CYC_HUMAN.1;note=Removed
CYC_HUMAN	SWISSPROT	mature_protein_region	2	105	.	+	.	ID=CYC_HUMAN.2;note=Cytochrome
c;ftid=PRO_0000108218
CYC_HUMAN	SWISSPROT	metal_binding	19	19	.	+	.	ID=CYC_HUMAN.3;note=Iron;comment=heme
axial ligand
CYC_HUMAN	SWISSPROT	metal_binding	81	81	.	+	.	ID=CYC_HUMAN.4;note=Iron;comment=heme
axial ligand
CYC_HUMAN	SWISSPROT	binding_site	15	15	.	+	.	ID=CYC_HUMAN.5;note=Heme;comment=covalent
CYC_HUMAN	SWISSPROT	binding_site	18	18	.	+	.	ID=CYC_HUMAN.6;note=Heme;comment=covalent
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	2	2	.	+	.	ID=CYC_HUMAN.7;note=N-acetylglycine
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	49	49	.	+	.	ID=CYC_HUMAN.8;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	98	98	.	+	.	ID=CYC_HUMAN.9;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN	SWISSPROT	natural_variant	42	42	.	+	.	ID=CYC_HUMAN.10;note=G
-> S;comment=in THC4%3B increases the pro- apoptotic function by
triggering caspase activation more efficiently than wild- type%3B does
not affect the redox function;ftid=VAR_044450
CYC_HUMAN	SWISSPROT	natural_variant	56	56	.	+	.	ID=CYC_HUMAN.11;note=K
-> R;comment=in dbSNP:rs11548795;ftid=VAR_048850
CYC_HUMAN	SWISSPROT	natural_variant	66	66	.	+	.	ID=CYC_HUMAN.12;note=M
-> L;comment=in 10%25 of the molecules;ftid=VAR_002204
CYC_HUMAN	SWISSPROT	sequence_conflict	18	18	.	+	.	ID=CYC_HUMAN.13;note=C
-> Y;comment=in Ref. 8%3B AAH15130
CYC_HUMAN	SWISSPROT	sequence_conflict	41	41	.	+	.	ID=CYC_HUMAN.14;note=T
-> I;comment=in Ref. 8%3B AAH68464
CYC_HUMAN	SWISSPROT	alpha_helix	4	14	.	+	.	ID=CYC_HUMAN.15
CYC_HUMAN	SWISSPROT	turn	16	18	.	+	.	ID=CYC_HUMAN.16
CYC_HUMAN	SWISSPROT	beta_strand	23	25	.	+	.	ID=CYC_HUMAN.17
CYC_HUMAN	SWISSPROT	beta_strand	28	30	.	+	.	ID=CYC_HUMAN.18
CYC_HUMAN	SWISSPROT	turn	36	38	.	+	.	ID=CYC_HUMAN.19
CYC_HUMAN	SWISSPROT	alpha_helix	51	56	.	+	.	ID=CYC_HUMAN.20
CYC_HUMAN	SWISSPROT	alpha_helix	62	70	.	+	.	ID=CYC_HUMAN.21
CYC_HUMAN	SWISSPROT	alpha_helix	72	75	.	+	.	ID=CYC_HUMAN.22
CYC_HUMAN	SWISSPROT	alpha_helix	89	102	.	+	.	ID=CYC_HUMAN.23
##FASTA
>CYC_HUMAN P99999 Cytochrome c
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
========================================

Interestingly EMBOSS includes the sequence at the bottom
(using the FASTA directive) and has generated unique ID tags
for each feature. It has also added more note tags.

Unfortunately this also failed the GFF3 validation. The EMBOSS
output does a lot better (e.g. "cleaved_initiator_methionine" is
valid while "Initiator methionine" in the UniProt file was not)

However, some of the terms in column 3 are apparently out of
date - but http://www.sequenceontology.org does list them as
synonyms:

* metal_binding -> polypeptide_metal_contact
* natural_variant -> natural_variant_site
* turn -> polypeptide_turn_motif

It looks like the EMBOSS sequence ontology table may need
updating for at least these three cases.

Finally protein_modification_categorized_by_chemical_process
does not seem to be valid (I failed to find it in the ontology).

Additionally the validator complained about some of the note
in Line 15, probably due to the %3B escaped semi-colon,
but that may be a bug in the validator.

Peter C.

From p.j.a.cock at googlemail.com  Tue Aug 16 14:39:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Aug 2011 19:39:05 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
Message-ID: <CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>

On Tue, Aug 16, 2011 at 4:36 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I recently noticed that the UniProt provide GFF3 files,
> e.g. http://www.uniprot.org/uniprot/P99999.gff
>
> ...
> http://www.uniprot.org/uniprot/P99999.txt
> ...
>
> $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt

I also noticed the seqret GFF3 output is using "+" as the strand,
which is wrong for a protein reference like this. It should be using
"." (period) as the features on a protein are strand-less (as done
in the UniProt GFF3 file).

Regards,

Peter C.

From p.j.a.cock at googlemail.com  Wed Aug 17 06:37:06 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 11:37:06 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
Message-ID: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>

Hi again Peter R. (et al.),

Following yesterday's discussion about GFF3 files from UniProt,
I'm trying seqret to produce GFF3 from GenBank files. I'd already
found the NCBI currently provides some very broken GFF3 files:

http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html

$ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff
$ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
$ seqret --version
EMBOSS:6.4.0.0
$ seqret -filter -feature -sequence NC_005213.gbk -sformat=genbank
-osformat=gff3 | head -n 20
##gff-version 3
##sequence-region NC_005213 1 490885
#!Date 2011-08-17
#!Type DNA
#!Source-version EMBOSS 6.4.0.0
NC_005213	EMBL	databank_entry	1	490885	.	+	.	ID=NC_005213.1;organism=Nanoarchaeum
equitans Kin4-M;mol_type=genomic
DNA;strain=Kin4-M;db_xref=taxon:228908
NC_005213	EMBL	gene	3254	35301	.	+	.	ID=NC_005213.2;locus_tag=NEQ_t01;experiment=experimental
evidence%2C no additional details
recorded;trans_splicing=true;db_xref=GeneID:3362429
NC_005213	EMBL	gene	35233	35301	.	+	.	Parent=NC_005213.2
NC_005213	EMBL	gene	3254	3289	.	+	.	Parent=NC_005213.2
NC_005213	EMBL	tRNA	3254	35287	.	+	.	ID=NC_005213.5;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental
evidence%2C no additional details
recorded;trans_splicing=true;db_xref=GeneID:3362429
NC_005213	EMBL	tRNA	35249	35287	.	+	.	Parent=NC_005213.5
NC_005213	EMBL	tRNA	3254	3289	.	+	.	Parent=NC_005213.5
NC_005213	EMBL	gene	1	490885	.	-	.	ID=NC_005213.8;locus_tag=NEQ001;db_xref=GeneID:2732620
NC_005213	EMBL	gene	490883	490885	.	-	.	Parent=NC_005213.8
NC_005213	EMBL	gene	1	879	.	-	.	Parent=NC_005213.8
NC_005213	EMBL	CDS	1	490885	.	-	0	ID=NC_005213.11;locus_tag=NEQ001;note=conserved
hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized
ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743:
Protein of unknown function
DUF57;codon_start=1;transl_table=11;product=hypothetical
protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
NC_005213	EMBL	CDS	490883	490885	.	-	0	Parent=NC_005213.11
NC_005213	EMBL	CDS	1	879	.	-	0	Parent=NC_005213.11
NC_005213	EMBL	sequence_feature	7	879	.	-	.	ID=NC_005213.14;locus_tag=NEQ001;note=CRISPR/Cas
system-associated RAMP superfamily protein Cas6%3B Region:
Cas6-I-III%3B cl11443;db_xref=CDD:196236
NC_005213	EMBL	gene	883	2691	.	+	.	ID=NC_005213.15;locus_tag=NEQ003;db_xref=GeneID:2654355

I've deliberately cut the example here to include all of NEQ_t01, and
interesting trans-spliced tRNA, and all of NEQ001, an interesting gene
because it spans the origin of this circular genome. I use these examples
in the blog post and discuss them again below.

Given some of the points below, I suspect EMBOSS is producing GFF3
prior to the additions made in v1.18 (24 June 2010) regarding circular
genomes.

The following numbering reflects the issues listed on my blog post
about the NCBI version of the GFF3 file (link given above).

------------------------------------------

Problem One - Invalid Feature Types

EMBOSS looks OK here, you're converting the GenBank feature types
source and misc_feature into databank_entry and sequence_feature
respectively.

------------------------------------------

Problem Two - Circular features not marked

EMBOSS is also lacking in this area.

EMBOSS has used feature type databank_entry and generated feature ID
NC_005213.1 for the landmark. However, this should include the special
tag entry Is_circular=true, since this is the landmark feature for the whole
circular chromosome.

------------------------------------------

Problem Three - Missing ID tags on multi-location features

Unlike the NCBI file which fails to cross link multi-location features like
trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think
you are following the expected pattern as used in the canonical GFF3
examples.

In the GenBank file, this tRNA is join(35233..35301,3254..3289)

For the gene and tRNA features for NEQ_t01, EMBOSS is generating
three GFF3 lines. First a very broad parent feature 3254 to 35301,
then two children 35233 to 35301 and 3254 to 3289.

I would expect two GFF3 lines (for each of gene and tRNA), just
35233 to 35301 and 3254 to 3289 which would be linked by virtue
of having the same ID.

The online GFF3 validator would seem to support my interpretation,
reporting errors like this:

8            [ERROR]   invalid type pair - check all parents (at line
7; gene to gene)
11           [ERROR]   invalid type pair - check all parents (at line
10; tRNA to tRNA)
14           [ERROR]   invalid type pair - check all parents (at line
13; gene to gene)
17           [ERROR]   invalid type pair - check all parents (at line
16; CDS to CDS)
28           [ERROR]   invalid type pair - check all parents (at line
27; sequence_feature to
             sequence_feature)


This is related to "Problem Six" and "Problem Seven" below.

------------------------------------------

Problem Four - Wrong tag for database cross references

I had noticed the NCBI using a local tag (lower case) db_xref rather
than the standard (upper case = reserved) tag Dbxref. EMBOSS
does the same - is this deliberate and if so why?

------------------------------------------

Problem Five - Missing stop codon in CDS features

EMBOSS looks OK here

------------------------------------------

Problem Six - Features wrapping the origin of a circular genome

Related to the landmark feature lacking the Is_curcular=true tag, the
gene and CDS features for origin wrapping NEQ003 look funny to me.
EMBOSS seems to be generating three GFF3 lines for the gene and CDS
for NEQ003, a surprisingly broad entry 1 to 490885 and two children
490883 to 490885 and 1 to 879 (which do look sensible).

This is essentially the same point I raised above with NEQ_t01, but
with the added complication of spanning the origin.

Based on the old specification, I had expected two GFF3 lines each for the
gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked
by virtue of the having the same ID.

Thankfully this potential confusion has been address in the updated
specification, so I would expect a single GFF3 line for each of the gene
and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

------------------------------------------

Problem Seven - No parent/child relationships

The NCBI GFF3 file had no parent/child relationships at all.

The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
but not in the way I expected (and not in a way the validator likes).
As discussed above, for the GenBank join locations EMBOSS
seems to create broad parent features with children for each
sub-location (parent/child relations of the same type = bad).

What I'm expecting instead is parent child relationships between
the CDS and gene features, between tRNA and gene features, etc.
Note that these relationships are implicit in the GenBank (and EMBL)
flat files, so I accept trying to deduce them might be hard (and
perhaps best not doing immediately - the other issues are more
pressing).

------------------------------------------

Problem Eight - Invalid tags

The online validator complains that EMBOSS too is using EC_number
(uppercase tags are reserved

------------------------------------------

So my conclusion is that while the EMBOSS generated GFF3 is
better than those produced by the NCBI, it still is invalid and needs
some work.

As usual, I am of course happy to help with testing fixes. And if
there are any mistakes in my understanding of the GFF3 spec,
please tell me ;)

Regards,

Peter C.


From pmr at ebi.ac.uk  Wed Aug 17 11:38:23 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:38:23 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
Message-ID: <4E4BE06F.9040503@ebi.ac.uk>

On 16/08/2011 16:36, Peter Cock wrote:
> Interestingly EMBOSS includes the sequence at the bottom
> (using the FASTA directive) and has generated unique ID tags
> for each feature. It has also added more note tags.

The sequence is included if you are writing sequence data. GFF3 allows 
sequence to be included, so we add it. Using a separate feature file is 
always awkward for users, but is supported.

> Unfortunately this also failed the GFF3 validation. The EMBOSS
> output does a lot better (e.g. "cleaved_initiator_methionine" is
> valid while "Initiator methionine" in the UniProt file was not)
>
> However, some of the terms in column 3 are apparently out of
> date - but http://www.sequenceontology.org does list them as
> synonyms:

Thanks. I'll update the table, but synonyms should be acceptable.

> Finally protein_modification_categorized_by_chemical_process
> does not seem to be valid (I failed to find it in the ontology).

Not in SO, but in a separate ontology (MOD). Should also be valid in GFF 
I believe, but perhaps the parser insists on using SO and excluding 
related ontologies.

> Additionally the validator complained about some of the note
> in Line 15, probably due to the %3B escaped semi-colon,
> but that may be a bug in the validator.

Interesting. Let me know if we are not escaping the right characters, 
but I believe we are supposed to escape ';' in those positions.

regards,

Peter Rice
EMBOSS Team

From pmr at ebi.ac.uk  Wed Aug 17 11:39:39 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:39:39 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>
Message-ID: <4E4BE0BB.70302@ebi.ac.uk>

On 16/08/2011 19:39, Peter Cock wrote:
> I also noticed the seqret GFF3 output is using "+" as the strand,
> which is wrong for a protein reference like this. It should be using
> "." (period) as the features on a protein are strand-less (as done
> in the UniProt GFF3 file).

Thanks.

We'll fix it for the next release, but my understanding is it should be 
acceptable to most parsers.

regards,

Peter Rice
EMBOSS Team

From p.j.a.cock at googlemail.com  Wed Aug 17 11:48:32 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 16:48:32 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE06F.9040503@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
Message-ID: <CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 16/08/2011 16:36, Peter Cock wrote:
>>
>> Interestingly EMBOSS includes the sequence at the bottom
>> (using the FASTA directive) and has generated unique ID tags
>> for each feature. It has also added more note tags.
>
> The sequence is included if you are writing sequence data. GFF3 allows
> sequence to be included, so we add it. Using a separate feature file is
> always awkward for users, but is supported.

See also the discussion today on gmod-gbrowse / song-devel where
it sounds like GFF3 should have a single block of FASTA embedded
sequence at the end of the fine, rather than interleaved. As I suggest
on that thread, the practical solution for EMBOSS seqret might be to
omit the FASTA sequence altogether. Or cache them in memory/on
disk to write out at the very end of the all the features?

http://generic-model-organism-system-database.450254.n5.nabble.com/Mailing-list-for-GFF3-specification-discussion-td4707740.html

>> Unfortunately this also failed the GFF3 validation. The EMBOSS
>> output does a lot better (e.g. "cleaved_initiator_methionine" is
>> valid while "Initiator methionine" in the UniProt file was not)
>>
>> However, some of the terms in column 3 are apparently out of
>> date - but http://www.sequenceontology.org does list them as
>> synonyms:
>
> Thanks. I'll update the table, but synonyms should be acceptable.

I can see plus points for either view, certainly the validator could
downgrade that error to an warning.

>> Finally protein_modification_categorized_by_chemical_process
>> does not seem to be valid (I failed to find it in the ontology).
>
> Not in SO, but in a separate ontology (MOD). Should also be valid
> in GFF I believe, but perhaps the parser insists on using SO and
> excluding related ontologies.

OK, but in that case shouldn't you then be declaring this with a
##feature-ontology directive?

>> Additionally the validator complained about some of the note
>> in Line 15, probably due to the %3B escaped semi-colon,
>> but that may be a bug in the validator.
>
> Interesting. Let me know if we are not escaping the right characters, but I
> believe we are supposed to escape ';' in those positions.

I haven't checked this aspect carefully (since this is fiddly).

Peter

From p.j.a.cock at googlemail.com  Wed Aug 17 11:50:57 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 16:50:57 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE0BB.70302@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>
	<4E4BE0BB.70302@ebi.ac.uk>
Message-ID: <CAKVJ-_58k83YPoG9MWeNpG=6tB4YO52ATAKJjJXD-AthygOgCg@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:39 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 16/08/2011 19:39, Peter Cock wrote:
>>
>> I also noticed the seqret GFF3 output is using "+" as the strand,
>> which is wrong for a protein reference like this. It should be using
>> "." (period) as the features on a protein are strand-less (as done
>> in the UniProt GFF3 file).
>
> Thanks.
>
> We'll fix it for the next release, but my understanding is it should be
> acceptable to most parsers.
>

I agree this is pretty harmless - in practice all that really matters
is if the strand is "-" or not. Still, it should be straight forward to fix.

Peter

From pmr at ebi.ac.uk  Wed Aug 17 11:52:21 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:52:21 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
Message-ID: <4E4BE3B5.4080601@ebi.ac.uk>

On 17/08/2011 11:37, Peter Cock wrote:
> Hi again Peter R. (et al.),
>
> Following yesterday's discussion about GFF3 files from UniProt,
> I'm trying seqret to produce GFF3 from GenBank files. I'd already
> found the NCBI currently provides some very broken GFF3 files:
>
> ------------------------------------------
>
> Problem Two - Circular features not marked
>
> EMBOSS is also lacking in this area.
>
> EMBOSS has used feature type databank_entry and generated feature ID
> NC_005213.1 for the landmark. However, this should include the special
> tag entry Is_circular=true, since this is the landmark feature for the whole
> circular chromosome.

Thanks. I'll make sure we add it for the next release.

> ------------------------------------------
>
> Problem Three - Missing ID tags on multi-location features
>
> Unlike the NCBI file which fails to cross link multi-location features like
> trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think
> you are following the expected pattern as used in the canonical GFF3
> examples.
>
> In the GenBank file, this tRNA is join(35233..35301,3254..3289)
>
> For the gene and tRNA features for NEQ_t01, EMBOSS is generating
> three GFF3 lines. First a very broad parent feature 3254 to 35301,
> then two children 35233 to 35301 and 3254 to 3289.
>
> I would expect two GFF3 lines (for each of gene and tRNA), just
> 35233 to 35301 and 3254 to 3289 which would be linked by virtue
> of having the same ID.

EMBOSS is reporting what is stored internally (feature and subfeatures 
for the exons). Looks like we should skip reporting the feature. I'll 
check what that means for the IDs.


> This is related to "Problem Six" and "Problem Seven" below.
>
> ------------------------------------------
>
> Problem Four - Wrong tag for database cross references
>
> I had noticed the NCBI using a local tag (lower case) db_xref rather
> than the standard (upper case = reserved) tag Dbxref. EMBOSS
> does the same - is this deliberate and if so why?

It is deliberate - we are using the db_xref tag from the EMBL/GenBank 
feature table.

But we could convert to the GFF3 tag (and back again on reading). I'll 
have a look at how easy that would be.

> ------------------------------------------
>
> Problem Six - Features wrapping the origin of a circular genome
>
> Related to the landmark feature lacking the Is_curcular=true tag, the
> gene and CDS features for origin wrapping NEQ003 look funny to me.
> EMBOSS seems to be generating three GFF3 lines for the gene and CDS
> for NEQ003, a surprisingly broad entry 1 to 490885 and two children
> 490883 to 490885 and 1 to 879 (which do look sensible).
>
> This is essentially the same point I raised above with NEQ_t01, but
> with the added complication of spanning the origin.

Ah, something to do with the way start and end positions are stored 
internally. I'll fix that along with other circular feature issues.

> Thankfully this potential confusion has been address in the updated
> specification, so I would expect a single GFF3 line for each of the gene
> and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

I'll try to write (and read) that way too.

> ------------------------------------------
>
> Problem Seven - No parent/child relationships
>
> The NCBI GFF3 file had no parent/child relationships at all.
>
> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
> but not in the way I expected (and not in a way the validator likes).
> As discussed above, for the GenBank join locations EMBOSS
> seems to create broad parent features with children for each
> sub-location (parent/child relations of the same type = bad).
>
> What I'm expecting instead is parent child relationships between
> the CDS and gene features, between tRNA and gene features, etc.
> Note that these relationships are implicit in the GenBank (and EMBL)
> flat files, so I accept trying to deduce them might be hard (and
> perhaps best not doing immediately - the other issues are more
> pressing).

Could be possible by matching common exons (stored internally as 
subfeatures). I'll have a look.

> ------------------------------------------
>
> Problem Eight - Invalid tags
>
> The online validator complains that EMBOSS too is using EC_number
> (uppercase tags are reserved

Pah! We use the EMBL/Genbank tag names. Looks like we will have to 
convert to lower case so may as well include that with the 
db_xref/Dbxref conversion in GFF3 writing and reading

> ------------------------------------------
>
> So my conclusion is that while the EMBOSS generated GFF3 is
> better than those produced by the NCBI, it still is invalid and needs
> some work.
>
> As usual, I am of course happy to help with testing fixes. And if
> there are any mistakes in my understanding of the GFF3 spec,
> please tell me ;)

Many, many thanks for finding these.

EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as 
subfeatures, which makes all this much easier to handle.

regards,

Peter Rice
EMBOSS Team

From pmr at ebi.ac.uk  Wed Aug 17 11:55:53 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:55:53 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
Message-ID: <4E4BE489.3040703@ebi.ac.uk>

On 17/08/2011 16:48, Peter Cock wrote:
> On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> On 16/08/2011 16:36, Peter Cock wrote:
>>>
>>> Interestingly EMBOSS includes the sequence at the bottom
>>> (using the FASTA directive) and has generated unique ID tags
>>> for each feature. It has also added more note tags.
>>
>> The sequence is included if you are writing sequence data. GFF3 allows
>> sequence to be included, so we add it. Using a separate feature file is
>> always awkward for users, but is supported.
>
> See also the discussion today on gmod-gbrowse / song-devel where
> it sounds like GFF3 should have a single block of FASTA embedded
> sequence at the end of the fine, rather than interleaved. As I suggest
> on that thread, the practical solution for EMBOSS seqret might be to
> omit the FASTA sequence altogether. Or cache them in memory/on
> disk to write out at the very end of the all the features?

Thanks. We already save sequences and write at the end for some formats 
so I'll add it for GFF3. We will need more work for reading GFF3 input 
though, but it may not be too bad.

If we are reading it as feature input, we don't look for the sequence.

If we are reading as sequence input, we need to read all the sequeces 
into memory and then go back to read the features. For streamed input we 
can buffer to make the rewind work.

regards,

Peter Rice
EMBOSS Team

From p.j.a.cock at googlemail.com  Wed Aug 17 12:05:13 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:05:13 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E4BE3B5.4080601@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E4BE3B5.4080601@ebi.ac.uk>
Message-ID: <CAKVJ-_4A=UwuXUexeMs=gg_eZ0n8LeCh96C57TNY9HZWSpQ2Ow@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:52 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 11:37, Peter Cock wrote:
>> ------------------------------------------
>>
>> Problem Four - Wrong tag for database cross references
>>
>> I had noticed the NCBI using a local tag (lower case) db_xref rather
>> than the standard (upper case = reserved) tag Dbxref. EMBOSS
>> does the same - is this deliberate and if so why?
>
> It is deliberate - we are using the db_xref tag from the EMBL/GenBank
> feature table.
>
> But we could convert to the GFF3 tag (and back again on reading). I'll
> have a look at how easy that would be.

Do you want to check this one with Lincoln on the song-devel mailing list
first - after all, using a lower case tag is quite allowable and valid GFF3.
My point is it does seem to be exactly what the reserved tag Dbxref is
intended for.

>> ------------------------------------------
>>
>> Problem Seven - No parent/child relationships
>>
>> The NCBI GFF3 file had no parent/child relationships at all.
>>
>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>> but not in the way I expected (and not in a way the validator likes).
>> As discussed above, for the GenBank join locations EMBOSS
>> seems to create broad parent features with children for each
>> sub-location (parent/child relations of the same type = bad).
>>
>> What I'm expecting instead is parent child relationships between
>> the CDS and gene features, between tRNA and gene features, etc.
>> Note that these relationships are implicit in the GenBank (and EMBL)
>> flat files, so I accept trying to deduce them might be hard (and
>> perhaps best not doing immediately - the other issues are more
>> pressing).
>
> Could be possible by matching common exons (stored internally as
> subfeatures). I'll have a look.

Usually yes, but not all the time. I've seen GenBank files where
the gene and CDS features have slightly different locations which
makes doing this automatically hard. Off the top of my head this
was a programmed frame shift example... I'll see if I can find you
a specific example.

>> ------------------------------------------
>>
>> So my conclusion is that while the EMBOSS generated GFF3 is
>> better than those produced by the NCBI, it still is invalid and needs
>> some work.
>>
>> As usual, I am of course happy to help with testing fixes. And if
>> there are any mistakes in my understanding of the GFF3 spec,
>> please tell me ;)
>
> Many, many thanks for finding these.

I've come to value NC_005213.gbk as a reasonably small circular
genome with some rather complicated annotation - its one of my
favourite test cases.

> EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as
> subfeatures, which makes all this much easier to handle.

Oh good - that restructuring should now pay dividends :)

Peter C.

From p.j.a.cock at googlemail.com  Wed Aug 17 12:07:54 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:07:54 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE489.3040703@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
	<4E4BE489.3040703@ebi.ac.uk>
Message-ID: <CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 16:48, Peter Cock wrote:
>> See also the discussion today on gmod-gbrowse / song-devel where
>> it sounds like GFF3 should have a single block of FASTA embedded
>> sequence at the end of the fine, rather than interleaved. As I suggest
>> on that thread, the practical solution for EMBOSS seqret might be to
>> omit the FASTA sequence altogether. Or cache them in memory/on
>> disk to write out at the very end of the all the features?
>
> Thanks. We already save sequences and write at the end for some
> formats so I'll add it for GFF3. We will need more work for reading
> GFF3 input though, but it may not be too bad.
>
> If we are reading it as feature input, we don't look for the sequence.
>
> If we are reading as sequence input, we need to read all the sequeces
> into memory and then go back to read the features. For streamed input
> we can buffer to make the rewind work.

I'm curious what other file formats needed this kind of work. But it
is good that you've already got some buffer/cache infrastructure
in place. Does it boil down to writing temp files in /tmp ?

Peter C.

From pmr at ebi.ac.uk  Wed Aug 17 12:14:15 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 17:14:15 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
	<4E4BE489.3040703@ebi.ac.uk>
	<CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>
Message-ID: <4E4BE8D7.4010203@ebi.ac.uk>

On 17/08/2011 17:07, Peter Cock wrote:
> On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> If we are reading as sequence input, we need to read all the sequeces
>> into memory and then go back to read the features. For streamed input
>> we can buffer to make the rewind work.
>
> I'm curious what other file formats needed this kind of work. But it
> is good that you've already got some buffer/cache infrastructure
> in place. Does it boil down to writing temp files in /tmp ?

MSF (checksum at the top), Phylip (number of sequences at the top).

In ajseqwrite.c these are the ones with the Save attribute set true.

We keep them in memory and write them when the output file is closed.

regards,

Peter Rice
EMBOSS Team

From p.j.a.cock at googlemail.com  Wed Aug 17 12:33:29 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:33:29 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE8D7.4010203@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
	<4E4BE489.3040703@ebi.ac.uk>
	<CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>
	<4E4BE8D7.4010203@ebi.ac.uk>
Message-ID: <CAKVJ-_4iABoj8y2s1f9kJ=pujKm41=PqECXg49MDRBY3T5wFEA@mail.gmail.com>

On Wed, Aug 17, 2011 at 5:14 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 17:07, Peter Cock wrote:
>> I'm curious what other file formats needed this kind of work. But it
>> is good that you've already got some buffer/cache infrastructure
>> in place. Does it boil down to writing temp files in /tmp ?
>
> MSF (checksum at the top), Phylip (number of sequences at the top).
>
> In ajseqwrite.c these are the ones with the Save attribute set true.
>
> We keep them in memory and write them when the output file is closed.

I wasn't thinking of alignments, but that makes perfect sense.

Thanks,

Peter C.

From p.j.a.cock at googlemail.com  Wed Aug 17 12:54:10 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:54:10 +0100
Subject: [emboss-dev] Moving EMBOSS from OBF hosted CVS to git on github
Message-ID: <CAKVJ-_78kPXyt3bCj+WX++q+WHDTQOz3Q8bJXY+kMOZUmYJQcA@mail.gmail.com>

Dear EMBOSS team,

Have you made any decisions regarding the proposal to
move the EMBOSS repository from CVS hosted by the
OBF to git hosted on github (where most of the other OBF
backed projects are now)?

I see this made it to the minutes of the 27 June 2011 meeting:
http://emboss.sourceforge.net/meetings/2011-06-27.html

As I recall from talking to Peter Rice at BOSC/ISMB 2011
in Vienna last month, EMBOSS currently uses a single branch
in CVS (like Biopython used to), so migrating the repository
to git shouldn't be too complicated.

I recommend in the short term maintaining a git mirror of the
CVS repository on github.com, which can be kept current
via a cron job running on the OBF server. You can then
treat this git repository as a read only mirror and continue
to make all commits via CVS.

During this interim period, external contributors can make
their own branches etc (without touching the official EMBOSS
repository) and send you patches. The internal developers can
also try this out as a way to get familiar with git gradually.

This is what we did with Biopython, and it worked very well.
I am happy to assist with this if you want. I think I made this
offer in person in Vienna, but I'm repeating it publicly now.

You might also be able to adopt the existing mirror
maintained by Pjotr Prins (CC'd), although that does
include a branch with BioLib work in it:
https://github.com/pjotrp/EMBOSS/

Regards,

Peter C.

P.S. You'll need to have a different project name on github
since emboss was used by Martin Bosslet back in Nov 2010.
How about emboss-prj or even open-bio for this?

P.P.S. This page seems to be missing:
http://emboss.sourceforge.net/meetings/2011-07-04.html

It is linked to from at least these two pages:
http://emboss.sourceforge.net/meetings/
http://emboss.sourceforge.net/meetings/2011-07-11.html

From pmr at ebi.ac.uk  Thu Aug 18 08:28:28 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 18 Aug 2011 13:28:28 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
Message-ID: <4E4D056C.5050508@ebi.ac.uk>

On 08/16/2011 04:36 PM, Peter Cock wrote:
> I will report this to UniProt later. However, first I thought
> I would try converting one of the other files provided into
> GFF3 using EMBOSS seqret for an alternative, e.g. the
> plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt
> 
> I can convert this using seqret as follows:
> 
> ========================================
> $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt

> However, some of the terms in column 3 are apparently out of
> date - but http://www.sequenceontology.org does list them as
> synonyms:
> 
> It looks like the EMBOSS sequence ontology table may need
> updating for at least these three cases.
> 
> Finally protein_modification_categorized_by_chemical_process
> does not seem to be valid (I failed to find it in the ontology).

That was a name from the MOD ontology. GFF3 output now uses an SO term
(but SO is lacking detail for MOD_RES, having only:

id: SO:0001089
name: post_translationally_modified_region

and

id: SO:0001700
name: histone_modification

... and then more descendant of histone modification. Still showing its
DNA_only roots.

EMBOSS internally uses MOD terms for MOD_RES features. The details are
in the note tag in GFF3 output.

> Additionally the validator complained about some of the note
> in Line 15, probably due to the %3B escaped semi-colon,
> but that may be a bug in the validator.

Worked for me. Perhaps it was confused by the term name errors (or
perhaps the validator has been fixed)

However, one nasty bug ... EMBOSS was so careful to only read real GFF3
format that the EMBOSS comment "#!Type Protein" was ignored and features
were read into EMBOSS as nucleotide.

I suspect there is no way in GFF3 to identify a protein file. In the
next patch we can parse the EMBOSS comment again but that will not help
with non-EMBOSS protein GFF3 files.

Is there some official distinction between protein and nucleotide GFF3
files?

regards,

Peter Rice
EMBOSS Team

From pmr at ebi.ac.uk  Wed Aug 24 06:36:34 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 24 Aug 2011 11:36:34 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
Message-ID: <4E54D432.8030309@ebi.ac.uk>

On 08/17/2011 11:37 AM, Peter Cock wrote:
> Hi again Peter R. (et al.),
>
> Following yesterday's discussion about GFF3 files from UniProt,
> I'm trying seqret to produce GFF3 from GenBank files.
>
> ------------------------------------------
>
> Problem Two - Circular features not marked
>
> EMBOSS is also lacking in this area.

Current status: circular tags will be passed better i the next EMBOSS 
release. Sequence inputs will have a new -scircular qualifier and 
feature inputs will have -fcircular to cover cases where the input 
format does not define a circular sequence (but if it does, these will 
not turn it off)

We will tag a feature with Is_circular in the output, even if we have to 
make one up.

> ------------------------------------------
>
> Problem Six - Features wrapping the origin of a circular genome
>
> Related to the landmark feature lacking the Is_circular=true tag, the
> gene and CDS features for origin wrapping NEQ003 look funny to me.
> EMBOSS seems to be generating three GFF3 lines for the gene and CDS
> for NEQ003, a surprisingly broad entry 1 to 490885 and two children
> 490883 to 490885 and 1 to 879 (which do look sensible).
>
> Based on the old specification, I had expected two GFF3 lines each for the
> gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked
> by virtue of the having the same ID.
>
> Thankfully this potential confusion has been address in the updated
> specification, so I would expect a single GFF3 line for each of the gene
> and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

Unfortunately GFF3 is sadly lacking in details on how to define the 
sequence length. It appears there is no standard for defining the 
length, yet it is critical to interpreting a circular feature that goes 
across the origin as GFF3 makes the end position greater than the length.

We will make a best guess but cannot guarantee we get the right answer.

> ------------------------------------------
>
> Problem Seven - No parent/child relationships
>
> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
> but not in the way I expected (and not in a way the validator likes).
> As discussed above, for the GenBank join locations EMBOSS
> seems to create broad parent features with children for each
> sub-location (parent/child relations of the same type = bad).
>
> What I'm expecting instead is parent child relationships between
> the CDS and gene features, between tRNA and gene features, etc.
> Note that these relationships are implicit in the GenBank (and EMBL)
> flat files, so I accept trying to deduce them might be hard (and
> perhaps best not doing immediately - the other issues are more
> pressing).

The obvious fix is to lie about the feature types of the exons so the 
validator is happy. We could call them exons, but "region" would be safer.

But there is a silly complication with CDS features: we could keep the 
CDS parent record and have it as a parent of a group of "regions" for 
the processed exons. But GFF3 wants the exons to be type "CDS" so what 
do we call the parent?

So in the cobbled together example below, ignoring the circular aspects, 
we would want to keep the CDS on the parent (ID=NC_005213.11) record 
where all the annotation tags are, but I suspect GFF3 wants that to be 
something else. We could of course specifically lie about CDS features 
for EMBOSS generated GFF3 files (we tag the header) so we can restore 
the correct internal structure on input.

NC_005213	EMBL	CDS	490883	491764  .	-	0 
ID=NC_005213.11;locus_tag=NEQ001;note=conserved
hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized
ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743:
Protein of unknown function
DUF57;codon_start=1;transl_table=11;product=hypothetical
protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
NC_005213	EMBL	CDS	490883	490885	.	-	0	ID=NC_005213.12;Parent=NC_005213.11
NC_005213	EMBL	CDS	1	879	.	-	0	ID=NC_005213.13;Parent=NC_005213.11

> ------------------------------------------
>
> Problem Eight - Invalid tags
>
> The online validator complains that EMBOSS too is using EC_number
> (uppercase tags are reserved

Fixed and we can patch the release. Making all tags lower case is 
trivial - they are automatically converted on input to the internal 
mixed case.

> ------------------------------------------
>
> So my conclusion is that while the EMBOSS generated GFF3 is
> better than those produced by the NCBI, it still is invalid and needs
> some work.
>
> As usual, I am of course happy to help with testing fixes. And if
> there are any mistakes in my understanding of the GFF3 spec,
> please tell me ;)

Hope this helps. Progress is being made.

However, as GFF3 is such a pain, I am wondering whether to switch the 
default feature format to something else - back to GFF2 or maybe to use GTF.

Does anyone have a preference?

regards,

Peter Rice
EMBOSS Team

From pmr at ebi.ac.uk  Wed Aug 24 10:45:33 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 24 Aug 2011 15:45:33 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E54D432.8030309@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk>
Message-ID: <4E550E8D.8010506@ebi.ac.uk>

On 08/24/2011 11:36 AM, Peter Rice wrote:
> On 08/17/2011 11:37 AM, Peter Cock wrote:
>
>> ------------------------------------------
>>
>> Problem Seven - No parent/child relationships
>>
>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>> but not in the way I expected (and not in a way the validator likes).

As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, 
I can make the CDS "parent" feature change its type to 
"biological_region" and add a featflags tag with the true type. Code 
(not yet checked in) can reconstruct the EMBL feature table from this GFF.

However, the EMBL tags are all on the parent (now biological_region) 
feature.

Any suggestions where I should stick them for them to be useful in GFF3?

EMBL feature table:

FT   source          1..3919
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT   CDS             join(2079..2171,2294..2515,3371..3499)
FT                   /db_xref="GDB:119299"
FT                   /db_xref="GOA:P02100"
FT                   /db_xref="HGNC:4830"
FT                   /db_xref="InterPro:IPR000971"
FT                   /db_xref="InterPro:IPR002337"
FT                   /db_xref="InterPro:IPR009050"
FT                   /db_xref="InterPro:IPR012292"
FT                   /db_xref="PDB:1A9W"
FT                   /db_xref="UniProtKB/Swiss-Prot:P02100"
FT                   /protein_id="CAA23766.1"
FT 
/translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS
FT 
FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF
FT                   KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH"

proposed GFF3 version

V00508	EMBL	databank_entry	1	3919	.	+	.	ID=V00508.1;organism=Homo 
sapiens;mol_type=genomic DNA;db_xref=taxon:9606
V00508	EMBL	biological_region	2079	3499	.	+	0 
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x
ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV
VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508	EMBL	CDS	2079	2171	.	+	0	Parent=V00508.2
V00508	EMBL	CDS	2294	2515	.	+	0	Parent=V00508.2
V00508	EMBL	CDS	3371	3499	.	+	0	Parent=V00508.2


regards,

Peter Rice
EMBOSS Team

From p.j.a.cock at googlemail.com  Wed Aug 24 20:44:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 25 Aug 2011 01:44:47 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E550E8D.8010506@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
Message-ID: <CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>

On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> However, as GFF3 is such a pain, I am wondering whether to switch the
> default feature format to something else - back to GFF2 or maybe to use GTF.
>

Sadly I have to agree with you - the current version of the GFF3
spec leaves far too much open to multiple interpretation, as we
have been discussing on the song-devel mailing lists. I'm not
sure that GFF2 or GTF are any better though.

On Wed, Aug 24, 2011 at 3:45 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 08/24/2011 11:36 AM, Peter Rice wrote:
>>
>> On 08/17/2011 11:37 AM, Peter Cock wrote:
>>
>>> ------------------------------------------
>>>
>>> Problem Seven - No parent/child relationships
>>>
>>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>>> but not in the way I expected (and not in a way the validator likes).
>
> As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I
> can make the CDS "parent" feature change its type to "biological_region" and
> add a featflags tag with the true type. Code (not yet checked in) can
> reconstruct the EMBL feature table from this GFF.
>
> However, the EMBL tags are all on the parent (now biological_region)
> feature.
>
> Any suggestions where I should stick them for them to be useful in GFF3?
>
> EMBL feature table:
>
> FT ? source ? ? ? ? ?1..3919
> FT ? ? ? ? ? ? ? ? ? /organism="Homo sapiens"
> FT ? ? ? ? ? ? ? ? ? /mol_type="genomic DNA"
> FT ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606"
> FT ? CDS ? ? ? ? ? ? join(2079..2171,2294..2515,3371..3499)
> FT ? ? ? ? ? ? ? ? ? /db_xref="GDB:119299"
> FT ? ? ? ? ? ? ? ? ? /db_xref="GOA:P02100"
> FT ? ? ? ? ? ? ? ? ? /db_xref="HGNC:4830"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR000971"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR002337"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR009050"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR012292"
> FT ? ? ? ? ? ? ? ? ? /db_xref="PDB:1A9W"
> FT ? ? ? ? ? ? ? ? ? /db_xref="UniProtKB/Swiss-Prot:P02100"
> FT ? ? ? ? ? ? ? ? ? /protein_id="CAA23766.1"
> FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS
> FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF
> FT ? ? ? ? ? ? ? ? ? KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH"
>
> proposed GFF3 version
>
> V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606
> V00508 ?EMBL ? ?biological_region ? ? ? 2079 ? ?3499 ? ?. ? ? ? + ? ? ? 0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x
> ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV
> VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0
> Parent=V00508.2
> V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0
> Parent=V00508.2
> V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0
> Parent=V00508.2
>

I was expecting something like this (done by hand) where we follow the
example on http://www.sequenceontology.org/gff3.shtml and have a
single GFF gene feature represented by three lines linked by virtue of
having the same ID:


V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? .
ID=V00508.1;organism=Homo sapiens;mol_type=genomic
DNA;db_xref=taxon:9606
V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH

On the downside, I have repeated all the annotation three times - but
that is what was done in the GFF3 example in the spec.

Perhaps this should be raised on the song-devel mailing list along
with our other GFF3 queries.

Regards,

Peter C.


From pmr at ebi.ac.uk  Thu Aug 25 09:52:30 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 25 Aug 2011 14:52:30 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
	<CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
Message-ID: <4E56539E.6030400@ebi.ac.uk>

On 25/08/2011 01:44, Peter Cock wrote:
> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>
>> However, as GFF3 is such a pain, I am wondering whether to switch the
>> default feature format to something else - back to GFF2 or maybe to use GTF.
>>
>
> Sadly I have to agree with you - the current version of the GFF3
> spec leaves far too much open to multiple interpretation, as we
> have been discussing on the song-devel mailing lists. I'm not
> sure that GFF2 or GTF are any better though.

GTF is no good for EMBOSS ... way too picky about start and stop codons

If pushed we could read it in using a version of the GTF parser but I 
see no point trying to write it using data from any source


> I was expecting something like this (done by hand) where we follow the
> example on http://www.sequenceontology.org/gff3.shtml and have a
> single GFF gene feature represented by three lines linked by virtue of
> having the same ID:
>
>
> V00508  EMBL    databank_entry  1       3919    .       +       .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic
> DNA;db_xref=taxon:9606
> V00508  EMBL    CDS     2079    2171    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508  EMBL    CDS     2294    2515    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508  EMBL    CDS     3371    3499    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
>
> On the downside, I have repeated all the annotation three times - but
> that is what was done in the GFF3 example in the spec.

Urgh. How about a gene with 80 exons? That's what I was trying to avoid.

How would you plan to read it back in? Transferring all features to the 
parent perhaps, with checks every time for an existing exact copy?

I am less impressed with GFF3 each time I look.

I think we'll go with the annotation of the "biological_region" parent 
and wait for anyone with a use case that actually requires massively 
replicated annotation.

regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Thu Aug 25 22:27:31 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 26 Aug 2011 03:27:31 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E56539E.6030400@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
	<CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
	<4E56539E.6030400@ebi.ac.uk>
Message-ID: <CAKVJ-_7L2oTm3f41hPbfiM7NGAfNqtAXRQF+pMFMZ=eqd2qVmw@mail.gmail.com>

On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 25/08/2011 01:44, Peter Cock wrote:
>>
>> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk> ?wrote:
>>>
>>> However, as GFF3 is such a pain, I am wondering whether to switch the
>>> default feature format to something else - back to GFF2 or maybe to use
>>> GTF.
>>>
>>
>> Sadly I have to agree with you - the current version of the GFF3
>> spec leaves far too much open to multiple interpretation, as we
>> have been discussing on the song-devel mailing lists. I'm not
>> sure that GFF2 or GTF are any better though.
>
> GTF is no good for EMBOSS ... way too picky about start and stop codons
>
> If pushed we could read it in using a version of the GTF parser but I see no
> point trying to write it using data from any source
>
>
>> I was expecting something like this (done by hand) where we follow the
>> example on http://www.sequenceontology.org/gff3.shtml and have a
>> single GFF gene feature represented by three lines linked by virtue of
>> having the same ID:
>>
>> ...
>>
>> On the downside, I have repeated all the annotation three times - but
>> that is what was done in the GFF3 example in the spec.
>
> Urgh. How about a gene with 80 exons? That's what I was trying to avoid.
>
> How would you plan to read it back in? Transferring all features to the
> parent perhaps, with checks every time for an existing exact copy?
>

It would make sense to propose that the first line has all the annotation,
and the subsequence lines from the same feature just need the ID,
and if it is adopted the part tag recently discussed on the song-devel
list to make the order of the sub-parts explicit.
http://sourceforge.net/mailarchive/message.php?msg_id=27960475

>
> I am less impressed with GFF3 each time I look.
>

Me too.

>
> I think we'll go with the annotation of the "biological_region" parent and
> wait for anyone with a use case that actually requires massively replicated
> annotation.
>

Have you looked at the BioPerl GenBank to GFF3 conversion?
I understand GBrowse recommends this as a way to get
GenBank format data into GBrowse. I'm also pretty sure that
this is being used inside TogoWS for GenBank/EMBL to GFF3:

http://togows.dbcls.jp/entry/embl/V00508  <-- original EMBL
http://togows.dbcls.jp/entry/embl/V00508.gff  <-- as GFF3

Interestingly their GFF3 output is pretty close to your proposed
EMBOSS output, only they've got a "region" rather than
"biological_region" for the parent meta-feature.

However, I think introducing extra biological_region features to
act as the parent of multi-location features would run counter to
the canonical gene model given in the GFF3 specification (which
appears to be just a suggestion rather than a requirement).

Also, introducing this meta-feature would complicate any
future wish to try to express explicit parent/child relationships
between operon, gene, mRNA and CDS features. Of course, as
we've discussed, these biological relationships are only implicit
in the GenBank/EMBL feature table.

This is probably a good example to discuss on the GFF3
song-devel mailing list - small and apparently very simple
except for how to represent the (forward strand) join location.

Peter C.


From pmr at ebi.ac.uk  Tue Aug 30 11:48:25 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 30 Aug 2011 16:48:25 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_7L2oTm3f41hPbfiM7NGAfNqtAXRQF+pMFMZ=eqd2qVmw@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
	<CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
	<4E56539E.6030400@ebi.ac.uk>
	<CAKVJ-_7L2oTm3f41hPbfiM7NGAfNqtAXRQF+pMFMZ=eqd2qVmw@mail.gmail.com>
Message-ID: <4E5D0649.3010905@ebi.ac.uk>

On 08/26/2011 03:27 AM, Peter Cock wrote:
> On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> On 25/08/2011 01:44, Peter Cock wrote:

> It would make sense to propose that the first line has all the annotation,
> and the subsequence lines from the same feature just need the ID,
> and if it is adopted the part tag recently discussed on the song-devel
> list to make the order of the sub-parts explicit.
> http://sourceforge.net/mailarchive/message.php?msg_id=27960475

The part tag is interesting and would map to the internal "exon" 
attribute in EMBOSS which we reserve for sorting.

>> I think we'll go with the annotation of the "biological_region" parent and
>> wait for anyone with a use case that actually requires massively replicated
>> annotation.
>>
>
> Have you looked at the BioPerl GenBank to GFF3 conversion?
> I understand GBrowse recommends this as a way to get
> GenBank format data into GBrowse. I'm also pretty sure that
> this is being used inside TogoWS for GenBank/EMBL to GFF3:
>
> http://togows.dbcls.jp/entry/embl/V00508<-- original EMBL
> http://togows.dbcls.jp/entry/embl/V00508.gff<-- as GFF3

Hmmm .... the GFF3 has Parent references to the protein_id, but it 
doesn't appear as an ID.

I do not like using a second region to put the description line in. 
Using the organism as the ID for the source line also looks odd.

> Interestingly their GFF3 output is pretty close to your proposed
> EMBOSS output, only they've got a "region" rather than
> "biological_region" for the parent meta-feature.

I don't see a parent meta-feature there.

> However, I think introducing extra biological_region features to
> act as the parent of multi-location features would run counter to
> the canonical gene model given in the GFF3 specification (which
> appears to be just a suggestion rather than a requirement).
>
> Also, introducing this meta-feature would complicate any
> future wish to try to express explicit parent/child relationships
> between operon, gene, mRNA and CDS features. Of course, as
> we've discussed, these biological relationships are only implicit
> in the GenBank/EMBL feature table.

I tried the canonical gene example:

##gff-version 3
##sequence-region ctg123 1 9000
ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
ctg123	.	TF_binding_site	1000	1012	.	+	.	ID=tfbs00001;Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123	.	five_prime_UTR	1050	1200	.	+	.	Parent=mRNA00001
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	three_prime_UTR	7601	9000	.	+	.	Parent=mRNA00001
ctg123	.	cDNA_match	1050	1500	5.8e-42	+	. 
ID=match00001;Target=cdna0123+12+462
ctg123	.	cDNA_match	5000	5500	8.1e-43	+	. 
ID=match00001;Target=cdna0123+463+963
ctg123	.	cDNA_match	7000	9000	1.4e-40	+	. 
ID=match00001;Target=cdna0123+964+2964
##FASTA
>ctg123
cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
>cdna0123
ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
tcaaacagcggctgtaaaaatttgtgattatggttaaagg

I can not (code not yet checked in) reproduce this, subject to the 
sequence being too short.

Internally, EMBOSS generates parent features for CDS and cDNA_match 
(where several features share an ID), and the parent structure is preserved.

On output, the generated features are not reported so GFF3 input is 
identical.

If we read EMBL/GenBank entries then we will generate a parent feature 
with type "biological region" to attach the annotation from the join. 
Reproducing the "parent" relationships is a separate exercise that could 
be a separate application. In terms of reading one format and writing 
another I prefer to not generate any GFF3-specific extras.

> This is probably a good example to discuss on the GFF3
> song-devel mailing list - small and apparently very simple
> except for how to represent the (forward strand) join location.

We could propose something for the 
http://www.sequenceontology.org/wiki/index.php/GFF3_best_practices page 
to describe how to represent EMBL/GenBank entries in GFF3 (after due 
discussion on the SONG-devel list)

regards,

Peter Rice
EMBSOS Team

From p.j.a.cock at googlemail.com  Tue Aug  2 18:01:54 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 2 Aug 2011 19:01:54 +0100
Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files
Message-ID: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>

Hi EMBOSS folk,

I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto
who has been adding ABI support to Biopython.

With EMBOSS 6.3.1 compiled from source on Mac (as an example),

$ seqret -osformat="fastq-sanger" -filter 310.ab1
@D11F
TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather
than the expected ID from within the file we get EMBOSS_001,

$ seqret -osformat="fastq-sanger" -filter 310.ab1
@EMBOSS_001
TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Regards,

Peter Cock

---------- Forwarded message ----------
From: Wibowo Arindrarto <w.arindrarto at gmail.com>
Date: Sat, Jul 30, 2011 at 8:42 AM
Subject: Re: [Biopython-dev] SeqIO Abi Parser
To: Peter Cock <p.j.a.cock at googlemail.com>
Cc: biopython-dev at lists.open-bio.org


Hi Peter,
I've done some more improvements to the code:
- I've written the check and unittest for the file handle mode. I've
set it so that abi file has to be opened in 'rb' mode, otherwise it'll
return an error. While it's ok to open in 'r' mode in python 2 in
Linux, it has to be specified as 'rb' in Windows and/or Python 3 for
the file to be read correctly. So I decided forcing it to 'rb' is the
best. Because of this, I changed 'test_SeqIO.py:503' to include the
mode argument when opening.
- I've also checked against test_Emboss.py for seqret output, after
including the abi format in it. My EMBOSS version is 6.4.0. There was
a slight problem with this testing, since for some reason the ID
returned by seqret is always "EMBOSS_001". Something might be wrong
with my EMBOSS installation, since when I previously tested it against
6.1.0, the ID was correct (although the qual values not, so I had to
upgrade). As expected, if I comment out the code that tests for
sequence id ('test_Emboss.py:168-172') the tests pass. Maybe you could
try testing it as well and see if EMBOSS also returns the default id
instead of the sample name?
- Finally, I did some small cosmetic changes to the code (typos, etc).
All changes have been pushed to my github fork. Now I still have time
for the weekend to improve whatever needs to be improved :).
Regards,
---
Wibowo Arindrarto (bow)
http://bow.web.id


On Fri, Jul 29, 2011 at 18:20, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> Hi again,
>
> I had a bit of time this afternoon so I looked at this.
>
> On Fri, Jul 29, 2011 at 1:14 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> > On Fri, Jul 29, 2011 at 12:34 PM, Wibowo Arindrarto wrote:
> >> Hi Peter,
> >> Thanks for explaining. I understand why we should stick to the stored
> >> sequence id. In this case, we can use the filename as SeqRecord.name as
> >> well. Regarding BioPerl, I don't have it installed myself -- but I took a
> >> quick look at their source and it seems they also use the stored sequence ID
> >> as their main identifier instead of the filename. If the stored sequence ID
> >> is not present, it's "(unknown)" in their case.
> >
> > OK good, that means Biopython, BioPerl and EMBOSS should be
> > consistent :)
>
> I've made that switch,
>
> >> I'll look on the test_SeqIO.py over the weekend. I think it'll have
> >> something to do with some ambiguous dna base stored in the abi files.
> >> Regards,
> >
> > Some of the alphabet stuff is a bit nasty - so please feel free to ask
> > or get me to help.
>
> I've done enough to get the test_SeqIO.py unit test to pass.
>
> We probably need a check (like in SFF) to check the user hasn't given
> a handle opened in text mode. That should probably have a unit test
> too.
>
> I still haven't cross checked the sequence and PHRED scores from
> your code and EMBOSS.
>
> Anyway - I'll leave the code for you to work on for now...
>
> Peter


From pmr at ebi.ac.uk  Tue Aug  2 18:27:07 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 02 Aug 2011 19:27:07 +0100
Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI
	files
In-Reply-To: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>
References: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>
Message-ID: <4E38417B.6000505@ebi.ac.uk>

On 02/08/2011 19:01, Peter Cock wrote:
> Hi EMBOSS folk,
>
> I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto
> who has been adding ABI support to Biopython.
>
> With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather
> than the expected ID from within the file we get EMBOSS_001,

Can you please run with -debug on the command line and send me the 
seqret.dbg file to see what it thought was in the file

regards,

Peter Rice
EMBOSS team


From p.j.a.cock at googlemail.com  Wed Aug  3 07:57:01 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 3 Aug 2011 08:57:01 +0100
Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI
	files
In-Reply-To: <4E38417B.6000505@ebi.ac.uk>
References: <CAKVJ-_66L2gJHL0jpDxK85AfkMQ6K53Dit7SONdEv7VgkcFPfw@mail.gmail.com>
	<4E38417B.6000505@ebi.ac.uk>
Message-ID: <CAKVJ-_7jeiY9yPriP8TkjHuoHxKQ80rSG_=+W9=cPXV+zZd_tQ@mail.gmail.com>

On Tue, Aug 2, 2011 at 7:27 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> On 02/08/2011 19:01, Peter Cock wrote:
>>
>> Hi EMBOSS folk,
>>
>> I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto
>> who has been adding ABI support to Biopython.
>>
>> With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather
>> than the expected ID from within the file we get EMBOSS_001,
>
> Can you please run with -debug on the command line and send me the
> seqret.dbg file to see what it thought was in the file

No problem - sent directly to Peter R,

Peter


From ajb at ebi.ac.uk  Thu Aug 11 13:22:25 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 11 Aug 2011 14:22:25 +0100 (BST)
Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released
Message-ID: <53905.82.26.12.214.1313068945.squirrel@imap04.ebi.ac.uk>

New bug-fix files are available for EMBOSS-6.4.0 and, for Windows
users, a new version of mEMBOSS is available.

The bugs fixed are appended for easy reference.

1) UNIX

As usual, the most convenient way of applying the bug-fixes should be
to apply the patch file:

ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-11.gz

to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code
and recompiling/installing.

(see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch
 for instructions on using 'patch').

Alternatively, you can individually copy the patched files
from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory
if your system does not support 'patch'.

2) mEMBOSS

The new version incorporates all the bug-fixes listed below.
Uninstall your previous mEMBOSS installation and download and install
the new setup file from:

ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.2-setup.exe


Alan

-----------------------------------------------------------------------

Fix 1. EMBOSS-6.4.0/emboss/dbiflat.c
       EMBOSS-6.4.0/emboss/dbxflat.c

10 Aug 2011: The SwissProt description line format includes additional
             tags which interfere with the EMBL parser used in
             previous releases. The fix replaces this with a SwissProt
             parser that strips out the extra tags. After patching the
             release, any existing SwissProt description index files
             should be reindexed. Other indexes are unchanged.

Fix 2. EMBOSS-6.4.0/ajax/core/ajquery.c

10 Aug 2011: For databases with more than one valid format (examples
             include the EBI dbfetch server) this fix allows the
             format to be specified with a qualifier on the command
             line. In the original release, only a format in the query
             string was used.


Fix 3. EMBOSS-6.4.0/ajax/core/ajfeatread.c

10 Aug 2011: When parsing GFF3 format input, long feature tags (for
             example extremely long translations) exceeded limits in
             regular expression parsing. This fix decouples testing for
             escaped quotes from the main task of finding quoted
             strings.


Fix 4. EMBOSS-6.4.0/emboss/data/Etcode.dat

10 Aug 2001: The local data file used by application tcode had a missing
             parameter line.


Fix 5. EMBOSS-6.4.0/ajax/core/ajrange.c

10 Aug 2011: When sequence ranges (and possible highlighting for
             showalign) were in a list file, the parser overwrote
             string values.


Fix 5. EMBOSS-6.4.0/ajax/core/ajseqabi.c

10 Aug 2011: Sample names in ABI format files were stored in
             incompletely defined strings. This fix corrects the
             string object. The sample name is also used as the
             sequence name.


Fix 6. EMBOSS-6.4.0/emboss/dbxresource.c

10 Aug 2011: A future change to the format of Data Resource Catalogue
             entries in DRCAT.dat requires an update to the parsing of
             category lines. The current version is not affected.


Fix 7. EMBOSS-6.4.0/emboss/server.ensemblgenomes
       EMBOSS-6.4.0/emboss/cacheensembl.c
       EMBOSS-6.4.0/ajax/ensembl/ensregistry.c
       EMBOSS-6.4.0/ajax/ensembl/ensregistry.c
       EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.c
       EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.h

10 Aug 2011: Microbial genomes use an enumerated species code which
             must be added to the query for data retrieval. This fix
             adds the species code to the comment field. In the next
             release a more complete solution will be implemented.


Fix 8. EMBOSS-6.4.0/ajax/core/ajarch.h

10-Aug-2011: Corrects the size of long integers on Windows systems only.


Fix 9. EMBOSS-6.4.0/emboss/cirdna.c

10-Aug-2011: Cirdna prints text inside solid blocks invisibly. When
             printed outside the text scaling was too small. The text
             scale is now adjusted for the radius and sequence length
             so that labels should be readable outside the box.


Fix 10. EMBOSS-6.4.0/ajax/core/ajpat.c

10-Aug-2011: Fuzznuc, fuzzpro and fuzztran using a pattern file
             ignored the command line -mismatch qualifier for the
             first pattern. The default mismatch is now set to this
             value at the start of the pattern matching loop in the
             library.


Fix 11. EMBOSS-6.4.0/ajax/core/ajfmt.c

11-Aug-2011: The function ajFmtScanF() handled va_list incorrectly. Only
             potentially affected code developers.


From ajb at ebi.ac.uk  Thu Aug 11 15:58:25 2011
From: ajb at ebi.ac.uk (ajb at ebi.ac.uk)
Date: Thu, 11 Aug 2011 16:58:25 +0100 (BST)
Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released
Message-ID: <49005.82.26.12.214.1313078305.squirrel@imap04.ebi.ac.uk>

UNIX users who downloaded the bug-fix patch file for EMBOSS earlier
this afternoon may have found that there were compilation problems on
a limited number of architectures.

The patch has been amended slightly to hopefully fix this problem
so please download it again if you were affected.

If anyone continues to experience compilation problems then
please let me know.

Alan


From p.j.a.cock at googlemail.com  Tue Aug 16 15:03:26 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Aug 2011 16:03:26 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
Message-ID: <CAKVJ-_6N0hWOu2o1GAdV81YJiEqzMCfLJmaHnWnNa9SwhQUSyQ@mail.gmail.com>

Dear Peter R. (et al.),

I recall from one of our chats in person that EMBOSS has some
mapping tables to convert the various different data file format's
feature names into a common standard (the Sequence Ontology?),
for the purpose of inter-converting files. e.g. Converting a UniProt/
SwissProt plain text protein file into a GenPept protein file or GFF3

Is that a fair summary?

It seems to match the minutes of this meeting (found with
Google) http://emboss.sourceforge.net/meetings/2009-02-16.html

> DASGFF requires a sequence ontology (or BioSapiens
> ontology) tag for protein features. Peter has updated the
> Efeatures definitions for proteins to use GFF3 sequence
> ontology codes as internal identifiers, and to use GFF3
> as the principle definitions for all protein features. All
> SwissProt feature types (36 in the current Swissprot
> release) are also defined with the closest possible match
> to the sequence ontology. Where there is no exact match,
> an EMBOSS internal type is defined using the closets SO
> code and the original feature type as a suffix. For SwissProt
> output this is converted back to the swissprot feature type.
> For GFF3 output the internal type is an alias for the closest
> (more general) SO term.

Can you point me at these mapping tables in the EMBOSS
source code please?

I'm particularly interested in the SwissProt to SO mapping
right now.

Thanks.

Peter C.


From pmr at ebi.ac.uk  Tue Aug 16 15:26:51 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 16 Aug 2011 16:26:51 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4A8BF7.4020106@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk>
Message-ID: <4E4A8C3B.8030306@ebi.ac.uk>

On 08/16/2011 04:03 PM, Peter Cock wrote:
> Dear Peter R. (et al.),
> 
> I recall from one of our chats in person that EMBOSS has some
> mapping tables to convert the various different data file format's
> feature names into a common standard (the Sequence Ontology?),
> for the purpose of inter-converting files. e.g. Converting a UniProt/
> SwissProt plain text protein file into a GenPept protein file or GFF3
> 
> Is that a fair summary?

Yes, We needed an internal identifier for feature types, and picked SO
for nucleotides - and then were able to add the protein terms when they
became available.

There are a few made up internal names, with _text after the SO term,
that were needed in the early days of the BioSapiens Ontology and some
dodgy mapping between SO and EMBL/GenBank for immunoglobulin gene
regions, but I believe are no longer used.

The first term in the file is defined as the default if nothing is
recognized (region or misc_feature)

> Can you point me at these mapping tables in the EMBOSS
> source code please?

emboss/data/Efeatures.embl
emboss/data/Efeatures.swiss

> I'm particularly interested in the SwissProt to SO mapping
> right now.

That was originally done by the BioSapiens "Network of excellence" for
annotating ENCODE data. They developed the protein features which were
then added to the sequence ontology.

You can look at SO terms in EMBOSS with:

ontoget so:0001094

or

ontoget -filter -oformat excel so:0001094

(Hmmm, should do something better for a missing namespace - it was
defined as a format for EDAM)


Let me know if you spot anything in need of updating.

We also have (especially for EMBL) equivalent Etags files listing the
available feature qualifiers.

regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Tue Aug 16 15:36:24 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Aug 2011 16:36:24 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4A8C3B.8030306@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk>
	<4E4A8C3B.8030306@ebi.ac.uk>
Message-ID: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>

On Tue, Aug 16, 2011 at 4:26 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> Yes, We needed an internal identifier for feature types, and picked SO
> for nucleotides - and then were able to add the protein terms when they
> became available.
> ...

Thanks!

>
> Let me know if you spot anything in need of updating.
>

I have found three protein features which have been
renamed, and one which appears to be wrong... see below.

I recently noticed that the UniProt provide GFF3 files,
e.g. http://www.uniprot.org/uniprot/P99999.gff

========================================
##gff-version 3
##sequence-region P99999 1 105
P99999	UniProtKB	Initiator methionine	1	1	.	.	.	Note=Removed	
P99999	UniProtKB	Chain	2	105	.	.	.	ID=PRO_0000108218;Note=Cytochrome c	
P99999	UniProtKB	Metal binding	19	19	.	.	.	Note=Iron (heme axial ligand)	
P99999	UniProtKB	Metal binding	81	81	.	.	.	Note=Iron (heme axial ligand)	
P99999	UniProtKB	Binding site	15	15	.	.	.	Note=Heme (covalent)	
P99999	UniProtKB	Binding site	18	18	.	.	.	Note=Heme (covalent)	
P99999	UniProtKB	Modified residue	2	2	.	.	.	Note=N-acetylglycine	
P99999	UniProtKB	Modified
residue	49	49	.	.	.	Note=Phosphotyrosine;Status=By similarity
P99999	UniProtKB	Modified
residue	98	98	.	.	.	Note=Phosphotyrosine;Status=By similarity
P99999	UniProtKB	Natural variant	42	42	.	.	.	ID=VAR_044450;Note=In
THC4%3B increases the pro-apoptotic function by triggering caspase
activation more efficiently than wild-type%3B does not affect the
redox function.
P99999	UniProtKB	Natural variant	56	56	.	.	.	ID=VAR_048850	
P99999	UniProtKB	Natural variant	66	66	.	.	.	ID=VAR_002204;Note=In
10%25 of the molecules.
P99999	UniProtKB	Sequence conflict	18	18	.	.	.	.	
P99999	UniProtKB	Sequence conflict	41	41	.	.	.	.	
P99999	UniProtKB	Helix	4	14	.	.	.	.	
P99999	UniProtKB	Turn	16	18	.	.	.	.	
P99999	UniProtKB	Beta strand	23	25	.	.	.	.	
P99999	UniProtKB	Beta strand	28	30	.	.	.	.	
P99999	UniProtKB	Turn	36	38	.	.	.	.	
P99999	UniProtKB	Helix	51	56	.	.	.	.	
P99999	UniProtKB	Helix	62	70	.	.	.	.	
P99999	UniProtKB	Helix	72	75	.	.	.	.	
P99999	UniProtKB	Helix	89	102	.	.	.	.	
========================================

However, they are not using Sequence Ontology terms
in column three and so fail the online GFF3 validator
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online
listed in http://www.sequenceontology.org/gff3.shtml
(GFF3 specification currently at v1.20). Additionally
that UniProt GFF3 uses an upper case reserved tag,
"Status" rather than perhaps "status", in the modified
residue features.

I will report this to UniProt later. However, first I thought
I would try converting one of the other files provided into
GFF3 using EMBOSS seqret for an alternative, e.g. the
plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt

I can convert this using seqret as follows:

========================================
$ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt
-stdout -auto
##gff-version 3
##sequence-region CYC_HUMAN 1 105
#!Date 2011-08-16
#!Type Protein
#!Source-version EMBOSS 6.4.0.0
CYC_HUMAN	SWISSPROT	cleaved_initiator_methionine	1	1	.	+	.	ID=CYC_HUMAN.1;note=Removed
CYC_HUMAN	SWISSPROT	mature_protein_region	2	105	.	+	.	ID=CYC_HUMAN.2;note=Cytochrome
c;ftid=PRO_0000108218
CYC_HUMAN	SWISSPROT	metal_binding	19	19	.	+	.	ID=CYC_HUMAN.3;note=Iron;comment=heme
axial ligand
CYC_HUMAN	SWISSPROT	metal_binding	81	81	.	+	.	ID=CYC_HUMAN.4;note=Iron;comment=heme
axial ligand
CYC_HUMAN	SWISSPROT	binding_site	15	15	.	+	.	ID=CYC_HUMAN.5;note=Heme;comment=covalent
CYC_HUMAN	SWISSPROT	binding_site	18	18	.	+	.	ID=CYC_HUMAN.6;note=Heme;comment=covalent
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	2	2	.	+	.	ID=CYC_HUMAN.7;note=N-acetylglycine
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	49	49	.	+	.	ID=CYC_HUMAN.8;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN	SWISSPROT	protein_modification_categorized_by_chemical_process	98	98	.	+	.	ID=CYC_HUMAN.9;note=Phosphotyrosine;comment=By
similarity
CYC_HUMAN	SWISSPROT	natural_variant	42	42	.	+	.	ID=CYC_HUMAN.10;note=G
-> S;comment=in THC4%3B increases the pro- apoptotic function by
triggering caspase activation more efficiently than wild- type%3B does
not affect the redox function;ftid=VAR_044450
CYC_HUMAN	SWISSPROT	natural_variant	56	56	.	+	.	ID=CYC_HUMAN.11;note=K
-> R;comment=in dbSNP:rs11548795;ftid=VAR_048850
CYC_HUMAN	SWISSPROT	natural_variant	66	66	.	+	.	ID=CYC_HUMAN.12;note=M
-> L;comment=in 10%25 of the molecules;ftid=VAR_002204
CYC_HUMAN	SWISSPROT	sequence_conflict	18	18	.	+	.	ID=CYC_HUMAN.13;note=C
-> Y;comment=in Ref. 8%3B AAH15130
CYC_HUMAN	SWISSPROT	sequence_conflict	41	41	.	+	.	ID=CYC_HUMAN.14;note=T
-> I;comment=in Ref. 8%3B AAH68464
CYC_HUMAN	SWISSPROT	alpha_helix	4	14	.	+	.	ID=CYC_HUMAN.15
CYC_HUMAN	SWISSPROT	turn	16	18	.	+	.	ID=CYC_HUMAN.16
CYC_HUMAN	SWISSPROT	beta_strand	23	25	.	+	.	ID=CYC_HUMAN.17
CYC_HUMAN	SWISSPROT	beta_strand	28	30	.	+	.	ID=CYC_HUMAN.18
CYC_HUMAN	SWISSPROT	turn	36	38	.	+	.	ID=CYC_HUMAN.19
CYC_HUMAN	SWISSPROT	alpha_helix	51	56	.	+	.	ID=CYC_HUMAN.20
CYC_HUMAN	SWISSPROT	alpha_helix	62	70	.	+	.	ID=CYC_HUMAN.21
CYC_HUMAN	SWISSPROT	alpha_helix	72	75	.	+	.	ID=CYC_HUMAN.22
CYC_HUMAN	SWISSPROT	alpha_helix	89	102	.	+	.	ID=CYC_HUMAN.23
##FASTA
>CYC_HUMAN P99999 Cytochrome c
MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW
GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
========================================

Interestingly EMBOSS includes the sequence at the bottom
(using the FASTA directive) and has generated unique ID tags
for each feature. It has also added more note tags.

Unfortunately this also failed the GFF3 validation. The EMBOSS
output does a lot better (e.g. "cleaved_initiator_methionine" is
valid while "Initiator methionine" in the UniProt file was not)

However, some of the terms in column 3 are apparently out of
date - but http://www.sequenceontology.org does list them as
synonyms:

* metal_binding -> polypeptide_metal_contact
* natural_variant -> natural_variant_site
* turn -> polypeptide_turn_motif

It looks like the EMBOSS sequence ontology table may need
updating for at least these three cases.

Finally protein_modification_categorized_by_chemical_process
does not seem to be valid (I failed to find it in the ontology).

Additionally the validator complained about some of the note
in Line 15, probably due to the %3B escaped semi-colon,
but that may be a bug in the validator.

Peter C.


From p.j.a.cock at googlemail.com  Tue Aug 16 18:39:05 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Aug 2011 19:39:05 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
Message-ID: <CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>

On Tue, Aug 16, 2011 at 4:36 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I recently noticed that the UniProt provide GFF3 files,
> e.g. http://www.uniprot.org/uniprot/P99999.gff
>
> ...
> http://www.uniprot.org/uniprot/P99999.txt
> ...
>
> $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt

I also noticed the seqret GFF3 output is using "+" as the strand,
which is wrong for a protein reference like this. It should be using
"." (period) as the features on a protein are strand-less (as done
in the UniProt GFF3 file).

Regards,

Peter C.


From p.j.a.cock at googlemail.com  Wed Aug 17 10:37:06 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 11:37:06 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
Message-ID: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>

Hi again Peter R. (et al.),

Following yesterday's discussion about GFF3 files from UniProt,
I'm trying seqret to produce GFF3 from GenBank files. I'd already
found the NCBI currently provides some very broken GFF3 files:

http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html

$ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff
$ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
$ seqret --version
EMBOSS:6.4.0.0
$ seqret -filter -feature -sequence NC_005213.gbk -sformat=genbank
-osformat=gff3 | head -n 20
##gff-version 3
##sequence-region NC_005213 1 490885
#!Date 2011-08-17
#!Type DNA
#!Source-version EMBOSS 6.4.0.0
NC_005213	EMBL	databank_entry	1	490885	.	+	.	ID=NC_005213.1;organism=Nanoarchaeum
equitans Kin4-M;mol_type=genomic
DNA;strain=Kin4-M;db_xref=taxon:228908
NC_005213	EMBL	gene	3254	35301	.	+	.	ID=NC_005213.2;locus_tag=NEQ_t01;experiment=experimental
evidence%2C no additional details
recorded;trans_splicing=true;db_xref=GeneID:3362429
NC_005213	EMBL	gene	35233	35301	.	+	.	Parent=NC_005213.2
NC_005213	EMBL	gene	3254	3289	.	+	.	Parent=NC_005213.2
NC_005213	EMBL	tRNA	3254	35287	.	+	.	ID=NC_005213.5;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental
evidence%2C no additional details
recorded;trans_splicing=true;db_xref=GeneID:3362429
NC_005213	EMBL	tRNA	35249	35287	.	+	.	Parent=NC_005213.5
NC_005213	EMBL	tRNA	3254	3289	.	+	.	Parent=NC_005213.5
NC_005213	EMBL	gene	1	490885	.	-	.	ID=NC_005213.8;locus_tag=NEQ001;db_xref=GeneID:2732620
NC_005213	EMBL	gene	490883	490885	.	-	.	Parent=NC_005213.8
NC_005213	EMBL	gene	1	879	.	-	.	Parent=NC_005213.8
NC_005213	EMBL	CDS	1	490885	.	-	0	ID=NC_005213.11;locus_tag=NEQ001;note=conserved
hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized
ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743:
Protein of unknown function
DUF57;codon_start=1;transl_table=11;product=hypothetical
protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
NC_005213	EMBL	CDS	490883	490885	.	-	0	Parent=NC_005213.11
NC_005213	EMBL	CDS	1	879	.	-	0	Parent=NC_005213.11
NC_005213	EMBL	sequence_feature	7	879	.	-	.	ID=NC_005213.14;locus_tag=NEQ001;note=CRISPR/Cas
system-associated RAMP superfamily protein Cas6%3B Region:
Cas6-I-III%3B cl11443;db_xref=CDD:196236
NC_005213	EMBL	gene	883	2691	.	+	.	ID=NC_005213.15;locus_tag=NEQ003;db_xref=GeneID:2654355

I've deliberately cut the example here to include all of NEQ_t01, and
interesting trans-spliced tRNA, and all of NEQ001, an interesting gene
because it spans the origin of this circular genome. I use these examples
in the blog post and discuss them again below.

Given some of the points below, I suspect EMBOSS is producing GFF3
prior to the additions made in v1.18 (24 June 2010) regarding circular
genomes.

The following numbering reflects the issues listed on my blog post
about the NCBI version of the GFF3 file (link given above).

------------------------------------------

Problem One - Invalid Feature Types

EMBOSS looks OK here, you're converting the GenBank feature types
source and misc_feature into databank_entry and sequence_feature
respectively.

------------------------------------------

Problem Two - Circular features not marked

EMBOSS is also lacking in this area.

EMBOSS has used feature type databank_entry and generated feature ID
NC_005213.1 for the landmark. However, this should include the special
tag entry Is_circular=true, since this is the landmark feature for the whole
circular chromosome.

------------------------------------------

Problem Three - Missing ID tags on multi-location features

Unlike the NCBI file which fails to cross link multi-location features like
trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think
you are following the expected pattern as used in the canonical GFF3
examples.

In the GenBank file, this tRNA is join(35233..35301,3254..3289)

For the gene and tRNA features for NEQ_t01, EMBOSS is generating
three GFF3 lines. First a very broad parent feature 3254 to 35301,
then two children 35233 to 35301 and 3254 to 3289.

I would expect two GFF3 lines (for each of gene and tRNA), just
35233 to 35301 and 3254 to 3289 which would be linked by virtue
of having the same ID.

The online GFF3 validator would seem to support my interpretation,
reporting errors like this:

8            [ERROR]   invalid type pair - check all parents (at line
7; gene to gene)
11           [ERROR]   invalid type pair - check all parents (at line
10; tRNA to tRNA)
14           [ERROR]   invalid type pair - check all parents (at line
13; gene to gene)
17           [ERROR]   invalid type pair - check all parents (at line
16; CDS to CDS)
28           [ERROR]   invalid type pair - check all parents (at line
27; sequence_feature to
             sequence_feature)


This is related to "Problem Six" and "Problem Seven" below.

------------------------------------------

Problem Four - Wrong tag for database cross references

I had noticed the NCBI using a local tag (lower case) db_xref rather
than the standard (upper case = reserved) tag Dbxref. EMBOSS
does the same - is this deliberate and if so why?

------------------------------------------

Problem Five - Missing stop codon in CDS features

EMBOSS looks OK here

------------------------------------------

Problem Six - Features wrapping the origin of a circular genome

Related to the landmark feature lacking the Is_curcular=true tag, the
gene and CDS features for origin wrapping NEQ003 look funny to me.
EMBOSS seems to be generating three GFF3 lines for the gene and CDS
for NEQ003, a surprisingly broad entry 1 to 490885 and two children
490883 to 490885 and 1 to 879 (which do look sensible).

This is essentially the same point I raised above with NEQ_t01, but
with the added complication of spanning the origin.

Based on the old specification, I had expected two GFF3 lines each for the
gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked
by virtue of the having the same ID.

Thankfully this potential confusion has been address in the updated
specification, so I would expect a single GFF3 line for each of the gene
and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

------------------------------------------

Problem Seven - No parent/child relationships

The NCBI GFF3 file had no parent/child relationships at all.

The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
but not in the way I expected (and not in a way the validator likes).
As discussed above, for the GenBank join locations EMBOSS
seems to create broad parent features with children for each
sub-location (parent/child relations of the same type = bad).

What I'm expecting instead is parent child relationships between
the CDS and gene features, between tRNA and gene features, etc.
Note that these relationships are implicit in the GenBank (and EMBL)
flat files, so I accept trying to deduce them might be hard (and
perhaps best not doing immediately - the other issues are more
pressing).

------------------------------------------

Problem Eight - Invalid tags

The online validator complains that EMBOSS too is using EC_number
(uppercase tags are reserved

------------------------------------------

So my conclusion is that while the EMBOSS generated GFF3 is
better than those produced by the NCBI, it still is invalid and needs
some work.

As usual, I am of course happy to help with testing fixes. And if
there are any mistakes in my understanding of the GFF3 spec,
please tell me ;)

Regards,

Peter C.


From pmr at ebi.ac.uk  Wed Aug 17 15:38:23 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:38:23 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
Message-ID: <4E4BE06F.9040503@ebi.ac.uk>

On 16/08/2011 16:36, Peter Cock wrote:
> Interestingly EMBOSS includes the sequence at the bottom
> (using the FASTA directive) and has generated unique ID tags
> for each feature. It has also added more note tags.

The sequence is included if you are writing sequence data. GFF3 allows 
sequence to be included, so we add it. Using a separate feature file is 
always awkward for users, but is supported.

> Unfortunately this also failed the GFF3 validation. The EMBOSS
> output does a lot better (e.g. "cleaved_initiator_methionine" is
> valid while "Initiator methionine" in the UniProt file was not)
>
> However, some of the terms in column 3 are apparently out of
> date - but http://www.sequenceontology.org does list them as
> synonyms:

Thanks. I'll update the table, but synonyms should be acceptable.

> Finally protein_modification_categorized_by_chemical_process
> does not seem to be valid (I failed to find it in the ontology).

Not in SO, but in a separate ontology (MOD). Should also be valid in GFF 
I believe, but perhaps the parser insists on using SO and excluding 
related ontologies.

> Additionally the validator complained about some of the note
> in Line 15, probably due to the %3B escaped semi-colon,
> but that may be a bug in the validator.

Interesting. Let me know if we are not escaping the right characters, 
but I believe we are supposed to escape ';' in those positions.

regards,

Peter Rice
EMBOSS Team


From pmr at ebi.ac.uk  Wed Aug 17 15:39:39 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:39:39 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>
Message-ID: <4E4BE0BB.70302@ebi.ac.uk>

On 16/08/2011 19:39, Peter Cock wrote:
> I also noticed the seqret GFF3 output is using "+" as the strand,
> which is wrong for a protein reference like this. It should be using
> "." (period) as the features on a protein are strand-less (as done
> in the UniProt GFF3 file).

Thanks.

We'll fix it for the next release, but my understanding is it should be 
acceptable to most parsers.

regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Wed Aug 17 15:48:32 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 16:48:32 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE06F.9040503@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
Message-ID: <CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 16/08/2011 16:36, Peter Cock wrote:
>>
>> Interestingly EMBOSS includes the sequence at the bottom
>> (using the FASTA directive) and has generated unique ID tags
>> for each feature. It has also added more note tags.
>
> The sequence is included if you are writing sequence data. GFF3 allows
> sequence to be included, so we add it. Using a separate feature file is
> always awkward for users, but is supported.

See also the discussion today on gmod-gbrowse / song-devel where
it sounds like GFF3 should have a single block of FASTA embedded
sequence at the end of the fine, rather than interleaved. As I suggest
on that thread, the practical solution for EMBOSS seqret might be to
omit the FASTA sequence altogether. Or cache them in memory/on
disk to write out at the very end of the all the features?

http://generic-model-organism-system-database.450254.n5.nabble.com/Mailing-list-for-GFF3-specification-discussion-td4707740.html

>> Unfortunately this also failed the GFF3 validation. The EMBOSS
>> output does a lot better (e.g. "cleaved_initiator_methionine" is
>> valid while "Initiator methionine" in the UniProt file was not)
>>
>> However, some of the terms in column 3 are apparently out of
>> date - but http://www.sequenceontology.org does list them as
>> synonyms:
>
> Thanks. I'll update the table, but synonyms should be acceptable.

I can see plus points for either view, certainly the validator could
downgrade that error to an warning.

>> Finally protein_modification_categorized_by_chemical_process
>> does not seem to be valid (I failed to find it in the ontology).
>
> Not in SO, but in a separate ontology (MOD). Should also be valid
> in GFF I believe, but perhaps the parser insists on using SO and
> excluding related ontologies.

OK, but in that case shouldn't you then be declaring this with a
##feature-ontology directive?

>> Additionally the validator complained about some of the note
>> in Line 15, probably due to the %3B escaped semi-colon,
>> but that may be a bug in the validator.
>
> Interesting. Let me know if we are not escaping the right characters, but I
> believe we are supposed to escape ';' in those positions.

I haven't checked this aspect carefully (since this is fiddly).

Peter


From p.j.a.cock at googlemail.com  Wed Aug 17 15:50:57 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 16:50:57 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE0BB.70302@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<CAKVJ-_4_HRr-LLj2yBQ2=X2QtLktKpJ9P0o60cyqQ-pnUY01Tg@mail.gmail.com>
	<4E4BE0BB.70302@ebi.ac.uk>
Message-ID: <CAKVJ-_58k83YPoG9MWeNpG=6tB4YO52ATAKJjJXD-AthygOgCg@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:39 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 16/08/2011 19:39, Peter Cock wrote:
>>
>> I also noticed the seqret GFF3 output is using "+" as the strand,
>> which is wrong for a protein reference like this. It should be using
>> "." (period) as the features on a protein are strand-less (as done
>> in the UniProt GFF3 file).
>
> Thanks.
>
> We'll fix it for the next release, but my understanding is it should be
> acceptable to most parsers.
>

I agree this is pretty harmless - in practice all that really matters
is if the strand is "-" or not. Still, it should be straight forward to fix.

Peter


From pmr at ebi.ac.uk  Wed Aug 17 15:52:21 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:52:21 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
Message-ID: <4E4BE3B5.4080601@ebi.ac.uk>

On 17/08/2011 11:37, Peter Cock wrote:
> Hi again Peter R. (et al.),
>
> Following yesterday's discussion about GFF3 files from UniProt,
> I'm trying seqret to produce GFF3 from GenBank files. I'd already
> found the NCBI currently provides some very broken GFF3 files:
>
> ------------------------------------------
>
> Problem Two - Circular features not marked
>
> EMBOSS is also lacking in this area.
>
> EMBOSS has used feature type databank_entry and generated feature ID
> NC_005213.1 for the landmark. However, this should include the special
> tag entry Is_circular=true, since this is the landmark feature for the whole
> circular chromosome.

Thanks. I'll make sure we add it for the next release.

> ------------------------------------------
>
> Problem Three - Missing ID tags on multi-location features
>
> Unlike the NCBI file which fails to cross link multi-location features like
> trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think
> you are following the expected pattern as used in the canonical GFF3
> examples.
>
> In the GenBank file, this tRNA is join(35233..35301,3254..3289)
>
> For the gene and tRNA features for NEQ_t01, EMBOSS is generating
> three GFF3 lines. First a very broad parent feature 3254 to 35301,
> then two children 35233 to 35301 and 3254 to 3289.
>
> I would expect two GFF3 lines (for each of gene and tRNA), just
> 35233 to 35301 and 3254 to 3289 which would be linked by virtue
> of having the same ID.

EMBOSS is reporting what is stored internally (feature and subfeatures 
for the exons). Looks like we should skip reporting the feature. I'll 
check what that means for the IDs.


> This is related to "Problem Six" and "Problem Seven" below.
>
> ------------------------------------------
>
> Problem Four - Wrong tag for database cross references
>
> I had noticed the NCBI using a local tag (lower case) db_xref rather
> than the standard (upper case = reserved) tag Dbxref. EMBOSS
> does the same - is this deliberate and if so why?

It is deliberate - we are using the db_xref tag from the EMBL/GenBank 
feature table.

But we could convert to the GFF3 tag (and back again on reading). I'll 
have a look at how easy that would be.

> ------------------------------------------
>
> Problem Six - Features wrapping the origin of a circular genome
>
> Related to the landmark feature lacking the Is_curcular=true tag, the
> gene and CDS features for origin wrapping NEQ003 look funny to me.
> EMBOSS seems to be generating three GFF3 lines for the gene and CDS
> for NEQ003, a surprisingly broad entry 1 to 490885 and two children
> 490883 to 490885 and 1 to 879 (which do look sensible).
>
> This is essentially the same point I raised above with NEQ_t01, but
> with the added complication of spanning the origin.

Ah, something to do with the way start and end positions are stored 
internally. I'll fix that along with other circular feature issues.

> Thankfully this potential confusion has been address in the updated
> specification, so I would expect a single GFF3 line for each of the gene
> and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

I'll try to write (and read) that way too.

> ------------------------------------------
>
> Problem Seven - No parent/child relationships
>
> The NCBI GFF3 file had no parent/child relationships at all.
>
> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
> but not in the way I expected (and not in a way the validator likes).
> As discussed above, for the GenBank join locations EMBOSS
> seems to create broad parent features with children for each
> sub-location (parent/child relations of the same type = bad).
>
> What I'm expecting instead is parent child relationships between
> the CDS and gene features, between tRNA and gene features, etc.
> Note that these relationships are implicit in the GenBank (and EMBL)
> flat files, so I accept trying to deduce them might be hard (and
> perhaps best not doing immediately - the other issues are more
> pressing).

Could be possible by matching common exons (stored internally as 
subfeatures). I'll have a look.

> ------------------------------------------
>
> Problem Eight - Invalid tags
>
> The online validator complains that EMBOSS too is using EC_number
> (uppercase tags are reserved

Pah! We use the EMBL/Genbank tag names. Looks like we will have to 
convert to lower case so may as well include that with the 
db_xref/Dbxref conversion in GFF3 writing and reading

> ------------------------------------------
>
> So my conclusion is that while the EMBOSS generated GFF3 is
> better than those produced by the NCBI, it still is invalid and needs
> some work.
>
> As usual, I am of course happy to help with testing fixes. And if
> there are any mistakes in my understanding of the GFF3 spec,
> please tell me ;)

Many, many thanks for finding these.

EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as 
subfeatures, which makes all this much easier to handle.

regards,

Peter Rice
EMBOSS Team


From pmr at ebi.ac.uk  Wed Aug 17 15:55:53 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 16:55:53 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
Message-ID: <4E4BE489.3040703@ebi.ac.uk>

On 17/08/2011 16:48, Peter Cock wrote:
> On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> On 16/08/2011 16:36, Peter Cock wrote:
>>>
>>> Interestingly EMBOSS includes the sequence at the bottom
>>> (using the FASTA directive) and has generated unique ID tags
>>> for each feature. It has also added more note tags.
>>
>> The sequence is included if you are writing sequence data. GFF3 allows
>> sequence to be included, so we add it. Using a separate feature file is
>> always awkward for users, but is supported.
>
> See also the discussion today on gmod-gbrowse / song-devel where
> it sounds like GFF3 should have a single block of FASTA embedded
> sequence at the end of the fine, rather than interleaved. As I suggest
> on that thread, the practical solution for EMBOSS seqret might be to
> omit the FASTA sequence altogether. Or cache them in memory/on
> disk to write out at the very end of the all the features?

Thanks. We already save sequences and write at the end for some formats 
so I'll add it for GFF3. We will need more work for reading GFF3 input 
though, but it may not be too bad.

If we are reading it as feature input, we don't look for the sequence.

If we are reading as sequence input, we need to read all the sequeces 
into memory and then go back to read the features. For streamed input we 
can buffer to make the rewind work.

regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Wed Aug 17 16:05:13 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:05:13 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E4BE3B5.4080601@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E4BE3B5.4080601@ebi.ac.uk>
Message-ID: <CAKVJ-_4A=UwuXUexeMs=gg_eZ0n8LeCh96C57TNY9HZWSpQ2Ow@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:52 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 11:37, Peter Cock wrote:
>> ------------------------------------------
>>
>> Problem Four - Wrong tag for database cross references
>>
>> I had noticed the NCBI using a local tag (lower case) db_xref rather
>> than the standard (upper case = reserved) tag Dbxref. EMBOSS
>> does the same - is this deliberate and if so why?
>
> It is deliberate - we are using the db_xref tag from the EMBL/GenBank
> feature table.
>
> But we could convert to the GFF3 tag (and back again on reading). I'll
> have a look at how easy that would be.

Do you want to check this one with Lincoln on the song-devel mailing list
first - after all, using a lower case tag is quite allowable and valid GFF3.
My point is it does seem to be exactly what the reserved tag Dbxref is
intended for.

>> ------------------------------------------
>>
>> Problem Seven - No parent/child relationships
>>
>> The NCBI GFF3 file had no parent/child relationships at all.
>>
>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>> but not in the way I expected (and not in a way the validator likes).
>> As discussed above, for the GenBank join locations EMBOSS
>> seems to create broad parent features with children for each
>> sub-location (parent/child relations of the same type = bad).
>>
>> What I'm expecting instead is parent child relationships between
>> the CDS and gene features, between tRNA and gene features, etc.
>> Note that these relationships are implicit in the GenBank (and EMBL)
>> flat files, so I accept trying to deduce them might be hard (and
>> perhaps best not doing immediately - the other issues are more
>> pressing).
>
> Could be possible by matching common exons (stored internally as
> subfeatures). I'll have a look.

Usually yes, but not all the time. I've seen GenBank files where
the gene and CDS features have slightly different locations which
makes doing this automatically hard. Off the top of my head this
was a programmed frame shift example... I'll see if I can find you
a specific example.

>> ------------------------------------------
>>
>> So my conclusion is that while the EMBOSS generated GFF3 is
>> better than those produced by the NCBI, it still is invalid and needs
>> some work.
>>
>> As usual, I am of course happy to help with testing fixes. And if
>> there are any mistakes in my understanding of the GFF3 spec,
>> please tell me ;)
>
> Many, many thanks for finding these.

I've come to value NC_005213.gbk as a reasonably small circular
genome with some rather complicated annotation - its one of my
favourite test cases.

> EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as
> subfeatures, which makes all this much easier to handle.

Oh good - that restructuring should now pay dividends :)

Peter C.


From p.j.a.cock at googlemail.com  Wed Aug 17 16:07:54 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:07:54 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE489.3040703@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
	<4E4BE489.3040703@ebi.ac.uk>
Message-ID: <CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>

On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 16:48, Peter Cock wrote:
>> See also the discussion today on gmod-gbrowse / song-devel where
>> it sounds like GFF3 should have a single block of FASTA embedded
>> sequence at the end of the fine, rather than interleaved. As I suggest
>> on that thread, the practical solution for EMBOSS seqret might be to
>> omit the FASTA sequence altogether. Or cache them in memory/on
>> disk to write out at the very end of the all the features?
>
> Thanks. We already save sequences and write at the end for some
> formats so I'll add it for GFF3. We will need more work for reading
> GFF3 input though, but it may not be too bad.
>
> If we are reading it as feature input, we don't look for the sequence.
>
> If we are reading as sequence input, we need to read all the sequeces
> into memory and then go back to read the features. For streamed input
> we can buffer to make the rewind work.

I'm curious what other file formats needed this kind of work. But it
is good that you've already got some buffer/cache infrastructure
in place. Does it boil down to writing temp files in /tmp ?

Peter C.


From pmr at ebi.ac.uk  Wed Aug 17 16:14:15 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 17 Aug 2011 17:14:15 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
	<4E4BE489.3040703@ebi.ac.uk>
	<CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>
Message-ID: <4E4BE8D7.4010203@ebi.ac.uk>

On 17/08/2011 17:07, Peter Cock wrote:
> On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> If we are reading as sequence input, we need to read all the sequeces
>> into memory and then go back to read the features. For streamed input
>> we can buffer to make the rewind work.
>
> I'm curious what other file formats needed this kind of work. But it
> is good that you've already got some buffer/cache infrastructure
> in place. Does it boil down to writing temp files in /tmp ?

MSF (checksum at the top), Phylip (number of sequences at the top).

In ajseqwrite.c these are the ones with the Save attribute set true.

We keep them in memory and write them when the output file is closed.

regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Wed Aug 17 16:33:29 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:33:29 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <4E4BE8D7.4010203@ebi.ac.uk>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
	<4E4BE06F.9040503@ebi.ac.uk>
	<CAKVJ-_7Ao_mTa=8KsDYFFW-ct+RaYmVJWPJqruYRdGwOC1p0Xw@mail.gmail.com>
	<4E4BE489.3040703@ebi.ac.uk>
	<CAKVJ-_4C54oTnRF4szi=Au0OTTn+GNQpSU9nSaYv9465BC1v7A@mail.gmail.com>
	<4E4BE8D7.4010203@ebi.ac.uk>
Message-ID: <CAKVJ-_4iABoj8y2s1f9kJ=pujKm41=PqECXg49MDRBY3T5wFEA@mail.gmail.com>

On Wed, Aug 17, 2011 at 5:14 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 17/08/2011 17:07, Peter Cock wrote:
>> I'm curious what other file formats needed this kind of work. But it
>> is good that you've already got some buffer/cache infrastructure
>> in place. Does it boil down to writing temp files in /tmp ?
>
> MSF (checksum at the top), Phylip (number of sequences at the top).
>
> In ajseqwrite.c these are the ones with the Save attribute set true.
>
> We keep them in memory and write them when the output file is closed.

I wasn't thinking of alignments, but that makes perfect sense.

Thanks,

Peter C.


From p.j.a.cock at googlemail.com  Wed Aug 17 16:54:10 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Aug 2011 17:54:10 +0100
Subject: [emboss-dev] Moving EMBOSS from OBF hosted CVS to git on github
Message-ID: <CAKVJ-_78kPXyt3bCj+WX++q+WHDTQOz3Q8bJXY+kMOZUmYJQcA@mail.gmail.com>

Dear EMBOSS team,

Have you made any decisions regarding the proposal to
move the EMBOSS repository from CVS hosted by the
OBF to git hosted on github (where most of the other OBF
backed projects are now)?

I see this made it to the minutes of the 27 June 2011 meeting:
http://emboss.sourceforge.net/meetings/2011-06-27.html

As I recall from talking to Peter Rice at BOSC/ISMB 2011
in Vienna last month, EMBOSS currently uses a single branch
in CVS (like Biopython used to), so migrating the repository
to git shouldn't be too complicated.

I recommend in the short term maintaining a git mirror of the
CVS repository on github.com, which can be kept current
via a cron job running on the OBF server. You can then
treat this git repository as a read only mirror and continue
to make all commits via CVS.

During this interim period, external contributors can make
their own branches etc (without touching the official EMBOSS
repository) and send you patches. The internal developers can
also try this out as a way to get familiar with git gradually.

This is what we did with Biopython, and it worked very well.
I am happy to assist with this if you want. I think I made this
offer in person in Vienna, but I'm repeating it publicly now.

You might also be able to adopt the existing mirror
maintained by Pjotr Prins (CC'd), although that does
include a branch with BioLib work in it:
https://github.com/pjotrp/EMBOSS/

Regards,

Peter C.

P.S. You'll need to have a different project name on github
since emboss was used by Martin Bosslet back in Nov 2010.
How about emboss-prj or even open-bio for this?

P.P.S. This page seems to be missing:
http://emboss.sourceforge.net/meetings/2011-07-04.html

It is linked to from at least these two pages:
http://emboss.sourceforge.net/meetings/
http://emboss.sourceforge.net/meetings/2011-07-11.html


From pmr at ebi.ac.uk  Thu Aug 18 12:28:28 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 18 Aug 2011 13:28:28 +0100
Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO)
In-Reply-To: <CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk>
	<CAKVJ-_7XSaQ2D+rQ+5d9q-KdvnHdj2s8qN6r+RjEYHwy=YBz1w@mail.gmail.com>
Message-ID: <4E4D056C.5050508@ebi.ac.uk>

On 08/16/2011 04:36 PM, Peter Cock wrote:
> I will report this to UniProt later. However, first I thought
> I would try converting one of the other files provided into
> GFF3 using EMBOSS seqret for an alternative, e.g. the
> plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt
> 
> I can convert this using seqret as follows:
> 
> ========================================
> $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt

> However, some of the terms in column 3 are apparently out of
> date - but http://www.sequenceontology.org does list them as
> synonyms:
> 
> It looks like the EMBOSS sequence ontology table may need
> updating for at least these three cases.
> 
> Finally protein_modification_categorized_by_chemical_process
> does not seem to be valid (I failed to find it in the ontology).

That was a name from the MOD ontology. GFF3 output now uses an SO term
(but SO is lacking detail for MOD_RES, having only:

id: SO:0001089
name: post_translationally_modified_region

and

id: SO:0001700
name: histone_modification

... and then more descendant of histone modification. Still showing its
DNA_only roots.

EMBOSS internally uses MOD terms for MOD_RES features. The details are
in the note tag in GFF3 output.

> Additionally the validator complained about some of the note
> in Line 15, probably due to the %3B escaped semi-colon,
> but that may be a bug in the validator.

Worked for me. Perhaps it was confused by the term name errors (or
perhaps the validator has been fixed)

However, one nasty bug ... EMBOSS was so careful to only read real GFF3
format that the EMBOSS comment "#!Type Protein" was ignored and features
were read into EMBOSS as nucleotide.

I suspect there is no way in GFF3 to identify a protein file. In the
next patch we can parse the EMBOSS comment again but that will not help
with non-EMBOSS protein GFF3 files.

Is there some official distinction between protein and nucleotide GFF3
files?

regards,

Peter Rice
EMBOSS Team


From pmr at ebi.ac.uk  Wed Aug 24 10:36:34 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 24 Aug 2011 11:36:34 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
Message-ID: <4E54D432.8030309@ebi.ac.uk>

On 08/17/2011 11:37 AM, Peter Cock wrote:
> Hi again Peter R. (et al.),
>
> Following yesterday's discussion about GFF3 files from UniProt,
> I'm trying seqret to produce GFF3 from GenBank files.
>
> ------------------------------------------
>
> Problem Two - Circular features not marked
>
> EMBOSS is also lacking in this area.

Current status: circular tags will be passed better i the next EMBOSS 
release. Sequence inputs will have a new -scircular qualifier and 
feature inputs will have -fcircular to cover cases where the input 
format does not define a circular sequence (but if it does, these will 
not turn it off)

We will tag a feature with Is_circular in the output, even if we have to 
make one up.

> ------------------------------------------
>
> Problem Six - Features wrapping the origin of a circular genome
>
> Related to the landmark feature lacking the Is_circular=true tag, the
> gene and CDS features for origin wrapping NEQ003 look funny to me.
> EMBOSS seems to be generating three GFF3 lines for the gene and CDS
> for NEQ003, a surprisingly broad entry 1 to 490885 and two children
> 490883 to 490885 and 1 to 879 (which do look sensible).
>
> Based on the old specification, I had expected two GFF3 lines each for the
> gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked
> by virtue of the having the same ID.
>
> Thankfully this potential confusion has been address in the updated
> specification, so I would expect a single GFF3 line for each of the gene
> and CDS for NEQ003, using start 490883 and end of 879+490885=491764.

Unfortunately GFF3 is sadly lacking in details on how to define the 
sequence length. It appears there is no standard for defining the 
length, yet it is critical to interpreting a circular feature that goes 
across the origin as GFF3 makes the end position greater than the length.

We will make a best guess but cannot guarantee we get the right answer.

> ------------------------------------------
>
> Problem Seven - No parent/child relationships
>
> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
> but not in the way I expected (and not in a way the validator likes).
> As discussed above, for the GenBank join locations EMBOSS
> seems to create broad parent features with children for each
> sub-location (parent/child relations of the same type = bad).
>
> What I'm expecting instead is parent child relationships between
> the CDS and gene features, between tRNA and gene features, etc.
> Note that these relationships are implicit in the GenBank (and EMBL)
> flat files, so I accept trying to deduce them might be hard (and
> perhaps best not doing immediately - the other issues are more
> pressing).

The obvious fix is to lie about the feature types of the exons so the 
validator is happy. We could call them exons, but "region" would be safer.

But there is a silly complication with CDS features: we could keep the 
CDS parent record and have it as a parent of a group of "regions" for 
the processed exons. But GFF3 wants the exons to be type "CDS" so what 
do we call the parent?

So in the cobbled together example below, ignoring the circular aspects, 
we would want to keep the CDS on the parent (ID=NC_005213.11) record 
where all the annotation tags are, but I suspect GFF3 wants that to be 
something else. We could of course specifically lie about CDS features 
for EMBOSS generated GFF3 files (we tag the header) so we can restore 
the correct internal structure on input.

NC_005213	EMBL	CDS	490883	491764  .	-	0 
ID=NC_005213.11;locus_tag=NEQ001;note=conserved
hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized
ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743:
Protein of unknown function
DUF57;codon_start=1;transl_table=11;product=hypothetical
protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
NC_005213	EMBL	CDS	490883	490885	.	-	0	ID=NC_005213.12;Parent=NC_005213.11
NC_005213	EMBL	CDS	1	879	.	-	0	ID=NC_005213.13;Parent=NC_005213.11

> ------------------------------------------
>
> Problem Eight - Invalid tags
>
> The online validator complains that EMBOSS too is using EC_number
> (uppercase tags are reserved

Fixed and we can patch the release. Making all tags lower case is 
trivial - they are automatically converted on input to the internal 
mixed case.

> ------------------------------------------
>
> So my conclusion is that while the EMBOSS generated GFF3 is
> better than those produced by the NCBI, it still is invalid and needs
> some work.
>
> As usual, I am of course happy to help with testing fixes. And if
> there are any mistakes in my understanding of the GFF3 spec,
> please tell me ;)

Hope this helps. Progress is being made.

However, as GFF3 is such a pain, I am wondering whether to switch the 
default feature format to something else - back to GFF2 or maybe to use GTF.

Does anyone have a preference?

regards,

Peter Rice
EMBOSS Team


From pmr at ebi.ac.uk  Wed Aug 24 14:45:33 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Wed, 24 Aug 2011 15:45:33 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E54D432.8030309@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk>
Message-ID: <4E550E8D.8010506@ebi.ac.uk>

On 08/24/2011 11:36 AM, Peter Rice wrote:
> On 08/17/2011 11:37 AM, Peter Cock wrote:
>
>> ------------------------------------------
>>
>> Problem Seven - No parent/child relationships
>>
>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>> but not in the way I expected (and not in a way the validator likes).

As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, 
I can make the CDS "parent" feature change its type to 
"biological_region" and add a featflags tag with the true type. Code 
(not yet checked in) can reconstruct the EMBL feature table from this GFF.

However, the EMBL tags are all on the parent (now biological_region) 
feature.

Any suggestions where I should stick them for them to be useful in GFF3?

EMBL feature table:

FT   source          1..3919
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT   CDS             join(2079..2171,2294..2515,3371..3499)
FT                   /db_xref="GDB:119299"
FT                   /db_xref="GOA:P02100"
FT                   /db_xref="HGNC:4830"
FT                   /db_xref="InterPro:IPR000971"
FT                   /db_xref="InterPro:IPR002337"
FT                   /db_xref="InterPro:IPR009050"
FT                   /db_xref="InterPro:IPR012292"
FT                   /db_xref="PDB:1A9W"
FT                   /db_xref="UniProtKB/Swiss-Prot:P02100"
FT                   /protein_id="CAA23766.1"
FT 
/translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS
FT 
FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF
FT                   KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH"

proposed GFF3 version

V00508	EMBL	databank_entry	1	3919	.	+	.	ID=V00508.1;organism=Homo 
sapiens;mol_type=genomic DNA;db_xref=taxon:9606
V00508	EMBL	biological_region	2079	3499	.	+	0 
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x
ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV
VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508	EMBL	CDS	2079	2171	.	+	0	Parent=V00508.2
V00508	EMBL	CDS	2294	2515	.	+	0	Parent=V00508.2
V00508	EMBL	CDS	3371	3499	.	+	0	Parent=V00508.2


regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Thu Aug 25 00:44:47 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 25 Aug 2011 01:44:47 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E550E8D.8010506@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
Message-ID: <CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>

On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice <pmr at ebi.ac.uk> wrote:
>
> However, as GFF3 is such a pain, I am wondering whether to switch the
> default feature format to something else - back to GFF2 or maybe to use GTF.
>

Sadly I have to agree with you - the current version of the GFF3
spec leaves far too much open to multiple interpretation, as we
have been discussing on the song-devel mailing lists. I'm not
sure that GFF2 or GTF are any better though.

On Wed, Aug 24, 2011 at 3:45 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 08/24/2011 11:36 AM, Peter Rice wrote:
>>
>> On 08/17/2011 11:37 AM, Peter Cock wrote:
>>
>>> ------------------------------------------
>>>
>>> Problem Seven - No parent/child relationships
>>>
>>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships
>>> but not in the way I expected (and not in a way the validator likes).
>
> As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I
> can make the CDS "parent" feature change its type to "biological_region" and
> add a featflags tag with the true type. Code (not yet checked in) can
> reconstruct the EMBL feature table from this GFF.
>
> However, the EMBL tags are all on the parent (now biological_region)
> feature.
>
> Any suggestions where I should stick them for them to be useful in GFF3?
>
> EMBL feature table:
>
> FT ? source ? ? ? ? ?1..3919
> FT ? ? ? ? ? ? ? ? ? /organism="Homo sapiens"
> FT ? ? ? ? ? ? ? ? ? /mol_type="genomic DNA"
> FT ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606"
> FT ? CDS ? ? ? ? ? ? join(2079..2171,2294..2515,3371..3499)
> FT ? ? ? ? ? ? ? ? ? /db_xref="GDB:119299"
> FT ? ? ? ? ? ? ? ? ? /db_xref="GOA:P02100"
> FT ? ? ? ? ? ? ? ? ? /db_xref="HGNC:4830"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR000971"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR002337"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR009050"
> FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR012292"
> FT ? ? ? ? ? ? ? ? ? /db_xref="PDB:1A9W"
> FT ? ? ? ? ? ? ? ? ? /db_xref="UniProtKB/Swiss-Prot:P02100"
> FT ? ? ? ? ? ? ? ? ? /protein_id="CAA23766.1"
> FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS
> FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF
> FT ? ? ? ? ? ? ? ? ? KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH"
>
> proposed GFF3 version
>
> V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606
> V00508 ?EMBL ? ?biological_region ? ? ? 2079 ? ?3499 ? ?. ? ? ? + ? ? ? 0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x
> ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV
> VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0
> Parent=V00508.2
> V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0
> Parent=V00508.2
> V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0
> Parent=V00508.2
>

I was expecting something like this (done by hand) where we follow the
example on http://www.sequenceontology.org/gff3.shtml and have a
single GFF gene feature represented by three lines linked by virtue of
having the same ID:


V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? .
ID=V00508.1;organism=Homo sapiens;mol_type=genomic
DNA;db_xref=taxon:9606
V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0
ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH

On the downside, I have repeated all the annotation three times - but
that is what was done in the GFF3 example in the spec.

Perhaps this should be raised on the song-devel mailing list along
with our other GFF3 queries.

Regards,

Peter C.


From pmr at ebi.ac.uk  Thu Aug 25 13:52:30 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Thu, 25 Aug 2011 14:52:30 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
	<CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
Message-ID: <4E56539E.6030400@ebi.ac.uk>

On 25/08/2011 01:44, Peter Cock wrote:
> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>>
>> However, as GFF3 is such a pain, I am wondering whether to switch the
>> default feature format to something else - back to GFF2 or maybe to use GTF.
>>
>
> Sadly I have to agree with you - the current version of the GFF3
> spec leaves far too much open to multiple interpretation, as we
> have been discussing on the song-devel mailing lists. I'm not
> sure that GFF2 or GTF are any better though.

GTF is no good for EMBOSS ... way too picky about start and stop codons

If pushed we could read it in using a version of the GTF parser but I 
see no point trying to write it using data from any source


> I was expecting something like this (done by hand) where we follow the
> example on http://www.sequenceontology.org/gff3.shtml and have a
> single GFF gene feature represented by three lines linked by virtue of
> having the same ID:
>
>
> V00508  EMBL    databank_entry  1       3919    .       +       .
> ID=V00508.1;organism=Homo sapiens;mol_type=genomic
> DNA;db_xref=taxon:9606
> V00508  EMBL    CDS     2079    2171    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508  EMBL    CDS     2294    2515    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
> V00508  EMBL    CDS     3371    3499    .       +       0
> ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH
>
> On the downside, I have repeated all the annotation three times - but
> that is what was done in the GFF3 example in the spec.

Urgh. How about a gene with 80 exons? That's what I was trying to avoid.

How would you plan to read it back in? Transferring all features to the 
parent perhaps, with checks every time for an existing exact copy?

I am less impressed with GFF3 each time I look.

I think we'll go with the annotation of the "biological_region" parent 
and wait for anyone with a use case that actually requires massively 
replicated annotation.

regards,

Peter Rice
EMBOSS Team


From p.j.a.cock at googlemail.com  Fri Aug 26 02:27:31 2011
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 26 Aug 2011 03:27:31 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <4E56539E.6030400@ebi.ac.uk>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
	<CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
	<4E56539E.6030400@ebi.ac.uk>
Message-ID: <CAKVJ-_7L2oTm3f41hPbfiM7NGAfNqtAXRQF+pMFMZ=eqd2qVmw@mail.gmail.com>

On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice <pmr at ebi.ac.uk> wrote:
> On 25/08/2011 01:44, Peter Cock wrote:
>>
>> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice<pmr at ebi.ac.uk> ?wrote:
>>>
>>> However, as GFF3 is such a pain, I am wondering whether to switch the
>>> default feature format to something else - back to GFF2 or maybe to use
>>> GTF.
>>>
>>
>> Sadly I have to agree with you - the current version of the GFF3
>> spec leaves far too much open to multiple interpretation, as we
>> have been discussing on the song-devel mailing lists. I'm not
>> sure that GFF2 or GTF are any better though.
>
> GTF is no good for EMBOSS ... way too picky about start and stop codons
>
> If pushed we could read it in using a version of the GTF parser but I see no
> point trying to write it using data from any source
>
>
>> I was expecting something like this (done by hand) where we follow the
>> example on http://www.sequenceontology.org/gff3.shtml and have a
>> single GFF gene feature represented by three lines linked by virtue of
>> having the same ID:
>>
>> ...
>>
>> On the downside, I have repeated all the annotation three times - but
>> that is what was done in the GFF3 example in the spec.
>
> Urgh. How about a gene with 80 exons? That's what I was trying to avoid.
>
> How would you plan to read it back in? Transferring all features to the
> parent perhaps, with checks every time for an existing exact copy?
>

It would make sense to propose that the first line has all the annotation,
and the subsequence lines from the same feature just need the ID,
and if it is adopted the part tag recently discussed on the song-devel
list to make the order of the sub-parts explicit.
http://sourceforge.net/mailarchive/message.php?msg_id=27960475

>
> I am less impressed with GFF3 each time I look.
>

Me too.

>
> I think we'll go with the annotation of the "biological_region" parent and
> wait for anyone with a use case that actually requires massively replicated
> annotation.
>

Have you looked at the BioPerl GenBank to GFF3 conversion?
I understand GBrowse recommends this as a way to get
GenBank format data into GBrowse. I'm also pretty sure that
this is being used inside TogoWS for GenBank/EMBL to GFF3:

http://togows.dbcls.jp/entry/embl/V00508  <-- original EMBL
http://togows.dbcls.jp/entry/embl/V00508.gff  <-- as GFF3

Interestingly their GFF3 output is pretty close to your proposed
EMBOSS output, only they've got a "region" rather than
"biological_region" for the parent meta-feature.

However, I think introducing extra biological_region features to
act as the parent of multi-location features would run counter to
the canonical gene model given in the GFF3 specification (which
appears to be just a suggestion rather than a requirement).

Also, introducing this meta-feature would complicate any
future wish to try to express explicit parent/child relationships
between operon, gene, mRNA and CDS features. Of course, as
we've discussed, these biological relationships are only implicit
in the GenBank/EMBL feature table.

This is probably a good example to discuss on the GFF3
song-devel mailing list - small and apparently very simple
except for how to represent the (forward strand) join location.

Peter C.


From pmr at ebi.ac.uk  Tue Aug 30 15:48:25 2011
From: pmr at ebi.ac.uk (Peter Rice)
Date: Tue, 30 Aug 2011 16:48:25 +0100
Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3
In-Reply-To: <CAKVJ-_7L2oTm3f41hPbfiM7NGAfNqtAXRQF+pMFMZ=eqd2qVmw@mail.gmail.com>
References: <CAKVJ-_6hc6j+oy3bZ+ydWL1KJ7m4Xh2xnz-Cjx1vhNUf_dzN8A@mail.gmail.com>
	<4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk>
	<CAKVJ-_7nwF=DEFcg11JqyeRcSHp8YxwX509wU6S20MQeaiQtUQ@mail.gmail.com>
	<4E56539E.6030400@ebi.ac.uk>
	<CAKVJ-_7L2oTm3f41hPbfiM7NGAfNqtAXRQF+pMFMZ=eqd2qVmw@mail.gmail.com>
Message-ID: <4E5D0649.3010905@ebi.ac.uk>

On 08/26/2011 03:27 AM, Peter Cock wrote:
> On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice<pmr at ebi.ac.uk>  wrote:
>> On 25/08/2011 01:44, Peter Cock wrote:

> It would make sense to propose that the first line has all the annotation,
> and the subsequence lines from the same feature just need the ID,
> and if it is adopted the part tag recently discussed on the song-devel
> list to make the order of the sub-parts explicit.
> http://sourceforge.net/mailarchive/message.php?msg_id=27960475

The part tag is interesting and would map to the internal "exon" 
attribute in EMBOSS which we reserve for sorting.

>> I think we'll go with the annotation of the "biological_region" parent and
>> wait for anyone with a use case that actually requires massively replicated
>> annotation.
>>
>
> Have you looked at the BioPerl GenBank to GFF3 conversion?
> I understand GBrowse recommends this as a way to get
> GenBank format data into GBrowse. I'm also pretty sure that
> this is being used inside TogoWS for GenBank/EMBL to GFF3:
>
> http://togows.dbcls.jp/entry/embl/V00508<-- original EMBL
> http://togows.dbcls.jp/entry/embl/V00508.gff<-- as GFF3

Hmmm .... the GFF3 has Parent references to the protein_id, but it 
doesn't appear as an ID.

I do not like using a second region to put the description line in. 
Using the organism as the ID for the source line also looks odd.

> Interestingly their GFF3 output is pretty close to your proposed
> EMBOSS output, only they've got a "region" rather than
> "biological_region" for the parent meta-feature.

I don't see a parent meta-feature there.

> However, I think introducing extra biological_region features to
> act as the parent of multi-location features would run counter to
> the canonical gene model given in the GFF3 specification (which
> appears to be just a suggestion rather than a requirement).
>
> Also, introducing this meta-feature would complicate any
> future wish to try to express explicit parent/child relationships
> between operon, gene, mRNA and CDS features. Of course, as
> we've discussed, these biological relationships are only implicit
> in the GenBank/EMBL feature table.

I tried the canonical gene example:

##gff-version 3
##sequence-region ctg123 1 9000
ctg123	.	gene	1000	9000	.	+	.	ID=gene00001;Name=EDEN
ctg123	.	TF_binding_site	1000	1012	.	+	.	ID=tfbs00001;Parent=gene00001
ctg123	.	mRNA	1050	9000	.	+	.	ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123	.	five_prime_UTR	1050	1200	.	+	.	Parent=mRNA00001
ctg123	.	CDS	1201	1500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	3000	3902	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	5000	5500	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	CDS	7000	7600	.	+	0	ID=cds00001;Parent=mRNA00001
ctg123	.	three_prime_UTR	7601	9000	.	+	.	Parent=mRNA00001
ctg123	.	cDNA_match	1050	1500	5.8e-42	+	. 
ID=match00001;Target=cdna0123+12+462
ctg123	.	cDNA_match	5000	5500	8.1e-43	+	. 
ID=match00001;Target=cdna0123+463+963
ctg123	.	cDNA_match	7000	9000	1.4e-40	+	. 
ID=match00001;Target=cdna0123+964+2964
##FASTA
>ctg123
cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
>cdna0123
ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
tcaaacagcggctgtaaaaatttgtgattatggttaaagg

I can not (code not yet checked in) reproduce this, subject to the 
sequence being too short.

Internally, EMBOSS generates parent features for CDS and cDNA_match 
(where several features share an ID), and the parent structure is preserved.

On output, the generated features are not reported so GFF3 input is 
identical.

If we read EMBL/GenBank entries then we will generate a parent feature 
with type "biological region" to attach the annotation from the join. 
Reproducing the "parent" relationships is a separate exercise that could 
be a separate application. In terms of reading one format and writing 
another I prefer to not generate any GFF3-specific extras.

> This is probably a good example to discuss on the GFF3
> song-devel mailing list - small and apparently very simple
> except for how to represent the (forward strand) join location.

We could propose something for the 
http://www.sequenceontology.org/wiki/index.php/GFF3_best_practices page 
to describe how to represent EMBL/GenBank entries in GFF3 (after due 
discussion on the SONG-devel list)

regards,

Peter Rice
EMBSOS Team