From p.j.a.cock at googlemail.com Tue Aug 2 14:01:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Aug 2011 19:01:54 +0100 Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files Message-ID: Hi EMBOSS folk, I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto who has been adding ABI support to Biopython. With EMBOSS 6.3.1 compiled from source on Mac (as an example), $ seqret -osformat="fastq-sanger" -filter 310.ab1 @D11F TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather than the expected ID from within the file we get EMBOSS_001, $ seqret -osformat="fastq-sanger" -filter 310.ab1 @EMBOSS_001 TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Regards, Peter Cock ---------- Forwarded message ---------- From: Wibowo Arindrarto Date: Sat, Jul 30, 2011 at 8:42 AM Subject: Re: [Biopython-dev] SeqIO Abi Parser To: Peter Cock Cc: biopython-dev at lists.open-bio.org Hi Peter, I've done some more improvements to the code: - I've written the check and unittest for the file handle mode. I've set it so that abi file has to be opened in 'rb' mode, otherwise it'll return an error. While it's ok to open in 'r' mode in python 2 in Linux, it has to be specified as 'rb' in Windows and/or Python 3 for the file to be read correctly. So I decided forcing it to 'rb' is the best. Because of this, I changed 'test_SeqIO.py:503' to include the mode argument when opening. - I've also checked against test_Emboss.py for seqret output, after including the abi format in it. My EMBOSS version is 6.4.0. There was a slight problem with this testing, since for some reason the ID returned by seqret is always "EMBOSS_001". Something might be wrong with my EMBOSS installation, since when I previously tested it against 6.1.0, the ID was correct (although the qual values not, so I had to upgrade). As expected, if I comment out the code that tests for sequence id ('test_Emboss.py:168-172') the tests pass. Maybe you could try testing it as well and see if EMBOSS also returns the default id instead of the sample name? - Finally, I did some small cosmetic changes to the code (typos, etc). All changes have been pushed to my github fork. Now I still have time for the weekend to improve whatever needs to be improved :). Regards, --- Wibowo Arindrarto (bow) http://bow.web.id On Fri, Jul 29, 2011 at 18:20, Peter Cock wrote: > > Hi again, > > I had a bit of time this afternoon so I looked at this. > > On Fri, Jul 29, 2011 at 1:14 PM, Peter Cock wrote: > > On Fri, Jul 29, 2011 at 12:34 PM, Wibowo Arindrarto wrote: > >> Hi Peter, > >> Thanks for explaining. I understand why we should stick to the stored > >> sequence id. In this case, we can use the filename as SeqRecord.name as > >> well. Regarding BioPerl, I don't have it installed myself -- but I took a > >> quick look at their source and it seems they also use the stored sequence ID > >> as their main identifier instead of the filename. If the stored sequence ID > >> is not present, it's "(unknown)" in their case. > > > > OK good, that means Biopython, BioPerl and EMBOSS should be > > consistent :) > > I've made that switch, > > >> I'll look on the test_SeqIO.py over the weekend. I think it'll have > >> something to do with some ambiguous dna base stored in the abi files. > >> Regards, > > > > Some of the alphabet stuff is a bit nasty - so please feel free to ask > > or get me to help. > > I've done enough to get the test_SeqIO.py unit test to pass. > > We probably need a check (like in SFF) to check the user hasn't given > a handle opened in text mode. That should probably have a unit test > too. > > I still haven't cross checked the sequence and PHRED scores from > your code and EMBOSS. > > Anyway - I'll leave the code for you to work on for now... > > Peter From pmr at ebi.ac.uk Tue Aug 2 14:27:07 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 02 Aug 2011 19:27:07 +0100 Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files In-Reply-To: References: Message-ID: <4E38417B.6000505@ebi.ac.uk> On 02/08/2011 19:01, Peter Cock wrote: > Hi EMBOSS folk, > > I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto > who has been adding ABI support to Biopython. > > With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather > than the expected ID from within the file we get EMBOSS_001, Can you please run with -debug on the command line and send me the seqret.dbg file to see what it thought was in the file regards, Peter Rice EMBOSS team From p.j.a.cock at googlemail.com Wed Aug 3 03:57:01 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Aug 2011 08:57:01 +0100 Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files In-Reply-To: <4E38417B.6000505@ebi.ac.uk> References: <4E38417B.6000505@ebi.ac.uk> Message-ID: On Tue, Aug 2, 2011 at 7:27 PM, Peter Rice wrote: > > On 02/08/2011 19:01, Peter Cock wrote: >> >> Hi EMBOSS folk, >> >> I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto >> who has been adding ABI support to Biopython. >> >> With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather >> than the expected ID from within the file we get EMBOSS_001, > > Can you please run with -debug on the command line and send me the > seqret.dbg file to see what it thought was in the file No problem - sent directly to Peter R, Peter From ajb at ebi.ac.uk Thu Aug 11 09:22:25 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Thu, 11 Aug 2011 14:22:25 +0100 (BST) Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released Message-ID: <53905.82.26.12.214.1313068945.squirrel@imap04.ebi.ac.uk> New bug-fix files are available for EMBOSS-6.4.0 and, for Windows users, a new version of mEMBOSS is available. The bugs fixed are appended for easy reference. 1) UNIX As usual, the most convenient way of applying the bug-fixes should be to apply the patch file: ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-11.gz to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code and recompiling/installing. (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch for instructions on using 'patch'). Alternatively, you can individually copy the patched files from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory if your system does not support 'patch'. 2) mEMBOSS The new version incorporates all the bug-fixes listed below. Uninstall your previous mEMBOSS installation and download and install the new setup file from: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.2-setup.exe Alan ----------------------------------------------------------------------- Fix 1. EMBOSS-6.4.0/emboss/dbiflat.c EMBOSS-6.4.0/emboss/dbxflat.c 10 Aug 2011: The SwissProt description line format includes additional tags which interfere with the EMBL parser used in previous releases. The fix replaces this with a SwissProt parser that strips out the extra tags. After patching the release, any existing SwissProt description index files should be reindexed. Other indexes are unchanged. Fix 2. EMBOSS-6.4.0/ajax/core/ajquery.c 10 Aug 2011: For databases with more than one valid format (examples include the EBI dbfetch server) this fix allows the format to be specified with a qualifier on the command line. In the original release, only a format in the query string was used. Fix 3. EMBOSS-6.4.0/ajax/core/ajfeatread.c 10 Aug 2011: When parsing GFF3 format input, long feature tags (for example extremely long translations) exceeded limits in regular expression parsing. This fix decouples testing for escaped quotes from the main task of finding quoted strings. Fix 4. EMBOSS-6.4.0/emboss/data/Etcode.dat 10 Aug 2001: The local data file used by application tcode had a missing parameter line. Fix 5. EMBOSS-6.4.0/ajax/core/ajrange.c 10 Aug 2011: When sequence ranges (and possible highlighting for showalign) were in a list file, the parser overwrote string values. Fix 5. EMBOSS-6.4.0/ajax/core/ajseqabi.c 10 Aug 2011: Sample names in ABI format files were stored in incompletely defined strings. This fix corrects the string object. The sample name is also used as the sequence name. Fix 6. EMBOSS-6.4.0/emboss/dbxresource.c 10 Aug 2011: A future change to the format of Data Resource Catalogue entries in DRCAT.dat requires an update to the parsing of category lines. The current version is not affected. Fix 7. EMBOSS-6.4.0/emboss/server.ensemblgenomes EMBOSS-6.4.0/emboss/cacheensembl.c EMBOSS-6.4.0/ajax/ensembl/ensregistry.c EMBOSS-6.4.0/ajax/ensembl/ensregistry.c EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.c EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.h 10 Aug 2011: Microbial genomes use an enumerated species code which must be added to the query for data retrieval. This fix adds the species code to the comment field. In the next release a more complete solution will be implemented. Fix 8. EMBOSS-6.4.0/ajax/core/ajarch.h 10-Aug-2011: Corrects the size of long integers on Windows systems only. Fix 9. EMBOSS-6.4.0/emboss/cirdna.c 10-Aug-2011: Cirdna prints text inside solid blocks invisibly. When printed outside the text scaling was too small. The text scale is now adjusted for the radius and sequence length so that labels should be readable outside the box. Fix 10. EMBOSS-6.4.0/ajax/core/ajpat.c 10-Aug-2011: Fuzznuc, fuzzpro and fuzztran using a pattern file ignored the command line -mismatch qualifier for the first pattern. The default mismatch is now set to this value at the start of the pattern matching loop in the library. Fix 11. EMBOSS-6.4.0/ajax/core/ajfmt.c 11-Aug-2011: The function ajFmtScanF() handled va_list incorrectly. Only potentially affected code developers. From ajb at ebi.ac.uk Thu Aug 11 11:58:25 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Thu, 11 Aug 2011 16:58:25 +0100 (BST) Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released Message-ID: <49005.82.26.12.214.1313078305.squirrel@imap04.ebi.ac.uk> UNIX users who downloaded the bug-fix patch file for EMBOSS earlier this afternoon may have found that there were compilation problems on a limited number of architectures. The patch has been amended slightly to hopefully fix this problem so please download it again if you were affected. If anyone continues to experience compilation problems then please let me know. Alan From p.j.a.cock at googlemail.com Tue Aug 16 11:03:26 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Aug 2011 16:03:26 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) Message-ID: Dear Peter R. (et al.), I recall from one of our chats in person that EMBOSS has some mapping tables to convert the various different data file format's feature names into a common standard (the Sequence Ontology?), for the purpose of inter-converting files. e.g. Converting a UniProt/ SwissProt plain text protein file into a GenPept protein file or GFF3 Is that a fair summary? It seems to match the minutes of this meeting (found with Google) http://emboss.sourceforge.net/meetings/2009-02-16.html > DASGFF requires a sequence ontology (or BioSapiens > ontology) tag for protein features. Peter has updated the > Efeatures definitions for proteins to use GFF3 sequence > ontology codes as internal identifiers, and to use GFF3 > as the principle definitions for all protein features. All > SwissProt feature types (36 in the current Swissprot > release) are also defined with the closest possible match > to the sequence ontology. Where there is no exact match, > an EMBOSS internal type is defined using the closets SO > code and the original feature type as a suffix. For SwissProt > output this is converted back to the swissprot feature type. > For GFF3 output the internal type is an alias for the closest > (more general) SO term. Can you point me at these mapping tables in the EMBOSS source code please? I'm particularly interested in the SwissProt to SO mapping right now. Thanks. Peter C. From pmr at ebi.ac.uk Tue Aug 16 11:26:51 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 16 Aug 2011 16:26:51 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4A8BF7.4020106@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> Message-ID: <4E4A8C3B.8030306@ebi.ac.uk> On 08/16/2011 04:03 PM, Peter Cock wrote: > Dear Peter R. (et al.), > > I recall from one of our chats in person that EMBOSS has some > mapping tables to convert the various different data file format's > feature names into a common standard (the Sequence Ontology?), > for the purpose of inter-converting files. e.g. Converting a UniProt/ > SwissProt plain text protein file into a GenPept protein file or GFF3 > > Is that a fair summary? Yes, We needed an internal identifier for feature types, and picked SO for nucleotides - and then were able to add the protein terms when they became available. There are a few made up internal names, with _text after the SO term, that were needed in the early days of the BioSapiens Ontology and some dodgy mapping between SO and EMBL/GenBank for immunoglobulin gene regions, but I believe are no longer used. The first term in the file is defined as the default if nothing is recognized (region or misc_feature) > Can you point me at these mapping tables in the EMBOSS > source code please? emboss/data/Efeatures.embl emboss/data/Efeatures.swiss > I'm particularly interested in the SwissProt to SO mapping > right now. That was originally done by the BioSapiens "Network of excellence" for annotating ENCODE data. They developed the protein features which were then added to the sequence ontology. You can look at SO terms in EMBOSS with: ontoget so:0001094 or ontoget -filter -oformat excel so:0001094 (Hmmm, should do something better for a missing namespace - it was defined as a format for EDAM) Let me know if you spot anything in need of updating. We also have (especially for EMBL) equivalent Etags files listing the available feature qualifiers. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Tue Aug 16 11:36:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Aug 2011 16:36:24 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4A8C3B.8030306@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: On Tue, Aug 16, 2011 at 4:26 PM, Peter Rice wrote: > Yes, We needed an internal identifier for feature types, and picked SO > for nucleotides - and then were able to add the protein terms when they > became available. > ... Thanks! > > Let me know if you spot anything in need of updating. > I have found three protein features which have been renamed, and one which appears to be wrong... see below. I recently noticed that the UniProt provide GFF3 files, e.g. http://www.uniprot.org/uniprot/P99999.gff ======================================== ##gff-version 3 ##sequence-region P99999 1 105 P99999 UniProtKB Initiator methionine 1 1 . . . Note=Removed P99999 UniProtKB Chain 2 105 . . . ID=PRO_0000108218;Note=Cytochrome c P99999 UniProtKB Metal binding 19 19 . . . Note=Iron (heme axial ligand) P99999 UniProtKB Metal binding 81 81 . . . Note=Iron (heme axial ligand) P99999 UniProtKB Binding site 15 15 . . . Note=Heme (covalent) P99999 UniProtKB Binding site 18 18 . . . Note=Heme (covalent) P99999 UniProtKB Modified residue 2 2 . . . Note=N-acetylglycine P99999 UniProtKB Modified residue 49 49 . . . Note=Phosphotyrosine;Status=By similarity P99999 UniProtKB Modified residue 98 98 . . . Note=Phosphotyrosine;Status=By similarity P99999 UniProtKB Natural variant 42 42 . . . ID=VAR_044450;Note=In THC4%3B increases the pro-apoptotic function by triggering caspase activation more efficiently than wild-type%3B does not affect the redox function. P99999 UniProtKB Natural variant 56 56 . . . ID=VAR_048850 P99999 UniProtKB Natural variant 66 66 . . . ID=VAR_002204;Note=In 10%25 of the molecules. P99999 UniProtKB Sequence conflict 18 18 . . . . P99999 UniProtKB Sequence conflict 41 41 . . . . P99999 UniProtKB Helix 4 14 . . . . P99999 UniProtKB Turn 16 18 . . . . P99999 UniProtKB Beta strand 23 25 . . . . P99999 UniProtKB Beta strand 28 30 . . . . P99999 UniProtKB Turn 36 38 . . . . P99999 UniProtKB Helix 51 56 . . . . P99999 UniProtKB Helix 62 70 . . . . P99999 UniProtKB Helix 72 75 . . . . P99999 UniProtKB Helix 89 102 . . . . ======================================== However, they are not using Sequence Ontology terms in column three and so fail the online GFF3 validator http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online listed in http://www.sequenceontology.org/gff3.shtml (GFF3 specification currently at v1.20). Additionally that UniProt GFF3 uses an upper case reserved tag, "Status" rather than perhaps "status", in the modified residue features. I will report this to UniProt later. However, first I thought I would try converting one of the other files provided into GFF3 using EMBOSS seqret for an alternative, e.g. the plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt I can convert this using seqret as follows: ======================================== $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt -stdout -auto ##gff-version 3 ##sequence-region CYC_HUMAN 1 105 #!Date 2011-08-16 #!Type Protein #!Source-version EMBOSS 6.4.0.0 CYC_HUMAN SWISSPROT cleaved_initiator_methionine 1 1 . + . ID=CYC_HUMAN.1;note=Removed CYC_HUMAN SWISSPROT mature_protein_region 2 105 . + . ID=CYC_HUMAN.2;note=Cytochrome c;ftid=PRO_0000108218 CYC_HUMAN SWISSPROT metal_binding 19 19 . + . ID=CYC_HUMAN.3;note=Iron;comment=heme axial ligand CYC_HUMAN SWISSPROT metal_binding 81 81 . + . ID=CYC_HUMAN.4;note=Iron;comment=heme axial ligand CYC_HUMAN SWISSPROT binding_site 15 15 . + . ID=CYC_HUMAN.5;note=Heme;comment=covalent CYC_HUMAN SWISSPROT binding_site 18 18 . + . ID=CYC_HUMAN.6;note=Heme;comment=covalent CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 2 2 . + . ID=CYC_HUMAN.7;note=N-acetylglycine CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 49 49 . + . ID=CYC_HUMAN.8;note=Phosphotyrosine;comment=By similarity CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 98 98 . + . ID=CYC_HUMAN.9;note=Phosphotyrosine;comment=By similarity CYC_HUMAN SWISSPROT natural_variant 42 42 . + . ID=CYC_HUMAN.10;note=G -> S;comment=in THC4%3B increases the pro- apoptotic function by triggering caspase activation more efficiently than wild- type%3B does not affect the redox function;ftid=VAR_044450 CYC_HUMAN SWISSPROT natural_variant 56 56 . + . ID=CYC_HUMAN.11;note=K -> R;comment=in dbSNP:rs11548795;ftid=VAR_048850 CYC_HUMAN SWISSPROT natural_variant 66 66 . + . ID=CYC_HUMAN.12;note=M -> L;comment=in 10%25 of the molecules;ftid=VAR_002204 CYC_HUMAN SWISSPROT sequence_conflict 18 18 . + . ID=CYC_HUMAN.13;note=C -> Y;comment=in Ref. 8%3B AAH15130 CYC_HUMAN SWISSPROT sequence_conflict 41 41 . + . ID=CYC_HUMAN.14;note=T -> I;comment=in Ref. 8%3B AAH68464 CYC_HUMAN SWISSPROT alpha_helix 4 14 . + . ID=CYC_HUMAN.15 CYC_HUMAN SWISSPROT turn 16 18 . + . ID=CYC_HUMAN.16 CYC_HUMAN SWISSPROT beta_strand 23 25 . + . ID=CYC_HUMAN.17 CYC_HUMAN SWISSPROT beta_strand 28 30 . + . ID=CYC_HUMAN.18 CYC_HUMAN SWISSPROT turn 36 38 . + . ID=CYC_HUMAN.19 CYC_HUMAN SWISSPROT alpha_helix 51 56 . + . ID=CYC_HUMAN.20 CYC_HUMAN SWISSPROT alpha_helix 62 70 . + . ID=CYC_HUMAN.21 CYC_HUMAN SWISSPROT alpha_helix 72 75 . + . ID=CYC_HUMAN.22 CYC_HUMAN SWISSPROT alpha_helix 89 102 . + . ID=CYC_HUMAN.23 ##FASTA >CYC_HUMAN P99999 Cytochrome c MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE ======================================== Interestingly EMBOSS includes the sequence at the bottom (using the FASTA directive) and has generated unique ID tags for each feature. It has also added more note tags. Unfortunately this also failed the GFF3 validation. The EMBOSS output does a lot better (e.g. "cleaved_initiator_methionine" is valid while "Initiator methionine" in the UniProt file was not) However, some of the terms in column 3 are apparently out of date - but http://www.sequenceontology.org does list them as synonyms: * metal_binding -> polypeptide_metal_contact * natural_variant -> natural_variant_site * turn -> polypeptide_turn_motif It looks like the EMBOSS sequence ontology table may need updating for at least these three cases. Finally protein_modification_categorized_by_chemical_process does not seem to be valid (I failed to find it in the ontology). Additionally the validator complained about some of the note in Line 15, probably due to the %3B escaped semi-colon, but that may be a bug in the validator. Peter C. From p.j.a.cock at googlemail.com Tue Aug 16 14:39:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Aug 2011 19:39:05 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: On Tue, Aug 16, 2011 at 4:36 PM, Peter Cock wrote: > > I recently noticed that the UniProt provide GFF3 files, > e.g. http://www.uniprot.org/uniprot/P99999.gff > > ... > http://www.uniprot.org/uniprot/P99999.txt > ... > > $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt I also noticed the seqret GFF3 output is using "+" as the strand, which is wrong for a protein reference like this. It should be using "." (period) as the features on a protein are strand-less (as done in the UniProt GFF3 file). Regards, Peter C. From p.j.a.cock at googlemail.com Wed Aug 17 06:37:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 11:37:06 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 Message-ID: Hi again Peter R. (et al.), Following yesterday's discussion about GFF3 files from UniProt, I'm trying seqret to produce GFF3 from GenBank files. I'd already found the NCBI currently provides some very broken GFF3 files: http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html $ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff $ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk $ seqret --version EMBOSS:6.4.0.0 $ seqret -filter -feature -sequence NC_005213.gbk -sformat=genbank -osformat=gff3 | head -n 20 ##gff-version 3 ##sequence-region NC_005213 1 490885 #!Date 2011-08-17 #!Type DNA #!Source-version EMBOSS 6.4.0.0 NC_005213 EMBL databank_entry 1 490885 . + . ID=NC_005213.1;organism=Nanoarchaeum equitans Kin4-M;mol_type=genomic DNA;strain=Kin4-M;db_xref=taxon:228908 NC_005213 EMBL gene 3254 35301 . + . ID=NC_005213.2;locus_tag=NEQ_t01;experiment=experimental evidence%2C no additional details recorded;trans_splicing=true;db_xref=GeneID:3362429 NC_005213 EMBL gene 35233 35301 . + . Parent=NC_005213.2 NC_005213 EMBL gene 3254 3289 . + . Parent=NC_005213.2 NC_005213 EMBL tRNA 3254 35287 . + . ID=NC_005213.5;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental evidence%2C no additional details recorded;trans_splicing=true;db_xref=GeneID:3362429 NC_005213 EMBL tRNA 35249 35287 . + . Parent=NC_005213.5 NC_005213 EMBL tRNA 3254 3289 . + . Parent=NC_005213.5 NC_005213 EMBL gene 1 490885 . - . ID=NC_005213.8;locus_tag=NEQ001;db_xref=GeneID:2732620 NC_005213 EMBL gene 490883 490885 . - . Parent=NC_005213.8 NC_005213 EMBL gene 1 879 . - . Parent=NC_005213.8 NC_005213 EMBL CDS 1 490885 . - 0 ID=NC_005213.11;locus_tag=NEQ001;note=conserved hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743: Protein of unknown function DUF57;codon_start=1;transl_table=11;product=hypothetical protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR NC_005213 EMBL CDS 490883 490885 . - 0 Parent=NC_005213.11 NC_005213 EMBL CDS 1 879 . - 0 Parent=NC_005213.11 NC_005213 EMBL sequence_feature 7 879 . - . ID=NC_005213.14;locus_tag=NEQ001;note=CRISPR/Cas system-associated RAMP superfamily protein Cas6%3B Region: Cas6-I-III%3B cl11443;db_xref=CDD:196236 NC_005213 EMBL gene 883 2691 . + . ID=NC_005213.15;locus_tag=NEQ003;db_xref=GeneID:2654355 I've deliberately cut the example here to include all of NEQ_t01, and interesting trans-spliced tRNA, and all of NEQ001, an interesting gene because it spans the origin of this circular genome. I use these examples in the blog post and discuss them again below. Given some of the points below, I suspect EMBOSS is producing GFF3 prior to the additions made in v1.18 (24 June 2010) regarding circular genomes. The following numbering reflects the issues listed on my blog post about the NCBI version of the GFF3 file (link given above). ------------------------------------------ Problem One - Invalid Feature Types EMBOSS looks OK here, you're converting the GenBank feature types source and misc_feature into databank_entry and sequence_feature respectively. ------------------------------------------ Problem Two - Circular features not marked EMBOSS is also lacking in this area. EMBOSS has used feature type databank_entry and generated feature ID NC_005213.1 for the landmark. However, this should include the special tag entry Is_circular=true, since this is the landmark feature for the whole circular chromosome. ------------------------------------------ Problem Three - Missing ID tags on multi-location features Unlike the NCBI file which fails to cross link multi-location features like trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think you are following the expected pattern as used in the canonical GFF3 examples. In the GenBank file, this tRNA is join(35233..35301,3254..3289) For the gene and tRNA features for NEQ_t01, EMBOSS is generating three GFF3 lines. First a very broad parent feature 3254 to 35301, then two children 35233 to 35301 and 3254 to 3289. I would expect two GFF3 lines (for each of gene and tRNA), just 35233 to 35301 and 3254 to 3289 which would be linked by virtue of having the same ID. The online GFF3 validator would seem to support my interpretation, reporting errors like this: 8 [ERROR] invalid type pair - check all parents (at line 7; gene to gene) 11 [ERROR] invalid type pair - check all parents (at line 10; tRNA to tRNA) 14 [ERROR] invalid type pair - check all parents (at line 13; gene to gene) 17 [ERROR] invalid type pair - check all parents (at line 16; CDS to CDS) 28 [ERROR] invalid type pair - check all parents (at line 27; sequence_feature to sequence_feature) This is related to "Problem Six" and "Problem Seven" below. ------------------------------------------ Problem Four - Wrong tag for database cross references I had noticed the NCBI using a local tag (lower case) db_xref rather than the standard (upper case = reserved) tag Dbxref. EMBOSS does the same - is this deliberate and if so why? ------------------------------------------ Problem Five - Missing stop codon in CDS features EMBOSS looks OK here ------------------------------------------ Problem Six - Features wrapping the origin of a circular genome Related to the landmark feature lacking the Is_curcular=true tag, the gene and CDS features for origin wrapping NEQ003 look funny to me. EMBOSS seems to be generating three GFF3 lines for the gene and CDS for NEQ003, a surprisingly broad entry 1 to 490885 and two children 490883 to 490885 and 1 to 879 (which do look sensible). This is essentially the same point I raised above with NEQ_t01, but with the added complication of spanning the origin. Based on the old specification, I had expected two GFF3 lines each for the gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked by virtue of the having the same ID. Thankfully this potential confusion has been address in the updated specification, so I would expect a single GFF3 line for each of the gene and CDS for NEQ003, using start 490883 and end of 879+490885=491764. ------------------------------------------ Problem Seven - No parent/child relationships The NCBI GFF3 file had no parent/child relationships at all. The EMBOSS 6.4.0 GFF3 file does use parent/child relationships but not in the way I expected (and not in a way the validator likes). As discussed above, for the GenBank join locations EMBOSS seems to create broad parent features with children for each sub-location (parent/child relations of the same type = bad). What I'm expecting instead is parent child relationships between the CDS and gene features, between tRNA and gene features, etc. Note that these relationships are implicit in the GenBank (and EMBL) flat files, so I accept trying to deduce them might be hard (and perhaps best not doing immediately - the other issues are more pressing). ------------------------------------------ Problem Eight - Invalid tags The online validator complains that EMBOSS too is using EC_number (uppercase tags are reserved ------------------------------------------ So my conclusion is that while the EMBOSS generated GFF3 is better than those produced by the NCBI, it still is invalid and needs some work. As usual, I am of course happy to help with testing fixes. And if there are any mistakes in my understanding of the GFF3 spec, please tell me ;) Regards, Peter C. From pmr at ebi.ac.uk Wed Aug 17 11:38:23 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:38:23 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: <4E4BE06F.9040503@ebi.ac.uk> On 16/08/2011 16:36, Peter Cock wrote: > Interestingly EMBOSS includes the sequence at the bottom > (using the FASTA directive) and has generated unique ID tags > for each feature. It has also added more note tags. The sequence is included if you are writing sequence data. GFF3 allows sequence to be included, so we add it. Using a separate feature file is always awkward for users, but is supported. > Unfortunately this also failed the GFF3 validation. The EMBOSS > output does a lot better (e.g. "cleaved_initiator_methionine" is > valid while "Initiator methionine" in the UniProt file was not) > > However, some of the terms in column 3 are apparently out of > date - but http://www.sequenceontology.org does list them as > synonyms: Thanks. I'll update the table, but synonyms should be acceptable. > Finally protein_modification_categorized_by_chemical_process > does not seem to be valid (I failed to find it in the ontology). Not in SO, but in a separate ontology (MOD). Should also be valid in GFF I believe, but perhaps the parser insists on using SO and excluding related ontologies. > Additionally the validator complained about some of the note > in Line 15, probably due to the %3B escaped semi-colon, > but that may be a bug in the validator. Interesting. Let me know if we are not escaping the right characters, but I believe we are supposed to escape ';' in those positions. regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 17 11:39:39 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:39:39 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: <4E4BE0BB.70302@ebi.ac.uk> On 16/08/2011 19:39, Peter Cock wrote: > I also noticed the seqret GFF3 output is using "+" as the strand, > which is wrong for a protein reference like this. It should be using > "." (period) as the features on a protein are strand-less (as done > in the UniProt GFF3 file). Thanks. We'll fix it for the next release, but my understanding is it should be acceptable to most parsers. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 17 11:48:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 16:48:32 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE06F.9040503@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice wrote: > On 16/08/2011 16:36, Peter Cock wrote: >> >> Interestingly EMBOSS includes the sequence at the bottom >> (using the FASTA directive) and has generated unique ID tags >> for each feature. It has also added more note tags. > > The sequence is included if you are writing sequence data. GFF3 allows > sequence to be included, so we add it. Using a separate feature file is > always awkward for users, but is supported. See also the discussion today on gmod-gbrowse / song-devel where it sounds like GFF3 should have a single block of FASTA embedded sequence at the end of the fine, rather than interleaved. As I suggest on that thread, the practical solution for EMBOSS seqret might be to omit the FASTA sequence altogether. Or cache them in memory/on disk to write out at the very end of the all the features? http://generic-model-organism-system-database.450254.n5.nabble.com/Mailing-list-for-GFF3-specification-discussion-td4707740.html >> Unfortunately this also failed the GFF3 validation. The EMBOSS >> output does a lot better (e.g. "cleaved_initiator_methionine" is >> valid while "Initiator methionine" in the UniProt file was not) >> >> However, some of the terms in column 3 are apparently out of >> date - but http://www.sequenceontology.org does list them as >> synonyms: > > Thanks. I'll update the table, but synonyms should be acceptable. I can see plus points for either view, certainly the validator could downgrade that error to an warning. >> Finally protein_modification_categorized_by_chemical_process >> does not seem to be valid (I failed to find it in the ontology). > > Not in SO, but in a separate ontology (MOD). Should also be valid > in GFF I believe, but perhaps the parser insists on using SO and > excluding related ontologies. OK, but in that case shouldn't you then be declaring this with a ##feature-ontology directive? >> Additionally the validator complained about some of the note >> in Line 15, probably due to the %3B escaped semi-colon, >> but that may be a bug in the validator. > > Interesting. Let me know if we are not escaping the right characters, but I > believe we are supposed to escape ';' in those positions. I haven't checked this aspect carefully (since this is fiddly). Peter From p.j.a.cock at googlemail.com Wed Aug 17 11:50:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 16:50:57 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE0BB.70302@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE0BB.70302@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:39 PM, Peter Rice wrote: > On 16/08/2011 19:39, Peter Cock wrote: >> >> I also noticed the seqret GFF3 output is using "+" as the strand, >> which is wrong for a protein reference like this. It should be using >> "." (period) as the features on a protein are strand-less (as done >> in the UniProt GFF3 file). > > Thanks. > > We'll fix it for the next release, but my understanding is it should be > acceptable to most parsers. > I agree this is pretty harmless - in practice all that really matters is if the strand is "-" or not. Still, it should be straight forward to fix. Peter From pmr at ebi.ac.uk Wed Aug 17 11:52:21 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:52:21 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: Message-ID: <4E4BE3B5.4080601@ebi.ac.uk> On 17/08/2011 11:37, Peter Cock wrote: > Hi again Peter R. (et al.), > > Following yesterday's discussion about GFF3 files from UniProt, > I'm trying seqret to produce GFF3 from GenBank files. I'd already > found the NCBI currently provides some very broken GFF3 files: > > ------------------------------------------ > > Problem Two - Circular features not marked > > EMBOSS is also lacking in this area. > > EMBOSS has used feature type databank_entry and generated feature ID > NC_005213.1 for the landmark. However, this should include the special > tag entry Is_circular=true, since this is the landmark feature for the whole > circular chromosome. Thanks. I'll make sure we add it for the next release. > ------------------------------------------ > > Problem Three - Missing ID tags on multi-location features > > Unlike the NCBI file which fails to cross link multi-location features like > trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think > you are following the expected pattern as used in the canonical GFF3 > examples. > > In the GenBank file, this tRNA is join(35233..35301,3254..3289) > > For the gene and tRNA features for NEQ_t01, EMBOSS is generating > three GFF3 lines. First a very broad parent feature 3254 to 35301, > then two children 35233 to 35301 and 3254 to 3289. > > I would expect two GFF3 lines (for each of gene and tRNA), just > 35233 to 35301 and 3254 to 3289 which would be linked by virtue > of having the same ID. EMBOSS is reporting what is stored internally (feature and subfeatures for the exons). Looks like we should skip reporting the feature. I'll check what that means for the IDs. > This is related to "Problem Six" and "Problem Seven" below. > > ------------------------------------------ > > Problem Four - Wrong tag for database cross references > > I had noticed the NCBI using a local tag (lower case) db_xref rather > than the standard (upper case = reserved) tag Dbxref. EMBOSS > does the same - is this deliberate and if so why? It is deliberate - we are using the db_xref tag from the EMBL/GenBank feature table. But we could convert to the GFF3 tag (and back again on reading). I'll have a look at how easy that would be. > ------------------------------------------ > > Problem Six - Features wrapping the origin of a circular genome > > Related to the landmark feature lacking the Is_curcular=true tag, the > gene and CDS features for origin wrapping NEQ003 look funny to me. > EMBOSS seems to be generating three GFF3 lines for the gene and CDS > for NEQ003, a surprisingly broad entry 1 to 490885 and two children > 490883 to 490885 and 1 to 879 (which do look sensible). > > This is essentially the same point I raised above with NEQ_t01, but > with the added complication of spanning the origin. Ah, something to do with the way start and end positions are stored internally. I'll fix that along with other circular feature issues. > Thankfully this potential confusion has been address in the updated > specification, so I would expect a single GFF3 line for each of the gene > and CDS for NEQ003, using start 490883 and end of 879+490885=491764. I'll try to write (and read) that way too. > ------------------------------------------ > > Problem Seven - No parent/child relationships > > The NCBI GFF3 file had no parent/child relationships at all. > > The EMBOSS 6.4.0 GFF3 file does use parent/child relationships > but not in the way I expected (and not in a way the validator likes). > As discussed above, for the GenBank join locations EMBOSS > seems to create broad parent features with children for each > sub-location (parent/child relations of the same type = bad). > > What I'm expecting instead is parent child relationships between > the CDS and gene features, between tRNA and gene features, etc. > Note that these relationships are implicit in the GenBank (and EMBL) > flat files, so I accept trying to deduce them might be hard (and > perhaps best not doing immediately - the other issues are more > pressing). Could be possible by matching common exons (stored internally as subfeatures). I'll have a look. > ------------------------------------------ > > Problem Eight - Invalid tags > > The online validator complains that EMBOSS too is using EC_number > (uppercase tags are reserved Pah! We use the EMBL/Genbank tag names. Looks like we will have to convert to lower case so may as well include that with the db_xref/Dbxref conversion in GFF3 writing and reading > ------------------------------------------ > > So my conclusion is that while the EMBOSS generated GFF3 is > better than those produced by the NCBI, it still is invalid and needs > some work. > > As usual, I am of course happy to help with testing fixes. And if > there are any mistakes in my understanding of the GFF3 spec, > please tell me ;) Many, many thanks for finding these. EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as subfeatures, which makes all this much easier to handle. regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 17 11:55:53 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:55:53 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> Message-ID: <4E4BE489.3040703@ebi.ac.uk> On 17/08/2011 16:48, Peter Cock wrote: > On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice wrote: >> On 16/08/2011 16:36, Peter Cock wrote: >>> >>> Interestingly EMBOSS includes the sequence at the bottom >>> (using the FASTA directive) and has generated unique ID tags >>> for each feature. It has also added more note tags. >> >> The sequence is included if you are writing sequence data. GFF3 allows >> sequence to be included, so we add it. Using a separate feature file is >> always awkward for users, but is supported. > > See also the discussion today on gmod-gbrowse / song-devel where > it sounds like GFF3 should have a single block of FASTA embedded > sequence at the end of the fine, rather than interleaved. As I suggest > on that thread, the practical solution for EMBOSS seqret might be to > omit the FASTA sequence altogether. Or cache them in memory/on > disk to write out at the very end of the all the features? Thanks. We already save sequences and write at the end for some formats so I'll add it for GFF3. We will need more work for reading GFF3 input though, but it may not be too bad. If we are reading it as feature input, we don't look for the sequence. If we are reading as sequence input, we need to read all the sequeces into memory and then go back to read the features. For streamed input we can buffer to make the rewind work. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 17 12:05:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:05:13 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E4BE3B5.4080601@ebi.ac.uk> References: <4E4BE3B5.4080601@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:52 PM, Peter Rice wrote: > On 17/08/2011 11:37, Peter Cock wrote: >> ------------------------------------------ >> >> Problem Four - Wrong tag for database cross references >> >> I had noticed the NCBI using a local tag (lower case) db_xref rather >> than the standard (upper case = reserved) tag Dbxref. EMBOSS >> does the same - is this deliberate and if so why? > > It is deliberate - we are using the db_xref tag from the EMBL/GenBank > feature table. > > But we could convert to the GFF3 tag (and back again on reading). I'll > have a look at how easy that would be. Do you want to check this one with Lincoln on the song-devel mailing list first - after all, using a lower case tag is quite allowable and valid GFF3. My point is it does seem to be exactly what the reserved tag Dbxref is intended for. >> ------------------------------------------ >> >> Problem Seven - No parent/child relationships >> >> The NCBI GFF3 file had no parent/child relationships at all. >> >> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships >> but not in the way I expected (and not in a way the validator likes). >> As discussed above, for the GenBank join locations EMBOSS >> seems to create broad parent features with children for each >> sub-location (parent/child relations of the same type = bad). >> >> What I'm expecting instead is parent child relationships between >> the CDS and gene features, between tRNA and gene features, etc. >> Note that these relationships are implicit in the GenBank (and EMBL) >> flat files, so I accept trying to deduce them might be hard (and >> perhaps best not doing immediately - the other issues are more >> pressing). > > Could be possible by matching common exons (stored internally as > subfeatures). I'll have a look. Usually yes, but not all the time. I've seen GenBank files where the gene and CDS features have slightly different locations which makes doing this automatically hard. Off the top of my head this was a programmed frame shift example... I'll see if I can find you a specific example. >> ------------------------------------------ >> >> So my conclusion is that while the EMBOSS generated GFF3 is >> better than those produced by the NCBI, it still is invalid and needs >> some work. >> >> As usual, I am of course happy to help with testing fixes. And if >> there are any mistakes in my understanding of the GFF3 spec, >> please tell me ;) > > Many, many thanks for finding these. I've come to value NC_005213.gbk as a reasonably small circular genome with some rather complicated annotation - its one of my favourite test cases. > EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as > subfeatures, which makes all this much easier to handle. Oh good - that restructuring should now pay dividends :) Peter C. From p.j.a.cock at googlemail.com Wed Aug 17 12:07:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:07:54 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE489.3040703@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> <4E4BE489.3040703@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice wrote: > On 17/08/2011 16:48, Peter Cock wrote: >> See also the discussion today on gmod-gbrowse / song-devel where >> it sounds like GFF3 should have a single block of FASTA embedded >> sequence at the end of the fine, rather than interleaved. As I suggest >> on that thread, the practical solution for EMBOSS seqret might be to >> omit the FASTA sequence altogether. Or cache them in memory/on >> disk to write out at the very end of the all the features? > > Thanks. We already save sequences and write at the end for some > formats so I'll add it for GFF3. We will need more work for reading > GFF3 input though, but it may not be too bad. > > If we are reading it as feature input, we don't look for the sequence. > > If we are reading as sequence input, we need to read all the sequeces > into memory and then go back to read the features. For streamed input > we can buffer to make the rewind work. I'm curious what other file formats needed this kind of work. But it is good that you've already got some buffer/cache infrastructure in place. Does it boil down to writing temp files in /tmp ? Peter C. From pmr at ebi.ac.uk Wed Aug 17 12:14:15 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 17:14:15 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> <4E4BE489.3040703@ebi.ac.uk> Message-ID: <4E4BE8D7.4010203@ebi.ac.uk> On 17/08/2011 17:07, Peter Cock wrote: > On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice wrote: >> If we are reading as sequence input, we need to read all the sequeces >> into memory and then go back to read the features. For streamed input >> we can buffer to make the rewind work. > > I'm curious what other file formats needed this kind of work. But it > is good that you've already got some buffer/cache infrastructure > in place. Does it boil down to writing temp files in /tmp ? MSF (checksum at the top), Phylip (number of sequences at the top). In ajseqwrite.c these are the ones with the Save attribute set true. We keep them in memory and write them when the output file is closed. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 17 12:33:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:33:29 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE8D7.4010203@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> <4E4BE489.3040703@ebi.ac.uk> <4E4BE8D7.4010203@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 5:14 PM, Peter Rice wrote: > On 17/08/2011 17:07, Peter Cock wrote: >> I'm curious what other file formats needed this kind of work. But it >> is good that you've already got some buffer/cache infrastructure >> in place. Does it boil down to writing temp files in /tmp ? > > MSF (checksum at the top), Phylip (number of sequences at the top). > > In ajseqwrite.c these are the ones with the Save attribute set true. > > We keep them in memory and write them when the output file is closed. I wasn't thinking of alignments, but that makes perfect sense. Thanks, Peter C. From p.j.a.cock at googlemail.com Wed Aug 17 12:54:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:54:10 +0100 Subject: [emboss-dev] Moving EMBOSS from OBF hosted CVS to git on github Message-ID: Dear EMBOSS team, Have you made any decisions regarding the proposal to move the EMBOSS repository from CVS hosted by the OBF to git hosted on github (where most of the other OBF backed projects are now)? I see this made it to the minutes of the 27 June 2011 meeting: http://emboss.sourceforge.net/meetings/2011-06-27.html As I recall from talking to Peter Rice at BOSC/ISMB 2011 in Vienna last month, EMBOSS currently uses a single branch in CVS (like Biopython used to), so migrating the repository to git shouldn't be too complicated. I recommend in the short term maintaining a git mirror of the CVS repository on github.com, which can be kept current via a cron job running on the OBF server. You can then treat this git repository as a read only mirror and continue to make all commits via CVS. During this interim period, external contributors can make their own branches etc (without touching the official EMBOSS repository) and send you patches. The internal developers can also try this out as a way to get familiar with git gradually. This is what we did with Biopython, and it worked very well. I am happy to assist with this if you want. I think I made this offer in person in Vienna, but I'm repeating it publicly now. You might also be able to adopt the existing mirror maintained by Pjotr Prins (CC'd), although that does include a branch with BioLib work in it: https://github.com/pjotrp/EMBOSS/ Regards, Peter C. P.S. You'll need to have a different project name on github since emboss was used by Martin Bosslet back in Nov 2010. How about emboss-prj or even open-bio for this? P.P.S. This page seems to be missing: http://emboss.sourceforge.net/meetings/2011-07-04.html It is linked to from at least these two pages: http://emboss.sourceforge.net/meetings/ http://emboss.sourceforge.net/meetings/2011-07-11.html From pmr at ebi.ac.uk Thu Aug 18 08:28:28 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 18 Aug 2011 13:28:28 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: <4E4D056C.5050508@ebi.ac.uk> On 08/16/2011 04:36 PM, Peter Cock wrote: > I will report this to UniProt later. However, first I thought > I would try converting one of the other files provided into > GFF3 using EMBOSS seqret for an alternative, e.g. the > plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt > > I can convert this using seqret as follows: > > ======================================== > $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt > However, some of the terms in column 3 are apparently out of > date - but http://www.sequenceontology.org does list them as > synonyms: > > It looks like the EMBOSS sequence ontology table may need > updating for at least these three cases. > > Finally protein_modification_categorized_by_chemical_process > does not seem to be valid (I failed to find it in the ontology). That was a name from the MOD ontology. GFF3 output now uses an SO term (but SO is lacking detail for MOD_RES, having only: id: SO:0001089 name: post_translationally_modified_region and id: SO:0001700 name: histone_modification ... and then more descendant of histone modification. Still showing its DNA_only roots. EMBOSS internally uses MOD terms for MOD_RES features. The details are in the note tag in GFF3 output. > Additionally the validator complained about some of the note > in Line 15, probably due to the %3B escaped semi-colon, > but that may be a bug in the validator. Worked for me. Perhaps it was confused by the term name errors (or perhaps the validator has been fixed) However, one nasty bug ... EMBOSS was so careful to only read real GFF3 format that the EMBOSS comment "#!Type Protein" was ignored and features were read into EMBOSS as nucleotide. I suspect there is no way in GFF3 to identify a protein file. In the next patch we can parse the EMBOSS comment again but that will not help with non-EMBOSS protein GFF3 files. Is there some official distinction between protein and nucleotide GFF3 files? regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 24 06:36:34 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 24 Aug 2011 11:36:34 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: Message-ID: <4E54D432.8030309@ebi.ac.uk> On 08/17/2011 11:37 AM, Peter Cock wrote: > Hi again Peter R. (et al.), > > Following yesterday's discussion about GFF3 files from UniProt, > I'm trying seqret to produce GFF3 from GenBank files. > > ------------------------------------------ > > Problem Two - Circular features not marked > > EMBOSS is also lacking in this area. Current status: circular tags will be passed better i the next EMBOSS release. Sequence inputs will have a new -scircular qualifier and feature inputs will have -fcircular to cover cases where the input format does not define a circular sequence (but if it does, these will not turn it off) We will tag a feature with Is_circular in the output, even if we have to make one up. > ------------------------------------------ > > Problem Six - Features wrapping the origin of a circular genome > > Related to the landmark feature lacking the Is_circular=true tag, the > gene and CDS features for origin wrapping NEQ003 look funny to me. > EMBOSS seems to be generating three GFF3 lines for the gene and CDS > for NEQ003, a surprisingly broad entry 1 to 490885 and two children > 490883 to 490885 and 1 to 879 (which do look sensible). > > Based on the old specification, I had expected two GFF3 lines each for the > gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked > by virtue of the having the same ID. > > Thankfully this potential confusion has been address in the updated > specification, so I would expect a single GFF3 line for each of the gene > and CDS for NEQ003, using start 490883 and end of 879+490885=491764. Unfortunately GFF3 is sadly lacking in details on how to define the sequence length. It appears there is no standard for defining the length, yet it is critical to interpreting a circular feature that goes across the origin as GFF3 makes the end position greater than the length. We will make a best guess but cannot guarantee we get the right answer. > ------------------------------------------ > > Problem Seven - No parent/child relationships > > The EMBOSS 6.4.0 GFF3 file does use parent/child relationships > but not in the way I expected (and not in a way the validator likes). > As discussed above, for the GenBank join locations EMBOSS > seems to create broad parent features with children for each > sub-location (parent/child relations of the same type = bad). > > What I'm expecting instead is parent child relationships between > the CDS and gene features, between tRNA and gene features, etc. > Note that these relationships are implicit in the GenBank (and EMBL) > flat files, so I accept trying to deduce them might be hard (and > perhaps best not doing immediately - the other issues are more > pressing). The obvious fix is to lie about the feature types of the exons so the validator is happy. We could call them exons, but "region" would be safer. But there is a silly complication with CDS features: we could keep the CDS parent record and have it as a parent of a group of "regions" for the processed exons. But GFF3 wants the exons to be type "CDS" so what do we call the parent? So in the cobbled together example below, ignoring the circular aspects, we would want to keep the CDS on the parent (ID=NC_005213.11) record where all the annotation tags are, but I suspect GFF3 wants that to be something else. We could of course specifically lie about CDS features for EMBOSS generated GFF3 files (we tag the header) so we can restore the correct internal structure on input. NC_005213 EMBL CDS 490883 491764 . - 0 ID=NC_005213.11;locus_tag=NEQ001;note=conserved hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743: Protein of unknown function DUF57;codon_start=1;transl_table=11;product=hypothetical protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR NC_005213 EMBL CDS 490883 490885 . - 0 ID=NC_005213.12;Parent=NC_005213.11 NC_005213 EMBL CDS 1 879 . - 0 ID=NC_005213.13;Parent=NC_005213.11 > ------------------------------------------ > > Problem Eight - Invalid tags > > The online validator complains that EMBOSS too is using EC_number > (uppercase tags are reserved Fixed and we can patch the release. Making all tags lower case is trivial - they are automatically converted on input to the internal mixed case. > ------------------------------------------ > > So my conclusion is that while the EMBOSS generated GFF3 is > better than those produced by the NCBI, it still is invalid and needs > some work. > > As usual, I am of course happy to help with testing fixes. And if > there are any mistakes in my understanding of the GFF3 spec, > please tell me ;) Hope this helps. Progress is being made. However, as GFF3 is such a pain, I am wondering whether to switch the default feature format to something else - back to GFF2 or maybe to use GTF. Does anyone have a preference? regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 24 10:45:33 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 24 Aug 2011 15:45:33 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E54D432.8030309@ebi.ac.uk> References: <4E54D432.8030309@ebi.ac.uk> Message-ID: <4E550E8D.8010506@ebi.ac.uk> On 08/24/2011 11:36 AM, Peter Rice wrote: > On 08/17/2011 11:37 AM, Peter Cock wrote: > >> ------------------------------------------ >> >> Problem Seven - No parent/child relationships >> >> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships >> but not in the way I expected (and not in a way the validator likes). As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I can make the CDS "parent" feature change its type to "biological_region" and add a featflags tag with the true type. Code (not yet checked in) can reconstruct the EMBL feature table from this GFF. However, the EMBL tags are all on the parent (now biological_region) feature. Any suggestions where I should stick them for them to be useful in GFF3? EMBL feature table: FT source 1..3919 FT /organism="Homo sapiens" FT /mol_type="genomic DNA" FT /db_xref="taxon:9606" FT CDS join(2079..2171,2294..2515,3371..3499) FT /db_xref="GDB:119299" FT /db_xref="GOA:P02100" FT /db_xref="HGNC:4830" FT /db_xref="InterPro:IPR000971" FT /db_xref="InterPro:IPR002337" FT /db_xref="InterPro:IPR009050" FT /db_xref="InterPro:IPR012292" FT /db_xref="PDB:1A9W" FT /db_xref="UniProtKB/Swiss-Prot:P02100" FT /protein_id="CAA23766.1" FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF FT KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH" proposed GFF3 version V00508 EMBL databank_entry 1 3919 . + . ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606 V00508 EMBL biological_region 2079 3499 . + 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH V00508 EMBL CDS 2079 2171 . + 0 Parent=V00508.2 V00508 EMBL CDS 2294 2515 . + 0 Parent=V00508.2 V00508 EMBL CDS 3371 3499 . + 0 Parent=V00508.2 regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 24 20:44:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Aug 2011 01:44:47 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E550E8D.8010506@ebi.ac.uk> References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> Message-ID: On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice wrote: > > However, as GFF3 is such a pain, I am wondering whether to switch the > default feature format to something else - back to GFF2 or maybe to use GTF. > Sadly I have to agree with you - the current version of the GFF3 spec leaves far too much open to multiple interpretation, as we have been discussing on the song-devel mailing lists. I'm not sure that GFF2 or GTF are any better though. On Wed, Aug 24, 2011 at 3:45 PM, Peter Rice wrote: > On 08/24/2011 11:36 AM, Peter Rice wrote: >> >> On 08/17/2011 11:37 AM, Peter Cock wrote: >> >>> ------------------------------------------ >>> >>> Problem Seven - No parent/child relationships >>> >>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships >>> but not in the way I expected (and not in a way the validator likes). > > As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I > can make the CDS "parent" feature change its type to "biological_region" and > add a featflags tag with the true type. Code (not yet checked in) can > reconstruct the EMBL feature table from this GFF. > > However, the EMBL tags are all on the parent (now biological_region) > feature. > > Any suggestions where I should stick them for them to be useful in GFF3? > > EMBL feature table: > > FT ? source ? ? ? ? ?1..3919 > FT ? ? ? ? ? ? ? ? ? /organism="Homo sapiens" > FT ? ? ? ? ? ? ? ? ? /mol_type="genomic DNA" > FT ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606" > FT ? CDS ? ? ? ? ? ? join(2079..2171,2294..2515,3371..3499) > FT ? ? ? ? ? ? ? ? ? /db_xref="GDB:119299" > FT ? ? ? ? ? ? ? ? ? /db_xref="GOA:P02100" > FT ? ? ? ? ? ? ? ? ? /db_xref="HGNC:4830" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR000971" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR002337" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR009050" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR012292" > FT ? ? ? ? ? ? ? ? ? /db_xref="PDB:1A9W" > FT ? ? ? ? ? ? ? ? ? /db_xref="UniProtKB/Swiss-Prot:P02100" > FT ? ? ? ? ? ? ? ? ? /protein_id="CAA23766.1" > FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS > FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF > FT ? ? ? ? ? ? ? ? ? KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH" > > proposed GFF3 version > > V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? . > ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606 > V00508 ?EMBL ? ?biological_region ? ? ? 2079 ? ?3499 ? ?. ? ? ? + ? ? ? 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x > ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV > VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0 > Parent=V00508.2 > V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0 > Parent=V00508.2 > V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0 > Parent=V00508.2 > I was expecting something like this (done by hand) where we follow the example on http://www.sequenceontology.org/gff3.shtml and have a single GFF gene feature represented by three lines linked by virtue of having the same ID: V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? . ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606 V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH On the downside, I have repeated all the annotation three times - but that is what was done in the GFF3 example in the spec. Perhaps this should be raised on the song-devel mailing list along with our other GFF3 queries. Regards, Peter C. From pmr at ebi.ac.uk Thu Aug 25 09:52:30 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 25 Aug 2011 14:52:30 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> Message-ID: <4E56539E.6030400@ebi.ac.uk> On 25/08/2011 01:44, Peter Cock wrote: > On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice wrote: >> >> However, as GFF3 is such a pain, I am wondering whether to switch the >> default feature format to something else - back to GFF2 or maybe to use GTF. >> > > Sadly I have to agree with you - the current version of the GFF3 > spec leaves far too much open to multiple interpretation, as we > have been discussing on the song-devel mailing lists. I'm not > sure that GFF2 or GTF are any better though. GTF is no good for EMBOSS ... way too picky about start and stop codons If pushed we could read it in using a version of the GTF parser but I see no point trying to write it using data from any source > I was expecting something like this (done by hand) where we follow the > example on http://www.sequenceontology.org/gff3.shtml and have a > single GFF gene feature represented by three lines linked by virtue of > having the same ID: > > > V00508 EMBL databank_entry 1 3919 . + . > ID=V00508.1;organism=Homo sapiens;mol_type=genomic > DNA;db_xref=taxon:9606 > V00508 EMBL CDS 2079 2171 . + 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > V00508 EMBL CDS 2294 2515 . + 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > V00508 EMBL CDS 3371 3499 . + 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > > On the downside, I have repeated all the annotation three times - but > that is what was done in the GFF3 example in the spec. Urgh. How about a gene with 80 exons? That's what I was trying to avoid. How would you plan to read it back in? Transferring all features to the parent perhaps, with checks every time for an existing exact copy? I am less impressed with GFF3 each time I look. I think we'll go with the annotation of the "biological_region" parent and wait for anyone with a use case that actually requires massively replicated annotation. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Thu Aug 25 22:27:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 26 Aug 2011 03:27:31 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E56539E.6030400@ebi.ac.uk> References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> <4E56539E.6030400@ebi.ac.uk> Message-ID: On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice wrote: > On 25/08/2011 01:44, Peter Cock wrote: >> >> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice ?wrote: >>> >>> However, as GFF3 is such a pain, I am wondering whether to switch the >>> default feature format to something else - back to GFF2 or maybe to use >>> GTF. >>> >> >> Sadly I have to agree with you - the current version of the GFF3 >> spec leaves far too much open to multiple interpretation, as we >> have been discussing on the song-devel mailing lists. I'm not >> sure that GFF2 or GTF are any better though. > > GTF is no good for EMBOSS ... way too picky about start and stop codons > > If pushed we could read it in using a version of the GTF parser but I see no > point trying to write it using data from any source > > >> I was expecting something like this (done by hand) where we follow the >> example on http://www.sequenceontology.org/gff3.shtml and have a >> single GFF gene feature represented by three lines linked by virtue of >> having the same ID: >> >> ... >> >> On the downside, I have repeated all the annotation three times - but >> that is what was done in the GFF3 example in the spec. > > Urgh. How about a gene with 80 exons? That's what I was trying to avoid. > > How would you plan to read it back in? Transferring all features to the > parent perhaps, with checks every time for an existing exact copy? > It would make sense to propose that the first line has all the annotation, and the subsequence lines from the same feature just need the ID, and if it is adopted the part tag recently discussed on the song-devel list to make the order of the sub-parts explicit. http://sourceforge.net/mailarchive/message.php?msg_id=27960475 > > I am less impressed with GFF3 each time I look. > Me too. > > I think we'll go with the annotation of the "biological_region" parent and > wait for anyone with a use case that actually requires massively replicated > annotation. > Have you looked at the BioPerl GenBank to GFF3 conversion? I understand GBrowse recommends this as a way to get GenBank format data into GBrowse. I'm also pretty sure that this is being used inside TogoWS for GenBank/EMBL to GFF3: http://togows.dbcls.jp/entry/embl/V00508 <-- original EMBL http://togows.dbcls.jp/entry/embl/V00508.gff <-- as GFF3 Interestingly their GFF3 output is pretty close to your proposed EMBOSS output, only they've got a "region" rather than "biological_region" for the parent meta-feature. However, I think introducing extra biological_region features to act as the parent of multi-location features would run counter to the canonical gene model given in the GFF3 specification (which appears to be just a suggestion rather than a requirement). Also, introducing this meta-feature would complicate any future wish to try to express explicit parent/child relationships between operon, gene, mRNA and CDS features. Of course, as we've discussed, these biological relationships are only implicit in the GenBank/EMBL feature table. This is probably a good example to discuss on the GFF3 song-devel mailing list - small and apparently very simple except for how to represent the (forward strand) join location. Peter C. From pmr at ebi.ac.uk Tue Aug 30 11:48:25 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 30 Aug 2011 16:48:25 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> <4E56539E.6030400@ebi.ac.uk> Message-ID: <4E5D0649.3010905@ebi.ac.uk> On 08/26/2011 03:27 AM, Peter Cock wrote: > On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice wrote: >> On 25/08/2011 01:44, Peter Cock wrote: > It would make sense to propose that the first line has all the annotation, > and the subsequence lines from the same feature just need the ID, > and if it is adopted the part tag recently discussed on the song-devel > list to make the order of the sub-parts explicit. > http://sourceforge.net/mailarchive/message.php?msg_id=27960475 The part tag is interesting and would map to the internal "exon" attribute in EMBOSS which we reserve for sorting. >> I think we'll go with the annotation of the "biological_region" parent and >> wait for anyone with a use case that actually requires massively replicated >> annotation. >> > > Have you looked at the BioPerl GenBank to GFF3 conversion? > I understand GBrowse recommends this as a way to get > GenBank format data into GBrowse. I'm also pretty sure that > this is being used inside TogoWS for GenBank/EMBL to GFF3: > > http://togows.dbcls.jp/entry/embl/V00508<-- original EMBL > http://togows.dbcls.jp/entry/embl/V00508.gff<-- as GFF3 Hmmm .... the GFF3 has Parent references to the protein_id, but it doesn't appear as an ID. I do not like using a second region to put the description line in. Using the organism as the ID for the source line also looks odd. > Interestingly their GFF3 output is pretty close to your proposed > EMBOSS output, only they've got a "region" rather than > "biological_region" for the parent meta-feature. I don't see a parent meta-feature there. > However, I think introducing extra biological_region features to > act as the parent of multi-location features would run counter to > the canonical gene model given in the GFF3 specification (which > appears to be just a suggestion rather than a requirement). > > Also, introducing this meta-feature would complicate any > future wish to try to express explicit parent/child relationships > between operon, gene, mRNA and CDS features. Of course, as > we've discussed, these biological relationships are only implicit > in the GenBank/EMBL feature table. I tried the canonical gene example: ##gff-version 3 ##sequence-region ctg123 1 9000 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . five_prime_UTR 1050 1200 . + . Parent=mRNA00001 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . three_prime_UTR 7601 9000 . + . Parent=mRNA00001 ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123+12+462 ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123+463+963 ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123+964+2964 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc >cdna0123 ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg tcaaacagcggctgtaaaaatttgtgattatggttaaagg I can not (code not yet checked in) reproduce this, subject to the sequence being too short. Internally, EMBOSS generates parent features for CDS and cDNA_match (where several features share an ID), and the parent structure is preserved. On output, the generated features are not reported so GFF3 input is identical. If we read EMBL/GenBank entries then we will generate a parent feature with type "biological region" to attach the annotation from the join. Reproducing the "parent" relationships is a separate exercise that could be a separate application. In terms of reading one format and writing another I prefer to not generate any GFF3-specific extras. > This is probably a good example to discuss on the GFF3 > song-devel mailing list - small and apparently very simple > except for how to represent the (forward strand) join location. We could propose something for the http://www.sequenceontology.org/wiki/index.php/GFF3_best_practices page to describe how to represent EMBL/GenBank entries in GFF3 (after due discussion on the SONG-devel list) regards, Peter Rice EMBSOS Team From p.j.a.cock at googlemail.com Tue Aug 2 18:01:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Aug 2011 19:01:54 +0100 Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files Message-ID: Hi EMBOSS folk, I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto who has been adding ABI support to Biopython. With EMBOSS 6.3.1 compiled from source on Mac (as an example), $ seqret -osformat="fastq-sanger" -filter 310.ab1 @D11F TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather than the expected ID from within the file we get EMBOSS_001, $ seqret -osformat="fastq-sanger" -filter 310.ab1 @EMBOSS_001 TGATNTTNACNNTTTTGAANCANTGAGTTAATAGCAATNCTTTACNAATAAGAATATACACTTTCTGCTTAGGGATGATAATTGGCAGGCAAGTGAATCCCTGAGCGTGNATTTGATAATGACCTAAATAATGGATGGGGTTTTAATTCCCAGACCTTCCCCTTTTTAANNGGNGGATTANTGGGGGNNNAACNNGGGGGGCCCTTNCCNAAGGGGGAAAAAATTTNAAACCCCCCNAGGNNGGGNAAAAAAAAATTTCCAAATTNCCGGGGTNNCCCCCAANTTTTTNCCGCNGGGAAAANNNNCCCCCCCNGGGNCCCCCCCCNNAAAAAAAAAAAAAAAAACCCCCCCCCCNTTGGGGNGGTNTNCNCCCCCNNANAANNGGGGGNNAAAAAAAAAGGCCCCCCCCAAAAAAAACCCNCNTTCTNNCNNNNNGNNCNGNNCCCCCNNCCNTNTNGGGGGGGGGGGNGGAAAAAAAACCCCTTTNTGNNNANANNAACCCNCTCNTNTTTTTTTTTTTANGNNNNCNNNNCAAAAAAAAANCNCCCCCNNCNNNCNNNCNCCCCNNNNTNAAAANANNAANNNNTTTTTTTNGGGGGGGTGNGCGNCCCNNANCNNNNNNNNGCGNGGNCNCCNNCCCNCNANAAANNNTNTTTTTTTTTTTTTTTNTNNTCNNCCCNNNCCCCNNCCCCCCCCCCCCCNCCNCNNNNNGGGGNNNCGGNNCNNNNNNNCCNTNCTNNANATNCCNTTNNNNNNNNGNNNNNNNNACNNNNNTNNTNNNCNNNNNNNNNNNNNNCNNNNNNCNNCCCNNCANNNNNNNCNNNNNNNNNNNNNNNNNNNNNTCNCTNCNCNCCCCNCCCNNNNNNNG + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Regards, Peter Cock ---------- Forwarded message ---------- From: Wibowo Arindrarto Date: Sat, Jul 30, 2011 at 8:42 AM Subject: Re: [Biopython-dev] SeqIO Abi Parser To: Peter Cock Cc: biopython-dev at lists.open-bio.org Hi Peter, I've done some more improvements to the code: - I've written the check and unittest for the file handle mode. I've set it so that abi file has to be opened in 'rb' mode, otherwise it'll return an error. While it's ok to open in 'r' mode in python 2 in Linux, it has to be specified as 'rb' in Windows and/or Python 3 for the file to be read correctly. So I decided forcing it to 'rb' is the best. Because of this, I changed 'test_SeqIO.py:503' to include the mode argument when opening. - I've also checked against test_Emboss.py for seqret output, after including the abi format in it. My EMBOSS version is 6.4.0. There was a slight problem with this testing, since for some reason the ID returned by seqret is always "EMBOSS_001". Something might be wrong with my EMBOSS installation, since when I previously tested it against 6.1.0, the ID was correct (although the qual values not, so I had to upgrade). As expected, if I comment out the code that tests for sequence id ('test_Emboss.py:168-172') the tests pass. Maybe you could try testing it as well and see if EMBOSS also returns the default id instead of the sample name? - Finally, I did some small cosmetic changes to the code (typos, etc). All changes have been pushed to my github fork. Now I still have time for the weekend to improve whatever needs to be improved :). Regards, --- Wibowo Arindrarto (bow) http://bow.web.id On Fri, Jul 29, 2011 at 18:20, Peter Cock wrote: > > Hi again, > > I had a bit of time this afternoon so I looked at this. > > On Fri, Jul 29, 2011 at 1:14 PM, Peter Cock wrote: > > On Fri, Jul 29, 2011 at 12:34 PM, Wibowo Arindrarto wrote: > >> Hi Peter, > >> Thanks for explaining. I understand why we should stick to the stored > >> sequence id. In this case, we can use the filename as SeqRecord.name as > >> well. Regarding BioPerl, I don't have it installed myself -- but I took a > >> quick look at their source and it seems they also use the stored sequence ID > >> as their main identifier instead of the filename. If the stored sequence ID > >> is not present, it's "(unknown)" in their case. > > > > OK good, that means Biopython, BioPerl and EMBOSS should be > > consistent :) > > I've made that switch, > > >> I'll look on the test_SeqIO.py over the weekend. I think it'll have > >> something to do with some ambiguous dna base stored in the abi files. > >> Regards, > > > > Some of the alphabet stuff is a bit nasty - so please feel free to ask > > or get me to help. > > I've done enough to get the test_SeqIO.py unit test to pass. > > We probably need a check (like in SFF) to check the user hasn't given > a handle opened in text mode. That should probably have a unit test > too. > > I still haven't cross checked the sequence and PHRED scores from > your code and EMBOSS. > > Anyway - I'll leave the code for you to work on for now... > > Peter From pmr at ebi.ac.uk Tue Aug 2 18:27:07 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 02 Aug 2011 19:27:07 +0100 Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files In-Reply-To: References: Message-ID: <4E38417B.6000505@ebi.ac.uk> On 02/08/2011 19:01, Peter Cock wrote: > Hi EMBOSS folk, > > I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto > who has been adding ABI support to Biopython. > > With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather > than the expected ID from within the file we get EMBOSS_001, Can you please run with -debug on the command line and send me the seqret.dbg file to see what it thought was in the file regards, Peter Rice EMBOSS team From p.j.a.cock at googlemail.com Wed Aug 3 07:57:01 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Aug 2011 08:57:01 +0100 Subject: [emboss-dev] EMBOSS 6.4.0 using EMBOSS_001 as the ID in ABI files In-Reply-To: <4E38417B.6000505@ebi.ac.uk> References: <4E38417B.6000505@ebi.ac.uk> Message-ID: On Tue, Aug 2, 2011 at 7:27 PM, Peter Rice wrote: > > On 02/08/2011 19:01, Peter Cock wrote: >> >> Hi EMBOSS folk, >> >> I'm reporting a regression in EMBOSS 6.4.0 spotted by Wibowo Arindrarto >> who has been adding ABI support to Biopython. >> >> With EMBOSS 6.4.0 compiled from source on 64 bit Linux, rather >> than the expected ID from within the file we get EMBOSS_001, > > Can you please run with -debug on the command line and send me the > seqret.dbg file to see what it thought was in the file No problem - sent directly to Peter R, Peter From ajb at ebi.ac.uk Thu Aug 11 13:22:25 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Thu, 11 Aug 2011 14:22:25 +0100 (BST) Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released Message-ID: <53905.82.26.12.214.1313068945.squirrel@imap04.ebi.ac.uk> New bug-fix files are available for EMBOSS-6.4.0 and, for Windows users, a new version of mEMBOSS is available. The bugs fixed are appended for easy reference. 1) UNIX As usual, the most convenient way of applying the bug-fixes should be to apply the patch file: ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/patch-1-11.gz to a freshly extracted copy of the EMBOSS-6.4.0.tar.gz source code and recompiling/installing. (see ftp://emboss.open-bio.org/pub/EMBOSS/fixes/patches/README.patch for instructions on using 'patch'). Alternatively, you can individually copy the patched files from the ftp://emboss.open-bio.org/pub/EMBOSS/fixes/ directory if your system does not support 'patch'. 2) mEMBOSS The new version incorporates all the bug-fixes listed below. Uninstall your previous mEMBOSS installation and download and install the new setup file from: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.4.0.2-setup.exe Alan ----------------------------------------------------------------------- Fix 1. EMBOSS-6.4.0/emboss/dbiflat.c EMBOSS-6.4.0/emboss/dbxflat.c 10 Aug 2011: The SwissProt description line format includes additional tags which interfere with the EMBL parser used in previous releases. The fix replaces this with a SwissProt parser that strips out the extra tags. After patching the release, any existing SwissProt description index files should be reindexed. Other indexes are unchanged. Fix 2. EMBOSS-6.4.0/ajax/core/ajquery.c 10 Aug 2011: For databases with more than one valid format (examples include the EBI dbfetch server) this fix allows the format to be specified with a qualifier on the command line. In the original release, only a format in the query string was used. Fix 3. EMBOSS-6.4.0/ajax/core/ajfeatread.c 10 Aug 2011: When parsing GFF3 format input, long feature tags (for example extremely long translations) exceeded limits in regular expression parsing. This fix decouples testing for escaped quotes from the main task of finding quoted strings. Fix 4. EMBOSS-6.4.0/emboss/data/Etcode.dat 10 Aug 2001: The local data file used by application tcode had a missing parameter line. Fix 5. EMBOSS-6.4.0/ajax/core/ajrange.c 10 Aug 2011: When sequence ranges (and possible highlighting for showalign) were in a list file, the parser overwrote string values. Fix 5. EMBOSS-6.4.0/ajax/core/ajseqabi.c 10 Aug 2011: Sample names in ABI format files were stored in incompletely defined strings. This fix corrects the string object. The sample name is also used as the sequence name. Fix 6. EMBOSS-6.4.0/emboss/dbxresource.c 10 Aug 2011: A future change to the format of Data Resource Catalogue entries in DRCAT.dat requires an update to the parsing of category lines. The current version is not affected. Fix 7. EMBOSS-6.4.0/emboss/server.ensemblgenomes EMBOSS-6.4.0/emboss/cacheensembl.c EMBOSS-6.4.0/ajax/ensembl/ensregistry.c EMBOSS-6.4.0/ajax/ensembl/ensregistry.c EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.c EMBOSS-6.4.0/ajax/ensembl/ensdatabaseadaptor.h 10 Aug 2011: Microbial genomes use an enumerated species code which must be added to the query for data retrieval. This fix adds the species code to the comment field. In the next release a more complete solution will be implemented. Fix 8. EMBOSS-6.4.0/ajax/core/ajarch.h 10-Aug-2011: Corrects the size of long integers on Windows systems only. Fix 9. EMBOSS-6.4.0/emboss/cirdna.c 10-Aug-2011: Cirdna prints text inside solid blocks invisibly. When printed outside the text scaling was too small. The text scale is now adjusted for the radius and sequence length so that labels should be readable outside the box. Fix 10. EMBOSS-6.4.0/ajax/core/ajpat.c 10-Aug-2011: Fuzznuc, fuzzpro and fuzztran using a pattern file ignored the command line -mismatch qualifier for the first pattern. The default mismatch is now set to this value at the start of the pattern matching loop in the library. Fix 11. EMBOSS-6.4.0/ajax/core/ajfmt.c 11-Aug-2011: The function ajFmtScanF() handled va_list incorrectly. Only potentially affected code developers. From ajb at ebi.ac.uk Thu Aug 11 15:58:25 2011 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Thu, 11 Aug 2011 16:58:25 +0100 (BST) Subject: [emboss-dev] EMBOSS and mEMBOSS bug-fixes for 6.4.0 released Message-ID: <49005.82.26.12.214.1313078305.squirrel@imap04.ebi.ac.uk> UNIX users who downloaded the bug-fix patch file for EMBOSS earlier this afternoon may have found that there were compilation problems on a limited number of architectures. The patch has been amended slightly to hopefully fix this problem so please download it again if you were affected. If anyone continues to experience compilation problems then please let me know. Alan From p.j.a.cock at googlemail.com Tue Aug 16 15:03:26 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Aug 2011 16:03:26 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) Message-ID: Dear Peter R. (et al.), I recall from one of our chats in person that EMBOSS has some mapping tables to convert the various different data file format's feature names into a common standard (the Sequence Ontology?), for the purpose of inter-converting files. e.g. Converting a UniProt/ SwissProt plain text protein file into a GenPept protein file or GFF3 Is that a fair summary? It seems to match the minutes of this meeting (found with Google) http://emboss.sourceforge.net/meetings/2009-02-16.html > DASGFF requires a sequence ontology (or BioSapiens > ontology) tag for protein features. Peter has updated the > Efeatures definitions for proteins to use GFF3 sequence > ontology codes as internal identifiers, and to use GFF3 > as the principle definitions for all protein features. All > SwissProt feature types (36 in the current Swissprot > release) are also defined with the closest possible match > to the sequence ontology. Where there is no exact match, > an EMBOSS internal type is defined using the closets SO > code and the original feature type as a suffix. For SwissProt > output this is converted back to the swissprot feature type. > For GFF3 output the internal type is an alias for the closest > (more general) SO term. Can you point me at these mapping tables in the EMBOSS source code please? I'm particularly interested in the SwissProt to SO mapping right now. Thanks. Peter C. From pmr at ebi.ac.uk Tue Aug 16 15:26:51 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 16 Aug 2011 16:26:51 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4A8BF7.4020106@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> Message-ID: <4E4A8C3B.8030306@ebi.ac.uk> On 08/16/2011 04:03 PM, Peter Cock wrote: > Dear Peter R. (et al.), > > I recall from one of our chats in person that EMBOSS has some > mapping tables to convert the various different data file format's > feature names into a common standard (the Sequence Ontology?), > for the purpose of inter-converting files. e.g. Converting a UniProt/ > SwissProt plain text protein file into a GenPept protein file or GFF3 > > Is that a fair summary? Yes, We needed an internal identifier for feature types, and picked SO for nucleotides - and then were able to add the protein terms when they became available. There are a few made up internal names, with _text after the SO term, that were needed in the early days of the BioSapiens Ontology and some dodgy mapping between SO and EMBL/GenBank for immunoglobulin gene regions, but I believe are no longer used. The first term in the file is defined as the default if nothing is recognized (region or misc_feature) > Can you point me at these mapping tables in the EMBOSS > source code please? emboss/data/Efeatures.embl emboss/data/Efeatures.swiss > I'm particularly interested in the SwissProt to SO mapping > right now. That was originally done by the BioSapiens "Network of excellence" for annotating ENCODE data. They developed the protein features which were then added to the sequence ontology. You can look at SO terms in EMBOSS with: ontoget so:0001094 or ontoget -filter -oformat excel so:0001094 (Hmmm, should do something better for a missing namespace - it was defined as a format for EDAM) Let me know if you spot anything in need of updating. We also have (especially for EMBL) equivalent Etags files listing the available feature qualifiers. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Tue Aug 16 15:36:24 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Aug 2011 16:36:24 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4A8C3B.8030306@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: On Tue, Aug 16, 2011 at 4:26 PM, Peter Rice wrote: > Yes, We needed an internal identifier for feature types, and picked SO > for nucleotides - and then were able to add the protein terms when they > became available. > ... Thanks! > > Let me know if you spot anything in need of updating. > I have found three protein features which have been renamed, and one which appears to be wrong... see below. I recently noticed that the UniProt provide GFF3 files, e.g. http://www.uniprot.org/uniprot/P99999.gff ======================================== ##gff-version 3 ##sequence-region P99999 1 105 P99999 UniProtKB Initiator methionine 1 1 . . . Note=Removed P99999 UniProtKB Chain 2 105 . . . ID=PRO_0000108218;Note=Cytochrome c P99999 UniProtKB Metal binding 19 19 . . . Note=Iron (heme axial ligand) P99999 UniProtKB Metal binding 81 81 . . . Note=Iron (heme axial ligand) P99999 UniProtKB Binding site 15 15 . . . Note=Heme (covalent) P99999 UniProtKB Binding site 18 18 . . . Note=Heme (covalent) P99999 UniProtKB Modified residue 2 2 . . . Note=N-acetylglycine P99999 UniProtKB Modified residue 49 49 . . . Note=Phosphotyrosine;Status=By similarity P99999 UniProtKB Modified residue 98 98 . . . Note=Phosphotyrosine;Status=By similarity P99999 UniProtKB Natural variant 42 42 . . . ID=VAR_044450;Note=In THC4%3B increases the pro-apoptotic function by triggering caspase activation more efficiently than wild-type%3B does not affect the redox function. P99999 UniProtKB Natural variant 56 56 . . . ID=VAR_048850 P99999 UniProtKB Natural variant 66 66 . . . ID=VAR_002204;Note=In 10%25 of the molecules. P99999 UniProtKB Sequence conflict 18 18 . . . . P99999 UniProtKB Sequence conflict 41 41 . . . . P99999 UniProtKB Helix 4 14 . . . . P99999 UniProtKB Turn 16 18 . . . . P99999 UniProtKB Beta strand 23 25 . . . . P99999 UniProtKB Beta strand 28 30 . . . . P99999 UniProtKB Turn 36 38 . . . . P99999 UniProtKB Helix 51 56 . . . . P99999 UniProtKB Helix 62 70 . . . . P99999 UniProtKB Helix 72 75 . . . . P99999 UniProtKB Helix 89 102 . . . . ======================================== However, they are not using Sequence Ontology terms in column three and so fail the online GFF3 validator http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online listed in http://www.sequenceontology.org/gff3.shtml (GFF3 specification currently at v1.20). Additionally that UniProt GFF3 uses an upper case reserved tag, "Status" rather than perhaps "status", in the modified residue features. I will report this to UniProt later. However, first I thought I would try converting one of the other files provided into GFF3 using EMBOSS seqret for an alternative, e.g. the plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt I can convert this using seqret as follows: ======================================== $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt -stdout -auto ##gff-version 3 ##sequence-region CYC_HUMAN 1 105 #!Date 2011-08-16 #!Type Protein #!Source-version EMBOSS 6.4.0.0 CYC_HUMAN SWISSPROT cleaved_initiator_methionine 1 1 . + . ID=CYC_HUMAN.1;note=Removed CYC_HUMAN SWISSPROT mature_protein_region 2 105 . + . ID=CYC_HUMAN.2;note=Cytochrome c;ftid=PRO_0000108218 CYC_HUMAN SWISSPROT metal_binding 19 19 . + . ID=CYC_HUMAN.3;note=Iron;comment=heme axial ligand CYC_HUMAN SWISSPROT metal_binding 81 81 . + . ID=CYC_HUMAN.4;note=Iron;comment=heme axial ligand CYC_HUMAN SWISSPROT binding_site 15 15 . + . ID=CYC_HUMAN.5;note=Heme;comment=covalent CYC_HUMAN SWISSPROT binding_site 18 18 . + . ID=CYC_HUMAN.6;note=Heme;comment=covalent CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 2 2 . + . ID=CYC_HUMAN.7;note=N-acetylglycine CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 49 49 . + . ID=CYC_HUMAN.8;note=Phosphotyrosine;comment=By similarity CYC_HUMAN SWISSPROT protein_modification_categorized_by_chemical_process 98 98 . + . ID=CYC_HUMAN.9;note=Phosphotyrosine;comment=By similarity CYC_HUMAN SWISSPROT natural_variant 42 42 . + . ID=CYC_HUMAN.10;note=G -> S;comment=in THC4%3B increases the pro- apoptotic function by triggering caspase activation more efficiently than wild- type%3B does not affect the redox function;ftid=VAR_044450 CYC_HUMAN SWISSPROT natural_variant 56 56 . + . ID=CYC_HUMAN.11;note=K -> R;comment=in dbSNP:rs11548795;ftid=VAR_048850 CYC_HUMAN SWISSPROT natural_variant 66 66 . + . ID=CYC_HUMAN.12;note=M -> L;comment=in 10%25 of the molecules;ftid=VAR_002204 CYC_HUMAN SWISSPROT sequence_conflict 18 18 . + . ID=CYC_HUMAN.13;note=C -> Y;comment=in Ref. 8%3B AAH15130 CYC_HUMAN SWISSPROT sequence_conflict 41 41 . + . ID=CYC_HUMAN.14;note=T -> I;comment=in Ref. 8%3B AAH68464 CYC_HUMAN SWISSPROT alpha_helix 4 14 . + . ID=CYC_HUMAN.15 CYC_HUMAN SWISSPROT turn 16 18 . + . ID=CYC_HUMAN.16 CYC_HUMAN SWISSPROT beta_strand 23 25 . + . ID=CYC_HUMAN.17 CYC_HUMAN SWISSPROT beta_strand 28 30 . + . ID=CYC_HUMAN.18 CYC_HUMAN SWISSPROT turn 36 38 . + . ID=CYC_HUMAN.19 CYC_HUMAN SWISSPROT alpha_helix 51 56 . + . ID=CYC_HUMAN.20 CYC_HUMAN SWISSPROT alpha_helix 62 70 . + . ID=CYC_HUMAN.21 CYC_HUMAN SWISSPROT alpha_helix 72 75 . + . ID=CYC_HUMAN.22 CYC_HUMAN SWISSPROT alpha_helix 89 102 . + . ID=CYC_HUMAN.23 ##FASTA >CYC_HUMAN P99999 Cytochrome c MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE ======================================== Interestingly EMBOSS includes the sequence at the bottom (using the FASTA directive) and has generated unique ID tags for each feature. It has also added more note tags. Unfortunately this also failed the GFF3 validation. The EMBOSS output does a lot better (e.g. "cleaved_initiator_methionine" is valid while "Initiator methionine" in the UniProt file was not) However, some of the terms in column 3 are apparently out of date - but http://www.sequenceontology.org does list them as synonyms: * metal_binding -> polypeptide_metal_contact * natural_variant -> natural_variant_site * turn -> polypeptide_turn_motif It looks like the EMBOSS sequence ontology table may need updating for at least these three cases. Finally protein_modification_categorized_by_chemical_process does not seem to be valid (I failed to find it in the ontology). Additionally the validator complained about some of the note in Line 15, probably due to the %3B escaped semi-colon, but that may be a bug in the validator. Peter C. From p.j.a.cock at googlemail.com Tue Aug 16 18:39:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Aug 2011 19:39:05 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: On Tue, Aug 16, 2011 at 4:36 PM, Peter Cock wrote: > > I recently noticed that the UniProt provide GFF3 files, > e.g. http://www.uniprot.org/uniprot/P99999.gff > > ... > http://www.uniprot.org/uniprot/P99999.txt > ... > > $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt I also noticed the seqret GFF3 output is using "+" as the strand, which is wrong for a protein reference like this. It should be using "." (period) as the features on a protein are strand-less (as done in the UniProt GFF3 file). Regards, Peter C. From p.j.a.cock at googlemail.com Wed Aug 17 10:37:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 11:37:06 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 Message-ID: Hi again Peter R. (et al.), Following yesterday's discussion about GFF3 files from UniProt, I'm trying seqret to produce GFF3 from GenBank files. I'd already found the NCBI currently provides some very broken GFF3 files: http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.html $ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff $ wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk $ seqret --version EMBOSS:6.4.0.0 $ seqret -filter -feature -sequence NC_005213.gbk -sformat=genbank -osformat=gff3 | head -n 20 ##gff-version 3 ##sequence-region NC_005213 1 490885 #!Date 2011-08-17 #!Type DNA #!Source-version EMBOSS 6.4.0.0 NC_005213 EMBL databank_entry 1 490885 . + . ID=NC_005213.1;organism=Nanoarchaeum equitans Kin4-M;mol_type=genomic DNA;strain=Kin4-M;db_xref=taxon:228908 NC_005213 EMBL gene 3254 35301 . + . ID=NC_005213.2;locus_tag=NEQ_t01;experiment=experimental evidence%2C no additional details recorded;trans_splicing=true;db_xref=GeneID:3362429 NC_005213 EMBL gene 35233 35301 . + . Parent=NC_005213.2 NC_005213 EMBL gene 3254 3289 . + . Parent=NC_005213.2 NC_005213 EMBL tRNA 3254 35287 . + . ID=NC_005213.5;locus_tag=NEQ_t01;product=tRNA-Met;experiment=experimental evidence%2C no additional details recorded;trans_splicing=true;db_xref=GeneID:3362429 NC_005213 EMBL tRNA 35249 35287 . + . Parent=NC_005213.5 NC_005213 EMBL tRNA 3254 3289 . + . Parent=NC_005213.5 NC_005213 EMBL gene 1 490885 . - . ID=NC_005213.8;locus_tag=NEQ001;db_xref=GeneID:2732620 NC_005213 EMBL gene 490883 490885 . - . Parent=NC_005213.8 NC_005213 EMBL gene 1 879 . - . Parent=NC_005213.8 NC_005213 EMBL CDS 1 490885 . - 0 ID=NC_005213.11;locus_tag=NEQ001;note=conserved hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743: Protein of unknown function DUF57;codon_start=1;transl_table=11;product=hypothetical protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR NC_005213 EMBL CDS 490883 490885 . - 0 Parent=NC_005213.11 NC_005213 EMBL CDS 1 879 . - 0 Parent=NC_005213.11 NC_005213 EMBL sequence_feature 7 879 . - . ID=NC_005213.14;locus_tag=NEQ001;note=CRISPR/Cas system-associated RAMP superfamily protein Cas6%3B Region: Cas6-I-III%3B cl11443;db_xref=CDD:196236 NC_005213 EMBL gene 883 2691 . + . ID=NC_005213.15;locus_tag=NEQ003;db_xref=GeneID:2654355 I've deliberately cut the example here to include all of NEQ_t01, and interesting trans-spliced tRNA, and all of NEQ001, an interesting gene because it spans the origin of this circular genome. I use these examples in the blog post and discuss them again below. Given some of the points below, I suspect EMBOSS is producing GFF3 prior to the additions made in v1.18 (24 June 2010) regarding circular genomes. The following numbering reflects the issues listed on my blog post about the NCBI version of the GFF3 file (link given above). ------------------------------------------ Problem One - Invalid Feature Types EMBOSS looks OK here, you're converting the GenBank feature types source and misc_feature into databank_entry and sequence_feature respectively. ------------------------------------------ Problem Two - Circular features not marked EMBOSS is also lacking in this area. EMBOSS has used feature type databank_entry and generated feature ID NC_005213.1 for the landmark. However, this should include the special tag entry Is_circular=true, since this is the landmark feature for the whole circular chromosome. ------------------------------------------ Problem Three - Missing ID tags on multi-location features Unlike the NCBI file which fails to cross link multi-location features like trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think you are following the expected pattern as used in the canonical GFF3 examples. In the GenBank file, this tRNA is join(35233..35301,3254..3289) For the gene and tRNA features for NEQ_t01, EMBOSS is generating three GFF3 lines. First a very broad parent feature 3254 to 35301, then two children 35233 to 35301 and 3254 to 3289. I would expect two GFF3 lines (for each of gene and tRNA), just 35233 to 35301 and 3254 to 3289 which would be linked by virtue of having the same ID. The online GFF3 validator would seem to support my interpretation, reporting errors like this: 8 [ERROR] invalid type pair - check all parents (at line 7; gene to gene) 11 [ERROR] invalid type pair - check all parents (at line 10; tRNA to tRNA) 14 [ERROR] invalid type pair - check all parents (at line 13; gene to gene) 17 [ERROR] invalid type pair - check all parents (at line 16; CDS to CDS) 28 [ERROR] invalid type pair - check all parents (at line 27; sequence_feature to sequence_feature) This is related to "Problem Six" and "Problem Seven" below. ------------------------------------------ Problem Four - Wrong tag for database cross references I had noticed the NCBI using a local tag (lower case) db_xref rather than the standard (upper case = reserved) tag Dbxref. EMBOSS does the same - is this deliberate and if so why? ------------------------------------------ Problem Five - Missing stop codon in CDS features EMBOSS looks OK here ------------------------------------------ Problem Six - Features wrapping the origin of a circular genome Related to the landmark feature lacking the Is_curcular=true tag, the gene and CDS features for origin wrapping NEQ003 look funny to me. EMBOSS seems to be generating three GFF3 lines for the gene and CDS for NEQ003, a surprisingly broad entry 1 to 490885 and two children 490883 to 490885 and 1 to 879 (which do look sensible). This is essentially the same point I raised above with NEQ_t01, but with the added complication of spanning the origin. Based on the old specification, I had expected two GFF3 lines each for the gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked by virtue of the having the same ID. Thankfully this potential confusion has been address in the updated specification, so I would expect a single GFF3 line for each of the gene and CDS for NEQ003, using start 490883 and end of 879+490885=491764. ------------------------------------------ Problem Seven - No parent/child relationships The NCBI GFF3 file had no parent/child relationships at all. The EMBOSS 6.4.0 GFF3 file does use parent/child relationships but not in the way I expected (and not in a way the validator likes). As discussed above, for the GenBank join locations EMBOSS seems to create broad parent features with children for each sub-location (parent/child relations of the same type = bad). What I'm expecting instead is parent child relationships between the CDS and gene features, between tRNA and gene features, etc. Note that these relationships are implicit in the GenBank (and EMBL) flat files, so I accept trying to deduce them might be hard (and perhaps best not doing immediately - the other issues are more pressing). ------------------------------------------ Problem Eight - Invalid tags The online validator complains that EMBOSS too is using EC_number (uppercase tags are reserved ------------------------------------------ So my conclusion is that while the EMBOSS generated GFF3 is better than those produced by the NCBI, it still is invalid and needs some work. As usual, I am of course happy to help with testing fixes. And if there are any mistakes in my understanding of the GFF3 spec, please tell me ;) Regards, Peter C. From pmr at ebi.ac.uk Wed Aug 17 15:38:23 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:38:23 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: <4E4BE06F.9040503@ebi.ac.uk> On 16/08/2011 16:36, Peter Cock wrote: > Interestingly EMBOSS includes the sequence at the bottom > (using the FASTA directive) and has generated unique ID tags > for each feature. It has also added more note tags. The sequence is included if you are writing sequence data. GFF3 allows sequence to be included, so we add it. Using a separate feature file is always awkward for users, but is supported. > Unfortunately this also failed the GFF3 validation. The EMBOSS > output does a lot better (e.g. "cleaved_initiator_methionine" is > valid while "Initiator methionine" in the UniProt file was not) > > However, some of the terms in column 3 are apparently out of > date - but http://www.sequenceontology.org does list them as > synonyms: Thanks. I'll update the table, but synonyms should be acceptable. > Finally protein_modification_categorized_by_chemical_process > does not seem to be valid (I failed to find it in the ontology). Not in SO, but in a separate ontology (MOD). Should also be valid in GFF I believe, but perhaps the parser insists on using SO and excluding related ontologies. > Additionally the validator complained about some of the note > in Line 15, probably due to the %3B escaped semi-colon, > but that may be a bug in the validator. Interesting. Let me know if we are not escaping the right characters, but I believe we are supposed to escape ';' in those positions. regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 17 15:39:39 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:39:39 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: <4E4BE0BB.70302@ebi.ac.uk> On 16/08/2011 19:39, Peter Cock wrote: > I also noticed the seqret GFF3 output is using "+" as the strand, > which is wrong for a protein reference like this. It should be using > "." (period) as the features on a protein are strand-less (as done > in the UniProt GFF3 file). Thanks. We'll fix it for the next release, but my understanding is it should be acceptable to most parsers. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 17 15:48:32 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 16:48:32 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE06F.9040503@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice wrote: > On 16/08/2011 16:36, Peter Cock wrote: >> >> Interestingly EMBOSS includes the sequence at the bottom >> (using the FASTA directive) and has generated unique ID tags >> for each feature. It has also added more note tags. > > The sequence is included if you are writing sequence data. GFF3 allows > sequence to be included, so we add it. Using a separate feature file is > always awkward for users, but is supported. See also the discussion today on gmod-gbrowse / song-devel where it sounds like GFF3 should have a single block of FASTA embedded sequence at the end of the fine, rather than interleaved. As I suggest on that thread, the practical solution for EMBOSS seqret might be to omit the FASTA sequence altogether. Or cache them in memory/on disk to write out at the very end of the all the features? http://generic-model-organism-system-database.450254.n5.nabble.com/Mailing-list-for-GFF3-specification-discussion-td4707740.html >> Unfortunately this also failed the GFF3 validation. The EMBOSS >> output does a lot better (e.g. "cleaved_initiator_methionine" is >> valid while "Initiator methionine" in the UniProt file was not) >> >> However, some of the terms in column 3 are apparently out of >> date - but http://www.sequenceontology.org does list them as >> synonyms: > > Thanks. I'll update the table, but synonyms should be acceptable. I can see plus points for either view, certainly the validator could downgrade that error to an warning. >> Finally protein_modification_categorized_by_chemical_process >> does not seem to be valid (I failed to find it in the ontology). > > Not in SO, but in a separate ontology (MOD). Should also be valid > in GFF I believe, but perhaps the parser insists on using SO and > excluding related ontologies. OK, but in that case shouldn't you then be declaring this with a ##feature-ontology directive? >> Additionally the validator complained about some of the note >> in Line 15, probably due to the %3B escaped semi-colon, >> but that may be a bug in the validator. > > Interesting. Let me know if we are not escaping the right characters, but I > believe we are supposed to escape ';' in those positions. I haven't checked this aspect carefully (since this is fiddly). Peter From p.j.a.cock at googlemail.com Wed Aug 17 15:50:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 16:50:57 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE0BB.70302@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE0BB.70302@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:39 PM, Peter Rice wrote: > On 16/08/2011 19:39, Peter Cock wrote: >> >> I also noticed the seqret GFF3 output is using "+" as the strand, >> which is wrong for a protein reference like this. It should be using >> "." (period) as the features on a protein are strand-less (as done >> in the UniProt GFF3 file). > > Thanks. > > We'll fix it for the next release, but my understanding is it should be > acceptable to most parsers. > I agree this is pretty harmless - in practice all that really matters is if the strand is "-" or not. Still, it should be straight forward to fix. Peter From pmr at ebi.ac.uk Wed Aug 17 15:52:21 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:52:21 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: Message-ID: <4E4BE3B5.4080601@ebi.ac.uk> On 17/08/2011 11:37, Peter Cock wrote: > Hi again Peter R. (et al.), > > Following yesterday's discussion about GFF3 files from UniProt, > I'm trying seqret to produce GFF3 from GenBank files. I'd already > found the NCBI currently provides some very broken GFF3 files: > > ------------------------------------------ > > Problem Two - Circular features not marked > > EMBOSS is also lacking in this area. > > EMBOSS has used feature type databank_entry and generated feature ID > NC_005213.1 for the landmark. However, this should include the special > tag entry Is_circular=true, since this is the landmark feature for the whole > circular chromosome. Thanks. I'll make sure we add it for the next release. > ------------------------------------------ > > Problem Three - Missing ID tags on multi-location features > > Unlike the NCBI file which fails to cross link multi-location features like > trans-spliced NEQ_t01, EMBOSS looks better. However, I don't think > you are following the expected pattern as used in the canonical GFF3 > examples. > > In the GenBank file, this tRNA is join(35233..35301,3254..3289) > > For the gene and tRNA features for NEQ_t01, EMBOSS is generating > three GFF3 lines. First a very broad parent feature 3254 to 35301, > then two children 35233 to 35301 and 3254 to 3289. > > I would expect two GFF3 lines (for each of gene and tRNA), just > 35233 to 35301 and 3254 to 3289 which would be linked by virtue > of having the same ID. EMBOSS is reporting what is stored internally (feature and subfeatures for the exons). Looks like we should skip reporting the feature. I'll check what that means for the IDs. > This is related to "Problem Six" and "Problem Seven" below. > > ------------------------------------------ > > Problem Four - Wrong tag for database cross references > > I had noticed the NCBI using a local tag (lower case) db_xref rather > than the standard (upper case = reserved) tag Dbxref. EMBOSS > does the same - is this deliberate and if so why? It is deliberate - we are using the db_xref tag from the EMBL/GenBank feature table. But we could convert to the GFF3 tag (and back again on reading). I'll have a look at how easy that would be. > ------------------------------------------ > > Problem Six - Features wrapping the origin of a circular genome > > Related to the landmark feature lacking the Is_curcular=true tag, the > gene and CDS features for origin wrapping NEQ003 look funny to me. > EMBOSS seems to be generating three GFF3 lines for the gene and CDS > for NEQ003, a surprisingly broad entry 1 to 490885 and two children > 490883 to 490885 and 1 to 879 (which do look sensible). > > This is essentially the same point I raised above with NEQ_t01, but > with the added complication of spanning the origin. Ah, something to do with the way start and end positions are stored internally. I'll fix that along with other circular feature issues. > Thankfully this potential confusion has been address in the updated > specification, so I would expect a single GFF3 line for each of the gene > and CDS for NEQ003, using start 490883 and end of 879+490885=491764. I'll try to write (and read) that way too. > ------------------------------------------ > > Problem Seven - No parent/child relationships > > The NCBI GFF3 file had no parent/child relationships at all. > > The EMBOSS 6.4.0 GFF3 file does use parent/child relationships > but not in the way I expected (and not in a way the validator likes). > As discussed above, for the GenBank join locations EMBOSS > seems to create broad parent features with children for each > sub-location (parent/child relations of the same type = bad). > > What I'm expecting instead is parent child relationships between > the CDS and gene features, between tRNA and gene features, etc. > Note that these relationships are implicit in the GenBank (and EMBL) > flat files, so I accept trying to deduce them might be hard (and > perhaps best not doing immediately - the other issues are more > pressing). Could be possible by matching common exons (stored internally as subfeatures). I'll have a look. > ------------------------------------------ > > Problem Eight - Invalid tags > > The online validator complains that EMBOSS too is using EC_number > (uppercase tags are reserved Pah! We use the EMBL/Genbank tag names. Looks like we will have to convert to lower case so may as well include that with the db_xref/Dbxref conversion in GFF3 writing and reading > ------------------------------------------ > > So my conclusion is that while the EMBOSS generated GFF3 is > better than those produced by the NCBI, it still is invalid and needs > some work. > > As usual, I am of course happy to help with testing fixes. And if > there are any mistakes in my understanding of the GFF3 spec, > please tell me ;) Many, many thanks for finding these. EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as subfeatures, which makes all this much easier to handle. regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 17 15:55:53 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 16:55:53 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> Message-ID: <4E4BE489.3040703@ebi.ac.uk> On 17/08/2011 16:48, Peter Cock wrote: > On Wed, Aug 17, 2011 at 4:38 PM, Peter Rice wrote: >> On 16/08/2011 16:36, Peter Cock wrote: >>> >>> Interestingly EMBOSS includes the sequence at the bottom >>> (using the FASTA directive) and has generated unique ID tags >>> for each feature. It has also added more note tags. >> >> The sequence is included if you are writing sequence data. GFF3 allows >> sequence to be included, so we add it. Using a separate feature file is >> always awkward for users, but is supported. > > See also the discussion today on gmod-gbrowse / song-devel where > it sounds like GFF3 should have a single block of FASTA embedded > sequence at the end of the fine, rather than interleaved. As I suggest > on that thread, the practical solution for EMBOSS seqret might be to > omit the FASTA sequence altogether. Or cache them in memory/on > disk to write out at the very end of the all the features? Thanks. We already save sequences and write at the end for some formats so I'll add it for GFF3. We will need more work for reading GFF3 input though, but it may not be too bad. If we are reading it as feature input, we don't look for the sequence. If we are reading as sequence input, we need to read all the sequeces into memory and then go back to read the features. For streamed input we can buffer to make the rewind work. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 17 16:05:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:05:13 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E4BE3B5.4080601@ebi.ac.uk> References: <4E4BE3B5.4080601@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:52 PM, Peter Rice wrote: > On 17/08/2011 11:37, Peter Cock wrote: >> ------------------------------------------ >> >> Problem Four - Wrong tag for database cross references >> >> I had noticed the NCBI using a local tag (lower case) db_xref rather >> than the standard (upper case = reserved) tag Dbxref. EMBOSS >> does the same - is this deliberate and if so why? > > It is deliberate - we are using the db_xref tag from the EMBL/GenBank > feature table. > > But we could convert to the GFF3 tag (and back again on reading). I'll > have a look at how easy that would be. Do you want to check this one with Lincoln on the song-devel mailing list first - after all, using a lower case tag is quite allowable and valid GFF3. My point is it does seem to be exactly what the reserved tag Dbxref is intended for. >> ------------------------------------------ >> >> Problem Seven - No parent/child relationships >> >> The NCBI GFF3 file had no parent/child relationships at all. >> >> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships >> but not in the way I expected (and not in a way the validator likes). >> As discussed above, for the GenBank join locations EMBOSS >> seems to create broad parent features with children for each >> sub-location (parent/child relations of the same type = bad). >> >> What I'm expecting instead is parent child relationships between >> the CDS and gene features, between tRNA and gene features, etc. >> Note that these relationships are implicit in the GenBank (and EMBL) >> flat files, so I accept trying to deduce them might be hard (and >> perhaps best not doing immediately - the other issues are more >> pressing). > > Could be possible by matching common exons (stored internally as > subfeatures). I'll have a look. Usually yes, but not all the time. I've seen GenBank files where the gene and CDS features have slightly different locations which makes doing this automatically hard. Off the top of my head this was a programmed frame shift example... I'll see if I can find you a specific example. >> ------------------------------------------ >> >> So my conclusion is that while the EMBOSS generated GFF3 is >> better than those produced by the NCBI, it still is invalid and needs >> some work. >> >> As usual, I am of course happy to help with testing fixes. And if >> there are any mistakes in my understanding of the GFF3 spec, >> please tell me ;) > > Many, many thanks for finding these. I've come to value NC_005213.gbk as a reasonably small circular genome with some rather complicated annotation - its one of my favourite test cases. > EMBOSS feature internals had a major rewrite in 6.4.0 to sore exons as > subfeatures, which makes all this much easier to handle. Oh good - that restructuring should now pay dividends :) Peter C. From p.j.a.cock at googlemail.com Wed Aug 17 16:07:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:07:54 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE489.3040703@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> <4E4BE489.3040703@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice wrote: > On 17/08/2011 16:48, Peter Cock wrote: >> See also the discussion today on gmod-gbrowse / song-devel where >> it sounds like GFF3 should have a single block of FASTA embedded >> sequence at the end of the fine, rather than interleaved. As I suggest >> on that thread, the practical solution for EMBOSS seqret might be to >> omit the FASTA sequence altogether. Or cache them in memory/on >> disk to write out at the very end of the all the features? > > Thanks. We already save sequences and write at the end for some > formats so I'll add it for GFF3. We will need more work for reading > GFF3 input though, but it may not be too bad. > > If we are reading it as feature input, we don't look for the sequence. > > If we are reading as sequence input, we need to read all the sequeces > into memory and then go back to read the features. For streamed input > we can buffer to make the rewind work. I'm curious what other file formats needed this kind of work. But it is good that you've already got some buffer/cache infrastructure in place. Does it boil down to writing temp files in /tmp ? Peter C. From pmr at ebi.ac.uk Wed Aug 17 16:14:15 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 17 Aug 2011 17:14:15 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> <4E4BE489.3040703@ebi.ac.uk> Message-ID: <4E4BE8D7.4010203@ebi.ac.uk> On 17/08/2011 17:07, Peter Cock wrote: > On Wed, Aug 17, 2011 at 4:55 PM, Peter Rice wrote: >> If we are reading as sequence input, we need to read all the sequeces >> into memory and then go back to read the features. For streamed input >> we can buffer to make the rewind work. > > I'm curious what other file formats needed this kind of work. But it > is good that you've already got some buffer/cache infrastructure > in place. Does it boil down to writing temp files in /tmp ? MSF (checksum at the top), Phylip (number of sequences at the top). In ajseqwrite.c these are the ones with the Save attribute set true. We keep them in memory and write them when the output file is closed. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Aug 17 16:33:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:33:29 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: <4E4BE8D7.4010203@ebi.ac.uk> References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> <4E4BE06F.9040503@ebi.ac.uk> <4E4BE489.3040703@ebi.ac.uk> <4E4BE8D7.4010203@ebi.ac.uk> Message-ID: On Wed, Aug 17, 2011 at 5:14 PM, Peter Rice wrote: > On 17/08/2011 17:07, Peter Cock wrote: >> I'm curious what other file formats needed this kind of work. But it >> is good that you've already got some buffer/cache infrastructure >> in place. Does it boil down to writing temp files in /tmp ? > > MSF (checksum at the top), Phylip (number of sequences at the top). > > In ajseqwrite.c these are the ones with the Save attribute set true. > > We keep them in memory and write them when the output file is closed. I wasn't thinking of alignments, but that makes perfect sense. Thanks, Peter C. From p.j.a.cock at googlemail.com Wed Aug 17 16:54:10 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 17:54:10 +0100 Subject: [emboss-dev] Moving EMBOSS from OBF hosted CVS to git on github Message-ID: Dear EMBOSS team, Have you made any decisions regarding the proposal to move the EMBOSS repository from CVS hosted by the OBF to git hosted on github (where most of the other OBF backed projects are now)? I see this made it to the minutes of the 27 June 2011 meeting: http://emboss.sourceforge.net/meetings/2011-06-27.html As I recall from talking to Peter Rice at BOSC/ISMB 2011 in Vienna last month, EMBOSS currently uses a single branch in CVS (like Biopython used to), so migrating the repository to git shouldn't be too complicated. I recommend in the short term maintaining a git mirror of the CVS repository on github.com, which can be kept current via a cron job running on the OBF server. You can then treat this git repository as a read only mirror and continue to make all commits via CVS. During this interim period, external contributors can make their own branches etc (without touching the official EMBOSS repository) and send you patches. The internal developers can also try this out as a way to get familiar with git gradually. This is what we did with Biopython, and it worked very well. I am happy to assist with this if you want. I think I made this offer in person in Vienna, but I'm repeating it publicly now. You might also be able to adopt the existing mirror maintained by Pjotr Prins (CC'd), although that does include a branch with BioLib work in it: https://github.com/pjotrp/EMBOSS/ Regards, Peter C. P.S. You'll need to have a different project name on github since emboss was used by Martin Bosslet back in Nov 2010. How about emboss-prj or even open-bio for this? P.P.S. This page seems to be missing: http://emboss.sourceforge.net/meetings/2011-07-04.html It is linked to from at least these two pages: http://emboss.sourceforge.net/meetings/ http://emboss.sourceforge.net/meetings/2011-07-11.html From pmr at ebi.ac.uk Thu Aug 18 12:28:28 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 18 Aug 2011 13:28:28 +0100 Subject: [emboss-dev] Mapping feature types to Sequence Ontology (SO) In-Reply-To: References: <4E4A8BF7.4020106@ebi.ac.uk> <4E4A8C3B.8030306@ebi.ac.uk> Message-ID: <4E4D056C.5050508@ebi.ac.uk> On 08/16/2011 04:36 PM, Peter Cock wrote: > I will report this to UniProt later. However, first I thought > I would try converting one of the other files provided into > GFF3 using EMBOSS seqret for an alternative, e.g. the > plain text "swiss" format: http://www.uniprot.org/uniprot/P99999.txt > > I can convert this using seqret as follows: > > ======================================== > $ seqret -feature -osformat=gff3 -sformat=swiss -sequence P99999.txt > However, some of the terms in column 3 are apparently out of > date - but http://www.sequenceontology.org does list them as > synonyms: > > It looks like the EMBOSS sequence ontology table may need > updating for at least these three cases. > > Finally protein_modification_categorized_by_chemical_process > does not seem to be valid (I failed to find it in the ontology). That was a name from the MOD ontology. GFF3 output now uses an SO term (but SO is lacking detail for MOD_RES, having only: id: SO:0001089 name: post_translationally_modified_region and id: SO:0001700 name: histone_modification ... and then more descendant of histone modification. Still showing its DNA_only roots. EMBOSS internally uses MOD terms for MOD_RES features. The details are in the note tag in GFF3 output. > Additionally the validator complained about some of the note > in Line 15, probably due to the %3B escaped semi-colon, > but that may be a bug in the validator. Worked for me. Perhaps it was confused by the term name errors (or perhaps the validator has been fixed) However, one nasty bug ... EMBOSS was so careful to only read real GFF3 format that the EMBOSS comment "#!Type Protein" was ignored and features were read into EMBOSS as nucleotide. I suspect there is no way in GFF3 to identify a protein file. In the next patch we can parse the EMBOSS comment again but that will not help with non-EMBOSS protein GFF3 files. Is there some official distinction between protein and nucleotide GFF3 files? regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 24 10:36:34 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 24 Aug 2011 11:36:34 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: Message-ID: <4E54D432.8030309@ebi.ac.uk> On 08/17/2011 11:37 AM, Peter Cock wrote: > Hi again Peter R. (et al.), > > Following yesterday's discussion about GFF3 files from UniProt, > I'm trying seqret to produce GFF3 from GenBank files. > > ------------------------------------------ > > Problem Two - Circular features not marked > > EMBOSS is also lacking in this area. Current status: circular tags will be passed better i the next EMBOSS release. Sequence inputs will have a new -scircular qualifier and feature inputs will have -fcircular to cover cases where the input format does not define a circular sequence (but if it does, these will not turn it off) We will tag a feature with Is_circular in the output, even if we have to make one up. > ------------------------------------------ > > Problem Six - Features wrapping the origin of a circular genome > > Related to the landmark feature lacking the Is_circular=true tag, the > gene and CDS features for origin wrapping NEQ003 look funny to me. > EMBOSS seems to be generating three GFF3 lines for the gene and CDS > for NEQ003, a surprisingly broad entry 1 to 490885 and two children > 490883 to 490885 and 1 to 879 (which do look sensible). > > Based on the old specification, I had expected two GFF3 lines each for the > gene and CDS, giving the regions 490883 to 490885 and 1 to 879, linked > by virtue of the having the same ID. > > Thankfully this potential confusion has been address in the updated > specification, so I would expect a single GFF3 line for each of the gene > and CDS for NEQ003, using start 490883 and end of 879+490885=491764. Unfortunately GFF3 is sadly lacking in details on how to define the sequence length. It appears there is no standard for defining the length, yet it is critical to interpreting a circular feature that goes across the origin as GFF3 makes the end position greater than the length. We will make a best guess but cannot guarantee we get the right answer. > ------------------------------------------ > > Problem Seven - No parent/child relationships > > The EMBOSS 6.4.0 GFF3 file does use parent/child relationships > but not in the way I expected (and not in a way the validator likes). > As discussed above, for the GenBank join locations EMBOSS > seems to create broad parent features with children for each > sub-location (parent/child relations of the same type = bad). > > What I'm expecting instead is parent child relationships between > the CDS and gene features, between tRNA and gene features, etc. > Note that these relationships are implicit in the GenBank (and EMBL) > flat files, so I accept trying to deduce them might be hard (and > perhaps best not doing immediately - the other issues are more > pressing). The obvious fix is to lie about the feature types of the exons so the validator is happy. We could call them exons, but "region" would be safer. But there is a silly complication with CDS features: we could keep the CDS parent record and have it as a parent of a group of "regions" for the processed exons. But GFF3 wants the exons to be type "CDS" so what do we call the parent? So in the cobbled together example below, ignoring the circular aspects, we would want to keep the CDS on the parent (ID=NC_005213.11) record where all the annotation tags are, but I suspect GFF3 wants that to be something else. We could of course specifically lie about CDS features for EMBOSS generated GFF3 files (we tag the header) so we can restore the correct internal structure on input. NC_005213 EMBL CDS 490883 491764 . - 0 ID=NC_005213.11;locus_tag=NEQ001;note=conserved hypothetical [Methanococcus jannaschii]%3B COG1583:Uncharacterized ACR%3B IPR001472:Bipartite nuclear localization signal%3B IPR002743: Protein of unknown function DUF57;codon_start=1;transl_table=11;product=hypothetical protein;protein_id=NP_963295.1;db_xref=GI:41614797;db_xref=GeneID:2732620;translation=MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLHNKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEPIEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRGEAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR NC_005213 EMBL CDS 490883 490885 . - 0 ID=NC_005213.12;Parent=NC_005213.11 NC_005213 EMBL CDS 1 879 . - 0 ID=NC_005213.13;Parent=NC_005213.11 > ------------------------------------------ > > Problem Eight - Invalid tags > > The online validator complains that EMBOSS too is using EC_number > (uppercase tags are reserved Fixed and we can patch the release. Making all tags lower case is trivial - they are automatically converted on input to the internal mixed case. > ------------------------------------------ > > So my conclusion is that while the EMBOSS generated GFF3 is > better than those produced by the NCBI, it still is invalid and needs > some work. > > As usual, I am of course happy to help with testing fixes. And if > there are any mistakes in my understanding of the GFF3 spec, > please tell me ;) Hope this helps. Progress is being made. However, as GFF3 is such a pain, I am wondering whether to switch the default feature format to something else - back to GFF2 or maybe to use GTF. Does anyone have a preference? regards, Peter Rice EMBOSS Team From pmr at ebi.ac.uk Wed Aug 24 14:45:33 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 24 Aug 2011 15:45:33 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E54D432.8030309@ebi.ac.uk> References: <4E54D432.8030309@ebi.ac.uk> Message-ID: <4E550E8D.8010506@ebi.ac.uk> On 08/24/2011 11:36 AM, Peter Rice wrote: > On 08/17/2011 11:37 AM, Peter Cock wrote: > >> ------------------------------------------ >> >> Problem Seven - No parent/child relationships >> >> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships >> but not in the way I expected (and not in a way the validator likes). As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I can make the CDS "parent" feature change its type to "biological_region" and add a featflags tag with the true type. Code (not yet checked in) can reconstruct the EMBL feature table from this GFF. However, the EMBL tags are all on the parent (now biological_region) feature. Any suggestions where I should stick them for them to be useful in GFF3? EMBL feature table: FT source 1..3919 FT /organism="Homo sapiens" FT /mol_type="genomic DNA" FT /db_xref="taxon:9606" FT CDS join(2079..2171,2294..2515,3371..3499) FT /db_xref="GDB:119299" FT /db_xref="GOA:P02100" FT /db_xref="HGNC:4830" FT /db_xref="InterPro:IPR000971" FT /db_xref="InterPro:IPR002337" FT /db_xref="InterPro:IPR009050" FT /db_xref="InterPro:IPR012292" FT /db_xref="PDB:1A9W" FT /db_xref="UniProtKB/Swiss-Prot:P02100" FT /protein_id="CAA23766.1" FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF FT KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH" proposed GFF3 version V00508 EMBL databank_entry 1 3919 . + . ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606 V00508 EMBL biological_region 2079 3499 . + 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH V00508 EMBL CDS 2079 2171 . + 0 Parent=V00508.2 V00508 EMBL CDS 2294 2515 . + 0 Parent=V00508.2 V00508 EMBL CDS 3371 3499 . + 0 Parent=V00508.2 regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Thu Aug 25 00:44:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Aug 2011 01:44:47 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E550E8D.8010506@ebi.ac.uk> References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> Message-ID: On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice wrote: > > However, as GFF3 is such a pain, I am wondering whether to switch the > default feature format to something else - back to GFF2 or maybe to use GTF. > Sadly I have to agree with you - the current version of the GFF3 spec leaves far too much open to multiple interpretation, as we have been discussing on the song-devel mailing lists. I'm not sure that GFF2 or GTF are any better though. On Wed, Aug 24, 2011 at 3:45 PM, Peter Rice wrote: > On 08/24/2011 11:36 AM, Peter Rice wrote: >> >> On 08/17/2011 11:37 AM, Peter Cock wrote: >> >>> ------------------------------------------ >>> >>> Problem Seven - No parent/child relationships >>> >>> The EMBOSS 6.4.0 GFF3 file does use parent/child relationships >>> but not in the way I expected (and not in a way the validator likes). > > As a first attempt, using the EMBL entry v00508 in the EMBOSS test set, I > can make the CDS "parent" feature change its type to "biological_region" and > add a featflags tag with the true type. Code (not yet checked in) can > reconstruct the EMBL feature table from this GFF. > > However, the EMBL tags are all on the parent (now biological_region) > feature. > > Any suggestions where I should stick them for them to be useful in GFF3? > > EMBL feature table: > > FT ? source ? ? ? ? ?1..3919 > FT ? ? ? ? ? ? ? ? ? /organism="Homo sapiens" > FT ? ? ? ? ? ? ? ? ? /mol_type="genomic DNA" > FT ? ? ? ? ? ? ? ? ? /db_xref="taxon:9606" > FT ? CDS ? ? ? ? ? ? join(2079..2171,2294..2515,3371..3499) > FT ? ? ? ? ? ? ? ? ? /db_xref="GDB:119299" > FT ? ? ? ? ? ? ? ? ? /db_xref="GOA:P02100" > FT ? ? ? ? ? ? ? ? ? /db_xref="HGNC:4830" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR000971" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR002337" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR009050" > FT ? ? ? ? ? ? ? ? ? /db_xref="InterPro:IPR012292" > FT ? ? ? ? ? ? ? ? ? /db_xref="PDB:1A9W" > FT ? ? ? ? ? ? ? ? ? /db_xref="UniProtKB/Swiss-Prot:P02100" > FT ? ? ? ? ? ? ? ? ? /protein_id="CAA23766.1" > FT /translation="MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDS > FT FGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENF > FT ? ? ? ? ? ? ? ? ? KLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH" > > proposed GFF3 version > > V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? . > ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606 > V00508 ?EMBL ? ?biological_region ? ? ? 2079 ? ?3499 ? ?. ? ? ? + ? ? ? 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_x > ref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLV > VYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0 > Parent=V00508.2 > V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0 > Parent=V00508.2 > V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0 > Parent=V00508.2 > I was expecting something like this (done by hand) where we follow the example on http://www.sequenceontology.org/gff3.shtml and have a single GFF gene feature represented by three lines linked by virtue of having the same ID: V00508 ?EMBL ? ?databank_entry ?1 ? ? ? 3919 ? ?. ? ? ? + ? ? ? . ID=V00508.1;organism=Homo sapiens;mol_type=genomic DNA;db_xref=taxon:9606 V00508 ?EMBL ? ?CDS ? ? 2079 ? ?2171 ? ?. ? ? ? + ? ? ? 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH V00508 ?EMBL ? ?CDS ? ? 2294 ? ?2515 ? ?. ? ? ? + ? ? ? 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH V00508 ?EMBL ? ?CDS ? ? 3371 ? ?3499 ? ?. ? ? ? + ? ? ? 0 ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH On the downside, I have repeated all the annotation three times - but that is what was done in the GFF3 example in the spec. Perhaps this should be raised on the song-devel mailing list along with our other GFF3 queries. Regards, Peter C. From pmr at ebi.ac.uk Thu Aug 25 13:52:30 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 25 Aug 2011 14:52:30 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> Message-ID: <4E56539E.6030400@ebi.ac.uk> On 25/08/2011 01:44, Peter Cock wrote: > On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice wrote: >> >> However, as GFF3 is such a pain, I am wondering whether to switch the >> default feature format to something else - back to GFF2 or maybe to use GTF. >> > > Sadly I have to agree with you - the current version of the GFF3 > spec leaves far too much open to multiple interpretation, as we > have been discussing on the song-devel mailing lists. I'm not > sure that GFF2 or GTF are any better though. GTF is no good for EMBOSS ... way too picky about start and stop codons If pushed we could read it in using a version of the GTF parser but I see no point trying to write it using data from any source > I was expecting something like this (done by hand) where we follow the > example on http://www.sequenceontology.org/gff3.shtml and have a > single GFF gene feature represented by three lines linked by virtue of > having the same ID: > > > V00508 EMBL databank_entry 1 3919 . + . > ID=V00508.1;organism=Homo sapiens;mol_type=genomic > DNA;db_xref=taxon:9606 > V00508 EMBL CDS 2079 2171 . + 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > V00508 EMBL CDS 2294 2515 . + 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > V00508 EMBL CDS 3371 3499 . + 0 > ID=V00508.2;featflags=type:CDS;db_xref=GDB:119299;db_xref=GOA:P02100;db_xref=HGNC:4830;db_xref=InterPro:IPR000971;db_xref=InterPro:IPR002337;db_xref=InterPro:IPR009050;db_xref=InterPro:IPR012292;db_xref=PDB:1A9W;db_xref=UniProtKB/Swiss-Prot:P02100;protein_id=CAA23766.1;translation=MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLTSFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAIALAHKYH > > On the downside, I have repeated all the annotation three times - but > that is what was done in the GFF3 example in the spec. Urgh. How about a gene with 80 exons? That's what I was trying to avoid. How would you plan to read it back in? Transferring all features to the parent perhaps, with checks every time for an existing exact copy? I am less impressed with GFF3 each time I look. I think we'll go with the annotation of the "biological_region" parent and wait for anyone with a use case that actually requires massively replicated annotation. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Fri Aug 26 02:27:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 26 Aug 2011 03:27:31 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: <4E56539E.6030400@ebi.ac.uk> References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> <4E56539E.6030400@ebi.ac.uk> Message-ID: On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice wrote: > On 25/08/2011 01:44, Peter Cock wrote: >> >> On Wed, Aug 24, 2011 at 11:36 AM, Peter Rice ?wrote: >>> >>> However, as GFF3 is such a pain, I am wondering whether to switch the >>> default feature format to something else - back to GFF2 or maybe to use >>> GTF. >>> >> >> Sadly I have to agree with you - the current version of the GFF3 >> spec leaves far too much open to multiple interpretation, as we >> have been discussing on the song-devel mailing lists. I'm not >> sure that GFF2 or GTF are any better though. > > GTF is no good for EMBOSS ... way too picky about start and stop codons > > If pushed we could read it in using a version of the GTF parser but I see no > point trying to write it using data from any source > > >> I was expecting something like this (done by hand) where we follow the >> example on http://www.sequenceontology.org/gff3.shtml and have a >> single GFF gene feature represented by three lines linked by virtue of >> having the same ID: >> >> ... >> >> On the downside, I have repeated all the annotation three times - but >> that is what was done in the GFF3 example in the spec. > > Urgh. How about a gene with 80 exons? That's what I was trying to avoid. > > How would you plan to read it back in? Transferring all features to the > parent perhaps, with checks every time for an existing exact copy? > It would make sense to propose that the first line has all the annotation, and the subsequence lines from the same feature just need the ID, and if it is adopted the part tag recently discussed on the song-devel list to make the order of the sub-parts explicit. http://sourceforge.net/mailarchive/message.php?msg_id=27960475 > > I am less impressed with GFF3 each time I look. > Me too. > > I think we'll go with the annotation of the "biological_region" parent and > wait for anyone with a use case that actually requires massively replicated > annotation. > Have you looked at the BioPerl GenBank to GFF3 conversion? I understand GBrowse recommends this as a way to get GenBank format data into GBrowse. I'm also pretty sure that this is being used inside TogoWS for GenBank/EMBL to GFF3: http://togows.dbcls.jp/entry/embl/V00508 <-- original EMBL http://togows.dbcls.jp/entry/embl/V00508.gff <-- as GFF3 Interestingly their GFF3 output is pretty close to your proposed EMBOSS output, only they've got a "region" rather than "biological_region" for the parent meta-feature. However, I think introducing extra biological_region features to act as the parent of multi-location features would run counter to the canonical gene model given in the GFF3 specification (which appears to be just a suggestion rather than a requirement). Also, introducing this meta-feature would complicate any future wish to try to express explicit parent/child relationships between operon, gene, mRNA and CDS features. Of course, as we've discussed, these biological relationships are only implicit in the GenBank/EMBL feature table. This is probably a good example to discuss on the GFF3 song-devel mailing list - small and apparently very simple except for how to represent the (forward strand) join location. Peter C. From pmr at ebi.ac.uk Tue Aug 30 15:48:25 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 30 Aug 2011 16:48:25 +0100 Subject: [emboss-dev] Problems with EMBOSS seqret GenBank to GFF3 In-Reply-To: References: <4E54D432.8030309@ebi.ac.uk> <4E550E8D.8010506@ebi.ac.uk> <4E56539E.6030400@ebi.ac.uk> Message-ID: <4E5D0649.3010905@ebi.ac.uk> On 08/26/2011 03:27 AM, Peter Cock wrote: > On Thu, Aug 25, 2011 at 2:52 PM, Peter Rice wrote: >> On 25/08/2011 01:44, Peter Cock wrote: > It would make sense to propose that the first line has all the annotation, > and the subsequence lines from the same feature just need the ID, > and if it is adopted the part tag recently discussed on the song-devel > list to make the order of the sub-parts explicit. > http://sourceforge.net/mailarchive/message.php?msg_id=27960475 The part tag is interesting and would map to the internal "exon" attribute in EMBOSS which we reserve for sorting. >> I think we'll go with the annotation of the "biological_region" parent and >> wait for anyone with a use case that actually requires massively replicated >> annotation. >> > > Have you looked at the BioPerl GenBank to GFF3 conversion? > I understand GBrowse recommends this as a way to get > GenBank format data into GBrowse. I'm also pretty sure that > this is being used inside TogoWS for GenBank/EMBL to GFF3: > > http://togows.dbcls.jp/entry/embl/V00508<-- original EMBL > http://togows.dbcls.jp/entry/embl/V00508.gff<-- as GFF3 Hmmm .... the GFF3 has Parent references to the protein_id, but it doesn't appear as an ID. I do not like using a second region to put the description line in. Using the organism as the ID for the source line also looks odd. > Interestingly their GFF3 output is pretty close to your proposed > EMBOSS output, only they've got a "region" rather than > "biological_region" for the parent meta-feature. I don't see a parent meta-feature there. > However, I think introducing extra biological_region features to > act as the parent of multi-location features would run counter to > the canonical gene model given in the GFF3 specification (which > appears to be just a suggestion rather than a requirement). > > Also, introducing this meta-feature would complicate any > future wish to try to express explicit parent/child relationships > between operon, gene, mRNA and CDS features. Of course, as > we've discussed, these biological relationships are only implicit > in the GenBank/EMBL feature table. I tried the canonical gene example: ##gff-version 3 ##sequence-region ctg123 1 9000 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . five_prime_UTR 1050 1200 . + . Parent=mRNA00001 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001 ctg123 . three_prime_UTR 7601 9000 . + . Parent=mRNA00001 ctg123 . cDNA_match 1050 1500 5.8e-42 + . ID=match00001;Target=cdna0123+12+462 ctg123 . cDNA_match 5000 5500 8.1e-43 + . ID=match00001;Target=cdna0123+463+963 ctg123 . cDNA_match 7000 9000 1.4e-40 + . ID=match00001;Target=cdna0123+964+2964 ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc >cdna0123 ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg tcaaacagcggctgtaaaaatttgtgattatggttaaagg I can not (code not yet checked in) reproduce this, subject to the sequence being too short. Internally, EMBOSS generates parent features for CDS and cDNA_match (where several features share an ID), and the parent structure is preserved. On output, the generated features are not reported so GFF3 input is identical. If we read EMBL/GenBank entries then we will generate a parent feature with type "biological region" to attach the annotation from the join. Reproducing the "parent" relationships is a separate exercise that could be a separate application. In terms of reading one format and writing another I prefer to not generate any GFF3-specific extras. > This is probably a good example to discuss on the GFF3 > song-devel mailing list - small and apparently very simple > except for how to represent the (forward strand) join location. We could propose something for the http://www.sequenceontology.org/wiki/index.php/GFF3_best_practices page to describe how to represent EMBL/GenBank entries in GFF3 (after due discussion on the SONG-devel list) regards, Peter Rice EMBSOS Team