From biopython at maubp.freeserve.co.uk Fri Jan 8 12:33:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Jan 2010 17:33:02 +0000 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) Message-ID: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> Hi all, Currently Biopython reads both GenBank and EMBL files, and write GenBank. I'm looking at writing EMBL files too - and wanted to see if any of you knew anything definitive on join(complement(...)) vs complement(join(...)) in feature location strings. References: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html http://www.genbank.lipi.go.id/docs/FTv6_2.html Both give this in example, two ways for writing the same location: complement(join(2691..4571,4918..5163) Joins regions 2691 to 4571 and 4918 to 5163, then complements the joined segments (the feature is on the strand complementary to the presented strand) join(complement(4918..5163),complement(2691..4571)) Complements regions 4918 to 5163 and 2691 to 4571, then joins the complemented segments (the feature is on the strand complementary to the presented strand) This suggests that either form is valid in both GenBank and EMBL format files. Anecdotally, I have observed GenBank uses the first form (which is shorter) while EMBL seems to use the second form (which to me is logical, if you consider how to represent mixed strand features). This seems to fit with this BioPerl wiki page: http://www.bioperl.org/wiki/BioPerl_Locations Is there any official documentation regarding this discrepancy that I have overlooked? Am I right to think that GenBank and EMBL do still use these different forms (any word on if they might standardised one way or the other in future?) What do EMBOSS, BioPerl, etc do in this situation? Do you treat these two examples the same on parsing, and use one layout when writing GenBank and the other for writing EMBL files? Peter From cjfields at illinois.edu Fri Jan 8 21:54:41 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 Jan 2010 20:54:41 -0600 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) In-Reply-To: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> Message-ID: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> On Jan 8, 2010, at 11:33 AM, Peter wrote: > Hi all, > > Currently Biopython reads both GenBank and EMBL files, and write GenBank. > I'm looking at writing EMBL files too - and wanted to see if any of you knew > anything definitive on join(complement(...)) vs complement(join(...)) in > feature location strings. > > References: > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html > http://www.genbank.lipi.go.id/docs/FTv6_2.html > > Both give this in example, two ways for writing the same location: > > complement(join(2691..4571,4918..5163) > Joins regions 2691 to 4571 and 4918 to 5163, then > complements the joined segments (the feature is > on the strand complementary to the presented strand) > > join(complement(4918..5163),complement(2691..4571)) > Complements regions 4918 to 5163 and 2691 to 4571, > then joins the complemented segments (the feature is > on the strand complementary to the presented strand) > > This suggests that either form is valid in both GenBank and EMBL > format files. > > Anecdotally, I have observed GenBank uses the first form (which is > shorter) while EMBL seems to use the second form (which to me is > logical, if you consider how to represent mixed strand features). > This seems to fit with this BioPerl wiki page: > > http://www.bioperl.org/wiki/BioPerl_Locations > > Is there any official documentation regarding this discrepancy that > I have overlooked? Am I right to think that GenBank and EMBL do > still use these different forms (any word on if they might > standardised one way or the other in future?) > > What do EMBOSS, BioPerl, etc do in this situation? Do you treat > these two examples the same on parsing, and use one layout > when writing GenBank and the other for writing EMBL files? > > Peter I can't recall which of the two BioPerl uses, but if it helps it standardizes on one of them for output but parses both. I think GenBank and EMBL have converged on using the same format, but I'm not absolutely sure on that. Ironic actually that I can't remember, as I'm the author of the above page and started a discussion about this very subject a while back on the list (in an effort to sort out some issues with BioPerl locations). chris From biopython at maubp.freeserve.co.uk Mon Jan 11 05:42:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 10:42:52 +0000 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) In-Reply-To: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> Message-ID: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com> On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields wrote: > > I can't recall which of the two BioPerl uses, but if it helps it standardizes > on one of them for output but parses both. ?I think GenBank and EMBL > have converged on using the same format, but I'm not absolutely sure > on that. > > Ironic actually that I can't remember, as I'm the author of the above page > and started a discussion about this very subject a while back on the list > (in an effort to sort out some issues with BioPerl locations). > > chris Thanks Chris, I'm glad my email made sense - on re-reading I had made more typos than usual :( As to the BioPerl behaviour, I think I know enough to get BioPerl to convert GenBank files into EMBL or vice versa, and thus find out what it does... I hope you are right that GenBank and EMBL have converged on using the same format - any confirmation of this (and which format) would be very welcome. Peter (Rice), do you have any input? I noticed some work in the latest EMBOSS patch last month that touches on this issue: http://lists.open-bio.org/pipermail/emboss-announce/2009-December/000016.html Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 09:42:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 14:42:51 +0000 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) In-Reply-To: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com> References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com> Message-ID: <320fb6e01001110642p777d5b65la2e97b5767bd3ae5@mail.gmail.com> On Mon, Jan 11, 2010 at 10:42 AM, Peter wrote: > On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields wrote: >> >> I can't recall which of the two BioPerl uses, but if it helps it standardizes >> on one of them for output but parses both. ?I think GenBank and EMBL >> have converged on using the same format, but I'm not absolutely sure >> on that. >> >> Ironic actually that I can't remember, as I'm the author of the above page >> and started a discussion about this very subject a while back on the list >> (in an effort to sort out some issues with BioPerl locations). >> >> chris > > Thanks Chris, > > I'm glad my email made sense - on re-reading I had made more typos > than usual :( > > As to the BioPerl behaviour, I think I know enough to get BioPerl > to convert GenBank files into EMBL or vice versa, and thus find > out what it does... After stumbling over this issue, I made some progress: http://lists.open-bio.org/pipermail/bioperl-l/2010-January/031889.html > I hope you are right that GenBank and EMBL have converged on > using the same format - any confirmation of this (and which format) > would be very welcome. I took the Arabidopsis thaliana chloroplast complete genome as an example. This is AP000423 in EMBL, NC_000932 in GenBank (although there are some minor differences in the annotation). Looking at these files (both from early 2009), they seem to use the same feature location styles, e.g. for reverse strand joins: complement(join(97999..98793,69611..69724)) e.g. for mixed strand features: join(complement(69611..69724),139856..140087, 140625..140650) I'm going to assume that this is what both EMBL and GenBank will be using in future. I have confirmed that BioPerl 1.6.x preserves these style locations on converting EMBL/GenBank to EMBL/GenBank. I need to find a reverse strand "join(complement(..." example to test with now... Peter From mark.schreiber at novartis.com Wed Jan 13 00:37:57 2010 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 13 Jan 2010 13:37:57 +0800 Subject: [Open-bio-l] [Bosc] BOSC 2010 Request for Input In-Reply-To: <57BC6503-C655-431C-88BB-0F63E1004BD6@drycafe.net> Message-ID: I might be coming to ISMB/BOSC this year hand ++ - Mark Hilmar Lapp Sent by: open-bio-l-bounces at lists.open-bio.org 12/22/2009 07:39 AM To Brad Chapman cc bosc at open-bio.org, community at cloudbiolinux.com, open-bio-l at lists.open-bio.org Subject Re: [Open-bio-l] [Bosc] BOSC 2010 Request for Input On Dec 21, 2009, at 4:32 PM, Brad Chapman wrote: > I can help organize a hackathon at Mass General Hospital, which is > located fairly to close to Hynes Convention Center, where the > conference > is on. I booked conferences rooms for dates surrounding BOSC (July > 6-8th and 11th) that will accommodate up to 30 developers from various > projects. Awesome!! Thanks so much Brad for stepping up so quickly and putting your foot down too! > If we could get a show of hands from people who might be interested, > this would help gauge the size and scope. :) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== _______________________________________________ Open-Bio-l mailing list Open-Bio-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/open-bio-l _________________________ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you. From biopython at maubp.freeserve.co.uk Thu Jan 21 07:33:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 12:33:53 +0000 Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Hi all, This is cross posted to try and ensure relevant people see it. I suggest we continue the discussion on the BioSQL list (for how to serialise structured annotation to BioSQL), and/or the OpenBio list (for things like file format naming conventions). I am hoping we (Bio*) can be consistent in how we parse and load into BioSQL the SwissProt DE lines (known as "swiss" format in both BioPerl and Biopython's SeqIO, and by EMBOSS) or the equivalent UniProt XML tags (which we are tentatively going to call the "uniprot" format in Biopython's SeqIO - comments?). Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") files and load them into BioSQL. Biopython currently treats the DE comment lines as a long string, as BioPerl used to: http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html I understand that BioPerl now turns the SwissProt DE lines into a TagTree, and for storing this in BioSQL this gets serialised as XML. I would like Biopython to handle this the same way (although rather than a Perl TagTree, we'd use a Python structure of course), and would appreciate clarification of what exactly was implemented (e.g. which bit of the BioPerl source code should be look at, and could you show a worked example?). Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or Open-Bio lists yet) has started work on parsing UniProt XML files for Biopython. Here the DE comment lines are already provided broken up with XML markup. Hopefully their nested structure matches what BioPerl was doing with the SwissProt DE lines. Regards, Peter From cjfields at illinois.edu Thu Jan 21 08:34:12 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 Jan 2010 07:34:12 -0600 Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: Peter, The relevant code is in Bio::Annotation::TagTree in bioperl-live, which is a decorator for Data::Stag: http://search.cpan.org/~cmungall/Data-Stag-0.11/Data/Stag.pm This is where the text output is derived from. It's a bit of a heavyweight solution to the problem, but it's capable of round-tripping the DE data and parses out the data in a way that's approachable. We could probably abstract out the serialization backend there and allow a pure bioperl solution (or the current solution) as a fallback. If the plain-text DE info is represented in a hierarchy already in UniProt XML, we should probably conform as closely as possible to that (using a standard format like XML, JSON, etc.). chris On Jan 21, 2010, at 6:33 AM, Peter wrote: > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter From holland at eaglegenomics.com Fri Jan 22 05:51:52 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 22 Jan 2010 10:51:52 +0000 Subject: [Open-bio-l] [BioSQL-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: <8FECCBDE-2DE1-40EE-B5A4-73BDAC893E2D@eaglegenomics.com> Nice idea. Currently, BioJava just stores the complete section as a string without parsing it, but it provides a parser module for converting it into useful tag/value format within a user's program (but not to be stored in BioSQL). On 21 Jan 2010, at 12:33, Peter wrote: > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andrea at biocomp.unibo.it Fri Jan 22 07:18:32 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET) Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it> I think that the point here can be a little broader, since not only the swissprot DE lines carry complex and structured data. To define a common, language-independent way to store structured data into the comment and *_qualifier_value tables of the actual BioSQL schema could be very useful. XML looks like a good candidate to me, and the UniprotXML format can be used as reference or as a template to start from. Each Bio* project will then parse and report this structured data in its own programming language data structure. Andrea > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Fri Jan 29 05:36:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Jan 2010 10:36:40 +0000 Subject: [Open-bio-l] [Bioperl-l] [MOBY-dev] OpenBio solution challenge: Project updates at BOSC 2010 In-Reply-To: References: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com> Hi all, This is a great topic but should be continue it on just the one mailing list? Is there a suitable BOSC list, or how about the general Open Bio list? On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson wrote: > > Brad, this sounds exciting! > > One thing strikes me, though - by asking for the sub-projects to propose > the "grand challenge" themselves the one thing you can guarantee is that > the "grand challenge" is solvable (or more likely, already solved!) > > Other "grand challenge" kinds of meetings have an independent third party > pose the problem that has to be solved, and then all groups work toward a > solution and compare their results. ?This would, IMO, be more revealing of > the "state of the art" in each Open-Bio project, and point out where the > weaknesses are that we should be focusing on... ?Someone (for example, > you!) could act as the moderator to ensure that the "grand challenge" was > at least a reasonable one, within the scope of what an Open-Bio project > *should* be able to solve... > > Just my CAD $0.02 > > Mark One possible problem with having Brad act as moderator is his ties to Biopython (plus it would be a shame if we'd be one man down for trying to solve the challenges - grin). Having a project representative "sign off" on the challenge might work - or simply the whole of the BOSC committee which is quite balanced. Alternatively some kind of panel of challenges does seem a good way to reduce individual project bias (as suggest by Scooter), but there will still need to be a judging committee. I'm curious what kind of challenges the BOSC committee had in mind - would something like taking a newly sequence bacteria and producing an automated annotation as a GenBank, EMBL, or GFF file be too ambitious for example? There are already several major projects to do this e.g. RAST http://rast.nmpdr.org/ Peter (@Biopython) From biopython at maubp.freeserve.co.uk Fri Jan 8 17:33:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Jan 2010 17:33:02 +0000 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) Message-ID: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> Hi all, Currently Biopython reads both GenBank and EMBL files, and write GenBank. I'm looking at writing EMBL files too - and wanted to see if any of you knew anything definitive on join(complement(...)) vs complement(join(...)) in feature location strings. References: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html http://www.genbank.lipi.go.id/docs/FTv6_2.html Both give this in example, two ways for writing the same location: complement(join(2691..4571,4918..5163) Joins regions 2691 to 4571 and 4918 to 5163, then complements the joined segments (the feature is on the strand complementary to the presented strand) join(complement(4918..5163),complement(2691..4571)) Complements regions 4918 to 5163 and 2691 to 4571, then joins the complemented segments (the feature is on the strand complementary to the presented strand) This suggests that either form is valid in both GenBank and EMBL format files. Anecdotally, I have observed GenBank uses the first form (which is shorter) while EMBL seems to use the second form (which to me is logical, if you consider how to represent mixed strand features). This seems to fit with this BioPerl wiki page: http://www.bioperl.org/wiki/BioPerl_Locations Is there any official documentation regarding this discrepancy that I have overlooked? Am I right to think that GenBank and EMBL do still use these different forms (any word on if they might standardised one way or the other in future?) What do EMBOSS, BioPerl, etc do in this situation? Do you treat these two examples the same on parsing, and use one layout when writing GenBank and the other for writing EMBL files? Peter From cjfields at illinois.edu Sat Jan 9 02:54:41 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 Jan 2010 20:54:41 -0600 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) In-Reply-To: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> Message-ID: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> On Jan 8, 2010, at 11:33 AM, Peter wrote: > Hi all, > > Currently Biopython reads both GenBank and EMBL files, and write GenBank. > I'm looking at writing EMBL files too - and wanted to see if any of you knew > anything definitive on join(complement(...)) vs complement(join(...)) in > feature location strings. > > References: > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html > http://www.genbank.lipi.go.id/docs/FTv6_2.html > > Both give this in example, two ways for writing the same location: > > complement(join(2691..4571,4918..5163) > Joins regions 2691 to 4571 and 4918 to 5163, then > complements the joined segments (the feature is > on the strand complementary to the presented strand) > > join(complement(4918..5163),complement(2691..4571)) > Complements regions 4918 to 5163 and 2691 to 4571, > then joins the complemented segments (the feature is > on the strand complementary to the presented strand) > > This suggests that either form is valid in both GenBank and EMBL > format files. > > Anecdotally, I have observed GenBank uses the first form (which is > shorter) while EMBL seems to use the second form (which to me is > logical, if you consider how to represent mixed strand features). > This seems to fit with this BioPerl wiki page: > > http://www.bioperl.org/wiki/BioPerl_Locations > > Is there any official documentation regarding this discrepancy that > I have overlooked? Am I right to think that GenBank and EMBL do > still use these different forms (any word on if they might > standardised one way or the other in future?) > > What do EMBOSS, BioPerl, etc do in this situation? Do you treat > these two examples the same on parsing, and use one layout > when writing GenBank and the other for writing EMBL files? > > Peter I can't recall which of the two BioPerl uses, but if it helps it standardizes on one of them for output but parses both. I think GenBank and EMBL have converged on using the same format, but I'm not absolutely sure on that. Ironic actually that I can't remember, as I'm the author of the above page and started a discussion about this very subject a while back on the list (in an effort to sort out some issues with BioPerl locations). chris From biopython at maubp.freeserve.co.uk Mon Jan 11 10:42:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 10:42:52 +0000 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) In-Reply-To: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> Message-ID: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com> On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields wrote: > > I can't recall which of the two BioPerl uses, but if it helps it standardizes > on one of them for output but parses both. ?I think GenBank and EMBL > have converged on using the same format, but I'm not absolutely sure > on that. > > Ironic actually that I can't remember, as I'm the author of the above page > and started a discussion about this very subject a while back on the list > (in an effort to sort out some issues with BioPerl locations). > > chris Thanks Chris, I'm glad my email made sense - on re-reading I had made more typos than usual :( As to the BioPerl behaviour, I think I know enough to get BioPerl to convert GenBank files into EMBL or vice versa, and thus find out what it does... I hope you are right that GenBank and EMBL have converged on using the same format - any confirmation of this (and which format) would be very welcome. Peter (Rice), do you have any input? I noticed some work in the latest EMBOSS patch last month that touches on this issue: http://lists.open-bio.org/pipermail/emboss-announce/2009-December/000016.html Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 14:42:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 14:42:51 +0000 Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs complement(join(...)) In-Reply-To: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com> References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com> <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu> <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com> Message-ID: <320fb6e01001110642p777d5b65la2e97b5767bd3ae5@mail.gmail.com> On Mon, Jan 11, 2010 at 10:42 AM, Peter wrote: > On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields wrote: >> >> I can't recall which of the two BioPerl uses, but if it helps it standardizes >> on one of them for output but parses both. ?I think GenBank and EMBL >> have converged on using the same format, but I'm not absolutely sure >> on that. >> >> Ironic actually that I can't remember, as I'm the author of the above page >> and started a discussion about this very subject a while back on the list >> (in an effort to sort out some issues with BioPerl locations). >> >> chris > > Thanks Chris, > > I'm glad my email made sense - on re-reading I had made more typos > than usual :( > > As to the BioPerl behaviour, I think I know enough to get BioPerl > to convert GenBank files into EMBL or vice versa, and thus find > out what it does... After stumbling over this issue, I made some progress: http://lists.open-bio.org/pipermail/bioperl-l/2010-January/031889.html > I hope you are right that GenBank and EMBL have converged on > using the same format - any confirmation of this (and which format) > would be very welcome. I took the Arabidopsis thaliana chloroplast complete genome as an example. This is AP000423 in EMBL, NC_000932 in GenBank (although there are some minor differences in the annotation). Looking at these files (both from early 2009), they seem to use the same feature location styles, e.g. for reverse strand joins: complement(join(97999..98793,69611..69724)) e.g. for mixed strand features: join(complement(69611..69724),139856..140087, 140625..140650) I'm going to assume that this is what both EMBL and GenBank will be using in future. I have confirmed that BioPerl 1.6.x preserves these style locations on converting EMBL/GenBank to EMBL/GenBank. I need to find a reverse strand "join(complement(..." example to test with now... Peter From mark.schreiber at novartis.com Wed Jan 13 05:37:57 2010 From: mark.schreiber at novartis.com (mark.schreiber at novartis.com) Date: Wed, 13 Jan 2010 13:37:57 +0800 Subject: [Open-bio-l] [Bosc] BOSC 2010 Request for Input In-Reply-To: <57BC6503-C655-431C-88BB-0F63E1004BD6@drycafe.net> Message-ID: I might be coming to ISMB/BOSC this year hand ++ - Mark Hilmar Lapp Sent by: open-bio-l-bounces at lists.open-bio.org 12/22/2009 07:39 AM To Brad Chapman cc bosc at open-bio.org, community at cloudbiolinux.com, open-bio-l at lists.open-bio.org Subject Re: [Open-bio-l] [Bosc] BOSC 2010 Request for Input On Dec 21, 2009, at 4:32 PM, Brad Chapman wrote: > I can help organize a hackathon at Mass General Hospital, which is > located fairly to close to Hynes Convention Center, where the > conference > is on. I booked conferences rooms for dates surrounding BOSC (July > 6-8th and 11th) that will accommodate up to 30 developers from various > projects. Awesome!! Thanks so much Brad for stepping up so quickly and putting your foot down too! > If we could get a show of hands from people who might be interested, > this would help gauge the size and scope. :) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== _______________________________________________ Open-Bio-l mailing list Open-Bio-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/open-bio-l _________________________ CONFIDENTIALITY NOTICE The information contained in this e-mail message is intended only for the exclusive use of the individual or entity named above and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivery of the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately by e-mail and delete the material from any computer. Thank you. From biopython at maubp.freeserve.co.uk Thu Jan 21 12:33:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 12:33:53 +0000 Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Hi all, This is cross posted to try and ensure relevant people see it. I suggest we continue the discussion on the BioSQL list (for how to serialise structured annotation to BioSQL), and/or the OpenBio list (for things like file format naming conventions). I am hoping we (Bio*) can be consistent in how we parse and load into BioSQL the SwissProt DE lines (known as "swiss" format in both BioPerl and Biopython's SeqIO, and by EMBOSS) or the equivalent UniProt XML tags (which we are tentatively going to call the "uniprot" format in Biopython's SeqIO - comments?). Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") files and load them into BioSQL. Biopython currently treats the DE comment lines as a long string, as BioPerl used to: http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html I understand that BioPerl now turns the SwissProt DE lines into a TagTree, and for storing this in BioSQL this gets serialised as XML. I would like Biopython to handle this the same way (although rather than a Perl TagTree, we'd use a Python structure of course), and would appreciate clarification of what exactly was implemented (e.g. which bit of the BioPerl source code should be look at, and could you show a worked example?). Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or Open-Bio lists yet) has started work on parsing UniProt XML files for Biopython. Here the DE comment lines are already provided broken up with XML markup. Hopefully their nested structure matches what BioPerl was doing with the SwissProt DE lines. Regards, Peter From cjfields at illinois.edu Thu Jan 21 13:34:12 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 Jan 2010 07:34:12 -0600 Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: Peter, The relevant code is in Bio::Annotation::TagTree in bioperl-live, which is a decorator for Data::Stag: http://search.cpan.org/~cmungall/Data-Stag-0.11/Data/Stag.pm This is where the text output is derived from. It's a bit of a heavyweight solution to the problem, but it's capable of round-tripping the DE data and parses out the data in a way that's approachable. We could probably abstract out the serialization backend there and allow a pure bioperl solution (or the current solution) as a fallback. If the plain-text DE info is represented in a hierarchy already in UniProt XML, we should probably conform as closely as possible to that (using a standard format like XML, JSON, etc.). chris On Jan 21, 2010, at 6:33 AM, Peter wrote: > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter From holland at eaglegenomics.com Fri Jan 22 10:51:52 2010 From: holland at eaglegenomics.com (Richard Holland) Date: Fri, 22 Jan 2010 10:51:52 +0000 Subject: [Open-bio-l] [BioSQL-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: <8FECCBDE-2DE1-40EE-B5A4-73BDAC893E2D@eaglegenomics.com> Nice idea. Currently, BioJava just stores the complete section as a string without parsing it, but it provides a parser module for converting it into useful tag/value format within a user's program (but not to be stored in BioSQL). On 21 Jan 2010, at 12:33, Peter wrote: > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From andrea at biocomp.unibo.it Fri Jan 22 12:18:32 2010 From: andrea at biocomp.unibo.it (Andrea Pierleoni) Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET) Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in BioSQL In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com> Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it> I think that the point here can be a little broader, since not only the swissprot DE lines carry complex and structured data. To define a common, language-independent way to store structured data into the comment and *_qualifier_value tables of the actual BioSQL schema could be very useful. XML looks like a good candidate to me, and the UniprotXML format can be used as reference or as a template to start from. Each Bio* project will then parse and report this structured data in its own programming language data structure. Andrea > Hi all, > > This is cross posted to try and ensure relevant people see it. > I suggest we continue the discussion on the BioSQL list > (for how to serialise structured annotation to BioSQL), and/or > the OpenBio list (for things like file format naming conventions). > > I am hoping we (Bio*) can be consistent in how we parse and load > into BioSQL the SwissProt DE lines (known as "swiss" format in > both BioPerl and Biopython's SeqIO, and by EMBOSS) or the > equivalent UniProt XML tags (which we are tentatively going to > call the "uniprot" format in Biopython's SeqIO - comments?). > > Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss") > files and load them into BioSQL. Biopython currently treats the DE > comment lines as a long string, as BioPerl used to: > > http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html > http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html > > I understand that BioPerl now turns the SwissProt DE lines into a > TagTree, and for storing this in BioSQL this gets serialised as XML. > I would like Biopython to handle this the same way (although rather > than a Perl TagTree, we'd use a Python structure of course), and > would appreciate clarification of what exactly was implemented > (e.g. which bit of the BioPerl source code should be look at, > and could you show a worked example?). > > Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or > Open-Bio lists yet) has started work on parsing UniProt XML > files for Biopython. Here the DE comment lines are already > provided broken up with XML markup. Hopefully their nested > structure matches what BioPerl was doing with the SwissProt > DE lines. > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Fri Jan 29 10:36:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Jan 2010 10:36:40 +0000 Subject: [Open-bio-l] [Bioperl-l] [MOBY-dev] OpenBio solution challenge: Project updates at BOSC 2010 In-Reply-To: References: <20100128203505.GG40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com> Hi all, This is a great topic but should be continue it on just the one mailing list? Is there a suitable BOSC list, or how about the general Open Bio list? On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson wrote: > > Brad, this sounds exciting! > > One thing strikes me, though - by asking for the sub-projects to propose > the "grand challenge" themselves the one thing you can guarantee is that > the "grand challenge" is solvable (or more likely, already solved!) > > Other "grand challenge" kinds of meetings have an independent third party > pose the problem that has to be solved, and then all groups work toward a > solution and compare their results. ?This would, IMO, be more revealing of > the "state of the art" in each Open-Bio project, and point out where the > weaknesses are that we should be focusing on... ?Someone (for example, > you!) could act as the moderator to ensure that the "grand challenge" was > at least a reasonable one, within the scope of what an Open-Bio project > *should* be able to solve... > > Just my CAD $0.02 > > Mark One possible problem with having Brad act as moderator is his ties to Biopython (plus it would be a shame if we'd be one man down for trying to solve the challenges - grin). Having a project representative "sign off" on the challenge might work - or simply the whole of the BOSC committee which is quite balanced. Alternatively some kind of panel of challenges does seem a good way to reduce individual project bias (as suggest by Scooter), but there will still need to be a judging committee. I'm curious what kind of challenges the BOSC committee had in mind - would something like taking a newly sequence bacteria and producing an automated annotation as a GenBank, EMBL, or GFF file be too ambitious for example? There are already several major projects to do this e.g. RAST http://rast.nmpdr.org/ Peter (@Biopython)