From biopython at maubp.freeserve.co.uk  Fri Jan  8 12:33:02 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 8 Jan 2010 17:33:02 +0000
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
Message-ID: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>

Hi all,

Currently Biopython reads both GenBank and EMBL files, and write GenBank.
I'm looking at writing EMBL files too - and wanted to see if any of you knew
anything definitive on join(complement(...)) vs complement(join(...)) in
feature location strings.

References:
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
http://www.genbank.lipi.go.id/docs/FTv6_2.html

Both give this in example, two ways for writing the same location:

complement(join(2691..4571,4918..5163)
                          Joins regions 2691 to 4571 and 4918 to 5163, then
                          complements the joined segments (the feature is
                          on the strand complementary to the presented strand)

join(complement(4918..5163),complement(2691..4571))
                          Complements regions 4918 to 5163 and 2691 to 4571,
                          then joins the complemented segments (the feature is
                          on the strand complementary to the presented strand)

This suggests that either form is valid in both GenBank and EMBL
format files.

Anecdotally, I have observed GenBank uses the first form (which is
shorter) while EMBL seems to use the second form (which to me is
logical, if you consider how to represent mixed strand features).
This seems to fit with this BioPerl wiki page:

http://www.bioperl.org/wiki/BioPerl_Locations

Is there any official documentation regarding this discrepancy that
I have overlooked? Am I right to think that GenBank and EMBL do
still use these different forms (any word on if they might
standardised one way or the other in future?)

What do EMBOSS, BioPerl, etc do in this situation? Do you treat
these two examples the same on parsing, and use one layout
when writing GenBank and the other for writing EMBL files?

Peter

From cjfields at illinois.edu  Fri Jan  8 21:54:41 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 8 Jan 2010 20:54:41 -0600
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
In-Reply-To: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
Message-ID: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>

On Jan 8, 2010, at 11:33 AM, Peter wrote:

> Hi all,
> 
> Currently Biopython reads both GenBank and EMBL files, and write GenBank.
> I'm looking at writing EMBL files too - and wanted to see if any of you knew
> anything definitive on join(complement(...)) vs complement(join(...)) in
> feature location strings.
> 
> References:
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
> http://www.genbank.lipi.go.id/docs/FTv6_2.html
> 
> Both give this in example, two ways for writing the same location:
> 
> complement(join(2691..4571,4918..5163)
>                          Joins regions 2691 to 4571 and 4918 to 5163, then
>                          complements the joined segments (the feature is
>                          on the strand complementary to the presented strand)
> 
> join(complement(4918..5163),complement(2691..4571))
>                          Complements regions 4918 to 5163 and 2691 to 4571,
>                          then joins the complemented segments (the feature is
>                          on the strand complementary to the presented strand)
> 
> This suggests that either form is valid in both GenBank and EMBL
> format files.
> 
> Anecdotally, I have observed GenBank uses the first form (which is
> shorter) while EMBL seems to use the second form (which to me is
> logical, if you consider how to represent mixed strand features).
> This seems to fit with this BioPerl wiki page:
> 
> http://www.bioperl.org/wiki/BioPerl_Locations
> 
> Is there any official documentation regarding this discrepancy that
> I have overlooked? Am I right to think that GenBank and EMBL do
> still use these different forms (any word on if they might
> standardised one way or the other in future?)
> 
> What do EMBOSS, BioPerl, etc do in this situation? Do you treat
> these two examples the same on parsing, and use one layout
> when writing GenBank and the other for writing EMBL files?
> 
> Peter

I can't recall which of the two BioPerl uses, but if it helps it standardizes on one of them for output but parses both.  I think GenBank and EMBL have converged on using the same format, but I'm not absolutely sure on that.

Ironic actually that I can't remember, as I'm the author of the above page and started a discussion about this very subject a while back on the list (in an effort to sort out some issues with BioPerl locations).

chris


From biopython at maubp.freeserve.co.uk  Mon Jan 11 05:42:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 10:42:52 +0000
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
In-Reply-To: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>
References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
	<2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>
Message-ID: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com>

On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields <cjfields at illinois.edu> wrote:
>
> I can't recall which of the two BioPerl uses, but if it helps it standardizes
> on one of them for output but parses both. ?I think GenBank and EMBL
> have converged on using the same format, but I'm not absolutely sure
> on that.
>
> Ironic actually that I can't remember, as I'm the author of the above page
> and started a discussion about this very subject a while back on the list
> (in an effort to sort out some issues with BioPerl locations).
>
> chris

Thanks Chris,

I'm glad my email made sense - on re-reading I had made more typos
than usual :(

As to the BioPerl behaviour, I think I know enough to get BioPerl
to convert GenBank files into EMBL or vice versa, and thus find
out what it does...

I hope you are right that GenBank and EMBL have converged on
using the same format - any confirmation of this (and which format)
would be very welcome.

Peter (Rice), do you have any input? I noticed some work in the
latest EMBOSS patch last month that touches on this issue:
http://lists.open-bio.org/pipermail/emboss-announce/2009-December/000016.html

Peter


From biopython at maubp.freeserve.co.uk  Mon Jan 11 09:42:51 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 14:42:51 +0000
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
In-Reply-To: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com>
References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
	<2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>
	<320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com>
Message-ID: <320fb6e01001110642p777d5b65la2e97b5767bd3ae5@mail.gmail.com>

On Mon, Jan 11, 2010 at 10:42 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields <cjfields at illinois.edu> wrote:
>>
>> I can't recall which of the two BioPerl uses, but if it helps it standardizes
>> on one of them for output but parses both. ?I think GenBank and EMBL
>> have converged on using the same format, but I'm not absolutely sure
>> on that.
>>
>> Ironic actually that I can't remember, as I'm the author of the above page
>> and started a discussion about this very subject a while back on the list
>> (in an effort to sort out some issues with BioPerl locations).
>>
>> chris
>
> Thanks Chris,
>
> I'm glad my email made sense - on re-reading I had made more typos
> than usual :(
>
> As to the BioPerl behaviour, I think I know enough to get BioPerl
> to convert GenBank files into EMBL or vice versa, and thus find
> out what it does...

After stumbling over this issue, I made some progress:
http://lists.open-bio.org/pipermail/bioperl-l/2010-January/031889.html

> I hope you are right that GenBank and EMBL have converged on
> using the same format - any confirmation of this (and which format)
> would be very welcome.

I took the Arabidopsis thaliana chloroplast complete genome as
an example. This is AP000423 in EMBL, NC_000932 in GenBank
(although there are some minor differences in the annotation).
Looking at these files (both from early 2009), they seem to use the
same feature location styles, e.g. for reverse strand joins:

complement(join(97999..98793,69611..69724))

e.g. for mixed strand features:

join(complement(69611..69724),139856..140087, 140625..140650)

I'm going to assume that this is what both EMBL and GenBank
will be using in future.

I have confirmed that BioPerl 1.6.x preserves these style locations
on converting EMBL/GenBank to EMBL/GenBank. I need to find
a reverse strand "join(complement(..." example to test with now...

Peter


From mark.schreiber at novartis.com  Wed Jan 13 00:37:57 2010
From: mark.schreiber at novartis.com (mark.schreiber at novartis.com)
Date: Wed, 13 Jan 2010 13:37:57 +0800
Subject: [Open-bio-l] [Bosc] BOSC 2010 Request for Input
In-Reply-To: <57BC6503-C655-431C-88BB-0F63E1004BD6@drycafe.net>
Message-ID: <OF320EDFA1.882A9074-ON482576AA.001EE24B-482576AA.001EF698@ah.novartis.com>

I might be coming to ISMB/BOSC this year

hand ++

- Mark


Hilmar Lapp <hlapp at drycafe.net> 
Sent by: open-bio-l-bounces at lists.open-bio.org
12/22/2009 07:39 AM

To
Brad Chapman <chapmanb at 50mail.com>
cc
bosc at open-bio.org, community at cloudbiolinux.com, 
open-bio-l at lists.open-bio.org
Subject
Re: [Open-bio-l] [Bosc] BOSC 2010 Request for Input


On Dec 21, 2009, at 4:32 PM, Brad Chapman wrote:

> I can help organize a hackathon at Mass General Hospital, which is
> located fairly to close to Hynes Convention Center, where the 
> conference
> is on. I booked conferences rooms for dates surrounding BOSC (July
> 6-8th and 11th) that will accommodate up to 30 developers from various
> projects.

Awesome!! Thanks so much Brad for stepping up so quickly and putting 
your foot down too!

> If we could get a show of hands from people who might be interested, 
> this would help gauge the size and scope.


<shows_both_hands/> :)

                 -hilmar

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


_______________________________________________
Open-Bio-l mailing list
Open-Bio-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/open-bio-l


_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for the 
exclusive use of the individual or entity named above and may contain 
information that is privileged, confidential or exempt from disclosure 
under applicable law. If the reader of this message is not the intended 
recipient, or the employee or agent responsible for delivery of the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please 
notify the sender immediately by e-mail and delete the material from any 
computer.  Thank you.

From biopython at maubp.freeserve.co.uk  Thu Jan 21 07:33:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 21 Jan 2010 12:33:53 +0000
Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in
	BioSQL
Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>

Hi all,

This is cross posted to try and ensure relevant people see it.
I suggest we continue the discussion on the BioSQL list
(for how to serialise structured annotation to BioSQL), and/or
the OpenBio list (for things like file format naming conventions).

I am hoping we (Bio*) can be consistent in how we parse and load
into BioSQL the SwissProt DE lines (known as "swiss" format in
both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
equivalent UniProt XML tags (which we are tentatively going to
call the "uniprot" format in Biopython's SeqIO - comments?).

Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
files and load them into BioSQL. Biopython currently treats the DE
comment lines as a long string, as BioPerl used to:

http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html

I understand that BioPerl now turns the SwissProt DE lines into a
TagTree, and for storing this in BioSQL this gets serialised as XML.
I would like Biopython to handle this the same way (although rather
than a Perl TagTree, we'd use a Python structure of course), and
would appreciate clarification of what exactly was implemented
(e.g. which bit of the BioPerl source code should be look at,
and could you show a worked example?).

Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
Open-Bio lists yet) has started work on parsing UniProt XML
files for Biopython. Here the DE comment lines are already
provided broken up with XML markup. Hopefully their nested
structure matches what BioPerl was doing with the SwissProt
DE lines.

Regards,

Peter

From cjfields at illinois.edu  Thu Jan 21 08:34:12 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Thu, 21 Jan 2010 07:34:12 -0600
Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as
	XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <A6F5F623-2750-4BB0-91F7-5A87BABE367B@illinois.edu>

Peter,

The relevant code is in Bio::Annotation::TagTree in bioperl-live, which is a decorator for Data::Stag:

http://search.cpan.org/~cmungall/Data-Stag-0.11/Data/Stag.pm

This is where the text output is derived from.  It's a bit of a heavyweight solution to the problem, but it's capable of round-tripping the DE data and parses out the data in a way that's approachable.  We could probably abstract out the serialization backend there and allow a pure bioperl solution (or the current solution) as a fallback. 

If the plain-text DE info is represented in a hierarchy already in UniProt XML, we should probably conform as closely as possible to that (using a standard format like XML, JSON, etc.).  

chris

On Jan 21, 2010, at 6:33 AM, Peter wrote:

> Hi all,
> 
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
> 
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
> 
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
> 
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
> 
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
> 
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
> 
> Regards,
> 
> Peter


From holland at eaglegenomics.com  Fri Jan 22 05:51:52 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Fri, 22 Jan 2010 10:51:52 +0000
Subject: [Open-bio-l] [BioSQL-l] SwissProt DE lines and UniProt XML /
	TagTree as XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <8FECCBDE-2DE1-40EE-B5A4-73BDAC893E2D@eaglegenomics.com>

Nice idea. Currently, BioJava just stores the complete section as a string without parsing it, but it provides a parser module for converting it into useful tag/value format within a user's program (but not to be stored in BioSQL).

On 21 Jan 2010, at 12:33, Peter wrote:

> Hi all,
> 
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
> 
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
> 
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
> 
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
> 
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
> 
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
> 
> Regards,
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From andrea at biocomp.unibo.it  Fri Jan 22 07:18:32 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET)
Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as
	XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it>

I think that the point here can be a little broader, since not only the
swissprot DE lines carry complex and structured data.
To define a common, language-independent way to store structured data into
the comment and *_qualifier_value tables of the actual BioSQL schema could
be very useful.
XML looks like a good candidate to me, and the UniprotXML format can be
used as reference or as a template to start from.
Each Bio* project will then parse and report this structured data in its
own programming language data structure.

Andrea


> Hi all,
>
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
>
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
>
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
>
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
>
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
>
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
>
> Regards,
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Fri Jan 29 05:36:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 29 Jan 2010 10:36:40 +0000
Subject: [Open-bio-l] [Bioperl-l] [MOBY-dev] OpenBio solution challenge:
	Project updates at BOSC 2010
In-Reply-To: <op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
References: <20100128203505.GG40046@sobchak.mgh.harvard.edu>
	<op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com>

Hi all,

This is a great topic but should be continue it on just the one mailing list?
Is there a suitable BOSC list, or how about the general Open Bio list?

On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson <markw at illuminae.com> wrote:
>
> Brad, this sounds exciting!
>
> One thing strikes me, though - by asking for the sub-projects to propose
> the "grand challenge" themselves the one thing you can guarantee is that
> the "grand challenge" is solvable (or more likely, already solved!)
>
> Other "grand challenge" kinds of meetings have an independent third party
> pose the problem that has to be solved, and then all groups work toward a
> solution and compare their results. ?This would, IMO, be more revealing of
> the "state of the art" in each Open-Bio project, and point out where the
> weaknesses are that we should be focusing on... ?Someone (for example,
> you!) could act as the moderator to ensure that the "grand challenge" was
> at least a reasonable one, within the scope of what an Open-Bio project
> *should* be able to solve...
>
> Just my CAD $0.02
>
> Mark

One possible problem with having Brad act as moderator is his ties to
Biopython (plus it would be a shame if we'd be one man down for trying
to solve the challenges - grin). Having a project representative "sign off"
on the challenge might work - or simply the whole of the BOSC committee
which is quite balanced. Alternatively some kind of panel of challenges does
seem a good way to reduce individual project bias (as suggest by Scooter),
but there will still need to be a judging committee.

I'm curious what kind of challenges the BOSC committee had in mind -
would something like taking a newly sequence bacteria and producing
an automated annotation as a GenBank, EMBL, or GFF  file be too
ambitious for example? There are already several major projects
to do this e.g. RAST http://rast.nmpdr.org/

Peter
(@Biopython)


From biopython at maubp.freeserve.co.uk  Fri Jan  8 17:33:02 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 8 Jan 2010 17:33:02 +0000
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
Message-ID: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>

Hi all,

Currently Biopython reads both GenBank and EMBL files, and write GenBank.
I'm looking at writing EMBL files too - and wanted to see if any of you knew
anything definitive on join(complement(...)) vs complement(join(...)) in
feature location strings.

References:
http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
http://www.genbank.lipi.go.id/docs/FTv6_2.html

Both give this in example, two ways for writing the same location:

complement(join(2691..4571,4918..5163)
                          Joins regions 2691 to 4571 and 4918 to 5163, then
                          complements the joined segments (the feature is
                          on the strand complementary to the presented strand)

join(complement(4918..5163),complement(2691..4571))
                          Complements regions 4918 to 5163 and 2691 to 4571,
                          then joins the complemented segments (the feature is
                          on the strand complementary to the presented strand)

This suggests that either form is valid in both GenBank and EMBL
format files.

Anecdotally, I have observed GenBank uses the first form (which is
shorter) while EMBL seems to use the second form (which to me is
logical, if you consider how to represent mixed strand features).
This seems to fit with this BioPerl wiki page:

http://www.bioperl.org/wiki/BioPerl_Locations

Is there any official documentation regarding this discrepancy that
I have overlooked? Am I right to think that GenBank and EMBL do
still use these different forms (any word on if they might
standardised one way or the other in future?)

What do EMBOSS, BioPerl, etc do in this situation? Do you treat
these two examples the same on parsing, and use one layout
when writing GenBank and the other for writing EMBL files?

Peter


From cjfields at illinois.edu  Sat Jan  9 02:54:41 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 8 Jan 2010 20:54:41 -0600
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
In-Reply-To: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
Message-ID: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>

On Jan 8, 2010, at 11:33 AM, Peter wrote:

> Hi all,
> 
> Currently Biopython reads both GenBank and EMBL files, and write GenBank.
> I'm looking at writing EMBL files too - and wanted to see if any of you knew
> anything definitive on join(complement(...)) vs complement(join(...)) in
> feature location strings.
> 
> References:
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
> http://www.genbank.lipi.go.id/docs/FTv6_2.html
> 
> Both give this in example, two ways for writing the same location:
> 
> complement(join(2691..4571,4918..5163)
>                          Joins regions 2691 to 4571 and 4918 to 5163, then
>                          complements the joined segments (the feature is
>                          on the strand complementary to the presented strand)
> 
> join(complement(4918..5163),complement(2691..4571))
>                          Complements regions 4918 to 5163 and 2691 to 4571,
>                          then joins the complemented segments (the feature is
>                          on the strand complementary to the presented strand)
> 
> This suggests that either form is valid in both GenBank and EMBL
> format files.
> 
> Anecdotally, I have observed GenBank uses the first form (which is
> shorter) while EMBL seems to use the second form (which to me is
> logical, if you consider how to represent mixed strand features).
> This seems to fit with this BioPerl wiki page:
> 
> http://www.bioperl.org/wiki/BioPerl_Locations
> 
> Is there any official documentation regarding this discrepancy that
> I have overlooked? Am I right to think that GenBank and EMBL do
> still use these different forms (any word on if they might
> standardised one way or the other in future?)
> 
> What do EMBOSS, BioPerl, etc do in this situation? Do you treat
> these two examples the same on parsing, and use one layout
> when writing GenBank and the other for writing EMBL files?
> 
> Peter

I can't recall which of the two BioPerl uses, but if it helps it standardizes on one of them for output but parses both.  I think GenBank and EMBL have converged on using the same format, but I'm not absolutely sure on that.

Ironic actually that I can't remember, as I'm the author of the above page and started a discussion about this very subject a while back on the list (in an effort to sort out some issues with BioPerl locations).

chris


From biopython at maubp.freeserve.co.uk  Mon Jan 11 10:42:52 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 10:42:52 +0000
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
In-Reply-To: <2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>
References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
	<2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>
Message-ID: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com>

On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields <cjfields at illinois.edu> wrote:
>
> I can't recall which of the two BioPerl uses, but if it helps it standardizes
> on one of them for output but parses both. ?I think GenBank and EMBL
> have converged on using the same format, but I'm not absolutely sure
> on that.
>
> Ironic actually that I can't remember, as I'm the author of the above page
> and started a discussion about this very subject a while back on the list
> (in an effort to sort out some issues with BioPerl locations).
>
> chris

Thanks Chris,

I'm glad my email made sense - on re-reading I had made more typos
than usual :(

As to the BioPerl behaviour, I think I know enough to get BioPerl
to convert GenBank files into EMBL or vice versa, and thus find
out what it does...

I hope you are right that GenBank and EMBL have converged on
using the same format - any confirmation of this (and which format)
would be very welcome.

Peter (Rice), do you have any input? I noticed some work in the
latest EMBOSS patch last month that touches on this issue:
http://lists.open-bio.org/pipermail/emboss-announce/2009-December/000016.html

Peter


From biopython at maubp.freeserve.co.uk  Mon Jan 11 14:42:51 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 11 Jan 2010 14:42:51 +0000
Subject: [Open-bio-l] GenBank and EMBL - join(complement(...)) vs
	complement(join(...))
In-Reply-To: <320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com>
References: <320fb6e01001080933w383767c1ya45e0a6a891308d6@mail.gmail.com>
	<2184EB34-B947-44F3-A0D4-BC55CFF404B4@illinois.edu>
	<320fb6e01001110242l4f26f80l465c61c7f1b97dd5@mail.gmail.com>
Message-ID: <320fb6e01001110642p777d5b65la2e97b5767bd3ae5@mail.gmail.com>

On Mon, Jan 11, 2010 at 10:42 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Sat, Jan 9, 2010 at 2:54 AM, Chris Fields <cjfields at illinois.edu> wrote:
>>
>> I can't recall which of the two BioPerl uses, but if it helps it standardizes
>> on one of them for output but parses both. ?I think GenBank and EMBL
>> have converged on using the same format, but I'm not absolutely sure
>> on that.
>>
>> Ironic actually that I can't remember, as I'm the author of the above page
>> and started a discussion about this very subject a while back on the list
>> (in an effort to sort out some issues with BioPerl locations).
>>
>> chris
>
> Thanks Chris,
>
> I'm glad my email made sense - on re-reading I had made more typos
> than usual :(
>
> As to the BioPerl behaviour, I think I know enough to get BioPerl
> to convert GenBank files into EMBL or vice versa, and thus find
> out what it does...

After stumbling over this issue, I made some progress:
http://lists.open-bio.org/pipermail/bioperl-l/2010-January/031889.html

> I hope you are right that GenBank and EMBL have converged on
> using the same format - any confirmation of this (and which format)
> would be very welcome.

I took the Arabidopsis thaliana chloroplast complete genome as
an example. This is AP000423 in EMBL, NC_000932 in GenBank
(although there are some minor differences in the annotation).
Looking at these files (both from early 2009), they seem to use the
same feature location styles, e.g. for reverse strand joins:

complement(join(97999..98793,69611..69724))

e.g. for mixed strand features:

join(complement(69611..69724),139856..140087, 140625..140650)

I'm going to assume that this is what both EMBL and GenBank
will be using in future.

I have confirmed that BioPerl 1.6.x preserves these style locations
on converting EMBL/GenBank to EMBL/GenBank. I need to find
a reverse strand "join(complement(..." example to test with now...

Peter


From mark.schreiber at novartis.com  Wed Jan 13 05:37:57 2010
From: mark.schreiber at novartis.com (mark.schreiber at novartis.com)
Date: Wed, 13 Jan 2010 13:37:57 +0800
Subject: [Open-bio-l] [Bosc] BOSC 2010 Request for Input
In-Reply-To: <57BC6503-C655-431C-88BB-0F63E1004BD6@drycafe.net>
Message-ID: <OF320EDFA1.882A9074-ON482576AA.001EE24B-482576AA.001EF698@ah.novartis.com>

I might be coming to ISMB/BOSC this year

hand ++

- Mark


Hilmar Lapp <hlapp at drycafe.net> 
Sent by: open-bio-l-bounces at lists.open-bio.org
12/22/2009 07:39 AM

To
Brad Chapman <chapmanb at 50mail.com>
cc
bosc at open-bio.org, community at cloudbiolinux.com, 
open-bio-l at lists.open-bio.org
Subject
Re: [Open-bio-l] [Bosc] BOSC 2010 Request for Input


On Dec 21, 2009, at 4:32 PM, Brad Chapman wrote:

> I can help organize a hackathon at Mass General Hospital, which is
> located fairly to close to Hynes Convention Center, where the 
> conference
> is on. I booked conferences rooms for dates surrounding BOSC (July
> 6-8th and 11th) that will accommodate up to 30 developers from various
> projects.

Awesome!! Thanks so much Brad for stepping up so quickly and putting 
your foot down too!

> If we could get a show of hands from people who might be interested, 
> this would help gauge the size and scope.


<shows_both_hands/> :)

                 -hilmar

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================


_______________________________________________
Open-Bio-l mailing list
Open-Bio-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/open-bio-l


_________________________

CONFIDENTIALITY NOTICE

The information contained in this e-mail message is intended only for the 
exclusive use of the individual or entity named above and may contain 
information that is privileged, confidential or exempt from disclosure 
under applicable law. If the reader of this message is not the intended 
recipient, or the employee or agent responsible for delivery of the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please 
notify the sender immediately by e-mail and delete the material from any 
computer.  Thank you.


From biopython at maubp.freeserve.co.uk  Thu Jan 21 12:33:53 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 21 Jan 2010 12:33:53 +0000
Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as XML in
	BioSQL
Message-ID: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>

Hi all,

This is cross posted to try and ensure relevant people see it.
I suggest we continue the discussion on the BioSQL list
(for how to serialise structured annotation to BioSQL), and/or
the OpenBio list (for things like file format naming conventions).

I am hoping we (Bio*) can be consistent in how we parse and load
into BioSQL the SwissProt DE lines (known as "swiss" format in
both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
equivalent UniProt XML tags (which we are tentatively going to
call the "uniprot" format in Biopython's SeqIO - comments?).

Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
files and load them into BioSQL. Biopython currently treats the DE
comment lines as a long string, as BioPerl used to:

http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html

I understand that BioPerl now turns the SwissProt DE lines into a
TagTree, and for storing this in BioSQL this gets serialised as XML.
I would like Biopython to handle this the same way (although rather
than a Perl TagTree, we'd use a Python structure of course), and
would appreciate clarification of what exactly was implemented
(e.g. which bit of the BioPerl source code should be look at,
and could you show a worked example?).

Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
Open-Bio lists yet) has started work on parsing UniProt XML
files for Biopython. Here the DE comment lines are already
provided broken up with XML markup. Hopefully their nested
structure matches what BioPerl was doing with the SwissProt
DE lines.

Regards,

Peter


From cjfields at illinois.edu  Thu Jan 21 13:34:12 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Thu, 21 Jan 2010 07:34:12 -0600
Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as
	XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <A6F5F623-2750-4BB0-91F7-5A87BABE367B@illinois.edu>

Peter,

The relevant code is in Bio::Annotation::TagTree in bioperl-live, which is a decorator for Data::Stag:

http://search.cpan.org/~cmungall/Data-Stag-0.11/Data/Stag.pm

This is where the text output is derived from.  It's a bit of a heavyweight solution to the problem, but it's capable of round-tripping the DE data and parses out the data in a way that's approachable.  We could probably abstract out the serialization backend there and allow a pure bioperl solution (or the current solution) as a fallback. 

If the plain-text DE info is represented in a hierarchy already in UniProt XML, we should probably conform as closely as possible to that (using a standard format like XML, JSON, etc.).  

chris

On Jan 21, 2010, at 6:33 AM, Peter wrote:

> Hi all,
> 
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
> 
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
> 
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
> 
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
> 
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
> 
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
> 
> Regards,
> 
> Peter


From holland at eaglegenomics.com  Fri Jan 22 10:51:52 2010
From: holland at eaglegenomics.com (Richard Holland)
Date: Fri, 22 Jan 2010 10:51:52 +0000
Subject: [Open-bio-l] [BioSQL-l] SwissProt DE lines and UniProt XML /
	TagTree as XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <8FECCBDE-2DE1-40EE-B5A4-73BDAC893E2D@eaglegenomics.com>

Nice idea. Currently, BioJava just stores the complete section as a string without parsing it, but it provides a parser module for converting it into useful tag/value format within a user's program (but not to be stored in BioSQL).

On 21 Jan 2010, at 12:33, Peter wrote:

> Hi all,
> 
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
> 
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
> 
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
> 
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
> 
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
> 
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
> 
> Regards,
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From andrea at biocomp.unibo.it  Fri Jan 22 12:18:32 2010
From: andrea at biocomp.unibo.it (Andrea Pierleoni)
Date: Fri, 22 Jan 2010 13:18:32 +0100 (CET)
Subject: [Open-bio-l] SwissProt DE lines and UniProt XML / TagTree as
	XML in BioSQL
In-Reply-To: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
References: <320fb6e01001210433n6f42e617g6519ee2790d6add5@mail.gmail.com>
Message-ID: <2b6e30c4628585042366646a7b46386e.squirrel@lipid.biocomp.unibo.it>

I think that the point here can be a little broader, since not only the
swissprot DE lines carry complex and structured data.
To define a common, language-independent way to store structured data into
the comment and *_qualifier_value tables of the actual BioSQL schema could
be very useful.
XML looks like a good candidate to me, and the UniprotXML format can be
used as reference or as a template to start from.
Each Bio* project will then parse and report this structured data in its
own programming language data structure.

Andrea


> Hi all,
>
> This is cross posted to try and ensure relevant people see it.
> I suggest we continue the discussion on the BioSQL list
> (for how to serialise structured annotation to BioSQL), and/or
> the OpenBio list (for things like file format naming conventions).
>
> I am hoping we (Bio*) can be consistent in how we parse and load
> into BioSQL the SwissProt DE lines (known as "swiss" format in
> both BioPerl and Biopython's SeqIO, and by EMBOSS) or the
> equivalent UniProt XML tags (which we are tentatively going to
> call the "uniprot" format in Biopython's SeqIO - comments?).
>
> Like BioPerl (etc), Biopython can parse plain text SwissProt ("swiss")
> files and load them into BioSQL. Biopython currently treats the DE
> comment lines as a long string, as BioPerl used to:
>
> http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030041.html
> http://lists.open-bio.org/pipermail/biosql-l/2009-May/001514.html
>
> I understand that BioPerl now turns the SwissProt DE lines into a
> TagTree, and for storing this in BioSQL this gets serialised as XML.
> I would like Biopython to handle this the same way (although rather
> than a Perl TagTree, we'd use a Python structure of course), and
> would appreciate clarification of what exactly was implemented
> (e.g. which bit of the BioPerl source code should be look at,
> and could you show a worked example?).
>
> Andrea Pierlenoin (CC'd - not sure if he is on the BioSQL or
> Open-Bio lists yet) has started work on parsing UniProt XML
> files for Biopython. Here the DE comment lines are already
> provided broken up with XML markup. Hopefully their nested
> structure matches what BioPerl was doing with the SwissProt
> DE lines.
>
> Regards,
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Fri Jan 29 10:36:40 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 29 Jan 2010 10:36:40 +0000
Subject: [Open-bio-l] [Bioperl-l] [MOBY-dev] OpenBio solution challenge:
	Project updates at BOSC 2010
In-Reply-To: <op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
References: <20100128203505.GG40046@sobchak.mgh.harvard.edu>
	<op.u69hfujinbznux@dd0710001l.icapture.ubc.ca>
Message-ID: <320fb6e01001290236l1ad02515w403a19f94dbb6d15@mail.gmail.com>

Hi all,

This is a great topic but should be continue it on just the one mailing list?
Is there a suitable BOSC list, or how about the general Open Bio list?

On Thu, Jan 28, 2010 at 9:17 PM, Mark Wilkinson <markw at illuminae.com> wrote:
>
> Brad, this sounds exciting!
>
> One thing strikes me, though - by asking for the sub-projects to propose
> the "grand challenge" themselves the one thing you can guarantee is that
> the "grand challenge" is solvable (or more likely, already solved!)
>
> Other "grand challenge" kinds of meetings have an independent third party
> pose the problem that has to be solved, and then all groups work toward a
> solution and compare their results. ?This would, IMO, be more revealing of
> the "state of the art" in each Open-Bio project, and point out where the
> weaknesses are that we should be focusing on... ?Someone (for example,
> you!) could act as the moderator to ensure that the "grand challenge" was
> at least a reasonable one, within the scope of what an Open-Bio project
> *should* be able to solve...
>
> Just my CAD $0.02
>
> Mark

One possible problem with having Brad act as moderator is his ties to
Biopython (plus it would be a shame if we'd be one man down for trying
to solve the challenges - grin). Having a project representative "sign off"
on the challenge might work - or simply the whole of the BOSC committee
which is quite balanced. Alternatively some kind of panel of challenges does
seem a good way to reduce individual project bias (as suggest by Scooter),
but there will still need to be a judging committee.

I'm curious what kind of challenges the BOSC committee had in mind -
would something like taking a newly sequence bacteria and producing
an automated annotation as a GenBank, EMBL, or GFF  file be too
ambitious for example? There are already several major projects
to do this e.g. RAST http://rast.nmpdr.org/

Peter
(@Biopython)