From dan.bolser at gmail.com  Fri May 28 12:29:10 2010
From: dan.bolser at gmail.com (Dan Bolser)
Date: Fri, 28 May 2010 17:29:10 +0100
Subject: [Open-bio-l] Best practice for modelling data in GFF
Message-ID: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>

Hi guys,

Not sure if this is the right forum, but I just thought I'd ask...

Where can I find information on 'best practices' for modelling
biological data in GFF?

For example, I'd like to model paired-end sequence alignments in GFF.
One suggestion was to use match/match_part to link each end into a
pair. Another option is to use 'read_pair' with 'contig' for the
parent feature...

Should I just be using SAM/BAM?

Seems a shame not to have a standard way to do this in GFF...


Cheers,
Dan.

From dalloliogm at gmail.com  Fri May 28 12:35:49 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 28 May 2010 18:35:49 +0200
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
Message-ID: <AANLkTimVYbTCOWPplJNsAti90YQNY6xYsISaR5gJWfAR@mail.gmail.com>

Hi Dan,
why don't you ask this here:
- http://biostar.stackexchange.com/questions

this mailing list is just to discuss topics related to the OpenBio
projects, like BioPerl, BioPython, etc..
You will find more people in biostars.


On Fri, May 28, 2010 at 6:29 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
> Hi guys,
>
> Not sure if this is the right forum, but I just thought I'd ask...
>
> Where can I find information on 'best practices' for modelling
> biological data in GFF?
>
> For example, I'd like to model paired-end sequence alignments in GFF.
> One suggestion was to use match/match_part to link each end into a
> pair. Another option is to use 'read_pair' with 'contig' for the
> parent feature...
>
> Should I just be using SAM/BAM?
>
> Seems a shame not to have a standard way to do this in GFF...
>
>
> Cheers,
> Dan.
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From lpritc at scri.ac.uk  Fri May 28 12:59:13 2010
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 28 May 2010 17:59:13 +0100
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
Message-ID: <C825B4F1.36548%lpritc@scri.ac.uk>

Hi Dan,

On 28/05/2010 Friday, May 28, 17:29, "Dan Bolser" <dan.bolser at gmail.com>
wrote:

> Not sure if this is the right forum, but I just thought I'd ask...
> 
> Where can I find information on 'best practices' for modelling
> biological data in GFF?

The specification is a good place to start:

http://www.sequenceontology.org/gff3.shtml

> For example, I'd like to model paired-end sequence alignments in GFF.
> One suggestion was to use match/match_part to link each end into a
> pair. Another option is to use 'read_pair' with 'contig' for the
> parent feature...

I'm not sure it's an issue with GFF as much as it is just working out where
your data fits in the Sequence Ontology model.

If your read pairs have been used to assemble the larger contig sequence
that you're modelling them as part_of, then read_pair would seem to be
exactly what you're looking for:

http://www.sequenceontology.org/miso/current_release/term/SO:0000007

However, if your read pair comes from a different contig, or exists in some
abstract sense, not associated with the assembly of the contig, and you're
just *aligning them to another sequence*, then a match, with (at least) two
match_part children corresponding to the regions that each read matches
could be more appropriate.

Which of these options best reflects your data?

Cheers,

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________

From chapmanb at 50mail.com  Thu May 27 14:51:46 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 27 May 2010 14:51:46 -0400
Subject: [Open-bio-l] Codefest 2010 plans: July 7 and 8th
Message-ID: <20100527185146.GJ1054@sobchak.mgh.harvard.edu>

Hi all;
Things are moving forward smoothly with plans for Codefest 2010,
scheduled for July 7 and 8th -- the Wednesday and Thursday before
BOSC and ISMB:

http://www.open-bio.org/wiki/Codefest_2010

The ISMB deadline for early registration is tomorrow, so it's a
great opportunity to start getting ourselves organized for the two
days of coding.

The focus is on two broad areas: cloud computing and semantic web.
On the cloud computing side, I've been putting together a general 
configuration environment we can work collaboratively on, and 
wrote up the current state here:

http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/

This framework could provide us with plenty to do for the two days:
extend the library and program coverage, provide Debian/Ubuntu ports
for packages, write new user documentation, start thinking about
common EBS stores for re-usable data, and much more.

For semantic web work, the plan is to extend work done at
BioHackathon 2010:

http://hackathon3.dbcls.jp/

Toshiaki and Mitsuteru -- feel free to chime in with specific ideas
and thoughts for the two days. Edit away on the wiki to provide more
information.

This is meant to be fun and fit your interests, so feel free to
suggest other areas you think would be cool and try and build
support. To avoid having to e-mail tons of people, I suggest
discussing things on the OpenBio general mailing list:

http://lists.open-bio.org/pipermail/open-bio-l/

Beyond coding, there will be backyard BBQ and beers on Thursday 
night so don't plan a 7am talk on Friday.

Let us know definitely if you are coming, and chime in with 
thoughts and ideas. The more organized we can get, the more we'll 
be able to accomplish. Looking forward to seeing everyone in a month,
Brad

From jason at bioperl.org  Fri May 28 13:06:06 2010
From: jason at bioperl.org (Jason Stajich)
Date: Fri, 28 May 2010 10:06:06 -0700
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
Message-ID: <4BFFF7FE.1030004@bioperl.org>

It's covered in the GFF3 spec as match_part if that helps.
http://song.sourceforge.net/gff3.shtml

Dan Bolser wrote, On 5/28/10 9:29 AM:
> Hi guys,
>
> Not sure if this is the right forum, but I just thought I'd ask...
>
> Where can I find information on 'best practices' for modelling
> biological data in GFF?
>
> For example, I'd like to model paired-end sequence alignments in GFF.
> One suggestion was to use match/match_part to link each end into a
> pair. Another option is to use 'read_pair' with 'contig' for the
> parent feature...
>
> Should I just be using SAM/BAM?
>
> Seems a shame not to have a standard way to do this in GFF...
>
>
> Cheers,
> Dan.
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>    

From cjfields at illinois.edu  Fri May 28 13:49:45 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 28 May 2010 12:49:45 -0500
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <4BFFF7FE.1030004@bioperl.org>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
	<4BFFF7FE.1030004@bioperl.org>
Message-ID: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu>

All,

Appears that link isn't up to date.  Current GFF3 spec (v. 1.16, updated May 25) here:

http://www.sequenceontology.org/gff3.shtml

chris

On May 28, 2010, at 12:06 PM, Jason Stajich wrote:

> It's covered in the GFF3 spec as match_part if that helps.
> http://song.sourceforge.net/gff3.shtml
> 
> Dan Bolser wrote, On 5/28/10 9:29 AM:
>> Hi guys,
>> 
>> Not sure if this is the right forum, but I just thought I'd ask...
>> 
>> Where can I find information on 'best practices' for modelling
>> biological data in GFF?
>> 
>> For example, I'd like to model paired-end sequence alignments in GFF.
>> One suggestion was to use match/match_part to link each end into a
>> pair. Another option is to use 'read_pair' with 'contig' for the
>> parent feature...
>> 
>> Should I just be using SAM/BAM?
>> 
>> Seems a shame not to have a standard way to do this in GFF...
>> 
>> 
>> Cheers,
>> Dan.
>> _______________________________________________
>> Open-Bio-l mailing list
>> Open-Bio-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>   
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l


From dan.bolser at gmail.com  Fri May 28 19:08:50 2010
From: dan.bolser at gmail.com (Dan Bolser)
Date: Sat, 29 May 2010 00:08:50 +0100
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
	<4BFFF7FE.1030004@bioperl.org>
	<685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu>
Message-ID: <AANLkTinrTVYODrcY9TUxhznkZKunNRQCvs-bnwgbHdNA@mail.gmail.com>

Thanks all for replies.

I'm aware of the GFF spec, and the SO ontology terms. The issue here
(as I understand it) is that the feature isn't 'flat', but is a
combination of two matching 'reads' that are grouped into a mate-pair
depending on their proximity and orientation. As pointed out, not
every pair is successfully mapped, specifically one read may be
'missing' from the pair, the pair may span two reference sequences, or
the proximity or orientation of the pair may be incorrect.

Strictly speaking this can be handled by match and match_part (or
read_pair and part_of) terms, however, the question is, does this
reflect the biology adequately? (And specifically which terms should
be used?)

There is a canonical way to model a gene, so I was wondering if it
makes sense to describe similar 'biology' (or in this case molecular
biology) in standard ways (when the feature isn't simply described by
a single line of GFF)?

Perhaps I've not understood SO properly, but I'm not sure how its
structure is translated into GFF structure ... is there a 1 to 1
mapping?


Cheers,
Dan.

On 28 May 2010 18:49, Chris Fields <cjfields at illinois.edu> wrote:
> All,
>
> Appears that link isn't up to date. ?Current GFF3 spec (v. 1.16, updated May 25) here:
>
> http://www.sequenceontology.org/gff3.shtml
>
> chris
>
> On May 28, 2010, at 12:06 PM, Jason Stajich wrote:
>
>> It's covered in the GFF3 spec as match_part if that helps.
>> http://song.sourceforge.net/gff3.shtml
>>
>> Dan Bolser wrote, On 5/28/10 9:29 AM:
>>> Hi guys,
>>>
>>> Not sure if this is the right forum, but I just thought I'd ask...
>>>
>>> Where can I find information on 'best practices' for modelling
>>> biological data in GFF?
>>>
>>> For example, I'd like to model paired-end sequence alignments in GFF.
>>> One suggestion was to use match/match_part to link each end into a
>>> pair. Another option is to use 'read_pair' with 'contig' for the
>>> parent feature...
>>>
>>> Should I just be using SAM/BAM?
>>>
>>> Seems a shame not to have a standard way to do this in GFF...
>>>
>>>
>>> Cheers,
>>> Dan.
>>> _______________________________________________
>>> Open-Bio-l mailing list
>>> Open-Bio-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>
>> _______________________________________________
>> Open-Bio-l mailing list
>> Open-Bio-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>
>


From dan.bolser at gmail.com  Fri May 28 16:29:10 2010
From: dan.bolser at gmail.com (Dan Bolser)
Date: Fri, 28 May 2010 17:29:10 +0100
Subject: [Open-bio-l] Best practice for modelling data in GFF
Message-ID: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>

Hi guys,

Not sure if this is the right forum, but I just thought I'd ask...

Where can I find information on 'best practices' for modelling
biological data in GFF?

For example, I'd like to model paired-end sequence alignments in GFF.
One suggestion was to use match/match_part to link each end into a
pair. Another option is to use 'read_pair' with 'contig' for the
parent feature...

Should I just be using SAM/BAM?

Seems a shame not to have a standard way to do this in GFF...


Cheers,
Dan.


From dalloliogm at gmail.com  Fri May 28 16:35:49 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 28 May 2010 18:35:49 +0200
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
Message-ID: <AANLkTimVYbTCOWPplJNsAti90YQNY6xYsISaR5gJWfAR@mail.gmail.com>

Hi Dan,
why don't you ask this here:
- http://biostar.stackexchange.com/questions

this mailing list is just to discuss topics related to the OpenBio
projects, like BioPerl, BioPython, etc..
You will find more people in biostars.


On Fri, May 28, 2010 at 6:29 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
> Hi guys,
>
> Not sure if this is the right forum, but I just thought I'd ask...
>
> Where can I find information on 'best practices' for modelling
> biological data in GFF?
>
> For example, I'd like to model paired-end sequence alignments in GFF.
> One suggestion was to use match/match_part to link each end into a
> pair. Another option is to use 'read_pair' with 'contig' for the
> parent feature...
>
> Should I just be using SAM/BAM?
>
> Seems a shame not to have a standard way to do this in GFF...
>
>
> Cheers,
> Dan.
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From lpritc at scri.ac.uk  Fri May 28 16:59:13 2010
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Fri, 28 May 2010 17:59:13 +0100
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
Message-ID: <C825B4F1.36548%lpritc@scri.ac.uk>

Hi Dan,

On 28/05/2010 Friday, May 28, 17:29, "Dan Bolser" <dan.bolser at gmail.com>
wrote:

> Not sure if this is the right forum, but I just thought I'd ask...
> 
> Where can I find information on 'best practices' for modelling
> biological data in GFF?

The specification is a good place to start:

http://www.sequenceontology.org/gff3.shtml

> For example, I'd like to model paired-end sequence alignments in GFF.
> One suggestion was to use match/match_part to link each end into a
> pair. Another option is to use 'read_pair' with 'contig' for the
> parent feature...

I'm not sure it's an issue with GFF as much as it is just working out where
your data fits in the Sequence Ontology model.

If your read pairs have been used to assemble the larger contig sequence
that you're modelling them as part_of, then read_pair would seem to be
exactly what you're looking for:

http://www.sequenceontology.org/miso/current_release/term/SO:0000007

However, if your read pair comes from a different contig, or exists in some
abstract sense, not associated with the assembly of the contig, and you're
just *aligning them to another sequence*, then a match, with (at least) two
match_part children corresponding to the regions that each read matches
could be more appropriate.

Which of these options best reflects your data?

Cheers,

L.


-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________


From chapmanb at 50mail.com  Thu May 27 18:51:46 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 27 May 2010 14:51:46 -0400
Subject: [Open-bio-l] Codefest 2010 plans: July 7 and 8th
Message-ID: <20100527185146.GJ1054@sobchak.mgh.harvard.edu>

Hi all;
Things are moving forward smoothly with plans for Codefest 2010,
scheduled for July 7 and 8th -- the Wednesday and Thursday before
BOSC and ISMB:

http://www.open-bio.org/wiki/Codefest_2010

The ISMB deadline for early registration is tomorrow, so it's a
great opportunity to start getting ourselves organized for the two
days of coding.

The focus is on two broad areas: cloud computing and semantic web.
On the cloud computing side, I've been putting together a general 
configuration environment we can work collaboratively on, and 
wrote up the current state here:

http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/

This framework could provide us with plenty to do for the two days:
extend the library and program coverage, provide Debian/Ubuntu ports
for packages, write new user documentation, start thinking about
common EBS stores for re-usable data, and much more.

For semantic web work, the plan is to extend work done at
BioHackathon 2010:

http://hackathon3.dbcls.jp/

Toshiaki and Mitsuteru -- feel free to chime in with specific ideas
and thoughts for the two days. Edit away on the wiki to provide more
information.

This is meant to be fun and fit your interests, so feel free to
suggest other areas you think would be cool and try and build
support. To avoid having to e-mail tons of people, I suggest
discussing things on the OpenBio general mailing list:

http://lists.open-bio.org/pipermail/open-bio-l/

Beyond coding, there will be backyard BBQ and beers on Thursday 
night so don't plan a 7am talk on Friday.

Let us know definitely if you are coming, and chime in with 
thoughts and ideas. The more organized we can get, the more we'll 
be able to accomplish. Looking forward to seeing everyone in a month,
Brad


From jason at bioperl.org  Fri May 28 17:06:06 2010
From: jason at bioperl.org (Jason Stajich)
Date: Fri, 28 May 2010 10:06:06 -0700
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
Message-ID: <4BFFF7FE.1030004@bioperl.org>

It's covered in the GFF3 spec as match_part if that helps.
http://song.sourceforge.net/gff3.shtml

Dan Bolser wrote, On 5/28/10 9:29 AM:
> Hi guys,
>
> Not sure if this is the right forum, but I just thought I'd ask...
>
> Where can I find information on 'best practices' for modelling
> biological data in GFF?
>
> For example, I'd like to model paired-end sequence alignments in GFF.
> One suggestion was to use match/match_part to link each end into a
> pair. Another option is to use 'read_pair' with 'contig' for the
> parent feature...
>
> Should I just be using SAM/BAM?
>
> Seems a shame not to have a standard way to do this in GFF...
>
>
> Cheers,
> Dan.
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>    


From cjfields at illinois.edu  Fri May 28 17:49:45 2010
From: cjfields at illinois.edu (Chris Fields)
Date: Fri, 28 May 2010 12:49:45 -0500
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <4BFFF7FE.1030004@bioperl.org>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
	<4BFFF7FE.1030004@bioperl.org>
Message-ID: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu>

All,

Appears that link isn't up to date.  Current GFF3 spec (v. 1.16, updated May 25) here:

http://www.sequenceontology.org/gff3.shtml

chris

On May 28, 2010, at 12:06 PM, Jason Stajich wrote:

> It's covered in the GFF3 spec as match_part if that helps.
> http://song.sourceforge.net/gff3.shtml
> 
> Dan Bolser wrote, On 5/28/10 9:29 AM:
>> Hi guys,
>> 
>> Not sure if this is the right forum, but I just thought I'd ask...
>> 
>> Where can I find information on 'best practices' for modelling
>> biological data in GFF?
>> 
>> For example, I'd like to model paired-end sequence alignments in GFF.
>> One suggestion was to use match/match_part to link each end into a
>> pair. Another option is to use 'read_pair' with 'contig' for the
>> parent feature...
>> 
>> Should I just be using SAM/BAM?
>> 
>> Seems a shame not to have a standard way to do this in GFF...
>> 
>> 
>> Cheers,
>> Dan.
>> _______________________________________________
>> Open-Bio-l mailing list
>> Open-Bio-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>   
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/open-bio-l


From dan.bolser at gmail.com  Fri May 28 23:08:50 2010
From: dan.bolser at gmail.com (Dan Bolser)
Date: Sat, 29 May 2010 00:08:50 +0100
Subject: [Open-bio-l] Best practice for modelling data in GFF
In-Reply-To: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu>
References: <AANLkTimuku_x7V9RSAERK4f4QxIc4od2zw3qZtdaL4oH@mail.gmail.com>
	<4BFFF7FE.1030004@bioperl.org>
	<685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu>
Message-ID: <AANLkTinrTVYODrcY9TUxhznkZKunNRQCvs-bnwgbHdNA@mail.gmail.com>

Thanks all for replies.

I'm aware of the GFF spec, and the SO ontology terms. The issue here
(as I understand it) is that the feature isn't 'flat', but is a
combination of two matching 'reads' that are grouped into a mate-pair
depending on their proximity and orientation. As pointed out, not
every pair is successfully mapped, specifically one read may be
'missing' from the pair, the pair may span two reference sequences, or
the proximity or orientation of the pair may be incorrect.

Strictly speaking this can be handled by match and match_part (or
read_pair and part_of) terms, however, the question is, does this
reflect the biology adequately? (And specifically which terms should
be used?)

There is a canonical way to model a gene, so I was wondering if it
makes sense to describe similar 'biology' (or in this case molecular
biology) in standard ways (when the feature isn't simply described by
a single line of GFF)?

Perhaps I've not understood SO properly, but I'm not sure how its
structure is translated into GFF structure ... is there a 1 to 1
mapping?


Cheers,
Dan.

On 28 May 2010 18:49, Chris Fields <cjfields at illinois.edu> wrote:
> All,
>
> Appears that link isn't up to date. ?Current GFF3 spec (v. 1.16, updated May 25) here:
>
> http://www.sequenceontology.org/gff3.shtml
>
> chris
>
> On May 28, 2010, at 12:06 PM, Jason Stajich wrote:
>
>> It's covered in the GFF3 spec as match_part if that helps.
>> http://song.sourceforge.net/gff3.shtml
>>
>> Dan Bolser wrote, On 5/28/10 9:29 AM:
>>> Hi guys,
>>>
>>> Not sure if this is the right forum, but I just thought I'd ask...
>>>
>>> Where can I find information on 'best practices' for modelling
>>> biological data in GFF?
>>>
>>> For example, I'd like to model paired-end sequence alignments in GFF.
>>> One suggestion was to use match/match_part to link each end into a
>>> pair. Another option is to use 'read_pair' with 'contig' for the
>>> parent feature...
>>>
>>> Should I just be using SAM/BAM?
>>>
>>> Seems a shame not to have a standard way to do this in GFF...
>>>
>>>
>>> Cheers,
>>> Dan.
>>> _______________________________________________
>>> Open-Bio-l mailing list
>>> Open-Bio-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>>>
>> _______________________________________________
>> Open-Bio-l mailing list
>> Open-Bio-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/open-bio-l
>
>