From dan.bolser at gmail.com Fri May 28 12:29:10 2010 From: dan.bolser at gmail.com (Dan Bolser) Date: Fri, 28 May 2010 17:29:10 +0100 Subject: [Open-bio-l] Best practice for modelling data in GFF Message-ID: Hi guys, Not sure if this is the right forum, but I just thought I'd ask... Where can I find information on 'best practices' for modelling biological data in GFF? For example, I'd like to model paired-end sequence alignments in GFF. One suggestion was to use match/match_part to link each end into a pair. Another option is to use 'read_pair' with 'contig' for the parent feature... Should I just be using SAM/BAM? Seems a shame not to have a standard way to do this in GFF... Cheers, Dan. From dalloliogm at gmail.com Fri May 28 12:35:49 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 28 May 2010 18:35:49 +0200 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: References: Message-ID: Hi Dan, why don't you ask this here: - http://biostar.stackexchange.com/questions this mailing list is just to discuss topics related to the OpenBio projects, like BioPerl, BioPython, etc.. You will find more people in biostars. On Fri, May 28, 2010 at 6:29 PM, Dan Bolser wrote: > Hi guys, > > Not sure if this is the right forum, but I just thought I'd ask... > > Where can I find information on 'best practices' for modelling > biological data in GFF? > > For example, I'd like to model paired-end sequence alignments in GFF. > One suggestion was to use match/match_part to link each end into a > pair. Another option is to use 'read_pair' with 'contig' for the > parent feature... > > Should I just be using SAM/BAM? > > Seems a shame not to have a standard way to do this in GFF... > > > Cheers, > Dan. > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From lpritc at scri.ac.uk Fri May 28 12:59:13 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 28 May 2010 17:59:13 +0100 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: Message-ID: Hi Dan, On 28/05/2010 Friday, May 28, 17:29, "Dan Bolser" wrote: > Not sure if this is the right forum, but I just thought I'd ask... > > Where can I find information on 'best practices' for modelling > biological data in GFF? The specification is a good place to start: http://www.sequenceontology.org/gff3.shtml > For example, I'd like to model paired-end sequence alignments in GFF. > One suggestion was to use match/match_part to link each end into a > pair. Another option is to use 'read_pair' with 'contig' for the > parent feature... I'm not sure it's an issue with GFF as much as it is just working out where your data fits in the Sequence Ontology model. If your read pairs have been used to assemble the larger contig sequence that you're modelling them as part_of, then read_pair would seem to be exactly what you're looking for: http://www.sequenceontology.org/miso/current_release/term/SO:0000007 However, if your read pair comes from a different contig, or exists in some abstract sense, not associated with the assembly of the contig, and you're just *aligning them to another sequence*, then a match, with (at least) two match_part children corresponding to the regions that each read matches could be more appropriate. Which of these options best reflects your data? Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From chapmanb at 50mail.com Thu May 27 14:51:46 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 27 May 2010 14:51:46 -0400 Subject: [Open-bio-l] Codefest 2010 plans: July 7 and 8th Message-ID: <20100527185146.GJ1054@sobchak.mgh.harvard.edu> Hi all; Things are moving forward smoothly with plans for Codefest 2010, scheduled for July 7 and 8th -- the Wednesday and Thursday before BOSC and ISMB: http://www.open-bio.org/wiki/Codefest_2010 The ISMB deadline for early registration is tomorrow, so it's a great opportunity to start getting ourselves organized for the two days of coding. The focus is on two broad areas: cloud computing and semantic web. On the cloud computing side, I've been putting together a general configuration environment we can work collaboratively on, and wrote up the current state here: http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/ This framework could provide us with plenty to do for the two days: extend the library and program coverage, provide Debian/Ubuntu ports for packages, write new user documentation, start thinking about common EBS stores for re-usable data, and much more. For semantic web work, the plan is to extend work done at BioHackathon 2010: http://hackathon3.dbcls.jp/ Toshiaki and Mitsuteru -- feel free to chime in with specific ideas and thoughts for the two days. Edit away on the wiki to provide more information. This is meant to be fun and fit your interests, so feel free to suggest other areas you think would be cool and try and build support. To avoid having to e-mail tons of people, I suggest discussing things on the OpenBio general mailing list: http://lists.open-bio.org/pipermail/open-bio-l/ Beyond coding, there will be backyard BBQ and beers on Thursday night so don't plan a 7am talk on Friday. Let us know definitely if you are coming, and chime in with thoughts and ideas. The more organized we can get, the more we'll be able to accomplish. Looking forward to seeing everyone in a month, Brad From jason at bioperl.org Fri May 28 13:06:06 2010 From: jason at bioperl.org (Jason Stajich) Date: Fri, 28 May 2010 10:06:06 -0700 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: References: Message-ID: <4BFFF7FE.1030004@bioperl.org> It's covered in the GFF3 spec as match_part if that helps. http://song.sourceforge.net/gff3.shtml Dan Bolser wrote, On 5/28/10 9:29 AM: > Hi guys, > > Not sure if this is the right forum, but I just thought I'd ask... > > Where can I find information on 'best practices' for modelling > biological data in GFF? > > For example, I'd like to model paired-end sequence alignments in GFF. > One suggestion was to use match/match_part to link each end into a > pair. Another option is to use 'read_pair' with 'contig' for the > parent feature... > > Should I just be using SAM/BAM? > > Seems a shame not to have a standard way to do this in GFF... > > > Cheers, > Dan. > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From cjfields at illinois.edu Fri May 28 13:49:45 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 28 May 2010 12:49:45 -0500 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: <4BFFF7FE.1030004@bioperl.org> References: <4BFFF7FE.1030004@bioperl.org> Message-ID: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> All, Appears that link isn't up to date. Current GFF3 spec (v. 1.16, updated May 25) here: http://www.sequenceontology.org/gff3.shtml chris On May 28, 2010, at 12:06 PM, Jason Stajich wrote: > It's covered in the GFF3 spec as match_part if that helps. > http://song.sourceforge.net/gff3.shtml > > Dan Bolser wrote, On 5/28/10 9:29 AM: >> Hi guys, >> >> Not sure if this is the right forum, but I just thought I'd ask... >> >> Where can I find information on 'best practices' for modelling >> biological data in GFF? >> >> For example, I'd like to model paired-end sequence alignments in GFF. >> One suggestion was to use match/match_part to link each end into a >> pair. Another option is to use 'read_pair' with 'contig' for the >> parent feature... >> >> Should I just be using SAM/BAM? >> >> Seems a shame not to have a standard way to do this in GFF... >> >> >> Cheers, >> Dan. >> _______________________________________________ >> Open-Bio-l mailing list >> Open-Bio-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From dan.bolser at gmail.com Fri May 28 19:08:50 2010 From: dan.bolser at gmail.com (Dan Bolser) Date: Sat, 29 May 2010 00:08:50 +0100 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> References: <4BFFF7FE.1030004@bioperl.org> <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> Message-ID: Thanks all for replies. I'm aware of the GFF spec, and the SO ontology terms. The issue here (as I understand it) is that the feature isn't 'flat', but is a combination of two matching 'reads' that are grouped into a mate-pair depending on their proximity and orientation. As pointed out, not every pair is successfully mapped, specifically one read may be 'missing' from the pair, the pair may span two reference sequences, or the proximity or orientation of the pair may be incorrect. Strictly speaking this can be handled by match and match_part (or read_pair and part_of) terms, however, the question is, does this reflect the biology adequately? (And specifically which terms should be used?) There is a canonical way to model a gene, so I was wondering if it makes sense to describe similar 'biology' (or in this case molecular biology) in standard ways (when the feature isn't simply described by a single line of GFF)? Perhaps I've not understood SO properly, but I'm not sure how its structure is translated into GFF structure ... is there a 1 to 1 mapping? Cheers, Dan. On 28 May 2010 18:49, Chris Fields wrote: > All, > > Appears that link isn't up to date. ?Current GFF3 spec (v. 1.16, updated May 25) here: > > http://www.sequenceontology.org/gff3.shtml > > chris > > On May 28, 2010, at 12:06 PM, Jason Stajich wrote: > >> It's covered in the GFF3 spec as match_part if that helps. >> http://song.sourceforge.net/gff3.shtml >> >> Dan Bolser wrote, On 5/28/10 9:29 AM: >>> Hi guys, >>> >>> Not sure if this is the right forum, but I just thought I'd ask... >>> >>> Where can I find information on 'best practices' for modelling >>> biological data in GFF? >>> >>> For example, I'd like to model paired-end sequence alignments in GFF. >>> One suggestion was to use match/match_part to link each end into a >>> pair. Another option is to use 'read_pair' with 'contig' for the >>> parent feature... >>> >>> Should I just be using SAM/BAM? >>> >>> Seems a shame not to have a standard way to do this in GFF... >>> >>> >>> Cheers, >>> Dan. >>> _______________________________________________ >>> Open-Bio-l mailing list >>> Open-Bio-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>> >> _______________________________________________ >> Open-Bio-l mailing list >> Open-Bio-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/open-bio-l > > From dan.bolser at gmail.com Fri May 28 16:29:10 2010 From: dan.bolser at gmail.com (Dan Bolser) Date: Fri, 28 May 2010 17:29:10 +0100 Subject: [Open-bio-l] Best practice for modelling data in GFF Message-ID: Hi guys, Not sure if this is the right forum, but I just thought I'd ask... Where can I find information on 'best practices' for modelling biological data in GFF? For example, I'd like to model paired-end sequence alignments in GFF. One suggestion was to use match/match_part to link each end into a pair. Another option is to use 'read_pair' with 'contig' for the parent feature... Should I just be using SAM/BAM? Seems a shame not to have a standard way to do this in GFF... Cheers, Dan. From dalloliogm at gmail.com Fri May 28 16:35:49 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 28 May 2010 18:35:49 +0200 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: References: Message-ID: Hi Dan, why don't you ask this here: - http://biostar.stackexchange.com/questions this mailing list is just to discuss topics related to the OpenBio projects, like BioPerl, BioPython, etc.. You will find more people in biostars. On Fri, May 28, 2010 at 6:29 PM, Dan Bolser wrote: > Hi guys, > > Not sure if this is the right forum, but I just thought I'd ask... > > Where can I find information on 'best practices' for modelling > biological data in GFF? > > For example, I'd like to model paired-end sequence alignments in GFF. > One suggestion was to use match/match_part to link each end into a > pair. Another option is to use 'read_pair' with 'contig' for the > parent feature... > > Should I just be using SAM/BAM? > > Seems a shame not to have a standard way to do this in GFF... > > > Cheers, > Dan. > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From lpritc at scri.ac.uk Fri May 28 16:59:13 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 28 May 2010 17:59:13 +0100 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: Message-ID: Hi Dan, On 28/05/2010 Friday, May 28, 17:29, "Dan Bolser" wrote: > Not sure if this is the right forum, but I just thought I'd ask... > > Where can I find information on 'best practices' for modelling > biological data in GFF? The specification is a good place to start: http://www.sequenceontology.org/gff3.shtml > For example, I'd like to model paired-end sequence alignments in GFF. > One suggestion was to use match/match_part to link each end into a > pair. Another option is to use 'read_pair' with 'contig' for the > parent feature... I'm not sure it's an issue with GFF as much as it is just working out where your data fits in the Sequence Ontology model. If your read pairs have been used to assemble the larger contig sequence that you're modelling them as part_of, then read_pair would seem to be exactly what you're looking for: http://www.sequenceontology.org/miso/current_release/term/SO:0000007 However, if your read pair comes from a different contig, or exists in some abstract sense, not associated with the assembly of the contig, and you're just *aligning them to another sequence*, then a match, with (at least) two match_part children corresponding to the regions that each read matches could be more appropriate. Which of these options best reflects your data? Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From chapmanb at 50mail.com Thu May 27 18:51:46 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 27 May 2010 14:51:46 -0400 Subject: [Open-bio-l] Codefest 2010 plans: July 7 and 8th Message-ID: <20100527185146.GJ1054@sobchak.mgh.harvard.edu> Hi all; Things are moving forward smoothly with plans for Codefest 2010, scheduled for July 7 and 8th -- the Wednesday and Thursday before BOSC and ISMB: http://www.open-bio.org/wiki/Codefest_2010 The ISMB deadline for early registration is tomorrow, so it's a great opportunity to start getting ourselves organized for the two days of coding. The focus is on two broad areas: cloud computing and semantic web. On the cloud computing side, I've been putting together a general configuration environment we can work collaboratively on, and wrote up the current state here: http://bcbio.wordpress.com/2010/05/08/automated-build-environment-for-bioinformatics-cloud-images/ This framework could provide us with plenty to do for the two days: extend the library and program coverage, provide Debian/Ubuntu ports for packages, write new user documentation, start thinking about common EBS stores for re-usable data, and much more. For semantic web work, the plan is to extend work done at BioHackathon 2010: http://hackathon3.dbcls.jp/ Toshiaki and Mitsuteru -- feel free to chime in with specific ideas and thoughts for the two days. Edit away on the wiki to provide more information. This is meant to be fun and fit your interests, so feel free to suggest other areas you think would be cool and try and build support. To avoid having to e-mail tons of people, I suggest discussing things on the OpenBio general mailing list: http://lists.open-bio.org/pipermail/open-bio-l/ Beyond coding, there will be backyard BBQ and beers on Thursday night so don't plan a 7am talk on Friday. Let us know definitely if you are coming, and chime in with thoughts and ideas. The more organized we can get, the more we'll be able to accomplish. Looking forward to seeing everyone in a month, Brad From jason at bioperl.org Fri May 28 17:06:06 2010 From: jason at bioperl.org (Jason Stajich) Date: Fri, 28 May 2010 10:06:06 -0700 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: References: Message-ID: <4BFFF7FE.1030004@bioperl.org> It's covered in the GFF3 spec as match_part if that helps. http://song.sourceforge.net/gff3.shtml Dan Bolser wrote, On 5/28/10 9:29 AM: > Hi guys, > > Not sure if this is the right forum, but I just thought I'd ask... > > Where can I find information on 'best practices' for modelling > biological data in GFF? > > For example, I'd like to model paired-end sequence alignments in GFF. > One suggestion was to use match/match_part to link each end into a > pair. Another option is to use 'read_pair' with 'contig' for the > parent feature... > > Should I just be using SAM/BAM? > > Seems a shame not to have a standard way to do this in GFF... > > > Cheers, > Dan. > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l > From cjfields at illinois.edu Fri May 28 17:49:45 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 28 May 2010 12:49:45 -0500 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: <4BFFF7FE.1030004@bioperl.org> References: <4BFFF7FE.1030004@bioperl.org> Message-ID: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> All, Appears that link isn't up to date. Current GFF3 spec (v. 1.16, updated May 25) here: http://www.sequenceontology.org/gff3.shtml chris On May 28, 2010, at 12:06 PM, Jason Stajich wrote: > It's covered in the GFF3 spec as match_part if that helps. > http://song.sourceforge.net/gff3.shtml > > Dan Bolser wrote, On 5/28/10 9:29 AM: >> Hi guys, >> >> Not sure if this is the right forum, but I just thought I'd ask... >> >> Where can I find information on 'best practices' for modelling >> biological data in GFF? >> >> For example, I'd like to model paired-end sequence alignments in GFF. >> One suggestion was to use match/match_part to link each end into a >> pair. Another option is to use 'read_pair' with 'contig' for the >> parent feature... >> >> Should I just be using SAM/BAM? >> >> Seems a shame not to have a standard way to do this in GFF... >> >> >> Cheers, >> Dan. >> _______________________________________________ >> Open-Bio-l mailing list >> Open-Bio-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From dan.bolser at gmail.com Fri May 28 23:08:50 2010 From: dan.bolser at gmail.com (Dan Bolser) Date: Sat, 29 May 2010 00:08:50 +0100 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> References: <4BFFF7FE.1030004@bioperl.org> <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> Message-ID: Thanks all for replies. I'm aware of the GFF spec, and the SO ontology terms. The issue here (as I understand it) is that the feature isn't 'flat', but is a combination of two matching 'reads' that are grouped into a mate-pair depending on their proximity and orientation. As pointed out, not every pair is successfully mapped, specifically one read may be 'missing' from the pair, the pair may span two reference sequences, or the proximity or orientation of the pair may be incorrect. Strictly speaking this can be handled by match and match_part (or read_pair and part_of) terms, however, the question is, does this reflect the biology adequately? (And specifically which terms should be used?) There is a canonical way to model a gene, so I was wondering if it makes sense to describe similar 'biology' (or in this case molecular biology) in standard ways (when the feature isn't simply described by a single line of GFF)? Perhaps I've not understood SO properly, but I'm not sure how its structure is translated into GFF structure ... is there a 1 to 1 mapping? Cheers, Dan. On 28 May 2010 18:49, Chris Fields wrote: > All, > > Appears that link isn't up to date. ?Current GFF3 spec (v. 1.16, updated May 25) here: > > http://www.sequenceontology.org/gff3.shtml > > chris > > On May 28, 2010, at 12:06 PM, Jason Stajich wrote: > >> It's covered in the GFF3 spec as match_part if that helps. >> http://song.sourceforge.net/gff3.shtml >> >> Dan Bolser wrote, On 5/28/10 9:29 AM: >>> Hi guys, >>> >>> Not sure if this is the right forum, but I just thought I'd ask... >>> >>> Where can I find information on 'best practices' for modelling >>> biological data in GFF? >>> >>> For example, I'd like to model paired-end sequence alignments in GFF. >>> One suggestion was to use match/match_part to link each end into a >>> pair. Another option is to use 'read_pair' with 'contig' for the >>> parent feature... >>> >>> Should I just be using SAM/BAM? >>> >>> Seems a shame not to have a standard way to do this in GFF... >>> >>> >>> Cheers, >>> Dan. >>> _______________________________________________ >>> Open-Bio-l mailing list >>> Open-Bio-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>> >> _______________________________________________ >> Open-Bio-l mailing list >> Open-Bio-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/open-bio-l > >