From biopython at maubp.freeserve.co.uk Thu Jul 15 07:01:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 12:01:52 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: Message-ID: Trying again, sending from right email address... ---------- Forwarded message ---------- From:?Peter To:?ajb at ebi.ac.uk, Peter Rice Date:?Thu, 15 Jul 2010 11:45:21 +0100 Subject:?EMBOSS 6.3.0 released - SAM/BAM On Thu, Jul 15, 2010 at 11:20 AM, ? wrote: > EMBOSS 6.3.0 is now available and can be downloaded from > our ftp server: Congratulations on the latest release. > Some highlights include: > > ? ?... > ? ?Support for BAM/SAM files > ? ?... > Cool. I should take a look at this before (if) merging SAM/BAM support into Biopython. The use case I had in mind was for conversion to FASTQ (discarding any alignment information). What do you do about naming for paired reads? I was appending /1 or /2 to match the Illumina convention. Doing nothing means the paired reads will have the same names. What do you do about the strand issue? SAM/BAM stored reads which map onto the reverse strand in reverse complement. If you want to get back to the original orientation for output as FASTQ you must apply the reverse complement (plus reverse the quality scores too of course). Do you support writing SAM/BAM files? If so, would this be for aligned reads or unaligned reads only? Assuming you do write BAM files, do you support the recent convention to use a single BGZF block, and that where possible reads should not span a BGZF block boundary? (I'm assuming some of the EMBOSS team must be on the samtools-devel mailing list which is where most SAM/BAM format discussion seems to take place) Regards, Peter C (@Biopython) From pmr at ebi.ac.uk Thu Jul 15 07:12:02 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 15 Jul 2010 12:12:02 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: Message-ID: <4C3EED02.7080507@ebi.ac.uk> On 15/07/2010 12:01, Peter C. wrote: > Congratulations on the latest release. > >> Some highlights include: >> >> ... >> Support for BAM/SAM files >> ... >> > > Cool. I should take a look at this before (if) merging SAM/BAM > support into Biopython. The use case I had in mind was for > conversion to FASTQ (discarding any alignment information). > > What do you do about naming for paired reads? I was appending > /1 or /2 to match the Illumina convention. Doing nothing means > the paired reads will have the same names. Not addressed yet - let's look into a common approach though. We would also have to lok into what the '/' character does to EMBOSS's handling of sequence names. > What do you do about the strand issue? SAM/BAM stored reads > which map onto the reverse strand in reverse complement. If > you want to get back to the original orientation for output as > FASTQ you must apply the reverse complement (plus reverse > the quality scores too of course). So far we read as sequences. Reading as mapped reads (very large alignments) is planned for the very near future so it can appear in the next release. > Do you support writing SAM/BAM files? If so, would this be > for aligned reads or unaligned reads only? Yes we do write them - so far unaligned but we will add aligned reads when we can treat that as an input type. > Assuming you do write BAM files, do you support the recent > convention to use a single BGZF block, and that where possible > reads should not span a BGZF block boundary? We looked at samtools 1.7 to get things working. We still need to look at issues such as using the index for access to remote BAM files, and various flavours of blocks. I was not aware of the single block version. Again, we should compare notes. > (I'm assuming some of the EMBOSS team must be on the > samtools-devel mailing list which is where most SAM/BAM > format discussion seems to take place) Actually no, but I will join it ASAP and catch up. regards, Peter Rice From biopython at maubp.freeserve.co.uk Thu Jul 15 07:36:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 12:36:11 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: <4C3EED02.7080507@ebi.ac.uk> References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice wrote: > >> What do you do about naming for paired reads? I was appending >> /1 or /2 to match the Illumina convention. Doing nothing means >> the paired reads will have the same names. > > Not addressed yet - let's look into a common approach though. > We would also have to lok into what the '/' character does to EMBOSS's > handling of sequence names. My rational for appending the /1 and /2 is that in a typical workflow you might take Illumina paired end data as FASTQ and map it onto a genome with BWA giving SAM/BAM. You might then want to reverse this (e.g. if given a SAM/BAM file by a collaborator, and you want to try an alternative mapping tool or reference geneome, first you must recover the raw reads again, e.g. as FASTQ files). >> What do you do about the strand issue? SAM/BAM stored reads >> which map onto the reverse strand in reverse complement. If >> you want to get back to the original orientation for output as >> FASTQ you must apply the reverse complement (plus reverse >> the quality scores too of course). > > So far we read as sequences. Reading as mapped reads (very large > alignments) is planned for the very near future so it can appear in the > next release. Given the use case of going from (aligned) SAM/BAM back to the original FASTQ, for a round trip you *must* undo the reverse complementation. This is important even for single reads, as quality scores tend to trail off in the (original) read direction so some algorithms may treat a reverse version of the read differently. >> Do you support writing SAM/BAM files? If so, would this be >> for aligned reads or unaligned reads only? > > Yes we do write them - so far unaligned but we will add aligned reads > when we can treat that as an input type. I was thinking about this for my experimental SAM/BAM support for Biopython - doing unaligned output only is much more straight forward for a stream based writer (no seeks) as you don't have to worry about header information like reference sequences. Although not as useful as writing aligned SAM/BAM, some people are already using unaligned SAM/BAM for storing read data - e.g. GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit >> Assuming you do write BAM files, do you support the recent >> convention to use a single BGZF block [for the header], and >> that where possible reads should not span a BGZF block boundary? > > We looked at samtools 1.7 to get things working. We still need to look at > issues such as using the index for access to remote BAM files, and various > flavours of blocks. I was not aware of the single block version. Again, we > should compare notes. Is here fine or on a more cross project list? I think BioPerl are sticking to wrapping samtools rather than experimenting with a reimplementation. I corrected above in line - samtools now uses a single BGZF block for the BAM *header*. This was done to make rewriting the header easier (you don't need to decompress and re-compress any reads which happened to be with the header in the first block). >> (I'm assuming some of the EMBOSS team must be on the >> samtools-devel mailing list which is where most SAM/BAM >> format discussion seems to take place) > > Actually no, but I will join it ASAP and catch up. Excellent - there have been some interesting discussions about BAM v2 (e.g. moving the header block, handling indels better) and the possibility of using HDF5 underneath rather than the in house gzip variant. Peter C. From pmr at ebi.ac.uk Thu Jul 15 07:48:19 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 15 Jul 2010 12:48:19 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: <4C3EF583.90105@ebi.ac.uk> On 15/07/2010 12:36, Peter C. wrote: > Is here fine or on a more cross project list? I think BioPerl are sticking > to wrapping samtools rather than experimenting with a reimplementation. I think here is fine while it is just EMBOSS and BioPython, and I would like some feedback from others on what is most needed. If it becomes a cross-openbio effort like FASTQ then we can move to the OBF lists to bring them in. Heng Li asked us to look after format conversions to help take that load off samtools, but that still requires us to support SAM and BAM as fully as we can. regards, Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 09:36:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 14:36:23 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: <4C3EF583.90105@ebi.ac.uk> References: <4C3EED02.7080507@ebi.ac.uk> <4C3EF583.90105@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 12:48 PM, Peter Rice wrote: > > On 15/07/2010 12:36, Peter C. wrote: > >> Is here fine or on a more cross project list? I think BioPerl are sticking >> to wrapping samtools rather than experimenting with a reimplementation. > > I think here is fine while it is just EMBOSS and BioPython, and I would like > some feedback from others on what is most needed. > > If it becomes a cross-openbio effort like FASTQ then we can move to the OBF > lists to bring them in. BioLib and BioRuby sound interested too. > Heng Li asked us to look after format conversions to help take that load off > samtools, but that still requires us to support SAM and BAM as fully as we > can. I don't quite see how it will take a load of samtools, but nice to be asked. Peter From biopython at maubp.freeserve.co.uk Fri Jul 16 06:25:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 11:25:04 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> <4C3EF583.90105@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 2:36 PM, Peter wrote: > On Thu, Jul 15, 2010 at 12:48 PM, Peter Rice wrote: >> >> On 15/07/2010 12:36, Peter C. wrote: >> >>> Is here fine or on a more cross project list? I think BioPerl are sticking >>> to wrapping samtools rather than experimenting with a reimplementation. >> >> I think here is fine while it is just EMBOSS and BioPython, and I would like >> some feedback from others on what is most needed. >> >> If it becomes a cross-openbio effort like FASTQ then we can move to the OBF >> lists to bring them in. > > BioLib and BioRuby sound interested too. There are some new threads on the biolib-dev mailing list now, http://lists.open-bio.org/pipermail/biolib-dev/2010-July/thread.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 16 07:08:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 12:08:11 +0100 Subject: [emboss-dev] [BioLib-dev] R: EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <20100715114015.GB28639@thebird.nl> <20100716090126.GA7197@thebird.nl> Message-ID: On Fri, Jul 16, 2010 at 11:42 AM, Raoul Bonnal wrote: > > Probably considering our idea of interoperability it would be reasonable to > focus on the emboss implementation. > > We can check the performances of the Emboss and then converge on it and > help in development if needed. > I haven't looked at the EMBOSS code, but I assume it is currently geared towards looking at the reads as individual sequences - much like FASTQ and most other file formats supported by EMBOSS (and much like my own experiments for Biopython). This makes perfect sense for extending the existing EMBOSS tools to work with SAM/BAM. Is this a fair description Peter R? On the other hand, the samtools C API is very rich and allows a lot of alignment based operations (e.g. access to reads based on mapped position to a reference). Isn't the samtools C API a broader more useful code base to wrap in the Bio* projects? It will also be kept up to date with the expected file format changes. Perhaps Pjotr could clarify what he has in mind for BioLib? Thanks, Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 16 09:15:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 14:15:54 +0100 Subject: [emboss-dev] [BioLib-dev] R: EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <20100715114015.GB28639@thebird.nl> <20100716090126.GA7197@thebird.nl> Message-ID: On Fri, Jul 16, 2010 at 12:54 PM, Jan Aerts wrote: > I would also favour to base biolib implementation on samtools API, as they > are the ones who've been developing the whole SAM/BAM format anyway > (Richard, Li, ...) > jan. What would biolib add to the samtools C API? Peter From naktinis at csc.fi Thu Jul 29 07:47:18 2010 From: naktinis at csc.fi (Rimvydas Naktinis) Date: Thu, 29 Jul 2010 14:47:18 +0300 Subject: [emboss-dev] Specifying sequence lists for seqret Message-ID: <4C516A46.3060006@csc.fi> Hi, I'm developing EMBOSS integration into Chipster project (chipster.csc.fi). I was wondering if there's a way to specify sequence list (for example, when calling seqret) without creating any extra files? I know there's a way to do it like this: > seqret @sequencelist or > seqret list:sequencelist But what I would need would look something like: > seqret "swiss:CASA1_RABIT,swiss:CASA1_HUMAN" I've looked into USA format specification and it seems that there is actually no way to do it without creating some temporary file. Or am I missing something? Regards, Rimvydas Naktinis CSC ? IT Center for Science Ltd From pmr at ebi.ac.uk Thu Jul 29 08:28:33 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 29 Jul 2010 13:28:33 +0100 Subject: [emboss-dev] Specifying sequence lists for seqret In-Reply-To: <4C516A46.3060006@csc.fi> References: <4C516A46.3060006@csc.fi> Message-ID: <4C5173F1.1070208@ebi.ac.uk> On 29/07/10 12:47, Rimvydas Naktinis wrote: > Hi, > > I'm developing EMBOSS integration into Chipster project > (chipster.csc.fi). I was wondering if there's a way to specify sequence > list (for example, when calling seqret) without creating any extra files? > > I know there's a way to do it like this: >> seqret @sequencelist > or >> seqret list:sequencelist > > But what I would need would look something like: >> seqret "swiss:CASA1_RABIT,swiss:CASA1_HUMAN" > > I've looked into USA format specification and it seems that there is > actually no way to do it without creating some temporary file. Or am I > missing something? We can add that as an option ... but there is a very real danger that the command line will be too long. How many sequences will be on the command line (normal use, and worst case)? regards, Peter Rice From naktinis at csc.fi Thu Jul 29 09:28:57 2010 From: naktinis at csc.fi (Rimvydas Naktinis) Date: Thu, 29 Jul 2010 16:28:57 +0300 Subject: [emboss-dev] Specifying sequence lists for seqret In-Reply-To: <4C5173F1.1070208@ebi.ac.uk> References: <4C516A46.3060006@csc.fi> <4C5173F1.1070208@ebi.ac.uk> Message-ID: <4C518219.2090103@csc.fi> On 07/29/2010 03:28 PM, Peter Rice wrote: > On 29/07/10 12:47, Rimvydas Naktinis wrote: >> Hi, >> >> I'm developing EMBOSS integration into Chipster project >> (chipster.csc.fi). I was wondering if there's a way to specify sequence >> list (for example, when calling seqret) without creating any extra files? >> >> I know there's a way to do it like this: >>> seqret @sequencelist >> or >>> seqret list:sequencelist >> >> But what I would need would look something like: >>> seqret "swiss:CASA1_RABIT,swiss:CASA1_HUMAN" >> >> I've looked into USA format specification and it seems that there is >> actually no way to do it without creating some temporary file. Or am I >> missing something? > > We can add that as an option ... but there is a very real danger that > the command line will be too long. > > How many sequences will be on the command line (normal use, and worst case)? > > regards, > > Peter Rice In current use case user enters the sequence names manually, so list should not be very long. However, we should also think about a general case. As far as I know, starting from Linux kernel version 2.6.23 argv size is limited by 1/4th of memory stack size (http://www.kernel.org/doc/man-pages/online/pages/man2/execve.2.html), so argument length is basically limited only by available physical memory. Situation might be different with other operating systems. I guess in Windows the limit is 32Kb (http://msdn.microsoft.com/en-us/library/ms682425%28VS.85%29.aspx). I guess this could be left for programmers to deal with and users, who use the command line themselves, would probably use the @seqlist option for long lists anyway. And thanks for quick response! Regards, Rimvydas Naktinis CSC ? IT Center for Science Ltd From biopython at maubp.freeserve.co.uk Thu Jul 15 11:01:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 12:01:52 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: Message-ID: Trying again, sending from right email address... ---------- Forwarded message ---------- From:?Peter To:?ajb at ebi.ac.uk, Peter Rice Date:?Thu, 15 Jul 2010 11:45:21 +0100 Subject:?EMBOSS 6.3.0 released - SAM/BAM On Thu, Jul 15, 2010 at 11:20 AM, ? wrote: > EMBOSS 6.3.0 is now available and can be downloaded from > our ftp server: Congratulations on the latest release. > Some highlights include: > > ? ?... > ? ?Support for BAM/SAM files > ? ?... > Cool. I should take a look at this before (if) merging SAM/BAM support into Biopython. The use case I had in mind was for conversion to FASTQ (discarding any alignment information). What do you do about naming for paired reads? I was appending /1 or /2 to match the Illumina convention. Doing nothing means the paired reads will have the same names. What do you do about the strand issue? SAM/BAM stored reads which map onto the reverse strand in reverse complement. If you want to get back to the original orientation for output as FASTQ you must apply the reverse complement (plus reverse the quality scores too of course). Do you support writing SAM/BAM files? If so, would this be for aligned reads or unaligned reads only? Assuming you do write BAM files, do you support the recent convention to use a single BGZF block, and that where possible reads should not span a BGZF block boundary? (I'm assuming some of the EMBOSS team must be on the samtools-devel mailing list which is where most SAM/BAM format discussion seems to take place) Regards, Peter C (@Biopython) From pmr at ebi.ac.uk Thu Jul 15 11:12:02 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 15 Jul 2010 12:12:02 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: Message-ID: <4C3EED02.7080507@ebi.ac.uk> On 15/07/2010 12:01, Peter C. wrote: > Congratulations on the latest release. > >> Some highlights include: >> >> ... >> Support for BAM/SAM files >> ... >> > > Cool. I should take a look at this before (if) merging SAM/BAM > support into Biopython. The use case I had in mind was for > conversion to FASTQ (discarding any alignment information). > > What do you do about naming for paired reads? I was appending > /1 or /2 to match the Illumina convention. Doing nothing means > the paired reads will have the same names. Not addressed yet - let's look into a common approach though. We would also have to lok into what the '/' character does to EMBOSS's handling of sequence names. > What do you do about the strand issue? SAM/BAM stored reads > which map onto the reverse strand in reverse complement. If > you want to get back to the original orientation for output as > FASTQ you must apply the reverse complement (plus reverse > the quality scores too of course). So far we read as sequences. Reading as mapped reads (very large alignments) is planned for the very near future so it can appear in the next release. > Do you support writing SAM/BAM files? If so, would this be > for aligned reads or unaligned reads only? Yes we do write them - so far unaligned but we will add aligned reads when we can treat that as an input type. > Assuming you do write BAM files, do you support the recent > convention to use a single BGZF block, and that where possible > reads should not span a BGZF block boundary? We looked at samtools 1.7 to get things working. We still need to look at issues such as using the index for access to remote BAM files, and various flavours of blocks. I was not aware of the single block version. Again, we should compare notes. > (I'm assuming some of the EMBOSS team must be on the > samtools-devel mailing list which is where most SAM/BAM > format discussion seems to take place) Actually no, but I will join it ASAP and catch up. regards, Peter Rice From biopython at maubp.freeserve.co.uk Thu Jul 15 11:36:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 12:36:11 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: <4C3EED02.7080507@ebi.ac.uk> References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 12:12 PM, Peter Rice wrote: > >> What do you do about naming for paired reads? I was appending >> /1 or /2 to match the Illumina convention. Doing nothing means >> the paired reads will have the same names. > > Not addressed yet - let's look into a common approach though. > We would also have to lok into what the '/' character does to EMBOSS's > handling of sequence names. My rational for appending the /1 and /2 is that in a typical workflow you might take Illumina paired end data as FASTQ and map it onto a genome with BWA giving SAM/BAM. You might then want to reverse this (e.g. if given a SAM/BAM file by a collaborator, and you want to try an alternative mapping tool or reference geneome, first you must recover the raw reads again, e.g. as FASTQ files). >> What do you do about the strand issue? SAM/BAM stored reads >> which map onto the reverse strand in reverse complement. If >> you want to get back to the original orientation for output as >> FASTQ you must apply the reverse complement (plus reverse >> the quality scores too of course). > > So far we read as sequences. Reading as mapped reads (very large > alignments) is planned for the very near future so it can appear in the > next release. Given the use case of going from (aligned) SAM/BAM back to the original FASTQ, for a round trip you *must* undo the reverse complementation. This is important even for single reads, as quality scores tend to trail off in the (original) read direction so some algorithms may treat a reverse version of the read differently. >> Do you support writing SAM/BAM files? If so, would this be >> for aligned reads or unaligned reads only? > > Yes we do write them - so far unaligned but we will add aligned reads > when we can treat that as an input type. I was thinking about this for my experimental SAM/BAM support for Biopython - doing unaligned output only is much more straight forward for a stream based writer (no seeks) as you don't have to worry about header information like reference sequences. Although not as useful as writing aligned SAM/BAM, some people are already using unaligned SAM/BAM for storing read data - e.g. GATK http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit >> Assuming you do write BAM files, do you support the recent >> convention to use a single BGZF block [for the header], and >> that where possible reads should not span a BGZF block boundary? > > We looked at samtools 1.7 to get things working. We still need to look at > issues such as using the index for access to remote BAM files, and various > flavours of blocks. I was not aware of the single block version. Again, we > should compare notes. Is here fine or on a more cross project list? I think BioPerl are sticking to wrapping samtools rather than experimenting with a reimplementation. I corrected above in line - samtools now uses a single BGZF block for the BAM *header*. This was done to make rewriting the header easier (you don't need to decompress and re-compress any reads which happened to be with the header in the first block). >> (I'm assuming some of the EMBOSS team must be on the >> samtools-devel mailing list which is where most SAM/BAM >> format discussion seems to take place) > > Actually no, but I will join it ASAP and catch up. Excellent - there have been some interesting discussions about BAM v2 (e.g. moving the header block, handling indels better) and the possibility of using HDF5 underneath rather than the in house gzip variant. Peter C. From pmr at ebi.ac.uk Thu Jul 15 11:48:19 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 15 Jul 2010 12:48:19 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> Message-ID: <4C3EF583.90105@ebi.ac.uk> On 15/07/2010 12:36, Peter C. wrote: > Is here fine or on a more cross project list? I think BioPerl are sticking > to wrapping samtools rather than experimenting with a reimplementation. I think here is fine while it is just EMBOSS and BioPython, and I would like some feedback from others on what is most needed. If it becomes a cross-openbio effort like FASTQ then we can move to the OBF lists to bring them in. Heng Li asked us to look after format conversions to help take that load off samtools, but that still requires us to support SAM and BAM as fully as we can. regards, Peter From biopython at maubp.freeserve.co.uk Thu Jul 15 13:36:23 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 14:36:23 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: <4C3EF583.90105@ebi.ac.uk> References: <4C3EED02.7080507@ebi.ac.uk> <4C3EF583.90105@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 12:48 PM, Peter Rice wrote: > > On 15/07/2010 12:36, Peter C. wrote: > >> Is here fine or on a more cross project list? I think BioPerl are sticking >> to wrapping samtools rather than experimenting with a reimplementation. > > I think here is fine while it is just EMBOSS and BioPython, and I would like > some feedback from others on what is most needed. > > If it becomes a cross-openbio effort like FASTQ then we can move to the OBF > lists to bring them in. BioLib and BioRuby sound interested too. > Heng Li asked us to look after format conversions to help take that load off > samtools, but that still requires us to support SAM and BAM as fully as we > can. I don't quite see how it will take a load of samtools, but nice to be asked. Peter From biopython at maubp.freeserve.co.uk Fri Jul 16 10:25:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 11:25:04 +0100 Subject: [emboss-dev] EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <4C3EED02.7080507@ebi.ac.uk> <4C3EF583.90105@ebi.ac.uk> Message-ID: On Thu, Jul 15, 2010 at 2:36 PM, Peter wrote: > On Thu, Jul 15, 2010 at 12:48 PM, Peter Rice wrote: >> >> On 15/07/2010 12:36, Peter C. wrote: >> >>> Is here fine or on a more cross project list? I think BioPerl are sticking >>> to wrapping samtools rather than experimenting with a reimplementation. >> >> I think here is fine while it is just EMBOSS and BioPython, and I would like >> some feedback from others on what is most needed. >> >> If it becomes a cross-openbio effort like FASTQ then we can move to the OBF >> lists to bring them in. > > BioLib and BioRuby sound interested too. There are some new threads on the biolib-dev mailing list now, http://lists.open-bio.org/pipermail/biolib-dev/2010-July/thread.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 16 11:08:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 12:08:11 +0100 Subject: [emboss-dev] [BioLib-dev] R: EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <20100715114015.GB28639@thebird.nl> <20100716090126.GA7197@thebird.nl> Message-ID: On Fri, Jul 16, 2010 at 11:42 AM, Raoul Bonnal wrote: > > Probably considering our idea of interoperability it would be reasonable to > focus on the emboss implementation. > > We can check the performances of the Emboss and then converge on it and > help in development if needed. > I haven't looked at the EMBOSS code, but I assume it is currently geared towards looking at the reads as individual sequences - much like FASTQ and most other file formats supported by EMBOSS (and much like my own experiments for Biopython). This makes perfect sense for extending the existing EMBOSS tools to work with SAM/BAM. Is this a fair description Peter R? On the other hand, the samtools C API is very rich and allows a lot of alignment based operations (e.g. access to reads based on mapped position to a reference). Isn't the samtools C API a broader more useful code base to wrap in the Bio* projects? It will also be kept up to date with the expected file format changes. Perhaps Pjotr could clarify what he has in mind for BioLib? Thanks, Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 16 13:15:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Jul 2010 14:15:54 +0100 Subject: [emboss-dev] [BioLib-dev] R: EMBOSS 6.3.0 released - SAM/BAM In-Reply-To: References: <20100715114015.GB28639@thebird.nl> <20100716090126.GA7197@thebird.nl> Message-ID: On Fri, Jul 16, 2010 at 12:54 PM, Jan Aerts wrote: > I would also favour to base biolib implementation on samtools API, as they > are the ones who've been developing the whole SAM/BAM format anyway > (Richard, Li, ...) > jan. What would biolib add to the samtools C API? Peter From naktinis at csc.fi Thu Jul 29 11:47:18 2010 From: naktinis at csc.fi (Rimvydas Naktinis) Date: Thu, 29 Jul 2010 14:47:18 +0300 Subject: [emboss-dev] Specifying sequence lists for seqret Message-ID: <4C516A46.3060006@csc.fi> Hi, I'm developing EMBOSS integration into Chipster project (chipster.csc.fi). I was wondering if there's a way to specify sequence list (for example, when calling seqret) without creating any extra files? I know there's a way to do it like this: > seqret @sequencelist or > seqret list:sequencelist But what I would need would look something like: > seqret "swiss:CASA1_RABIT,swiss:CASA1_HUMAN" I've looked into USA format specification and it seems that there is actually no way to do it without creating some temporary file. Or am I missing something? Regards, Rimvydas Naktinis CSC ? IT Center for Science Ltd From pmr at ebi.ac.uk Thu Jul 29 12:28:33 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 29 Jul 2010 13:28:33 +0100 Subject: [emboss-dev] Specifying sequence lists for seqret In-Reply-To: <4C516A46.3060006@csc.fi> References: <4C516A46.3060006@csc.fi> Message-ID: <4C5173F1.1070208@ebi.ac.uk> On 29/07/10 12:47, Rimvydas Naktinis wrote: > Hi, > > I'm developing EMBOSS integration into Chipster project > (chipster.csc.fi). I was wondering if there's a way to specify sequence > list (for example, when calling seqret) without creating any extra files? > > I know there's a way to do it like this: >> seqret @sequencelist > or >> seqret list:sequencelist > > But what I would need would look something like: >> seqret "swiss:CASA1_RABIT,swiss:CASA1_HUMAN" > > I've looked into USA format specification and it seems that there is > actually no way to do it without creating some temporary file. Or am I > missing something? We can add that as an option ... but there is a very real danger that the command line will be too long. How many sequences will be on the command line (normal use, and worst case)? regards, Peter Rice From naktinis at csc.fi Thu Jul 29 13:28:57 2010 From: naktinis at csc.fi (Rimvydas Naktinis) Date: Thu, 29 Jul 2010 16:28:57 +0300 Subject: [emboss-dev] Specifying sequence lists for seqret In-Reply-To: <4C5173F1.1070208@ebi.ac.uk> References: <4C516A46.3060006@csc.fi> <4C5173F1.1070208@ebi.ac.uk> Message-ID: <4C518219.2090103@csc.fi> On 07/29/2010 03:28 PM, Peter Rice wrote: > On 29/07/10 12:47, Rimvydas Naktinis wrote: >> Hi, >> >> I'm developing EMBOSS integration into Chipster project >> (chipster.csc.fi). I was wondering if there's a way to specify sequence >> list (for example, when calling seqret) without creating any extra files? >> >> I know there's a way to do it like this: >>> seqret @sequencelist >> or >>> seqret list:sequencelist >> >> But what I would need would look something like: >>> seqret "swiss:CASA1_RABIT,swiss:CASA1_HUMAN" >> >> I've looked into USA format specification and it seems that there is >> actually no way to do it without creating some temporary file. Or am I >> missing something? > > We can add that as an option ... but there is a very real danger that > the command line will be too long. > > How many sequences will be on the command line (normal use, and worst case)? > > regards, > > Peter Rice In current use case user enters the sequence names manually, so list should not be very long. However, we should also think about a general case. As far as I know, starting from Linux kernel version 2.6.23 argv size is limited by 1/4th of memory stack size (http://www.kernel.org/doc/man-pages/online/pages/man2/execve.2.html), so argument length is basically limited only by available physical memory. Situation might be different with other operating systems. I guess in Windows the limit is 32Kb (http://msdn.microsoft.com/en-us/library/ms682425%28VS.85%29.aspx). I guess this could be left for programmers to deal with and users, who use the command line themselves, would probably use the @seqlist option for long lists anyway. And thanks for quick response! Regards, Rimvydas Naktinis CSC ? IT Center for Science Ltd