From alvin at pasteur.edu.uy Tue Jun 1 08:37:20 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Tue, 1 Jun 2010 09:37:20 -0300 Subject: [Biopython] Cross_match In-Reply-To: References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> Message-ID: Sorry about the mistake, I meant to say Biopython instead bioperl. Well, now I know that SearchIO supports cross match in bioperl. Anyway, I agree that it would be great if we could make some method for cross match in Biopython. In my case I'm more interested in the hit information than the pairwise alignment. I'll take a look to the Bio.AlignIO 2010/5/31 Peter > On Mon, May 31, 2010 at 11:08 PM, Chris Fields wrote: > > > > Yes, but only one file: > > > > > http://github.com/bioperl/bioperl-live/blob/master/t/data/testdata.crossmatch > > > > chris > > > > Thanks Chris - that saved me searching ;) > > Alvaro - would you just want the pairwise alignment, > or are your interested in the hit information (scores etc)? > I'm wondering if adding support in Bio.AlignIO would > be enough (similar to how we support FASTA -m 10 > output already). > > Peter > From biopython at maubp.freeserve.co.uk Tue Jun 1 08:57:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 13:57:53 +0100 Subject: [Biopython] Cross_match In-Reply-To: References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> Message-ID: On Tue, Jun 1, 2010 at 1:37 PM, Alvaro F Pena Perea wrote: > Sorry about the mistake, I meant to say Biopython instead bioperl. Well, now > I know that SearchIO supports cross match in bioperl. > Anyway, I agree that it would be great if we could make some method for > cross match in Biopython. In my case I'm more interested in the hit > information than the pairwise alignment. I'll take a look to the Bio.AlignIO Just to clarify - Bio.AlignIO does not currently support cross_match, but if it did this would be focusing on the pairwise alignments. Peter From biopython at maubp.freeserve.co.uk Thu Jun 3 13:52:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Jun 2010 18:52:31 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? Message-ID: Dear Biopythoneers, We've had several discussions (mostly on the development list) about extending the Bio.SeqIO.index() functionality. For a quick recap of what you can do with it right now, see: http://news.open-bio.org/news/2009/09/biopython-seqio-index/ http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ There are two major additions that have been discussed (and some code written too): gzip support and storing the index on disk. Currently Bio.SeqIO.index() has to re-index the sequence file each time you run your script. If you run the same script often, it would be useful to be able to save the index information to disk. The idea is that you can then load the index file and get almost immediate random access to the associated sequence file (with waiting to scan the file to rebuild the index). The old OBDA style indexes used by BioPerl, BioRuby etc are one possible file format we might use, but a simple SQLite database may be preferable. This also would give us a way to index really big files with many millions of reads without keeping the file offsets in memory. This is going to be important for random access to the latest massive sequencing data files. Next, support for indexing compressed files (initially concentrating on Unix style gzipped files, e.g. example.fasta.gz) without having to decompress the whole file. You can already parse these files with Bio.SeqIO in conjunction with the Python gzip module. It would be nice to be able to index them too. Now ideally we'd be able to offer both of these features - but if you had to vote, which would be most important and why? Peter From rjalves at igc.gulbenkian.pt Fri Jun 4 04:10:22 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 04 Jun 2010 09:10:22 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: <4C08B4EE.4020508@igc.gulbenkian.pt> Hi Peter and all, Considering the fact that the first addition 'is' a potential problem, while the second is more of an optimization, I would put my vote on the first. In addition, an sqlite or similar solution would also allow one to use the indexing feature on short run applications where recalculating the index every time is a costly (sometimes too much) operation. Obviously the second would be of great use if put together with the first, but I'm a little bit biased on that since I was part of the group that raised the gzip question in the mailing list some time ago. Regards, Renato Quoting Peter on 06/03/2010 06:52 PM: > Dear Biopythoneers, > > We've had several discussions (mostly on the development list) about > extending the Bio.SeqIO.index() functionality. For a quick recap of what > you can do with it right now, see: > > http://news.open-bio.org/news/2009/09/biopython-seqio-index/ > http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > > There are two major additions that have been discussed (and some > code written too): gzip support and storing the index on disk. > > Currently Bio.SeqIO.index() has to re-index the sequence file each > time you run your script. If you run the same script often, it would be > useful to be able to save the index information to disk. The idea is > that you can then load the index file and get almost immediate > random access to the associated sequence file (with waiting to scan > the file to rebuild the index). The old OBDA style indexes used by > BioPerl, BioRuby etc are one possible file format we might use, but > a simple SQLite database may be preferable. This also would give > us a way to index really big files with many millions of reads without > keeping the file offsets in memory. This is going to be important for > random access to the latest massive sequencing data files. > > Next, support for indexing compressed files (initially concentrating > on Unix style gzipped files, e.g. example.fasta.gz) without having > to decompress the whole file. You can already parse these files > with Bio.SeqIO in conjunction with the Python gzip module. It would > be nice to be able to index them too. > > Now ideally we'd be able to offer both of these features - but if > you had to vote, which would be most important and why? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Jun 4 04:42:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 09:42:22 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C08B4EE.4020508@igc.gulbenkian.pt> References: <4C08B4EE.4020508@igc.gulbenkian.pt> Message-ID: On Fri, Jun 4, 2010 at 9:10 AM, Renato Alves wrote: > Hi Peter and all, > > Considering the fact that the first addition 'is' a potential problem, > while the second is more of an optimization, I would put my vote on the > first. In addition, an sqlite or similar solution would also allow one > to use the indexing feature on short run applications where > recalculating the index every time is a costly (sometimes too much) > operation. > > Obviously the second would be of great use if put together with the > first, but I'm a little bit biased on that since I was part of the group > that raised the gzip question in the mailing list some time ago. > > Regards, > Renato Hi Renato, Unfortunately I was inconsistent about which order I used in my email (gzip vs on disk indexes) so I'm not sure which you are talking about. Are you saying supporting on disk indexes would be your priority (even though you did ask look at gzip support in the past)? Peter From lpritc at scri.ac.uk Fri Jun 4 04:49:06 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 04 Jun 2010 09:49:06 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: Message-ID: Hi, On 03/06/2010 Thursday, June 3, 18:52, "Peter" wrote: > There are two major additions that have been discussed (and some > code written too): gzip support and storing the index on disk. [...] > Now ideally we'd be able to offer both of these features - but if > you had to vote, which would be most important and why? On-disk indexing. But does this not also lend itself (perhaps eventually...) also to storing the whole dataset in SQLite or similar to avoid syncing problems between the file and the index? Wasn't that also part of a discussion on the BIP list some time ago? I've not looked at how you're already parsing from gzip files, so I hope it's more time-efficient than what I used to do for bzip, which was write a Pyrex wrapper to Flex, which was using the bzip2 library directly. This was not a speed improvement over uncompressing the file each time I needed to open it (and then using Flex). The same is true for Python's gzip module: -rw-r--r-- 1 lpritc staff 110M 14 Apr 14:22 phytophthora_infestans_data.tar.gz $ time gunzip phytophthora_infestans_data.tar.gz real 0m18.359s user 0m3.562s sys 0m0.582s Python 2.6 (trunk:66714:66715M, Oct 1 2008, 18:36:04) [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import time >>> import gzip >>> def gzip_time(): ... t0 = time.time() ... f = gzip.open('phytophthora_infestans_data.tar.gz','rb') ... f.read() ... print time.time()-t0 ... >>> gzip_time() 19.2009749413 If you know where your data is, it can be quicker to get to, but you still need to uncompress each time, and it scales approximately linearly with number of lines returned, as you'd expect: >>> def read_lines(n): ... t0 = time.time() ... f = gzip.open('phytophthora_infestans_data.tar.gz', 'rb') ... lset = [f.readline() for i in range(n)] ... print time.time() - t0 ... return lset ... >>> d = read_lines(1000) 0.0324518680573 >>> d = read_lines(10000) 0.11150097847 >>> d = read_lines(100000) 0.808992147446 >>> d = read_lines(1000000) 7.9017291069 >>> d = read_lines(2000000) 15.7361371517 >>> d = read_lines(3000000) 23.7589659691 The advantage to me was in the amount of disk space (and network transfer time/bandwidth) saved by dealing with a compressed file. In the end I decided that, where data access was likely to be frequent, buying more storage and handling uncompressed data would be a better option than dealing directly with the compressed file: -rw-r--r-- 1 lpritc staff 410M 14 Apr 14:22 phytophthora_infestans_data.tar >>> def read_file(): ... t0 = time.time() ... d = open('phytophthora_infestans_data.tar','rb').read() ... print time.time() - t0 ... >>> read_file() 0.620229959488 >>> def read_file_lines(n): ... t0 = time.time() ... f = open('phytophthora_infestans_data.1.tar', 'rb') ... lset = [f.readline() for i in range(n)] ... print time.time() - t0 ... return lset ... >>> d = read_file_lines(100) 0.000148057937622 >>> d = read_file_lines(1000) 0.000863075256348 >>> d = read_file_lines(10000) 0.00704002380371 >>> d = read_file_lines(100000) 0.0780401229858 >>> d = read_file_lines(1000000) 0.804203033447 >>> d = read_file_lines(2000000) 1.71462202072 >>> d = read_file_lines(4000000) 3.55472993851 I don't see (though I'm happy to be shown) how you can efficiently index directly into the LZW/DEFLATE/BZIP compressed data. If you're not decompressing the whole thing in one go, I think you atill have to partially decompress a section of the file (starting from the front of the file) to retrieve your sequence each time. Even if you index - say, by recording the required buffer size/number of buffer decompressions and the offset of your sequence in the output as the index. This could save memory if you discard, rather than cache, unwanted early output - but I'm not sure that it would be time-efficient to do it for more than one or two (on average) sequences in a compressed file. You'd likely be better off spending your time waiting for the file to decompress once and doing science with the time that's left over ;) I could be wrong, though... Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Fri Jun 4 05:16:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 10:16:19 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: On Fri, Jun 4, 2010 at 9:49 AM, Leighton Pritchard wrote: > Hi, > > On 03/06/2010 Thursday, June 3, 18:52, "Peter" > wrote: > >> There are two major additions that have been discussed (and some >> code written too): gzip support and storing the index on disk. > > [...] > >> Now ideally we'd be able to offer both of these features - but if >> you had to vote, which would be most important and why? > > On-disk indexing. ?But does this not also lend itself (perhaps > eventually...) also to storing the whole dataset in SQLite or similar to > avoid syncing problems between the file and the index? ?Wasn't that also > part of a discussion on the BIP list some time ago? That is a much more complicated problem - serialising data from many different possible files formats. We have BioSQL which is pretty good for things like GenBank, EMBL, SwissProt etc but not suitable for FASTQ. I'd rather stick to the simpler task of recording a lookup table mapping record identifiers to file offsets. > I've not looked at how you're already parsing from gzip files, so I hope > it's more time-efficient than what I used to do for bzip, which was write a > Pyrex wrapper to Flex, which was using the bzip2 library directly. ?This was > not a speed improvement over uncompressing the file each time I needed to > open it (and then using Flex). ?The same is true for Python's gzip module: > > -rw-r--r-- ?1 lpritc ?staff ? 110M 14 Apr 14:22 > phytophthora_infestans_data.tar.gz > > $ time gunzip phytophthora_infestans_data.tar.gz > > real ? ?0m18.359s > user ? ?0m3.562s > sys ? ?0m0.582s > > Python 2.6 (trunk:66714:66715M, Oct ?1 2008, 18:36:04) > [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> import time >>>> import gzip >>>> def gzip_time(): > ... ? ? t0 = time.time() > ... ? ? f = gzip.open('phytophthora_infestans_data.tar.gz','rb') > ... ? ? f.read() > ... ? ? print time.time()-t0 > ... >>>> gzip_time() > 19.2009749413 > > If you know where your data is, it can be quicker to get to, but you still > need to uncompress each time, and it scales approximately linearly with > number of lines returned, as you'd expect: > >>>> def read_lines(n): > ... ? ? t0 = time.time() > ... ? ? f = gzip.open('phytophthora_infestans_data.tar.gz', 'rb') > ... ? ? lset = [f.readline() for i in range(n)] > ... ? ? print time.time() - t0 > ... ? ? return lset > ... >>>> d = read_lines(1000) > 0.0324518680573 >>>> d = read_lines(10000) > 0.11150097847 >>>> d = read_lines(100000) > 0.808992147446 >>>> d = read_lines(1000000) > 7.9017291069 >>>> d = read_lines(2000000) > 15.7361371517 >>>> d = read_lines(3000000) > 23.7589659691 > > The advantage to me was in the amount of disk space (and network transfer > time/bandwidth) saved by dealing with a compressed file. ?In the end I > decided that, where data access was likely to be frequent, buying more > storage and handling uncompressed data would be a better option than dealing > directly with the compressed file: > > -rw-r--r-- ?1 lpritc ?staff ? 410M 14 Apr 14:22 > phytophthora_infestans_data.tar > >>>> def read_file(): > ... ? ? t0 = time.time() > ... ? ? d = open('phytophthora_infestans_data.tar','rb').read() > ... ? ? print time.time() - t0 > ... >>>> read_file() > 0.620229959488 > >>>> def read_file_lines(n): > ... ? ? t0 = time.time() > ... ? ? f = open('phytophthora_infestans_data.1.tar', 'rb') > ... ? ? lset = [f.readline() for i in range(n)] > ... ? ? print time.time() - t0 > ... ? ? return lset > ... >>>> d = read_file_lines(100) > 0.000148057937622 >>>> d = read_file_lines(1000) > 0.000863075256348 >>>> d = read_file_lines(10000) > 0.00704002380371 >>>> d = read_file_lines(100000) > 0.0780401229858 >>>> d = read_file_lines(1000000) > 0.804203033447 >>>> d = read_file_lines(2000000) > 1.71462202072 >>>> d = read_file_lines(4000000) > 3.55472993851 > > > I don't see (though I'm happy to be shown) how you can efficiently index > directly into the LZW/DEFLATE/BZIP compressed data. ?If you're not > decompressing the whole thing in one go, I think you atill have to partially > decompress a section of the file (starting from the front of the file) to > retrieve your sequence each time. ?Even if you index - say, by recording the > required buffer size/number of buffer decompressions and the offset of your > sequence in the output as the index. ?This could save memory if you discard, > rather than cache, unwanted early output - but I'm not sure that it would be > time-efficient to do it for more than one or two (on average) sequences in a > compressed file. ?You'd likely be better off spending your time waiting for > the file to decompress once and doing science with the time that's left over > ;) > > I could be wrong, though... > The proof of concept support for gzip files in Bio.SeqIO.index() just called the Python gzip module. This gives us a file-like handle object supporting the usual methods like readline and iteration (used to scan the file looking for each record) and seek/tell (offsets for the decompressed stream). Here building the index must by its nature decompress the whole file once - there is no way round that. The interesting thing is how seeking to an offset and then reading a record performs - and I have not looked at the run time or memory usage for this. It works, but your measurements do suggest it will be much much slower than using the original file. i.e. It looks like while the code to support gzip files in Bio.SeqIO.index() is quite short, the performance may be unimpressive for large archives. I doubt this can be worked around - its the cost of saving disk space by compressing a whole file without taking any special case about putting different records into different block. Peter [As an aside, this is something I'm interested in for BAM file support, these are binary files which are gzip compressed.] From aboulia at gmail.com Fri Jun 4 06:53:14 2010 From: aboulia at gmail.com (Kevin) Date: Fri, 4 Jun 2010 18:53:14 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: I vote for sqlite index. Have been using bsddb to do the same but the db is inflated compared to plain text. Performance is not bad using btree For gzip I feel it might be possible to gunzip into a stream which biopython can parse on the fly? Kev Sent from my iPod On 04-Jun-2010, at 1:52 AM, Peter wrote: > Dear Biopythoneers, > > We've had several discussions (mostly on the development list) about > extending the Bio.SeqIO.index() functionality. For a quick recap of > what > you can do with it right now, see: > > http://news.open-bio.org/news/2009/09/biopython-seqio-index/ > http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > > There are two major additions that have been discussed (and some > code written too): gzip support and storing the index on disk. > > Currently Bio.SeqIO.index() has to re-index the sequence file each > time you run your script. If you run the same script often, it would > be > useful to be able to save the index information to disk. The idea is > that you can then load the index file and get almost immediate > random access to the associated sequence file (with waiting to scan > the file to rebuild the index). The old OBDA style indexes used by > BioPerl, BioRuby etc are one possible file format we might use, but > a simple SQLite database may be preferable. This also would give > us a way to index really big files with many millions of reads without > keeping the file offsets in memory. This is going to be important for > random access to the latest massive sequencing data files. > > Next, support for indexing compressed files (initially concentrating > on Unix style gzipped files, e.g. example.fasta.gz) without having > to decompress the whole file. You can already parse these files > with Bio.SeqIO in conjunction with the Python gzip module. It would > be nice to be able to index them too. > > Now ideally we'd be able to offer both of these features - but if > you had to vote, which would be most important and why? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Jun 4 08:59:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 13:59:22 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References:

Message-ID: On Fri, Jun 4, 2010 at 11:53 AM, Kevin wrote: > I vote for sqlite index. Have been using bsddb to do the same but the db > is inflated compared to plain text. Performance is not bad using btree The other major point against bsddb is that future versions of Python will not include it in the standard library - but Python 2.5+ does have sqlite3 included. > For gzip I feel it might be possible to gunzip into a stream which > biopython can parse on the fly? Yes of course, like this: import gzip from Bio import SeqIO handle = gzip.open("uniprot_sprot.dat.gz") for record in SeqIO.parse(handle, "swiss"): print record.id handle.close() Parsing is easy - the point of this discussion is random access to any record within the stream (which requires jumping to an offset). Peter From lgautier at gmail.com Fri Jun 4 14:25:42 2010 From: lgautier at gmail.com (Laurent) Date: Fri, 04 Jun 2010 20:25:42 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: <4C094526.9040508@gmail.com> On 04/06/10 18:00, biopython-request at lists.open-bio.org wrote: > > On Fri, Jun 4, 2010 at 11:53 AM, Kevin wrote: >> I vote for sqlite index. Have been using bsddb to do the same but the db >> is inflated compared to plain text. Performance is not bad using btree > > The other major point against bsddb is that future versions of Python > will not include it in the standard library - but Python 2.5+ does have > sqlite3 included. > >> For gzip I feel it might be possible to gunzip into a stream which >> biopython can parse on the fly? > > Yes of course, like this: > > import gzip > from Bio import SeqIO > handle = gzip.open("uniprot_sprot.dat.gz") > for record in SeqIO.parse(handle, "swiss"): print record.id > handle.close() > > Parsing is easy - the point of this discussion is random access to > any record within the stream (which requires jumping to an offset). > > Peter > One note of caution: Python's gzip module is slow, or so I experienced... to the point that I ended up wrapping the code into a function that gunzipped the file to a temporary location, parse and extract information, then delete the temporary file. Regarding random access in compressed file, there is the BGZF format but I am not familiar enough with it to tell whether it can be of use here. More generally, compression is part of the HDF5 format and this with chunks could prove the most battle-tested way to access entries randomly. L. From aboulia at gmail.com Fri Jun 4 14:35:05 2010 From: aboulia at gmail.com (Kevin Lam) Date: Sat, 5 Jun 2010 02:35:05 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References:

Message-ID: > > > Parsing is easy - the point of this discussion is random access to > any record within the stream (which requires jumping to an offset). > > Peter > apologies didn't follow the thread close enough. Now I understand why the two might be overlapping. I would still vote for sqlite3. based on my short experience with next gen seq. there's these other benefits 1)pairing of csfasta with qual files based on read name can be done easier + stored in same db 2) pairing of mate pair and paired end reads can be done easier + stored in same db 3)generation of fastq files from 1) can be done easier 4)double encoded fasta sequence and base space sequence for can be stored in same db as well. I think the bwt method of indexing and compression used in bowtie and bwa for reference genomes might be a better way of going about the problem. That said, I think generally disk space is seldom an issue with lowering costs. Time / convenience is probably more important. The one time I wished for smaller NGS files is when I need to do transfers. Kevin From biopython at maubp.freeserve.co.uk Fri Jun 4 15:04:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 20:04:16 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C094526.9040508@gmail.com> References: <4C094526.9040508@gmail.com> Message-ID: On Fri, Jun 4, 2010 at 7:25 PM, Laurent wrote: > > One note of caution: Python's gzip module is slow, or so I experienced... to > the point that I ended up wrapping the code into a function that gunzipped > the file to a temporary location, parse and extract information, then delete > the temporary file. > That should be easy to benchmark - using Python's gzip to parse a file versus using the command line tool gzip to decompress and then parse the uncompressed file. > > Regarding random access in compressed file, there is the BGZF format but I > am not familiar enough with it to tell whether it can be of use here. > I've been looking at that this afternoon as it is used in BAM files. However, most gzip files (e.g. FASTA or FASTQ files) created with the gzip command line tools will NOT follow the BGZF convention. I personally have no need to have random access to gzipped general sequence files files. However, I have some proof of concept code to exploit GZIP files using the BGZF structure which should give more efficient random access to any part of the file (compared to simply using the gzip module) but haven't yet done any benchmarking. The code is still very immature, but if you want a look see the _BgzfHandle class here: http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 > > More generally, compression is part of the HDF5 format and this with chunks > could prove the most battle-tested way to access entries randomly. > But (thus far) no sequence data is stored in HDF5 format (is it?). Peter From chapmanb at 50mail.com Fri Jun 4 15:33:58 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 4 Jun 2010 15:33:58 -0400 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> Message-ID: <20100604193358.GV1054@sobchak.mgh.harvard.edu> Peter and all; > > One note of caution: Python's gzip module is slow, or so I experienced... to > > the point that I ended up wrapping the code into a function that gunzipped > > the file to a temporary location, parse and extract information, then delete > > the temporary file. More generally, I find having files gzipped while doing analysis is not very helpful. The time to gunzip and feed them into programs doesn't end up being worth the space tradeoff. My only real use of gzip is when archiving something that I'm done with. > > Regarding random access in compressed file, there is the BGZF format but I > > am not familiar enough with it to tell whether it can be of use here. > > I've been looking at that this afternoon as it is used in BAM files. What Broad does internally is store Fastq files in BAM format. You can convert with this Picard tool: http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam Originally when using their tools I thought this would be as annoying as gzipped files, but it is practically pretty nice since you can access them with pysam. Compression size is the same as if gzipped. What do you think about co-opting the SAM/BAM format for this? This would make it more specific for things that can go into BAM (so no GenBank and what not), but would have the advantage of working with existing workflows. Region based indexing is already implemented for BAM, but it would be really useful to also have ID based retrieval along the lines of what you are proposing. Brad From kevin at aitbiotech.com Fri Jun 4 16:21:05 2010 From: kevin at aitbiotech.com (Kevin Lam) Date: Sat, 5 Jun 2010 04:21:05 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <20100604193358.GV1054@sobchak.mgh.harvard.edu> References: <4C094526.9040508@gmail.com> <20100604193358.GV1054@sobchak.mgh.harvard.edu> Message-ID: Just thinking out loud. would generating a fake region id (unique for each read id) and the corresponding index when creating the bam be a good quick fix to utilise bam format for ID based retrieval? Or would the double mapping slow things down considerably? Kevin > > What do you think about co-opting the SAM/BAM format for this? This > would make it more specific for things that can go into BAM (so no > GenBank and what not), but would have the advantage of working with > existing workflows. > > Region based indexing is already implemented for BAM, but it would > be really useful to also have ID based retrieval along the lines of > what you are proposing. > > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From peter at maubp.freeserve.co.uk Fri Jun 4 16:21:33 2010 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 21:21:33 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <20100604193358.GV1054@sobchak.mgh.harvard.edu> References: <4C094526.9040508@gmail.com> <20100604193358.GV1054@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jun 4, 2010 at 8:33 PM, Brad Chapman wrote: > Peter and all; > > More generally, I find having files gzipped while doing analysis is > not very helpful. The time to gunzip and feed them into programs > doesn't end up being worth the space tradeoff. My only real use of > gzip is when archiving something that I'm done with. It seems that in general support for random access to gzipped files is of niche interest. Avoiding this in Bio.SeqIO.index() will keep the API simple and I think will make the caching to disk stuff a bit easier too. >> > Regarding random access in compressed file, there is the BGZF >> > format but I am not familiar enough with it to tell whether it can be >> > of use here. >> >> I've been looking at that this afternoon as it is used in BAM files. > > What Broad does internally is store Fastq files in BAM format. You > can convert with this Picard tool: > > http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam Thanks for the link - I knew I'd seen a FASTQ to unaligned SAM/BAM tool out there somewhere. > Originally when using their tools I thought this would be as annoying as > gzipped files, but it is practically pretty nice since you can access > them with pysam. Compression size is the same as if gzipped. BAM files are a compressed with a variant of gzip (this BGZF sub-format), so that isn't a big surprise ;) > What do you think about co-opting the SAM/BAM format for this? This > would make it more specific for things that can go into BAM (so no > GenBank and what not), but would have the advantage of working with > existing workflows. I can see storing unmapped reads in BAM as a sensible alternative to FASTQ. Note that you lose any descriptions (not usually important) but more importantly BAM files do not store the sequence case information (which is often used to encode trimming points). Obviously we'd want to have SAM/BAM output support in Bio.SeqIO to fully take advantage of this (grin). I'm keeping this in mind while working on SAM/BAM parsing, but it would be a *lot* more work. > Region based indexing is already implemented for BAM, but it would > be really useful to also have ID based retrieval along the lines of > what you are proposing. > > Brad Yeah, I've been reading up on the BAM index format (BAI files) and they don't do anything about read lookup by ID at all. So I haven't been reinventing the wheel by trying to do Bio.SeqIO.index() support of BAM - it should be complementary to the pysam stuff. Anyway, even for BAM files we should be able to use the same scheme as all the other file formats supported in Bio.SeqIO.index(), use an SQLite database to hold the lookup table of read names to file offsets (rather than a Python dictionary in memory as now). Regards, Peter From lgautier at gmail.com Sat Jun 5 01:12:57 2010 From: lgautier at gmail.com (Laurent) Date: Sat, 05 Jun 2010 07:12:57 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> Message-ID: <4C09DCD9.60907@gmail.com> On 04/06/10 21:04, Peter wrote: > On Fri, Jun 4, 2010 at 7:25 PM, Laurent wrote: >> >> One note of caution: Python's gzip module is slow, or so I experienced... to >> the point that I ended up wrapping the code into a function that gunzipped >> the file to a temporary location, parse and extract information, then delete >> the temporary file. >> > > That should be easy to benchmark - using Python's gzip to parse a file > versus using the command line tool gzip to decompress and then parse > the uncompressed file. > >> >> Regarding random access in compressed file, there is the BGZF format but I >> am not familiar enough with it to tell whether it can be of use here. >> > > I've been looking at that this afternoon as it is used in BAM files. However, > most gzip files (e.g. FASTA or FASTQ files) created with the gzip command > line tools will NOT follow the BGZF convention. I personally have no need > to have random access to gzipped general sequence files files. > > However, I have some proof of concept code to exploit GZIP files using the > BGZF structure which should give more efficient random access to any part > of the file (compared to simply using the gzip module) but haven't yet done > any benchmarking. The code is still very immature, but if you want a look > see the _BgzfHandle class here: > > http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 Are you using that gzip obscure option that inserts "ticks" throughout the file ? If so, I remember reading that this could lead to problems (I just can't remember which ones... may be it can be found on the web). >> >> More generally, compression is part of the HDF5 format and this with chunks >> could prove the most battle-tested way to access entries randomly. >> > > But (thus far) no sequence data is stored in HDF5 format (is it?). Last year, in a SIG at the ISMB in Stockholm people showed that they have stored next-gen/short-reads using HDF5, and have demonstrated superior performances to BAM (not completely a surprise since to some BAM is reinventing some of the features in HDF5, and HDF5 has been developed for a longer time). I think that their slides are on slideshare (or similar place). Laurent > Peter From cjfields at illinois.edu Sat Jun 5 06:59:59 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 5 Jun 2010 05:59:59 -0500 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> Message-ID: On Jun 4, 2010, at 2:04 PM, Peter wrote: > On Fri, Jun 4, 2010 at 7:25 PM, Laurent wrote: > >> >> More generally, compression is part of the HDF5 format and this with chunks >> could prove the most battle-tested way to access entries randomly. >> > > But (thus far) no sequence data is stored in HDF5 format (is it?). > > Peter There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics). There is a website: http://www.hdfgroup.org/projects/biohdf/ chris From biopython at maubp.freeserve.co.uk Sat Jun 5 07:43:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 12:43:36 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C09DCD9.60907@gmail.com> References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> Message-ID: On Sat, Jun 5, 2010 at 6:12 AM, Laurent wrote: >> >> I've been looking at that this afternoon as it is used in BAM files. >> However, most gzip files (e.g. FASTA or FASTQ files) created with >> the gzip command line tools will NOT follow the BGZF convention. >> I personally have no need to have random access to gzipped general >> sequence files files. >> >> However, I have some proof of concept code to exploit GZIP files using >> the BGZF structure which should give more efficient random access to >> any part of the file (compared to simply using the gzip module) but >> haven't yet done any benchmarking. The code is still very immature, >> but if you want a look see the _BgzfHandle class here: >> >> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 > > Are you using that gzip obscure option that inserts "ticks" throughout the > file ? If so, I remember reading that this could lead to problems (I just > can't remember which ones... may be it can be found on the web). I'm not sure what you are refering to - probably not. The way BGZF works is it is a standard GZIP file, made up of multiple GZIP blocks which can be decompressed in isolation (each with their own GZIP header). For random access to any part of the file all you need is the block offset (raw bytes, non-negative) and then a relative offset from the start of that block after decompression (again, non-negative). The non-standard bit is they use the optional subfields in the GZIP header to record the block size - presumably this cannot be infered any other way. This information gives you the block offsets which are used when constructing the index. >>> More generally, compression is part of the HDF5 format and this with >>> chunks could prove the most battle-tested way to access entries >>> randomly. >> >> But (thus far) no sequence data is stored in HDF5 format (is it?). > > Last year, in a SIG at the ISMB in Stockholm people showed that they have > stored next-gen/short-reads using HDF5, and have demonstrated superior > performances to BAM (not completely a surprise since to some BAM is > reinventing some of the features in HDF5, and HDF5 has been developed for a > longer time). I think that their slides are on slideshare (or similar > place). There is some talk on the samtools mailing list about general improvements to the chunking in BAM, relocating the header information (and other very read specific things about representing error models, indels, etc). You may be right that HDF5 has technical advantages over BAM version 1, but currently my impression is that SAM/BAM is making good headway with becoming a defacto standard for next generation data, while HDF5 is not. Maybe someone should suggest they move to HDF5 internally for BAM version 2? Peter From biopython at maubp.freeserve.co.uk Sat Jun 5 07:51:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 12:51:10 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com>

Message-ID: On Sat, Jun 5, 2010 at 11:59 AM, Chris Fields wrote: > On Jun 4, 2010, at 2:04 PM, Peter wrote: >> >> But (thus far) no sequence data is stored in HDF5 format (is it?). >> >> Peter > > There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics). > There is a website: > > http://www.hdfgroup.org/projects/biohdf/ It looks like they are making good progress - with SAM/BAM conversion to and from BioHDF in place. Still, as they say: >>> The current BioHDF distribution is a pipleline prototype designed to show >>> the suitability of HDF5 as a biological data store and to determine how to >>> best implement an HDF5-based bioinformatics pipeline. It is in source code >>> format only. The code builds a set of command-line tools which allow >>> uploading and extracting DNA/RNA sequence and alignment data from >>> next-generation gene sequencers. These files have been provided with the >>> same BSD license used by HDF5 >>> >>> ... >>> >>> Please be aware that the code contained in it will be in a high state of flux >>> in the immediate future. This certainly looks like something to keep an eye on. In any case, getting back to the thread's purpose - Bio.SeqIO.index() aims to give random access to sequences by their ID for many different file formats. There has been little interest in extending this to support gzipped files. However, extending the code to store the id/offset lookup table on disk with SQLite3 (rather than in memory as a Python dict) would seem welcome. I'll be refreshing the github branch where I was working on this earlier in the year... Peter From lgautier at gmail.com Sat Jun 5 08:06:00 2010 From: lgautier at gmail.com (Laurent) Date: Sat, 05 Jun 2010 14:06:00 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> Message-ID: <4C0A3DA8.1020009@gmail.com> On 05/06/10 13:43, Peter wrote: > On Sat, Jun 5, 2010 at 6:12 AM, Laurent wrote: >>> >>> I've been looking at that this afternoon as it is used in BAM files. >>> However, most gzip files (e.g. FASTA or FASTQ files) created with >>> the gzip command line tools will NOT follow the BGZF convention. >>> I personally have no need to have random access to gzipped general >>> sequence files files. >>> >>> However, I have some proof of concept code to exploit GZIP files using >>> the BGZF structure which should give more efficient random access to >>> any part of the file (compared to simply using the gzip module) but >>> haven't yet done any benchmarking. The code is still very immature, >>> but if you want a look see the _BgzfHandle class here: >>> >>> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 >> >> Are you using that gzip obscure option that inserts "ticks" throughout the >> file ? If so, I remember reading that this could lead to problems (I just >> can't remember which ones... may be it can be found on the web). > > I'm not sure what you are refering to - probably not. The way BGZF works > is it is a standard GZIP file, made up of multiple GZIP blocks which can > be decompressed in isolation (each with their own GZIP header). For > random access to any part of the file all you need is the block offset > (raw bytes, non-negative) and then a relative offset from the start of that > block after decompression (again, non-negative). > > The non-standard bit is they use the optional subfields in the GZIP header > to record the block size - presumably this cannot be infered any other > way. This information gives you the block offsets which are used when > constructing the index. > >>>> More generally, compression is part of the HDF5 format and this with >>>> chunks could prove the most battle-tested way to access entries >>>> randomly. >>> >>> But (thus far) no sequence data is stored in HDF5 format (is it?). >> >> Last year, in a SIG at the ISMB in Stockholm people showed that they have >> stored next-gen/short-reads using HDF5, and have demonstrated superior >> performances to BAM (not completely a surprise since to some BAM is >> reinventing some of the features in HDF5, and HDF5 has been developed for a >> longer time). I think that their slides are on slideshare (or similar >> place). > > There is some talk on the samtools mailing list about general improvements > to the chunking in BAM, relocating the header information (and other very > read specific things about representing error models, indels, etc). You may > be right that HDF5 has technical advantages over BAM version 1, but currently > my impression is that SAM/BAM is making good headway with becoming > a defacto standard for next generation data, while HDF5 is not. Maybe > someone should suggest they move to HDF5 internally for BAM version 2? De-facto standards happen to become so because more people use them at some point (which may involve step during which a lot of people /believe/ that most of the people are using a format over an other ;-) ), but this is indeed not necessarily making them the best technical solutions. I do believe that building on HDF5 is a better approach: - better use of resources (do not reinvent completely what is already existing unless better) - HDF5 is designed as a rather general storage architecture, and will let one build tailored solutions when needed. I'd be surprised the BAM/SAM do not know about HDF formats, but I do not know for sure. Is there any BAM/SAM person reading ? Laurent > Peter From biopython at maubp.freeserve.co.uk Sat Jun 5 08:25:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 13:25:52 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C0A3DA8.1020009@gmail.com> References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> <4C0A3DA8.1020009@gmail.com> Message-ID: On Sat, Jun 5, 2010 at 1:06 PM, Laurent wrote: >> >> There is some talk on the samtools mailing list about general improvements >> to the chunking in BAM, relocating the header information (and other very >> read specific things about representing error models, indels, etc). You >> may be right that HDF5 has technical advantages over BAM version 1, >> but currently my impression is that SAM/BAM is making good headway >> with becoming a defacto standard for next generation data, while HDF5 is >> not. Maybe someone should suggest they move to HDF5 internally for BAM >> version 2? > > De-facto standards happen to become so because more people use them > at some point (which may involve step during which a lot of people /believe/ > that most of the people are using a format over an other ;-) ), but this is > indeed not necessarily making them the best technical solutions. Absolutley. > I do believe that building on HDF5 is a better approach: > - better use of resources (do not reinvent completely what is already > existing unless better) > - HDF5 is designed as a rather general storage architecture, and will let > one build tailored solutions when needed. > > I'd be surprised the BAM/SAM do not know about HDF formats, but I do not > know for sure. Is there any BAM/SAM person reading ? I've been subscribed to the samtools mailing list for a few weeks now. I think we (or better yet the BioHDF team) should put this idea forward on their mailing list. As I said, they appear to be discussing some fairly dramatic changes to the internals of the BAM format (while intending to keep their API as close as possible), so now would be a good time to consider a switch from their blocked gzip system to something else like HDF instead. Chris has pointed out some BioHDF people will be at BOSC 2010. There is also a "HiTSeq: High Throughput Sequencing" ISMB 2010 SIG meeting at the same time as BOSC 2010, so there could be some SAM/BAM folk about in Boston to have some in person discussions with. Will you be there is year Laurent (or at EuroSciPy or something else instead)? Regards, Peter From chapmanb at 50mail.com Sat Jun 5 08:51:08 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 5 Jun 2010 08:51:08 -0400 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C0A3DA8.1020009@gmail.com> References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> <4C0A3DA8.1020009@gmail.com> Message-ID: <20100605125108.GA1822@kunkel> Laurent and Peter; > I do believe that building on HDF5 is a better approach: > - better use of resources (do not reinvent completely what is > already existing unless better) > - HDF5 is designed as a rather general storage architecture, and > will let one build tailored solutions when needed. HDF5 does has lots of good technical points, although as Peter mentions the lack of community uptake is a concern. To potentially explain this, here is my personal HDF5 usage story: I took an in depth look at PyTables for some large data sets that were overwhelming SQLite: http://www.pytables.org/moin The data loaded quickly without any issues, but the most basic thing I needed was indexes to retrieve a subset of the data by chromosome and position. Unfortunately, you can't create indexes without buying the Pro edition: http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex That immediately killed my ability to share the script so I ended my HDF5 experiment and reworked my SQLite approach. Also, echoing Peter, the BioHDF download warns you that the code is not stable, tested, or supported: http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html BAM is widely used and has tools that are meant to work on it in production environments now, while HDF tool support still feels experimental. Sometimes it is best to be practical and keep an eye on other technical solutions as they evolve, Brad From cjfields at illinois.edu Sat Jun 5 08:52:25 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 5 Jun 2010 07:52:25 -0500 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> Message-ID: On Jun 5, 2010, at 6:43 AM, Peter wrote: > On Sat, Jun 5, 2010 at 6:12 AM, Laurent wrote: >>> >>> I've been looking at that this afternoon as it is used in BAM files. >>> However, most gzip files (e.g. FASTA or FASTQ files) created with >>> the gzip command line tools will NOT follow the BGZF convention. >>> I personally have no need to have random access to gzipped general >>> sequence files files. >>> >>> However, I have some proof of concept code to exploit GZIP files using >>> the BGZF structure which should give more efficient random access to >>> any part of the file (compared to simply using the gzip module) but >>> haven't yet done any benchmarking. The code is still very immature, >>> but if you want a look see the _BgzfHandle class here: >>> >>> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 >> >> Are you using that gzip obscure option that inserts "ticks" throughout the >> file ? If so, I remember reading that this could lead to problems (I just >> can't remember which ones... may be it can be found on the web). > > I'm not sure what you are refering to - probably not. The way BGZF works > is it is a standard GZIP file, made up of multiple GZIP blocks which can > be decompressed in isolation (each with their own GZIP header). For > random access to any part of the file all you need is the block offset > (raw bytes, non-negative) and then a relative offset from the start of that > block after decompression (again, non-negative). > > The non-standard bit is they use the optional subfields in the GZIP header > to record the block size - presumably this cannot be infered any other > way. This information gives you the block offsets which are used when > constructing the index. > >>>> More generally, compression is part of the HDF5 format and this with >>>> chunks could prove the most battle-tested way to access entries >>>> randomly. >>> >>> But (thus far) no sequence data is stored in HDF5 format (is it?). >> >> Last year, in a SIG at the ISMB in Stockholm people showed that they have >> stored next-gen/short-reads using HDF5, and have demonstrated superior >> performances to BAM (not completely a surprise since to some BAM is >> reinventing some of the features in HDF5, and HDF5 has been developed for a >> longer time). I think that their slides are on slideshare (or similar >> place). > > There is some talk on the samtools mailing list about general improvements > to the chunking in BAM, relocating the header information (and other very > read specific things about representing error models, indels, etc). You may > be right that HDF5 has technical advantages over BAM version 1, but currently > my impression is that SAM/BAM is making good headway with becoming > a defacto standard for next generation data, while HDF5 is not. Maybe > someone should suggest they move to HDF5 internally for BAM version 2? > > Peter I have run into a few people (primarily those interested in mapping reads to genome seq) that have pointed out some problems with SAM/BAM, particularly the lack of more definitive, clear-cut definitions for regions of non-matching sequences (possibly due to many reasons, such as splice junctions, etc). Haven't actually looked at the SAM/BAM spec myself to see how correct this point is, but there are others either rolling their own solutions or threatening to. chris From lgautier at gmail.com Sat Jun 5 09:05:43 2010 From: lgautier at gmail.com (Laurent) Date: Sat, 05 Jun 2010 15:05:43 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> <4C0A3DA8.1020009@gmail.com> Message-ID: <4C0A4BA7.5090102@gmail.com> On 05/06/10 14:25, Peter wrote: > On Sat, Jun 5, 2010 at 1:06 PM, Laurent wrote: >>> >>> There is some talk on the samtools mailing list about general improvements >>> to the chunking in BAM, relocating the header information (and other very >>> read specific things about representing error models, indels, etc). You >>> may be right that HDF5 has technical advantages over BAM version 1, >>> but currently my impression is that SAM/BAM is making good headway >>> with becoming a defacto standard for next generation data, while HDF5 is >>> not. Maybe someone should suggest they move to HDF5 internally for BAM >>> version 2? >> >> De-facto standards happen to become so because more people use them >> at some point (which may involve step during which a lot of people /believe/ >> that most of the people are using a format over an other ;-) ), but this is >> indeed not necessarily making them the best technical solutions. > > Absolutley. > >> I do believe that building on HDF5 is a better approach: >> - better use of resources (do not reinvent completely what is already >> existing unless better) >> - HDF5 is designed as a rather general storage architecture, and will let >> one build tailored solutions when needed. >> >> I'd be surprised the BAM/SAM do not know about HDF formats, but I do not >> know for sure. Is there any BAM/SAM person reading ? > > I've been subscribed to the samtools mailing list for a few weeks now. I think > we (or better yet the BioHDF team) should put this idea forward on their > mailing list. As I said, they appear to be discussing some fairly dramatic > changes to the internals of the BAM format (while intending to keep their > API as close as possible), so now would be a good time to consider a > switch from their blocked gzip system to something else like HDF instead. > > Chris has pointed out some BioHDF people will be at BOSC 2010. There > is also a "HiTSeq: High Throughput Sequencing" ISMB 2010 SIG meeting > at the same time as BOSC 2010, so there could be some SAM/BAM > folk about in Boston to have some in person discussions with. Will you > be there is year Laurent (or at EuroSciPy or something else instead)? I'll be at BOSC / ISMB. Hopefully we will all stumble upon each other. Best, Laurent > Regards, > > Peter From cjfields at illinois.edu Sat Jun 5 08:56:13 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 5 Jun 2010 07:56:13 -0500 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com>

Message-ID: <3D71994B-1308-4D57-AD60-F9B66B77B063@illinois.edu> On Jun 5, 2010, at 6:51 AM, Peter wrote: > On Sat, Jun 5, 2010 at 11:59 AM, Chris Fields wrote: >> On Jun 4, 2010, at 2:04 PM, Peter wrote: >>> >>> But (thus far) no sequence data is stored in HDF5 format (is it?). >>> >>> Peter >> >> There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics). >> There is a website: >> >> http://www.hdfgroup.org/projects/biohdf/ > > It looks like they are making good progress - with SAM/BAM conversion to and > from BioHDF in place. Still, as they say: > >>>> The current BioHDF distribution is a pipleline prototype designed to show >>>> the suitability of HDF5 as a biological data store and to determine how to >>>> best implement an HDF5-based bioinformatics pipeline. It is in source code >>>> format only. The code builds a set of command-line tools which allow >>>> uploading and extracting DNA/RNA sequence and alignment data from >>>> next-generation gene sequencers. These files have been provided with the >>>> same BSD license used by HDF5 >>>> >>>> ... >>>> >>>> Please be aware that the code contained in it will be in a high state of flux >>>> in the immediate future. > > This certainly looks like something to keep an eye on. > > In any case, getting back to the thread's purpose - Bio.SeqIO.index() aims to > give random access to sequences by their ID for many different file formats. > There has been little interest in extending this to support gzipped > files. However, > extending the code to store the id/offset lookup table on disk with SQLite3 > (rather than in memory as a Python dict) would seem welcome. I'll be > refreshing the github branch where I was working on this earlier in the year... > > Peter We have seen (on the bioperl side) some interest in allowing gzip/bzip and others in via the PerlIO layer, and also AnyDBM using SQLite. Mark Jensen actually did a little work along these lines, though I'm not sure how clear-cut the support is at the moment. chris From cjfields at illinois.edu Sat Jun 5 09:31:37 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 5 Jun 2010 08:31:37 -0500 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <20100605125108.GA1822@kunkel> References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> <4C0A3DA8.1020009@gmail.com> <20100605125108.GA1822@kunkel> Message-ID: <51440B55-36B6-456F-9750-30AC67B95D48@illinois.edu> On Jun 5, 2010, at 7:51 AM, Brad Chapman wrote: > Laurent and Peter; > >> I do believe that building on HDF5 is a better approach: >> - better use of resources (do not reinvent completely what is >> already existing unless better) >> - HDF5 is designed as a rather general storage architecture, and >> will let one build tailored solutions when needed. > > HDF5 does has lots of good technical points, although as Peter mentions > the lack of community uptake is a concern. To potentially explain this, > here is my personal HDF5 usage story: I took an in depth look at PyTables > for some large data sets that were overwhelming SQLite: > > http://www.pytables.org/moin > > The data loaded quickly without any issues, but the most basic thing > I needed was indexes to retrieve a subset of the data by chromosome > and position. Unfortunately, you can't create indexes without > buying the Pro edition: > > http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex > > That immediately killed my ability to share the script so I ended > my HDF5 experiment and reworked my SQLite approach. > > Also, echoing Peter, the BioHDF download warns you that the code is > not stable, tested, or supported: > > http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html > > BAM is widely used and has tools that are meant to work on it > in production environments now, while HDF tool support still feels > experimental. Sometimes it is best to be practical and keep an eye > on other technical solutions as they evolve, > > Brad Yes, will be interesting to see how far along it is at BOSC. chris From lgautier at gmail.com Sat Jun 5 10:07:00 2010 From: lgautier at gmail.com (Laurent) Date: Sat, 05 Jun 2010 16:07:00 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: <4C0A5A04.1060207@gmail.com> On 05/06/10 15:02, biopython-request at lists.open-bio.org wrote: > Laurent and Peter; > >> I do believe that building on HDF5 is a better approach: >> - better use of resources (do not reinvent completely what is >> already existing unless better) >> - HDF5 is designed as a rather general storage architecture, and >> will let one build tailored solutions when needed. > > HDF5 does has lots of good technical points, although as Peter mentions > the lack of community uptake is a concern. To potentially explain this, > here is my personal HDF5 usage story: I took an in depth look at PyTables > for some large data sets that were overwhelming SQLite: > > http://www.pytables.org/moin > > The data loaded quickly without any issues, but the most basic thing > I needed was indexes to retrieve a subset of the data by chromosome > and position. Unfortunately, you can't create indexes without > buying the Pro edition: > > http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex > > That immediately killed my ability to share the script so I ended > my HDF5 experiment and reworked my SQLite approach. PyTables is already a dialect of HDF5 (not necessarily readable by other HDF5 software/libraries, and the "Pro Edition" adds indexing capabilities, I think. h5py is the alternative. Also, indexing (as in "hash function + tree") can be done using SQLite, and both (HDF5 and SQLite) can complement each other very efficiently. [I have designed and implemented ad-hoc hybrid solution at several occasions, and never regretted it so far] > Also, echoing Peter, the BioHDF download warns you that the code is > not stable, tested, or supported: Not tested is not good, but that's mostly a matter of having unit tests. Also I am referring to using HDF5 (mature, tested), not necessarily BioHDF as an higher layer (which I have no experience at all with). Should BioHDF not have tests and release cycles, it will probably not be the answer for me either. Along those lines, a very recent post advertising for a position at FHRC (bioconductor's group) suggests that HDF5 (and netCDF) are directions considered over there as well. > http://www.hdfgroup.org/projects/biohdf/biohdf_downloads.html > > BAM is widely used and has tools that are meant to work on it > in production environments now, while HDF tool support still feels > experimental. I had that feeling with BAM/SAM tools at the time, and I new a bit my way around with HDF5. > Sometimes it is best to be practical and keep an eye > on other technical solutions as they evolve, I am reading otherwise that not everyone using BAM/SAM is happy with it (and some threatening to fork). I might well be wrong, but I don't think that BAM/SAM has (yet) a place so prominent that efforts should first go into converting to it. > Brad From chapmanb at 50mail.com Sat Jun 5 16:42:23 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 5 Jun 2010 16:42:23 -0400 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <2133BED5-43A4-4909-87CF-ABF12AC63C9A@gmail.com> References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> <4C0A3DA8.1020009@gmail.com> <20100605125108.GA1822@kunkel> <2133BED5-43A4-4909-87CF-ABF12AC63C9A@gmail.com> Message-ID: <20100605204135.GB1822@kunkel> Aaron and Laurent; Aaron: > I am facing a similar situation. Brad, out of curiosity did you also try h5py? > > http://code.google.com/p/h5py/ Yes, I think that's the right way to go. After I found out about the indexing I re-aligned my thinking around h5py, which is more hierarchical than table based. Ironically, this led me to a more compact binned solution which would work fine within SQLite, which is why I never got very far with h5py. I would start with this next time a need arises. Laurent: > Not tested is not good, but that's mostly a matter of having unit tests. I'm not knocking the code, only reading the warnings on the download page. Hopefully this will shape up to be something usable, and like others am looking forward to the BOSC presentation. > Also I am referring to using HDF5 (mature, tested), not necessarily > BioHDF as an higher layer (which I have no experience at all with). > Should BioHDF not have tests and release cycles, it will probably > not be the answer for me either. > > Along those lines, a very recent post advertising for a position at > FHRC (bioconductor's group) suggests that HDF5 (and netCDF) are > directions considered over there as well. That's good news. Essentially what I wanted was to build a data structure that I could sub-select out of into an R data.frame, ala sqldf: http://code.google.com/p/sqldf/ > I am reading otherwise that not everyone using BAM/SAM is happy with > it (and some threatening to fork). > I might well be wrong, but I don't think that BAM/SAM has (yet) a > place so prominent that efforts should first go into converting to > it. Oh please don't ruin my day by bringing up that possibility; BAM features pretty prominently in my daily work. Broad's Picard and GATK pipelines are based solely on BAM, so I might be biased due to my interactions with them. Hopefully if the community moves to something else for alignment representation a smooth transition is planned. Famous last words, Brad From gnd9 at cox.net Sun Jun 6 19:24:15 2010 From: gnd9 at cox.net (Gary) Date: Sun, 6 Jun 2010 18:24:15 -0500 Subject: [Biopython] Helpa Newbie Please.py Message-ID: <841F4789085F4EB9A6A353F8FD755D7C@garyPC> Subject: Helpa Newbie Please.py Just started into this fantastic project & can't get past Cookbook 2.4.2 parsing example! from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) ALWAYS REPLYS: Traceback (most recent call last): File "(stdin)"' line1, in (module) File "C:\Python26\lib\site-package\Bio\SeqIO\_init_.py"' line 483, in parse IOError: [Error 2] No such file or directory: 'ls_orchid.gbk' I do have the file in fact I put it in several locations.... genbank folder, SeqIO folder, Python2.6 folder.... Does anyone know which folder SeqIO is looking for the orchid file? I assume thats my problem Thanks in advance Helpa Newbie Please.py From jordan.r.willis at Vanderbilt.Edu Sun Jun 6 19:36:28 2010 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 6 Jun 2010 18:36:28 -0500 Subject: [Biopython] Helpa Newbie Please.py In-Reply-To: <841F4789085F4EB9A6A353F8FD755D7C@garyPC> Message-ID: Hi Gary, Python will always look for the file in whichever directory you started python in. I would go to the command line and start python in the same directory in which you have ls_orchid.gbk. On another note, I'm not sure when this changed but SeqIO.parse will only accept a file object contrary to the example. So try this. from Bio import SeqIO for seq_record in SeqIO.parse(open("ls_orchid.gbk"), "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) Jordan On 6/6/10 6:24 PM, "Gary" wrote: Subject: Helpa Newbie Please.py Just started into this fantastic project & can't get past Cookbook 2.4.2 parsing example! from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) ALWAYS REPLYS: Traceback (most recent call last): File "(stdin)"' line1, in (module) File "C:\Python26\lib\site-package\Bio\SeqIO\_init_.py"' line 483, in parse IOError: [Error 2] No such file or directory: 'ls_orchid.gbk' I do have the file in fact I put it in several locations.... genbank folder, SeqIO folder, Python2.6 folder.... Does anyone know which folder SeqIO is looking for the orchid file? I assume thats my problem Thanks in advance Helpa Newbie Please.py _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mjldehoon at yahoo.com Sun Jun 6 20:30:52 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 6 Jun 2010 17:30:52 -0700 (PDT) Subject: [Biopython] Helpa Newbie Please.py In-Reply-To: <841F4789085F4EB9A6A353F8FD755D7C@garyPC> Message-ID: <352042.44056.qm@web62404.mail.re1.yahoo.com> The file ls_orchid.gbk should be in your current directory (the one in which you start python). --Michiel. --- On Sun, 6/6/10, Gary wrote: > From: Gary > Subject: [Biopython] Helpa Newbie Please.py > To: biopython at lists.open-bio.org > Date: Sunday, June 6, 2010, 7:24 PM > Subject: Helpa Newbie Please.py > > > Just started into this fantastic project & can't get > past Cookbook 2.4.2 parsing example! > > from Bio import SeqIO > for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank"): > ? ? print seq_record.id > ? ? print repr(seq_record.seq) > ? ? print len(seq_record) > > ALWAYS REPLYS: > > Traceback (most recent call last): > File "(stdin)"' line1, in (module) > File "C:\Python26\lib\site-package\Bio\SeqIO\_init_.py"' > line 483, in parse > IOError: [Error 2] No such file or directory: > 'ls_orchid.gbk' > > > I do have the file in fact I put it in several > locations.... genbank folder, SeqIO folder, Python2.6 > folder.... > Does anyone know which folder SeqIO is looking for the > orchid file? > I assume thats my problem > > Thanks in advance > Helpa Newbie Please.py > > > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rjalves at igc.gulbenkian.pt Mon Jun 7 04:01:25 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 07 Jun 2010 09:01:25 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C08B4EE.4020508@igc.gulbenkian.pt> Message-ID: <4C0CA755.6060608@igc.gulbenkian.pt> Quoting Peter on 06/04/2010 09:42 AM: > Unfortunately I was inconsistent about which order I used in my email > (gzip vs on disk indexes) so I'm not sure which you are talking about. > Are you saying supporting on disk indexes would be your priority (even > though you did ask look at gzip support in the past)? Yes exactly. The gzip support became a non priority at least for our current local uses. On the other hand, disk support would be quite helpful. As a matter of fact we borrowed a little of your SeqIO.index() sqlite code you have on a github branch. Renato. From biopython at maubp.freeserve.co.uk Mon Jun 7 04:39:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 09:39:50 +0100 Subject: [Biopython] Helpa Newbie Please.py In-Reply-To: References: <841F4789085F4EB9A6A353F8FD755D7C@garyPC> Message-ID: On Mon, Jun 7, 2010 at 12:36 AM, Willis, Jordan R wrote: > Hi Gary, > > Python will always look for the file in whichever directory you started > python in. I would go to the command line and start python in the same > directory in which you have ls_orchid.gbk. > > On another note, I'm not sure when this changed but SeqIO.parse > will only accept a file object contrary to the example. So try this. If using Biopython 1.54 or later, you can use a filename or handle. http://news.open-bio.org/news/2010/04/biopython-seqio-and-alignio-easier/ If using Biopython 1.53 or older, you need a handle (file object) as in Jordan's example: > from Bio import SeqIO > for seq_record in SeqIO.parse(open("ls_orchid.gbk"), "genbank"): > ? ?print seq_record.id > ? ?print repr(seq_record.seq) > ? print len(seq_record) Gary - If you don't have the file in your current working directory, it might be simplest to give a full path, for example: from Bio import SeqIO filename = r"C:\My Documents\new work\ls_orchid.gbk"" for seq_record in SeqIO.parse(filename, "genbank"): print seq_record.id print repr(seq_record.seq) print len(seq_record) When defining the filename, using r"text" with a leading r means a raw string. This means \n or \t etc won't be treated as a newline or a tab which is what Python does normally - something to beware of as Windows uses \ in paths. Peter From e.picardi at unical.it Mon Jun 7 05:10:26 2010 From: e.picardi at unical.it (Ernesto) Date: Mon, 7 Jun 2010 11:10:26 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C0CA755.6060608@igc.gulbenkian.pt> References: <4C08B4EE.4020508@igc.gulbenkian.pt> <4C0CA755.6060608@igc.gulbenkian.pt> Message-ID: <8E4EFAE6-63F8-4355-9514-24E8C14474F8@unical.it> Hi all, I followed the interesting discussion about indexing. I think that it is a hot point given the huge amount of data released by the new sequencing technologies. I never used the Bio.SeqIO.index() but I'd like to test it and I'd like also to know how to use it. Is there a simple tutorial? In the past I tried pytables based on HDF5 library and I was impressed by its high speed. However, the indexing is not supported at least for the free version. Moreover, strings of not fixed length cannot easily handled and stored. For example, in order to store EST sequences you need to know a priori the maximum length in order to optimize the storage. As an alternative, VLAs (variable length arrays) could be used but the storing performance goes down quickly. Few days ago I tried to store millions of data using SQLite and I found it very slow, although my code it is not optimized (I'm not a computer scientist but a biologist who like python and biopython). However, as an alternative, I found the tokyocabinet library (http://1978th.net/tokyocabinet/) that is a modern implementation (in C) of DBM. There are a lot of python wrappers like tokyocabinet-python 0.5.0 (http://pypi.python.org/pypi/tokyocabinet-python/) that work efficiently and guarantee high speed and compression. Tokyocabinet implements hash databases, B-tree databases, table databases giving also the possibility to store info on disk or on memory. In case of table databases it should be able to index specific columns. Hope this help, Ernesto Il giorno 07/giu/2010, alle ore 10.01, Renato Alves ha scritto: > Quoting Peter on 06/04/2010 09:42 AM: >> Unfortunately I was inconsistent about which order I used in my email >> (gzip vs on disk indexes) so I'm not sure which you are talking about. >> Are you saying supporting on disk indexes would be your priority (even >> though you did ask look at gzip support in the past)? > > Yes exactly. The gzip support became a non priority at least for our > current local uses. On the other hand, disk support would be quite helpful. > As a matter of fact we borrowed a little of your SeqIO.index() sqlite > code you have on a github branch. > > Renato. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Jun 7 05:49:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 10:49:17 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <8E4EFAE6-63F8-4355-9514-24E8C14474F8@unical.it> References: <4C08B4EE.4020508@igc.gulbenkian.pt> <4C0CA755.6060608@igc.gulbenkian.pt> <8E4EFAE6-63F8-4355-9514-24E8C14474F8@unical.it> Message-ID: On Mon, Jun 7, 2010 at 10:10 AM, Ernesto wrote: > Hi all, > > I followed the interesting discussion about indexing. I think that > it is a hot point given the huge amount of data released by the > new sequencing technologies. Yes - although the discussion has gone beyond just indexing to also cover storing data. > I never used the Bio.SeqIO.index() but I'd like to test it and I'd > like also to know how to use it. Is there a simple tutorial? Bio.SeqIO.index() is included in Biopython 1.52 onwards. It is covered in the main Biopython Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There are also a few blog posts about it (linked to at the start of this thread): http://news.open-bio.org/news/2009/09/biopython-seqio-index/ http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > In the past I tried pytables based on HDF5 library and I was > impressed by its high speed. However, the indexing is not supported > at least for the free version. Yes, Brad wrote about this frustration earlier. > Moreover, strings of not fixed length cannot easily handled and > stored. For example, in order to store EST sequences you need to > know a priori the maximum length in order to optimize the storage. > As an alternative, VLAs (variable length arrays) could be used but > the storing performance goes down quickly. The BioHDF project have probably thought about this kind of issue. However, for Bio.SeqIO.index() we don't store the sequences in a database - just the associated file offsets. The current Bio.SeqIO.index() code works by scanning a sequence file and storing a lookup table of record identifiers and file offsets in a Python dictionary. This works very well but once you get into tens of millions of records the memory requirements become a problem. For instance, running a 64bit Python can actually be important as you may need more than 4GB of RAM. Also, for very large files, the time taken to build the index gets longer - so having to reindex the file each time can become an issue. Saving the index to disk solves this, and can also lets us avoid keeping the whole lookup table in memory. > Few days ago I tried to store millions of data using SQLite and I > found it very slow, although my code it is not optimized (I'm not a > computer scientist but a biologist who like python and biopython). If you search the Biopython development mailing list you'll see we've already done some work using SQLite to store the file offsets. There is an experimental branch on github here if you are curious BUT this is not ready for production use: http://github.com/peterjc/biopython/tree/index-sqlite > However, as an alternative, I found the tokyocabinet library > (http://1978th.net/tokyocabinet/) that is a modern implementation (in C) > of DBM. There are a lot of python wrappers like tokyocabinet-python > 0.5.0 (http://pypi.python.org/pypi/tokyocabinet-python/) that work > efficiently and guarantee high speed and compression. Tokyocabinet > implements hash databases, B-tree databases, table databases > giving also the possibility to store info on disk or on memory. In > case of table databases it should be able to index specific columns. Tokyocabinet is certainly an interesting project, but this isn't the issue that Bio.SeqIO.index() is trying to solve. You might be interested in Brad's blog post from last year: http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ Regards, Peter From aboulia at gmail.com Mon Jun 7 06:38:33 2010 From: aboulia at gmail.com (Kevin Lam) Date: Mon, 7 Jun 2010 18:38:33 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C08B4EE.4020508@igc.gulbenkian.pt> <4C0CA755.6060608@igc.gulbenkian.pt> <8E4EFAE6-63F8-4355-9514-24E8C14474F8@unical.it> Message-ID: > > Tokyocabinet is certainly an interesting project, but this isn't the issue > that Bio.SeqIO.index() is trying to solve. You might be interested in > Brad's blog post from last year: > > > http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/ > > Regards, > > Peter > > Hi Peter > Can I summarise it? I think alot of well meaning ppl are pushing for their > fav db but > > using a disk based db like sqlite for Bio.SeqIO.index() for recording the > file offset is going to be the best way to do it versus trying to find > another suitable non-mysql db variant to 'databasify' the short read-data? > As the latter would be relatively easy for anyone else interested to > experiment to code their own scripts for their fav db > > :) > > > this post on this mongodb is interesing btw. > http://blog.zawodny.com/2010/05/22/mongodb-early-impressions/ > > From biopython at maubp.freeserve.co.uk Mon Jun 7 07:02:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 12:02:12 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C08B4EE.4020508@igc.gulbenkian.pt> <4C0CA755.6060608@igc.gulbenkian.pt> <8E4EFAE6-63F8-4355-9514-24E8C14474F8@unical.it>

Message-ID: On Mon, Jun 7, 2010 at 11:38 AM, Kevin Lam wrote: > > Hi Peter > Can I summarise it? I think alot of well meaning ppl are pushing for their > fav db but?using a disk based db like sqlite for Bio.SeqIO.index() for > recording the file offset is going to be the best way to do it versus trying > to find another suitable non-mysql db variant to 'databasify' the short > read-data? As the latter would be relatively easy for anyone else > interested to experiment to code their own scripts for their fav db > > :) I think that is a good summary. Bio.SeqIO.index() is for random access to assorted existing file formats (e.g. FASTA. FASTQ, SFF) by record identifier string and works with a look up table of offsets. We are going to try storing this lookup table in SQLite. The proof of concept code works, is cross platform, adds no external dependencies - and seems fast enough too. As we add more file formats to Bio.SeqIO, in most cases we can add support for indexing them in the same way. Maybe one day this will include BioHDF as it matures? Peter From biopython at maubp.freeserve.co.uk Mon Jun 7 09:40:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 14:40:54 +0100 Subject: [Biopython] EuroSciPy 2010 conference in Paris In-Reply-To: References: Message-ID: On Sat, Jun 5, 2010 at 3:49 PM, Peter wrote: > Hi all, > > Are any Biopython folk planning to be at the EuroSciPy > conference in Paris this year (July 2010)? They are still > finalising the Scientific track, but the list of tutorials is > quite interesting already: > > http://www.euroscipy.org/conference/euroscipy2010 > > Peter Hi all, The track list for the EuroSciPy 2010 Scientific track has now been announced, and I'm delighted that I will be able to present a talk on Biopython (likely 4pm Saturday 10 July). While I hope there will be some other Biopython users there, this is a nice opportunity to meet the broader scientific python community. There are still places at the moment if you want to attend: http://www.euroscipy.org/conference/euroscipy2010 Unfortunately I will not be attending BOSC or ISMB this year. However Brad Chapman will be there to present the annual "Biopython Project Update" talk (as well as helping to organise this year's BOSC and the associated CodeFest event preceding it). I'd love to have been there too, but I'm sure everyone attending will have a great time. Again, registration is still open: http://www.open-bio.org/wiki/BOSC_2010 http://www.open-bio.org/wiki/Codefest_2010 Regards, Peter P.S. Those of you in North America you might also be interested in the main SciPy conference in Austin, Texas (28 June to 3 July 2010): http://conference.scipy.org/scipy2010/ From jordan.r.willis at Vanderbilt.Edu Mon Jun 7 23:36:05 2010 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Mon, 7 Jun 2010 22:36:05 -0500 Subject: [Biopython] Is this feasible? Message-ID: Hello I'm relatively new to both programming and bioinformatics. I wanted to know if anyone would knew how to do something like this: I have a list of sequences that have evolved away from a given germline sequence. I was going to use biopython to iteratively map the closest mutant to the germline and pull it out of the list. I would then align these two sequences and give a score. I would then take the remaining sequences and find the one that is closest to the one that was just taken out of the list and do the same thing as the list is empty. The output would look something like this: Prints to screen: -------------------------------------------------------------------------------------------------------------------- Round one: Seed(germline) Closets scoring sequence --> with a bitwise score of Round two: Closest scoring sequence to seed Next closest scoring sequence(s) ---> with a bitwise score of.... ... Round N: Seed Next to last closest scoring sequence Last place sequence(s) ---> with a bitwise score of.... -------------------------------------------------------------------------------------------------------------------- In a way it's sort of a tractable phylogeny tree but with simpler sequences. Def run_blast(command): subprocess.call(str(command), shell=(sys.platform!="win32") xml_return = 'tmp.xml' return xml_return Def main() Database = [ seq_record for seq_record in seqIO.parse('Input.fasta', "fasta")] germline = seqIO.read('germline.fasta') while Database: cline = NcbiblastpCommandline(query=germline, db=Database out='tmp.xml') blast_records = NCBIXML.parse(run_blast(cline)) print blast_records.alignment[0] print germline print "\n\n\n" germline = blast_records.alignment[0] Database.remove(germline) I guess my first question is does this seem logical. Is blast the best algorithm to use for this scenario? The other problem is creating my own database. I read the documentation, and it said you could create your own database to run local blast (which I have), I just have no idea how do to that. The second thing is in blast_records.alignment[0], will this always give me the best scoring sequence? Any help would be much appreciated. Thanks for the help, Jordan From chapmanb at 50mail.com Tue Jun 8 10:17:54 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 8 Jun 2010 10:17:54 -0400 Subject: [Biopython] Is this feasible? In-Reply-To: References: Message-ID: <20100608141754.GE2003@kunkel> Jordan; > Hello I'm relatively new to both programming and bioinformatics. Welcome. Happy to have you on board. > I have a list of sequences that have evolved away from a given > germline sequence. I was going to use biopython to iteratively map the > closest mutant to the germline and pull it out of the list. I would > then align these two sequences and give a score. I would then take the > remaining sequences and find the one that is closest to the one that > was just taken out of the list and do the same thing as the list is > empty. The BLASTing and finding the closest match makes good sense. Why do you need to remove the hit sequences from the reference database? Could sequences in your input list not match to the same gene or different sections of the same gene? This also biases the search depending on your input order. If you do need to BLAST without replacement, the approach I would take is to keep a list of hit IDs you've already used, and avoid using these alignments again when parsing subsequence searches. Some specific thoughts on your pseudocode: > Database = [ seq_record for seq_record in seqIO.parse('Input.fasta', "fasta")] Unless your database is small you probably don't want to do this as is loads the entire file into memory. Instead you'd want to index this file with formatdb: formatdb -p T -i Input.fasta The subprocess module is the way to go: http://docs.python.org/library/subprocess.html subprocess.call(["formatdb", "-p", "T", "-i", "Input.fasta"]) > germline = seqIO.read('germline.fasta') > while Database: > cline = NcbiblastpCommandline(query=germline, db=Database out='tmp.xml') > blast_records = NCBIXML.parse(run_blast(cline)) > print blast_records.alignment[0] > print germline Here what you could do for your removal search is to keep a list of hits you've seen and pick the first one that is not in that list: hits_seen = [] for query in to_search: blast_rec = _do_your_blast_and_parse_it() new_hit = None for align in blast_rec.alignments: if align.title not in hits_seen: new_hit = align break if new_hit: hits_seen.append(new_hit.title) _output_what_you_want(new_hit) > I guess my first question is does this seem logical. Is blast the > best algorithm to use for this scenario? Hopefully this helps move you forward. > The other problem is > creating my own database. I read the documentation, and it said > you could create your own database to run local blast (which I > have), I just have no idea how do to that. That's the formatdb command listed above. There's plenty to read on the web about commandline options for DNA/protein databases. > The second thing is in > blast_records.alignment[0], will this always give me the best scoring > sequence? Yes, they are ordered by score just like in the raw BLAST file. Hope this helps, Brad From jordan.r.willis at Vanderbilt.Edu Thu Jun 10 18:47:36 2010 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Thu, 10 Jun 2010 17:47:36 -0500 Subject: [Biopython] SeqIO.dict Message-ID: Hello Community. I was wonderig if you could convert a dictionary object back into a fasta file. Dictionary = SeqIO.to_dict(SeqIO.parse('my.file', "fasta") Removed some objects from dictionary.... SeqIO.write(Dictionary, 'my.file', "fasta") This is how I have removed items from my.file but I need to convert it back into a fasta file so it can be read by blast. Thanks. From biopython at maubp.freeserve.co.uk Thu Jun 10 19:33:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Jun 2010 00:33:32 +0100 Subject: [Biopython] SeqIO.dict In-Reply-To: References: Message-ID: On Thu, Jun 10, 2010 at 11:47 PM, Willis, Jordan R wrote: > Hello Community. > > I was wonderig if you could convert a dictionary object back into a fasta file. > > > Dictionary = SeqIO.to_dict(SeqIO.parse('my.file', "fasta") > > Removed some objects from dictionary.... > > SeqIO.write(Dictionary, 'my.file', "fasta") > > > This is how I have removed items from my.file but I > need to convert it back into a fasta file so it can be > read by blast. Doing Dictionary.values() will give a list of SeqRecord objects, which you can give to the SeqIO.write(...) function to save to a FASTA file. i.e. Dictionary = SeqIO.to_dict(SeqIO.parse('my.file', "fasta") #edit dictionary... then: SeqIO.write(Dictionary.values(), "new.fas", "fasta") If you care about the order then it is a little more complicated. Peter From jordan.r.willis at Vanderbilt.Edu Thu Jun 10 22:12:45 2010 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Thu, 10 Jun 2010 21:12:45 -0500 Subject: [Biopython] SeqIO.dict In-Reply-To: Message-ID: Thanks Peter, One last question: During my blast runs, at about midway through it will truncate my sequences... For example: GYTFTNFA ----> Query GY FTNFA ----> Score of: 34.0 GYIFTNFA ----> Template Template becomes the query: GYIFTN ----> Query GYIFTN ----> Score of: 30.0 GYIFTN ----> Template All sequences are the exact same size and I can't figure out which blast parameters would show all 8 amino acids every time. Jordan On 6/10/10 6:33 PM, "Peter" wrote: On Thu, Jun 10, 2010 at 11:47 PM, Willis, Jordan R wrote: > Hello Community. > > I was wonderig if you could convert a dictionary object back into a fasta file. > > > Dictionary = SeqIO.to_dict(SeqIO.parse('my.file', "fasta") > > Removed some objects from dictionary.... > > SeqIO.write(Dictionary, 'my.file', "fasta") > > > This is how I have removed items from my.file but I > need to convert it back into a fasta file so it can be > read by blast. Doing Dictionary.values() will give a list of SeqRecord objects, which you can give to the SeqIO.write(...) function to save to a FASTA file. i.e. Dictionary = SeqIO.to_dict(SeqIO.parse('my.file', "fasta") #edit dictionary... then: SeqIO.write(Dictionary.values(), "new.fas", "fasta") If you care about the order then it is a little more complicated. Peter From biopython at maubp.freeserve.co.uk Fri Jun 11 05:14:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 11 Jun 2010 10:14:56 +0100 Subject: [Biopython] SeqIO.dict In-Reply-To: References: Message-ID: On Fri, Jun 11, 2010 at 3:12 AM, Willis, Jordan R wrote: > Thanks Peter, > > One last question: > > During my blast runs, at about midway through it will truncate > my sequences... For example: > > GYTFTNFA ----> Query > GY ?FTNFA ----> Score of: 34.0 > GYIFTNFA ----> Template > Template becomes the query: > GYIFTN ----> Query > GYIFTN ----> Score of: 30.0 > GYIFTN ----> Template > > > All sequences are the exact same size and I can't figure out which > blast parameters would show all 8 amino acids every time. > > Jordan Hi Jordan, I'm sorry but I don't understand what you are trying to describe. Note that BLAST find local alignments not global alignments (it will not try and align all of your query to a match in the database). Peter From chapmanb at 50mail.com Fri Jun 11 07:16:37 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 11 Jun 2010 07:16:37 -0400 Subject: [Biopython] Is this feasible? In-Reply-To: References: <20100608141754.GE2003@kunkel> Message-ID: <20100611111637.GA29480@sobchak.mgh.harvard.edu> Jordan; > Hi Brad thanks for the excellent feedback. I understand why you > think I wouldn't need to make a new database every time but here > is the thing. For my first hit, I want to use that sequence as a > query in a new blast run. So it must be removed and converted into a > fasta format to be used as the new input. So I have taken the id ( > blast_record.alignments[0].title) and I am trying to remove it from > input.fasta. Does the fasta parser have a remove feature based on ID? > That would be ideal. > > It would go something like this: > > Germline > match1 > > Next round: > > match1 > match2 > > Next round: > > match2: > match3: > > Where each time the hit becomes the next query. It sounds a bit like you are re-implementing the functionality of PSI-BLAST: http://en.wikipedia.org/wiki/BLAST#Program Have you given psiblast a try and found some type of issue with the approach? You can run and parse PSI-BLAST with Bio.Blast.Applications and the Bio.Blast parsers: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc93 If you have blast+ installed, you can get all the commandline options with: psiblast -help or with blast2, do: blastpgp - Hope this helps, Brad From sdavis2 at mail.nih.gov Fri Jun 11 10:37:50 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 11 Jun 2010 10:37:50 -0400 Subject: [Biopython] [OT] Bioconductor Conference Message-ID: Sorry for the slightly off-topic message, but I think there are plenty of overlapping interests between biopython and bioconductor. I wanted to announce the upcoming Bioconductor 2010 conference. Besides the email below, if there are questions that I can answer, let me know. Thanks, Sean ------------------- BioC 2010 is coming up quickly, and we urge you to sign up today! We're meeting July 29-30 (Developer Day July 28) in Seattle. There is a terrific line-up of speakers (see below) and the work shops are being finalized (see https://secure.bioconductor.org/BioC2010/labs.php for a preliminary list, subject to change). Also, we have introduced a special 'Flow Cytometery' track. This provides access to the Friday afternoon practicals sessions. These sessions will include, among others, practicals relevant to cytometry. Finally, the deadline for scholarship applications is June 15, just a few days away! Questions? Send email to biocworkshop at fhcrc.org Thursday 8:30 - 9:15 Atul Butte, Stanford Center for Biomedical Informatics Research. Exploring Genomic Medicine Through Integrative Bioinformatics. 9:15 - 10:00 Stephen Friend, SAGE Bionetworks. Risks and Opportunities for Disease Models based on Integrative Genomic Approaches. 10:30 - 11:15 Jay Shendure, University of Washington. Exome sequencing and human genetics. 11:15 - 12:00 To be confirmed. Friday 8:30 - 9:15 Simon Tavar?, University of Southern California. 9:15 - 10:00 Paul Flicek, European Bioinformatics Institute. Generation gap: How existing bioinformatics resources are adapting to high-throughput sequencing. 10:30 - 11:15 Lior Pachter, University of California, Berkeley. 11:15 - 12:00 Simon Anders, European Bioinformatics Institute. Inference of differential signal in high throughput sequencing count data. From biopython at maubp.freeserve.co.uk Wed Jun 16 06:04:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Jun 2010 11:04:41 +0100 Subject: [Biopython] MutableSeq (reverse) complement methods Message-ID: Hi all, I've been meaning to discuss the following issue for a while - I find this to be an annoying difference between the Seq and MutableSeq objects: The Seq object's (reverse) complement method returns a new Seq object (it has to because we regard the Seq object as read only). The MutableSeq object' (reverse) complement method instead currently modifies the object in place (and has no return value). Writing general code that expects a sequence object is difficult because this requires a special case for MutableSeq objects. I would therefore like to make the MutableSeq object's complement and reverse_complement methods act like those of the Seq object by default, and return a new object. This discrepancy was the main reason why I didn't add (back) transcribe and translate methods to the MutableSeq object when they were added to the Seq object. So, people you use the MutableSeq object, do you find the in situ complement and reverse_complement methods useful? If so, should we add an optional argument to the methods to control this (e.g. in_place or in_situ). Via a warning mechanism for a few releases, we can then switch over to the new default behaviour being consistent with the Seq object. If on the other hand, the in situ (reverse) complement methods are not seen as useful we can handle this with a simple change in behaviour (again with warning messages for a few releases). Peter From reece at berkeley.edu Thu Jun 17 19:13:17 2010 From: reece at berkeley.edu (Reece Hart) Date: Thu, 17 Jun 2010 16:13:17 -0700 Subject: [Biopython] sequence coordinate mapping Message-ID: <4C1AAC0D.5030208@berkeley.edu> Hi All- I'm looking for code in Python (preferably already in BioPython) to map between genomic, CDS, and protein coordinates. For example, map position 7872 of AB026906.1 to position 274 in the CDS and pos 92 in the protein. It's not difficult and I've already written a crude version, but I'm a little surprised that it's not there and I don't want to reinvent. I'm looking for something akin to Bio::Coordinate::GeneMapper, for those from BioPerl. Thanks, Reece From biopython at maubp.freeserve.co.uk Fri Jun 18 06:01:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Jun 2010 11:01:56 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <4C1AAC0D.5030208@berkeley.edu> References: <4C1AAC0D.5030208@berkeley.edu> Message-ID: On Fri, Jun 18, 2010 at 12:13 AM, Reece Hart wrote: > Hi All- > > I'm looking for code in Python (preferably already in BioPython) to map > between genomic, CDS, and protein coordinates. For example, map position > 7872 of AB026906.1 to position 274 in the CDS and pos 92 in the protein. > > It's not difficult and I've already written a crude version, but I'm a > little surprised that it's not there and I don't want to reinvent. > > I'm looking for something akin to Bio::Coordinate::GeneMapper, for those > from BioPerl. > > Thanks, > Reece The Bio::Coordinate::GeneMapper stuff looks quite complicated just from the documentation - maybe I'm looking in the wrong place but some examples would help to understand the full scope of it. There isn't anything quite like this built into Biopython at the moment. Your question also sounds hard in general. What about where a single base on the genome maps to multiple genes (overlapping genes are common in bacteria and viruses). What about where a single base on the genome maps to an intron in a gene - would you want any values back? What about where a gene has a fuzzy boundary? What about a ribosomal slippage where a single bp ends up coding for two residues in the protein? It can be broken down into two steps: (1) finding a list of features covering a position on the genome, (2) for a CDS feature getting the amino acid position (which would require looking for the codon start position if specified in the annotation). Just thinking out loud, implementing "in" and/or sorting on our FeatureLocation (and perhaps SeqFeature) objects (i.e. implement the special __contains__ method, __lt__ method etc) could be useful syntactic sugar for this kind of work. Peter From biopython at maubp.freeserve.co.uk Fri Jun 18 08:00:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Jun 2010 13:00:04 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C1AAC0D.5030208@berkeley.edu> Message-ID: On Fri, Jun 18, 2010 at 11:01 AM, Peter wrote: > > Just thinking out loud, implementing "in" and/or sorting on our > FeatureLocation (and perhaps SeqFeature) objects (i.e. implement > the special __contains__ method, ?__lt__ method etc) could be > useful syntactic sugar for this kind of work. > Something like this? This implements __contains__ on the SeqFeature so that you can check if a simple location (integer) is within a feature. http://github.com/peterjc/biopython/tree/feature-in There is a docstring with examples, just look at the diff here: http://github.com/peterjc/biopython/commit/83c44e8f6ee62a9c5855b603cb3c080d367e23d6 Would that go part way to solving your task? Peter From chapmanb at 50mail.com Fri Jun 18 08:58:03 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 18 Jun 2010 08:58:03 -0400 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C1AAC0D.5030208@berkeley.edu> Message-ID: <20100618125803.GV3415@sobchak.mgh.harvard.edu> Reece and Peter; > > I'm looking for code in Python (preferably already in BioPython) to map > > between genomic, CDS, and protein coordinates. For example, map position > > 7872 of AB026906.1 to position 274 in the CDS and pos 92 in the protein. > > > > It's not difficult and I've already written a crude version, but I'm a > > little surprised that it's not there and I don't want to reinvent. > > > > I'm looking for something akin to Bio::Coordinate::GeneMapper, for those > > from BioPerl. A general implementation like this would be really useful. Here is a stab I took at the problem a while back: http://bitbucket.org/chapmanb/synbio/src/tip/SynBio/Codons/CodingRegion.py This represents a sequence as a set of Exon/Intron objects and then lets you manipulate directly on the coding region. Here's a representation inspired by that for SNP calling: http://github.com/chapmanb/bcbb/blob/master/biopython/CodingRegion.py The tricky part of this problem is getting exon/intron coordinates parsed and in the right format. I'm not sure I ever really got this correct, but hopefully those implementations help. > Your question also sounds hard in general. What about where a > single base on the genome maps to multiple genes (overlapping > genes are common in bacteria and viruses). What about where > a single base on the genome maps to an intron in a gene - would > you want any values back? What about where a gene has a fuzzy > boundary? What about a ribosomal slippage where a single bp > ends up coding for two residues in the protein? You'd want to catch and raise errors in pathological cases, but this would be useful for the standard cases. If the target is SNP calling, you'd want to be able to have a genomic coordinate and find out if it's in a gene; if so, is it in a coding region?; if so, what is the protein at that position? It is handy to be able to pull out each of those representations so you can ask questions about the location in the coding sequence or amino acid change caused by a SNP. > Something like this? This implements __contains__ on the SeqFeature > so that you can check if a simple location (integer) is within a feature. > http://github.com/peterjc/biopython/tree/feature-in > > There is a docstring with examples, just look at the diff here: > http://github.com/peterjc/biopython/commit/83c44e8f6ee62a9c5855b603cb3c080d367e23d6 That's nice. The next part would be remapping the coordinates so once you have the feature you can easily address the relative position you are interested in. Brad From biopython at maubp.freeserve.co.uk Fri Jun 18 09:39:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Jun 2010 14:39:04 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <20100618125803.GV3415@sobchak.mgh.harvard.edu> References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jun 18, 2010 at 1:58 PM, Brad Chapman wrote: > Reece and Peter; > > Peter wrote: >> Something like this? This implements __contains__ on the SeqFeature >> so that you can check if a simple location (integer) is within a feature. >> http://github.com/peterjc/biopython/tree/feature-in >> >> There is a docstring with examples, just look at the diff here: >> http://github.com/peterjc/biopython/commit/83c44e8f6ee62a9c5855b603cb3c080d367e23d6 > > That's nice. Nice enough to be worth committing in its own right? > The next part would be remapping the coordinates so > once you have the feature you can easily address the relative > position you are interested in. Perhaps one approach would be to do this in the SeqFeature. If we define a SeqFeature's length in the natural way, then we have len(SeqFeature) == len(SeqFeature.extract(parent_seq)). Now we have two coordinates systems, 0 to len(SeqFeature) and the regions it describes on the parent sequence. Then we could discuss a pair of methods on the SeqFeature for converting between the two coordinate systems. Once you have that, the special case of amino acid coordinates is much easier to do (account for where the start codon is, divide by three). I've made another commit on the __contains__ branch to also implement __len__ for the SeqFeature: http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f Peter From cjfields at illinois.edu Fri Jun 18 10:08:21 2010 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 18 Jun 2010 09:08:21 -0500 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> Message-ID: <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> On Jun 18, 2010, at 8:39 AM, Peter wrote: > On Fri, Jun 18, 2010 at 1:58 PM, Brad Chapman wrote: >> Reece and Peter; >> >> Peter wrote: >>> Something like this? This implements __contains__ on the SeqFeature >>> so that you can check if a simple location (integer) is within a feature. >>> http://github.com/peterjc/biopython/tree/feature-in >>> >>> There is a docstring with examples, just look at the diff here: >>> http://github.com/peterjc/biopython/commit/83c44e8f6ee62a9c5855b603cb3c080d367e23d6 >> >> That's nice. > > Nice enough to be worth committing in its own right? > >> The next part would be remapping the coordinates so >> once you have the feature you can easily address the relative >> position you are interested in. > > Perhaps one approach would be to do this in the SeqFeature. If we > define a SeqFeature's length in the natural way, then we have > len(SeqFeature) == len(SeqFeature.extract(parent_seq)). > Now we have two coordinates systems, 0 to len(SeqFeature) and > the regions it describes on the parent sequence. Then we could > discuss a pair of methods on the SeqFeature for converting > between the two coordinate systems. Once you have that, the > special case of amino acid coordinates is much easier to do > (account for where the start codon is, divide by three). > > I've made another commit on the __contains__ branch to > also implement __len__ for the SeqFeature: > http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f > > Peter We essentially do this with Bioperl features, locations, and ranges (in fact, the coordinate system previously mentioned use these). Basically, anything that is-a Range can be compared to anything else that is-a Range (this has bitten us a well :). Beyond the module documentation the test suite has a bit more on it, and Aaron Mackey has a presentation up on slideshare that touches upon the Bio::Coordinate implementation: http://www.slideshare.net/bosc_2008/mackey-bio-perl-bosc2008 chris From biopython at maubp.freeserve.co.uk Fri Jun 18 14:10:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Jun 2010 19:10:24 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> Message-ID: On Fri, Jun 18, 2010 at 3:08 PM, Chris Fields wrote: > On Jun 18, 2010, at 8:39 AM, Peter wrote: >> Perhaps one approach would be to do this in the SeqFeature. If we >> define a SeqFeature's length in the natural way, then we have >> len(SeqFeature) == len(SeqFeature.extract(parent_seq)). >> Now we have two coordinates systems, 0 to len(SeqFeature) and >> the regions it describes on the parent sequence. Then we could >> discuss a pair of methods on the SeqFeature for converting >> between the two coordinate systems. Once you have that, the >> special case of amino acid coordinates is much easier to do >> (account for where the start codon is, divide by three). >> >> I've made another commit on the __contains__ branch to >> also implement __len__ for the SeqFeature: >> http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f >> >> Peter > > We essentially do this with Bioperl features, locations, and ranges > (in fact, the coordinate system previously mentioned use these). > Basically, anything that is-a Range can be compared to anything else > that is-a Range (this has bitten us a well :). ?Beyond the module > documentation the test suite has a bit more on it, and Aaron Mackey > has a presentation up on slideshare that touches upon the > Bio::Coordinate implementation: > > http://www.slideshare.net/bosc_2008/mackey-bio-perl-bosc2008 Thanks for the link - I didn't see much in the presentation that I hadn't seen in the documentation though. I guess the BioPerl unit tests would be worth checking out. Thanks Chris. Back in Biopython land (where we seem to have adopted similar but different names like locations and positions), I had a go at doing one of the mappings - from feature coordinates to parent coordinates (e.g. CDS back to genome, or PFAM domain back to protein if your SeqRecord is for a protein sequence): http://github.com/peterjc/biopython/tree/feature-coords As you can tell from the rather lengthy docstring and doctests, this is quite hairy and difficult to explain. The way Python sub-setting works also complicates how to translate the break point between two subfeatures (exons), since you may want to use the number as the end of the first exon or as the start of the second exon. I've implemented the mapping so that single letter access works as expected: http://github.com/peterjc/biopython/tree/74b264acacd228d64859d28d75e2c30a8030d03f I'm pretty sure we can do the reverse mapping from a parent sequence coordinate to a feature coordinate, although I can come up with pathological examples where this is not a one to one mapping, but one to many. e.g. a ribosomal slippage where a base gets used twice. In this case we could raise an error, or maybe more simply take the first match. I'm not convinced about adding these methods just yet - but the relatively simple work to support "len(feature)" and "x in feature" look like useful additions. Peter From reece at berkeley.edu Fri Jun 18 14:00:20 2010 From: reece at berkeley.edu (Reece Hart) Date: Fri, 18 Jun 2010 11:00:20 -0700 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> Message-ID: <4C1BB434.5040103@berkeley.edu> Thanks, all, for feedback. I'm still digesting some of the previous comments. For the purposes of discussion, I've attached the crude (pre-crude, even) implementation that I mentioned. Caveats/ToDos: * The interface is sufficient for my needs, but for a large number of CDS subfeatures, it might make sense to change the implementation index rather than linear search. * I ignore strand for the moment. * I don't use SeqFeature.AbstractPosition and friends. -Reece -------------- next part -------------- A non-text attachment was scrubbed... Name: CoordinateMapper.py Type: text/x-python Size: 2501 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Fri Jun 18 14:19:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 18 Jun 2010 19:19:59 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <4C1BB434.5040103@berkeley.edu> References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> <4C1BB434.5040103@berkeley.edu> Message-ID: On Fri, Jun 18, 2010 at 7:00 PM, Reece Hart wrote: > Thanks, all, for feedback. I'm still digesting some of the previous > comments. For the purposes of discussion, I've attached the crude > (pre-crude, even) implementation that I mentioned. Thanks > Caveats/ToDos: > * The interface is sufficient for my needs, but for a large number of CDS > subfeatures, it might make sense to change the implementation index > rather than linear search. It looks like the core idea you are using is the same - loop over the exons (subfeatures) to keep track of where you are. > * I ignore strand for the moment. That makes like a bit more fun! I haven't tested my code on mixed strand features yet (e.g. some crazy tRNA annotation I've seen). > * I don't use SeqFeature.AbstractPosition and friends. Unfortunately they crop up in lots of real world GenBank/EMBL files, so anything we add to the SeqFeature object has to cope with them. Things like GFF3 files avoid this of course. Peter From biopython at maubp.freeserve.co.uk Mon Jun 21 13:59:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 21 Jun 2010 18:59:59 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> Message-ID: On Fri, Jun 18, 2010 at 7:10 PM, Peter wrote: > On Fri, Jun 18, 2010 at 3:08 PM, Chris Fields wrote: >> On Jun 18, 2010, at 8:39 AM, Peter wrote: >>> Perhaps one approach would be to do this in the SeqFeature. If we >>> define a SeqFeature's length in the natural way, then we have >>> len(SeqFeature) == len(SeqFeature.extract(parent_seq)). >>> Now we have two coordinates systems, 0 to len(SeqFeature) and >>> the regions it describes on the parent sequence. Then we could >>> discuss a pair of methods on the SeqFeature for converting >>> between the two coordinate systems. Once you have that, the >>> special case of amino acid coordinates is much easier to do >>> (account for where the start codon is, divide by three). >>> >>> I've made another commit on the __contains__ branch to >>> also implement __len__ for the SeqFeature: >>> http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f Note I found an off by one error with the end point, fixed now: http://github.com/peterjc/biopython/commit/dc18df0c5d9cc824ddd31c96d59ee2bf9c5c7fc2 With a few more unit tests, I think that could be merged to the trunk. > ... I had a go at > doing one of the mappings - from feature coordinates to parent > coordinates (e.g. CDS back to genome, or PFAM domain back > to protein if your SeqRecord is for a protein sequence): > > http://github.com/peterjc/biopython/tree/feature-coords > > As you can tell from the rather lengthy docstring and doctests, > this is quite hairy and difficult to explain. ... > > I'm pretty sure we can do the reverse mapping from a > parent sequence coordinate to a feature coordinate, > although I can come up with pathological examples where > this is not a one to one mapping, but one to many. e.g. a > ribosomal slippage where a base gets used twice. In this > case we could raise an error, or maybe more simply take > the first match. This second branch now implements two methods for mapping between feature coordinates and the parent sequence coordinates. http://github.com/peterjc/biopython/tree/feature-coords In the case where due to overlapping sub-features a parent letter has more than one possible feature coordindate, this returns the lowest feature coordinate. This is slightly faster since we don't have to check all the sub-features. However, perhaps doing so and raising an exception is preferable to avoid silent errors in this corner case? Note this does not handle the third case of amino acid coordinates (which only applies where the parent sequence is nucleotides and the feature is something like a CDS or mature peptide entry). Peter From jtomkins at ICR.org Tue Jun 22 15:55:52 2010 From: jtomkins at ICR.org (Jeff Tomkins) Date: Tue, 22 Jun 2010 14:55:52 -0500 Subject: [Biopython] Cogent package Message-ID: I have been working with with the biopython package for about the past year (formerly worked with perl and dabbled in bioperl a bit) and somehow the pycogent package didn?t even cross my radar. I just discovered it by accident in a code search on the web. So how does pycogent relate to and compare with biopython? It seems that the most recent release of cogent (1.4.1) is fairly mature and contains a large amount and diversity of code. -jeff From lgautier at gmail.com Tue Jun 22 16:49:41 2010 From: lgautier at gmail.com (Laurent) Date: Tue, 22 Jun 2010 22:49:41 +0200 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: Message-ID: <4C2121E5.2080203@gmail.com> On 22/06/10 18:00, biopython-request at lists.open-bio.org wrote: > On Fri, Jun 18, 2010 at 7:10 PM, Peter wrote: >> On Fri, Jun 18, 2010 at 3:08 PM, Chris Fields wrote: >>> On Jun 18, 2010, at 8:39 AM, Peter wrote: >>>> Perhaps one approach would be to do this in the SeqFeature. If we >>>> define a SeqFeature's length in the natural way, then we have >>>> len(SeqFeature) == len(SeqFeature.extract(parent_seq)). >>>> Now we have two coordinates systems, 0 to len(SeqFeature) and >>>> the regions it describes on the parent sequence. Then we could >>>> discuss a pair of methods on the SeqFeature for converting >>>> between the two coordinate systems. Once you have that, the >>>> special case of amino acid coordinates is much easier to do >>>> (account for where the start codon is, divide by three). >>>> >>>> I've made another commit on the __contains__ branch to >>>> also implement __len__ for the SeqFeature: >>>> http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f > > Note I found an off by one error with the end point, fixed now: > > http://github.com/peterjc/biopython/commit/dc18df0c5d9cc824ddd31c96d59ee2bf9c5c7fc2 > > With a few more unit tests, I think that could be merged to the trunk. > >> ... I had a go at >> doing one of the mappings - from feature coordinates to parent >> coordinates (e.g. CDS back to genome, or PFAM domain back >> to protein if your SeqRecord is for a protein sequence): >> >> http://github.com/peterjc/biopython/tree/feature-coords >> >> As you can tell from the rather lengthy docstring and doctests, >> this is quite hairy and difficult to explain. ... >> >> I'm pretty sure we can do the reverse mapping from a >> parent sequence coordinate to a feature coordinate, >> although I can come up with pathological examples where >> this is not a one to one mapping, but one to many. e.g. a >> ribosomal slippage where a base gets used twice. In this >> case we could raise an error, or maybe more simply take >> the first match. > > This second branch now implements two methods for mapping > between feature coordinates and the parent sequence coordinates. > > http://github.com/peterjc/biopython/tree/feature-coords > > In the case where due to overlapping sub-features a parent letter > has more than one possible feature coordindate, this returns the > lowest feature coordinate. This is slightly faster since we don't > have to check all the sub-features. However, perhaps doing so > and raising an exception is preferable to avoid silent errors in > this corner case? Exception is better. In the worst case raising an exception will take a split second to fix, while silent logic twists can in the best case take time and frustration to find and fix (in the worst case it can lead to wrong results undetected). > Note this does not handle the third case of amino acid coordinates > (which only applies where the parent sequence is nucleotides and > the feature is something like a CDS or mature peptide entry). Also, I followed that distantly but wouldn't it make sense to abstract everything into a system of nested relative coordinates ? I guess that putting a bit of code together would be the easiest to demonstrate what I have in mind (obviously after checking that other packages around do not already have something similar, and after I spare some time aside for that). L. > Peter > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 90, Issue 15 > ***************************************** From biopython at maubp.freeserve.co.uk Wed Jun 23 05:16:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Jun 2010 10:16:38 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> <4C1BB434.5040103@berkeley.edu> Message-ID: On Fri, Jun 18, 2010 at 7:19 PM, Peter wrote: > On Fri, Jun 18, 2010 at 7:00 PM, Reece Hart wrote: >> Thanks, all, for feedback. I'm still digesting some of the previous >> comments. For the purposes of discussion, I've attached the crude >> (pre-crude, even) implementation that I mentioned. > > Thanks > >> Caveats/ToDos: >> * The interface is sufficient for my needs, but for a large number of CDS >> subfeatures, it might make sense to change the implementation index >> rather than linear search. > > It looks like the core idea you are using is the same - loop over the exons > (subfeatures) to keep track of where you are. > >> * I ignore strand for the moment. > > That makes like a bit more fun! I haven't tested my code on mixed > strand features yet (e.g. some crazy tRNA annotation I've seen). > >> * I don't use SeqFeature.AbstractPosition and friends. > > Unfortunately they crop up in lots of real world GenBank/EMBL files, > so anything we add to the SeqFeature object has to cope with them. > Things like GFF3 files avoid this of course. I should also point out that if accessing location positions in your own code, using nofuzzy_start and nofuzzy_end is better since they give the appropriate integer values. i.e. change this: sf.location.start.position,sf.location.end.position to: sf.location.nofuzzy_start,sf.location.nofuzzy_end That should then take care of the fuzzy locations as best as possible. Peter From chapmanb at 50mail.com Wed Jun 23 09:13:20 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 23 Jun 2010 09:13:20 -0400 Subject: [Biopython] Cogent package In-Reply-To: References: Message-ID: <20100623131320.GI26392@sobchak.mgh.harvard.edu> Jeff; > I have been working with with the biopython package for about the > past year (formerly worked with perl and dabbled in bioperl a bit) > and somehow the pycogent package didn?t even cross my radar. I just > discovered it by accident in a code search on the web. So how does > pycogent relate to and compare with biopython? It seems that the most > recent release of cogent (1.4.1) is fairly mature and contains a large > amount and diversity of code. PyCogent is definitely a useful library to have in your toolbelt. It focuses a bit more on evolutionary and phylogenetic work, so will give you extra functionality if you're working in those areas. We'd hoped to develop formal interoperability with PyCogent, and proposed a summer of code project along those lines: http://biopython.org/wiki/Google_Summer_of_Code#Biopython_and_PyCogent_interoperability but unfortunately didn't get funded this year. A few other useful Python biology projects are: bx-python: http://bitbucket.org/james_taylor/bx-python/wiki/Home DendroPy: http://packages.python.org/DendroPy/ Pygr: http://code.google.com/p/pygr/ It would be cool to see the Python bioinformatics community develop documentation and examples of using multiple toolkits together. Brad From biopython at maubp.freeserve.co.uk Wed Jun 23 10:00:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 23 Jun 2010 15:00:26 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <4C2121E5.2080203@gmail.com> References: <4C2121E5.2080203@gmail.com> Message-ID: On Tue, Jun 22, 2010 at 9:49 PM, Laurent wrote: > > Peter wrote: >> >> This second branch now implements two methods for mapping >> between feature coordinates and the parent sequence coordinates. >> >> http://github.com/peterjc/biopython/tree/feature-coords >> >> In the case where due to overlapping sub-features a parent letter >> has more than one possible feature coordindate, this returns the >> lowest feature coordinate. This is slightly faster since we don't >> have to check all the sub-features. However, perhaps doing so >> and raising an exception is preferable to avoid silent errors in >> this corner case? > > Exception is better. > > In the worst case raising an exception will take a split second to fix, > while silent logic twists can in the best case take time and frustration to > find and fix (in the worst case it can lead to wrong results undetected). Agreed. I've made get_local_coord give an exception now for ambiguous mappings, and introduced get_local_coords (with a trailing s for plural) which gives a list of the local coordinates. That seems to cover the typical case nicely and makes dealing with the special case fairly easy. http://github.com/peterjc/biopython/tree/feature-coords >> Note this does not handle the third case of amino acid coordinates >> (which only applies where the parent sequence is nucleotides and >> the feature is something like a CDS or mature peptide entry). > > Also, I followed that distantly but wouldn't it make sense to abstract > everything into a system of nested relative coordinates ? > I guess that putting a bit of code together would be the easiest to > demonstrate what I have in mind (obviously after checking that other > packages around do not already have something similar, and after > I spare some time aside for that). That might be a good idea (I'm note quite sure what you are suggesting). Another related problem is going from gapped to ungapped coordinates (also described as padded and unpadded) when working with sequence alignments. Peter From pmr at ebi.ac.uk Wed Jun 23 12:50:38 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 23 Jun 2010 17:50:38 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> Message-ID: <4C223B5E.4080501@ebi.ac.uk> I have been following the discussion with interest. This is something we also want to implement in EMBOSS soon after the next release when we seriously tackle mapping and large alignments. It would be very nice to have a common approach across the Open-Bio projects. I will be at BOSC and ISMB in Boston so perhaps some of us can get together there and compare notes. regards, Peter Rice From lthiberiol at gmail.com Wed Jun 23 17:09:31 2010 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Wed, 23 Jun 2010 18:09:31 -0300 Subject: [Biopython] blast+ gilist Message-ID: Hi you all, I know this isn't the right place to ask it, but does anyone ever used the gilist parameter on blast+ or "-l" on blastall? I am trying to use it but never works, can you show me some examples of how the gilist file shoul look like? best regards, -- Luiz Thib?rio Rangel From biopython at maubp.freeserve.co.uk Thu Jun 24 04:36:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 09:36:50 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: <4C223B5E.4080501@ebi.ac.uk> References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> <4C223B5E.4080501@ebi.ac.uk> Message-ID: On Wed, Jun 23, 2010 at 5:50 PM, Peter Rice wrote: > I have been following the discussion with interest. Its nice to have BioPerl and EMBOSS folk on the mailing list :) > This is something we > also want to implement in EMBOSS soon after the next release when we > seriously tackle mapping and large alignments. Are you thinking beyond the simple feature mapping which I've had in mind here (e.g. in GenBank or EMBL files)? > It would be very nice to have a common approach across the Open-Bio > projects. I will be at BOSC and ISMB in Boston so perhaps some of us > can get together there and compare notes. Sadly I won't be at the Boston BOSC/ISMB 2010, but Brad and others will be. Maybe next time I visit the Sanger Centre I'll try and drop by and visit you (Peter R) at the EBI? Regards, Peter C. From pmr at ebi.ac.uk Thu Jun 24 04:47:34 2010 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 24 Jun 2010 09:47:34 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C1AAC0D.5030208@berkeley.edu> <20100618125803.GV3415@sobchak.mgh.harvard.edu> <051CA3A3-9ECB-40B7-B29C-BDEC69C450C1@illinois.edu> <4C223B5E.4080501@ebi.ac.uk> Message-ID: <4C231BA6.1090105@ebi.ac.uk> On 24/06/2010 09:36, Peter wrote: > On Wed, Jun 23, 2010 at 5:50 PM, Peter Rice wrote: >> I have been following the discussion with interest. > > Its nice to have BioPerl and EMBOSS folk on the mailing list :) > >> This is something we >> also want to implement in EMBOSS soon after the next release when we >> seriously tackle mapping and large alignments. > > Are you thinking beyond the simple feature mapping which I've had in > mind here (e.g. in GenBank or EMBL files)? Well, EMBOSS internals are identical for analysis results and EMBL/GenBank features so we would hope to cover anything we might want to do. A big effort after this release will include mapping to coordinate systems (especially reference sequences) so we could align an annotated sequence (e.g. an EMBL/GenBank entry) to a reference and aim to transfer the features, or to map features from the reference (using DAS or some similar protocol to extract just the region of interest) on to the user's own sequence. Anything that fails to map completely can be annotated e.g. with end or /note="some explanation" The naming of the reference sequences is also important so the mapping coule be hopefully reversible. > Sadly I won't be at the Boston BOSC/ISMB 2010, but Brad and others > will be. Maybe next time I visit the Sanger Centre I'll try and drop by and > visit you (Peter R) at the EBI? Great, let me know when you can drop in. Always good to see you. regards, Peter From biopython at maubp.freeserve.co.uk Thu Jun 24 13:32:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 24 Jun 2010 18:32:36 +0100 Subject: [Biopython] sequence coordinate mapping In-Reply-To: References: <4C2121E5.2080203@gmail.com> Message-ID: On Wed, Jun 23, 2010 at 3:00 PM, Peter wrote: >> >> In the worst case raising an exception will take a split second to fix, >> while silent logic twists can in the best case take time and frustration to >> find and fix (in the worst case it can lead to wrong results undetected). > > Agreed. > > I've made get_local_coord give an exception now for ambiguous > mappings, and introduced get_local_coords (with a trailing s for > plural) which gives a list of the local coordinates. That seems to > cover the typical case nicely and makes dealing with the special > case fairly easy. > > http://github.com/peterjc/biopython/tree/feature-coords I've just added an __iter__ method which gives the parent coordinates for each position in the feature (in the order of the local coordinates). I expect that would be useful for something... Peter From j.reid at mail.cryst.bbk.ac.uk Fri Jun 25 12:28:38 2010 From: j.reid at mail.cryst.bbk.ac.uk (John Reid) Date: Fri, 25 Jun 2010 17:28:38 +0100 Subject: [Biopython] Exon/intron locations for drosophila R4 Message-ID: Hi, What's the easiest way to get exon locations for release 4 of the melanogaster genome into my python script? I have been using a DAS interface to UCSC but it doesn't seem to supply this info. Thanks, John. From ratlaw at gmail.com Fri Jun 25 12:59:27 2010 From: ratlaw at gmail.com (Walter Scheper) Date: Fri, 25 Jun 2010 12:59:27 -0400 Subject: [Biopython] Problem building from PyPi: Affy package directory is missing Message-ID: Hey all, I generally install biopython either using easy_install or pip (if I need to enforce a particular version). Today was rebuilding the dependencies for a project I'm working on and when I went to build biopython I got the following error: error: package directory 'Bio/Affy' does not exist Looking in the Bio/ directory, sure enough there is no Affy/ directory. If I download the tarball available from www.biopython.org, then the Bio/Affy package directory does exist. Is there some issue with distributing the Affy submodule, or is the PyPi tarball incorreclty built? I'm fairly curious why I haven't seen any mention of this issue on the list, so perhaps there is something fishy with my process. Thanks, Walter Scheper From chapmanb at 50mail.com Fri Jun 25 14:32:59 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 25 Jun 2010 14:32:59 -0400 Subject: [Biopython] Problem building from PyPi: Affy package directory is missing In-Reply-To: References: Message-ID: <20100625183259.GM1227@sobchak.mgh.harvard.edu> Walter; > I generally install biopython either using easy_install or pip (if > I need to enforce a particular version). Today was rebuilding the > dependencies for a project I'm working on and when I went to build > biopython I got the following error: > > error: package directory 'Bio/Affy' does not exist Thanks for the heads up. There were a couple of compounding issues that led to the easy_install problem. Normally, pypi pulls the source files directly from the Biopython server. However, it looks like a spammer added an index.html in biopython.org/DIST, which forced pypi to use the uploaded .tar.gz as a last resort. This file was missing the Bio/Affy directories and other information because the main biopython distribution didn't have the MANIFEST.in. Whew, so in summary: - Things are fixed now on all ends. The pypi upload is new correct, you can access the biopython files again, and the MANIFEST.in was updated so we shouldn't see this again in the future. - You are apparently the first lucky person to see the problem since the biopython.org/DIST problem appears to be brand new. Thanks again. Let us know if you have any other problems, Brad From biopython at maubp.freeserve.co.uk Fri Jun 25 15:41:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Jun 2010 20:41:17 +0100 Subject: [Biopython] Problem building from PyPi: Affy package directory is missing In-Reply-To: <20100625183259.GM1227@sobchak.mgh.harvard.edu> References: <20100625183259.GM1227@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jun 25, 2010 at 7:32 PM, Brad Chapman wrote: > Walter; > >> I generally install biopython either using easy_install or pip (if >> I need to enforce a particular version). Today was rebuilding the >> dependencies for a project I'm working on and when I went to build >> biopython I got the following error: >> >> error: package directory 'Bio/Affy' does not exist > > Thanks for the heads up. There were a couple of compounding issues > that led to the easy_install problem. Normally, pypi pulls the > source files directly from the Biopython server. However, it looks > like a spammer added an index.html in biopython.org/DIST, which > forced pypi to use the uploaded .tar.gz as a last resort. ... > > - Things are fixed now on all ends. The pypi upload is new correct, > ?you can access the biopython files again, and the MANIFEST.in was > ?updated so we shouldn't see this again in the future. Huh - so MANIFEST.in should include MANIFEST.in? Thanks for sorting that out Brad. > - You are apparently the first lucky person to see the problem > ?since the biopython.org/DIST problem appears to be brand new. I'd spotted the index.html thing yesterday and it was fixed this morning - you were unlucky with the timing. The OBF team are looking into this. Peter From biopython at maubp.freeserve.co.uk Fri Jun 25 15:50:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 25 Jun 2010 20:50:33 +0100 Subject: [Biopython] Exon/intron locations for drosophila R4 In-Reply-To: References: Message-ID: On Fri, Jun 25, 2010 at 5:28 PM, John Reid wrote: > Hi, > > What's the easiest way to get exon locations for release 4 of the > melanogaster genome into my python script? I have been using a > DAS interface to UCSC but it doesn't seem to supply this info. > > Thanks, > John. You could download by FTP from here: ftp://ftp.ncbi.nih.gov/genomes/Drosophila_melanogaster/RELEASE_4.1/ The *.gbk or *.gff files should give you the exons. Peter From chapmanb at 50mail.com Sun Jun 27 21:26:57 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 27 Jun 2010 21:26:57 -0400 Subject: [Biopython] Problem building from PyPi: Affy package directory is missing In-Reply-To: References: <20100625183259.GM1227@sobchak.mgh.harvard.edu> Message-ID: <20100628012657.GA25674@kunkel> Peter; [PyPi upload builds] > Huh - so MANIFEST.in should include MANIFEST.in? Thanks > for sorting that out Brad. To build the PyPi distribution, I download the release tarball and then upload it to PyPi with: setup.py sdist upload So if the MANIFEST.in is in the release all of the right files will be pulled in. With it missing, the Affy and other bits get left out. So including MANIFEST.in ensures we can keep building identical distributions from the main tarball. Hope the website issues get sorted out so we don't have to worry too much it. Thanks, Brad From msameet at gmail.com Mon Jun 28 15:20:49 2010 From: msameet at gmail.com (Sameet Mehta) Date: Mon, 28 Jun 2010 15:20:49 -0400 Subject: [Biopython] problem parsing embl file Message-ID: Hi, I am trying to parse a EMBL file created in 2004. The file contains a single record for the entire chromosome. I have tried the following two approaches r = SeqIO.parse( file( "chromosome1.contig.embl" ), "embl" ).next() r = SeqIO.read( file( "chromosome1.contig.embl" ), "embl" ) I get the following error: ValueError Traceback (most recent call last) /home/sameet/NIH-work/downloads/2004_release/ in () /usr/lib64/python2.6/site-packages/Bio/SeqIO/__init__.pyc in read(handle, format, alphabet) 516 iterator = parse(handle, format, alphabet) 517 try: --> 518 first = iterator.next() 519 except StopIteration: 520 first = None /usr/lib64/python2.6/site-packages/Bio/GenBank/Scanner.pyc in parse_records(self, handle, do_features) 418 #This is a generator function 419 while True: --> 420 record = self.parse(handle, do_features) 421 if record is None : break 422 assert record.id is not None /usr/lib64/python2.6/site-packages/Bio/GenBank/Scanner.pyc in parse(self, handle, do_features) 401 feature_cleaner = FeatureValueCleaner()) 402 --> 403 if self.feed(handle, consumer, do_features): 404 return consumer.data 405 else: /usr/lib64/python2.6/site-packages/Bio/GenBank/Scanner.pyc in feed(self, handle, consumer, do_features) 383 consumer.sequence(sequence_string) 384 #Calls to consumer.base_number() do nothing anyway --> 385 consumer.record_end("//") 386 387 assert self.line == "//" /usr/lib64/python2.6/site-packages/Bio/GenBank/__init__.pyc in record_end(self, content) 1047 and self._expected_size != len(sequence): 1048 raise ValueError("Expected sequence length %i, found %i." \ -> 1049 % (self._expected_size, len(sequence))) 1050 1051 if self._seq_type: ValueError: Expected sequence length 666, found 5580032. Can you tell me if i am doing anything wrong. I am following the instructions as given in the Bio.SeqIO wiki page. Thanks for the help. Sameet -- Sameet Mehta, Ph.D., Phone: (301) 842-4791 From biopython at maubp.freeserve.co.uk Mon Jun 28 15:56:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Jun 2010 20:56:42 +0100 Subject: [Biopython] problem parsing embl file In-Reply-To: References: Message-ID: Hi Sameet, On Mon, Jun 28, 2010 at 8:20 PM, Sameet Mehta wrote: > Hi, > > I am trying to parse a EMBL file created in 2004. ?The file contains a > single record for the entire chromosome. ?I have tried the following > two approaches > > r = SeqIO.parse( file( "chromosome1.contig.embl" ), "embl" ).next() > r = SeqIO.read( file( "chromosome1.contig.embl" ), "embl" ) Those look fine - if you are using Biopython 1.54 you can just use the filename rather then opening it explicitly. > I get the following error: > ValueError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Traceback (most recent call last) > ... > ValueError: Expected sequence length 666, found 5580032. > > Can you tell me if i am doing anything wrong. ?I am following the > instructions as given in the Bio.SeqIO wiki page. No, your code is fine. It looks like you have a broken EMBL file. Could you show me the first few lines of the EMBL file, and also have a look at it in a text editor to see if the sequence length really is 666bp, or 5580032 as Biopython thinks? (Or send the whole EMBL file to me off list?) In any case, that check seemed a bit strict (I've seen several examples of unofficial GenBank or EMBL files where the sequence length didn't match the header) so I relaxed this check to a warning for Biopython 1.54. You could try updating your copy of Biopython and see if it will accept the file then? Regards, Peter From msameet at gmail.com Mon Jun 28 16:02:33 2010 From: msameet at gmail.com (Sameet Mehta) Date: Mon, 28 Jun 2010 16:02:33 -0400 Subject: [Biopython] problem parsing embl file In-Reply-To: References:

Message-ID: Hi Peter, The Sequence length is 5580032, its first chromosome of yeast. following are the first 10 lines of the file. ID c212 standard; DNA; FUN; 666 BP. AC c212; FH Key Location/Qualifiers FH FT CDS complement(1..5662) FT /gene="SPAC212.11" FT /partial FT /product="DNA helicase; no apparent orthologs" FT /note="possibly pseudo as has strange promoter region" FT misc_feature complement(1115..1339) Also I believe that I am using the latest BioPython on my laptop. I think I found the problem!! Indeed the first line is the problem. So how can i circumvent this. On Mon, Jun 28, 2010 at 3:56 PM, Peter wrote: > Hi Sameet, > > On Mon, Jun 28, 2010 at 8:20 PM, Sameet Mehta wrote: >> Hi, >> >> I am trying to parse a EMBL file created in 2004. ?The file contains a >> single record for the entire chromosome. ?I have tried the following >> two approaches >> >> r = SeqIO.parse( file( "chromosome1.contig.embl" ), "embl" ).next() >> r = SeqIO.read( file( "chromosome1.contig.embl" ), "embl" ) > > Those look fine - if you are using Biopython 1.54 you can just > use the filename rather then opening it explicitly. > >> I get the following error: >> ValueError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Traceback (most recent call last) >> ... >> ValueError: Expected sequence length 666, found 5580032. >> >> Can you tell me if i am doing anything wrong. ?I am following the >> instructions as given in the Bio.SeqIO wiki page. > > No, your code is fine. It looks like you have a broken EMBL file. > Could you show me the first few lines of the EMBL file, and also > have a look at it in a text editor to see if the sequence length > really is 666bp, or 5580032 as Biopython thinks? > > (Or send the whole EMBL file to me off list?) > > In any case, that check seemed a bit strict (I've seen several > examples of unofficial GenBank or EMBL files where the > sequence length didn't match the header) so I relaxed this > check to a warning for Biopython 1.54. You could try updating > your copy of Biopython and see if it will accept the file then? > > Regards, > > Peter > -- Sameet Mehta, Ph.D., Phone: (301) 842-4791 From biopython at maubp.freeserve.co.uk Mon Jun 28 16:06:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Jun 2010 21:06:38 +0100 Subject: [Biopython] problem parsing embl file In-Reply-To: References:

Message-ID: On Mon, Jun 28, 2010 at 9:02 PM, Sameet Mehta wrote: > Hi Peter, > > The Sequence length is 5580032, its first chromosome of yeast. > following are the first 10 lines of the file. > > ID c212 standard; DNA; FUN; 666 BP. > ... > > Also I believe that I am using the latest BioPython on my laptop. > Could you check? Run Python then, import Bio print Bio.__version__ > I think I found the problem!! Indeed the first line is the problem. So > how can i circumvent this. Try editing the file to start: ID c212 standard; DNA; FUN; 5580032 BP. Peter From msameet at gmail.com Mon Jun 28 16:58:01 2010 From: msameet at gmail.com (Sameet Mehta) Date: Mon, 28 Jun 2010 16:58:01 -0400 Subject: [Biopython] problem parsing embl file In-Reply-To: References:

Message-ID: Hi Peter, Yes I am indeed running v 1.53 >>> import Bio >>> print Bio.__version__ 1.53 >>> I will change that line and see if it works, probably will. But I have multiple such files to deal with and i dont knwo the lengths of each of those. Is there any other simpler way around. Sorry to bother you this way. Samee On Mon, Jun 28, 2010 at 4:06 PM, Peter wrote: > On Mon, Jun 28, 2010 at 9:02 PM, Sameet Mehta wrote: >> Hi Peter, >> >> The Sequence length is 5580032, its first chromosome of yeast. >> following are the first 10 lines of the file. >> >> ID c212 standard; DNA; FUN; 666 BP. >> ... >> >> Also I believe that I am using the latest BioPython on my laptop. >> > > Could you check? Run Python then, > > import Bio > print Bio.__version__ > >> I think I found the problem!! Indeed the first line is the problem. So >> how can i circumvent this. > > Try editing the file to start: > > ID c212 standard; DNA; FUN; 5580032 BP. > > Peter > -- Sameet Mehta, Ph.D., Phone: (301) 842-4791 From biopython at maubp.freeserve.co.uk Mon Jun 28 18:06:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 28 Jun 2010 23:06:55 +0100 Subject: [Biopython] problem parsing embl file In-Reply-To: References:

Message-ID: On Mon, Jun 28, 2010 at 9:58 PM, Sameet Mehta wrote: > Yes I am indeed running v 1.53 That isn't our latest release (although it may be the latest release available as a package from your Linux distribution). > I will change that line and see if it works, probably will. ?But I > have multiple such files to deal with and i dont knwo the lengths of > each of those. Please try that and let us know if that works. >?Is there any other simpler way around. Yes, upgrade to Biopython 1.54 and this length check will just issue a warning and carry on (unlike Biopython 1.53 which issues an exception and stops). Regards, Peter From msameet at gmail.com Mon Jun 28 19:43:30 2010 From: msameet at gmail.com (Sameet Mehta) Date: Mon, 28 Jun 2010 19:43:30 -0400 Subject: [Biopython] problem parsing embl file In-Reply-To: References:

Message-ID: Hi Peter, I think i figured out a way around this. I will of course try with biopython 1.54 but in the line that begins with SQ (identifier for the region where the sequence starts, the length of the true sequence is given. Getting that information and replacing the one on the ID line was trivial. I have done it. Thanks for the help. Sameet On Mon, Jun 28, 2010 at 6:06 PM, Peter wrote: > On Mon, Jun 28, 2010 at 9:58 PM, Sameet Mehta wrote: >> Yes I am indeed running v 1.53 > > That isn't our latest release (although it may be the latest > release available as a package from your Linux distribution). > >> I will change that line and see if it works, probably will. ?But I >> have multiple such files to deal with and i dont knwo the lengths of >> each of those. > > Please try that and let us know if that works. > >>?Is there any other simpler way around. > > Yes, upgrade to Biopython 1.54 and this length check will > just issue a warning and carry on (unlike Biopython 1.53 > which issues an exception and stops). > > Regards, > > Peter > -- Sameet Mehta, Ph.D., Phone: (301) 842-4791 From lunt at ctbp.ucsd.edu Tue Jun 29 02:36:58 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Mon, 28 Jun 2010 23:36:58 -0700 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB Message-ID: Greetings All, Does anyone have any code for easy alignment between the SEQRES entry in a pdb file and the actual ATOM/HETATM entries in the chain? In biojava, this is just one of the options when you parse a PDB file, it would certainly be useful. Does anyone have any code for this? Shall I write it? Thanks! -Bryan Lunt From reece at berkeley.edu Tue Jun 29 19:04:41 2010 From: reece at berkeley.edu (Reece Hart) Date: Tue, 29 Jun 2010 16:04:41 -0700 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: Message-ID: <4C2A7C09.2020204@berkeley.edu> On 06/28/2010 11:36 PM, Bryan Lunt wrote: > Does anyone have any code for easy alignment between the SEQRES entry > in a pdb file and the actual ATOM/HETATM entries in the chain? > > In biojava, this is just one of the options when you parse a PDB file, > it would certainly be useful. > How does BioJava do this? RCSB added this mapping explicitly in the XML formatted files several years ago. It looks like this: SER 145 n 3 SER 145 A That is, sequence position 3 is resid position 145 in this protein. In any case, having function that provides this mapping (both directions) in BioPython would be extremely useful. Thanks, Reece From biopython at maubp.freeserve.co.uk Wed Jun 30 09:44:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Jun 2010 14:44:10 +0100 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: <4C2A7C09.2020204@berkeley.edu> References: <4C2A7C09.2020204@berkeley.edu> Message-ID: On Wed, Jun 30, 2010 at 12:04 AM, Reece Hart wrote: > On 06/28/2010 11:36 PM, Bryan Lunt wrote: >> >> Does anyone have any code for easy alignment between the SEQRES entry >> in a pdb file and the actual ATOM/HETATM entries in the chain? >> >> In biojava, this is just one of the options when you parse a PDB file, >> it would certainly be useful. >> > > How does BioJava do this? > > RCSB added this mapping explicitly in the XML formatted files several years > ago. It looks like this: > > ? ? ? seq_id="3"> > ? ? ? ? SER > ? ? ? ? 145 > ? ? ? ? n > ? ? ? ? 3 > ? ? ? ? > ? ? ? ? SER > ? ? ? ? 145 > ? ? ? ? A > ? ? ? > > That is, sequence position 3 is resid position 145 in this protein. That looks like a good reason to have a PDB XML parser (as trying to do this from the plain text PDB is probably fiddly). > In any case, having function that provides this mapping (both directions) in > BioPython would be extremely useful. Maybe something for the GSoC project TODO list? ;) Peter From lunt at ctbp.ucsd.edu Wed Jun 30 14:13:37 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Wed, 30 Jun 2010 11:13:37 -0700 Subject: [Biopython] Fwd: Re: BioJava-like seqres alignment for Bio.PDB Message-ID: It is indeed fiddly, But I needed it immediately. http://gist.github.com/459005 Basically you call mapAll( open("pdb1234.ent"), structureObjectPreviouslyReadWith-Bio.PDB) and get back a dictionary of chain IDs -> lists of tuples, where position in the list corresponds to position in the seqres. It _is_ ugly, and makes my eyes bleed. Sorry. -Bryan From reece at berkeley.edu Wed Jun 30 20:05:05 2010 From: reece at berkeley.edu (Reece Hart) Date: Wed, 30 Jun 2010 17:05:05 -0700 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: <4C2A7C09.2020204@berkeley.edu> Message-ID: <4C2BDBB1.7070503@berkeley.edu> On 06/30/2010 06:44 AM, Peter wrote: > That looks like a good reason to have a PDB XML parser (as trying to do > this from the plain text PDB is probably fiddly). > The only way to do this from the PDB text file is to infer the mapping by aligning the seqres block and the ATOM residues. Fiddly is a generous description of the problems one will encounter. mmCIF and XML are the authoritative sources, AFAIK. -Reece From lunt at ctbp.ucsd.edu Wed Jun 30 23:06:41 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Wed, 30 Jun 2010 20:06:41 -0700 Subject: [Biopython] (least) Favorite PDB models? Message-ID: Greetings All, So I have finished (for now) the section of my program that maps PDB model residues to SEQRES residues... And every programmer's favorite thing is QA, right? Does anyone have some suggestions of particularly ugly PDB models (discontinuous, strange residue numberings, etc) that I can test it on? Thanks All! -Bryan Lunt From alvin at pasteur.edu.uy Tue Jun 1 12:37:20 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Tue, 1 Jun 2010 09:37:20 -0300 Subject: [Biopython] Cross_match In-Reply-To: References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> Message-ID: Sorry about the mistake, I meant to say Biopython instead bioperl. Well, now I know that SearchIO supports cross match in bioperl. Anyway, I agree that it would be great if we could make some method for cross match in Biopython. In my case I'm more interested in the hit information than the pairwise alignment. I'll take a look to the Bio.AlignIO 2010/5/31 Peter > On Mon, May 31, 2010 at 11:08 PM, Chris Fields wrote: > > > > Yes, but only one file: > > > > > http://github.com/bioperl/bioperl-live/blob/master/t/data/testdata.crossmatch > > > > chris > > > > Thanks Chris - that saved me searching ;) > > Alvaro - would you just want the pairwise alignment, > or are your interested in the hit information (scores etc)? > I'm wondering if adding support in Bio.AlignIO would > be enough (similar to how we support FASTA -m 10 > output already). > > Peter > From biopython at maubp.freeserve.co.uk Tue Jun 1 12:57:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 13:57:53 +0100 Subject: [Biopython] Cross_match In-Reply-To: References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> Message-ID: On Tue, Jun 1, 2010 at 1:37 PM, Alvaro F Pena Perea wrote: > Sorry about the mistake, I meant to say Biopython instead bioperl. Well, now > I know that SearchIO supports cross match in bioperl. > Anyway, I agree that it would be great if we could make some method for > cross match in Biopython. In my case I'm more interested in the hit > information than the pairwise alignment. I'll take a look to the Bio.AlignIO Just to clarify - Bio.AlignIO does not currently support cross_match, but if it did this would be focusing on the pairwise alignments. Peter From biopython at maubp.freeserve.co.uk Thu Jun 3 17:52:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Jun 2010 18:52:31 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? Message-ID: Dear Biopythoneers, We've had several discussions (mostly on the development list) about extending the Bio.SeqIO.index() functionality. For a quick recap of what you can do with it right now, see: http://news.open-bio.org/news/2009/09/biopython-seqio-index/ http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ There are two major additions that have been discussed (and some code written too): gzip support and storing the index on disk. Currently Bio.SeqIO.index() has to re-index the sequence file each time you run your script. If you run the same script often, it would be useful to be able to save the index information to disk. The idea is that you can then load the index file and get almost immediate random access to the associated sequence file (with waiting to scan the file to rebuild the index). The old OBDA style indexes used by BioPerl, BioRuby etc are one possible file format we might use, but a simple SQLite database may be preferable. This also would give us a way to index really big files with many millions of reads without keeping the file offsets in memory. This is going to be important for random access to the latest massive sequencing data files. Next, support for indexing compressed files (initially concentrating on Unix style gzipped files, e.g. example.fasta.gz) without having to decompress the whole file. You can already parse these files with Bio.SeqIO in conjunction with the Python gzip module. It would be nice to be able to index them too. Now ideally we'd be able to offer both of these features - but if you had to vote, which would be most important and why? Peter From rjalves at igc.gulbenkian.pt Fri Jun 4 08:10:22 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Fri, 04 Jun 2010 09:10:22 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: <4C08B4EE.4020508@igc.gulbenkian.pt> Hi Peter and all, Considering the fact that the first addition 'is' a potential problem, while the second is more of an optimization, I would put my vote on the first. In addition, an sqlite or similar solution would also allow one to use the indexing feature on short run applications where recalculating the index every time is a costly (sometimes too much) operation. Obviously the second would be of great use if put together with the first, but I'm a little bit biased on that since I was part of the group that raised the gzip question in the mailing list some time ago. Regards, Renato Quoting Peter on 06/03/2010 06:52 PM: > Dear Biopythoneers, > > We've had several discussions (mostly on the development list) about > extending the Bio.SeqIO.index() functionality. For a quick recap of what > you can do with it right now, see: > > http://news.open-bio.org/news/2009/09/biopython-seqio-index/ > http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > > There are two major additions that have been discussed (and some > code written too): gzip support and storing the index on disk. > > Currently Bio.SeqIO.index() has to re-index the sequence file each > time you run your script. If you run the same script often, it would be > useful to be able to save the index information to disk. The idea is > that you can then load the index file and get almost immediate > random access to the associated sequence file (with waiting to scan > the file to rebuild the index). The old OBDA style indexes used by > BioPerl, BioRuby etc are one possible file format we might use, but > a simple SQLite database may be preferable. This also would give > us a way to index really big files with many millions of reads without > keeping the file offsets in memory. This is going to be important for > random access to the latest massive sequencing data files. > > Next, support for indexing compressed files (initially concentrating > on Unix style gzipped files, e.g. example.fasta.gz) without having > to decompress the whole file. You can already parse these files > with Bio.SeqIO in conjunction with the Python gzip module. It would > be nice to be able to index them too. > > Now ideally we'd be able to offer both of these features - but if > you had to vote, which would be most important and why? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Jun 4 08:42:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 09:42:22 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C08B4EE.4020508@igc.gulbenkian.pt> References: <4C08B4EE.4020508@igc.gulbenkian.pt> Message-ID: On Fri, Jun 4, 2010 at 9:10 AM, Renato Alves wrote: > Hi Peter and all, > > Considering the fact that the first addition 'is' a potential problem, > while the second is more of an optimization, I would put my vote on the > first. In addition, an sqlite or similar solution would also allow one > to use the indexing feature on short run applications where > recalculating the index every time is a costly (sometimes too much) > operation. > > Obviously the second would be of great use if put together with the > first, but I'm a little bit biased on that since I was part of the group > that raised the gzip question in the mailing list some time ago. > > Regards, > Renato Hi Renato, Unfortunately I was inconsistent about which order I used in my email (gzip vs on disk indexes) so I'm not sure which you are talking about. Are you saying supporting on disk indexes would be your priority (even though you did ask look at gzip support in the past)? Peter From lpritc at scri.ac.uk Fri Jun 4 08:49:06 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Fri, 04 Jun 2010 09:49:06 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: Message-ID: Hi, On 03/06/2010 Thursday, June 3, 18:52, "Peter" wrote: > There are two major additions that have been discussed (and some > code written too): gzip support and storing the index on disk. [...] > Now ideally we'd be able to offer both of these features - but if > you had to vote, which would be most important and why? On-disk indexing. But does this not also lend itself (perhaps eventually...) also to storing the whole dataset in SQLite or similar to avoid syncing problems between the file and the index? Wasn't that also part of a discussion on the BIP list some time ago? I've not looked at how you're already parsing from gzip files, so I hope it's more time-efficient than what I used to do for bzip, which was write a Pyrex wrapper to Flex, which was using the bzip2 library directly. This was not a speed improvement over uncompressing the file each time I needed to open it (and then using Flex). The same is true for Python's gzip module: -rw-r--r-- 1 lpritc staff 110M 14 Apr 14:22 phytophthora_infestans_data.tar.gz $ time gunzip phytophthora_infestans_data.tar.gz real 0m18.359s user 0m3.562s sys 0m0.582s Python 2.6 (trunk:66714:66715M, Oct 1 2008, 18:36:04) [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import time >>> import gzip >>> def gzip_time(): ... t0 = time.time() ... f = gzip.open('phytophthora_infestans_data.tar.gz','rb') ... f.read() ... print time.time()-t0 ... >>> gzip_time() 19.2009749413 If you know where your data is, it can be quicker to get to, but you still need to uncompress each time, and it scales approximately linearly with number of lines returned, as you'd expect: >>> def read_lines(n): ... t0 = time.time() ... f = gzip.open('phytophthora_infestans_data.tar.gz', 'rb') ... lset = [f.readline() for i in range(n)] ... print time.time() - t0 ... return lset ... >>> d = read_lines(1000) 0.0324518680573 >>> d = read_lines(10000) 0.11150097847 >>> d = read_lines(100000) 0.808992147446 >>> d = read_lines(1000000) 7.9017291069 >>> d = read_lines(2000000) 15.7361371517 >>> d = read_lines(3000000) 23.7589659691 The advantage to me was in the amount of disk space (and network transfer time/bandwidth) saved by dealing with a compressed file. In the end I decided that, where data access was likely to be frequent, buying more storage and handling uncompressed data would be a better option than dealing directly with the compressed file: -rw-r--r-- 1 lpritc staff 410M 14 Apr 14:22 phytophthora_infestans_data.tar >>> def read_file(): ... t0 = time.time() ... d = open('phytophthora_infestans_data.tar','rb').read() ... print time.time() - t0 ... >>> read_file() 0.620229959488 >>> def read_file_lines(n): ... t0 = time.time() ... f = open('phytophthora_infestans_data.1.tar', 'rb') ... lset = [f.readline() for i in range(n)] ... print time.time() - t0 ... return lset ... >>> d = read_file_lines(100) 0.000148057937622 >>> d = read_file_lines(1000) 0.000863075256348 >>> d = read_file_lines(10000) 0.00704002380371 >>> d = read_file_lines(100000) 0.0780401229858 >>> d = read_file_lines(1000000) 0.804203033447 >>> d = read_file_lines(2000000) 1.71462202072 >>> d = read_file_lines(4000000) 3.55472993851 I don't see (though I'm happy to be shown) how you can efficiently index directly into the LZW/DEFLATE/BZIP compressed data. If you're not decompressing the whole thing in one go, I think you atill have to partially decompress a section of the file (starting from the front of the file) to retrieve your sequence each time. Even if you index - say, by recording the required buffer size/number of buffer decompressions and the offset of your sequence in the output as the index. This could save memory if you discard, rather than cache, unwanted early output - but I'm not sure that it would be time-efficient to do it for more than one or two (on average) sequences in a compressed file. You'd likely be better off spending your time waiting for the file to decompress once and doing science with the time that's left over ;) I could be wrong, though... Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Fri Jun 4 09:16:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 10:16:19 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: On Fri, Jun 4, 2010 at 9:49 AM, Leighton Pritchard wrote: > Hi, > > On 03/06/2010 Thursday, June 3, 18:52, "Peter" > wrote: > >> There are two major additions that have been discussed (and some >> code written too): gzip support and storing the index on disk. > > [...] > >> Now ideally we'd be able to offer both of these features - but if >> you had to vote, which would be most important and why? > > On-disk indexing. ?But does this not also lend itself (perhaps > eventually...) also to storing the whole dataset in SQLite or similar to > avoid syncing problems between the file and the index? ?Wasn't that also > part of a discussion on the BIP list some time ago? That is a much more complicated problem - serialising data from many different possible files formats. We have BioSQL which is pretty good for things like GenBank, EMBL, SwissProt etc but not suitable for FASTQ. I'd rather stick to the simpler task of recording a lookup table mapping record identifiers to file offsets. > I've not looked at how you're already parsing from gzip files, so I hope > it's more time-efficient than what I used to do for bzip, which was write a > Pyrex wrapper to Flex, which was using the bzip2 library directly. ?This was > not a speed improvement over uncompressing the file each time I needed to > open it (and then using Flex). ?The same is true for Python's gzip module: > > -rw-r--r-- ?1 lpritc ?staff ? 110M 14 Apr 14:22 > phytophthora_infestans_data.tar.gz > > $ time gunzip phytophthora_infestans_data.tar.gz > > real ? ?0m18.359s > user ? ?0m3.562s > sys ? ?0m0.582s > > Python 2.6 (trunk:66714:66715M, Oct ?1 2008, 18:36:04) > [GCC 4.0.1 (Apple Computer, Inc. build 5370)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> import time >>>> import gzip >>>> def gzip_time(): > ... ? ? t0 = time.time() > ... ? ? f = gzip.open('phytophthora_infestans_data.tar.gz','rb') > ... ? ? f.read() > ... ? ? print time.time()-t0 > ... >>>> gzip_time() > 19.2009749413 > > If you know where your data is, it can be quicker to get to, but you still > need to uncompress each time, and it scales approximately linearly with > number of lines returned, as you'd expect: > >>>> def read_lines(n): > ... ? ? t0 = time.time() > ... ? ? f = gzip.open('phytophthora_infestans_data.tar.gz', 'rb') > ... ? ? lset = [f.readline() for i in range(n)] > ... ? ? print time.time() - t0 > ... ? ? return lset > ... >>>> d = read_lines(1000) > 0.0324518680573 >>>> d = read_lines(10000) > 0.11150097847 >>>> d = read_lines(100000) > 0.808992147446 >>>> d = read_lines(1000000) > 7.9017291069 >>>> d = read_lines(2000000) > 15.7361371517 >>>> d = read_lines(3000000) > 23.7589659691 > > The advantage to me was in the amount of disk space (and network transfer > time/bandwidth) saved by dealing with a compressed file. ?In the end I > decided that, where data access was likely to be frequent, buying more > storage and handling uncompressed data would be a better option than dealing > directly with the compressed file: > > -rw-r--r-- ?1 lpritc ?staff ? 410M 14 Apr 14:22 > phytophthora_infestans_data.tar > >>>> def read_file(): > ... ? ? t0 = time.time() > ... ? ? d = open('phytophthora_infestans_data.tar','rb').read() > ... ? ? print time.time() - t0 > ... >>>> read_file() > 0.620229959488 > >>>> def read_file_lines(n): > ... ? ? t0 = time.time() > ... ? ? f = open('phytophthora_infestans_data.1.tar', 'rb') > ... ? ? lset = [f.readline() for i in range(n)] > ... ? ? print time.time() - t0 > ... ? ? return lset > ... >>>> d = read_file_lines(100) > 0.000148057937622 >>>> d = read_file_lines(1000) > 0.000863075256348 >>>> d = read_file_lines(10000) > 0.00704002380371 >>>> d = read_file_lines(100000) > 0.0780401229858 >>>> d = read_file_lines(1000000) > 0.804203033447 >>>> d = read_file_lines(2000000) > 1.71462202072 >>>> d = read_file_lines(4000000) > 3.55472993851 > > > I don't see (though I'm happy to be shown) how you can efficiently index > directly into the LZW/DEFLATE/BZIP compressed data. ?If you're not > decompressing the whole thing in one go, I think you atill have to partially > decompress a section of the file (starting from the front of the file) to > retrieve your sequence each time. ?Even if you index - say, by recording the > required buffer size/number of buffer decompressions and the offset of your > sequence in the output as the index. ?This could save memory if you discard, > rather than cache, unwanted early output - but I'm not sure that it would be > time-efficient to do it for more than one or two (on average) sequences in a > compressed file. ?You'd likely be better off spending your time waiting for > the file to decompress once and doing science with the time that's left over > ;) > > I could be wrong, though... > The proof of concept support for gzip files in Bio.SeqIO.index() just called the Python gzip module. This gives us a file-like handle object supporting the usual methods like readline and iteration (used to scan the file looking for each record) and seek/tell (offsets for the decompressed stream). Here building the index must by its nature decompress the whole file once - there is no way round that. The interesting thing is how seeking to an offset and then reading a record performs - and I have not looked at the run time or memory usage for this. It works, but your measurements do suggest it will be much much slower than using the original file. i.e. It looks like while the code to support gzip files in Bio.SeqIO.index() is quite short, the performance may be unimpressive for large archives. I doubt this can be worked around - its the cost of saving disk space by compressing a whole file without taking any special case about putting different records into different block. Peter [As an aside, this is something I'm interested in for BAM file support, these are binary files which are gzip compressed.] From aboulia at gmail.com Fri Jun 4 10:53:14 2010 From: aboulia at gmail.com (Kevin) Date: Fri, 4 Jun 2010 18:53:14 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: I vote for sqlite index. Have been using bsddb to do the same but the db is inflated compared to plain text. Performance is not bad using btree For gzip I feel it might be possible to gunzip into a stream which biopython can parse on the fly? Kev Sent from my iPod On 04-Jun-2010, at 1:52 AM, Peter wrote: > Dear Biopythoneers, > > We've had several discussions (mostly on the development list) about > extending the Bio.SeqIO.index() functionality. For a quick recap of > what > you can do with it right now, see: > > http://news.open-bio.org/news/2009/09/biopython-seqio-index/ > http://news.open-bio.org/news/2010/04/partial-seq-files-biopython/ > > There are two major additions that have been discussed (and some > code written too): gzip support and storing the index on disk. > > Currently Bio.SeqIO.index() has to re-index the sequence file each > time you run your script. If you run the same script often, it would > be > useful to be able to save the index information to disk. The idea is > that you can then load the index file and get almost immediate > random access to the associated sequence file (with waiting to scan > the file to rebuild the index). The old OBDA style indexes used by > BioPerl, BioRuby etc are one possible file format we might use, but > a simple SQLite database may be preferable. This also would give > us a way to index really big files with many millions of reads without > keeping the file offsets in memory. This is going to be important for > random access to the latest massive sequencing data files. > > Next, support for indexing compressed files (initially concentrating > on Unix style gzipped files, e.g. example.fasta.gz) without having > to decompress the whole file. You can already parse these files > with Bio.SeqIO in conjunction with the Python gzip module. It would > be nice to be able to index them too. > > Now ideally we'd be able to offer both of these features - but if > you had to vote, which would be most important and why? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Fri Jun 4 12:59:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 13:59:22 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References:

Message-ID: On Fri, Jun 4, 2010 at 11:53 AM, Kevin wrote: > I vote for sqlite index. Have been using bsddb to do the same but the db > is inflated compared to plain text. Performance is not bad using btree The other major point against bsddb is that future versions of Python will not include it in the standard library - but Python 2.5+ does have sqlite3 included. > For gzip I feel it might be possible to gunzip into a stream which > biopython can parse on the fly? Yes of course, like this: import gzip from Bio import SeqIO handle = gzip.open("uniprot_sprot.dat.gz") for record in SeqIO.parse(handle, "swiss"): print record.id handle.close() Parsing is easy - the point of this discussion is random access to any record within the stream (which requires jumping to an offset). Peter From lgautier at gmail.com Fri Jun 4 18:25:42 2010 From: lgautier at gmail.com (Laurent) Date: Fri, 04 Jun 2010 20:25:42 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: Message-ID: <4C094526.9040508@gmail.com> On 04/06/10 18:00, biopython-request at lists.open-bio.org wrote: > > On Fri, Jun 4, 2010 at 11:53 AM, Kevin wrote: >> I vote for sqlite index. Have been using bsddb to do the same but the db >> is inflated compared to plain text. Performance is not bad using btree > > The other major point against bsddb is that future versions of Python > will not include it in the standard library - but Python 2.5+ does have > sqlite3 included. > >> For gzip I feel it might be possible to gunzip into a stream which >> biopython can parse on the fly? > > Yes of course, like this: > > import gzip > from Bio import SeqIO > handle = gzip.open("uniprot_sprot.dat.gz") > for record in SeqIO.parse(handle, "swiss"): print record.id > handle.close() > > Parsing is easy - the point of this discussion is random access to > any record within the stream (which requires jumping to an offset). > > Peter > One note of caution: Python's gzip module is slow, or so I experienced... to the point that I ended up wrapping the code into a function that gunzipped the file to a temporary location, parse and extract information, then delete the temporary file. Regarding random access in compressed file, there is the BGZF format but I am not familiar enough with it to tell whether it can be of use here. More generally, compression is part of the HDF5 format and this with chunks could prove the most battle-tested way to access entries randomly. L. From aboulia at gmail.com Fri Jun 4 18:35:05 2010 From: aboulia at gmail.com (Kevin Lam) Date: Sat, 5 Jun 2010 02:35:05 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References:

Message-ID: > > > Parsing is easy - the point of this discussion is random access to > any record within the stream (which requires jumping to an offset). > > Peter > apologies didn't follow the thread close enough. Now I understand why the two might be overlapping. I would still vote for sqlite3. based on my short experience with next gen seq. there's these other benefits 1)pairing of csfasta with qual files based on read name can be done easier + stored in same db 2) pairing of mate pair and paired end reads can be done easier + stored in same db 3)generation of fastq files from 1) can be done easier 4)double encoded fasta sequence and base space sequence for can be stored in same db as well. I think the bwt method of indexing and compression used in bowtie and bwa for reference genomes might be a better way of going about the problem. That said, I think generally disk space is seldom an issue with lowering costs. Time / convenience is probably more important. The one time I wished for smaller NGS files is when I need to do transfers. Kevin From biopython at maubp.freeserve.co.uk Fri Jun 4 19:04:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 20:04:16 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C094526.9040508@gmail.com> References: <4C094526.9040508@gmail.com> Message-ID: On Fri, Jun 4, 2010 at 7:25 PM, Laurent wrote: > > One note of caution: Python's gzip module is slow, or so I experienced... to > the point that I ended up wrapping the code into a function that gunzipped > the file to a temporary location, parse and extract information, then delete > the temporary file. > That should be easy to benchmark - using Python's gzip to parse a file versus using the command line tool gzip to decompress and then parse the uncompressed file. > > Regarding random access in compressed file, there is the BGZF format but I > am not familiar enough with it to tell whether it can be of use here. > I've been looking at that this afternoon as it is used in BAM files. However, most gzip files (e.g. FASTA or FASTQ files) created with the gzip command line tools will NOT follow the BGZF convention. I personally have no need to have random access to gzipped general sequence files files. However, I have some proof of concept code to exploit GZIP files using the BGZF structure which should give more efficient random access to any part of the file (compared to simply using the gzip module) but haven't yet done any benchmarking. The code is still very immature, but if you want a look see the _BgzfHandle class here: http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 > > More generally, compression is part of the HDF5 format and this with chunks > could prove the most battle-tested way to access entries randomly. > But (thus far) no sequence data is stored in HDF5 format (is it?). Peter From chapmanb at 50mail.com Fri Jun 4 19:33:58 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 4 Jun 2010 15:33:58 -0400 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> Message-ID: <20100604193358.GV1054@sobchak.mgh.harvard.edu> Peter and all; > > One note of caution: Python's gzip module is slow, or so I experienced... to > > the point that I ended up wrapping the code into a function that gunzipped > > the file to a temporary location, parse and extract information, then delete > > the temporary file. More generally, I find having files gzipped while doing analysis is not very helpful. The time to gunzip and feed them into programs doesn't end up being worth the space tradeoff. My only real use of gzip is when archiving something that I'm done with. > > Regarding random access in compressed file, there is the BGZF format but I > > am not familiar enough with it to tell whether it can be of use here. > > I've been looking at that this afternoon as it is used in BAM files. What Broad does internally is store Fastq files in BAM format. You can convert with this Picard tool: http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam Originally when using their tools I thought this would be as annoying as gzipped files, but it is practically pretty nice since you can access them with pysam. Compression size is the same as if gzipped. What do you think about co-opting the SAM/BAM format for this? This would make it more specific for things that can go into BAM (so no GenBank and what not), but would have the advantage of working with existing workflows. Region based indexing is already implemented for BAM, but it would be really useful to also have ID based retrieval along the lines of what you are proposing. Brad From kevin at aitbiotech.com Fri Jun 4 20:21:05 2010 From: kevin at aitbiotech.com (Kevin Lam) Date: Sat, 5 Jun 2010 04:21:05 +0800 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <20100604193358.GV1054@sobchak.mgh.harvard.edu> References: <4C094526.9040508@gmail.com> <20100604193358.GV1054@sobchak.mgh.harvard.edu> Message-ID: Just thinking out loud. would generating a fake region id (unique for each read id) and the corresponding index when creating the bam be a good quick fix to utilise bam format for ID based retrieval? Or would the double mapping slow things down considerably? Kevin > > What do you think about co-opting the SAM/BAM format for this? This > would make it more specific for things that can go into BAM (so no > GenBank and what not), but would have the advantage of working with > existing workflows. > > Region based indexing is already implemented for BAM, but it would > be really useful to also have ID based retrieval along the lines of > what you are proposing. > > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From peter at maubp.freeserve.co.uk Fri Jun 4 20:21:33 2010 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 4 Jun 2010 21:21:33 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <20100604193358.GV1054@sobchak.mgh.harvard.edu> References: <4C094526.9040508@gmail.com> <20100604193358.GV1054@sobchak.mgh.harvard.edu> Message-ID: On Fri, Jun 4, 2010 at 8:33 PM, Brad Chapman wrote: > Peter and all; > > More generally, I find having files gzipped while doing analysis is > not very helpful. The time to gunzip and feed them into programs > doesn't end up being worth the space tradeoff. My only real use of > gzip is when archiving something that I'm done with. It seems that in general support for random access to gzipped files is of niche interest. Avoiding this in Bio.SeqIO.index() will keep the API simple and I think will make the caching to disk stuff a bit easier too. >> > Regarding random access in compressed file, there is the BGZF >> > format but I am not familiar enough with it to tell whether it can be >> > of use here. >> >> I've been looking at that this afternoon as it is used in BAM files. > > What Broad does internally is store Fastq files in BAM format. You > can convert with this Picard tool: > > http://picard.sourceforge.net/command-line-overview.shtml#FastqToSam Thanks for the link - I knew I'd seen a FASTQ to unaligned SAM/BAM tool out there somewhere. > Originally when using their tools I thought this would be as annoying as > gzipped files, but it is practically pretty nice since you can access > them with pysam. Compression size is the same as if gzipped. BAM files are a compressed with a variant of gzip (this BGZF sub-format), so that isn't a big surprise ;) > What do you think about co-opting the SAM/BAM format for this? This > would make it more specific for things that can go into BAM (so no > GenBank and what not), but would have the advantage of working with > existing workflows. I can see storing unmapped reads in BAM as a sensible alternative to FASTQ. Note that you lose any descriptions (not usually important) but more importantly BAM files do not store the sequence case information (which is often used to encode trimming points). Obviously we'd want to have SAM/BAM output support in Bio.SeqIO to fully take advantage of this (grin). I'm keeping this in mind while working on SAM/BAM parsing, but it would be a *lot* more work. > Region based indexing is already implemented for BAM, but it would > be really useful to also have ID based retrieval along the lines of > what you are proposing. > > Brad Yeah, I've been reading up on the BAM index format (BAI files) and they don't do anything about read lookup by ID at all. So I haven't been reinventing the wheel by trying to do Bio.SeqIO.index() support of BAM - it should be complementary to the pysam stuff. Anyway, even for BAM files we should be able to use the same scheme as all the other file formats supported in Bio.SeqIO.index(), use an SQLite database to hold the lookup table of read names to file offsets (rather than a Python dictionary in memory as now). Regards, Peter From lgautier at gmail.com Sat Jun 5 05:12:57 2010 From: lgautier at gmail.com (Laurent) Date: Sat, 05 Jun 2010 07:12:57 +0200 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> Message-ID: <4C09DCD9.60907@gmail.com> On 04/06/10 21:04, Peter wrote: > On Fri, Jun 4, 2010 at 7:25 PM, Laurent wrote: >> >> One note of caution: Python's gzip module is slow, or so I experienced... to >> the point that I ended up wrapping the code into a function that gunzipped >> the file to a temporary location, parse and extract information, then delete >> the temporary file. >> > > That should be easy to benchmark - using Python's gzip to parse a file > versus using the command line tool gzip to decompress and then parse > the uncompressed file. > >> >> Regarding random access in compressed file, there is the BGZF format but I >> am not familiar enough with it to tell whether it can be of use here. >> > > I've been looking at that this afternoon as it is used in BAM files. However, > most gzip files (e.g. FASTA or FASTQ files) created with the gzip command > line tools will NOT follow the BGZF convention. I personally have no need > to have random access to gzipped general sequence files files. > > However, I have some proof of concept code to exploit GZIP files using the > BGZF structure which should give more efficient random access to any part > of the file (compared to simply using the gzip module) but haven't yet done > any benchmarking. The code is still very immature, but if you want a look > see the _BgzfHandle class here: > > http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 Are you using that gzip obscure option that inserts "ticks" throughout the file ? If so, I remember reading that this could lead to problems (I just can't remember which ones... may be it can be found on the web). >> >> More generally, compression is part of the HDF5 format and this with chunks >> could prove the most battle-tested way to access entries randomly. >> > > But (thus far) no sequence data is stored in HDF5 format (is it?). Last year, in a SIG at the ISMB in Stockholm people showed that they have stored next-gen/short-reads using HDF5, and have demonstrated superior performances to BAM (not completely a surprise since to some BAM is reinventing some of the features in HDF5, and HDF5 has been developed for a longer time). I think that their slides are on slideshare (or similar place). Laurent > Peter From cjfields at illinois.edu Sat Jun 5 10:59:59 2010 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 5 Jun 2010 05:59:59 -0500 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com> Message-ID: On Jun 4, 2010, at 2:04 PM, Peter wrote: > On Fri, Jun 4, 2010 at 7:25 PM, Laurent wrote: > >> >> More generally, compression is part of the HDF5 format and this with chunks >> could prove the most battle-tested way to access entries randomly. >> > > But (thus far) no sequence data is stored in HDF5 format (is it?). > > Peter There will be a presentation this year at BOSC on BioHDF (HDF5 for bioinformatics). There is a website: http://www.hdfgroup.org/projects/biohdf/ chris From biopython at maubp.freeserve.co.uk Sat Jun 5 11:43:36 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 12:43:36 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: <4C09DCD9.60907@gmail.com> References: <4C094526.9040508@gmail.com> <4C09DCD9.60907@gmail.com> Message-ID: On Sat, Jun 5, 2010 at 6:12 AM, Laurent wrote: >> >> I've been looking at that this afternoon as it is used in BAM files. >> However, most gzip files (e.g. FASTA or FASTQ files) created with >> the gzip command line tools will NOT follow the BGZF convention. >> I personally have no need to have random access to gzipped general >> sequence files files. >> >> However, I have some proof of concept code to exploit GZIP files using >> the BGZF structure which should give more efficient random access to >> any part of the file (compared to simply using the gzip module) but >> haven't yet done any benchmarking. The code is still very immature, >> but if you want a look see the _BgzfHandle class here: >> >> http://github.com/peterjc/biopython/commit/416a795ef618c937bf5d9acbd1ffdf33c4ae4767 > > Are you using that gzip obscure option that inserts "ticks" throughout the > file ? If so, I remember reading that this could lead to problems (I just > can't remember which ones... may be it can be found on the web). I'm not sure what you are refering to - probably not. The way BGZF works is it is a standard GZIP file, made up of multiple GZIP blocks which can be decompressed in isolation (each with their own GZIP header). For random access to any part of the file all you need is the block offset (raw bytes, non-negative) and then a relative offset from the start of that block after decompression (again, non-negative). The non-standard bit is they use the optional subfields in the GZIP header to record the block size - presumably this cannot be infered any other way. This information gives you the block offsets which are used when constructing the index. >>> More generally, compression is part of the HDF5 format and this with >>> chunks could prove the most battle-tested way to access entries >>> randomly. >> >> But (thus far) no sequence data is stored in HDF5 format (is it?). > > Last year, in a SIG at the ISMB in Stockholm people showed that they have > stored next-gen/short-reads using HDF5, and have demonstrated superior > performances to BAM (not completely a surprise since to some BAM is > reinventing some of the features in HDF5, and HDF5 has been developed for a > longer time). I think that their slides are on slideshare (or similar > place). There is some talk on the samtools mailing list about general improvements to the chunking in BAM, relocating the header information (and other very read specific things about representing error models, indels, etc). You may be right that HDF5 has technical advantages over BAM version 1, but currently my impression is that SAM/BAM is making good headway with becoming a defacto standard for next generation data, while HDF5 is not. Maybe someone should suggest they move to HDF5 internally for BAM version 2? Peter From biopython at maubp.freeserve.co.uk Sat Jun 5 11:51:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 5 Jun 2010 12:51:10 +0100 Subject: [Biopython] Bio.SeqIO.index() - gzip support and/or index stored on disk? In-Reply-To: References: <4C094526.9040508@gmail.com>