From zhigangwu.bgi at gmail.com Wed May 1 10:17:14 2013 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Wed, 1 May 2013 07:17:14 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Peter and all, Thanks for the long explanation. I got much better understand of this project though I am still confusing on how to implement the lazy-loading parser for feature rich files (EMBL, GenBank, GFF3). Since the deadline is pretty close,I decided to post my premature of proposal for this project. It would be great if you all can given me some comments and suggestions. The proposal is available here. Thank you all in advance. Zhigang On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock wrote: > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu > wrote: > > Peter, > > > > Thanks for the detailed explanation. It's very helpful. I am not quite > > sure about the goal of the lazy-loading parser. > > Let me try to summarize what are the goals of lazy-loading and how > > lazy-loading would work. Please correct me if necessary. Below I use > > fasta/fastq file as an example. The idea should generally applies to > > other format such as GenBank/EMBL as you mentioned. > > > > Lazy-loading is useful under the assumption that given a large file, > > we are interested in partial information of it but not all of them. > > For example a fasta file contains Arabidopsis genome, we only > > interested in the sequence of chr5 from index position from 2000-3000. > > Rather than parsing the whole file and storing each record in memory > > as most parsers will do, during the indexing step, lazy loading > > parser will only store a few position information, such as access > > positions (readily usable for seek) for all chromosomes (chr1, chr2, > > chr3, chr4, chr5, ...) and may be position index information such as > > the access positions for every 1000bp positions for each sequence in > > the given file. After indexing, we store these information in a > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, > > 2000:access_pos,}, 'chr3'...}. > > > > Compared to the usual parser which tends to parsing the whole file, we > > gain two benefits: speed, less memory usage and random access. Speed > > is gained because we skipped a lot during the parsing step. Go back to > > my example, once we have the dictionary, we can just seek to the > > access position of chr5:2000 and start reading and parsing from there. > > Less memory usage is due to we only stores access positions for each > > record as a dictionary in memory. > > > > > > Best, > > > > Zhigang > > Hi Zhigang, > > Yes - that's the basic idea of a disk based lazy loader. Here > the data stays on the disk until needed, so generally this is > very low memory but can be slow as it needs to read from > the disk. And existing example already in Biopython is our > BioSQL bindings which present a SeqRecord subclass which > only retrieves values from the database on demand. > > Note in the case of FASTA, we might want to use the existing > FAI index files from Heng Li's faidx tool (or another existing > index scheme). That relies on each record using a consistent > line wrapping length, so that seek offsets can be easily > calculated. > > An alternative idea is to load the data into memory (so that the > file is not touched again, useful for stream processing where > you cannot seek within the input data) but it is only parsed into > Python objects on demand. This would use a lot more memory, > but should be faster as there is no disk seeking and reading > (other than the one initial read). For FASTA this wouldn't help > much but it might work for EMBL/GenBank. > > Something to beware of with any lazy loading / lazy parsing is > what happens if the user tries to edit the record? Do you want > to allow this (it makes the code more complex) or not (simpler > and still very useful). > > In terms of usage examples, for things like raw NGS data this > is (currently) made up of lots and lots of short sequences (under > 1000bp). Lazy loading here is unlikely to be very helpful - unless > perhaps you can make the FASTQ parser faster this way? > (Once the reads are assembled or mapped to a reference, > random access to lookup reads by their mapped location is > very very important, thus the BAI indexing of BAM files). > > In terms of this project, I was thinking about a SeqRecord > style interface extending Bio.SeqIO (but you can suggest > something different for your project). > > What I saw as the main use case here is large datasets like > whole chromosomes in FASTA format or richly annotated > formats like EMBL, GenBank or GFF3. Right now if I am > doing something with (for example) the annotated human > chromosomes, loading these as GenBank files is quite > slow (it takes a far amount of memory too, but that isn't > my main worry). A lazy loading approach should let me > 'load' the GenBank files almost instantly, and delay > reading specific features or sequence from the disk > until needed. > > For example, I might have a list of genes for which I wish > to extract the annotation or sequence for - and there is no > need to load all the other features or the rest of the genome. > > (Note we can already do this by loading GenBank files > into a BioSQL database, and access them that way) > > Regards, > > Peter > From chris.mit7 at gmail.com Wed May 1 10:40:26 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 1 May 2013 10:40:26 -0400 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Zhigang, I throw some comments on your proposal. As i said there, I think you need to find & look at a variety of gff/gtf files to see where your implementation breaks down. Also, for parsing, I would focus on optimizing the speed the user can access attributes, they're the bits people care most about (where is gene X, what is the FPKM of isoform y?, etc.) Chris On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu wrote: > Hi Peter and all, > Thanks for the long explanation. > I got much better understand of this project though I am still confusing on > how to implement the lazy-loading parser for feature rich files (EMBL, > GenBank, GFF3). > Since the deadline is pretty close,I decided to post my premature of > proposal for this project. It would be great if you all can given me some > comments and suggestions. The proposal is available > here< > https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing > >. > Thank you all in advance. > > > Zhigang > > > > On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock >wrote: > > > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu > > wrote: > > > Peter, > > > > > > Thanks for the detailed explanation. It's very helpful. I am not quite > > > sure about the goal of the lazy-loading parser. > > > Let me try to summarize what are the goals of lazy-loading and how > > > lazy-loading would work. Please correct me if necessary. Below I use > > > fasta/fastq file as an example. The idea should generally applies to > > > other format such as GenBank/EMBL as you mentioned. > > > > > > Lazy-loading is useful under the assumption that given a large file, > > > we are interested in partial information of it but not all of them. > > > For example a fasta file contains Arabidopsis genome, we only > > > interested in the sequence of chr5 from index position from 2000-3000. > > > Rather than parsing the whole file and storing each record in memory > > > as most parsers will do, during the indexing step, lazy loading > > > parser will only store a few position information, such as access > > > positions (readily usable for seek) for all chromosomes (chr1, chr2, > > > chr3, chr4, chr5, ...) and may be position index information such as > > > the access positions for every 1000bp positions for each sequence in > > > the given file. After indexing, we store these information in a > > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, > > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, > > > 2000:access_pos,}, 'chr3'...}. > > > > > > Compared to the usual parser which tends to parsing the whole file, we > > > gain two benefits: speed, less memory usage and random access. Speed > > > is gained because we skipped a lot during the parsing step. Go back to > > > my example, once we have the dictionary, we can just seek to the > > > access position of chr5:2000 and start reading and parsing from there. > > > Less memory usage is due to we only stores access positions for each > > > record as a dictionary in memory. > > > > > > > > > Best, > > > > > > Zhigang > > > > Hi Zhigang, > > > > Yes - that's the basic idea of a disk based lazy loader. Here > > the data stays on the disk until needed, so generally this is > > very low memory but can be slow as it needs to read from > > the disk. And existing example already in Biopython is our > > BioSQL bindings which present a SeqRecord subclass which > > only retrieves values from the database on demand. > > > > Note in the case of FASTA, we might want to use the existing > > FAI index files from Heng Li's faidx tool (or another existing > > index scheme). That relies on each record using a consistent > > line wrapping length, so that seek offsets can be easily > > calculated. > > > > An alternative idea is to load the data into memory (so that the > > file is not touched again, useful for stream processing where > > you cannot seek within the input data) but it is only parsed into > > Python objects on demand. This would use a lot more memory, > > but should be faster as there is no disk seeking and reading > > (other than the one initial read). For FASTA this wouldn't help > > much but it might work for EMBL/GenBank. > > > > Something to beware of with any lazy loading / lazy parsing is > > what happens if the user tries to edit the record? Do you want > > to allow this (it makes the code more complex) or not (simpler > > and still very useful). > > > > In terms of usage examples, for things like raw NGS data this > > is (currently) made up of lots and lots of short sequences (under > > 1000bp). Lazy loading here is unlikely to be very helpful - unless > > perhaps you can make the FASTQ parser faster this way? > > (Once the reads are assembled or mapped to a reference, > > random access to lookup reads by their mapped location is > > very very important, thus the BAI indexing of BAM files). > > > > In terms of this project, I was thinking about a SeqRecord > > style interface extending Bio.SeqIO (but you can suggest > > something different for your project). > > > > What I saw as the main use case here is large datasets like > > whole chromosomes in FASTA format or richly annotated > > formats like EMBL, GenBank or GFF3. Right now if I am > > doing something with (for example) the annotated human > > chromosomes, loading these as GenBank files is quite > > slow (it takes a far amount of memory too, but that isn't > > my main worry). A lazy loading approach should let me > > 'load' the GenBank files almost instantly, and delay > > reading specific features or sequence from the disk > > until needed. > > > > For example, I might have a list of genes for which I wish > > to extract the annotation or sequence for - and there is no > > need to load all the other features or the rest of the genome. > > > > (Note we can already do this by loading GenBank files > > into a BioSQL database, and access them that way) > > > > Regards, > > > > Peter > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed May 1 11:46:43 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 1 May 2013 11:46:43 -0400 Subject: [Biopython-dev] gsoc phylo project questions In-Reply-To: References: Message-ID: On Tue, Apr 30, 2013 at 3:20 AM, Yanbo Ye wrote: > Hi Eric, > > Again, thanks for your comment. It might be better to discuss here. > https://github.com/lijax/gsoc/commit/e969c82a5a0aef45bba1277ce01d6dbee03e6a84#commitcomment-3096321 > > I have changed my proposal and timeline based on your advice. I think I > was too optimistic that I didn't consider about the compatibility with > existing code or other potential problem that may exist. After careful > consideration, I removed one task from the goal list to make the time more > relaxed, the tree comparison(seems > I miss understood this). I might be able to complete all of them. But it's > better to make it as an extra task, to make sure this coding experience is > not a burden. > I agree it's best to commit to a feasible timeline and then reserve a few "stretch goals". Dropping the tree distance function is fine, as there are currently some other students who might develop this small module as a course project, independently of GSoC. In any case that functionality is independent of the other tasks you've proposed. > According to your comment: > > 1. I didn't know PyCogent and DendroPy. I'll refer to them for useful > solutions. > 2. For distance-based tree and consensus tree, I think there is no need > to use NumPy. And for consensus tree, my original plan is to implement a > binary class to count the clade with the same leaves for performance. As > you suggest, I'll implement a class with the same API and improve the > performance later, so that I can pay more attention to the Strict and Adam > Consensus algorithms. > Sounds good. > 3. I didn't find the distance matrix method for MSA on Phylo Cookbook > page, only from existing tree. > Ah, I think I misunderstood you earlier. Yes, for the NJ method you'll need to use a substitution matrix to compute pairwise distances from a multiple sequence alignment. This shouldn't be too challenging, though you might find the need to add a new matrix to the Bio.SubsMat module if you want to let the user choose something other than BLOSUM or PAM. 4. For parsimony tree search, I have already know how several heuristic > search algorithms work. Do I need to implement them all? > No, just choose a well-established one that you feel comfortable implementing. 5. I'm not clear about the radial layout and Felsenstein's Equal Daylight > algorithm. Isn't this algorithm one way of showing the radial layout? I'm > sorry that I'm not familiar with this layout. Can you give some figure > examples and references? > For radial tree layout: https://en.wikipedia.org/wiki/Radial_tree http://www.infosun.fim.uni-passau.de/~chris/down/DrawingPhyloTreesEA.pdf The paper above also explains an "angle spreading" refinement step to improve the appearance of radial trees, which you could opt to implement instead of Equal Daylight. The Equal Daylight algorithm seems to only be documented fully in the book "Inferring Phylogenies" and implemented in the "drawtree" program in Phylip. In the Phylip documentation, the radial layout algorithm is called "Equal Arc", and the layout provided by that algorithm is the starting point for Equal Daylight: http://evolution.genetics.washington.edu/phylip/doc/drawtree.html Cheers, Eric From albl500 at york.ac.uk Wed May 1 18:56:12 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Wed, 01 May 2013 23:56:12 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Dear all, I also left some minor comments on the proposal; I hope they're helpful and I wish you every success! You should focus on the proposal for now, but I thought I'd share a more presentable version of the fasta lazy-loader I wrote a couple of years ago. The focus at the time was to minimise memory usage and increase the speed of random access to fasta-formatted sequences, stored on disk. Only sequence accessions and file locations are stored in-memory (in a dict). Once the index has been populated, it can 'pickle' the dictionary to a file on disk, for later re-use. It doesn't exactly fulfill all of your needs, but I hope it might help you in the right direction.. Also, were there plans for making the lazy loader thread-safe? I've done it in the past by passing a `multiprocessing.Pipe` instance to a method (`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to implement a callback scheme, but passing a Pipe did the job.. Maybe it's outside the current scope of the project, but anyway, I put the module up on github if you want to check it out[1]. Cheers, Alex [1] - https://github.com/alexleach/fasta_lazy_loader/blob/master/fasta_lazy_loader.py From zhigang.wu at email.ucr.edu Thu May 2 04:14:04 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 2 May 2013 01:14:04 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Alex, The idea of taking advantage of multiprocessing is great. I haven't touched this kind of thing before and I think it's going to be cool to integrate into the project. Best, Zhigang On Wed, May 1, 2013 at 3:56 PM, Alex Leach wrote: > Dear all, > > I also left some minor comments on the proposal; I hope they're helpful > and I wish you every success! > > You should focus on the proposal for now, but I thought I'd share a more > presentable version of the fasta lazy-loader I wrote a couple of years ago. > The focus at the time was to minimise memory usage and increase the speed > of random access to fasta-formatted sequences, stored on disk. Only > sequence accessions and file locations are stored in-memory (in a dict). > Once the index has been populated, it can 'pickle' the dictionary to a file > on disk, for later re-use. > > It doesn't exactly fulfill all of your needs, but I hope it might help you > in the right direction.. > > Also, were there plans for making the lazy loader thread-safe? I've done > it in the past by passing a `multiprocessing.Pipe` instance to a method > (`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to > implement a callback scheme, but passing a Pipe did the job.. Maybe it's > outside the current scope of the project, but anyway, I put the module up > on github if you want to check it out[1]. > > > Cheers, > Alex > > > [1] - https://github.com/alexleach/**fasta_lazy_loader/blob/master/** > fasta_lazy_loader.py > > ______________________________**_________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.**org > http://lists.open-bio.org/**mailman/listinfo/biopython-dev > From albl500 at york.ac.uk Thu May 2 05:08:23 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Thu, 02 May 2013 10:08:23 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Thu, 02 May 2013 09:14:04 +0100, Zhigang Wu wrote: > Hi Alex, > > The idea of taking advantage of multiprocessing is great. I haven't > touched this kind of thing before and I think >it's going to be cool to > integrate into the project. Pleasure. Multiprocessing is quite a large topic, and the relevant library documentation also rather large[1-2]. If you haven't worked with multiprocessing before, it will probably take a long while before you're comfortable using the libraries involved. So if you were to mention it in the proposal, I'd keep it out of the core objectives, as you have a lot else on your plate, already. Don't know if anyone else has any thoughts on this, though? I could potentially help to provide some pointers, so if you have any questions I might be able to help with, please feel free to ask. Kind regards, Alex [1] - http://docs.python.org/2/library/multiprocessing.html [2] - http://docs.python.org/2/library/threading.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.j.a.cock at googlemail.com Thu May 2 05:52:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 10:52:19 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu wrote: > Hi Peter and all, > Thanks for the long explanation. > I got much better understand of this project though I am still confusing on > how to implement the lazy-loading parser for feature rich files (EMBL, > GenBank, GFF3). > Since the deadline is pretty close,I decided to post my premature of > proposal for this project. It would be great if you all can given me some > comments and suggestions. The proposal is available here. > https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing > Thank you all in advance. > > Zhigang Hi Zhigang, I've posted a few comment there, but it would be a good idea to put the draft on Google Melange soon. I see you've posted the Google Doc on the NESCent Google+ as well, good. Looking at the current draft, you don't yet have a timeline. This is vital - and it should include writing tests (as you write code - not all at the end) and documentation (which can come after the code). In the community bonding period you could write that you plan to setup your development environment including multiple versions of Python (at least Python 2.6, Python 3, Jython 2.7, and PyPy 2.0 to cover the main variants). For instance, it would make sense to start with learning about faidx and how its indexing works, and trying to reproduce it in Python code, and then wrapping that in a SeqRecord style API. Include writing and evaluating some benchmarks too - you may need to learn how to profile Python code for this, since speed and performance is one the reasons for wanting lazy loading (lower memory usage is the other main driver). That could be the first few weeks perhaps? Regards, Peter From p.j.a.cock at googlemail.com Thu May 2 06:37:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 11:37:31 +0100 Subject: [Biopython-dev] Fwd: [PhyloSoC] Application deadline fast approaching In-Reply-To: References: Message-ID: Hi all, I'm forwarding this for any potential Google Summer of Code 2013 students and mentors - note you should also be signed up to the NESCent "Phyloinformatics Summer of Code" mailing list to make sure you don't miss any important information. Thanks, Peter ---------- Forwarded message ---------- From: Karen Cranston Date: Thu, May 2, 2013 at 12:39 AM Subject: [PhyloSoC] Application deadline fast approaching To: Phyloinformatics Summer of Code The student application deadline for GSoC is this Friday, May 3 at 19:00 UTC! Thanks to everyone for their expertise and enthusiasm so far. Expect much traffic in Melange and on the G+ page between now and the deadline. Please do help students (for your projects or others) improve their applications - either on the G+ page or via a public comment on Melange. The most common issue is a lack of detail in the project plan. You can point students to the wiki for examples from previous years. Feel free to ask for help on this list. We will send out more about assigning mentors / scoring after the application deadline. Cheers, Karen & Jim -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karen Cranston, PhD Training Coordinator and Informatics Project Manager nescent.org @kcranstn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ _______________________________________________ PhyloSoC mailing list PhyloSoC at nescent.org https://lists.nescent.org/mailman/listinfo/phylosoc UNSUBSCRIBE: https://lists.nescent.org/mailman/options/phylosoc/p.j.a.cock%40googlemail.com?unsub=1&unsubconfirm=1 From p.j.a.cock at googlemail.com Thu May 2 08:54:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 13:54:52 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu wrote: > Hi Peter and all, > Thanks for the long explanation. > I got much better understand of this project though I am still confusing on > how to implement the lazy-loading parser for feature rich files (EMBL, > GenBank, GFF3). Hi Zhigang, I'd considered two ideas for GenBank/EMBL, Lazy parsing of the feature table: The existing iterator approach reads in a GenBank file record by record, and parses everything into objects (a SeqRecord object with the sequence as a Seq object and the features as a list of SeqFeature objects). I did some profiling a while ago, and of this the feature processing is quite slow, therefore during the initial parse the features could be stored in memory as a list of strings, and only parsed into SeqFeature objects if the user tries to access the SeqRecord's feature property. It would require a fairly simple subclassing of the SeqRecord to make the features list into a property in order to populate the list of SeqFeatures when first accessed. In the situation where the user never uses the features, this should be much faster, and save some memory as well (that would need to be confirmed by measurement - but a list of strings should take less RAM than a list of SeqFeature objects with all the sub-objects like the locations and annotations). In the situation where the use does access the features, the simplest behaviour would be to process the cached raw feature table into a list of SeqFeature objects. The overall runtime and memory usage would be about what we have now. This would not require any file seeking, and could be used within the existing SeqIO interface where we make a single pass though the file for parsing - this is vital in order to cope with handles like stdin and network handles where you cannot seek backwards in the file. That is the simpler idea, some real benefits, but not too ambitious. If you are already familiar with the GenBank/EMBL file format and our current parser and the SeqRecord object, then I think a week is reasonable. A full index based approach would mean scanning the GenBank, EMBL or GFF file and recording information about where each feature is on disk (file offset) and the feature location coordinates. This could be recorded in an efficient index structure (I was thinking something based on BAM's BAI or Heng Li's improved version CSI). The idea here is that when the user wants to look at features in a particular region of the genome (e.g. they have a mutation or SNP in region 1234567 on chr5) then only the annotation in that part of the genome needs to be loaded from the disk. This would likely require API changes or additions, for example the SeqRecord currently holds the SeqFeature objects as a simple list - with no build in co-ordinate access. As I wrote in the original outline email, there is scope for a very ambitious project working in this area - but some of these ideas would require more background knowledge or preparation: http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html Anything looking to work with GFF (in the broad sense of GFF3 and/or GTF) would ideal incorporate Brad Chapman's existing work: http://biopython.org/wiki/GFF_Parsing Regards, Peter From albl500 at york.ac.uk Thu May 2 09:54:37 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Thu, 02 May 2013 14:54:37 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi again, Thought I'd contribute some thoughts... Hope I'm not intruding too much on the discussion. On Thu, 02 May 2013 13:54:52 +0100, Peter Cock wrote: > > It would require a fairly simple subclassing of the SeqRecord to make > the features list into a property in order to populate the list of > SeqFeatures when first accessed. > Yes. You can turn a class property into a function quite easily, using decorators. Here[1] is a pretty good example, description and justification. [1] - http://stackoverflow.com/questions/6618002/python-property-versus-getters-and-setters > In the situation where the user never uses the features, this should > be much faster, and save some memory as well (that would need to > be confirmed by measurement - but a list of strings should take less > RAM than a list of SeqFeature objects with all the sub-objects like > the locations and annotations). > > In the situation where the use does access the features, the simplest > behaviour would be to process the cached raw feature table into a > list of SeqFeature objects. The overall runtime and memory usage > would be about what we have now. This would not require any > file seeking, and could be used within the existing SeqIO interface > where we make a single pass though the file for parsing - this is > vital in order to cope with handles like stdin and network handles > where you cannot seek backwards in the file. I think the Pythonic way here would be to follow the "Easier to Ask for Forgiveness than to ask for Permission" (EAFP) idiom[2]. i.e. Try to seek the file handle first, and if that raises an IOError, catch the exception and continue to cache the input stream data, perhaps writing it to a temporary file on disk. [2] - http://docs.python.org/2/glossary.html#term-eafp > > That is the simpler idea, some real benefits, but not too ambitious. > If you are already familiar with the GenBank/EMBL file format and > our current parser and the SeqRecord object, then I think a week > is reasonable. > > A full index based approach would mean scanning the GenBank, > EMBL or GFF file and recording information about where each > feature is on disk (file offset) and the feature location coordinates. > This could be recorded in an efficient index structure (I was thinking > something based on BAM's BAI or Heng Li's improved version CSI). > The idea here is that when the user wants to look at features in a > particular region of the genome (e.g. they have a mutation or SNP > in region 1234567 on chr5) then only the annotation in that part > of the genome needs to be loaded from the disk. Thought I'd add that Blast uses SQL tables (in ISAM format) for maintaining indexes to their databases[3]. I'm not familiar with BioPython's BioSQL module at all, but a nice feature of sqlite is that you can hold temporary databases in memory[4]. [3] - http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbisam_8hpp.html [4] - http://docs.python.org/2/library/sqlite3.html#using-sqlite3-efficiently Cheers, Alex > > This would likely require API changes or additions, for example > the SeqRecord currently holds the SeqFeature objects as a > simple list - with no build in co-ordinate access. > > As I wrote in the original outline email, there is scope for a very > ambitious project working in this area - but some of these ideas > would require more background knowledge or preparation: > http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html > > Anything looking to work with GFF (in the broad sense of GFF3 > and/or GTF) would ideal incorporate Brad Chapman's existing > work: http://biopython.org/wiki/GFF_Parsing > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- --- Alex Leach. BSc, MRes PhD Student Chong & Redeker Labs Department of Biology University of York YO10 5DD Tel: 07940 480 771 EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm From idoerg at gmail.com Thu May 2 12:12:12 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 2 May 2013 12:12:12 -0400 Subject: [Biopython-dev] Uniprot-GOA parser Message-ID: Does anybody have a GOA parser in the works? Currently writing a simple parser for GAF, GPA and GPI formats. Can contribute if there is interest. More on GOA: http://www.ebi.ac.uk/GOA Cheers, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Thu May 2 12:18:17 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 17:18:17 +0100 Subject: [Biopython-dev] Uniprot-GOA parser In-Reply-To: References: Message-ID: On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg wrote: > Does anybody have a GOA parser in the works? Currently writing a simple > parser for GAF, GPA and GPI formats. Can contribute if there is interest. > > More on GOA: http://www.ebi.ac.uk/GOA > > Cheers, > > Iddo Hi Iddo, I see they're now offering GPAD1.1 format (as well? instead?). Does targeting that make more sense in the long run? I know a few people on the list are or were looking at ontology support for Biopython... it would be good to add this. Regards, Peter From idoerg at gmail.com Thu May 2 12:19:39 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 2 May 2013 12:19:39 -0400 Subject: [Biopython-dev] Uniprot-GOA parser In-Reply-To: References: Message-ID: Yes, will do GPAD as well. Need to preserve the others though, due to legacy. ./I On Thu, May 2, 2013 at 12:18 PM, Peter Cock wrote: > On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg wrote: > > Does anybody have a GOA parser in the works? Currently writing a simple > > parser for GAF, GPA and GPI formats. Can contribute if there is interest. > > > > More on GOA: http://www.ebi.ac.uk/GOA > > > > Cheers, > > > > Iddo > > Hi Iddo, > > I see they're now offering GPAD1.1 format (as well? instead?). > Does targeting that make more sense in the long run? > > I know a few people on the list are or were looking at ontology > support for Biopython... it would be good to add this. > > Regards, > > Peter > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From zhigang.wu at email.ucr.edu Thu May 2 17:18:43 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 2 May 2013 14:18:43 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Chris and All, In your comments to my proposal, you mentioned that some GFF files may have a size of GBs. After seeing that comment, I just want to roughly know how large is a gff file people are often working with? I mainly work on plants and I am not quite familiar with animals. Below I listed out a list of animals and plants, to my knowledge from reading papers, which most people are working with. organism(genome size) size of gff url to the ftp *folder*(not a huge file so feel free to click it) arabidopsis(~120MB) 44MB ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/ rice(~450MB) 77MB here corn(3GB) 87MB http://ftp.maizesequence.org/release-5b/filtered-set/ D. melanogaster 450MB ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/ C. elegans (site going down) http://wiki.wormbase.org/index.php/Downloads#GFF2 H. sapiens(3G) 170MB here My point is that caching gff files in memory wasn't as bad as we have thought. Any comments or suggestion are welcome. Best, Zhigang On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell wrote: > Hi Zhigang, > > I throw some comments on your proposal. As i said there, I think you need > to find & look at a variety of gff/gtf files to see where your > implementation breaks down. Also, for parsing, I would focus on optimizing > the speed the user can access attributes, they're the bits people care most > about (where is gene X, what is the FPKM of isoform y?, etc.) > > Chris > > > On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu wrote: > >> Hi Peter and all, >> Thanks for the long explanation. >> I got much better understand of this project though I am still confusing >> on >> how to implement the lazy-loading parser for feature rich files (EMBL, >> GenBank, GFF3). >> Since the deadline is pretty close,I decided to post my premature of >> proposal for this project. It would be great if you all can given me some >> comments and suggestions. The proposal is available >> here< >> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing >> >. >> >> Thank you all in advance. >> >> >> Zhigang >> >> >> >> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock > >wrote: >> >> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu >> > wrote: >> > > Peter, >> > > >> > > Thanks for the detailed explanation. It's very helpful. I am not quite >> > > sure about the goal of the lazy-loading parser. >> > > Let me try to summarize what are the goals of lazy-loading and how >> > > lazy-loading would work. Please correct me if necessary. Below I use >> > > fasta/fastq file as an example. The idea should generally applies to >> > > other format such as GenBank/EMBL as you mentioned. >> > > >> > > Lazy-loading is useful under the assumption that given a large file, >> > > we are interested in partial information of it but not all of them. >> > > For example a fasta file contains Arabidopsis genome, we only >> > > interested in the sequence of chr5 from index position from 2000-3000. >> > > Rather than parsing the whole file and storing each record in memory >> > > as most parsers will do, during the indexing step, lazy loading >> > > parser will only store a few position information, such as access >> > > positions (readily usable for seek) for all chromosomes (chr1, chr2, >> > > chr3, chr4, chr5, ...) and may be position index information such as >> > > the access positions for every 1000bp positions for each sequence in >> > > the given file. After indexing, we store these information in a >> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, >> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, >> > > 2000:access_pos,}, 'chr3'...}. >> > > >> > > Compared to the usual parser which tends to parsing the whole file, we >> > > gain two benefits: speed, less memory usage and random access. Speed >> > > is gained because we skipped a lot during the parsing step. Go back to >> > > my example, once we have the dictionary, we can just seek to the >> > > access position of chr5:2000 and start reading and parsing from there. >> > > Less memory usage is due to we only stores access positions for each >> > > record as a dictionary in memory. >> > > >> > > >> > > Best, >> > > >> > > Zhigang >> > >> > Hi Zhigang, >> > >> > Yes - that's the basic idea of a disk based lazy loader. Here >> > the data stays on the disk until needed, so generally this is >> > very low memory but can be slow as it needs to read from >> > the disk. And existing example already in Biopython is our >> > BioSQL bindings which present a SeqRecord subclass which >> > only retrieves values from the database on demand. >> > >> > Note in the case of FASTA, we might want to use the existing >> > FAI index files from Heng Li's faidx tool (or another existing >> > index scheme). That relies on each record using a consistent >> > line wrapping length, so that seek offsets can be easily >> > calculated. >> > >> > An alternative idea is to load the data into memory (so that the >> > file is not touched again, useful for stream processing where >> > you cannot seek within the input data) but it is only parsed into >> > Python objects on demand. This would use a lot more memory, >> > but should be faster as there is no disk seeking and reading >> > (other than the one initial read). For FASTA this wouldn't help >> > much but it might work for EMBL/GenBank. >> > >> > Something to beware of with any lazy loading / lazy parsing is >> > what happens if the user tries to edit the record? Do you want >> > to allow this (it makes the code more complex) or not (simpler >> > and still very useful). >> > >> > In terms of usage examples, for things like raw NGS data this >> > is (currently) made up of lots and lots of short sequences (under >> > 1000bp). Lazy loading here is unlikely to be very helpful - unless >> > perhaps you can make the FASTQ parser faster this way? >> > (Once the reads are assembled or mapped to a reference, >> > random access to lookup reads by their mapped location is >> > very very important, thus the BAI indexing of BAM files). >> > >> > In terms of this project, I was thinking about a SeqRecord >> > style interface extending Bio.SeqIO (but you can suggest >> > something different for your project). >> > >> > What I saw as the main use case here is large datasets like >> > whole chromosomes in FASTA format or richly annotated >> > formats like EMBL, GenBank or GFF3. Right now if I am >> > doing something with (for example) the annotated human >> > chromosomes, loading these as GenBank files is quite >> > slow (it takes a far amount of memory too, but that isn't >> > my main worry). A lazy loading approach should let me >> > 'load' the GenBank files almost instantly, and delay >> > reading specific features or sequence from the disk >> > until needed. >> > >> > For example, I might have a list of genes for which I wish >> > to extract the annotation or sequence for - and there is no >> > need to load all the other features or the rest of the genome. >> > >> > (Note we can already do this by loading GenBank files >> > into a BioSQL database, and access them that way) >> > >> > Regards, >> > >> > Peter >> > >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > From zhigang.wu at email.ucr.edu Thu May 2 20:18:03 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 2 May 2013 17:18:03 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Thu, May 2, 2013 at 5:54 AM, Peter Cock wrote: > On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu > wrote: > > Hi Peter and all, > > Thanks for the long explanation. > > I got much better understand of this project though I am still confusing > on > > how to implement the lazy-loading parser for feature rich files (EMBL, > > GenBank, GFF3). > > Hi Zhigang, > > I'd considered two ideas for GenBank/EMBL, > > Lazy parsing of the feature table: The existing iterator approach reads > in a GenBank file record by record, and parses everything into objects > (a SeqRecord object with the sequence as a Seq object and the > features as a list of SeqFeature objects). I did some profiling a while > ago, and of this the feature processing is quite slow, therefore during > the initial parse the features could be stored in memory as a list of > strings, and only parsed into SeqFeature objects if the user tries to > access the SeqRecord's feature property. > > It would require a fairly simple subclassing of the SeqRecord to make > the features list into a property in order to populate the list of > SeqFeatures when first accessed. > > In the situation where the user never uses the features, this should > be much faster, and save some memory as well (that would need to > be confirmed by measurement - but a list of strings should take less > RAM than a list of SeqFeature objects with all the sub-objects like > the locations and annotations). > I agree. This would save some memory. > In the situation where the use does access the features, the simplest > behaviour would be to process the cached raw feature table into a > list of SeqFeature objects. The overall runtime and memory usage > would be about what we have now. This would not require any > file seeking, and could be used within the existing SeqIO interface > where we make a single pass though the file for parsing - this is > vital in order to cope with handles like stdin and network handles > where you cannot seek backwards in the file. > > Yes, I agree. So in this sense, the name "lazy-loading" is a little misleading. Because, this would load everything into memory at the beginning, while just delay in parsing any feature until a specific one is requested. Seems like "lazy parsing" would be more appropriate. That is the simpler idea, some real benefits, but not too ambitious. > If you are already familiar with the GenBank/EMBL file format and > our current parser and the SeqRecord object, then I think a week > is reasonable. > > No, I am not quite familiar with these. > A full index based approach would mean scanning the GenBank, > EMBL or GFF file and recording information about where each > feature is on disk (file offset) and the feature location coordinates. > This could be recorded in an efficient index structure (I was thinking > something based on BAM's BAI or Heng Li's improved version CSI). > The idea here is that when the user wants to look at features in a > particular region of the genome (e.g. they have a mutation or SNP > in region 1234567 on chr5) then only the annotation in that part > of the genome needs to be loaded from the disk. > > This would likely require API changes or additions, for example > the SeqRecord currently holds the SeqFeature objects as a > simple list - with no build in co-ordinate access. > > As I wrote in the original outline email, there is scope for a very > ambitious project working in this area - but some of these ideas > would require more background knowledge or preparation: > http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html > > Hmm, this is actually INDEXing a big file. Don't you think a little bit off topic, "lazy-loading parser". But this seems interesting and challenging and definitely going to be useful. > Anything looking to work with GFF (in the broad sense of GFF3 > and/or GTF) would ideal incorporate Brad Chapman's existing > work: http://biopython.org/wiki/GFF_Parsing > > Yes, I definitely will take a Brad's GFF parser. > Regards, > > Peter > Thanks for the long explanation again. Zhigang From yeyanbo289 at gmail.com Thu May 2 22:19:07 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Fri, 3 May 2013 10:19:07 +0800 Subject: [Biopython-dev] Biopython Phylo Proposal Message-ID: Hi everyone, I forget to post my gsoc proposal page here. Any comment? http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/yeyanbo/1# Thanks, Yanbo -- ??? ???????????????? Ye Yanbo Bioinformatics Group, Wuhan Institute Of Virology, Chinese Academy of Sciences From Markus.Piotrowski at ruhr-uni-bochum.de Fri May 3 02:32:43 2013 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: 3 May 2013 08:32:43 +0200 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Zhigang, Sequence read files from Next Generation Sequencing methods are several GB large. Don't know if they are regulary stored in GFF files, anyhow. Best, Markus Am 2013-05-02 23:18, schrieb Zhigang Wu: > Hi Chris and All, > > In your comments to my proposal, you mentioned that some GFF files > may have > a size of GBs. > After seeing that comment, I just want to roughly know how large is a > gff > file people are often working with? > I mainly work on plants and I am not quite familiar with animals. > Below I listed out a list of animals and plants, to my knowledge from > reading papers, which most people are working with. > > organism(genome size) size of gff url to > the > ftp *folder*(not a huge file so feel free to click it) > arabidopsis(~120MB) 44MB > ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/ > rice(~450MB) 77MB > > here > corn(3GB) 87MB > http://ftp.maizesequence.org/release-5b/filtered-set/ > D. melanogaster 450MB > > ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/ > C. elegans (site going down) > http://wiki.wormbase.org/index.php/Downloads#GFF2 > H. sapiens(3G) 170MB > > here > > My point is that caching gff files in memory wasn't as bad as we have > thought. Any comments or suggestion are welcome. > > Best, > > > Zhigang > > > > > On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell > wrote: > >> Hi Zhigang, >> >> I throw some comments on your proposal. As i said there, I think >> you need >> to find & look at a variety of gff/gtf files to see where your >> implementation breaks down. Also, for parsing, I would focus on >> optimizing >> the speed the user can access attributes, they're the bits people >> care most >> about (where is gene X, what is the FPKM of isoform y?, etc.) >> >> Chris >> >> >> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu >> wrote: >> >>> Hi Peter and all, >>> Thanks for the long explanation. >>> I got much better understand of this project though I am still >>> confusing >>> on >>> how to implement the lazy-loading parser for feature rich files >>> (EMBL, >>> GenBank, GFF3). >>> Since the deadline is pretty close,I decided to post my premature >>> of >>> proposal for this project. It would be great if you all can given >>> me some >>> comments and suggestions. The proposal is available >>> here< >>> >>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing >>> >. >>> >>> Thank you all in advance. >>> >>> >>> Zhigang >>> >>> >>> >>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock >>> >> >wrote: >>> >>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu >>> >>> > wrote: >>> > > Peter, >>> > > >>> > > Thanks for the detailed explanation. It's very helpful. I am >>> not quite >>> > > sure about the goal of the lazy-loading parser. >>> > > Let me try to summarize what are the goals of lazy-loading and >>> how >>> > > lazy-loading would work. Please correct me if necessary. Below >>> I use >>> > > fasta/fastq file as an example. The idea should generally >>> applies to >>> > > other format such as GenBank/EMBL as you mentioned. >>> > > >>> > > Lazy-loading is useful under the assumption that given a large >>> file, >>> > > we are interested in partial information of it but not all of >>> them. >>> > > For example a fasta file contains Arabidopsis genome, we only >>> > > interested in the sequence of chr5 from index position from >>> 2000-3000. >>> > > Rather than parsing the whole file and storing each record in >>> memory >>> > > as most parsers will do, during the indexing step, lazy >>> loading >>> > > parser will only store a few position information, such as >>> access >>> > > positions (readily usable for seek) for all chromosomes (chr1, >>> chr2, >>> > > chr3, chr4, chr5, ...) and may be position index information >>> such as >>> > > the access positions for every 1000bp positions for each >>> sequence in >>> > > the given file. After indexing, we store these information in a >>> > > dictionary like following {'chr1':{0:access_pos, >>> 1000:access_pos, >>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, >>> > > 2000:access_pos,}, 'chr3'...}. >>> > > >>> > > Compared to the usual parser which tends to parsing the whole >>> file, we >>> > > gain two benefits: speed, less memory usage and random access. >>> Speed >>> > > is gained because we skipped a lot during the parsing step. Go >>> back to >>> > > my example, once we have the dictionary, we can just seek to >>> the >>> > > access position of chr5:2000 and start reading and parsing from >>> there. >>> > > Less memory usage is due to we only stores access positions for >>> each >>> > > record as a dictionary in memory. >>> > > >>> > > >>> > > Best, >>> > > >>> > > Zhigang >>> > >>> > Hi Zhigang, >>> > >>> > Yes - that's the basic idea of a disk based lazy loader. Here >>> > the data stays on the disk until needed, so generally this is >>> > very low memory but can be slow as it needs to read from >>> > the disk. And existing example already in Biopython is our >>> > BioSQL bindings which present a SeqRecord subclass which >>> > only retrieves values from the database on demand. >>> > >>> > Note in the case of FASTA, we might want to use the existing >>> > FAI index files from Heng Li's faidx tool (or another existing >>> > index scheme). That relies on each record using a consistent >>> > line wrapping length, so that seek offsets can be easily >>> > calculated. >>> > >>> > An alternative idea is to load the data into memory (so that the >>> > file is not touched again, useful for stream processing where >>> > you cannot seek within the input data) but it is only parsed into >>> > Python objects on demand. This would use a lot more memory, >>> > but should be faster as there is no disk seeking and reading >>> > (other than the one initial read). For FASTA this wouldn't help >>> > much but it might work for EMBL/GenBank. >>> > >>> > Something to beware of with any lazy loading / lazy parsing is >>> > what happens if the user tries to edit the record? Do you want >>> > to allow this (it makes the code more complex) or not (simpler >>> > and still very useful). >>> > >>> > In terms of usage examples, for things like raw NGS data this >>> > is (currently) made up of lots and lots of short sequences (under >>> > 1000bp). Lazy loading here is unlikely to be very helpful - >>> unless >>> > perhaps you can make the FASTQ parser faster this way? >>> > (Once the reads are assembled or mapped to a reference, >>> > random access to lookup reads by their mapped location is >>> > very very important, thus the BAI indexing of BAM files). >>> > >>> > In terms of this project, I was thinking about a SeqRecord >>> > style interface extending Bio.SeqIO (but you can suggest >>> > something different for your project). >>> > >>> > What I saw as the main use case here is large datasets like >>> > whole chromosomes in FASTA format or richly annotated >>> > formats like EMBL, GenBank or GFF3. Right now if I am >>> > doing something with (for example) the annotated human >>> > chromosomes, loading these as GenBank files is quite >>> > slow (it takes a far amount of memory too, but that isn't >>> > my main worry). A lazy loading approach should let me >>> > 'load' the GenBank files almost instantly, and delay >>> > reading specific features or sequence from the disk >>> > until needed. >>> > >>> > For example, I might have a list of genes for which I wish >>> > to extract the annotation or sequence for - and there is no >>> > need to load all the other features or the rest of the genome. >>> > >>> > (Note we can already do this by loading GenBank files >>> > into a BioSQL database, and access them that way) >>> > >>> > Regards, >>> > >>> > Peter >>> > >>> _______________________________________________ >>> Biopython-dev mailing list >>> Biopython-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> >> >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon May 6 07:23:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 May 2013 12:23:24 +0100 Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC 2013 In-Reply-To: References: Message-ID: On Tue, Apr 16, 2013 at 9:47 AM, Peter Cock wrote: > On Tue, Apr 16, 2013 at 1:43 AM, Eric Talevich wrote: >> >> The abstract looks good to me. Which release was the first to include >> SearchIO, was that 1.61? If so, maybe it would be good to note that in >> addition to the smaller improvements, SearchIO specifically was (one of?) >> the new module(s) that introduced the beta designation. >> > > Yes, SearchIO was included in Biopython 1.61, but you're right that > could be made a bit clearer. > The Biopython update has been accepted for a 10 minute talk slot at BOSC (anyone else with an abstract submitted should have had an email by now), the reviewers' feedback was short and positive: (A) Keep it short and show the variety of active sub-projects and people involved and the presentaion will will be attractive to the audience. The last year's talk is a good example (based on the shared slides). (Last year it was Eric at BOSC 2012 in Long Beach, CA - well done) (B) Nice to see latest news on BioPython and future directions of one of the most popular OpenBio project. (C) This talk reports an update on the BioPython project (support for experimental codes, Python 3 compatibility, SearchIO and genomic variant formats). BioPython is one of the central projects of O.B.F and its update is worth getting some attention at BOSC. We have until June to revise our abstract - so perhaps we should do the next release this month in May ;) Peter From idoerg at gmail.com Tue May 7 12:24:00 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 7 May 2013 12:24:00 -0400 Subject: [Biopython-dev] uniprot-GOA parse Message-ID: hi, As promised, I have written a uniprot-goa parser. Very skeletal, has iterators for reading the three uniprot-GOA file types, a write function, and a couple of usage examples. No github write access, so attaching. Cheers, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. -------------- next part -------------- A non-text attachment was scrubbed... Name: upg_parser.py Type: application/octet-stream Size: 10344 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue May 7 12:47:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 7 May 2013 17:47:16 +0100 Subject: [Biopython-dev] uniprot-GOA parse In-Reply-To: References: Message-ID: On Tue, May 7, 2013 at 5:24 PM, Iddo Friedberg wrote: > hi, > > As promised, I have written a uniprot-goa parser. Very skeletal, has > iterators for reading the three uniprot-GOA file types, a write function, > and a couple of usage examples. > > No github write access, so attaching. The file arrived :) Did you have any thoughts on where in the namespace to put this? The idea with github is you'd register an account, say iddux (since that's your Twitter username), and then fork the repository as https://github.com/iddux/biopython - and make a new branch there with your changes, and ask for feedback or make a pull request. All that can be done without any write access to the main repository, and is intended to lower the barrier to entry. In your case, given you're a past project leader etc, drop me (or Brad etc) an email once you've mastered the git basic and we can give you direct access. Regards, Peter From natemsutton at yahoo.com Tue May 7 17:12:59 2013 From: natemsutton at yahoo.com (Nate Sutton) Date: Tue, 7 May 2013 14:12:59 -0700 (PDT) Subject: [Biopython-dev] Progress with ticket 3336 Message-ID: <1367961179.88206.YahooMailNeo@web122603.mail.ne1.yahoo.com> Hi, Here is a progress follow up to http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010548.html . ?I have added a commit to the github branch that adds an option to create claude branch lines using linecollection. ?The linecollection objects are stored in a tuple before adding them to the plot. ?It?s in Bio/Phylo/_utils.py. ?Is this what the last bullet point was requesting in https://redmine.open-bio.org/issues/3336 ? ? Thanks! Nate P. S. ?I used a tuple to store the linecollection objects instead of a list because that was mentioned in the ticket but if that looks like it should be different let me know. ?Also, I got some global variables to work with the code but I was only able to do that after declaring them as globals twice. ?If there are suggestions on how to code that differently let me know. From idoerg at gmail.com Wed May 8 19:28:17 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 8 May 2013 19:28:17 -0400 Subject: [Biopython-dev] UniProt GOA parser Message-ID: A new uniprot-GOA parser is available for you to poke around: https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA More on Uniprot-GOA: http://www.ebi.ac.uk/GOA There are three file formats: GAF (gene association file) , GPA (gene product association) and GPI (gene product information) explained here: http://www.ebi.ac.uk/GOA/downloads Input GAF files can be very large, due to the growth of uniprot GOA. If you would like to test in a timely fashion, I suggest you get historical files, which are smaller. Once you get to the > 40 version numbers, the runtime for the example code in UniProtGOA.py goes over 2 minutes (on my i5 machine). Old GAF files are available here: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/ Current GPI and GPA files are not very large. Thanks to Peter for his help on this. Best, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Fri May 10 06:06:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 May 2013 11:06:19 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg wrote: > A new uniprot-GOA parser is available for you to poke around: > > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA > I think for the namespace, we might be better off using Bio.UniProt.GOA, where Iddo's parser would be in Bio/UniProt/GOA.py and any other UniProt specific code could also go under Bio/UniProt - for example a web API. Some of Bio.SwissProt might also migrate here over time. > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA > > There are three file formats: GAF (gene association file) , GPA (gene > product association) and GPI (gene product information) explained here: > http://www.ebi.ac.uk/GOA/downloads > > Input GAF files can be very large, due to the growth of uniprot GOA. If you > would like to test in a timely fashion, I suggest you get historical files, > which are smaller. Once you get to the > 40 version numbers, the runtime > for the example code in UniProtGOA.py goes over 2 minutes (on my i5 > machine). Would it make sense to want random access to the GOA files based on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That should be fairly straight forward to do building on the indexing code for Bio.SeqIO and SearchIO. Note here I am picturing combining all the (consecutive) lines for the same DB_Object_ID - currently the parser is line based, but batching by DB_Object_ID would be a straightforward change and may better suit some uses. > Old GAF files are available here: > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/ > > Current GPI and GPA files are not very large. > > Thanks to Peter for his help on this. > > Best, > > Iddo Peter From idoerg at gmail.com Fri May 10 12:20:16 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 10 May 2013 12:20:16 -0400 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: > On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg wrote: > > A new uniprot-GOA parser is available for you to poke around: > > > > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA > > > > I think for the namespace, we might be better off using Bio.UniProt.GOA, > where Iddo's parser would be in Bio/UniProt/GOA.py and any other > UniProt specific code could also go under Bio/UniProt - for example > a web API. > OK. > > Some of Bio.SwissProt might also migrate here over time. > > > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA > > > > There are three file formats: GAF (gene association file) , GPA (gene > > product association) and GPI (gene product information) explained here: > > http://www.ebi.ac.uk/GOA/downloads > > > > Input GAF files can be very large, due to the growth of uniprot GOA. If > you > > would like to test in a timely fashion, I suggest you get historical > files, > > which are smaller. Once you get to the > 40 version numbers, the runtime > > for the example code in UniProtGOA.py goes over 2 minutes (on my i5 > > machine). > > Would it make sense to want random access to the GOA files based > on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That > should be fairly straight forward to do building on the indexing code > for Bio.SeqIO and SearchIO. > Would that require reading it all into memory? Uniprot_GOA files are huge, it is impractical to read them in fully. > > Note here I am picturing combining all the (consecutive) lines > for the same DB_Object_ID - currently the parser is line based, > but batching by DB_Object_ID would be a straightforward change > and may better suit some uses. > Perhaps only for organism specific file, which in some cases can be read fully into memory. > > > Old GAF files are available here: > > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/ > > > > Current GPI and GPA files are not very large. > > > > Thanks to Peter for his help on this. > > > > Best, > > > > Iddo > > Peter > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Fri May 10 12:26:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 May 2013 17:26:13 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg wrote: > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: >> >> Would it make sense to want random access to the GOA files based >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That >> should be fairly straight forward to do building on the indexing code >> for Bio.SeqIO and SearchIO. > > > Would that require reading it all into memory? Uniprot_GOA files > are huge, it is impractical to read them in fully. Not at all - we'd record a dictionary mapping the record ID to an offset in the file on disk, or record this mapping in an SQLite index file. >> Note here I am picturing combining all the (consecutive) lines >> for the same DB_Object_ID - currently the parser is line based, >> but batching by DB_Object_ID would be a straightforward change >> and may better suit some uses. > > Perhaps only for organism specific file, which in some cases can > be read fully into memory. The examples I looked at only seemed to have a dozen or so lines for each DB_Object_ID - but perhaps these were easy cases? How many lines per DB_Object_ID in the worst cases? Peter From idoerg at gmail.com Fri May 10 12:32:43 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 10 May 2013 12:32:43 -0400 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 10, 2013 at 12:26 PM, Peter Cock wrote: > On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg wrote: > > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: > >> > >> Would it make sense to want random access to the GOA files based > >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That > >> should be fairly straight forward to do building on the indexing code > >> for Bio.SeqIO and SearchIO. > > > > > > Would that require reading it all into memory? Uniprot_GOA files > > are huge, it is impractical to read them in fully. > > Not at all - we'd record a dictionary mapping the record ID to an offset > in the file on disk, or record this mapping in an SQLite index file. > Ok, that's good then > >> Note here I am picturing combining all the (consecutive) lines > >> for the same DB_Object_ID - currently the parser is line based, > >> but batching by DB_Object_ID would be a straightforward change > >> and may better suit some uses. > > > > Perhaps only for organism specific file, which in some cases can > > be read fully into memory. > > The examples I looked at only seemed to have a dozen or so > lines for each DB_Object_ID - but perhaps these were easy > cases? How many lines per DB_Object_ID in the worst cases? > > Peter > I was actually thinking you are suggesting that the whole file should be read in memory, nit just buffer by DB-Object_ID. My mistake. -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From linxzh1989 at gmail.com Sun May 12 08:57:25 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Sun, 12 May 2013 20:57:25 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: I am very Sorry about my mistake. I want to install biopython 1.61 in a local server(CentOS), python setup.py build python setup.py test and then showed some errors: ====================================================================== FAIL: Test an input file containing a single sequence. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Clustalw_tool.py", line 166, in test_single_sequence self.assertTrue(str(err) == "No records found in handle") AssertionError ====================================================================== ERROR: Test Entrez.read from URL ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez_online.py", line 34, in test_read_from_url rec = Entrez.read(einfo) File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py", line 362, in read record = handler.read(handle) File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", line 184, in read self.parser.ParseFile(handle) File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", line 322, in endElementHandler raise RuntimeError(value) RuntimeError: Unable to open connection to #DbInfo?dbaf= ====================================================================== ERROR: Run tutorial doctests. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Tutorial.py", line 152, in test_doctests ValueError: 4 Tutorial doctests failed: test_from_line_05671, test_from_line_06030, test_from_line_06190, test_from_line_06479 ---------------------------------------------------------------------- Ran 213 tests in 1621.002 seconds FAILED (failures = 3) i use python 2.6.5 2013/5/12 ?????? : > I've run the From saketkc at gmail.com Sun May 12 14:11:46 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sun, 12 May 2013 23:41:46 +0530 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: <516653BE.8060509@brueffer.de> Message-ID: Just completed writing samtools wrapper : https://github.com/biopython/biopython/pull/180 Unit Tests pending. On 11 April 2013 23:51, Chris Mitchell wrote: > Here's the branch I'm starting with, including a working mpileup daemon for > those who want to use it: > > https://github.com/chrismit/biopython/tree/samtools > > sample usage: > from Bio.SamTools import SamTools > sTools = '/home/chris/bin/samtools' > hg19 = '/media/chris/ChrisSSD/ref/human/hg19.fa' > bamSource = '/media/chris/ChrisSSD/TH1Alignment/NK/accepted_hits.bam' > st = SamTools(bamSource,binary=sTools,threads=30) > > #now with a callback, which is advisable to use to process data as it is > generated > def processPileup(pileup): > print 'to process',pileup > > #st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in > xrange(2000001,2001001)],callback=processPileup) #with callback > #print st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in > xrange(2000001,2000101)]) #will just return as a list > > > On Thu, Apr 11, 2013 at 10:04 AM, Chris Mitchell wrote: > >> Given that we'd be chasing after the samtools development cycle, I think >> it's just easier to implement command line wrappers that are dynamic enough >> to handle future versions. For instance, some of the code doesn't seem too >> set in stone and appears empirical (the BAQ computation comes to mind) and >> therefore probable to change in future versions. I can package in my >> existing pileup parser, but in general I think most people will be using a >> callback routine to handle it themselves since use cases of the final >> output sort of vary project by project. >> >> Chris >> >> >> On Thu, Apr 11, 2013 at 9:54 AM, Peter Cock wrote: >> >>> On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell >>> wrote: >>> > Also, if a binary can't be found, having it fallback to the future >>> > BioPython parser seems like it might be a good idea (provided it has >>> > similar functionality like creating pileups, does it?). >>> >>> It has the low level random access via the BAI index done, but >>> does not yet have a reimplementation of the mpileup code, no. >>> (Would that be useful compared to calling samtools and parsing >>> its output?) >>> >>> Peter >>> >> >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From linxzh1989 at gmail.com Sun May 12 21:41:30 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Mon, 13 May 2013 09:41:30 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: 2013/5/13 Peter Cock : > On Sun, May 12, 2013 at 1:57 PM, ?????? wrote: >> I want to install biopython 1.61 in a local server(CentOS), >> python setup.py build >> python setup.py test >> and then showed some errors: >> >> ... >> >> i use python 2.6.5 >> > > Thank you for getting in touch, and including the important > information about the operating system, version of Python > and version of Biopython. > >> FAIL: Test an input file containing a single sequence. >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Clustalw_tool.py", line 166, in test_single_sequence >> self.assertTrue(str(err) == "No records found in handle") >> AssertionError >> > > This test calls the command line tool clustalw. > > What version of clustalw do you have? > >> ERROR: Test Entrez.read from URL >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Entrez_online.py", line 34, in test_read_from_url >> rec = Entrez.read(einfo) >> File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py", >> line 362, in read >> record = handler.read(handle) >> File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", >> line 184, in read >> self.parser.ParseFile(handle) >> File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", >> line 322, in endElementHandler >> raise RuntimeError(value) >> RuntimeError: Unable to open connection to #DbInfo?dbaf= >> > > This test connects to the NCBI Entrez server over the internet. > This kind of error is usually a temporary network problem, and > will go away if you repeat the test later. > >> ERROR: Run tutorial doctests. >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Tutorial.py", line 152, in test_doctests >> ValueError: 4 Tutorial doctests failed: test_from_line_05671, >> test_from_line_06030, test_from_line_06190, test_from_line_06479 > > Those four failing examples in the Tutorial seem to match this > commit, made just before the Biopython 1.61 release: > > https://github.com/biopython/biopython/commit/b84bda01bd22e93a1cf71613a55 February 2013 (Biopython 1.61)cfca876b7128d7#Doc/Tutorial.tex > > Where did you get the Biopython 1.61 files from? e.g. The zip file > or tar.gz file on our website? Perhaps I accidentally included an > older copy of the Doc/Tutorial.tex file? Could you look for the > "Late Update" line in your Tutorial.tex file for me - does it say: > > \date{Last Update -- 5 February 2013 (Biopython 1.61)} > > Thanks, > > Peter Hi??Peter?? Clustalw I am using is 1.83. I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5 February 2013 (Biopython 1.61)}'. I downloaded the tar.gz from the biopython website. Thanks Lin From p.j.a.cock at googlemail.com Mon May 13 04:49:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 May 2013 09:49:20 +0100 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: On Mon, May 13, 2013 at 2:41 AM, ??? wrote: > >> Where did you get the Biopython 1.61 files from? e.g. The zip file >> or tar.gz file on our website? Perhaps I accidentally included an >> older copy of the Doc/Tutorial.tex file? Could you look for the >> "Late Update" line in your Tutorial.tex file for me - does it say: >> >> \date{Last Update -- 5 February 2013 (Biopython 1.61)} >> >> Thanks, >> >> Peter > > Hi?Peter? > Clustalw I am using is 1.83. Hi Lin, I also have clustalw 1.83, so this isn't simply a version problem. It could be something subtle about the locale - what language is your CentOS running in (that can alter error messages etc)? > I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5 > February 2013 (Biopython 1.61)}'. That's good - that's what it should say :) (Sorry late/last was my typing error). > > I downloaded the tar.gz from the biopython website. > Thanks. I could reproduce the test_Tutorial.py problem with that. This is easy to explain - I forgot to include the test file my_blast.xml when doing the release (and you are the first person to report this problem). I should have noticed this myself, sorry :( I've fixed this ready for the next release - thank you for reporting this: https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5 If you want to, you can get the missing file from here: http://biopython.org/SRC/Doc/examples/my_blast.xml or: https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml If you save that in the Biopython 1.61 source under Doc/examples then the Tutorial test should pass. -- Did you retry the test_Entrez_online.py example to see if this was a temporary problem? -- The good news is these minor issues should not cause you any problems installing and using Biopython 1.61 - so you can go ahead and run 'python setup.py install. Thanks, Peter From linxzh1989 at gmail.com Mon May 13 10:34:31 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Mon, 13 May 2013 22:34:31 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: 2013/5/13 Peter Cock > On Mon, May 13, 2013 at 2:41 AM, ?????? wrote: > > > >> Where did you get the Biopython 1.61 files from? e.g. The zip file > >> or tar.gz file on our website? Perhaps I accidentally included an > >> older copy of the Doc/Tutorial.tex file? Could you look for the > >> "Late Update" line in your Tutorial.tex file for me - does it say: > >> > >> \date{Last Update -- 5 February 2013 (Biopython 1.61)} > >> > >> Thanks, > >> > >> Peter > > > > Hi??Peter?? > > Clustalw I am using is 1.83. > > Hi Lin, > > I also have clustalw 1.83, so this isn't simply a version > problem. It could be something subtle about the locale - > what language is your CentOS running in (that can alter > error messages etc)? > > > I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update > -- 5 > > February 2013 (Biopython 1.61)}'. > > That's good - that's what it should say :) > > (Sorry late/last was my typing error). > > > > > I downloaded the tar.gz from the biopython website. > > > > Thanks. I could reproduce the test_Tutorial.py problem with that. > This is easy to explain - I forgot to include the test file my_blast.xml > when doing the release (and you are the first person to report this > problem). I should have noticed this myself, sorry :( > > I've fixed this ready for the next release - thank you for reporting this: > > https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5 > > If you want to, you can get the missing file from here: > http://biopython.org/SRC/Doc/examples/my_blast.xml > > or: > https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml > > If you save that in the Biopython 1.61 source under Doc/examples > then the Tutorial test should pass. > > -- > > Did you retry the test_Entrez_online.py example to see if > this was a temporary problem? > > -- > > The good news is these minor issues should not cause you > any problems installing and using Biopython 1.61 - so you > can go ahead and run 'python setup.py install. > > Thanks, > > Peter > Hi Peter I have run the locale in my serve $ locale LANG=en_US.UTF-8 LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC=zh_CN.UTF-8 LC_TIME=zh_CN.UTF-8 LC_COLLATE="en_US.UTF-8" LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES="en_US.UTF-8" LC_PAPER=zh_CN.UTF-8 LC_NAME=zh_CN.UTF-8 LC_ADDRESS=zh_CN.UTF-8 LC_TELEPHONE=zh_CN.UTF-8 LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=zh_CN.UTF-8 LC_ALL= Is that locale you want? I retryed the the test_Entrez_online.py, it's all right now. As you said, it should be a connection problem. I have put the file in the Doc/examples file, but the error still exists. And i find there is no my_blat.psl in Doc/examples comparing with the zip file i downloaded from github. After i put the my_blat.psi in the Doc/examples, the error did not show up again. Thanks Lin From p.j.a.cock at googlemail.com Mon May 13 11:50:26 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 May 2013 16:50:26 +0100 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: On Mon, May 13, 2013 at 3:34 PM, ??? wrote: > > Hi Peter > I have run the locale in my serve > > $ locale > LANG=en_US.UTF-8 > LC_CTYPE=zh_CN.UTF-8 > LC_NUMERIC=zh_CN.UTF-8 > LC_TIME=zh_CN.UTF-8 > LC_COLLATE="en_US.UTF-8" > LC_MONETARY=zh_CN.UTF-8 > LC_MESSAGES="en_US.UTF-8" > LC_PAPER=zh_CN.UTF-8 > LC_NAME=zh_CN.UTF-8 > LC_ADDRESS=zh_CN.UTF-8 > LC_TELEPHONE=zh_CN.UTF-8 > LC_MEASUREMENT=zh_CN.UTF-8 > LC_IDENTIFICATION=zh_CN.UTF-8 > LC_ALL= > > Is that locale you want? Hi Lin, Thanks for checking that, but having looked in more detail I think this is not related to the locale settings. My first guess was wrong :( I think I may have solved this - my test machine has both clustalw 2.1 and clustalw 1.83, and they behave differently for this example. The old test only worked with v2.1, fixed: https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00 If you want to verify this, you could update your copy of Tests/test_Clustalw_tool.py to that from github (or just tried installing the latest Biopython code from github?). Note the Clustal developers intended that clustalw 1 and 2 would behave the same as each other (Version 2 was a rewrite as a step towards version 3, no called ClustalOmega), but there are still some minor differences. > I retryed the the test_Entrez_online.py, it's all right now. As > you said, it should be a connection problem. OK, good. > I have put the file in the Doc/examples file, but the error still exists. > And i find there is no my_blat.psl in Doc/examples comparing with the zip > file i downloaded from github. After i put the my_blat.psi in the > Doc/examples, the error did not show up again. Thank you, that should be fixed in the next release: https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d Your feedback has been very helpful, Thanks, Peter From linxzh1989 at gmail.com Mon May 13 21:32:23 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Tue, 14 May 2013 09:32:23 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: Hi Peter I copy the test_Clustalw_tool.py from the github, now it does work. Thank you! Lin 2013/5/13 Peter Cock > On Mon, May 13, 2013 at 3:34 PM, ?????? wrote: > > > > Hi Peter > > I have run the locale in my serve > > > > $ locale > > LANG=en_US.UTF-8 > > LC_CTYPE=zh_CN.UTF-8 > > LC_NUMERIC=zh_CN.UTF-8 > > LC_TIME=zh_CN.UTF-8 > > LC_COLLATE="en_US.UTF-8" > > LC_MONETARY=zh_CN.UTF-8 > > LC_MESSAGES="en_US.UTF-8" > > LC_PAPER=zh_CN.UTF-8 > > LC_NAME=zh_CN.UTF-8 > > LC_ADDRESS=zh_CN.UTF-8 > > LC_TELEPHONE=zh_CN.UTF-8 > > LC_MEASUREMENT=zh_CN.UTF-8 > > LC_IDENTIFICATION=zh_CN.UTF-8 > > LC_ALL= > > > > Is that locale you want? > > Hi Lin, > > Thanks for checking that, but having looked in more detail > I think this is not related to the locale settings. My first guess > was wrong :( > > I think I may have solved this - my test machine has both > clustalw 2.1 and clustalw 1.83, and they behave differently > for this example. The old test only worked with v2.1, fixed: > > https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00 > > If you want to verify this, you could update your copy of > Tests/test_Clustalw_tool.py to that from github (or just > tried installing the latest Biopython code from github?). > > Note the Clustal developers intended that clustalw 1 and 2 > would behave the same as each other (Version 2 was a > rewrite as a step towards version 3, no called ClustalOmega), > but there are still some minor differences. > > > I retryed the the test_Entrez_online.py, it's all right now. As > > you said, it should be a connection problem. > > OK, good. > > > I have put the file in the Doc/examples file, but the error still exists. > > And i find there is no my_blat.psl in Doc/examples comparing with the zip > > file i downloaded from github. After i put the my_blat.psi in the > > Doc/examples, the error did not show up again. > > Thank you, that should be fixed in the next release: > > https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d > > Your feedback has been very helpful, > > Thanks, > > Peter > From idoerg at gmail.com Fri May 17 17:35:41 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 17 May 2013 17:35:41 -0400 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: OK. I added a few changes as suggested by Peter. There is a parser now to group GAF files by DB_Object_ID, and a write function to write them. Random access not implemented yet. On Fri, May 10, 2013 at 12:32 PM, Iddo Friedberg wrote: > > > On Fri, May 10, 2013 at 12:26 PM, Peter Cock wrote: > >> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg wrote: >> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: >> >> >> >> Would it make sense to want random access to the GOA files based >> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That >> >> should be fairly straight forward to do building on the indexing code >> >> for Bio.SeqIO and SearchIO. >> > >> > >> > Would that require reading it all into memory? Uniprot_GOA files >> > are huge, it is impractical to read them in fully. >> >> Not at all - we'd record a dictionary mapping the record ID to an offset >> in the file on disk, or record this mapping in an SQLite index file. >> > > Ok, that's good then > > >> >> Note here I am picturing combining all the (consecutive) lines >> >> for the same DB_Object_ID - currently the parser is line based, >> >> but batching by DB_Object_ID would be a straightforward change >> >> and may better suit some uses. >> > >> > Perhaps only for organism specific file, which in some cases can >> > be read fully into memory. >> >> The examples I looked at only seemed to have a dozen or so >> lines for each DB_Object_ID - but perhaps these were easy >> cases? How many lines per DB_Object_ID in the worst cases? >> >> Peter >> > > > I was actually thinking you are suggesting that the whole file should be > read in memory, nit just buffer by DB-Object_ID. My mistake. > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> > ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. > .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> > >>----.<--.>++++++.<<<<------------------------------------. > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Mon May 20 09:16:45 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 20 May 2013 14:16:45 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 17, 2013 at 10:35 PM, Iddo Friedberg wrote: > > > OK. I added a few changes as suggested by Peter. > > There is a parser now to group GAF files by DB_Object_ID, and a write > function to write them. Random access not implemented yet. > Hi Iddo, Over on this branch building on your work I moved things under Bio.UniProt.GOA, and got things a bit more in line with PEP8: https://github.com/peterjc/biopython/tree/uniprot-goa (Drop me an email off list if you need a hand pulling those changes into your branch) Do you want to have a go at re-using the index code in Bio.File (the back end for SeqIO and SearchIO's indexing)? Let me know if the current setup is too mysterious and I can try and document more of it and/or do this for the GOA module. Peter From redmine at redmine.open-bio.org Tue May 21 08:24:34 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 May 2013 12:24:34 +0000 Subject: [Biopython-dev] [Biopython - Feature #3432] (New) Updated/Extended module MeltingTemp in Bio.SeqUtils Message-ID: Issue #3432 has been reported by Markus Piotrowski. ---------------------------------------- Feature #3432: Updated/Extended module MeltingTemp in Bio.SeqUtils https://redmine.open-bio.org/issues/3432 Author: Markus Piotrowski Status: New Priority: Normal Assignee: Category: Target version: URL: Dear Biopython developers, I updated/extended the MeltingTemp module of SeqUtils and would be happy if you would consider it for implementing. Please find the source code attached. Any feedback is appreciated. 'Old' module: One method, Tm_staluc, which calculates the melting temperature by the nearest neighbor method, using two different thermodynamic data sets for DNA and RNA. Fixed salt correction formula. 'Updated' module: 1. Three different Tm calculations: one 'rule of thumb' (Tm_Wallace), one using approximative formulas basing on GC content (Tm_GC) and one using nearest neighbor calculations (Tm_NN). 2. The new Tm_NN allows the usage of different thermodynamic datasets (8 tables are included for Watson-Crick base-pairing) and includes tables for mismatches (including inosine) and dangling ends. The datasets are Python dictionaries; the user can use his own datasets or change/update existing tables for his needs. 3. Seven different formulas to correct for salt concentration, including correction for Mg2+ ions (method salt_correction). 4. Method chem_correction which allows for Tm correction when using DMSO and formaldehyde. I haven't touched the old Tm_staluc method (except adding some comments [labelled 'MP'] and a deprecation warning). Actually, the method has two problems on the RNA side: The dataset for RNA is faulty and 'U' isn't considered as input. Of course this problems can easily be fixed, however, I would prefer (if it is decided to accept the updated module) to completely exchange the body of Tm_staluc for calls to Tm_NN (as outlined in the comments). There is one thing, that I'm uneasy with: For terminal mismatches, I used thermodynamic data from a patent application that has been withdrawn (http://patentscope.wipo.int/search/en/WO2001094611). Actually, I found the reference in the manual for Primer3 which also seems to use these data (http://primer3.sourceforge.net/primer3_manual.htm). Indeed, the Primer3 source (which is distributed under GPLv2) contains the data. Best wishes, Markus ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed May 22 09:45:00 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 May 2013 14:45:00 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Mon, May 20, 2013 at 7:09 PM, Iddo Friedberg wrote: >> Do you want to have a go at re-using the index code in Bio.File >> (the back end for SeqIO and SearchIO's indexing)? Let me know >> if the current setup is too mysterious and I can try and document >> more of it and/or do this for the GOA module. > > I'd like to have a go.. > > ./I Great - a few more details then, The second part of Bio/File.py has some private classes _IndexedSeqFileProxy and _IndexedSeqFileDict and _SQLiteManySeqFilesDict which can be used for any sequential record file format (meaning one after the other, not just biological sequences). These are used by the Bio.SeqIO.index() and index_db() functions, and their sisters in Bio.SearchIO. The idea is you write a subclass of _IndexedSeqFileProxy for your new file format, and then this gets used by either _IndexedSeqFileDict (in memory offset dictionary) or _SQLiteManySeqFilesDict (SQLite offset dictionary). Your _IndexedSeqFileProxy subclass has to define an __iter__ method which loops over the file giving a tuple for each record giving the identifier string and the start offset, and ideally the length in bytes. It must also define a get method which must seek to the offset and then parse the record. For the GOA files, the __iter__ loop will just spot batches of lines for the same identifier which together make up a single record. I managed to explain the setup to Bow, and he got it to work for SearchIO, but we were doing face to face video chats for that during GSoC last year. Fresh eyes will surely find some more rough edges in my docs ;) Regards, Peter From pgarland at gmail.com Sun May 26 22:27:05 2013 From: pgarland at gmail.com (Phillip Garland) Date: Sun, 26 May 2013 19:27:05 -0700 Subject: [Biopython-dev] test_SeqIO_online failure Message-ID: The fasta formatted record is fine, the problem seems to come after requesting and reading the genbank-formatted record for the protein with GI:16130152. It looks like the record was modified a few days ago: LOCUS NP_416719 367 aa linear CON 24-MAY-2013 and ends with CONTIG join(WP_000865568.1:1..367)\n//\n\n' instead of ORIGIN and the sequence data. Is this a problem with the genbank record that should be reported to NCBI, or is SeqIO supposed to handle the record as it is by fetching the sequence from the linked contig, or is the test doing the wrong thing by using rettype="gb" instead of rettype="gbwithparts"? Here's the test output: pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python run_tests.py test_SeqIO_online.py Python version: 2.7.5 (default, May 20 2013, 11:51:12) [GCC 4.7.3] Operating system: posix linux2 test_SeqIO_online ... FAIL ====================================================================== FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests) Bio.Entrez.efetch(protein, 16130152, ...) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", line 77, in method = lambda x : x.simple(d, f, e, l, c) File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", line 65, in simple self.assertEqual(seguid(record.seq), checksum) AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg' ---------------------------------------------------------------------- Ran 1 test in 10.010 seconds FAILED (failures = 1) ~Phillip From kai.blin at biotech.uni-tuebingen.de Mon May 27 02:19:20 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 27 May 2013 08:19:20 +0200 Subject: [Biopython-dev] SearchIO: Fix a bug in the HMMer2 text parser Message-ID: <51A2FAE8.1040408@biotech.uni-tuebingen.de> Hi folks, I've run into and fixed a bug in the hmmer2-text parser when parsing consensus lines. The pull request is at https://github.com/biopython/biopython/pull/182 Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From p.j.a.cock at googlemail.com Mon May 27 05:05:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 May 2013 10:05:44 +0100 Subject: [Biopython-dev] test_SeqIO_online failure In-Reply-To: References: Message-ID: Hi Philip, On Mon, May 27, 2013 at 3:27 AM, Phillip Garland wrote: > The fasta formatted record is fine, the problem seems to come after > requesting and reading the genbank-formatted record for the protein > with GI:16130152. > > It looks like the record was modified a few days ago: > > LOCUS NP_416719 367 aa linear CON 24-MAY-2013 > > and ends with > > CONTIG join(WP_000865568.1:1..367)\n//\n\n' > > instead of > > ORIGIN and the sequence data. > > Is this a problem with the genbank record that should be reported to > NCBI, or is SeqIO supposed to handle the record as it is by fetching > the sequence from the linked contig, or is the test doing the wrong > thing by using rettype="gb" instead of rettype="gbwithparts"? Interesting - it looks like the NCBI made a change to Entrez and where previously this record had included the sequence with rettype="gb" now we have to ask for it explicitly with the longer rettype="gbwithparts" - my guess is this is now happening on more records. Note it does not affect all records, consider this example in our Tutorial which seems unchanged: from Bio import Entrez Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") print handle.read() Curious. > Here's the test output: > > pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python > run_tests.py test_SeqIO_online.py > Python version: 2.7.5 (default, May 20 2013, 11:51:12) > [GCC 4.7.3] > Operating system: posix linux2 > test_SeqIO_online ... FAIL > ====================================================================== > FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests) > Bio.Entrez.efetch(protein, 16130152, ...) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", > line 77, in > method = lambda x : x.simple(d, f, e, l, c) > File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", > line 65, in simple > self.assertEqual(seguid(record.seq), checksum) > AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg' > > ---------------------------------------------------------------------- > Ran 1 test in 10.010 seconds > > FAILED (failures = 1) I'd noticed this on Friday but hadn't looked into why the sequence was different (and sometimes Entrez errors are transient). Thanks for exploring this :) Would you like to submit a pull request to update test_SeqIO_online.py or should I just go ahead and change the rettype? It would be sensible to review all the Entrez examples in the Tutorial, to perhaps make more use of 'gbwithparts' rather than 'gb'? Thanks, Peter From pgarland at gmail.com Mon May 27 17:38:30 2013 From: pgarland at gmail.com (Phillip Garland) Date: Mon, 27 May 2013 14:38:30 -0700 Subject: [Biopython-dev] test_SeqIO_online failure In-Reply-To: References: Message-ID: Hi Peter, On Mon, May 27, 2013 at 2:05 AM, Peter Cock wrote: > Hi Philip, > > On Mon, May 27, 2013 at 3:27 AM, Phillip Garland wrote: >> The fasta formatted record is fine, the problem seems to come after >> requesting and reading the genbank-formatted record for the protein >> with GI:16130152. >> >> It looks like the record was modified a few days ago: >> >> LOCUS NP_416719 367 aa linear CON 24-MAY-2013 >> >> and ends with >> >> CONTIG join(WP_000865568.1:1..367)\n//\n\n' >> >> instead of >> >> ORIGIN and the sequence data. >> >> Is this a problem with the genbank record that should be reported to >> NCBI, or is SeqIO supposed to handle the record as it is by fetching >> the sequence from the linked contig, or is the test doing the wrong >> thing by using rettype="gb" instead of rettype="gbwithparts"? > > Interesting - it looks like the NCBI made a change to Entrez and > where previously this record had included the sequence with > rettype="gb" now we have to ask for it explicitly with the longer > rettype="gbwithparts" - my guess is this is now happening on > more records. > > Note it does not affect all records, consider this example in our > Tutorial which seems unchanged: > > from Bio import Entrez > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are > handle = Entrez.efetch(db="nucleotide", id="186972394", > rettype="gb", retmode="text") > print handle.read() > > Curious. > >> Here's the test output: >> >> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python >> run_tests.py test_SeqIO_online.py >> Python version: 2.7.5 (default, May 20 2013, 11:51:12) >> [GCC 4.7.3] >> Operating system: posix linux2 >> test_SeqIO_online ... FAIL >> ====================================================================== >> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests) >> Bio.Entrez.efetch(protein, 16130152, ...) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", >> line 77, in >> method = lambda x : x.simple(d, f, e, l, c) >> File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", >> line 65, in simple >> self.assertEqual(seguid(record.seq), checksum) >> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg' >> >> ---------------------------------------------------------------------- >> Ran 1 test in 10.010 seconds >> >> FAILED (failures = 1) > > I'd noticed this on Friday but hadn't looked into why the sequence was > different (and sometimes Entrez errors are transient). Thanks for > exploring this :) > > Would you like to submit a pull request to update test_SeqIO_online.py > or should I just go ahead and change the rettype? > > It would be sensible to review all the Entrez examples in the Tutorial, > to perhaps make more use of 'gbwithparts' rather than 'gb'? > > Thanks, > > Peter The slight problem with just replacing "gb" with "gbwithparts" is that SeqIO doesn't take "gbwithparts" as an option for the file format. So in test_SeqIO_online.py, you have this code: handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text") record = SeqIO.read(handle, f) which is a natural way to write the test (because it tests fasta and genbank files), but will currently fail if f is "gbwithparts", b/c SeqIO doesn't accept "gbwithparts" as a file format specifier. My guess is that most existing code hardcodes the rettype and SeqIO file format specifier, so we could just test for gbwithparts prior to calling SeqIO.read: handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text") if f == "gbwithparts": f = "gb" record = SeqIO.read(handle, f) I submitted a pull request with a minimal patch that does this. For code like this, it would be cleaner if SeqIO accepted, "gbwithparts" as an alias for "genbank", just like "gb" is, but I don't know if it's a common pattern enough to bother. If records like this are becoming more common, then "gbwithparts" should be clearly documented in the biopython tutorial, though "gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It seems safer to always use "gbwithparts" at this point, at least when you want the sequence. ~Phillip From p.j.a.cock at googlemail.com Mon May 27 18:43:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 May 2013 23:43:19 +0100 Subject: [Biopython-dev] test_SeqIO_online failure In-Reply-To: References: Message-ID: On Mon, May 27, 2013 at 10:38 PM, Phillip Garland wrote: > Hi Peter, > >> I'd noticed this on Friday but hadn't looked into why the sequence was >> different (and sometimes Entrez errors are transient). Thanks for >> exploring this :) >> >> Would you like to submit a pull request to update test_SeqIO_online.py >> or should I just go ahead and change the rettype? >> >> It would be sensible to review all the Entrez examples in the Tutorial, >> to perhaps make more use of 'gbwithparts' rather than 'gb'? >> >> Thanks, >> >> Peter > > The slight problem with just replacing "gb" with "gbwithparts" is that > SeqIO doesn't take "gbwithparts" as an option for the file format. So > in test_SeqIO_online.py, you have this code: > > handle = Entrez.efetch(db=database, id=entry, rettype=f, > retmode="text") > record = SeqIO.read(handle, f) > > which is a natural way to write the test (because it tests fasta and > genbank files), but will currently fail if f is "gbwithparts", b/c > SeqIO doesn't accept "gbwithparts" as a file format specifier. My > guess is that most existing code hardcodes the rettype and SeqIO file > format specifier, so we could just test for gbwithparts prior to > calling SeqIO.read: > > handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text") > if f == "gbwithparts": > f = "gb" > record = SeqIO.read(handle, f) > > I submitted a pull request with a minimal patch that does this. That's good for now :) > For code like this, it would be cleaner if SeqIO accepted, > "gbwithparts" as an alias for "genbank", just like "gb" is, but I > don't know if it's a common pattern enough to bother. That makes some sense for parsing files, but all those aliases would cause confusion with writing GenBank files. > If records like this are becoming more common, then "gbwithparts" > should be clearly documented in the biopython tutorial, though > "gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It > seems safer to always use "gbwithparts" at this point, at least when > you want the sequence. Definitely - if the NCBI moves to using 'gb' as the light style without the sequence then many people will just want to use 'gbwithparts' as their default when scripting this sort of thing. Thanks, Peter From redmine at redmine.open-bio.org Tue May 28 03:50:41 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 28 May 2013 07:50:41 +0000 Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on python3 for disordered atoms Message-ID: Issue #3433 has been reported by Alexander Campbell. ---------------------------------------- Bug #3433: MMCIFParser fails on python3 for disordered atoms https://redmine.open-bio.org/issues/3433 Author: Alexander Campbell Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work. Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue May 28 03:50:41 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 28 May 2013 07:50:41 +0000 Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on python3 for disordered atoms Message-ID: Issue #3433 has been reported by Alexander Campbell. ---------------------------------------- Bug #3433: MMCIFParser fails on python3 for disordered atoms https://redmine.open-bio.org/issues/3433 Author: Alexander Campbell Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work. Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From tiagoantao at gmail.com Tue May 28 07:14:53 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 12:14:53 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) Message-ID: Hi, I have been trying to setup a windows 8 buildbot. For that purpose I have installed a recent version of mingw on a new win8 machine. It seems that one of the compiling options of biopython (-mno-cygwin) is deprecated. See here for more details: http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 07:21:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 12:21:13 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 12:14 PM, Tiago Ant?o wrote: > Hi, > > I have been trying to setup a windows 8 buildbot. For that purpose I have > installed a recent version of mingw on a new win8 machine. > > It seems that one of the compiling options of biopython (-mno-cygwin) is > deprecated. See here for more details: > http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html Looks like there's a confusing open bug about just removing this argument from Python's distutils - http://bugs.python.org/issue12641 For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself get it to work? I could live with that on the build slave, coupled with a warning in our install documentation for the brave people self-compiling under Windows. Peter From tiagoantao at gmail.com Tue May 28 08:04:32 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 13:04:32 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: Hi, On Tue, May 28, 2013 at 12:21 PM, Peter Cock wrote: > For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself > get it to work? I could live with that on the build slave, coupled with a > warning in our install documentation for the brave people self-compiling > under Windows. > I have hacked my distutils implementation. It compiled OK. That being said, there seems to be some problems with Bio.Applications on win8: http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 10:09:40 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 15:09:40 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 1:04 PM, Tiago Ant?o wrote: > Hi, > > > On Tue, May 28, 2013 at 12:21 PM, Peter Cock > wrote: >> >> For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself >> get it to work? I could live with that on the build slave, coupled with a >> warning in our install documentation for the brave people self-compiling >> under Windows. > > I have hacked my distutils implementation. It compiled OK. That's encouraging. > That being said, there seems to be some problems with Bio.Applications on > win8: > http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio Could you confirm output sys.platform is "win32" still? I've got a hunch that spaces in the executable path might explain some of these failures - I'm trying a patch for that here. Some of the other failures appear to be down to newline differences (the \r in some of the output suggests this). Here we can probably use universal new lines mode for file input, but I am puzzled why these pass under Windows XP with an older mingw32 or the Intel compiler. Peter From tiagoantao at gmail.com Tue May 28 10:40:02 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 15:40:02 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 3:09 PM, Peter Cock wrote: > Could you confirm output sys.platform is "win32" still? > Yup T -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 12:36:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 17:36:20 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 3:09 PM, Peter Cock wrote: > > I've got a hunch that spaces in the executable path might explain > some of these failures - I'm trying a patch for that here. Hi Tiago, Patch applied to master - this is essential for the rare case of calling a binary under Unix where the path/filename includes a space, but appears to be redundant under Windows XP: https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580 I'm curious if that matters under Windows 8 or not - trying the example in the commit comment at the command line might be illuminating. Peter P.S. Saket - You might remember I touched on this issue in our discussion on GitHub about your bwa/samtools wrappers, which led to this commit keeping self.program_name as the binary only: https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a From tiagoantao at gmail.com Tue May 28 12:50:39 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 17:50:39 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 5:36 PM, Peter Cock wrote: > I'm curious if that matters under Windows 8 or not - trying > the example in the commit comment at the command line > might be illuminating. > I just re-scheduled a testing case and the results were not great... http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio I will test this manually and in deep when I arrive home today. -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 13:15:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 18:15:23 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 5:50 PM, Tiago Ant?o wrote: > > On Tue, May 28, 2013 at 5:36 PM, Peter Cock > wrote: >> >> I'm curious if that matters under Windows 8 or not - trying >> the example in the commit comment at the command line >> might be illuminating. > > > I just re-scheduled a testing case and the results were not great... > http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio > > I will test this manually and in deep when I arrive home today. I think there are at just two classes of failure, calling applications: test_Application ... FAIL And indexing with Windows newlines (I wonder if the git setup on my Windows XP machine has a different default to yours, meaning I have Unix newlines and you have Windows newlines?): test_SearchIO_blast_tab_index ... FAIL test_SearchIO_blast_xml_index ... FAIL test_SearchIO_exonerate_text_index ... FAIL test_SearchIO_exonerate_vulgar_index ... FAIL test_SearchIO_fasta_m10_index ... FAIL test_SearchIO_hmmer2_text_index ... FAIL test_SearchIO_hmmer3_domtab_index ... FAIL test_SearchIO_hmmer3_tab_index ... FAIL test_SearchIO_hmmer3_text_index ... FAIL Bio.SeqIO docstring test ... FAIL Plus of course the minor issues which I just introduced with the escaping change (commits to follow). Peter From saketkc at gmail.com Tue May 28 13:20:36 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 28 May 2013 22:50:36 +0530 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: The constraint for me really is I do not have access to Windows/MAC machines here. Hunting for a Windows machine is possible, besides these I need to validate the _ArgumentList method for windows too On 28 May 2013 22:06, Peter Cock wrote: > On Tue, May 28, 2013 at 3:09 PM, Peter Cock > wrote: > > > > I've got a hunch that spaces in the executable path might explain > > some of these failures - I'm trying a patch for that here. > > Hi Tiago, > > Patch applied to master - this is essential for the rare case of > calling a binary under Unix where the path/filename includes > a space, but appears to be redundant under Windows XP: > > https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580 > > I'm curious if that matters under Windows 8 or not - trying > the example in the commit comment at the command line > might be illuminating. > > Peter > > P.S. Saket - You might remember I touched on this issue in our > discussion on GitHub about your bwa/samtools wrappers, which > led to this commit keeping self.program_name as the binary only: > > https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a > From p.j.a.cock at googlemail.com Tue May 28 13:30:47 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 18:30:47 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 6:20 PM, Saket Choudhary wrote: > The constraint for me really is I do not have access to Windows/MAC machines > here. > > Hunting for a Windows machine is possible, besides these I need to validate > the _ArgumentList method for windows too I sympathise - sorting out a (virtual) 64bit Windows machine has been on my TODO list for a while, since right now I don't have access to one. When I started doing Biopython my primary machine was Windows XP. That old laptop has retired and I now mainly use Mac OS X and Linux at work, but I made a point of getting a Windows XP machine setup for development (e.g. the Windows installers are build with this) and for use as one of our nightly build slaves: http://testing.open-bio.org/biopython/buildslaves Regards, Peter From redmine at redmine.open-bio.org Thu May 30 02:32:21 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 30 May 2013 06:32:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3433] (Resolved) MMCIFParser fails on python3 for disordered atoms References: Message-ID: Issue #3433 has been updated by Michiel de Hoon. Status changed from New to Resolved % Done changed from 0 to 100 Patch applied, thanks. ---------------------------------------- Bug #3433: MMCIFParser fails on python3 for disordered atoms https://redmine.open-bio.org/issues/3433 Author: Alexander Campbell Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work. Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu May 30 04:21:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 09:21:31 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug Message-ID: Hi Tiago, We'd been talking briefly off-list about the recent buildbot failures under Python 3 where the recent change to using subprocess in the PopGen module was causing failures. Sadly while it seems to work on Python 3.1 and 3.2 my suggestion to try using bytes with the communicate call fails on Python 3.3 and under Windows: https://github.com/biopython/biopython/commit/912692ee2b57e8c075ba38bdf814c9dbe4f5cdb9 e.g. After the change to use bytes, http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/202 http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/816 http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.2/builds/680 http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.3/builds/206 This appears to be a known bug in the subprocess module, http://bugs.python.org/issue16903 which should be fixed in Python 3.2.4 and Python 3.3. It appears not to have been fixed on Python 3.1. I see two options, Option One, revert that commit (i.e. send unicode strings as before, not bytes). This will work on Python 3.2.4+ onwards including Windows. It will fail on Python 3.1 and out of date Python 3.2 through 3.2.3 releases. Option Two, don't use universal_newlines=True which then requires us to use byte strings for all the stdin, stdout and stderr processing. More work, but it should in principle work on old and new Python 3 releases. Note that while we're not seeing any problems yet, I suspect this issue would affect our Bio.Application wrappers __call__ function as well when used to send data to stdin. Here again we could switch to using bytes and universal_newlines=False and do any bytes/unicode handling within the __call_ function, on just insist on a fixed version of Python. If we decide to recommend at least Python 3.2.4 (when using Python 3), then we could add a warning to the relevant modules to catch this issue? What do people think? Regards, Peter From tiagoantao at gmail.com Thu May 30 04:28:04 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 May 2013 09:28:04 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: I was having a look at the issue precisely now. I do not have a cast opinion on the issue, I think it all boils down on how many people are dependent on 3.2.3 and prior 3s. In theory I would prefer not to have workarounds for implementation bugs (as makes things more complex to manage in the long-run), but if many people are using buggy 3.x, I see no option... I simply do not have any view on how many people would be using these... From p.j.a.cock at googlemail.com Thu May 30 04:34:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 09:34:15 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o wrote: > I was having a look at the issue precisely now. > > I do not have a cast opinion on the issue, I think it all boils down on how > many people are dependent on 3.2.3 and prior 3s. > > In theory I would prefer not to have workarounds for implementation bugs (as > makes things more complex to manage in the long-run), but if many people are > using buggy 3.x, I see no option... > > I simply do not have any view on how many people would be using these... > Since till now we've not officially supported Python 3, but plan to start doing so for the forthcoming Biopython 1.62 release, so we could just set a minimum version of 3.2.4 (with Python 3.3 being our current recommendation). However, that may be a problem for some current Linux distributions still shipping older versions? Peter From tiagoantao at gmail.com Thu May 30 04:41:27 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 May 2013 09:41:27 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:34 AM, Peter Cock wrote: > However, that may be a problem for some current Linux > distributions still shipping older versions? > > > I suppose people could revert to Python 2 in that case? [Do not get me wrong, I really have no strong feelings either way] -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Thu May 30 07:37:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 12:37:51 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:41 AM, Tiago Ant?o wrote: > > On Thu, May 30, 2013 at 9:34 AM, Peter Cock > wrote: >> >> However, that may be a problem for some current Linux >> distributions still shipping older versions? > > I suppose people could revert to Python 2 in that case? [Do not get me > wrong, I really have no strong feelings either way] > I guess we should do a brief survey on the main list of Python 3 versions people have installed, if any. In the meantime, I reverted that commit so the tests should now pass under Python 3.2.4+ and Python 3.3. https://github.com/biopython/biopython/commit/285988b1b5227b591bd2fed379e36db3a157eca2 Peter From tiagoantao at gmail.com Thu May 30 07:40:27 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 May 2013 12:40:27 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: > I guess we should do a brief survey on the main list of Python 3 versions > people have installed, if any. > > > +1 From p.j.a.cock at googlemail.com Thu May 30 07:47:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 12:47:33 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 12:40 PM, Tiago Ant?o wrote: > >> I guess we should do a brief survey on the main list of Python 3 versions >> people have installed, if any. >> >> > > +1 Agreed, http://lists.open-bio.org/pipermail/biopython/2013-May/008598.html Peter From p.j.a.cock at googlemail.com Thu May 30 09:33:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 14:33:22 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts Message-ID: Splitting off from this thread: http://lists.open-bio.org/pipermail/biopython/2013-May/008601.html On Thu, May 30, 2013 at 2:13 PM, Peter Cock wrote: > Thank you for all the comments so far, don't stop yet :) > > On Thu, May 30, 2013 at 1:51 PM, Wibowo Arindrarto > wrote: >> Hi everyone, >> >> I'm leaning towards insisting on Python >=3.3 support (I'm running >> 3.3.2). I suppose that even if Python3.3 is not available on a machine >> or through the default package manager, it's always installable on its >> own. If that's not the case, I imagine Python2.x is most likely >> present in these machines (so Biopython can still be used). > > True. > > So far everyone who has replied (including some off list) have said > they are using Python 3.3 which is encouraging. Thank you for > the comments so far. > > It looks like we can forget about Python 3.1, and just need to > decide if it is worth including Python 3.2.5 in the short term. > >> On a related note, do we have a defined timeline on when we >> would drop support for Python2.x? Are there any plans to have >> our codebase written in Python3.x instead of Python2.x? > > Nothing concrete planned, no. I'll reply in more detail on the > biopython-dev list as I do have some thoughts about this. Good question Bow, I think people will still be using Python 2 a year or two from now, so we must support both for some time. Biopython 1.62 (next week perhaps?) - Final release with Python 2.5 support - Official support for Python 2.5, 2.6, 2.7 and 3.3 - Possibly official support for Python 3.2.5+ as well? (Exactly which versions of Python 3 we'll include to be decided, see the other thread for that discussion.) Short term we will continue with developing using Python 2 syntax and running 2to3 for Python 3. As far as I know, the reverse process with 3to2 is not well established. If anyone wants to investigate that would be useful as another option. However, dropping Python 2.5 support makes things more flexible... Medium term I believe it would be possible to have a single code base which is both valid Python 2 and 3 at the same time. This may require us to target 2.7 and 3.3+ only - we'll have to try it and see if Python 2.6 will hold us back. I've actually done this with lzma.backports, a small but non-trivial module with Python and C code: https://pypi.python.org/pypi/backports.lzma/ https://github.com/peterjc/backports.lzma Python 3.3 reintroduces some features designed to make this more straightforward, like unicode literals (missing in the early versions of Python 3). This is why I'd like to drop Python 3.2 as soon as possible. What I was thinking is we can start migrating modules on a case by case basis from "Python 2 syntax" to "Dual syntax" one by one, with a white-list in the do2to3.py script. That way over time less and less modules need to be converted via 2to3, and "python3 setup.py install" will get faster, until eventually we can stop using 2to3 at all. This conversion could consider the code and doctests separately. However, using using print(example) we can hopefully get most of the doctests and Tutorial examples to work under both Python 2 and 3 at the same time. That's my current thinking anyway - and I think the fact that it would be a gradual migration from writing Python 2 specific code to writing dual 2/3 code makes it low risk (as long as we're continuing to run regular testing). Regards, Peter From p.j.a.cock at googlemail.com Thu May 30 10:23:01 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 15:23:01 +0100 Subject: [Biopython-dev] HMMER3.1 beta test 1 released Message-ID: Hi Bow, Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759 "The programs phmmer, hmmsearch, and hmmscan offer a new tabular output format for easier automated parsing, --pfamtblout. his format is the one used internally by Pfam, but we make it more broadly available in case it is of use elsewhere. An analagous output format is available for nhmmer and nhmmscan, --dfamtblout." Something to consider for SearchIO later on... Regards, Peter From w.arindrarto at gmail.com Thu May 30 10:50:24 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 30 May 2013 16:50:24 +0200 Subject: [Biopython-dev] HMMER3.1 beta test 1 released In-Reply-To: References: Message-ID: Hi Peter, Thanks for the heads-up. This just showed up in my feed as well. I've been waiting for the official release (since they first mentioned it some monts ago). I'll follow up on this slowly :).. Best regards, Bow On Thu, May 30, 2013 at 4:23 PM, Peter Cock wrote: > Hi Bow, > > Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759 > > "The programs phmmer, hmmsearch, and hmmscan offer a new > tabular output format for easier automated parsing, --pfamtblout. > his format is the one used internally by Pfam, but we make it more > broadly available in case it is of use elsewhere. An analagous > output format is available for nhmmer and nhmmscan, --dfamtblout." > > Something to consider for SearchIO later on... > > Regards, > > Peter From rz1991 at foxmail.com Thu May 30 11:37:00 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Thu, 30 May 2013 23:37:00 +0800 Subject: [Biopython-dev] GSoC 2013 Student Self-introduction Message-ID: Hi Everyone, This is Zheng Ruan, a first year graduate students at the University of Georgia. I'm happy to be chosen to participate in GSoC this year. My project is "Codon Alignment and Analysis in Biopython" and I will be working with Eric Talevich and Peter Cock during the summer. My undergraduate major is biotechnology and now seeking for a PhD in bioinformatics. I hope to improve my python programming skills during the project and make long term contribution to biopython. I will follow the timeline of my proposal in the Community Bounding Period these days (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/rzzmh12345/1). Thanks! Best, Ruan From p.j.a.cock at googlemail.com Thu May 30 12:18:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 17:18:41 +0100 Subject: [Biopython-dev] Biopython projects with NESCent for GSoC 2013 In-Reply-To: References: Message-ID: Dear all, After the disappointing news that the Open Bioinformatics Foundation (OBF) was not accepted as a Google Summer of Code (GSoC) organisation this year, Biopython was fortunate to once again offer some projects with the NESCent team: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 As always the student proposals have been very competitive, and we've not been able to take on everyone. This year NESCent was fortunately to be able to accept seven students through GSoC and one through the GNOME Outreach Program for Women. Two of these GSoC projects are Biopython related: Codon Alignment and Analysis in Biopython Student: Zheng Ruan Mentors: Eric Talevich, Peter Cock http://www.google-melange.com/gsoc/project/google/gsoc2013/rzzmh12345/32001 Phylogenetics in Biopython: Filling in the gaps Student: Yanbo Ye http://www.google-melange.com/gsoc/project/google/gsoc2013/yeyanbo/45001 Mentors: Mark Holder, Jeet Sukumaran, Eric Talevich Thank you NESCent, and congratulations to Zheng Ruan and Yanbo Ye! I'm hoping you're already setting up a blog, which I hope you'll be able to use for roughly weekly progress reports during the summer - CC'd to the biopython-dev mailing list and the NESCent Phyloinformatics Summer of Code forum on Google+, http://lists.open-bio.org/mailman/listinfo/biopython-dev https://plus.google.com/communities/105828320619238393015 An introduction to your project would be a great idea for your first post - here's Bow's from last year as an example: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/ http://bow.web.id/blog/2012/08/summers-over/ http://bow.web.id/blog/tag/gsoc/ The idea here is to keep the wider community informed about how your project is going. On behalf of the Biopython developers, congratulations! We're looking forward to another productive Summer of Code :) Peter From p.j.a.cock at googlemail.com Fri May 31 05:04:28 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 31 May 2013 10:04:28 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:34 AM, Peter Cock wrote: > On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o wrote: >> I was having a look at the issue precisely now. >> >> I do not have a cast opinion on the issue, I think it all boils down on how >> many people are dependent on 3.2.3 and prior 3s. >> >> In theory I would prefer not to have workarounds for implementation bugs (as >> makes things more complex to manage in the long-run), but if many people are >> using buggy 3.x, I see no option... >> >> I simply do not have any view on how many people would be using these... >> > > Since till now we've not officially supported Python 3, but > plan to start doing so for the forthcoming Biopython 1.62 > release, so we could just set a minimum version of 3.2.4 > (with Python 3.3 being our current recommendation). >From the discussion on the main list, requiring a recent version of Python 3 where this bug is fixed should be fine. For now I've added code to skip this test on the older Python 3 releases where the bug exists: https://github.com/biopython/biopython/commit/9c16c09806ca4af84f714662e54c9bd3057b0a52 Once we've settled on the versions to support with the next release we should review what versions we run on the buildbot. Regards, Peter From zhigangwu.bgi at gmail.com Wed May 1 14:17:14 2013 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Wed, 1 May 2013 07:17:14 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Peter and all, Thanks for the long explanation. I got much better understand of this project though I am still confusing on how to implement the lazy-loading parser for feature rich files (EMBL, GenBank, GFF3). Since the deadline is pretty close,I decided to post my premature of proposal for this project. It would be great if you all can given me some comments and suggestions. The proposal is available here. Thank you all in advance. Zhigang On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock wrote: > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu > wrote: > > Peter, > > > > Thanks for the detailed explanation. It's very helpful. I am not quite > > sure about the goal of the lazy-loading parser. > > Let me try to summarize what are the goals of lazy-loading and how > > lazy-loading would work. Please correct me if necessary. Below I use > > fasta/fastq file as an example. The idea should generally applies to > > other format such as GenBank/EMBL as you mentioned. > > > > Lazy-loading is useful under the assumption that given a large file, > > we are interested in partial information of it but not all of them. > > For example a fasta file contains Arabidopsis genome, we only > > interested in the sequence of chr5 from index position from 2000-3000. > > Rather than parsing the whole file and storing each record in memory > > as most parsers will do, during the indexing step, lazy loading > > parser will only store a few position information, such as access > > positions (readily usable for seek) for all chromosomes (chr1, chr2, > > chr3, chr4, chr5, ...) and may be position index information such as > > the access positions for every 1000bp positions for each sequence in > > the given file. After indexing, we store these information in a > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, > > 2000:access_pos,}, 'chr3'...}. > > > > Compared to the usual parser which tends to parsing the whole file, we > > gain two benefits: speed, less memory usage and random access. Speed > > is gained because we skipped a lot during the parsing step. Go back to > > my example, once we have the dictionary, we can just seek to the > > access position of chr5:2000 and start reading and parsing from there. > > Less memory usage is due to we only stores access positions for each > > record as a dictionary in memory. > > > > > > Best, > > > > Zhigang > > Hi Zhigang, > > Yes - that's the basic idea of a disk based lazy loader. Here > the data stays on the disk until needed, so generally this is > very low memory but can be slow as it needs to read from > the disk. And existing example already in Biopython is our > BioSQL bindings which present a SeqRecord subclass which > only retrieves values from the database on demand. > > Note in the case of FASTA, we might want to use the existing > FAI index files from Heng Li's faidx tool (or another existing > index scheme). That relies on each record using a consistent > line wrapping length, so that seek offsets can be easily > calculated. > > An alternative idea is to load the data into memory (so that the > file is not touched again, useful for stream processing where > you cannot seek within the input data) but it is only parsed into > Python objects on demand. This would use a lot more memory, > but should be faster as there is no disk seeking and reading > (other than the one initial read). For FASTA this wouldn't help > much but it might work for EMBL/GenBank. > > Something to beware of with any lazy loading / lazy parsing is > what happens if the user tries to edit the record? Do you want > to allow this (it makes the code more complex) or not (simpler > and still very useful). > > In terms of usage examples, for things like raw NGS data this > is (currently) made up of lots and lots of short sequences (under > 1000bp). Lazy loading here is unlikely to be very helpful - unless > perhaps you can make the FASTQ parser faster this way? > (Once the reads are assembled or mapped to a reference, > random access to lookup reads by their mapped location is > very very important, thus the BAI indexing of BAM files). > > In terms of this project, I was thinking about a SeqRecord > style interface extending Bio.SeqIO (but you can suggest > something different for your project). > > What I saw as the main use case here is large datasets like > whole chromosomes in FASTA format or richly annotated > formats like EMBL, GenBank or GFF3. Right now if I am > doing something with (for example) the annotated human > chromosomes, loading these as GenBank files is quite > slow (it takes a far amount of memory too, but that isn't > my main worry). A lazy loading approach should let me > 'load' the GenBank files almost instantly, and delay > reading specific features or sequence from the disk > until needed. > > For example, I might have a list of genes for which I wish > to extract the annotation or sequence for - and there is no > need to load all the other features or the rest of the genome. > > (Note we can already do this by loading GenBank files > into a BioSQL database, and access them that way) > > Regards, > > Peter > From chris.mit7 at gmail.com Wed May 1 14:40:26 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Wed, 1 May 2013 10:40:26 -0400 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Zhigang, I throw some comments on your proposal. As i said there, I think you need to find & look at a variety of gff/gtf files to see where your implementation breaks down. Also, for parsing, I would focus on optimizing the speed the user can access attributes, they're the bits people care most about (where is gene X, what is the FPKM of isoform y?, etc.) Chris On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu wrote: > Hi Peter and all, > Thanks for the long explanation. > I got much better understand of this project though I am still confusing on > how to implement the lazy-loading parser for feature rich files (EMBL, > GenBank, GFF3). > Since the deadline is pretty close,I decided to post my premature of > proposal for this project. It would be great if you all can given me some > comments and suggestions. The proposal is available > here< > https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing > >. > Thank you all in advance. > > > Zhigang > > > > On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock >wrote: > > > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu > > wrote: > > > Peter, > > > > > > Thanks for the detailed explanation. It's very helpful. I am not quite > > > sure about the goal of the lazy-loading parser. > > > Let me try to summarize what are the goals of lazy-loading and how > > > lazy-loading would work. Please correct me if necessary. Below I use > > > fasta/fastq file as an example. The idea should generally applies to > > > other format such as GenBank/EMBL as you mentioned. > > > > > > Lazy-loading is useful under the assumption that given a large file, > > > we are interested in partial information of it but not all of them. > > > For example a fasta file contains Arabidopsis genome, we only > > > interested in the sequence of chr5 from index position from 2000-3000. > > > Rather than parsing the whole file and storing each record in memory > > > as most parsers will do, during the indexing step, lazy loading > > > parser will only store a few position information, such as access > > > positions (readily usable for seek) for all chromosomes (chr1, chr2, > > > chr3, chr4, chr5, ...) and may be position index information such as > > > the access positions for every 1000bp positions for each sequence in > > > the given file. After indexing, we store these information in a > > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, > > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, > > > 2000:access_pos,}, 'chr3'...}. > > > > > > Compared to the usual parser which tends to parsing the whole file, we > > > gain two benefits: speed, less memory usage and random access. Speed > > > is gained because we skipped a lot during the parsing step. Go back to > > > my example, once we have the dictionary, we can just seek to the > > > access position of chr5:2000 and start reading and parsing from there. > > > Less memory usage is due to we only stores access positions for each > > > record as a dictionary in memory. > > > > > > > > > Best, > > > > > > Zhigang > > > > Hi Zhigang, > > > > Yes - that's the basic idea of a disk based lazy loader. Here > > the data stays on the disk until needed, so generally this is > > very low memory but can be slow as it needs to read from > > the disk. And existing example already in Biopython is our > > BioSQL bindings which present a SeqRecord subclass which > > only retrieves values from the database on demand. > > > > Note in the case of FASTA, we might want to use the existing > > FAI index files from Heng Li's faidx tool (or another existing > > index scheme). That relies on each record using a consistent > > line wrapping length, so that seek offsets can be easily > > calculated. > > > > An alternative idea is to load the data into memory (so that the > > file is not touched again, useful for stream processing where > > you cannot seek within the input data) but it is only parsed into > > Python objects on demand. This would use a lot more memory, > > but should be faster as there is no disk seeking and reading > > (other than the one initial read). For FASTA this wouldn't help > > much but it might work for EMBL/GenBank. > > > > Something to beware of with any lazy loading / lazy parsing is > > what happens if the user tries to edit the record? Do you want > > to allow this (it makes the code more complex) or not (simpler > > and still very useful). > > > > In terms of usage examples, for things like raw NGS data this > > is (currently) made up of lots and lots of short sequences (under > > 1000bp). Lazy loading here is unlikely to be very helpful - unless > > perhaps you can make the FASTQ parser faster this way? > > (Once the reads are assembled or mapped to a reference, > > random access to lookup reads by their mapped location is > > very very important, thus the BAI indexing of BAM files). > > > > In terms of this project, I was thinking about a SeqRecord > > style interface extending Bio.SeqIO (but you can suggest > > something different for your project). > > > > What I saw as the main use case here is large datasets like > > whole chromosomes in FASTA format or richly annotated > > formats like EMBL, GenBank or GFF3. Right now if I am > > doing something with (for example) the annotated human > > chromosomes, loading these as GenBank files is quite > > slow (it takes a far amount of memory too, but that isn't > > my main worry). A lazy loading approach should let me > > 'load' the GenBank files almost instantly, and delay > > reading specific features or sequence from the disk > > until needed. > > > > For example, I might have a list of genes for which I wish > > to extract the annotation or sequence for - and there is no > > need to load all the other features or the rest of the genome. > > > > (Note we can already do this by loading GenBank files > > into a BioSQL database, and access them that way) > > > > Regards, > > > > Peter > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From eric.talevich at gmail.com Wed May 1 15:46:43 2013 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 1 May 2013 11:46:43 -0400 Subject: [Biopython-dev] gsoc phylo project questions In-Reply-To: References: Message-ID: On Tue, Apr 30, 2013 at 3:20 AM, Yanbo Ye wrote: > Hi Eric, > > Again, thanks for your comment. It might be better to discuss here. > https://github.com/lijax/gsoc/commit/e969c82a5a0aef45bba1277ce01d6dbee03e6a84#commitcomment-3096321 > > I have changed my proposal and timeline based on your advice. I think I > was too optimistic that I didn't consider about the compatibility with > existing code or other potential problem that may exist. After careful > consideration, I removed one task from the goal list to make the time more > relaxed, the tree comparison(seems > I miss understood this). I might be able to complete all of them. But it's > better to make it as an extra task, to make sure this coding experience is > not a burden. > I agree it's best to commit to a feasible timeline and then reserve a few "stretch goals". Dropping the tree distance function is fine, as there are currently some other students who might develop this small module as a course project, independently of GSoC. In any case that functionality is independent of the other tasks you've proposed. > According to your comment: > > 1. I didn't know PyCogent and DendroPy. I'll refer to them for useful > solutions. > 2. For distance-based tree and consensus tree, I think there is no need > to use NumPy. And for consensus tree, my original plan is to implement a > binary class to count the clade with the same leaves for performance. As > you suggest, I'll implement a class with the same API and improve the > performance later, so that I can pay more attention to the Strict and Adam > Consensus algorithms. > Sounds good. > 3. I didn't find the distance matrix method for MSA on Phylo Cookbook > page, only from existing tree. > Ah, I think I misunderstood you earlier. Yes, for the NJ method you'll need to use a substitution matrix to compute pairwise distances from a multiple sequence alignment. This shouldn't be too challenging, though you might find the need to add a new matrix to the Bio.SubsMat module if you want to let the user choose something other than BLOSUM or PAM. 4. For parsimony tree search, I have already know how several heuristic > search algorithms work. Do I need to implement them all? > No, just choose a well-established one that you feel comfortable implementing. 5. I'm not clear about the radial layout and Felsenstein's Equal Daylight > algorithm. Isn't this algorithm one way of showing the radial layout? I'm > sorry that I'm not familiar with this layout. Can you give some figure > examples and references? > For radial tree layout: https://en.wikipedia.org/wiki/Radial_tree http://www.infosun.fim.uni-passau.de/~chris/down/DrawingPhyloTreesEA.pdf The paper above also explains an "angle spreading" refinement step to improve the appearance of radial trees, which you could opt to implement instead of Equal Daylight. The Equal Daylight algorithm seems to only be documented fully in the book "Inferring Phylogenies" and implemented in the "drawtree" program in Phylip. In the Phylip documentation, the radial layout algorithm is called "Equal Arc", and the layout provided by that algorithm is the starting point for Equal Daylight: http://evolution.genetics.washington.edu/phylip/doc/drawtree.html Cheers, Eric From albl500 at york.ac.uk Wed May 1 22:56:12 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Wed, 01 May 2013 23:56:12 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Dear all, I also left some minor comments on the proposal; I hope they're helpful and I wish you every success! You should focus on the proposal for now, but I thought I'd share a more presentable version of the fasta lazy-loader I wrote a couple of years ago. The focus at the time was to minimise memory usage and increase the speed of random access to fasta-formatted sequences, stored on disk. Only sequence accessions and file locations are stored in-memory (in a dict). Once the index has been populated, it can 'pickle' the dictionary to a file on disk, for later re-use. It doesn't exactly fulfill all of your needs, but I hope it might help you in the right direction.. Also, were there plans for making the lazy loader thread-safe? I've done it in the past by passing a `multiprocessing.Pipe` instance to a method (`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to implement a callback scheme, but passing a Pipe did the job.. Maybe it's outside the current scope of the project, but anyway, I put the module up on github if you want to check it out[1]. Cheers, Alex [1] - https://github.com/alexleach/fasta_lazy_loader/blob/master/fasta_lazy_loader.py From zhigang.wu at email.ucr.edu Thu May 2 08:14:04 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 2 May 2013 01:14:04 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Alex, The idea of taking advantage of multiprocessing is great. I haven't touched this kind of thing before and I think it's going to be cool to integrate into the project. Best, Zhigang On Wed, May 1, 2013 at 3:56 PM, Alex Leach wrote: > Dear all, > > I also left some minor comments on the proposal; I hope they're helpful > and I wish you every success! > > You should focus on the proposal for now, but I thought I'd share a more > presentable version of the fasta lazy-loader I wrote a couple of years ago. > The focus at the time was to minimise memory usage and increase the speed > of random access to fasta-formatted sequences, stored on disk. Only > sequence accessions and file locations are stored in-memory (in a dict). > Once the index has been populated, it can 'pickle' the dictionary to a file > on disk, for later re-use. > > It doesn't exactly fulfill all of your needs, but I hope it might help you > in the right direction.. > > Also, were there plans for making the lazy loader thread-safe? I've done > it in the past by passing a `multiprocessing.Pipe` instance to a method > (`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to > implement a callback scheme, but passing a Pipe did the job.. Maybe it's > outside the current scope of the project, but anyway, I put the module up > on github if you want to check it out[1]. > > > Cheers, > Alex > > > [1] - https://github.com/alexleach/**fasta_lazy_loader/blob/master/** > fasta_lazy_loader.py > > ______________________________**_________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.**org > http://lists.open-bio.org/**mailman/listinfo/biopython-dev > From albl500 at york.ac.uk Thu May 2 09:08:23 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Thu, 02 May 2013 10:08:23 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Thu, 02 May 2013 09:14:04 +0100, Zhigang Wu wrote: > Hi Alex, > > The idea of taking advantage of multiprocessing is great. I haven't > touched this kind of thing before and I think >it's going to be cool to > integrate into the project. Pleasure. Multiprocessing is quite a large topic, and the relevant library documentation also rather large[1-2]. If you haven't worked with multiprocessing before, it will probably take a long while before you're comfortable using the libraries involved. So if you were to mention it in the proposal, I'd keep it out of the core objectives, as you have a lot else on your plate, already. Don't know if anyone else has any thoughts on this, though? I could potentially help to provide some pointers, so if you have any questions I might be able to help with, please feel free to ask. Kind regards, Alex [1] - http://docs.python.org/2/library/multiprocessing.html [2] - http://docs.python.org/2/library/threading.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.j.a.cock at googlemail.com Thu May 2 09:52:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 10:52:19 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu wrote: > Hi Peter and all, > Thanks for the long explanation. > I got much better understand of this project though I am still confusing on > how to implement the lazy-loading parser for feature rich files (EMBL, > GenBank, GFF3). > Since the deadline is pretty close,I decided to post my premature of > proposal for this project. It would be great if you all can given me some > comments and suggestions. The proposal is available here. > https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing > Thank you all in advance. > > Zhigang Hi Zhigang, I've posted a few comment there, but it would be a good idea to put the draft on Google Melange soon. I see you've posted the Google Doc on the NESCent Google+ as well, good. Looking at the current draft, you don't yet have a timeline. This is vital - and it should include writing tests (as you write code - not all at the end) and documentation (which can come after the code). In the community bonding period you could write that you plan to setup your development environment including multiple versions of Python (at least Python 2.6, Python 3, Jython 2.7, and PyPy 2.0 to cover the main variants). For instance, it would make sense to start with learning about faidx and how its indexing works, and trying to reproduce it in Python code, and then wrapping that in a SeqRecord style API. Include writing and evaluating some benchmarks too - you may need to learn how to profile Python code for this, since speed and performance is one the reasons for wanting lazy loading (lower memory usage is the other main driver). That could be the first few weeks perhaps? Regards, Peter From p.j.a.cock at googlemail.com Thu May 2 10:37:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 11:37:31 +0100 Subject: [Biopython-dev] Fwd: [PhyloSoC] Application deadline fast approaching In-Reply-To: References: Message-ID: Hi all, I'm forwarding this for any potential Google Summer of Code 2013 students and mentors - note you should also be signed up to the NESCent "Phyloinformatics Summer of Code" mailing list to make sure you don't miss any important information. Thanks, Peter ---------- Forwarded message ---------- From: Karen Cranston Date: Thu, May 2, 2013 at 12:39 AM Subject: [PhyloSoC] Application deadline fast approaching To: Phyloinformatics Summer of Code The student application deadline for GSoC is this Friday, May 3 at 19:00 UTC! Thanks to everyone for their expertise and enthusiasm so far. Expect much traffic in Melange and on the G+ page between now and the deadline. Please do help students (for your projects or others) improve their applications - either on the G+ page or via a public comment on Melange. The most common issue is a lack of detail in the project plan. You can point students to the wiki for examples from previous years. Feel free to ask for help on this list. We will send out more about assigning mentors / scoring after the application deadline. Cheers, Karen & Jim -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Karen Cranston, PhD Training Coordinator and Informatics Project Manager nescent.org @kcranstn ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ _______________________________________________ PhyloSoC mailing list PhyloSoC at nescent.org https://lists.nescent.org/mailman/listinfo/phylosoc UNSUBSCRIBE: https://lists.nescent.org/mailman/options/phylosoc/p.j.a.cock%40googlemail.com?unsub=1&unsubconfirm=1 From p.j.a.cock at googlemail.com Thu May 2 12:54:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 13:54:52 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu wrote: > Hi Peter and all, > Thanks for the long explanation. > I got much better understand of this project though I am still confusing on > how to implement the lazy-loading parser for feature rich files (EMBL, > GenBank, GFF3). Hi Zhigang, I'd considered two ideas for GenBank/EMBL, Lazy parsing of the feature table: The existing iterator approach reads in a GenBank file record by record, and parses everything into objects (a SeqRecord object with the sequence as a Seq object and the features as a list of SeqFeature objects). I did some profiling a while ago, and of this the feature processing is quite slow, therefore during the initial parse the features could be stored in memory as a list of strings, and only parsed into SeqFeature objects if the user tries to access the SeqRecord's feature property. It would require a fairly simple subclassing of the SeqRecord to make the features list into a property in order to populate the list of SeqFeatures when first accessed. In the situation where the user never uses the features, this should be much faster, and save some memory as well (that would need to be confirmed by measurement - but a list of strings should take less RAM than a list of SeqFeature objects with all the sub-objects like the locations and annotations). In the situation where the use does access the features, the simplest behaviour would be to process the cached raw feature table into a list of SeqFeature objects. The overall runtime and memory usage would be about what we have now. This would not require any file seeking, and could be used within the existing SeqIO interface where we make a single pass though the file for parsing - this is vital in order to cope with handles like stdin and network handles where you cannot seek backwards in the file. That is the simpler idea, some real benefits, but not too ambitious. If you are already familiar with the GenBank/EMBL file format and our current parser and the SeqRecord object, then I think a week is reasonable. A full index based approach would mean scanning the GenBank, EMBL or GFF file and recording information about where each feature is on disk (file offset) and the feature location coordinates. This could be recorded in an efficient index structure (I was thinking something based on BAM's BAI or Heng Li's improved version CSI). The idea here is that when the user wants to look at features in a particular region of the genome (e.g. they have a mutation or SNP in region 1234567 on chr5) then only the annotation in that part of the genome needs to be loaded from the disk. This would likely require API changes or additions, for example the SeqRecord currently holds the SeqFeature objects as a simple list - with no build in co-ordinate access. As I wrote in the original outline email, there is scope for a very ambitious project working in this area - but some of these ideas would require more background knowledge or preparation: http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html Anything looking to work with GFF (in the broad sense of GFF3 and/or GTF) would ideal incorporate Brad Chapman's existing work: http://biopython.org/wiki/GFF_Parsing Regards, Peter From albl500 at york.ac.uk Thu May 2 13:54:37 2013 From: albl500 at york.ac.uk (Alex Leach) Date: Thu, 02 May 2013 14:54:37 +0100 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi again, Thought I'd contribute some thoughts... Hope I'm not intruding too much on the discussion. On Thu, 02 May 2013 13:54:52 +0100, Peter Cock wrote: > > It would require a fairly simple subclassing of the SeqRecord to make > the features list into a property in order to populate the list of > SeqFeatures when first accessed. > Yes. You can turn a class property into a function quite easily, using decorators. Here[1] is a pretty good example, description and justification. [1] - http://stackoverflow.com/questions/6618002/python-property-versus-getters-and-setters > In the situation where the user never uses the features, this should > be much faster, and save some memory as well (that would need to > be confirmed by measurement - but a list of strings should take less > RAM than a list of SeqFeature objects with all the sub-objects like > the locations and annotations). > > In the situation where the use does access the features, the simplest > behaviour would be to process the cached raw feature table into a > list of SeqFeature objects. The overall runtime and memory usage > would be about what we have now. This would not require any > file seeking, and could be used within the existing SeqIO interface > where we make a single pass though the file for parsing - this is > vital in order to cope with handles like stdin and network handles > where you cannot seek backwards in the file. I think the Pythonic way here would be to follow the "Easier to Ask for Forgiveness than to ask for Permission" (EAFP) idiom[2]. i.e. Try to seek the file handle first, and if that raises an IOError, catch the exception and continue to cache the input stream data, perhaps writing it to a temporary file on disk. [2] - http://docs.python.org/2/glossary.html#term-eafp > > That is the simpler idea, some real benefits, but not too ambitious. > If you are already familiar with the GenBank/EMBL file format and > our current parser and the SeqRecord object, then I think a week > is reasonable. > > A full index based approach would mean scanning the GenBank, > EMBL or GFF file and recording information about where each > feature is on disk (file offset) and the feature location coordinates. > This could be recorded in an efficient index structure (I was thinking > something based on BAM's BAI or Heng Li's improved version CSI). > The idea here is that when the user wants to look at features in a > particular region of the genome (e.g. they have a mutation or SNP > in region 1234567 on chr5) then only the annotation in that part > of the genome needs to be loaded from the disk. Thought I'd add that Blast uses SQL tables (in ISAM format) for maintaining indexes to their databases[3]. I'm not familiar with BioPython's BioSQL module at all, but a nice feature of sqlite is that you can hold temporary databases in memory[4]. [3] - http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbisam_8hpp.html [4] - http://docs.python.org/2/library/sqlite3.html#using-sqlite3-efficiently Cheers, Alex > > This would likely require API changes or additions, for example > the SeqRecord currently holds the SeqFeature objects as a > simple list - with no build in co-ordinate access. > > As I wrote in the original outline email, there is scope for a very > ambitious project working in this area - but some of these ideas > would require more background knowledge or preparation: > http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html > > Anything looking to work with GFF (in the broad sense of GFF3 > and/or GTF) would ideal incorporate Brad Chapman's existing > work: http://biopython.org/wiki/GFF_Parsing > > Regards, > > Peter > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev -- --- Alex Leach. BSc, MRes PhD Student Chong & Redeker Labs Department of Biology University of York YO10 5DD Tel: 07940 480 771 EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm From idoerg at gmail.com Thu May 2 16:12:12 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 2 May 2013 12:12:12 -0400 Subject: [Biopython-dev] Uniprot-GOA parser Message-ID: Does anybody have a GOA parser in the works? Currently writing a simple parser for GAF, GPA and GPI formats. Can contribute if there is interest. More on GOA: http://www.ebi.ac.uk/GOA Cheers, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Thu May 2 16:18:17 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 May 2013 17:18:17 +0100 Subject: [Biopython-dev] Uniprot-GOA parser In-Reply-To: References: Message-ID: On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg wrote: > Does anybody have a GOA parser in the works? Currently writing a simple > parser for GAF, GPA and GPI formats. Can contribute if there is interest. > > More on GOA: http://www.ebi.ac.uk/GOA > > Cheers, > > Iddo Hi Iddo, I see they're now offering GPAD1.1 format (as well? instead?). Does targeting that make more sense in the long run? I know a few people on the list are or were looking at ontology support for Biopython... it would be good to add this. Regards, Peter From idoerg at gmail.com Thu May 2 16:19:39 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 2 May 2013 12:19:39 -0400 Subject: [Biopython-dev] Uniprot-GOA parser In-Reply-To: References: Message-ID: Yes, will do GPAD as well. Need to preserve the others though, due to legacy. ./I On Thu, May 2, 2013 at 12:18 PM, Peter Cock wrote: > On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg wrote: > > Does anybody have a GOA parser in the works? Currently writing a simple > > parser for GAF, GPA and GPI formats. Can contribute if there is interest. > > > > More on GOA: http://www.ebi.ac.uk/GOA > > > > Cheers, > > > > Iddo > > Hi Iddo, > > I see they're now offering GPAD1.1 format (as well? instead?). > Does targeting that make more sense in the long run? > > I know a few people on the list are or were looking at ontology > support for Biopython... it would be good to add this. > > Regards, > > Peter > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From zhigang.wu at email.ucr.edu Thu May 2 21:18:43 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 2 May 2013 14:18:43 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Chris and All, In your comments to my proposal, you mentioned that some GFF files may have a size of GBs. After seeing that comment, I just want to roughly know how large is a gff file people are often working with? I mainly work on plants and I am not quite familiar with animals. Below I listed out a list of animals and plants, to my knowledge from reading papers, which most people are working with. organism(genome size) size of gff url to the ftp *folder*(not a huge file so feel free to click it) arabidopsis(~120MB) 44MB ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/ rice(~450MB) 77MB here corn(3GB) 87MB http://ftp.maizesequence.org/release-5b/filtered-set/ D. melanogaster 450MB ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/ C. elegans (site going down) http://wiki.wormbase.org/index.php/Downloads#GFF2 H. sapiens(3G) 170MB here My point is that caching gff files in memory wasn't as bad as we have thought. Any comments or suggestion are welcome. Best, Zhigang On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell wrote: > Hi Zhigang, > > I throw some comments on your proposal. As i said there, I think you need > to find & look at a variety of gff/gtf files to see where your > implementation breaks down. Also, for parsing, I would focus on optimizing > the speed the user can access attributes, they're the bits people care most > about (where is gene X, what is the FPKM of isoform y?, etc.) > > Chris > > > On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu wrote: > >> Hi Peter and all, >> Thanks for the long explanation. >> I got much better understand of this project though I am still confusing >> on >> how to implement the lazy-loading parser for feature rich files (EMBL, >> GenBank, GFF3). >> Since the deadline is pretty close,I decided to post my premature of >> proposal for this project. It would be great if you all can given me some >> comments and suggestions. The proposal is available >> here< >> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing >> >. >> >> Thank you all in advance. >> >> >> Zhigang >> >> >> >> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock > >wrote: >> >> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu >> > wrote: >> > > Peter, >> > > >> > > Thanks for the detailed explanation. It's very helpful. I am not quite >> > > sure about the goal of the lazy-loading parser. >> > > Let me try to summarize what are the goals of lazy-loading and how >> > > lazy-loading would work. Please correct me if necessary. Below I use >> > > fasta/fastq file as an example. The idea should generally applies to >> > > other format such as GenBank/EMBL as you mentioned. >> > > >> > > Lazy-loading is useful under the assumption that given a large file, >> > > we are interested in partial information of it but not all of them. >> > > For example a fasta file contains Arabidopsis genome, we only >> > > interested in the sequence of chr5 from index position from 2000-3000. >> > > Rather than parsing the whole file and storing each record in memory >> > > as most parsers will do, during the indexing step, lazy loading >> > > parser will only store a few position information, such as access >> > > positions (readily usable for seek) for all chromosomes (chr1, chr2, >> > > chr3, chr4, chr5, ...) and may be position index information such as >> > > the access positions for every 1000bp positions for each sequence in >> > > the given file. After indexing, we store these information in a >> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos, >> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, >> > > 2000:access_pos,}, 'chr3'...}. >> > > >> > > Compared to the usual parser which tends to parsing the whole file, we >> > > gain two benefits: speed, less memory usage and random access. Speed >> > > is gained because we skipped a lot during the parsing step. Go back to >> > > my example, once we have the dictionary, we can just seek to the >> > > access position of chr5:2000 and start reading and parsing from there. >> > > Less memory usage is due to we only stores access positions for each >> > > record as a dictionary in memory. >> > > >> > > >> > > Best, >> > > >> > > Zhigang >> > >> > Hi Zhigang, >> > >> > Yes - that's the basic idea of a disk based lazy loader. Here >> > the data stays on the disk until needed, so generally this is >> > very low memory but can be slow as it needs to read from >> > the disk. And existing example already in Biopython is our >> > BioSQL bindings which present a SeqRecord subclass which >> > only retrieves values from the database on demand. >> > >> > Note in the case of FASTA, we might want to use the existing >> > FAI index files from Heng Li's faidx tool (or another existing >> > index scheme). That relies on each record using a consistent >> > line wrapping length, so that seek offsets can be easily >> > calculated. >> > >> > An alternative idea is to load the data into memory (so that the >> > file is not touched again, useful for stream processing where >> > you cannot seek within the input data) but it is only parsed into >> > Python objects on demand. This would use a lot more memory, >> > but should be faster as there is no disk seeking and reading >> > (other than the one initial read). For FASTA this wouldn't help >> > much but it might work for EMBL/GenBank. >> > >> > Something to beware of with any lazy loading / lazy parsing is >> > what happens if the user tries to edit the record? Do you want >> > to allow this (it makes the code more complex) or not (simpler >> > and still very useful). >> > >> > In terms of usage examples, for things like raw NGS data this >> > is (currently) made up of lots and lots of short sequences (under >> > 1000bp). Lazy loading here is unlikely to be very helpful - unless >> > perhaps you can make the FASTQ parser faster this way? >> > (Once the reads are assembled or mapped to a reference, >> > random access to lookup reads by their mapped location is >> > very very important, thus the BAI indexing of BAM files). >> > >> > In terms of this project, I was thinking about a SeqRecord >> > style interface extending Bio.SeqIO (but you can suggest >> > something different for your project). >> > >> > What I saw as the main use case here is large datasets like >> > whole chromosomes in FASTA format or richly annotated >> > formats like EMBL, GenBank or GFF3. Right now if I am >> > doing something with (for example) the annotated human >> > chromosomes, loading these as GenBank files is quite >> > slow (it takes a far amount of memory too, but that isn't >> > my main worry). A lazy loading approach should let me >> > 'load' the GenBank files almost instantly, and delay >> > reading specific features or sequence from the disk >> > until needed. >> > >> > For example, I might have a list of genes for which I wish >> > to extract the annotation or sequence for - and there is no >> > need to load all the other features or the rest of the genome. >> > >> > (Note we can already do this by loading GenBank files >> > into a BioSQL database, and access them that way) >> > >> > Regards, >> > >> > Peter >> > >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> > > From zhigang.wu at email.ucr.edu Fri May 3 00:18:03 2013 From: zhigang.wu at email.ucr.edu (Zhigang Wu) Date: Thu, 2 May 2013 17:18:03 -0700 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: On Thu, May 2, 2013 at 5:54 AM, Peter Cock wrote: > On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu > wrote: > > Hi Peter and all, > > Thanks for the long explanation. > > I got much better understand of this project though I am still confusing > on > > how to implement the lazy-loading parser for feature rich files (EMBL, > > GenBank, GFF3). > > Hi Zhigang, > > I'd considered two ideas for GenBank/EMBL, > > Lazy parsing of the feature table: The existing iterator approach reads > in a GenBank file record by record, and parses everything into objects > (a SeqRecord object with the sequence as a Seq object and the > features as a list of SeqFeature objects). I did some profiling a while > ago, and of this the feature processing is quite slow, therefore during > the initial parse the features could be stored in memory as a list of > strings, and only parsed into SeqFeature objects if the user tries to > access the SeqRecord's feature property. > > It would require a fairly simple subclassing of the SeqRecord to make > the features list into a property in order to populate the list of > SeqFeatures when first accessed. > > In the situation where the user never uses the features, this should > be much faster, and save some memory as well (that would need to > be confirmed by measurement - but a list of strings should take less > RAM than a list of SeqFeature objects with all the sub-objects like > the locations and annotations). > I agree. This would save some memory. > In the situation where the use does access the features, the simplest > behaviour would be to process the cached raw feature table into a > list of SeqFeature objects. The overall runtime and memory usage > would be about what we have now. This would not require any > file seeking, and could be used within the existing SeqIO interface > where we make a single pass though the file for parsing - this is > vital in order to cope with handles like stdin and network handles > where you cannot seek backwards in the file. > > Yes, I agree. So in this sense, the name "lazy-loading" is a little misleading. Because, this would load everything into memory at the beginning, while just delay in parsing any feature until a specific one is requested. Seems like "lazy parsing" would be more appropriate. That is the simpler idea, some real benefits, but not too ambitious. > If you are already familiar with the GenBank/EMBL file format and > our current parser and the SeqRecord object, then I think a week > is reasonable. > > No, I am not quite familiar with these. > A full index based approach would mean scanning the GenBank, > EMBL or GFF file and recording information about where each > feature is on disk (file offset) and the feature location coordinates. > This could be recorded in an efficient index structure (I was thinking > something based on BAM's BAI or Heng Li's improved version CSI). > The idea here is that when the user wants to look at features in a > particular region of the genome (e.g. they have a mutation or SNP > in region 1234567 on chr5) then only the annotation in that part > of the genome needs to be loaded from the disk. > > This would likely require API changes or additions, for example > the SeqRecord currently holds the SeqFeature objects as a > simple list - with no build in co-ordinate access. > > As I wrote in the original outline email, there is scope for a very > ambitious project working in this area - but some of these ideas > would require more background knowledge or preparation: > http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html > > Hmm, this is actually INDEXing a big file. Don't you think a little bit off topic, "lazy-loading parser". But this seems interesting and challenging and definitely going to be useful. > Anything looking to work with GFF (in the broad sense of GFF3 > and/or GTF) would ideal incorporate Brad Chapman's existing > work: http://biopython.org/wiki/GFF_Parsing > > Yes, I definitely will take a Brad's GFF parser. > Regards, > > Peter > Thanks for the long explanation again. Zhigang From yeyanbo289 at gmail.com Fri May 3 02:19:07 2013 From: yeyanbo289 at gmail.com (Yanbo Ye) Date: Fri, 3 May 2013 10:19:07 +0800 Subject: [Biopython-dev] Biopython Phylo Proposal Message-ID: Hi everyone, I forget to post my gsoc proposal page here. Any comment? http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/yeyanbo/1# Thanks, Yanbo -- ??? ???????????????? Ye Yanbo Bioinformatics Group, Wuhan Institute Of Virology, Chinese Academy of Sciences From Markus.Piotrowski at ruhr-uni-bochum.de Fri May 3 06:32:43 2013 From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski) Date: 3 May 2013 08:32:43 +0200 Subject: [Biopython-dev] Lazy-loading parsers, was: Biopython GSoC 2013 applications via NESCent In-Reply-To: References: Message-ID: Hi Zhigang, Sequence read files from Next Generation Sequencing methods are several GB large. Don't know if they are regulary stored in GFF files, anyhow. Best, Markus Am 2013-05-02 23:18, schrieb Zhigang Wu: > Hi Chris and All, > > In your comments to my proposal, you mentioned that some GFF files > may have > a size of GBs. > After seeing that comment, I just want to roughly know how large is a > gff > file people are often working with? > I mainly work on plants and I am not quite familiar with animals. > Below I listed out a list of animals and plants, to my knowledge from > reading papers, which most people are working with. > > organism(genome size) size of gff url to > the > ftp *folder*(not a huge file so feel free to click it) > arabidopsis(~120MB) 44MB > ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/ > rice(~450MB) 77MB > > here > corn(3GB) 87MB > http://ftp.maizesequence.org/release-5b/filtered-set/ > D. melanogaster 450MB > > ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/ > C. elegans (site going down) > http://wiki.wormbase.org/index.php/Downloads#GFF2 > H. sapiens(3G) 170MB > > here > > My point is that caching gff files in memory wasn't as bad as we have > thought. Any comments or suggestion are welcome. > > Best, > > > Zhigang > > > > > On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell > wrote: > >> Hi Zhigang, >> >> I throw some comments on your proposal. As i said there, I think >> you need >> to find & look at a variety of gff/gtf files to see where your >> implementation breaks down. Also, for parsing, I would focus on >> optimizing >> the speed the user can access attributes, they're the bits people >> care most >> about (where is gene X, what is the FPKM of isoform y?, etc.) >> >> Chris >> >> >> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu >> wrote: >> >>> Hi Peter and all, >>> Thanks for the long explanation. >>> I got much better understand of this project though I am still >>> confusing >>> on >>> how to implement the lazy-loading parser for feature rich files >>> (EMBL, >>> GenBank, GFF3). >>> Since the deadline is pretty close,I decided to post my premature >>> of >>> proposal for this project. It would be great if you all can given >>> me some >>> comments and suggestions. The proposal is available >>> here< >>> >>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing >>> >. >>> >>> Thank you all in advance. >>> >>> >>> Zhigang >>> >>> >>> >>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock >>> >> >wrote: >>> >>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu >>> >>> > wrote: >>> > > Peter, >>> > > >>> > > Thanks for the detailed explanation. It's very helpful. I am >>> not quite >>> > > sure about the goal of the lazy-loading parser. >>> > > Let me try to summarize what are the goals of lazy-loading and >>> how >>> > > lazy-loading would work. Please correct me if necessary. Below >>> I use >>> > > fasta/fastq file as an example. The idea should generally >>> applies to >>> > > other format such as GenBank/EMBL as you mentioned. >>> > > >>> > > Lazy-loading is useful under the assumption that given a large >>> file, >>> > > we are interested in partial information of it but not all of >>> them. >>> > > For example a fasta file contains Arabidopsis genome, we only >>> > > interested in the sequence of chr5 from index position from >>> 2000-3000. >>> > > Rather than parsing the whole file and storing each record in >>> memory >>> > > as most parsers will do, during the indexing step, lazy >>> loading >>> > > parser will only store a few position information, such as >>> access >>> > > positions (readily usable for seek) for all chromosomes (chr1, >>> chr2, >>> > > chr3, chr4, chr5, ...) and may be position index information >>> such as >>> > > the access positions for every 1000bp positions for each >>> sequence in >>> > > the given file. After indexing, we store these information in a >>> > > dictionary like following {'chr1':{0:access_pos, >>> 1000:access_pos, >>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos, >>> > > 2000:access_pos,}, 'chr3'...}. >>> > > >>> > > Compared to the usual parser which tends to parsing the whole >>> file, we >>> > > gain two benefits: speed, less memory usage and random access. >>> Speed >>> > > is gained because we skipped a lot during the parsing step. Go >>> back to >>> > > my example, once we have the dictionary, we can just seek to >>> the >>> > > access position of chr5:2000 and start reading and parsing from >>> there. >>> > > Less memory usage is due to we only stores access positions for >>> each >>> > > record as a dictionary in memory. >>> > > >>> > > >>> > > Best, >>> > > >>> > > Zhigang >>> > >>> > Hi Zhigang, >>> > >>> > Yes - that's the basic idea of a disk based lazy loader. Here >>> > the data stays on the disk until needed, so generally this is >>> > very low memory but can be slow as it needs to read from >>> > the disk. And existing example already in Biopython is our >>> > BioSQL bindings which present a SeqRecord subclass which >>> > only retrieves values from the database on demand. >>> > >>> > Note in the case of FASTA, we might want to use the existing >>> > FAI index files from Heng Li's faidx tool (or another existing >>> > index scheme). That relies on each record using a consistent >>> > line wrapping length, so that seek offsets can be easily >>> > calculated. >>> > >>> > An alternative idea is to load the data into memory (so that the >>> > file is not touched again, useful for stream processing where >>> > you cannot seek within the input data) but it is only parsed into >>> > Python objects on demand. This would use a lot more memory, >>> > but should be faster as there is no disk seeking and reading >>> > (other than the one initial read). For FASTA this wouldn't help >>> > much but it might work for EMBL/GenBank. >>> > >>> > Something to beware of with any lazy loading / lazy parsing is >>> > what happens if the user tries to edit the record? Do you want >>> > to allow this (it makes the code more complex) or not (simpler >>> > and still very useful). >>> > >>> > In terms of usage examples, for things like raw NGS data this >>> > is (currently) made up of lots and lots of short sequences (under >>> > 1000bp). Lazy loading here is unlikely to be very helpful - >>> unless >>> > perhaps you can make the FASTQ parser faster this way? >>> > (Once the reads are assembled or mapped to a reference, >>> > random access to lookup reads by their mapped location is >>> > very very important, thus the BAI indexing of BAM files). >>> > >>> > In terms of this project, I was thinking about a SeqRecord >>> > style interface extending Bio.SeqIO (but you can suggest >>> > something different for your project). >>> > >>> > What I saw as the main use case here is large datasets like >>> > whole chromosomes in FASTA format or richly annotated >>> > formats like EMBL, GenBank or GFF3. Right now if I am >>> > doing something with (for example) the annotated human >>> > chromosomes, loading these as GenBank files is quite >>> > slow (it takes a far amount of memory too, but that isn't >>> > my main worry). A lazy loading approach should let me >>> > 'load' the GenBank files almost instantly, and delay >>> > reading specific features or sequence from the disk >>> > until needed. >>> > >>> > For example, I might have a list of genes for which I wish >>> > to extract the annotation or sequence for - and there is no >>> > need to load all the other features or the rest of the genome. >>> > >>> > (Note we can already do this by loading GenBank files >>> > into a BioSQL database, and access them that way) >>> > >>> > Regards, >>> > >>> > Peter >>> > >>> _______________________________________________ >>> Biopython-dev mailing list >>> Biopython-dev at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython-dev >>> >> >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From p.j.a.cock at googlemail.com Mon May 6 11:23:24 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 May 2013 12:23:24 +0100 Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC 2013 In-Reply-To: References: Message-ID: On Tue, Apr 16, 2013 at 9:47 AM, Peter Cock wrote: > On Tue, Apr 16, 2013 at 1:43 AM, Eric Talevich wrote: >> >> The abstract looks good to me. Which release was the first to include >> SearchIO, was that 1.61? If so, maybe it would be good to note that in >> addition to the smaller improvements, SearchIO specifically was (one of?) >> the new module(s) that introduced the beta designation. >> > > Yes, SearchIO was included in Biopython 1.61, but you're right that > could be made a bit clearer. > The Biopython update has been accepted for a 10 minute talk slot at BOSC (anyone else with an abstract submitted should have had an email by now), the reviewers' feedback was short and positive: (A) Keep it short and show the variety of active sub-projects and people involved and the presentaion will will be attractive to the audience. The last year's talk is a good example (based on the shared slides). (Last year it was Eric at BOSC 2012 in Long Beach, CA - well done) (B) Nice to see latest news on BioPython and future directions of one of the most popular OpenBio project. (C) This talk reports an update on the BioPython project (support for experimental codes, Python 3 compatibility, SearchIO and genomic variant formats). BioPython is one of the central projects of O.B.F and its update is worth getting some attention at BOSC. We have until June to revise our abstract - so perhaps we should do the next release this month in May ;) Peter From idoerg at gmail.com Tue May 7 16:24:00 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 7 May 2013 12:24:00 -0400 Subject: [Biopython-dev] uniprot-GOA parse Message-ID: hi, As promised, I have written a uniprot-goa parser. Very skeletal, has iterators for reading the three uniprot-GOA file types, a write function, and a couple of usage examples. No github write access, so attaching. Cheers, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. -------------- next part -------------- A non-text attachment was scrubbed... Name: upg_parser.py Type: application/octet-stream Size: 10344 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue May 7 16:47:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 7 May 2013 17:47:16 +0100 Subject: [Biopython-dev] uniprot-GOA parse In-Reply-To: References: Message-ID: On Tue, May 7, 2013 at 5:24 PM, Iddo Friedberg wrote: > hi, > > As promised, I have written a uniprot-goa parser. Very skeletal, has > iterators for reading the three uniprot-GOA file types, a write function, > and a couple of usage examples. > > No github write access, so attaching. The file arrived :) Did you have any thoughts on where in the namespace to put this? The idea with github is you'd register an account, say iddux (since that's your Twitter username), and then fork the repository as https://github.com/iddux/biopython - and make a new branch there with your changes, and ask for feedback or make a pull request. All that can be done without any write access to the main repository, and is intended to lower the barrier to entry. In your case, given you're a past project leader etc, drop me (or Brad etc) an email once you've mastered the git basic and we can give you direct access. Regards, Peter From natemsutton at yahoo.com Tue May 7 21:12:59 2013 From: natemsutton at yahoo.com (Nate Sutton) Date: Tue, 7 May 2013 14:12:59 -0700 (PDT) Subject: [Biopython-dev] Progress with ticket 3336 Message-ID: <1367961179.88206.YahooMailNeo@web122603.mail.ne1.yahoo.com> Hi, Here is a progress follow up to http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010548.html . ?I have added a commit to the github branch that adds an option to create claude branch lines using linecollection. ?The linecollection objects are stored in a tuple before adding them to the plot. ?It?s in Bio/Phylo/_utils.py. ?Is this what the last bullet point was requesting in https://redmine.open-bio.org/issues/3336 ? ? Thanks! Nate P. S. ?I used a tuple to store the linecollection objects instead of a list because that was mentioned in the ticket but if that looks like it should be different let me know. ?Also, I got some global variables to work with the code but I was only able to do that after declaring them as globals twice. ?If there are suggestions on how to code that differently let me know. From idoerg at gmail.com Wed May 8 23:28:17 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 8 May 2013 19:28:17 -0400 Subject: [Biopython-dev] UniProt GOA parser Message-ID: A new uniprot-GOA parser is available for you to poke around: https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA More on Uniprot-GOA: http://www.ebi.ac.uk/GOA There are three file formats: GAF (gene association file) , GPA (gene product association) and GPI (gene product information) explained here: http://www.ebi.ac.uk/GOA/downloads Input GAF files can be very large, due to the growth of uniprot GOA. If you would like to test in a timely fashion, I suggest you get historical files, which are smaller. Once you get to the > 40 version numbers, the runtime for the example code in UniProtGOA.py goes over 2 minutes (on my i5 machine). Old GAF files are available here: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/ Current GPI and GPA files are not very large. Thanks to Peter for his help on this. Best, Iddo -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Fri May 10 10:06:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 May 2013 11:06:19 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg wrote: > A new uniprot-GOA parser is available for you to poke around: > > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA > I think for the namespace, we might be better off using Bio.UniProt.GOA, where Iddo's parser would be in Bio/UniProt/GOA.py and any other UniProt specific code could also go under Bio/UniProt - for example a web API. Some of Bio.SwissProt might also migrate here over time. > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA > > There are three file formats: GAF (gene association file) , GPA (gene > product association) and GPI (gene product information) explained here: > http://www.ebi.ac.uk/GOA/downloads > > Input GAF files can be very large, due to the growth of uniprot GOA. If you > would like to test in a timely fashion, I suggest you get historical files, > which are smaller. Once you get to the > 40 version numbers, the runtime > for the example code in UniProtGOA.py goes over 2 minutes (on my i5 > machine). Would it make sense to want random access to the GOA files based on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That should be fairly straight forward to do building on the indexing code for Bio.SeqIO and SearchIO. Note here I am picturing combining all the (consecutive) lines for the same DB_Object_ID - currently the parser is line based, but batching by DB_Object_ID would be a straightforward change and may better suit some uses. > Old GAF files are available here: > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/ > > Current GPI and GPA files are not very large. > > Thanks to Peter for his help on this. > > Best, > > Iddo Peter From idoerg at gmail.com Fri May 10 16:20:16 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 10 May 2013 12:20:16 -0400 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: > On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg wrote: > > A new uniprot-GOA parser is available for you to poke around: > > > > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA > > > > I think for the namespace, we might be better off using Bio.UniProt.GOA, > where Iddo's parser would be in Bio/UniProt/GOA.py and any other > UniProt specific code could also go under Bio/UniProt - for example > a web API. > OK. > > Some of Bio.SwissProt might also migrate here over time. > > > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA > > > > There are three file formats: GAF (gene association file) , GPA (gene > > product association) and GPI (gene product information) explained here: > > http://www.ebi.ac.uk/GOA/downloads > > > > Input GAF files can be very large, due to the growth of uniprot GOA. If > you > > would like to test in a timely fashion, I suggest you get historical > files, > > which are smaller. Once you get to the > 40 version numbers, the runtime > > for the example code in UniProtGOA.py goes over 2 minutes (on my i5 > > machine). > > Would it make sense to want random access to the GOA files based > on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That > should be fairly straight forward to do building on the indexing code > for Bio.SeqIO and SearchIO. > Would that require reading it all into memory? Uniprot_GOA files are huge, it is impractical to read them in fully. > > Note here I am picturing combining all the (consecutive) lines > for the same DB_Object_ID - currently the parser is line based, > but batching by DB_Object_ID would be a straightforward change > and may better suit some uses. > Perhaps only for organism specific file, which in some cases can be read fully into memory. > > > Old GAF files are available here: > > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/ > > > > Current GPI and GPA files are not very large. > > > > Thanks to Peter for his help on this. > > > > Best, > > > > Iddo > > Peter > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Fri May 10 16:26:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 10 May 2013 17:26:13 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg wrote: > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: >> >> Would it make sense to want random access to the GOA files based >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That >> should be fairly straight forward to do building on the indexing code >> for Bio.SeqIO and SearchIO. > > > Would that require reading it all into memory? Uniprot_GOA files > are huge, it is impractical to read them in fully. Not at all - we'd record a dictionary mapping the record ID to an offset in the file on disk, or record this mapping in an SQLite index file. >> Note here I am picturing combining all the (consecutive) lines >> for the same DB_Object_ID - currently the parser is line based, >> but batching by DB_Object_ID would be a straightforward change >> and may better suit some uses. > > Perhaps only for organism specific file, which in some cases can > be read fully into memory. The examples I looked at only seemed to have a dozen or so lines for each DB_Object_ID - but perhaps these were easy cases? How many lines per DB_Object_ID in the worst cases? Peter From idoerg at gmail.com Fri May 10 16:32:43 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 10 May 2013 12:32:43 -0400 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 10, 2013 at 12:26 PM, Peter Cock wrote: > On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg wrote: > > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: > >> > >> Would it make sense to want random access to the GOA files based > >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That > >> should be fairly straight forward to do building on the indexing code > >> for Bio.SeqIO and SearchIO. > > > > > > Would that require reading it all into memory? Uniprot_GOA files > > are huge, it is impractical to read them in fully. > > Not at all - we'd record a dictionary mapping the record ID to an offset > in the file on disk, or record this mapping in an SQLite index file. > Ok, that's good then > >> Note here I am picturing combining all the (consecutive) lines > >> for the same DB_Object_ID - currently the parser is line based, > >> but batching by DB_Object_ID would be a straightforward change > >> and may better suit some uses. > > > > Perhaps only for organism specific file, which in some cases can > > be read fully into memory. > > The examples I looked at only seemed to have a dozen or so > lines for each DB_Object_ID - but perhaps these were easy > cases? How many lines per DB_Object_ID in the worst cases? > > Peter > I was actually thinking you are suggesting that the whole file should be read in memory, nit just buffer by DB-Object_ID. My mistake. -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From linxzh1989 at gmail.com Sun May 12 12:57:25 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Sun, 12 May 2013 20:57:25 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: I am very Sorry about my mistake. I want to install biopython 1.61 in a local server(CentOS), python setup.py build python setup.py test and then showed some errors: ====================================================================== FAIL: Test an input file containing a single sequence. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Clustalw_tool.py", line 166, in test_single_sequence self.assertTrue(str(err) == "No records found in handle") AssertionError ====================================================================== ERROR: Test Entrez.read from URL ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Entrez_online.py", line 34, in test_read_from_url rec = Entrez.read(einfo) File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py", line 362, in read record = handler.read(handle) File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", line 184, in read self.parser.ParseFile(handle) File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", line 322, in endElementHandler raise RuntimeError(value) RuntimeError: Unable to open connection to #DbInfo?dbaf= ====================================================================== ERROR: Run tutorial doctests. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Tutorial.py", line 152, in test_doctests ValueError: 4 Tutorial doctests failed: test_from_line_05671, test_from_line_06030, test_from_line_06190, test_from_line_06479 ---------------------------------------------------------------------- Ran 213 tests in 1621.002 seconds FAILED (failures = 3) i use python 2.6.5 2013/5/12 ??? : > I've run the From saketkc at gmail.com Sun May 12 18:11:46 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Sun, 12 May 2013 23:41:46 +0530 Subject: [Biopython-dev] samtools threaded daemon In-Reply-To: References: <516653BE.8060509@brueffer.de> Message-ID: Just completed writing samtools wrapper : https://github.com/biopython/biopython/pull/180 Unit Tests pending. On 11 April 2013 23:51, Chris Mitchell wrote: > Here's the branch I'm starting with, including a working mpileup daemon for > those who want to use it: > > https://github.com/chrismit/biopython/tree/samtools > > sample usage: > from Bio.SamTools import SamTools > sTools = '/home/chris/bin/samtools' > hg19 = '/media/chris/ChrisSSD/ref/human/hg19.fa' > bamSource = '/media/chris/ChrisSSD/TH1Alignment/NK/accepted_hits.bam' > st = SamTools(bamSource,binary=sTools,threads=30) > > #now with a callback, which is advisable to use to process data as it is > generated > def processPileup(pileup): > print 'to process',pileup > > #st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in > xrange(2000001,2001001)],callback=processPileup) #with callback > #print st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in > xrange(2000001,2000101)]) #will just return as a list > > > On Thu, Apr 11, 2013 at 10:04 AM, Chris Mitchell wrote: > >> Given that we'd be chasing after the samtools development cycle, I think >> it's just easier to implement command line wrappers that are dynamic enough >> to handle future versions. For instance, some of the code doesn't seem too >> set in stone and appears empirical (the BAQ computation comes to mind) and >> therefore probable to change in future versions. I can package in my >> existing pileup parser, but in general I think most people will be using a >> callback routine to handle it themselves since use cases of the final >> output sort of vary project by project. >> >> Chris >> >> >> On Thu, Apr 11, 2013 at 9:54 AM, Peter Cock wrote: >> >>> On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell >>> wrote: >>> > Also, if a binary can't be found, having it fallback to the future >>> > BioPython parser seems like it might be a good idea (provided it has >>> > similar functionality like creating pileups, does it?). >>> >>> It has the low level random access via the BAI index done, but >>> does not yet have a reimplementation of the mpileup code, no. >>> (Would that be useful compared to calling samtools and parsing >>> its output?) >>> >>> Peter >>> >> >> > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From linxzh1989 at gmail.com Mon May 13 01:41:30 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Mon, 13 May 2013 09:41:30 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: 2013/5/13 Peter Cock : > On Sun, May 12, 2013 at 1:57 PM, ??? wrote: >> I want to install biopython 1.61 in a local server(CentOS), >> python setup.py build >> python setup.py test >> and then showed some errors: >> >> ... >> >> i use python 2.6.5 >> > > Thank you for getting in touch, and including the important > information about the operating system, version of Python > and version of Biopython. > >> FAIL: Test an input file containing a single sequence. >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Clustalw_tool.py", line 166, in test_single_sequence >> self.assertTrue(str(err) == "No records found in handle") >> AssertionError >> > > This test calls the command line tool clustalw. > > What version of clustalw do you have? > >> ERROR: Test Entrez.read from URL >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Entrez_online.py", line 34, in test_read_from_url >> rec = Entrez.read(einfo) >> File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py", >> line 362, in read >> record = handler.read(handle) >> File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", >> line 184, in read >> self.parser.ParseFile(handle) >> File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py", >> line 322, in endElementHandler >> raise RuntimeError(value) >> RuntimeError: Unable to open connection to #DbInfo?dbaf= >> > > This test connects to the NCBI Entrez server over the internet. > This kind of error is usually a temporary network problem, and > will go away if you repeat the test later. > >> ERROR: Run tutorial doctests. >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Tutorial.py", line 152, in test_doctests >> ValueError: 4 Tutorial doctests failed: test_from_line_05671, >> test_from_line_06030, test_from_line_06190, test_from_line_06479 > > Those four failing examples in the Tutorial seem to match this > commit, made just before the Biopython 1.61 release: > > https://github.com/biopython/biopython/commit/b84bda01bd22e93a1cf71613a55 February 2013 (Biopython 1.61)cfca876b7128d7#Doc/Tutorial.tex > > Where did you get the Biopython 1.61 files from? e.g. The zip file > or tar.gz file on our website? Perhaps I accidentally included an > older copy of the Doc/Tutorial.tex file? Could you look for the > "Late Update" line in your Tutorial.tex file for me - does it say: > > \date{Last Update -- 5 February 2013 (Biopython 1.61)} > > Thanks, > > Peter Hi?Peter? Clustalw I am using is 1.83. I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5 February 2013 (Biopython 1.61)}'. I downloaded the tar.gz from the biopython website. Thanks Lin From p.j.a.cock at googlemail.com Mon May 13 08:49:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 May 2013 09:49:20 +0100 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: On Mon, May 13, 2013 at 2:41 AM, ??? wrote: > >> Where did you get the Biopython 1.61 files from? e.g. The zip file >> or tar.gz file on our website? Perhaps I accidentally included an >> older copy of the Doc/Tutorial.tex file? Could you look for the >> "Late Update" line in your Tutorial.tex file for me - does it say: >> >> \date{Last Update -- 5 February 2013 (Biopython 1.61)} >> >> Thanks, >> >> Peter > > Hi?Peter? > Clustalw I am using is 1.83. Hi Lin, I also have clustalw 1.83, so this isn't simply a version problem. It could be something subtle about the locale - what language is your CentOS running in (that can alter error messages etc)? > I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5 > February 2013 (Biopython 1.61)}'. That's good - that's what it should say :) (Sorry late/last was my typing error). > > I downloaded the tar.gz from the biopython website. > Thanks. I could reproduce the test_Tutorial.py problem with that. This is easy to explain - I forgot to include the test file my_blast.xml when doing the release (and you are the first person to report this problem). I should have noticed this myself, sorry :( I've fixed this ready for the next release - thank you for reporting this: https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5 If you want to, you can get the missing file from here: http://biopython.org/SRC/Doc/examples/my_blast.xml or: https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml If you save that in the Biopython 1.61 source under Doc/examples then the Tutorial test should pass. -- Did you retry the test_Entrez_online.py example to see if this was a temporary problem? -- The good news is these minor issues should not cause you any problems installing and using Biopython 1.61 - so you can go ahead and run 'python setup.py install. Thanks, Peter From linxzh1989 at gmail.com Mon May 13 14:34:31 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Mon, 13 May 2013 22:34:31 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: 2013/5/13 Peter Cock > On Mon, May 13, 2013 at 2:41 AM, ??? wrote: > > > >> Where did you get the Biopython 1.61 files from? e.g. The zip file > >> or tar.gz file on our website? Perhaps I accidentally included an > >> older copy of the Doc/Tutorial.tex file? Could you look for the > >> "Late Update" line in your Tutorial.tex file for me - does it say: > >> > >> \date{Last Update -- 5 February 2013 (Biopython 1.61)} > >> > >> Thanks, > >> > >> Peter > > > > Hi?Peter? > > Clustalw I am using is 1.83. > > Hi Lin, > > I also have clustalw 1.83, so this isn't simply a version > problem. It could be something subtle about the locale - > what language is your CentOS running in (that can alter > error messages etc)? > > > I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update > -- 5 > > February 2013 (Biopython 1.61)}'. > > That's good - that's what it should say :) > > (Sorry late/last was my typing error). > > > > > I downloaded the tar.gz from the biopython website. > > > > Thanks. I could reproduce the test_Tutorial.py problem with that. > This is easy to explain - I forgot to include the test file my_blast.xml > when doing the release (and you are the first person to report this > problem). I should have noticed this myself, sorry :( > > I've fixed this ready for the next release - thank you for reporting this: > > https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5 > > If you want to, you can get the missing file from here: > http://biopython.org/SRC/Doc/examples/my_blast.xml > > or: > https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml > > If you save that in the Biopython 1.61 source under Doc/examples > then the Tutorial test should pass. > > -- > > Did you retry the test_Entrez_online.py example to see if > this was a temporary problem? > > -- > > The good news is these minor issues should not cause you > any problems installing and using Biopython 1.61 - so you > can go ahead and run 'python setup.py install. > > Thanks, > > Peter > Hi Peter I have run the locale in my serve $ locale LANG=en_US.UTF-8 LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC=zh_CN.UTF-8 LC_TIME=zh_CN.UTF-8 LC_COLLATE="en_US.UTF-8" LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES="en_US.UTF-8" LC_PAPER=zh_CN.UTF-8 LC_NAME=zh_CN.UTF-8 LC_ADDRESS=zh_CN.UTF-8 LC_TELEPHONE=zh_CN.UTF-8 LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=zh_CN.UTF-8 LC_ALL= Is that locale you want? I retryed the the test_Entrez_online.py, it's all right now. As you said, it should be a connection problem. I have put the file in the Doc/examples file, but the error still exists. And i find there is no my_blat.psl in Doc/examples comparing with the zip file i downloaded from github. After i put the my_blat.psi in the Doc/examples, the error did not show up again. Thanks Lin From p.j.a.cock at googlemail.com Mon May 13 15:50:26 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 May 2013 16:50:26 +0100 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: On Mon, May 13, 2013 at 3:34 PM, ??? wrote: > > Hi Peter > I have run the locale in my serve > > $ locale > LANG=en_US.UTF-8 > LC_CTYPE=zh_CN.UTF-8 > LC_NUMERIC=zh_CN.UTF-8 > LC_TIME=zh_CN.UTF-8 > LC_COLLATE="en_US.UTF-8" > LC_MONETARY=zh_CN.UTF-8 > LC_MESSAGES="en_US.UTF-8" > LC_PAPER=zh_CN.UTF-8 > LC_NAME=zh_CN.UTF-8 > LC_ADDRESS=zh_CN.UTF-8 > LC_TELEPHONE=zh_CN.UTF-8 > LC_MEASUREMENT=zh_CN.UTF-8 > LC_IDENTIFICATION=zh_CN.UTF-8 > LC_ALL= > > Is that locale you want? Hi Lin, Thanks for checking that, but having looked in more detail I think this is not related to the locale settings. My first guess was wrong :( I think I may have solved this - my test machine has both clustalw 2.1 and clustalw 1.83, and they behave differently for this example. The old test only worked with v2.1, fixed: https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00 If you want to verify this, you could update your copy of Tests/test_Clustalw_tool.py to that from github (or just tried installing the latest Biopython code from github?). Note the Clustal developers intended that clustalw 1 and 2 would behave the same as each other (Version 2 was a rewrite as a step towards version 3, no called ClustalOmega), but there are still some minor differences. > I retryed the the test_Entrez_online.py, it's all right now. As > you said, it should be a connection problem. OK, good. > I have put the file in the Doc/examples file, but the error still exists. > And i find there is no my_blat.psl in Doc/examples comparing with the zip > file i downloaded from github. After i put the my_blat.psi in the > Doc/examples, the error did not show up again. Thank you, that should be fixed in the next release: https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d Your feedback has been very helpful, Thanks, Peter From linxzh1989 at gmail.com Tue May 14 01:32:23 2013 From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=) Date: Tue, 14 May 2013 09:32:23 +0800 Subject: [Biopython-dev] Errors about installing biopython 1.61 In-Reply-To: References: Message-ID: Hi Peter I copy the test_Clustalw_tool.py from the github, now it does work. Thank you! Lin 2013/5/13 Peter Cock > On Mon, May 13, 2013 at 3:34 PM, ??? wrote: > > > > Hi Peter > > I have run the locale in my serve > > > > $ locale > > LANG=en_US.UTF-8 > > LC_CTYPE=zh_CN.UTF-8 > > LC_NUMERIC=zh_CN.UTF-8 > > LC_TIME=zh_CN.UTF-8 > > LC_COLLATE="en_US.UTF-8" > > LC_MONETARY=zh_CN.UTF-8 > > LC_MESSAGES="en_US.UTF-8" > > LC_PAPER=zh_CN.UTF-8 > > LC_NAME=zh_CN.UTF-8 > > LC_ADDRESS=zh_CN.UTF-8 > > LC_TELEPHONE=zh_CN.UTF-8 > > LC_MEASUREMENT=zh_CN.UTF-8 > > LC_IDENTIFICATION=zh_CN.UTF-8 > > LC_ALL= > > > > Is that locale you want? > > Hi Lin, > > Thanks for checking that, but having looked in more detail > I think this is not related to the locale settings. My first guess > was wrong :( > > I think I may have solved this - my test machine has both > clustalw 2.1 and clustalw 1.83, and they behave differently > for this example. The old test only worked with v2.1, fixed: > > https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00 > > If you want to verify this, you could update your copy of > Tests/test_Clustalw_tool.py to that from github (or just > tried installing the latest Biopython code from github?). > > Note the Clustal developers intended that clustalw 1 and 2 > would behave the same as each other (Version 2 was a > rewrite as a step towards version 3, no called ClustalOmega), > but there are still some minor differences. > > > I retryed the the test_Entrez_online.py, it's all right now. As > > you said, it should be a connection problem. > > OK, good. > > > I have put the file in the Doc/examples file, but the error still exists. > > And i find there is no my_blat.psl in Doc/examples comparing with the zip > > file i downloaded from github. After i put the my_blat.psi in the > > Doc/examples, the error did not show up again. > > Thank you, that should be fixed in the next release: > > https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d > > Your feedback has been very helpful, > > Thanks, > > Peter > From idoerg at gmail.com Fri May 17 21:35:41 2013 From: idoerg at gmail.com (Iddo Friedberg) Date: Fri, 17 May 2013 17:35:41 -0400 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: OK. I added a few changes as suggested by Peter. There is a parser now to group GAF files by DB_Object_ID, and a write function to write them. Random access not implemented yet. On Fri, May 10, 2013 at 12:32 PM, Iddo Friedberg wrote: > > > On Fri, May 10, 2013 at 12:26 PM, Peter Cock wrote: > >> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg wrote: >> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote: >> >> >> >> Would it make sense to want random access to the GOA files based >> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That >> >> should be fairly straight forward to do building on the indexing code >> >> for Bio.SeqIO and SearchIO. >> > >> > >> > Would that require reading it all into memory? Uniprot_GOA files >> > are huge, it is impractical to read them in fully. >> >> Not at all - we'd record a dictionary mapping the record ID to an offset >> in the file on disk, or record this mapping in an SQLite index file. >> > > Ok, that's good then > > >> >> Note here I am picturing combining all the (consecutive) lines >> >> for the same DB_Object_ID - currently the parser is line based, >> >> but batching by DB_Object_ID would be a straightforward change >> >> and may better suit some uses. >> > >> > Perhaps only for organism specific file, which in some cases can >> > be read fully into memory. >> >> The examples I looked at only seemed to have a dozen or so >> lines for each DB_Object_ID - but perhaps these were easy >> cases? How many lines per DB_Object_ID in the worst cases? >> >> Peter >> > > > I was actually thinking you are suggesting that the whole file should be > read in memory, nit just buffer by DB-Object_ID. My mistake. > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> > ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. > .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> > >>----.<--.>++++++.<<<<------------------------------------. > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From p.j.a.cock at googlemail.com Mon May 20 13:16:45 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 20 May 2013 14:16:45 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Fri, May 17, 2013 at 10:35 PM, Iddo Friedberg wrote: > > > OK. I added a few changes as suggested by Peter. > > There is a parser now to group GAF files by DB_Object_ID, and a write > function to write them. Random access not implemented yet. > Hi Iddo, Over on this branch building on your work I moved things under Bio.UniProt.GOA, and got things a bit more in line with PEP8: https://github.com/peterjc/biopython/tree/uniprot-goa (Drop me an email off list if you need a hand pulling those changes into your branch) Do you want to have a go at re-using the index code in Bio.File (the back end for SeqIO and SearchIO's indexing)? Let me know if the current setup is too mysterious and I can try and document more of it and/or do this for the GOA module. Peter From redmine at redmine.open-bio.org Tue May 21 12:24:34 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 21 May 2013 12:24:34 +0000 Subject: [Biopython-dev] [Biopython - Feature #3432] (New) Updated/Extended module MeltingTemp in Bio.SeqUtils Message-ID: Issue #3432 has been reported by Markus Piotrowski. ---------------------------------------- Feature #3432: Updated/Extended module MeltingTemp in Bio.SeqUtils https://redmine.open-bio.org/issues/3432 Author: Markus Piotrowski Status: New Priority: Normal Assignee: Category: Target version: URL: Dear Biopython developers, I updated/extended the MeltingTemp module of SeqUtils and would be happy if you would consider it for implementing. Please find the source code attached. Any feedback is appreciated. 'Old' module: One method, Tm_staluc, which calculates the melting temperature by the nearest neighbor method, using two different thermodynamic data sets for DNA and RNA. Fixed salt correction formula. 'Updated' module: 1. Three different Tm calculations: one 'rule of thumb' (Tm_Wallace), one using approximative formulas basing on GC content (Tm_GC) and one using nearest neighbor calculations (Tm_NN). 2. The new Tm_NN allows the usage of different thermodynamic datasets (8 tables are included for Watson-Crick base-pairing) and includes tables for mismatches (including inosine) and dangling ends. The datasets are Python dictionaries; the user can use his own datasets or change/update existing tables for his needs. 3. Seven different formulas to correct for salt concentration, including correction for Mg2+ ions (method salt_correction). 4. Method chem_correction which allows for Tm correction when using DMSO and formaldehyde. I haven't touched the old Tm_staluc method (except adding some comments [labelled 'MP'] and a deprecation warning). Actually, the method has two problems on the RNA side: The dataset for RNA is faulty and 'U' isn't considered as input. Of course this problems can easily be fixed, however, I would prefer (if it is decided to accept the updated module) to completely exchange the body of Tm_staluc for calls to Tm_NN (as outlined in the comments). There is one thing, that I'm uneasy with: For terminal mismatches, I used thermodynamic data from a patent application that has been withdrawn (http://patentscope.wipo.int/search/en/WO2001094611). Actually, I found the reference in the manual for Primer3 which also seems to use these data (http://primer3.sourceforge.net/primer3_manual.htm). Indeed, the Primer3 source (which is distributed under GPLv2) contains the data. Best wishes, Markus ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Wed May 22 13:45:00 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 May 2013 14:45:00 +0100 Subject: [Biopython-dev] UniProt GOA parser In-Reply-To: References: Message-ID: On Mon, May 20, 2013 at 7:09 PM, Iddo Friedberg wrote: >> Do you want to have a go at re-using the index code in Bio.File >> (the back end for SeqIO and SearchIO's indexing)? Let me know >> if the current setup is too mysterious and I can try and document >> more of it and/or do this for the GOA module. > > I'd like to have a go.. > > ./I Great - a few more details then, The second part of Bio/File.py has some private classes _IndexedSeqFileProxy and _IndexedSeqFileDict and _SQLiteManySeqFilesDict which can be used for any sequential record file format (meaning one after the other, not just biological sequences). These are used by the Bio.SeqIO.index() and index_db() functions, and their sisters in Bio.SearchIO. The idea is you write a subclass of _IndexedSeqFileProxy for your new file format, and then this gets used by either _IndexedSeqFileDict (in memory offset dictionary) or _SQLiteManySeqFilesDict (SQLite offset dictionary). Your _IndexedSeqFileProxy subclass has to define an __iter__ method which loops over the file giving a tuple for each record giving the identifier string and the start offset, and ideally the length in bytes. It must also define a get method which must seek to the offset and then parse the record. For the GOA files, the __iter__ loop will just spot batches of lines for the same identifier which together make up a single record. I managed to explain the setup to Bow, and he got it to work for SearchIO, but we were doing face to face video chats for that during GSoC last year. Fresh eyes will surely find some more rough edges in my docs ;) Regards, Peter From pgarland at gmail.com Mon May 27 02:27:05 2013 From: pgarland at gmail.com (Phillip Garland) Date: Sun, 26 May 2013 19:27:05 -0700 Subject: [Biopython-dev] test_SeqIO_online failure Message-ID: The fasta formatted record is fine, the problem seems to come after requesting and reading the genbank-formatted record for the protein with GI:16130152. It looks like the record was modified a few days ago: LOCUS NP_416719 367 aa linear CON 24-MAY-2013 and ends with CONTIG join(WP_000865568.1:1..367)\n//\n\n' instead of ORIGIN and the sequence data. Is this a problem with the genbank record that should be reported to NCBI, or is SeqIO supposed to handle the record as it is by fetching the sequence from the linked contig, or is the test doing the wrong thing by using rettype="gb" instead of rettype="gbwithparts"? Here's the test output: pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python run_tests.py test_SeqIO_online.py Python version: 2.7.5 (default, May 20 2013, 11:51:12) [GCC 4.7.3] Operating system: posix linux2 test_SeqIO_online ... FAIL ====================================================================== FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests) Bio.Entrez.efetch(protein, 16130152, ...) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", line 77, in method = lambda x : x.simple(d, f, e, l, c) File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", line 65, in simple self.assertEqual(seguid(record.seq), checksum) AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg' ---------------------------------------------------------------------- Ran 1 test in 10.010 seconds FAILED (failures = 1) ~Phillip From kai.blin at biotech.uni-tuebingen.de Mon May 27 06:19:20 2013 From: kai.blin at biotech.uni-tuebingen.de (Kai Blin) Date: Mon, 27 May 2013 08:19:20 +0200 Subject: [Biopython-dev] SearchIO: Fix a bug in the HMMer2 text parser Message-ID: <51A2FAE8.1040408@biotech.uni-tuebingen.de> Hi folks, I've run into and fixed a bug in the hmmer2-text parser when parsing consensus lines. The pull request is at https://github.com/biopython/biopython/pull/182 Cheers, Kai -- Dipl.-Inform. Kai Blin kai.blin at biotech.uni-tuebingen.de Institute for Microbiology and Infection Medicine Division of Microbiology/Biotechnology Eberhard-Karls-Universit?t T?bingen Auf der Morgenstelle 28 Phone : ++49 7071 29-78841 D-72076 T?bingen Fax : ++49 7071 29-5979 Germany Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben From p.j.a.cock at googlemail.com Mon May 27 09:05:44 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 May 2013 10:05:44 +0100 Subject: [Biopython-dev] test_SeqIO_online failure In-Reply-To: References: Message-ID: Hi Philip, On Mon, May 27, 2013 at 3:27 AM, Phillip Garland wrote: > The fasta formatted record is fine, the problem seems to come after > requesting and reading the genbank-formatted record for the protein > with GI:16130152. > > It looks like the record was modified a few days ago: > > LOCUS NP_416719 367 aa linear CON 24-MAY-2013 > > and ends with > > CONTIG join(WP_000865568.1:1..367)\n//\n\n' > > instead of > > ORIGIN and the sequence data. > > Is this a problem with the genbank record that should be reported to > NCBI, or is SeqIO supposed to handle the record as it is by fetching > the sequence from the linked contig, or is the test doing the wrong > thing by using rettype="gb" instead of rettype="gbwithparts"? Interesting - it looks like the NCBI made a change to Entrez and where previously this record had included the sequence with rettype="gb" now we have to ask for it explicitly with the longer rettype="gbwithparts" - my guess is this is now happening on more records. Note it does not affect all records, consider this example in our Tutorial which seems unchanged: from Bio import Entrez Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb", retmode="text") print handle.read() Curious. > Here's the test output: > > pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python > run_tests.py test_SeqIO_online.py > Python version: 2.7.5 (default, May 20 2013, 11:51:12) > [GCC 4.7.3] > Operating system: posix linux2 > test_SeqIO_online ... FAIL > ====================================================================== > FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests) > Bio.Entrez.efetch(protein, 16130152, ...) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", > line 77, in > method = lambda x : x.simple(d, f, e, l, c) > File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", > line 65, in simple > self.assertEqual(seguid(record.seq), checksum) > AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg' > > ---------------------------------------------------------------------- > Ran 1 test in 10.010 seconds > > FAILED (failures = 1) I'd noticed this on Friday but hadn't looked into why the sequence was different (and sometimes Entrez errors are transient). Thanks for exploring this :) Would you like to submit a pull request to update test_SeqIO_online.py or should I just go ahead and change the rettype? It would be sensible to review all the Entrez examples in the Tutorial, to perhaps make more use of 'gbwithparts' rather than 'gb'? Thanks, Peter From pgarland at gmail.com Mon May 27 21:38:30 2013 From: pgarland at gmail.com (Phillip Garland) Date: Mon, 27 May 2013 14:38:30 -0700 Subject: [Biopython-dev] test_SeqIO_online failure In-Reply-To: References: Message-ID: Hi Peter, On Mon, May 27, 2013 at 2:05 AM, Peter Cock wrote: > Hi Philip, > > On Mon, May 27, 2013 at 3:27 AM, Phillip Garland wrote: >> The fasta formatted record is fine, the problem seems to come after >> requesting and reading the genbank-formatted record for the protein >> with GI:16130152. >> >> It looks like the record was modified a few days ago: >> >> LOCUS NP_416719 367 aa linear CON 24-MAY-2013 >> >> and ends with >> >> CONTIG join(WP_000865568.1:1..367)\n//\n\n' >> >> instead of >> >> ORIGIN and the sequence data. >> >> Is this a problem with the genbank record that should be reported to >> NCBI, or is SeqIO supposed to handle the record as it is by fetching >> the sequence from the linked contig, or is the test doing the wrong >> thing by using rettype="gb" instead of rettype="gbwithparts"? > > Interesting - it looks like the NCBI made a change to Entrez and > where previously this record had included the sequence with > rettype="gb" now we have to ask for it explicitly with the longer > rettype="gbwithparts" - my guess is this is now happening on > more records. > > Note it does not affect all records, consider this example in our > Tutorial which seems unchanged: > > from Bio import Entrez > Entrez.email = "A.N.Other at example.com" # Always tell NCBI who you are > handle = Entrez.efetch(db="nucleotide", id="186972394", > rettype="gb", retmode="text") > print handle.read() > > Curious. > >> Here's the test output: >> >> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python >> run_tests.py test_SeqIO_online.py >> Python version: 2.7.5 (default, May 20 2013, 11:51:12) >> [GCC 4.7.3] >> Operating system: posix linux2 >> test_SeqIO_online ... FAIL >> ====================================================================== >> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests) >> Bio.Entrez.efetch(protein, 16130152, ...) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", >> line 77, in >> method = lambda x : x.simple(d, f, e, l, c) >> File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py", >> line 65, in simple >> self.assertEqual(seguid(record.seq), checksum) >> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg' >> >> ---------------------------------------------------------------------- >> Ran 1 test in 10.010 seconds >> >> FAILED (failures = 1) > > I'd noticed this on Friday but hadn't looked into why the sequence was > different (and sometimes Entrez errors are transient). Thanks for > exploring this :) > > Would you like to submit a pull request to update test_SeqIO_online.py > or should I just go ahead and change the rettype? > > It would be sensible to review all the Entrez examples in the Tutorial, > to perhaps make more use of 'gbwithparts' rather than 'gb'? > > Thanks, > > Peter The slight problem with just replacing "gb" with "gbwithparts" is that SeqIO doesn't take "gbwithparts" as an option for the file format. So in test_SeqIO_online.py, you have this code: handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text") record = SeqIO.read(handle, f) which is a natural way to write the test (because it tests fasta and genbank files), but will currently fail if f is "gbwithparts", b/c SeqIO doesn't accept "gbwithparts" as a file format specifier. My guess is that most existing code hardcodes the rettype and SeqIO file format specifier, so we could just test for gbwithparts prior to calling SeqIO.read: handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text") if f == "gbwithparts": f = "gb" record = SeqIO.read(handle, f) I submitted a pull request with a minimal patch that does this. For code like this, it would be cleaner if SeqIO accepted, "gbwithparts" as an alias for "genbank", just like "gb" is, but I don't know if it's a common pattern enough to bother. If records like this are becoming more common, then "gbwithparts" should be clearly documented in the biopython tutorial, though "gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It seems safer to always use "gbwithparts" at this point, at least when you want the sequence. ~Phillip From p.j.a.cock at googlemail.com Mon May 27 22:43:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 27 May 2013 23:43:19 +0100 Subject: [Biopython-dev] test_SeqIO_online failure In-Reply-To: References: Message-ID: On Mon, May 27, 2013 at 10:38 PM, Phillip Garland wrote: > Hi Peter, > >> I'd noticed this on Friday but hadn't looked into why the sequence was >> different (and sometimes Entrez errors are transient). Thanks for >> exploring this :) >> >> Would you like to submit a pull request to update test_SeqIO_online.py >> or should I just go ahead and change the rettype? >> >> It would be sensible to review all the Entrez examples in the Tutorial, >> to perhaps make more use of 'gbwithparts' rather than 'gb'? >> >> Thanks, >> >> Peter > > The slight problem with just replacing "gb" with "gbwithparts" is that > SeqIO doesn't take "gbwithparts" as an option for the file format. So > in test_SeqIO_online.py, you have this code: > > handle = Entrez.efetch(db=database, id=entry, rettype=f, > retmode="text") > record = SeqIO.read(handle, f) > > which is a natural way to write the test (because it tests fasta and > genbank files), but will currently fail if f is "gbwithparts", b/c > SeqIO doesn't accept "gbwithparts" as a file format specifier. My > guess is that most existing code hardcodes the rettype and SeqIO file > format specifier, so we could just test for gbwithparts prior to > calling SeqIO.read: > > handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text") > if f == "gbwithparts": > f = "gb" > record = SeqIO.read(handle, f) > > I submitted a pull request with a minimal patch that does this. That's good for now :) > For code like this, it would be cleaner if SeqIO accepted, > "gbwithparts" as an alias for "genbank", just like "gb" is, but I > don't know if it's a common pattern enough to bother. That makes some sense for parsing files, but all those aliases would cause confusion with writing GenBank files. > If records like this are becoming more common, then "gbwithparts" > should be clearly documented in the biopython tutorial, though > "gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It > seems safer to always use "gbwithparts" at this point, at least when > you want the sequence. Definitely - if the NCBI moves to using 'gb' as the light style without the sequence then many people will just want to use 'gbwithparts' as their default when scripting this sort of thing. Thanks, Peter From redmine at redmine.open-bio.org Tue May 28 07:50:41 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 28 May 2013 07:50:41 +0000 Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on python3 for disordered atoms Message-ID: Issue #3433 has been reported by Alexander Campbell. ---------------------------------------- Bug #3433: MMCIFParser fails on python3 for disordered atoms https://redmine.open-bio.org/issues/3433 Author: Alexander Campbell Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work. Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible. ---------------------------------------- You have received this notification because this email was added to the New Issue Alert plugin -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From redmine at redmine.open-bio.org Tue May 28 07:50:41 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Tue, 28 May 2013 07:50:41 +0000 Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on python3 for disordered atoms Message-ID: Issue #3433 has been reported by Alexander Campbell. ---------------------------------------- Bug #3433: MMCIFParser fails on python3 for disordered atoms https://redmine.open-bio.org/issues/3433 Author: Alexander Campbell Status: New Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work. Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From tiagoantao at gmail.com Tue May 28 11:14:53 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 12:14:53 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) Message-ID: Hi, I have been trying to setup a windows 8 buildbot. For that purpose I have installed a recent version of mingw on a new win8 machine. It seems that one of the compiling options of biopython (-mno-cygwin) is deprecated. See here for more details: http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 11:21:13 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 12:21:13 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 12:14 PM, Tiago Ant?o wrote: > Hi, > > I have been trying to setup a windows 8 buildbot. For that purpose I have > installed a recent version of mingw on a new win8 machine. > > It seems that one of the compiling options of biopython (-mno-cygwin) is > deprecated. See here for more details: > http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html Looks like there's a confusing open bug about just removing this argument from Python's distutils - http://bugs.python.org/issue12641 For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself get it to work? I could live with that on the build slave, coupled with a warning in our install documentation for the brave people self-compiling under Windows. Peter From tiagoantao at gmail.com Tue May 28 12:04:32 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 13:04:32 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: Hi, On Tue, May 28, 2013 at 12:21 PM, Peter Cock wrote: > For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself > get it to work? I could live with that on the build slave, coupled with a > warning in our install documentation for the brave people self-compiling > under Windows. > I have hacked my distutils implementation. It compiled OK. That being said, there seems to be some problems with Bio.Applications on win8: http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 14:09:40 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 15:09:40 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 1:04 PM, Tiago Ant?o wrote: > Hi, > > > On Tue, May 28, 2013 at 12:21 PM, Peter Cock > wrote: >> >> For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself >> get it to work? I could live with that on the build slave, coupled with a >> warning in our install documentation for the brave people self-compiling >> under Windows. > > I have hacked my distutils implementation. It compiled OK. That's encouraging. > That being said, there seems to be some problems with Bio.Applications on > win8: > http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio Could you confirm output sys.platform is "win32" still? I've got a hunch that spaces in the executable path might explain some of these failures - I'm trying a patch for that here. Some of the other failures appear to be down to newline differences (the \r in some of the output suggests this). Here we can probably use universal new lines mode for file input, but I am puzzled why these pass under Windows XP with an older mingw32 or the Intel compiler. Peter From tiagoantao at gmail.com Tue May 28 14:40:02 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 15:40:02 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 3:09 PM, Peter Cock wrote: > Could you confirm output sys.platform is "win32" still? > Yup T -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 16:36:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 17:36:20 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 3:09 PM, Peter Cock wrote: > > I've got a hunch that spaces in the executable path might explain > some of these failures - I'm trying a patch for that here. Hi Tiago, Patch applied to master - this is essential for the rare case of calling a binary under Unix where the path/filename includes a space, but appears to be redundant under Windows XP: https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580 I'm curious if that matters under Windows 8 or not - trying the example in the commit comment at the command line might be illuminating. Peter P.S. Saket - You might remember I touched on this issue in our discussion on GitHub about your bwa/samtools wrappers, which led to this commit keeping self.program_name as the binary only: https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a From tiagoantao at gmail.com Tue May 28 16:50:39 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Tue, 28 May 2013 17:50:39 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 5:36 PM, Peter Cock wrote: > I'm curious if that matters under Windows 8 or not - trying > the example in the commit comment at the command line > might be illuminating. > I just re-scheduled a testing case and the results were not great... http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio I will test this manually and in deep when I arrive home today. -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Tue May 28 17:15:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 18:15:23 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 5:50 PM, Tiago Ant?o wrote: > > On Tue, May 28, 2013 at 5:36 PM, Peter Cock > wrote: >> >> I'm curious if that matters under Windows 8 or not - trying >> the example in the commit comment at the command line >> might be illuminating. > > > I just re-scheduled a testing case and the results were not great... > http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio > > I will test this manually and in deep when I arrive home today. I think there are at just two classes of failure, calling applications: test_Application ... FAIL And indexing with Windows newlines (I wonder if the git setup on my Windows XP machine has a different default to yours, meaning I have Unix newlines and you have Windows newlines?): test_SearchIO_blast_tab_index ... FAIL test_SearchIO_blast_xml_index ... FAIL test_SearchIO_exonerate_text_index ... FAIL test_SearchIO_exonerate_vulgar_index ... FAIL test_SearchIO_fasta_m10_index ... FAIL test_SearchIO_hmmer2_text_index ... FAIL test_SearchIO_hmmer3_domtab_index ... FAIL test_SearchIO_hmmer3_tab_index ... FAIL test_SearchIO_hmmer3_text_index ... FAIL Bio.SeqIO docstring test ... FAIL Plus of course the minor issues which I just introduced with the escaping change (commits to follow). Peter From saketkc at gmail.com Tue May 28 17:20:36 2013 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 28 May 2013 22:50:36 +0530 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: The constraint for me really is I do not have access to Windows/MAC machines here. Hunting for a Windows machine is possible, besides these I need to validate the _ArgumentList method for windows too On 28 May 2013 22:06, Peter Cock wrote: > On Tue, May 28, 2013 at 3:09 PM, Peter Cock > wrote: > > > > I've got a hunch that spaces in the executable path might explain > > some of these failures - I'm trying a patch for that here. > > Hi Tiago, > > Patch applied to master - this is essential for the rare case of > calling a binary under Unix where the path/filename includes > a space, but appears to be redundant under Windows XP: > > https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580 > > I'm curious if that matters under Windows 8 or not - trying > the example in the commit comment at the command line > might be illuminating. > > Peter > > P.S. Saket - You might remember I touched on this issue in our > discussion on GitHub about your bwa/samtools wrappers, which > led to this commit keeping self.program_name as the binary only: > > https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a > From p.j.a.cock at googlemail.com Tue May 28 17:30:47 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 28 May 2013 18:30:47 +0100 Subject: [Biopython-dev] Compiling on modern windows (recent mingw) In-Reply-To: References: Message-ID: On Tue, May 28, 2013 at 6:20 PM, Saket Choudhary wrote: > The constraint for me really is I do not have access to Windows/MAC machines > here. > > Hunting for a Windows machine is possible, besides these I need to validate > the _ArgumentList method for windows too I sympathise - sorting out a (virtual) 64bit Windows machine has been on my TODO list for a while, since right now I don't have access to one. When I started doing Biopython my primary machine was Windows XP. That old laptop has retired and I now mainly use Mac OS X and Linux at work, but I made a point of getting a Windows XP machine setup for development (e.g. the Windows installers are build with this) and for use as one of our nightly build slaves: http://testing.open-bio.org/biopython/buildslaves Regards, Peter From redmine at redmine.open-bio.org Thu May 30 06:32:21 2013 From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org) Date: Thu, 30 May 2013 06:32:21 +0000 Subject: [Biopython-dev] [Biopython - Bug #3433] (Resolved) MMCIFParser fails on python3 for disordered atoms References: Message-ID: Issue #3433 has been updated by Michiel de Hoon. Status changed from New to Resolved % Done changed from 0 to 100 Patch applied, thanks. ---------------------------------------- Bug #3433: MMCIFParser fails on python3 for disordered atoms https://redmine.open-bio.org/issues/3433 Author: Alexander Campbell Status: Resolved Priority: Normal Assignee: Biopython Dev Mailing List Category: Main Distribution Target version: URL: The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work. Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible. -- You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here and login: http://redmine.open-bio.org From p.j.a.cock at googlemail.com Thu May 30 08:21:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 09:21:31 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug Message-ID: Hi Tiago, We'd been talking briefly off-list about the recent buildbot failures under Python 3 where the recent change to using subprocess in the PopGen module was causing failures. Sadly while it seems to work on Python 3.1 and 3.2 my suggestion to try using bytes with the communicate call fails on Python 3.3 and under Windows: https://github.com/biopython/biopython/commit/912692ee2b57e8c075ba38bdf814c9dbe4f5cdb9 e.g. After the change to use bytes, http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/202 http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/816 http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.2/builds/680 http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.3/builds/206 This appears to be a known bug in the subprocess module, http://bugs.python.org/issue16903 which should be fixed in Python 3.2.4 and Python 3.3. It appears not to have been fixed on Python 3.1. I see two options, Option One, revert that commit (i.e. send unicode strings as before, not bytes). This will work on Python 3.2.4+ onwards including Windows. It will fail on Python 3.1 and out of date Python 3.2 through 3.2.3 releases. Option Two, don't use universal_newlines=True which then requires us to use byte strings for all the stdin, stdout and stderr processing. More work, but it should in principle work on old and new Python 3 releases. Note that while we're not seeing any problems yet, I suspect this issue would affect our Bio.Application wrappers __call__ function as well when used to send data to stdin. Here again we could switch to using bytes and universal_newlines=False and do any bytes/unicode handling within the __call_ function, on just insist on a fixed version of Python. If we decide to recommend at least Python 3.2.4 (when using Python 3), then we could add a warning to the relevant modules to catch this issue? What do people think? Regards, Peter From tiagoantao at gmail.com Thu May 30 08:28:04 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 May 2013 09:28:04 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: I was having a look at the issue precisely now. I do not have a cast opinion on the issue, I think it all boils down on how many people are dependent on 3.2.3 and prior 3s. In theory I would prefer not to have workarounds for implementation bugs (as makes things more complex to manage in the long-run), but if many people are using buggy 3.x, I see no option... I simply do not have any view on how many people would be using these... From p.j.a.cock at googlemail.com Thu May 30 08:34:15 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 09:34:15 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o wrote: > I was having a look at the issue precisely now. > > I do not have a cast opinion on the issue, I think it all boils down on how > many people are dependent on 3.2.3 and prior 3s. > > In theory I would prefer not to have workarounds for implementation bugs (as > makes things more complex to manage in the long-run), but if many people are > using buggy 3.x, I see no option... > > I simply do not have any view on how many people would be using these... > Since till now we've not officially supported Python 3, but plan to start doing so for the forthcoming Biopython 1.62 release, so we could just set a minimum version of 3.2.4 (with Python 3.3 being our current recommendation). However, that may be a problem for some current Linux distributions still shipping older versions? Peter From tiagoantao at gmail.com Thu May 30 08:41:27 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 May 2013 09:41:27 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:34 AM, Peter Cock wrote: > However, that may be a problem for some current Linux > distributions still shipping older versions? > > > I suppose people could revert to Python 2 in that case? [Do not get me wrong, I really have no strong feelings either way] -- ?Grant me chastity and continence, but not yet? - St Augustine From p.j.a.cock at googlemail.com Thu May 30 11:37:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 12:37:51 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:41 AM, Tiago Ant?o wrote: > > On Thu, May 30, 2013 at 9:34 AM, Peter Cock > wrote: >> >> However, that may be a problem for some current Linux >> distributions still shipping older versions? > > I suppose people could revert to Python 2 in that case? [Do not get me > wrong, I really have no strong feelings either way] > I guess we should do a brief survey on the main list of Python 3 versions people have installed, if any. In the meantime, I reverted that commit so the tests should now pass under Python 3.2.4+ and Python 3.3. https://github.com/biopython/biopython/commit/285988b1b5227b591bd2fed379e36db3a157eca2 Peter From tiagoantao at gmail.com Thu May 30 11:40:27 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 30 May 2013 12:40:27 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: > I guess we should do a brief survey on the main list of Python 3 versions > people have installed, if any. > > > +1 From p.j.a.cock at googlemail.com Thu May 30 11:47:33 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 12:47:33 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 12:40 PM, Tiago Ant?o wrote: > >> I guess we should do a brief survey on the main list of Python 3 versions >> people have installed, if any. >> >> > > +1 Agreed, http://lists.open-bio.org/pipermail/biopython/2013-May/008598.html Peter From p.j.a.cock at googlemail.com Thu May 30 13:33:22 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 14:33:22 +0100 Subject: [Biopython-dev] Python 2 and 3 migration thoughts Message-ID: Splitting off from this thread: http://lists.open-bio.org/pipermail/biopython/2013-May/008601.html On Thu, May 30, 2013 at 2:13 PM, Peter Cock wrote: > Thank you for all the comments so far, don't stop yet :) > > On Thu, May 30, 2013 at 1:51 PM, Wibowo Arindrarto > wrote: >> Hi everyone, >> >> I'm leaning towards insisting on Python >=3.3 support (I'm running >> 3.3.2). I suppose that even if Python3.3 is not available on a machine >> or through the default package manager, it's always installable on its >> own. If that's not the case, I imagine Python2.x is most likely >> present in these machines (so Biopython can still be used). > > True. > > So far everyone who has replied (including some off list) have said > they are using Python 3.3 which is encouraging. Thank you for > the comments so far. > > It looks like we can forget about Python 3.1, and just need to > decide if it is worth including Python 3.2.5 in the short term. > >> On a related note, do we have a defined timeline on when we >> would drop support for Python2.x? Are there any plans to have >> our codebase written in Python3.x instead of Python2.x? > > Nothing concrete planned, no. I'll reply in more detail on the > biopython-dev list as I do have some thoughts about this. Good question Bow, I think people will still be using Python 2 a year or two from now, so we must support both for some time. Biopython 1.62 (next week perhaps?) - Final release with Python 2.5 support - Official support for Python 2.5, 2.6, 2.7 and 3.3 - Possibly official support for Python 3.2.5+ as well? (Exactly which versions of Python 3 we'll include to be decided, see the other thread for that discussion.) Short term we will continue with developing using Python 2 syntax and running 2to3 for Python 3. As far as I know, the reverse process with 3to2 is not well established. If anyone wants to investigate that would be useful as another option. However, dropping Python 2.5 support makes things more flexible... Medium term I believe it would be possible to have a single code base which is both valid Python 2 and 3 at the same time. This may require us to target 2.7 and 3.3+ only - we'll have to try it and see if Python 2.6 will hold us back. I've actually done this with lzma.backports, a small but non-trivial module with Python and C code: https://pypi.python.org/pypi/backports.lzma/ https://github.com/peterjc/backports.lzma Python 3.3 reintroduces some features designed to make this more straightforward, like unicode literals (missing in the early versions of Python 3). This is why I'd like to drop Python 3.2 as soon as possible. What I was thinking is we can start migrating modules on a case by case basis from "Python 2 syntax" to "Dual syntax" one by one, with a white-list in the do2to3.py script. That way over time less and less modules need to be converted via 2to3, and "python3 setup.py install" will get faster, until eventually we can stop using 2to3 at all. This conversion could consider the code and doctests separately. However, using using print(example) we can hopefully get most of the doctests and Tutorial examples to work under both Python 2 and 3 at the same time. That's my current thinking anyway - and I think the fact that it would be a gradual migration from writing Python 2 specific code to writing dual 2/3 code makes it low risk (as long as we're continuing to run regular testing). Regards, Peter From p.j.a.cock at googlemail.com Thu May 30 14:23:01 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 15:23:01 +0100 Subject: [Biopython-dev] HMMER3.1 beta test 1 released Message-ID: Hi Bow, Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759 "The programs phmmer, hmmsearch, and hmmscan offer a new tabular output format for easier automated parsing, --pfamtblout. his format is the one used internally by Pfam, but we make it more broadly available in case it is of use elsewhere. An analagous output format is available for nhmmer and nhmmscan, --dfamtblout." Something to consider for SearchIO later on... Regards, Peter From w.arindrarto at gmail.com Thu May 30 14:50:24 2013 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 30 May 2013 16:50:24 +0200 Subject: [Biopython-dev] HMMER3.1 beta test 1 released In-Reply-To: References: Message-ID: Hi Peter, Thanks for the heads-up. This just showed up in my feed as well. I've been waiting for the official release (since they first mentioned it some monts ago). I'll follow up on this slowly :).. Best regards, Bow On Thu, May 30, 2013 at 4:23 PM, Peter Cock wrote: > Hi Bow, > > Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759 > > "The programs phmmer, hmmsearch, and hmmscan offer a new > tabular output format for easier automated parsing, --pfamtblout. > his format is the one used internally by Pfam, but we make it more > broadly available in case it is of use elsewhere. An analagous > output format is available for nhmmer and nhmmscan, --dfamtblout." > > Something to consider for SearchIO later on... > > Regards, > > Peter From rz1991 at foxmail.com Thu May 30 15:37:00 2013 From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=) Date: Thu, 30 May 2013 23:37:00 +0800 Subject: [Biopython-dev] GSoC 2013 Student Self-introduction Message-ID: Hi Everyone, This is Zheng Ruan, a first year graduate students at the University of Georgia. I'm happy to be chosen to participate in GSoC this year. My project is "Codon Alignment and Analysis in Biopython" and I will be working with Eric Talevich and Peter Cock during the summer. My undergraduate major is biotechnology and now seeking for a PhD in bioinformatics. I hope to improve my python programming skills during the project and make long term contribution to biopython. I will follow the timeline of my proposal in the Community Bounding Period these days (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/rzzmh12345/1). Thanks! Best, Ruan From p.j.a.cock at googlemail.com Thu May 30 16:18:41 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 30 May 2013 17:18:41 +0100 Subject: [Biopython-dev] Biopython projects with NESCent for GSoC 2013 In-Reply-To: References: Message-ID: Dear all, After the disappointing news that the Open Bioinformatics Foundation (OBF) was not accepted as a Google Summer of Code (GSoC) organisation this year, Biopython was fortunate to once again offer some projects with the NESCent team: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013 As always the student proposals have been very competitive, and we've not been able to take on everyone. This year NESCent was fortunately to be able to accept seven students through GSoC and one through the GNOME Outreach Program for Women. Two of these GSoC projects are Biopython related: Codon Alignment and Analysis in Biopython Student: Zheng Ruan Mentors: Eric Talevich, Peter Cock http://www.google-melange.com/gsoc/project/google/gsoc2013/rzzmh12345/32001 Phylogenetics in Biopython: Filling in the gaps Student: Yanbo Ye http://www.google-melange.com/gsoc/project/google/gsoc2013/yeyanbo/45001 Mentors: Mark Holder, Jeet Sukumaran, Eric Talevich Thank you NESCent, and congratulations to Zheng Ruan and Yanbo Ye! I'm hoping you're already setting up a blog, which I hope you'll be able to use for roughly weekly progress reports during the summer - CC'd to the biopython-dev mailing list and the NESCent Phyloinformatics Summer of Code forum on Google+, http://lists.open-bio.org/mailman/listinfo/biopython-dev https://plus.google.com/communities/105828320619238393015 An introduction to your project would be a great idea for your first post - here's Bow's from last year as an example: http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/ http://bow.web.id/blog/2012/08/summers-over/ http://bow.web.id/blog/tag/gsoc/ The idea here is to keep the wider community informed about how your project is going. On behalf of the Biopython developers, congratulations! We're looking forward to another productive Summer of Code :) Peter From p.j.a.cock at googlemail.com Fri May 31 09:04:28 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 31 May 2013 10:04:28 +0100 Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug In-Reply-To: References: Message-ID: On Thu, May 30, 2013 at 9:34 AM, Peter Cock wrote: > On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o wrote: >> I was having a look at the issue precisely now. >> >> I do not have a cast opinion on the issue, I think it all boils down on how >> many people are dependent on 3.2.3 and prior 3s. >> >> In theory I would prefer not to have workarounds for implementation bugs (as >> makes things more complex to manage in the long-run), but if many people are >> using buggy 3.x, I see no option... >> >> I simply do not have any view on how many people would be using these... >> > > Since till now we've not officially supported Python 3, but > plan to start doing so for the forthcoming Biopython 1.62 > release, so we could just set a minimum version of 3.2.4 > (with Python 3.3 being our current recommendation). >From the discussion on the main list, requiring a recent version of Python 3 where this bug is fixed should be fine. For now I've added code to skip this test on the older Python 3 releases where the bug exists: https://github.com/biopython/biopython/commit/9c16c09806ca4af84f714662e54c9bd3057b0a52 Once we've settled on the versions to support with the next release we should review what versions we run on the buildbot. Regards, Peter