From zhigangwu.bgi at gmail.com  Wed May  1 10:17:14 2013
From: zhigangwu.bgi at gmail.com (Zhigang Wu)
Date: Wed, 1 May 2013 07:17:14 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
Message-ID: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>

Hi Peter and all,
Thanks for the long explanation.
I got much better understand of this project though I am still confusing on
how to implement the lazy-loading parser for feature rich files (EMBL,
GenBank, GFF3).
Since the deadline is pretty close,I decided to post my premature of
proposal for this project. It would be great if you all can given me some
comments and suggestions. The proposal is available
here<https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing>.
Thank you all in advance.


Zhigang


On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> wrote:
> > Peter,
> >
> > Thanks for the detailed explanation. It's very helpful. I am not quite
> > sure about the goal of the lazy-loading parser.
> > Let me try to summarize what are the goals of lazy-loading and how
> > lazy-loading would work. Please correct me if necessary. Below I use
> > fasta/fastq file as an example. The idea should generally applies to
> > other format such as GenBank/EMBL as you mentioned.
> >
> > Lazy-loading is useful under the assumption that given a large file,
> > we are interested in partial information of it but not all of them.
> > For example a fasta file contains Arabidopsis genome, we only
> > interested in the sequence of chr5 from index position from 2000-3000.
> > Rather than parsing the whole file and storing each record in memory
> > as most parsers will do,  during the indexing step, lazy loading
> > parser will only store a few position information, such as access
> > positions (readily usable for seek) for all chromosomes (chr1, chr2,
> > chr3, chr4, chr5, ...) and may be position index information such as
> > the access positions for every 1000bp positions for each sequence in
> > the given file. After indexing, we store these information in a
> > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
> > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
> > 2000:access_pos,}, 'chr3'...}.
> >
> > Compared to the usual parser which tends to parsing the whole file, we
> > gain two benefits: speed, less memory usage and random access. Speed
> > is gained because we skipped a lot during the parsing step. Go back to
> > my example, once we have the dictionary, we can just seek to the
> > access position of chr5:2000 and start reading and parsing from there.
> > Less memory usage is due to we only stores access positions for each
> > record as a dictionary in memory.
> >
> >
> > Best,
> >
> > Zhigang
>
> Hi Zhigang,
>
> Yes - that's the basic idea of a disk based lazy loader. Here
> the data stays on the disk until needed, so generally this is
> very low memory but can be slow as it needs to read from
> the disk. And existing example already in Biopython is our
> BioSQL bindings which present a SeqRecord subclass which
> only retrieves values from the database on demand.
>
> Note in the case of FASTA, we might want to use the existing
> FAI index files from Heng Li's faidx tool (or another existing
> index scheme). That relies on each record using a consistent
> line wrapping length, so that seek offsets can be easily
> calculated.
>
> An alternative idea is to load the data into memory (so that the
> file is not touched again, useful for stream processing where
> you cannot seek within the input data) but it is only parsed into
> Python objects on demand. This would use a lot more memory,
> but should be faster as there is no disk seeking and reading
> (other than the one initial read). For FASTA this wouldn't help
> much but it might work for EMBL/GenBank.
>
> Something to beware of with any lazy loading / lazy parsing is
> what happens if the user tries to edit the record? Do you want
> to allow this (it makes the code more complex) or not (simpler
> and still very useful).
>
> In terms of usage examples, for things like raw NGS data this
> is (currently) made up of lots and lots of short sequences (under
> 1000bp). Lazy loading here is unlikely to be very helpful - unless
> perhaps you can make the FASTQ parser faster this way?
> (Once the reads are assembled or mapped to a reference,
> random access to lookup reads by their mapped location is
> very very important, thus the BAI indexing of BAM files).
>
> In terms of this project, I was thinking about a SeqRecord
> style interface extending Bio.SeqIO (but you can suggest
> something different for your project).
>
> What I saw as the main use case here is large datasets like
> whole chromosomes in FASTA format or richly annotated
> formats like EMBL, GenBank or GFF3. Right now if I am
> doing something with (for example) the annotated human
> chromosomes, loading these as GenBank files is quite
> slow (it takes a far amount of memory too, but that isn't
> my main worry). A lazy loading approach should let me
> 'load' the GenBank files almost instantly, and delay
> reading specific features or sequence from the disk
> until needed.
>
> For example, I might have a list of genes for which I wish
> to extract the annotation or sequence for - and there is no
> need to load all the other features or the rest of the genome.
>
> (Note we can already do this by loading GenBank files
> into a BioSQL database, and access them that way)
>
> Regards,
>
> Peter
>

From chris.mit7 at gmail.com  Wed May  1 10:40:26 2013
From: chris.mit7 at gmail.com (Chris Mitchell)
Date: Wed, 1 May 2013 10:40:26 -0400
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
Message-ID: <CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>

Hi Zhigang,

I throw some comments on your proposal.  As i said there, I think you need
to find & look at a variety of gff/gtf files to see where your
implementation breaks down.  Also, for parsing, I would focus on optimizing
the speed the user can access attributes, they're the bits people care most
about (where is gene X, what is the FPKM of isoform y?, etc.)

Chris


On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:

> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).
> Since the deadline is pretty close,I decided to post my premature of
> proposal for this project. It would be great if you all can given me some
> comments and suggestions. The proposal is available
> here<
> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
> >.
> Thank you all in advance.
>
>
> Zhigang
>
>
>
> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> > wrote:
> > > Peter,
> > >
> > > Thanks for the detailed explanation. It's very helpful. I am not quite
> > > sure about the goal of the lazy-loading parser.
> > > Let me try to summarize what are the goals of lazy-loading and how
> > > lazy-loading would work. Please correct me if necessary. Below I use
> > > fasta/fastq file as an example. The idea should generally applies to
> > > other format such as GenBank/EMBL as you mentioned.
> > >
> > > Lazy-loading is useful under the assumption that given a large file,
> > > we are interested in partial information of it but not all of them.
> > > For example a fasta file contains Arabidopsis genome, we only
> > > interested in the sequence of chr5 from index position from 2000-3000.
> > > Rather than parsing the whole file and storing each record in memory
> > > as most parsers will do,  during the indexing step, lazy loading
> > > parser will only store a few position information, such as access
> > > positions (readily usable for seek) for all chromosomes (chr1, chr2,
> > > chr3, chr4, chr5, ...) and may be position index information such as
> > > the access positions for every 1000bp positions for each sequence in
> > > the given file. After indexing, we store these information in a
> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
> > > 2000:access_pos,}, 'chr3'...}.
> > >
> > > Compared to the usual parser which tends to parsing the whole file, we
> > > gain two benefits: speed, less memory usage and random access. Speed
> > > is gained because we skipped a lot during the parsing step. Go back to
> > > my example, once we have the dictionary, we can just seek to the
> > > access position of chr5:2000 and start reading and parsing from there.
> > > Less memory usage is due to we only stores access positions for each
> > > record as a dictionary in memory.
> > >
> > >
> > > Best,
> > >
> > > Zhigang
> >
> > Hi Zhigang,
> >
> > Yes - that's the basic idea of a disk based lazy loader. Here
> > the data stays on the disk until needed, so generally this is
> > very low memory but can be slow as it needs to read from
> > the disk. And existing example already in Biopython is our
> > BioSQL bindings which present a SeqRecord subclass which
> > only retrieves values from the database on demand.
> >
> > Note in the case of FASTA, we might want to use the existing
> > FAI index files from Heng Li's faidx tool (or another existing
> > index scheme). That relies on each record using a consistent
> > line wrapping length, so that seek offsets can be easily
> > calculated.
> >
> > An alternative idea is to load the data into memory (so that the
> > file is not touched again, useful for stream processing where
> > you cannot seek within the input data) but it is only parsed into
> > Python objects on demand. This would use a lot more memory,
> > but should be faster as there is no disk seeking and reading
> > (other than the one initial read). For FASTA this wouldn't help
> > much but it might work for EMBL/GenBank.
> >
> > Something to beware of with any lazy loading / lazy parsing is
> > what happens if the user tries to edit the record? Do you want
> > to allow this (it makes the code more complex) or not (simpler
> > and still very useful).
> >
> > In terms of usage examples, for things like raw NGS data this
> > is (currently) made up of lots and lots of short sequences (under
> > 1000bp). Lazy loading here is unlikely to be very helpful - unless
> > perhaps you can make the FASTQ parser faster this way?
> > (Once the reads are assembled or mapped to a reference,
> > random access to lookup reads by their mapped location is
> > very very important, thus the BAI indexing of BAM files).
> >
> > In terms of this project, I was thinking about a SeqRecord
> > style interface extending Bio.SeqIO (but you can suggest
> > something different for your project).
> >
> > What I saw as the main use case here is large datasets like
> > whole chromosomes in FASTA format or richly annotated
> > formats like EMBL, GenBank or GFF3. Right now if I am
> > doing something with (for example) the annotated human
> > chromosomes, loading these as GenBank files is quite
> > slow (it takes a far amount of memory too, but that isn't
> > my main worry). A lazy loading approach should let me
> > 'load' the GenBank files almost instantly, and delay
> > reading specific features or sequence from the disk
> > until needed.
> >
> > For example, I might have a list of genes for which I wish
> > to extract the annotation or sequence for - and there is no
> > need to load all the other features or the rest of the genome.
> >
> > (Note we can already do this by loading GenBank files
> > into a BioSQL database, and access them that way)
> >
> > Regards,
> >
> > Peter
> >
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From eric.talevich at gmail.com  Wed May  1 11:46:43 2013
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 1 May 2013 11:46:43 -0400
Subject: [Biopython-dev] gsoc phylo project questions
In-Reply-To: <CADoMHjxR_CM7wg377AyFfwz+UKxAvD0=pyHRUJOfo0YdsCnEmg@mail.gmail.com>
References: <CADoMHjxR_CM7wg377AyFfwz+UKxAvD0=pyHRUJOfo0YdsCnEmg@mail.gmail.com>
Message-ID: <CAMC681nq4xMUZe5j7J891C_A1GCSYFVWMshdHyGizf4w=EDBjg@mail.gmail.com>

On Tue, Apr 30, 2013 at 3:20 AM, Yanbo Ye <yeyanbo289 at gmail.com> wrote:

> Hi Eric,
>
> Again, thanks for your comment. It might be better to discuss here.
> https://github.com/lijax/gsoc/commit/e969c82a5a0aef45bba1277ce01d6dbee03e6a84#commitcomment-3096321
>
> I have changed my proposal and timeline based on your advice. I think I
> was too optimistic that I didn't consider about the compatibility with
> existing code or other potential problem that may exist. After careful
> consideration, I removed one task from the goal list to make the time more
> relaxed, the tree comparison<http://www.biopython.org/wiki/Phylo_cookbook#Comparing_trees>(seems
> I miss understood this). I might be able to complete all of them. But it's
> better to make it as an extra task, to make sure this coding experience is
> not a burden.
>

I agree it's best to commit to a feasible timeline and then reserve a few
"stretch goals". Dropping the tree distance function is fine, as there are
currently some other students who might develop this small module as a
course project, independently of GSoC. In any case that functionality is
independent of the other tasks you've proposed.


> According to your comment:
>
> 1. I didn't know PyCogent and DendroPy. I'll refer to them for useful
> solutions.
> 2. For distance-based tree and consensus tree, I think there is no need
> to use NumPy. And for consensus tree, my original plan is to implement a
> binary class to count the clade with the same leaves for performance. As
> you suggest, I'll implement a class with the same API and improve the
> performance later, so that I can pay more attention to the Strict and Adam
> Consensus algorithms.
>

Sounds good.


> 3. I didn't find the distance matrix method for MSA on Phylo Cookbook
> page, only from existing tree.
>

Ah, I think I misunderstood you earlier. Yes, for the NJ method you'll need
to use a substitution matrix to compute pairwise distances from a multiple
sequence alignment. This shouldn't be too challenging, though you might
find the need to add a new matrix to the Bio.SubsMat module if you want to
let the user choose something other than BLOSUM or PAM.

4. For parsimony tree search, I have already know how several heuristic
> search algorithms work. Do I need to implement them all?
>

No, just choose a well-established one that you feel comfortable
implementing.

5. I'm not clear about the radial layout and Felsenstein's Equal Daylight
> algorithm. Isn't this algorithm one way of showing the radial layout? I'm
> sorry that I'm not familiar with this layout. Can you give some figure
> examples and references?
>

For radial tree layout:
https://en.wikipedia.org/wiki/Radial_tree
http://www.infosun.fim.uni-passau.de/~chris/down/DrawingPhyloTreesEA.pdf

The paper above also explains an "angle spreading" refinement step to
improve the appearance of radial trees, which you could opt to implement
instead of Equal Daylight.

The Equal Daylight algorithm seems to only be documented fully in the book
"Inferring Phylogenies" and implemented in the "drawtree" program in
Phylip. In the Phylip documentation, the radial layout algorithm is called
"Equal Arc", and the layout provided by that algorithm is the starting
point for Equal Daylight:
http://evolution.genetics.washington.edu/phylip/doc/drawtree.html

Cheers,
Eric

From albl500 at york.ac.uk  Wed May  1 18:56:12 2013
From: albl500 at york.ac.uk (Alex Leach)
Date: Wed, 01 May 2013 23:56:12 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
Message-ID: <op.wwfgnyegyzmrol@ns1.alexleach.org.uk>

Dear all,

I also left some minor comments on the proposal; I hope they're helpful  
and I wish you every success!

You should focus on the proposal for now, but I thought I'd share a more  
presentable version of the fasta lazy-loader I wrote a couple of years  
ago. The focus at the time was to minimise memory usage and increase the  
speed of random access to fasta-formatted sequences, stored on disk. Only  
sequence accessions and file locations are stored in-memory (in a dict).  
Once the index has been populated, it can 'pickle' the dictionary to a  
file on disk, for later re-use.

It doesn't exactly fulfill all of your needs, but I hope it might help you  
in the right direction..

Also, were there plans for making the lazy loader thread-safe? I've done  
it in the past by passing a `multiprocessing.Pipe` instance to a method  
(`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to  
implement a callback scheme, but passing a Pipe did the job.. Maybe it's  
outside the current scope of the project, but anyway, I put the module up  
on github if you want to check it out[1].


Cheers,
Alex


[1] -  
https://github.com/alexleach/fasta_lazy_loader/blob/master/fasta_lazy_loader.py

From zhigang.wu at email.ucr.edu  Thu May  2 04:14:04 2013
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Thu, 2 May 2013 01:14:04 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <op.wwfgnyegyzmrol@ns1.alexleach.org.uk>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
	<op.wwfgnyegyzmrol@ns1.alexleach.org.uk>
Message-ID: <CADhJE9tuszX-zYs1Kn26deC0zux8ZyL5VLOPitssd-5OSs-XQA@mail.gmail.com>

Hi Alex,

The idea of taking advantage of multiprocessing is great. I haven't touched
this kind of thing before and I think it's going to be cool to integrate
into the project.

Best,

Zhigang


On Wed, May 1, 2013 at 3:56 PM, Alex Leach <albl500 at york.ac.uk> wrote:

> Dear all,
>
> I also left some minor comments on the proposal; I hope they're helpful
> and I wish you every success!
>
> You should focus on the proposal for now, but I thought I'd share a more
> presentable version of the fasta lazy-loader I wrote a couple of years ago.
> The focus at the time was to minimise memory usage and increase the speed
> of random access to fasta-formatted sequences, stored on disk. Only
> sequence accessions and file locations are stored in-memory (in a dict).
> Once the index has been populated, it can 'pickle' the dictionary to a file
> on disk, for later re-use.
>
> It doesn't exactly fulfill all of your needs, but I hope it might help you
> in the right direction..
>
> Also, were there plans for making the lazy loader thread-safe? I've done
> it in the past by passing a `multiprocessing.Pipe` instance to a method
> (`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to
> implement a callback scheme, but passing a Pipe did the job.. Maybe it's
> outside the current scope of the project, but anyway, I put the module up
> on github if you want to check it out[1].
>
>
> Cheers,
> Alex
>
>
> [1] - https://github.com/alexleach/**fasta_lazy_loader/blob/master/**
> fasta_lazy_loader.py<https://github.com/alexleach/fasta_lazy_loader/blob/master/fasta_lazy_loader.py>
>
> ______________________________**_________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.**org <Biopython-dev at lists.open-bio.org>
> http://lists.open-bio.org/**mailman/listinfo/biopython-dev<http://lists.open-bio.org/mailman/listinfo/biopython-dev>
>

From albl500 at york.ac.uk  Thu May  2 05:08:23 2013
From: albl500 at york.ac.uk (Alex Leach)
Date: Thu, 02 May 2013 10:08:23 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9tuszX-zYs1Kn26deC0zux8ZyL5VLOPitssd-5OSs-XQA@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
	<op.wwfgnyegyzmrol@ns1.alexleach.org.uk>
	<CADhJE9tuszX-zYs1Kn26deC0zux8ZyL5VLOPitssd-5OSs-XQA@mail.gmail.com>
Message-ID: <op.wwf8z9ivyzmrol@ns1.alexleach.org.uk>

On Thu, 02 May 2013 09:14:04 +0100, Zhigang Wu <zhigang.wu at email.ucr.edu>  
wrote:

> Hi Alex,
>
> The idea of taking advantage of multiprocessing is great. I haven't  
> touched this kind of thing before and I think >it's going to be cool to  
> integrate into the project.


Pleasure.

Multiprocessing is quite a large topic, and the relevant library  
documentation also rather large[1-2]. If you haven't worked with  
multiprocessing before, it will probably take a long while before you're  
comfortable using the libraries involved. So if you were to mention it in  
the proposal, I'd keep it out of the core objectives, as you have a lot  
else on your plate, already. Don't know if anyone else has any thoughts on  
this, though?

I could potentially help to provide some pointers, so if you have any  
questions I might be able to help with, please feel free to ask.

Kind regards,
Alex


[1] - http://docs.python.org/2/library/multiprocessing.html
[2] - http://docs.python.org/2/library/threading.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130502/9024522a/attachment.html>

From p.j.a.cock at googlemail.com  Thu May  2 05:52:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 10:52:19 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
Message-ID: <CAKVJ-_74FUzv+yWASR0nwfLE6Bub1t+mcXVtSNB0pU6r7sHV6Q@mail.gmail.com>

On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).
> Since the deadline is pretty close,I decided to post my premature of
> proposal for this project. It would be great if you all can given me some
> comments and suggestions. The proposal is available here.
> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
> Thank you all in advance.
>
> Zhigang

Hi Zhigang,

I've posted a few comment there, but it would be a good idea
to put the draft on Google Melange soon. I see you've posted
the Google Doc on the NESCent Google+ as well, good.

Looking at the current draft, you don't yet have a timeline. This
is vital - and it should include writing tests (as you write code -
not all at the end) and documentation (which can come after
the code).

In the community bonding period you could write that you
plan to setup your development environment including
multiple versions of Python (at least Python 2.6, Python 3,
Jython 2.7, and PyPy 2.0 to cover the main variants).

For instance, it would make sense to start with learning about
faidx and how its indexing works, and trying to reproduce it in
Python code, and then wrapping that in a SeqRecord style
API. Include writing and evaluating some benchmarks too -
you may need to learn how to profile Python code for this,
since speed and performance is one the reasons for wanting
lazy loading (lower memory usage is the other main driver).
That could be the first few weeks perhaps?

Regards,

Peter

From p.j.a.cock at googlemail.com  Thu May  2 06:37:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 11:37:31 +0100
Subject: [Biopython-dev] Fwd: [PhyloSoC] Application deadline fast
	approaching
In-Reply-To: <CAL1BuGUK+1eMpJb5dZ8-o4ChDphyCrf8sY8evfhhtTLVY3kf3Q@mail.gmail.com>
References: <CAL1BuGUK+1eMpJb5dZ8-o4ChDphyCrf8sY8evfhhtTLVY3kf3Q@mail.gmail.com>
Message-ID: <CAKVJ-_50j8zV2ETaqivwOTppXmEgCmOkUD1ryUZ2-9bV=zuEuw@mail.gmail.com>

Hi all,

I'm forwarding this for any potential  Google Summer of Code 2013
students and mentors - note you should also be signed up to the
NESCent "Phyloinformatics Summer of Code" mailing list to make
sure you don't miss any important information.

Thanks,

Peter

---------- Forwarded message ----------
From: Karen Cranston <karen.cranston at nescent.org>
Date: Thu, May 2, 2013 at 12:39 AM
Subject: [PhyloSoC] Application deadline fast approaching
To: Phyloinformatics Summer of Code <phylosoc at nescent.org>


The student application deadline for GSoC is this Friday, May 3 at 19:00 UTC!

Thanks to everyone for their expertise and enthusiasm so far. Expect
much traffic in Melange and on the G+ page between now and the
deadline. Please do help students (for your projects or others)
improve their applications - either on the G+ page or via a public
comment on Melange. The most common issue is a lack of detail in the
project plan. You can point students to the wiki for examples from
previous years. Feel free to ask for help on this list.

We will send out more about assigning mentors / scoring after the
application deadline.

Cheers,
Karen & Jim

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karen Cranston, PhD
Training Coordinator and Informatics Project Manager
nescent.org
@kcranstn
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_______________________________________________
PhyloSoC mailing list
PhyloSoC at nescent.org
https://lists.nescent.org/mailman/listinfo/phylosoc

UNSUBSCRIBE: https://lists.nescent.org/mailman/options/phylosoc/p.j.a.cock%40googlemail.com?unsub=1&unsubconfirm=1

From p.j.a.cock at googlemail.com  Thu May  2 08:54:52 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 13:54:52 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
Message-ID: <CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>

On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).

Hi Zhigang,

I'd considered two ideas for GenBank/EMBL,

Lazy parsing of the feature table: The existing iterator approach reads
in a GenBank file record by record, and parses everything into objects
(a SeqRecord object with the sequence as a Seq object and the
features as a list of SeqFeature objects). I did some profiling a while
ago, and of this the feature processing is quite slow, therefore during
the initial parse the features could be stored in memory as a list of
strings, and only parsed into SeqFeature objects if the user tries to
access the SeqRecord's feature property.

It would require a fairly simple subclassing of the SeqRecord to make
the features list into a property in order to populate the list of
SeqFeatures when first accessed.

In the situation where the user never uses the features, this should
be much faster, and save some memory as well (that would need to
be confirmed by measurement - but a list of strings should take less
RAM than a list of SeqFeature objects with all the sub-objects like
the locations and annotations).

In the situation where the use does access the features, the simplest
behaviour would be to process the cached raw feature table into a
list of SeqFeature objects. The overall runtime and memory usage
would be about what we have now. This would not require any
file seeking, and could be used within the existing SeqIO interface
where we make a single pass though the file for parsing - this is
vital in order to cope with handles like stdin and network handles
where you cannot seek backwards in the file.

That is the simpler idea, some real benefits, but not too ambitious.
If you are already familiar with the GenBank/EMBL file format and
our current parser and the SeqRecord object, then I think a week
is reasonable.

A full index based approach would mean scanning the GenBank,
EMBL or GFF file and recording information about where each
feature is on disk (file offset) and the feature location coordinates.
This could be recorded in an efficient index structure (I was thinking
something based on BAM's BAI or Heng Li's improved version CSI).
The idea here is that when the user wants to look at features in a
particular region of the genome (e.g. they have a mutation or SNP
in region 1234567 on chr5) then only the annotation in that part
of the genome needs to be loaded from the disk.

This would likely require API changes or additions, for example
the SeqRecord currently holds the SeqFeature objects as a
simple list - with no build in co-ordinate access.

As I wrote in the original outline email, there is scope for a very
ambitious project working in this area - but some of these ideas
would require more background knowledge or preparation:
http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html

Anything looking to work with GFF (in the broad sense of GFF3
and/or GTF) would ideal incorporate Brad Chapman's existing
work: http://biopython.org/wiki/GFF_Parsing

Regards,

Peter

From albl500 at york.ac.uk  Thu May  2 09:54:37 2013
From: albl500 at york.ac.uk (Alex Leach)
Date: Thu, 02 May 2013 14:54:37 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
Message-ID: <op.wwgl9bwkyzmrol@ns1.alexleach.org.uk>

Hi again,

Thought I'd contribute some thoughts... Hope I'm not intruding too much on  
the discussion.

On Thu, 02 May 2013 13:54:52 +0100, Peter Cock <p.j.a.cock at googlemail.com>  
wrote:

>
> It would require a fairly simple subclassing of the SeqRecord to make
> the features list into a property in order to populate the list of
> SeqFeatures when first accessed.
>

Yes. You can turn a class property into a function quite easily, using  
decorators. Here[1] is a pretty good example, description and  
justification.

[1] -  
http://stackoverflow.com/questions/6618002/python-property-versus-getters-and-setters

> In the situation where the user never uses the features, this should
> be much faster, and save some memory as well (that would need to
> be confirmed by measurement - but a list of strings should take less
> RAM than a list of SeqFeature objects with all the sub-objects like
> the locations and annotations).
>
> In the situation where the use does access the features, the simplest
> behaviour would be to process the cached raw feature table into a
> list of SeqFeature objects. The overall runtime and memory usage
> would be about what we have now. This would not require any
> file seeking, and could be used within the existing SeqIO interface
> where we make a single pass though the file for parsing - this is
> vital in order to cope with handles like stdin and network handles
> where you cannot seek backwards in the file.


I think the Pythonic way here would be to follow the "Easier to Ask for  
Forgiveness than to ask for Permission" (EAFP) idiom[2]. i.e. Try to seek  
the file handle first, and if that raises an IOError, catch the exception  
and continue to cache the input stream data, perhaps writing it to a  
temporary file on disk.

[2] - http://docs.python.org/2/glossary.html#term-eafp


>
> That is the simpler idea, some real benefits, but not too ambitious.
> If you are already familiar with the GenBank/EMBL file format and
> our current parser and the SeqRecord object, then I think a week
> is reasonable.
>
> A full index based approach would mean scanning the GenBank,
> EMBL or GFF file and recording information about where each
> feature is on disk (file offset) and the feature location coordinates.
> This could be recorded in an efficient index structure (I was thinking
> something based on BAM's BAI or Heng Li's improved version CSI).
> The idea here is that when the user wants to look at features in a
> particular region of the genome (e.g. they have a mutation or SNP
> in region 1234567 on chr5) then only the annotation in that part
> of the genome needs to be loaded from the disk.


Thought I'd add that Blast uses SQL tables (in ISAM format) for  
maintaining indexes to their databases[3]. I'm not familiar with  
BioPython's BioSQL module at all, but a nice feature of sqlite is that you  
can hold temporary databases in memory[4].

[3] -  
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbisam_8hpp.html
[4] -  
http://docs.python.org/2/library/sqlite3.html#using-sqlite3-efficiently


Cheers,
Alex

>
> This would likely require API changes or additions, for example
> the SeqRecord currently holds the SeqFeature objects as a
> simple list - with no build in co-ordinate access.
>
> As I wrote in the original outline email, there is scope for a very
> ambitious project working in this area - but some of these ideas
> would require more background knowledge or preparation:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
> Anything looking to work with GFF (in the broad sense of GFF3
> and/or GTF) would ideal incorporate Brad Chapman's existing
> work: http://biopython.org/wiki/GFF_Parsing
>
> Regards,
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


-- 
---
Alex Leach. BSc, MRes
PhD Student
Chong & Redeker Labs
Department of Biology
University of York
YO10 5DD
Tel: 07940 480 771

EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm

From idoerg at gmail.com  Thu May  2 12:12:12 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Thu, 2 May 2013 12:12:12 -0400
Subject: [Biopython-dev] Uniprot-GOA parser
Message-ID: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>

Does anybody have a GOA parser in the works? Currently writing a simple
parser for GAF, GPA and GPI formats. Can contribute if there is interest.

More on GOA: http://www.ebi.ac.uk/GOA

Cheers,

Iddo

-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From p.j.a.cock at googlemail.com  Thu May  2 12:18:17 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 17:18:17 +0100
Subject: [Biopython-dev] Uniprot-GOA parser
In-Reply-To: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>
References: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>
Message-ID: <CAKVJ-_7mrezjn7PZ7PWm02O=sBSsWmgnjuA8XbYUY8_hv+MHgg@mail.gmail.com>

On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> Does anybody have a GOA parser in the works? Currently writing a simple
> parser for GAF, GPA and GPI formats. Can contribute if there is interest.
>
> More on GOA: http://www.ebi.ac.uk/GOA
>
> Cheers,
>
> Iddo

Hi Iddo,

I see they're now offering GPAD1.1 format (as well? instead?).
Does targeting that make more sense in the long run?

I know a few people on the list are or were looking at ontology
support for Biopython... it would be good to add this.

Regards,

Peter

From idoerg at gmail.com  Thu May  2 12:19:39 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Thu, 2 May 2013 12:19:39 -0400
Subject: [Biopython-dev] Uniprot-GOA parser
In-Reply-To: <CAKVJ-_7mrezjn7PZ7PWm02O=sBSsWmgnjuA8XbYUY8_hv+MHgg@mail.gmail.com>
References: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>
	<CAKVJ-_7mrezjn7PZ7PWm02O=sBSsWmgnjuA8XbYUY8_hv+MHgg@mail.gmail.com>
Message-ID: <CABm4-MR-Aikxd705XO+eRmBDWiXNHW0Xg96w0m+nbDQBJiGvRQ@mail.gmail.com>

Yes, will do GPAD as well. Need to preserve the others though, due to
legacy.

./I

On Thu, May 2, 2013 at 12:18 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > Does anybody have a GOA parser in the works? Currently writing a simple
> > parser for GAF, GPA and GPI formats. Can contribute if there is interest.
> >
> > More on GOA: http://www.ebi.ac.uk/GOA
> >
> > Cheers,
> >
> > Iddo
>
> Hi Iddo,
>
> I see they're now offering GPAD1.1 format (as well? instead?).
> Does targeting that make more sense in the long run?
>
> I know a few people on the list are or were looking at ontology
> support for Biopython... it would be good to add this.
>
> Regards,
>
> Peter
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From zhigang.wu at email.ucr.edu  Thu May  2 17:18:43 2013
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Thu, 2 May 2013 14:18:43 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
Message-ID: <CADhJE9v+dbaTLEDvzRsn7JiXj45xdzg32cZzpHFioOkVeE6m2g@mail.gmail.com>

Hi Chris and All,

In your comments to my proposal, you mentioned that some GFF files may have
a size of GBs.
After seeing that comment, I just want to roughly know how large is a gff
file people are often working with?
I mainly work on plants and I am not quite familiar with animals.
Below I listed out a list of animals and plants, to my knowledge from
reading papers,  which most people are working with.

organism(genome size)                      size of gff         url to the
ftp *folder*(not a huge file so feel free to click it)
arabidopsis(~120MB)                         44MB
ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/
rice(~450MB)                                     77MB
here<ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/>
corn(3GB)                                          87MB
http://ftp.maizesequence.org/release-5b/filtered-set/
D. melanogaster                                450MB
ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/
C. elegans                                         (site going down)
http://wiki.wormbase.org/index.php/Downloads#GFF2
H. sapiens(3G)                                   170MB
here<http://galaxy.raetschlab.org/library_common/browse_library?show_deleted=False&cntrller=library&use_panels=False&id=2f94e8ae9edff68a>

My point is that caching gff files in memory wasn't as bad as we have
thought. Any comments or suggestion are welcome.

Best,


Zhigang


On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell <chris.mit7 at gmail.com> wrote:

> Hi Zhigang,
>
> I throw some comments on your proposal.  As i said there, I think you need
> to find & look at a variety of gff/gtf files to see where your
> implementation breaks down.  Also, for parsing, I would focus on optimizing
> the speed the user can access attributes, they're the bits people care most
> about (where is gene X, what is the FPKM of isoform y?, etc.)
>
> Chris
>
>
> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu <zhigangwu.bgi at gmail.com>wrote:
>
>> Hi Peter and all,
>> Thanks for the long explanation.
>> I got much better understand of this project though I am still confusing
>> on
>> how to implement the lazy-loading parser for feature rich files (EMBL,
>> GenBank, GFF3).
>> Since the deadline is pretty close,I decided to post my premature of
>> proposal for this project. It would be great if you all can given me some
>> comments and suggestions. The proposal is available
>> here<
>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
>> >.
>>
>> Thank you all in advance.
>>
>>
>> Zhigang
>>
>>
>>
>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com
>> >wrote:
>>
>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
>> > wrote:
>> > > Peter,
>> > >
>> > > Thanks for the detailed explanation. It's very helpful. I am not quite
>> > > sure about the goal of the lazy-loading parser.
>> > > Let me try to summarize what are the goals of lazy-loading and how
>> > > lazy-loading would work. Please correct me if necessary. Below I use
>> > > fasta/fastq file as an example. The idea should generally applies to
>> > > other format such as GenBank/EMBL as you mentioned.
>> > >
>> > > Lazy-loading is useful under the assumption that given a large file,
>> > > we are interested in partial information of it but not all of them.
>> > > For example a fasta file contains Arabidopsis genome, we only
>> > > interested in the sequence of chr5 from index position from 2000-3000.
>> > > Rather than parsing the whole file and storing each record in memory
>> > > as most parsers will do,  during the indexing step, lazy loading
>> > > parser will only store a few position information, such as access
>> > > positions (readily usable for seek) for all chromosomes (chr1, chr2,
>> > > chr3, chr4, chr5, ...) and may be position index information such as
>> > > the access positions for every 1000bp positions for each sequence in
>> > > the given file. After indexing, we store these information in a
>> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
>> > > 2000:access_pos,}, 'chr3'...}.
>> > >
>> > > Compared to the usual parser which tends to parsing the whole file, we
>> > > gain two benefits: speed, less memory usage and random access. Speed
>> > > is gained because we skipped a lot during the parsing step. Go back to
>> > > my example, once we have the dictionary, we can just seek to the
>> > > access position of chr5:2000 and start reading and parsing from there.
>> > > Less memory usage is due to we only stores access positions for each
>> > > record as a dictionary in memory.
>> > >
>> > >
>> > > Best,
>> > >
>> > > Zhigang
>> >
>> > Hi Zhigang,
>> >
>> > Yes - that's the basic idea of a disk based lazy loader. Here
>> > the data stays on the disk until needed, so generally this is
>> > very low memory but can be slow as it needs to read from
>> > the disk. And existing example already in Biopython is our
>> > BioSQL bindings which present a SeqRecord subclass which
>> > only retrieves values from the database on demand.
>> >
>> > Note in the case of FASTA, we might want to use the existing
>> > FAI index files from Heng Li's faidx tool (or another existing
>> > index scheme). That relies on each record using a consistent
>> > line wrapping length, so that seek offsets can be easily
>> > calculated.
>> >
>> > An alternative idea is to load the data into memory (so that the
>> > file is not touched again, useful for stream processing where
>> > you cannot seek within the input data) but it is only parsed into
>> > Python objects on demand. This would use a lot more memory,
>> > but should be faster as there is no disk seeking and reading
>> > (other than the one initial read). For FASTA this wouldn't help
>> > much but it might work for EMBL/GenBank.
>> >
>> > Something to beware of with any lazy loading / lazy parsing is
>> > what happens if the user tries to edit the record? Do you want
>> > to allow this (it makes the code more complex) or not (simpler
>> > and still very useful).
>> >
>> > In terms of usage examples, for things like raw NGS data this
>> > is (currently) made up of lots and lots of short sequences (under
>> > 1000bp). Lazy loading here is unlikely to be very helpful - unless
>> > perhaps you can make the FASTQ parser faster this way?
>> > (Once the reads are assembled or mapped to a reference,
>> > random access to lookup reads by their mapped location is
>> > very very important, thus the BAI indexing of BAM files).
>> >
>> > In terms of this project, I was thinking about a SeqRecord
>> > style interface extending Bio.SeqIO (but you can suggest
>> > something different for your project).
>> >
>> > What I saw as the main use case here is large datasets like
>> > whole chromosomes in FASTA format or richly annotated
>> > formats like EMBL, GenBank or GFF3. Right now if I am
>> > doing something with (for example) the annotated human
>> > chromosomes, loading these as GenBank files is quite
>> > slow (it takes a far amount of memory too, but that isn't
>> > my main worry). A lazy loading approach should let me
>> > 'load' the GenBank files almost instantly, and delay
>> > reading specific features or sequence from the disk
>> > until needed.
>> >
>> > For example, I might have a list of genes for which I wish
>> > to extract the annotation or sequence for - and there is no
>> > need to load all the other features or the rest of the genome.
>> >
>> > (Note we can already do this by loading GenBank files
>> > into a BioSQL database, and access them that way)
>> >
>> > Regards,
>> >
>> > Peter
>> >
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>

From zhigang.wu at email.ucr.edu  Thu May  2 20:18:03 2013
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Thu, 2 May 2013 17:18:03 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
Message-ID: <CADhJE9uw2V3sX3nfrOSBiZCdeTcMJc9FSLkcoNV=Y1Wb-5SoVQ@mail.gmail.com>

On Thu, May 2, 2013 at 5:54 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> wrote:
> > Hi Peter and all,
> > Thanks for the long explanation.
> > I got much better understand of this project though I am still confusing
> on
> > how to implement the lazy-loading parser for feature rich files (EMBL,
> > GenBank, GFF3).
>
> Hi Zhigang,
>
> I'd considered two ideas for GenBank/EMBL,
>
> Lazy parsing of the feature table: The existing iterator approach reads
> in a GenBank file record by record, and parses everything into objects
> (a SeqRecord object with the sequence as a Seq object and the
> features as a list of SeqFeature objects). I did some profiling a while
> ago, and of this the feature processing is quite slow, therefore during
> the initial parse the features could be stored in memory as a list of
> strings, and only parsed into SeqFeature objects if the user tries to
> access the SeqRecord's feature property.
>
> It would require a fairly simple subclassing of the SeqRecord to make
> the features list into a property in order to populate the list of
> SeqFeatures when first accessed.
>
> In the situation where the user never uses the features, this should
> be much faster, and save some memory as well (that would need to
> be confirmed by measurement - but a list of strings should take less
> RAM than a list of SeqFeature objects with all the sub-objects like
> the locations and annotations).
>

I agree. This would save some memory.


> In the situation where the use does access the features, the simplest
> behaviour would be to process the cached raw feature table into a
> list of SeqFeature objects. The overall runtime and memory usage
> would be about what we have now. This would not require any
> file seeking, and could be used within the existing SeqIO interface
> where we make a single pass though the file for parsing - this is
> vital in order to cope with handles like stdin and network handles
> where you cannot seek backwards in the file.
>
> Yes, I agree. So in this sense, the name "lazy-loading" is a little
misleading.
Because, this would load everything into memory at the beginning, while
just delay
in parsing any feature until a specific one is requested.
Seems like "lazy parsing" would be more appropriate.

That is the simpler idea, some real benefits, but not too ambitious.
> If you are already familiar with the GenBank/EMBL file format and
> our current parser and the SeqRecord object, then I think a week
> is reasonable.
>
>
No, I am not quite familiar with these.


> A full index based approach would mean scanning the GenBank,
> EMBL or GFF file and recording information about where each
> feature is on disk (file offset) and the feature location coordinates.
> This could be recorded in an efficient index structure (I was thinking
> something based on BAM's BAI or Heng Li's improved version CSI).
> The idea here is that when the user wants to look at features in a
> particular region of the genome (e.g. they have a mutation or SNP
> in region 1234567 on chr5) then only the annotation in that part
> of the genome needs to be loaded from the disk.
>
> This would likely require API changes or additions, for example
> the SeqRecord currently holds the SeqFeature objects as a
> simple list - with no build in co-ordinate access.
>
> As I wrote in the original outline email, there is scope for a very
> ambitious project working in this area - but some of these ideas
> would require more background knowledge or preparation:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
>
Hmm, this is actually INDEXing a big file. Don't you think a little bit off
topic, "lazy-loading parser".
But this seems interesting and challenging and definitely going to be
useful.


> Anything looking to work with GFF (in the broad sense of GFF3
> and/or GTF) would ideal incorporate Brad Chapman's existing
> work: http://biopython.org/wiki/GFF_Parsing
>
> Yes, I definitely will take a Brad's GFF parser.


> Regards,
>
> Peter
>

Thanks for the long explanation again.


Zhigang

From yeyanbo289 at gmail.com  Thu May  2 22:19:07 2013
From: yeyanbo289 at gmail.com (Yanbo Ye)
Date: Fri, 3 May 2013 10:19:07 +0800
Subject: [Biopython-dev] Biopython Phylo Proposal
Message-ID: <CADoMHjx3HPxpHpx10x7a1TUUGf6bO3GXbjF924biZMT4eohnVw@mail.gmail.com>

Hi everyone,

I forget to post my gsoc proposal page here. Any comment?
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/yeyanbo/1#

Thanks,

Yanbo
-- 

???

????????????????

Ye Yanbo

Bioinformatics Group, Wuhan Institute Of Virology, Chinese Academy of
Sciences


From Markus.Piotrowski at ruhr-uni-bochum.de  Fri May  3 02:32:43 2013
From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski)
Date: 3 May 2013 08:32:43 +0200
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9v+dbaTLEDvzRsn7JiXj45xdzg32cZzpHFioOkVeE6m2g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
	<CADhJE9v+dbaTLEDvzRsn7JiXj45xdzg32cZzpHFioOkVeE6m2g@mail.gmail.com>
Message-ID: <e7682ba4d69f8006c063791b2c06053f@mpx2.rz.ruhr-uni-bochum.de>

Hi Zhigang,

Sequence read files from Next Generation Sequencing methods are several 
GB large. Don't know if they are regulary stored in GFF files, anyhow.

Best,

Markus

Am 2013-05-02 23:18, schrieb Zhigang Wu:
> Hi Chris and All,
>
> In your comments to my proposal, you mentioned that some GFF files 
> may have
> a size of GBs.
> After seeing that comment, I just want to roughly know how large is a 
> gff
> file people are often working with?
> I mainly work on plants and I am not quite familiar with animals.
> Below I listed out a list of animals and plants, to my knowledge from
> reading papers,  which most people are working with.
>
> organism(genome size)                      size of gff         url to 
> the
> ftp *folder*(not a huge file so feel free to click it)
> arabidopsis(~120MB)                         44MB
> ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/
> rice(~450MB)                                     77MB
> 
> here<ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/>
> corn(3GB)                                          87MB
> http://ftp.maizesequence.org/release-5b/filtered-set/
> D. melanogaster                                450MB
> 
> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/
> C. elegans                                         (site going down)
> http://wiki.wormbase.org/index.php/Downloads#GFF2
> H. sapiens(3G)                                   170MB
> 
> here<http://galaxy.raetschlab.org/library_common/browse_library?show_deleted=False&cntrller=library&use_panels=False&id=2f94e8ae9edff68a>
>
> My point is that caching gff files in memory wasn't as bad as we have
> thought. Any comments or suggestion are welcome.
>
> Best,
>
>
> Zhigang
>
>
>
>
> On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell <chris.mit7 at gmail.com> 
> wrote:
>
>> Hi Zhigang,
>>
>> I throw some comments on your proposal.  As i said there, I think 
>> you need
>> to find & look at a variety of gff/gtf files to see where your
>> implementation breaks down.  Also, for parsing, I would focus on 
>> optimizing
>> the speed the user can access attributes, they're the bits people 
>> care most
>> about (where is gene X, what is the FPKM of isoform y?, etc.)
>>
>> Chris
>>
>>
>> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu 
>> <zhigangwu.bgi at gmail.com>wrote:
>>
>>> Hi Peter and all,
>>> Thanks for the long explanation.
>>> I got much better understand of this project though I am still 
>>> confusing
>>> on
>>> how to implement the lazy-loading parser for feature rich files 
>>> (EMBL,
>>> GenBank, GFF3).
>>> Since the deadline is pretty close,I decided to post my premature 
>>> of
>>> proposal for this project. It would be great if you all can given 
>>> me some
>>> comments and suggestions. The proposal is available
>>> here<
>>> 
>>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
>>> >.
>>>
>>> Thank you all in advance.
>>>
>>>
>>> Zhigang
>>>
>>>
>>>
>>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock 
>>> <p.j.a.cock at googlemail.com
>>> >wrote:
>>>
>>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu 
>>> <zhigangwu.bgi at gmail.com>
>>> > wrote:
>>> > > Peter,
>>> > >
>>> > > Thanks for the detailed explanation. It's very helpful. I am 
>>> not quite
>>> > > sure about the goal of the lazy-loading parser.
>>> > > Let me try to summarize what are the goals of lazy-loading and 
>>> how
>>> > > lazy-loading would work. Please correct me if necessary. Below 
>>> I use
>>> > > fasta/fastq file as an example. The idea should generally 
>>> applies to
>>> > > other format such as GenBank/EMBL as you mentioned.
>>> > >
>>> > > Lazy-loading is useful under the assumption that given a large 
>>> file,
>>> > > we are interested in partial information of it but not all of 
>>> them.
>>> > > For example a fasta file contains Arabidopsis genome, we only
>>> > > interested in the sequence of chr5 from index position from 
>>> 2000-3000.
>>> > > Rather than parsing the whole file and storing each record in 
>>> memory
>>> > > as most parsers will do,  during the indexing step, lazy 
>>> loading
>>> > > parser will only store a few position information, such as 
>>> access
>>> > > positions (readily usable for seek) for all chromosomes (chr1, 
>>> chr2,
>>> > > chr3, chr4, chr5, ...) and may be position index information 
>>> such as
>>> > > the access positions for every 1000bp positions for each 
>>> sequence in
>>> > > the given file. After indexing, we store these information in a
>>> > > dictionary like following {'chr1':{0:access_pos, 
>>> 1000:access_pos,
>>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
>>> > > 2000:access_pos,}, 'chr3'...}.
>>> > >
>>> > > Compared to the usual parser which tends to parsing the whole 
>>> file, we
>>> > > gain two benefits: speed, less memory usage and random access. 
>>> Speed
>>> > > is gained because we skipped a lot during the parsing step. Go 
>>> back to
>>> > > my example, once we have the dictionary, we can just seek to 
>>> the
>>> > > access position of chr5:2000 and start reading and parsing from 
>>> there.
>>> > > Less memory usage is due to we only stores access positions for 
>>> each
>>> > > record as a dictionary in memory.
>>> > >
>>> > >
>>> > > Best,
>>> > >
>>> > > Zhigang
>>> >
>>> > Hi Zhigang,
>>> >
>>> > Yes - that's the basic idea of a disk based lazy loader. Here
>>> > the data stays on the disk until needed, so generally this is
>>> > very low memory but can be slow as it needs to read from
>>> > the disk. And existing example already in Biopython is our
>>> > BioSQL bindings which present a SeqRecord subclass which
>>> > only retrieves values from the database on demand.
>>> >
>>> > Note in the case of FASTA, we might want to use the existing
>>> > FAI index files from Heng Li's faidx tool (or another existing
>>> > index scheme). That relies on each record using a consistent
>>> > line wrapping length, so that seek offsets can be easily
>>> > calculated.
>>> >
>>> > An alternative idea is to load the data into memory (so that the
>>> > file is not touched again, useful for stream processing where
>>> > you cannot seek within the input data) but it is only parsed into
>>> > Python objects on demand. This would use a lot more memory,
>>> > but should be faster as there is no disk seeking and reading
>>> > (other than the one initial read). For FASTA this wouldn't help
>>> > much but it might work for EMBL/GenBank.
>>> >
>>> > Something to beware of with any lazy loading / lazy parsing is
>>> > what happens if the user tries to edit the record? Do you want
>>> > to allow this (it makes the code more complex) or not (simpler
>>> > and still very useful).
>>> >
>>> > In terms of usage examples, for things like raw NGS data this
>>> > is (currently) made up of lots and lots of short sequences (under
>>> > 1000bp). Lazy loading here is unlikely to be very helpful - 
>>> unless
>>> > perhaps you can make the FASTQ parser faster this way?
>>> > (Once the reads are assembled or mapped to a reference,
>>> > random access to lookup reads by their mapped location is
>>> > very very important, thus the BAI indexing of BAM files).
>>> >
>>> > In terms of this project, I was thinking about a SeqRecord
>>> > style interface extending Bio.SeqIO (but you can suggest
>>> > something different for your project).
>>> >
>>> > What I saw as the main use case here is large datasets like
>>> > whole chromosomes in FASTA format or richly annotated
>>> > formats like EMBL, GenBank or GFF3. Right now if I am
>>> > doing something with (for example) the annotated human
>>> > chromosomes, loading these as GenBank files is quite
>>> > slow (it takes a far amount of memory too, but that isn't
>>> > my main worry). A lazy loading approach should let me
>>> > 'load' the GenBank files almost instantly, and delay
>>> > reading specific features or sequence from the disk
>>> > until needed.
>>> >
>>> > For example, I might have a list of genes for which I wish
>>> > to extract the annotation or sequence for - and there is no
>>> > need to load all the other features or the rest of the genome.
>>> >
>>> > (Note we can already do this by loading GenBank files
>>> > into a BioSQL database, and access them that way)
>>> >
>>> > Regards,
>>> >
>>> > Peter
>>> >
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From p.j.a.cock at googlemail.com  Mon May  6 07:23:24 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 6 May 2013 12:23:24 +0100
Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC
	2013
In-Reply-To: <CAKVJ-_6z4V6OOmG3R-4SjVfD3xJbcc7eRiuzHUH4h=jHXfJc=w@mail.gmail.com>
References: <CAKVJ-_4_yYYDUcpXoVivr2FdZ4yXS-HzAHLFgwodYq9S7_PCgA@mail.gmail.com>
	<CAKVJ-_4cdF=XDJ7P1Scg+9LR_8V6Evz7eEWetwMxPin4Kn3pyg@mail.gmail.com>
	<CAMC681nKv5qguoBn_iZyQfR80L+-f=Kjkr_O-8G4YUnfRr7KaQ@mail.gmail.com>
	<CAKVJ-_6z4V6OOmG3R-4SjVfD3xJbcc7eRiuzHUH4h=jHXfJc=w@mail.gmail.com>
Message-ID: <CAKVJ-_4NX8J-EBBRiaMtp85RmKaACUX2JzS-jDg9HGMpfCM0hQ@mail.gmail.com>

On Tue, Apr 16, 2013 at 9:47 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Apr 16, 2013 at 1:43 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> The abstract looks good to me. Which release was the first to include
>> SearchIO, was that 1.61? If so, maybe it would be good to note that in
>> addition to the smaller improvements, SearchIO specifically was (one of?)
>> the new module(s) that introduced the beta designation.
>>
>
> Yes, SearchIO was included in Biopython 1.61, but you're right that
> could be made a bit clearer.
>

The Biopython update has been accepted for a 10 minute talk slot at
BOSC (anyone else with an abstract submitted should have had an
email by now), the reviewers' feedback was short and positive:

  (A) Keep it short and show the variety of active sub-projects and
     people involved and the presentaion will will be attractive to the
     audience. The last year's talk is a good example (based on the
     shared slides).

(Last year it was Eric at BOSC 2012 in Long Beach, CA - well done)

(B) Nice to see latest news on BioPython and future directions of one
    of the most popular OpenBio project.

(C) This talk reports an update on the BioPython project (support for
     experimental codes, Python 3 compatibility, SearchIO and genomic
     variant formats). BioPython is one of the central projects of O.B.F
     and its update is worth getting some attention at BOSC.

We have until June to revise our abstract - so perhaps we should
do the next release this month in May ;)

Peter

From idoerg at gmail.com  Tue May  7 12:24:00 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Tue, 7 May 2013 12:24:00 -0400
Subject: [Biopython-dev] uniprot-GOA parse
Message-ID: <CABm4-MS4qr_6PASH5j6dYH20NWX=JvHZjQkVtvyTjn0JF_WZ-Q@mail.gmail.com>

hi,

As promised, I have written a uniprot-goa parser. Very skeletal, has
iterators for reading the three uniprot-GOA file types, a write function,
and a couple of usage examples.

No github write access, so attaching.

Cheers,

Iddo


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: upg_parser.py
Type: application/octet-stream
Size: 10344 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130507/913ffc33/attachment.obj>

From p.j.a.cock at googlemail.com  Tue May  7 12:47:16 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 7 May 2013 17:47:16 +0100
Subject: [Biopython-dev] uniprot-GOA parse
In-Reply-To: <CABm4-MS4qr_6PASH5j6dYH20NWX=JvHZjQkVtvyTjn0JF_WZ-Q@mail.gmail.com>
References: <CABm4-MS4qr_6PASH5j6dYH20NWX=JvHZjQkVtvyTjn0JF_WZ-Q@mail.gmail.com>
Message-ID: <CAKVJ-_7ExtyWmzEM=Cxqg4JytHux29XKw_W+Zna6xA4OC8gYoA@mail.gmail.com>

On Tue, May 7, 2013 at 5:24 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> hi,
>
> As promised, I have written a uniprot-goa parser. Very skeletal, has
> iterators for reading the three uniprot-GOA file types, a write function,
> and a couple of usage examples.
>
> No github write access, so attaching.

The file arrived :)

Did you have any thoughts on where in the namespace to put this?

The idea with github is you'd register an account, say iddux (since
that's your Twitter username), and then fork the repository as
https://github.com/iddux/biopython - and make a new branch
there with your changes, and ask for feedback or make a pull
request.

All that can be done without any write access to the main repository,
and is intended to lower the barrier to entry.

In your case, given you're a past project leader etc, drop me (or Brad
etc) an email once you've mastered the git basic and we can give you
direct access.

Regards,

Peter

From natemsutton at yahoo.com  Tue May  7 17:12:59 2013
From: natemsutton at yahoo.com (Nate Sutton)
Date: Tue, 7 May 2013 14:12:59 -0700 (PDT)
Subject: [Biopython-dev] Progress with ticket 3336
Message-ID: <1367961179.88206.YahooMailNeo@web122603.mail.ne1.yahoo.com>

Hi,

Here is a progress follow up to http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010548.html . ?I have added a commit to the github branch that adds an option to create claude branch lines using linecollection. ?The linecollection objects are stored in a tuple before adding them to the plot. ?It?s in Bio/Phylo/_utils.py. ?Is this what the last bullet point was requesting in https://redmine.open-bio.org/issues/3336 ? ?

Thanks!

Nate

P. S. ?I used a tuple to store the linecollection objects instead of a list because that was mentioned in the ticket but if that looks like it should be different let me know. ?Also, I got some global variables to work with the code but I was only able to do that after declaring them as globals twice. ?If there are suggestions on how to code that differently let me know.


From idoerg at gmail.com  Wed May  8 19:28:17 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Wed, 8 May 2013 19:28:17 -0400
Subject: [Biopython-dev] UniProt GOA parser
Message-ID: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>

A new uniprot-GOA parser is available for you to poke around:

https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA

More on Uniprot-GOA: http://www.ebi.ac.uk/GOA

There are three file formats: GAF (gene association file) , GPA (gene
product association) and GPI (gene product information) explained here:
http://www.ebi.ac.uk/GOA/downloads

Input GAF files can be very large, due to the growth of uniprot GOA. If you
would like to test in a timely fashion, I suggest you get historical files,
which are smaller. Once you get to the > 40 version numbers, the runtime
for the example code in UniProtGOA.py goes over 2 minutes (on my i5
machine).

Old GAF files are available here:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/

Current GPI and GPA files are not very large.

Thanks to Peter for his help on this.

Best,

Iddo
-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From p.j.a.cock at googlemail.com  Fri May 10 06:06:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 10 May 2013 11:06:19 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
Message-ID: <CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>

On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
> A new uniprot-GOA parser is available for you to poke around:
>
> https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA
>

I think for the namespace, we might be better off using Bio.UniProt.GOA,
where Iddo's parser would be in Bio/UniProt/GOA.py and any other
UniProt specific code could also go under Bio/UniProt - for example
a web API.

Some of Bio.SwissProt might also migrate here over time.

> More on Uniprot-GOA: http://www.ebi.ac.uk/GOA
>
> There are three file formats: GAF (gene association file) , GPA (gene
> product association) and GPI (gene product information) explained here:
> http://www.ebi.ac.uk/GOA/downloads
>
> Input GAF files can be very large, due to the growth of uniprot GOA. If you
> would like to test in a timely fashion, I suggest you get historical files,
> which are smaller. Once you get to the > 40 version numbers, the runtime
> for the example code in UniProtGOA.py goes over 2 minutes (on my i5
> machine).

Would it make sense to want random access to the GOA files based
on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
should be fairly straight forward to do building on the indexing code
for Bio.SeqIO and SearchIO.

Note here I am picturing combining all the (consecutive) lines
for the same DB_Object_ID - currently the parser is line based,
but batching by DB_Object_ID would be a straightforward change
and may better suit some uses.

> Old GAF files are available here:
> ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
>
> Current GPI and GPA files are not very large.
>
> Thanks to Peter for his help on this.
>
> Best,
>
> Iddo

Peter

From idoerg at gmail.com  Fri May 10 12:20:16 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Fri, 10 May 2013 12:20:16 -0400
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
Message-ID: <CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>

On Fri, May 10, 2013 at 6:06 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > A new uniprot-GOA parser is available for you to poke around:
> >
> > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA
> >
>
> I think for the namespace, we might be better off using Bio.UniProt.GOA,
> where Iddo's parser would be in Bio/UniProt/GOA.py and any other
> UniProt specific code could also go under Bio/UniProt - for example
> a web API.
>

OK.


>
> Some of Bio.SwissProt might also migrate here over time.
>
> > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA
> >
> > There are three file formats: GAF (gene association file) , GPA (gene
> > product association) and GPI (gene product information) explained here:
> > http://www.ebi.ac.uk/GOA/downloads
> >
> > Input GAF files can be very large, due to the growth of uniprot GOA. If
> you
> > would like to test in a timely fashion, I suggest you get historical
> files,
> > which are smaller. Once you get to the > 40 version numbers, the runtime
> > for the example code in UniProtGOA.py goes over 2 minutes (on my i5
> > machine).
>
> Would it make sense to want random access to the GOA files based
> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
> should be fairly straight forward to do building on the indexing code
> for Bio.SeqIO and SearchIO.
>

Would that require reading it all into memory? Uniprot_GOA files are huge,
it is impractical to read them in fully.


>
> Note here I am picturing combining all the (consecutive) lines
> for the same DB_Object_ID - currently the parser is line based,
> but batching by DB_Object_ID would be a straightforward change
> and may better suit some uses.
>

Perhaps only for organism specific file, which in some cases can be read
fully into memory.

>
> > Old GAF files are available here:
> > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
> >
> > Current GPI and GPA files are not very large.
> >
> > Thanks to Peter for his help on this.
> >
> > Best,
> >
> > Iddo
>
> Peter
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From p.j.a.cock at googlemail.com  Fri May 10 12:26:13 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 10 May 2013 17:26:13 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
Message-ID: <CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>

On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
>>
>> Would it make sense to want random access to the GOA files based
>> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
>> should be fairly straight forward to do building on the indexing code
>> for Bio.SeqIO and SearchIO.
>
>
> Would that require reading it all into memory? Uniprot_GOA files
> are huge, it is impractical to read them in fully.

Not at all - we'd record a dictionary mapping the record ID to an offset
in the file on disk, or record this mapping in an SQLite index file.

>> Note here I am picturing combining all the (consecutive) lines
>> for the same DB_Object_ID - currently the parser is line based,
>> but batching by DB_Object_ID would be a straightforward change
>> and may better suit some uses.
>
> Perhaps only for organism specific file, which in some cases can
> be read fully into memory.

The examples I looked at only seemed to have a dozen or so
lines for each DB_Object_ID - but perhaps these were easy
cases? How many lines per DB_Object_ID in the worst cases?

Peter

From idoerg at gmail.com  Fri May 10 12:32:43 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Fri, 10 May 2013 12:32:43 -0400
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
Message-ID: <CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>

On Fri, May 10, 2013 at 12:26 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
> >>
> >> Would it make sense to want random access to the GOA files based
> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
> >> should be fairly straight forward to do building on the indexing code
> >> for Bio.SeqIO and SearchIO.
> >
> >
> > Would that require reading it all into memory? Uniprot_GOA files
> > are huge, it is impractical to read them in fully.
>
> Not at all - we'd record a dictionary mapping the record ID to an offset
> in the file on disk, or record this mapping in an SQLite index file.
>

 Ok, that's good then


> >> Note here I am picturing combining all the (consecutive) lines
> >> for the same DB_Object_ID - currently the parser is line based,
> >> but batching by DB_Object_ID would be a straightforward change
> >> and may better suit some uses.
> >
> > Perhaps only for organism specific file, which in some cases can
> > be read fully into memory.
>
> The examples I looked at only seemed to have a dozen or so
> lines for each DB_Object_ID - but perhaps these were easy
> cases? How many lines per DB_Object_ID in the worst cases?
>
> Peter
>


I was actually thinking you are suggesting that the whole file should be
read in memory, nit just buffer by DB-Object_ID.  My mistake.


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From linxzh1989 at gmail.com  Sun May 12 08:57:25 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Sun, 12 May 2013 20:57:25 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
Message-ID: <CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>

I am very Sorry about my mistake.
I want to install biopython 1.61 in a local server(CentOS),
    python setup.py build
    python setup.py test
and then showed some errors:


======================================================================
FAIL: Test an input file containing a single sequence.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Clustalw_tool.py", line 166, in test_single_sequence
    self.assertTrue(str(err) == "No records found in handle")
AssertionError

======================================================================
ERROR: Test Entrez.read from URL
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez_online.py", line 34, in test_read_from_url
    rec = Entrez.read(einfo)
  File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py",
line 362, in read
    record = handler.read(handle)
  File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
line 184, in read
    self.parser.ParseFile(handle)
  File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
line 322, in endElementHandler
    raise RuntimeError(value)
RuntimeError: Unable to open connection to #DbInfo?dbaf=

======================================================================
ERROR: Run tutorial doctests.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Tutorial.py", line 152, in test_doctests
ValueError: 4 Tutorial doctests failed: test_from_line_05671,
test_from_line_06030, test_from_line_06190, test_from_line_06479

----------------------------------------------------------------------
Ran 213 tests in 1621.002 seconds

FAILED (failures = 3)


i use python 2.6.5

2013/5/12 ?????? <linxzh1989 at gmail.com>:
> I've run the


From saketkc at gmail.com  Sun May 12 14:11:46 2013
From: saketkc at gmail.com (Saket Choudhary)
Date: Sun, 12 May 2013 23:41:46 +0530
Subject: [Biopython-dev] samtools threaded daemon
In-Reply-To: <CAK_U6OCO_QTyi1f9vxHFog2cER-J43pstD_xmaoQziDiUkd23w@mail.gmail.com>
References: <CAK_U6OB6huGJsEG9Tr=anxBHNO-KJ4savdwimVOSEs22XHyj8Q@mail.gmail.com>
	<516653BE.8060509@brueffer.de>
	<CAKVJ-_5G8nawG2nY9sGUNkVpBskTLMMmKuEtdgR1xR3GzxR3Qg@mail.gmail.com>
	<CAK_U6OBESn3Y20k5kJU0papp4ZmWdVW8Z_8QS8LMjXteKAJZVg@mail.gmail.com>
	<CAKVJ-_6RsKLL6JQgB9Q=OirfuMK3QM5LCCyF21uSvmpp4aXyzw@mail.gmail.com>
	<CAK_U6ODx3nk4hi0i3cJZzrgxd6eBHLSL70R0RVNwiz6Y4sGf_g@mail.gmail.com>
	<CAK_U6OCO_QTyi1f9vxHFog2cER-J43pstD_xmaoQziDiUkd23w@mail.gmail.com>
Message-ID: <CAEDHeis1-E-0gPy9kwcAyVYeuVjaBBOyadM+ewLqN+i41AUZpA@mail.gmail.com>

Just completed writing samtools wrapper :
https://github.com/biopython/biopython/pull/180

Unit Tests pending.

On 11 April 2013 23:51, Chris Mitchell <chris.mit7 at gmail.com> wrote:
> Here's the branch I'm starting with, including a working mpileup daemon for
> those who want to use it:
>
> https://github.com/chrismit/biopython/tree/samtools
>
> sample usage:
> from Bio.SamTools import SamTools
> sTools = '/home/chris/bin/samtools'
> hg19 = '/media/chris/ChrisSSD/ref/human/hg19.fa'
> bamSource = '/media/chris/ChrisSSD/TH1Alignment/NK/accepted_hits.bam'
> st = SamTools(bamSource,binary=sTools,threads=30)
>
> #now with a callback, which is advisable to use to process data as it is
> generated
> def processPileup(pileup):
>     print 'to process',pileup
>
> #st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in
> xrange(2000001,2001001)],callback=processPileup) #with callback
> #print st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in
> xrange(2000001,2000101)]) #will just return as a list
>
>
> On Thu, Apr 11, 2013 at 10:04 AM, Chris Mitchell <chris.mit7 at gmail.com>wrote:
>
>> Given that we'd be chasing after the samtools development cycle, I think
>> it's just easier to implement command line wrappers that are dynamic enough
>> to handle future versions.  For instance, some of the code doesn't seem too
>> set in stone and appears empirical (the BAQ computation comes to mind) and
>> therefore probable to change in future versions.  I can package in my
>> existing pileup parser, but in general I think most people will be using a
>> callback routine to handle it themselves since use cases of the final
>> output sort of vary project by project.
>>
>> Chris
>>
>>
>> On Thu, Apr 11, 2013 at 9:54 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>>
>>> On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell <chris.mit7 at gmail.com>
>>> wrote:
>>> > Also, if a binary can't be found, having it fallback to the future
>>> > BioPython parser seems like it might be a good idea (provided it has
>>> > similar functionality like creating pileups, does it?).
>>>
>>> It has the low level random access via the BAI index done, but
>>> does not yet have a reimplementation of the mpileup code, no.
>>> (Would that be useful compared to calling samtools and parsing
>>> its output?)
>>>
>>> Peter
>>>
>>
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev

From linxzh1989 at gmail.com  Sun May 12 21:41:30 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Mon, 13 May 2013 09:41:30 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
Message-ID: <CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>

2013/5/13 Peter Cock <p.j.a.cock at googlemail.com>:
> On Sun, May 12, 2013 at 1:57 PM, ?????? <linxzh1989 at gmail.com> wrote:
>> I want to install biopython 1.61 in a local server(CentOS),
>>     python setup.py build
>>     python setup.py test
>> and then showed some errors:
>>
>> ...
>>
>> i use python 2.6.5
>>
>
> Thank you for getting in touch, and including the important
> information about the operating system, version of Python
> and version of Biopython.
>
>> FAIL: Test an input file containing a single sequence.
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "test_Clustalw_tool.py", line 166, in test_single_sequence
>>     self.assertTrue(str(err) == "No records found in handle")
>> AssertionError
>>
>
> This test calls the command line tool clustalw.
>
> What version of clustalw do you have?
>
>> ERROR: Test Entrez.read from URL
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "test_Entrez_online.py", line 34, in test_read_from_url
>>     rec = Entrez.read(einfo)
>>   File
"/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py",
>> line 362, in read
>>     record = handler.read(handle)
>>   File
"/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
>> line 184, in read
>>     self.parser.ParseFile(handle)
>>   File
"/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
>> line 322, in endElementHandler
>>     raise RuntimeError(value)
>> RuntimeError: Unable to open connection to #DbInfo?dbaf=
>>
>
> This test connects to the NCBI Entrez server over the internet.
> This kind of error is usually a temporary network problem, and
> will go away if you repeat the test later.
>
>> ERROR: Run tutorial doctests.
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "test_Tutorial.py", line 152, in test_doctests
>> ValueError: 4 Tutorial doctests failed: test_from_line_05671,
>> test_from_line_06030, test_from_line_06190, test_from_line_06479
>
> Those four failing examples in the Tutorial seem to match this
> commit, made just before the Biopython 1.61 release:
>
> https://github.com/biopython/biopython/commit/b84bda01bd22e93a1cf71613a55
February 2013 (Biopython
1.61)cfca876b7128d7#Doc/Tutorial.tex<https://github.com/biopython/biopython/commit/b84bda01bd22e93a1cf71613a5cfca876b7128d7#Doc/Tutorial.tex>
>
> Where did you get the Biopython 1.61 files from? e.g. The zip file
> or tar.gz file on our website? Perhaps I accidentally included an
> older copy of the Doc/Tutorial.tex file? Could you look for the
> "Late Update" line in your Tutorial.tex file for me - does it say:
>
> \date{Last Update -- 5 February 2013 (Biopython 1.61)}
>
> Thanks,
>
> Peter

Hi??Peter??
Clustalw I am using is 1.83.

I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5
February 2013 (Biopython 1.61)}'.

I downloaded the tar.gz from the biopython website.

Thanks
Lin


From p.j.a.cock at googlemail.com  Mon May 13 04:49:20 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 13 May 2013 09:49:20 +0100
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
Message-ID: <CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>

On Mon, May 13, 2013 at 2:41 AM, ??? <linxzh1989 at gmail.com> wrote:
>
>> Where did you get the Biopython 1.61 files from? e.g. The zip file
>> or tar.gz file on our website? Perhaps I accidentally included an
>> older copy of the Doc/Tutorial.tex file? Could you look for the
>> "Late Update" line in your Tutorial.tex file for me - does it say:
>>
>> \date{Last Update -- 5 February 2013 (Biopython 1.61)}
>>
>> Thanks,
>>
>> Peter
>
> Hi?Peter?
> Clustalw I am using is 1.83.

Hi Lin,

I also have clustalw 1.83, so this isn't simply a version
problem. It could be something subtle about the locale -
what language is your CentOS running in (that can alter
error messages etc)?

> I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5
> February 2013 (Biopython 1.61)}'.

That's good - that's what it should say :)

(Sorry late/last was my typing error).

>
> I downloaded the tar.gz from the biopython website.
>

Thanks. I could reproduce the test_Tutorial.py problem with that.
This is easy to explain - I forgot to include the test file my_blast.xml
when doing the release (and you are the first person to report this
problem). I should have noticed this myself, sorry :(

I've fixed this ready for the next release - thank you for reporting this:
https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5

If you want to, you can get the missing file from here:
http://biopython.org/SRC/Doc/examples/my_blast.xml

or:
https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml

If you save that in the Biopython 1.61 source under Doc/examples
then the Tutorial test should pass.

--

Did you retry the test_Entrez_online.py example to see if
this was a temporary problem?

--

The good news is these minor issues should not cause you
any problems installing and using Biopython 1.61 - so you
can go ahead and run 'python setup.py install.

Thanks,

Peter


From linxzh1989 at gmail.com  Mon May 13 10:34:31 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Mon, 13 May 2013 22:34:31 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
	<CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
Message-ID: <CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>

2013/5/13 Peter Cock <p.j.a.cock at googlemail.com>

> On Mon, May 13, 2013 at 2:41 AM, ?????? <linxzh1989 at gmail.com> wrote:
> >
> >> Where did you get the Biopython 1.61 files from? e.g. The zip file
> >> or tar.gz file on our website? Perhaps I accidentally included an
> >> older copy of the Doc/Tutorial.tex file? Could you look for the
> >> "Late Update" line in your Tutorial.tex file for me - does it say:
> >>
> >> \date{Last Update -- 5 February 2013 (Biopython 1.61)}
> >>
> >> Thanks,
> >>
> >> Peter
> >
> > Hi??Peter??
> > Clustalw I am using is 1.83.
>
> Hi Lin,
>
> I also have clustalw 1.83, so this isn't simply a version
> problem. It could be something subtle about the locale -
> what language is your CentOS running in (that can alter
> error messages etc)?
>
> > I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update
> -- 5
> > February 2013 (Biopython 1.61)}'.
>
> That's good - that's what it should say :)
>
> (Sorry late/last was my typing error).
>
> >
> > I downloaded the tar.gz from the biopython website.
> >
>
> Thanks. I could reproduce the test_Tutorial.py problem with that.
> This is easy to explain - I forgot to include the test file my_blast.xml
> when doing the release (and you are the first person to report this
> problem). I should have noticed this myself, sorry :(
>
> I've fixed this ready for the next release - thank you for reporting this:
>
> https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5
>
> If you want to, you can get the missing file from here:
> http://biopython.org/SRC/Doc/examples/my_blast.xml
>
> or:
> https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml
>
> If you save that in the Biopython 1.61 source under Doc/examples
> then the Tutorial test should pass.
>
> --
>
> Did you retry the test_Entrez_online.py example to see if
> this was a temporary problem?
>
> --
>
> The good news is these minor issues should not cause you
> any problems installing and using Biopython 1.61 - so you
> can go ahead and run 'python setup.py install.
>
> Thanks,
>
> Peter
>
Hi Peter
I have run the locale in my serve

$ locale
LANG=en_US.UTF-8
LC_CTYPE=zh_CN.UTF-8
LC_NUMERIC=zh_CN.UTF-8
LC_TIME=zh_CN.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=zh_CN.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=zh_CN.UTF-8
LC_NAME=zh_CN.UTF-8
LC_ADDRESS=zh_CN.UTF-8
LC_TELEPHONE=zh_CN.UTF-8
LC_MEASUREMENT=zh_CN.UTF-8
LC_IDENTIFICATION=zh_CN.UTF-8
LC_ALL=

Is that locale you want?

I retryed the the test_Entrez_online.py, it's all right now. As you said,
it should be a connection problem.

I have put the file in the Doc/examples file, but the error still exists.
And i find there is no my_blat.psl in Doc/examples comparing with the zip
file i downloaded from github. After i put the my_blat.psi in the
Doc/examples, the error did not show up again.

Thanks
Lin


From p.j.a.cock at googlemail.com  Mon May 13 11:50:26 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 13 May 2013 16:50:26 +0100
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
	<CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
	<CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>
Message-ID: <CAKVJ-_6ZT9KDs2g7ojbwDrK9jW7M3oaK+zG5AQhzwx-ngym+fQ@mail.gmail.com>

On Mon, May 13, 2013 at 3:34 PM, ??? <linxzh1989 at gmail.com> wrote:
>
> Hi Peter
> I have run the locale in my serve
>
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE=zh_CN.UTF-8
> LC_NUMERIC=zh_CN.UTF-8
> LC_TIME=zh_CN.UTF-8
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY=zh_CN.UTF-8
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER=zh_CN.UTF-8
> LC_NAME=zh_CN.UTF-8
> LC_ADDRESS=zh_CN.UTF-8
> LC_TELEPHONE=zh_CN.UTF-8
> LC_MEASUREMENT=zh_CN.UTF-8
> LC_IDENTIFICATION=zh_CN.UTF-8
> LC_ALL=
>
> Is that locale you want?

Hi Lin,

Thanks for checking that, but having looked in more detail
I think this is not related to the locale settings. My first guess
was wrong :(

I think I may have solved this - my test machine has both
clustalw 2.1 and clustalw 1.83, and they behave differently
for this example. The old test only worked with v2.1, fixed:
https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00

If you want to verify this, you could update your copy of
Tests/test_Clustalw_tool.py to that from github (or just
tried installing the latest Biopython code from github?).

Note the Clustal developers intended that clustalw 1 and 2
would behave the same as each other (Version 2 was a
rewrite as a step towards version 3, no called ClustalOmega),
but there are still some minor differences.

> I retryed the the test_Entrez_online.py, it's all right now. As
> you said, it should be a connection problem.

OK, good.

> I have put the file in the Doc/examples file, but the error still exists.
> And i find there is no my_blat.psl in Doc/examples comparing with the zip
> file i downloaded from github. After i put the my_blat.psi in the
> Doc/examples, the error did not show up again.

Thank you, that should be fixed in the next release:
https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d

Your feedback has been very helpful,

Thanks,

Peter


From linxzh1989 at gmail.com  Mon May 13 21:32:23 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Tue, 14 May 2013 09:32:23 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CAKVJ-_6ZT9KDs2g7ojbwDrK9jW7M3oaK+zG5AQhzwx-ngym+fQ@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
	<CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
	<CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>
	<CAKVJ-_6ZT9KDs2g7ojbwDrK9jW7M3oaK+zG5AQhzwx-ngym+fQ@mail.gmail.com>
Message-ID: <CALzRd7N5P4pVRebZiWyzL5RQ2mxYoymLaoMQRAL5vEd_ZEPrSQ@mail.gmail.com>

Hi Peter
I copy the test_Clustalw_tool.py from the github, now it does work. Thank
you!
Lin

2013/5/13 Peter Cock <p.j.a.cock at googlemail.com>

> On Mon, May 13, 2013 at 3:34 PM, ?????? <linxzh1989 at gmail.com> wrote:
> >
> > Hi Peter
> > I have run the locale in my serve
> >
> > $ locale
> > LANG=en_US.UTF-8
> > LC_CTYPE=zh_CN.UTF-8
> > LC_NUMERIC=zh_CN.UTF-8
> > LC_TIME=zh_CN.UTF-8
> > LC_COLLATE="en_US.UTF-8"
> > LC_MONETARY=zh_CN.UTF-8
> > LC_MESSAGES="en_US.UTF-8"
> > LC_PAPER=zh_CN.UTF-8
> > LC_NAME=zh_CN.UTF-8
> > LC_ADDRESS=zh_CN.UTF-8
> > LC_TELEPHONE=zh_CN.UTF-8
> > LC_MEASUREMENT=zh_CN.UTF-8
> > LC_IDENTIFICATION=zh_CN.UTF-8
> > LC_ALL=
> >
> > Is that locale you want?
>
> Hi Lin,
>
> Thanks for checking that, but having looked in more detail
> I think this is not related to the locale settings. My first guess
> was wrong :(
>
> I think I may have solved this - my test machine has both
> clustalw 2.1 and clustalw 1.83, and they behave differently
> for this example. The old test only worked with v2.1, fixed:
>
> https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00
>
> If you want to verify this, you could update your copy of
> Tests/test_Clustalw_tool.py to that from github (or just
> tried installing the latest Biopython code from github?).
>
> Note the Clustal developers intended that clustalw 1 and 2
> would behave the same as each other (Version 2 was a
> rewrite as a step towards version 3, no called ClustalOmega),
> but there are still some minor differences.
>
> > I retryed the the test_Entrez_online.py, it's all right now. As
> > you said, it should be a connection problem.
>
> OK, good.
>
> > I have put the file in the Doc/examples file, but the error still exists.
> > And i find there is no my_blat.psl in Doc/examples comparing with the zip
> > file i downloaded from github. After i put the my_blat.psi in the
> > Doc/examples, the error did not show up again.
>
> Thank you, that should be fixed in the next release:
>
> https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d
>
> Your feedback has been very helpful,
>
> Thanks,
>
> Peter
>


From idoerg at gmail.com  Fri May 17 17:35:41 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Fri, 17 May 2013 17:35:41 -0400
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
	<CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
Message-ID: <CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>

OK. I added a few changes as suggested by Peter.

There is a parser now to group GAF files by DB_Object_ID, and a write
function to write them. Random access not implemented yet.

On Fri, May 10, 2013 at 12:32 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

>
>
> On Fri, May 10, 2013 at 12:26 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
>> >>
>> >> Would it make sense to want random access to the GOA files based
>> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
>> >> should be fairly straight forward to do building on the indexing code
>> >> for Bio.SeqIO and SearchIO.
>> >
>> >
>> > Would that require reading it all into memory? Uniprot_GOA files
>> > are huge, it is impractical to read them in fully.
>>
>> Not at all - we'd record a dictionary mapping the record ID to an offset
>> in the file on disk, or record this mapping in an SQLite index file.
>>
>
>  Ok, that's good then
>
>
>> >> Note here I am picturing combining all the (consecutive) lines
>> >> for the same DB_Object_ID - currently the parser is line based,
>> >> but batching by DB_Object_ID would be a straightforward change
>> >> and may better suit some uses.
>> >
>> > Perhaps only for organism specific file, which in some cases can
>> > be read fully into memory.
>>
>> The examples I looked at only seemed to have a dozen or so
>> lines for each DB_Object_ID - but perhaps these were easy
>> cases? How many lines per DB_Object_ID in the worst cases?
>>
>> Peter
>>
>
>
> I was actually thinking you are suggesting that the whole file should be
> read in memory, nit just buffer by DB-Object_ID.  My mistake.
>
>
> --
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
> >>----.<--.>++++++.<<<<------------------------------------.
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.

From p.j.a.cock at googlemail.com  Mon May 20 09:16:45 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 20 May 2013 14:16:45 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
	<CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
	<CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>
Message-ID: <CAKVJ-_7CehNHjbKNvsUJoFpxw9MkCQgyx_-C38r0HRFZT2PsBg@mail.gmail.com>

On Fri, May 17, 2013 at 10:35 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>
>
> OK. I added a few changes as suggested by Peter.
>
> There is a parser now to group GAF files by DB_Object_ID, and a write
> function to write them. Random access not implemented yet.
>

Hi Iddo,

Over on this branch building on your work I moved things under
Bio.UniProt.GOA, and got things a bit more in line with PEP8:
https://github.com/peterjc/biopython/tree/uniprot-goa

(Drop me an email off list if you need a hand pulling those
changes into your branch)

Do you want to have a go at re-using the index code in Bio.File
(the back end for SeqIO and SearchIO's indexing)? Let me know
if the current setup is too mysterious and I can try and document
more of it and/or do this for the GOA module.

Peter

From redmine at redmine.open-bio.org  Tue May 21 08:24:34 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 21 May 2013 12:24:34 +0000
Subject: [Biopython-dev] [Biopython - Feature #3432] (New) Updated/Extended
	module MeltingTemp in Bio.SeqUtils
Message-ID: <redmine.issue-3432.20130521122434@redmine.open-bio.org>


Issue #3432 has been reported by Markus Piotrowski.

----------------------------------------
Feature #3432: Updated/Extended module MeltingTemp in Bio.SeqUtils
https://redmine.open-bio.org/issues/3432

Author: Markus Piotrowski
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Dear Biopython developers,

I updated/extended the MeltingTemp module of SeqUtils and would be happy if you would consider it for implementing. Please find the source code attached. Any feedback is appreciated.
'Old' module:
One method, Tm_staluc, which calculates the melting temperature by the nearest neighbor method, using two different thermodynamic data sets for DNA and RNA. Fixed salt correction formula.
'Updated' module:
1. Three different Tm calculations: one 'rule of thumb' (Tm_Wallace), one using approximative formulas basing on GC content (Tm_GC) and one using nearest neighbor calculations (Tm_NN).
2. The new Tm_NN allows the usage of different thermodynamic datasets (8 tables are included for Watson-Crick base-pairing) and includes tables for mismatches (including inosine) and dangling ends. The datasets are Python dictionaries; the user can use his own datasets or change/update existing tables for his needs.
3. Seven different formulas to correct for salt concentration, including correction for Mg2+ ions (method salt_correction).
4. Method chem_correction which allows for Tm correction when using DMSO and formaldehyde.

I haven't touched the old Tm_staluc method (except adding some comments [labelled 'MP'] and a deprecation warning). Actually, the method has two problems on the RNA side: The dataset for RNA is faulty and 'U' isn't considered as input. Of course this problems can easily be fixed, however, I would prefer (if it is decided to accept the updated module) to completely exchange the body of Tm_staluc for calls to Tm_NN (as outlined in the comments).

There is one thing, that I'm uneasy with: For terminal mismatches, I used thermodynamic data from a patent application that has been withdrawn (http://patentscope.wipo.int/search/en/WO2001094611). Actually, I found the reference in the manual for Primer3 which also seems to use these data (http://primer3.sourceforge.net/primer3_manual.htm). Indeed, the Primer3 source (which is distributed under GPLv2) contains the data.

Best wishes,

Markus


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Wed May 22 09:45:00 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 22 May 2013 14:45:00 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CAKVJ-_6MDCWTpwD+_QnYxhgSQ4Y6Yc2cbgOm+fb-BfrLi8S3hQ@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
	<CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
	<CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>
	<CAKVJ-_7CehNHjbKNvsUJoFpxw9MkCQgyx_-C38r0HRFZT2PsBg@mail.gmail.com>
	<CABm4-MQhC3ZuE0NSwWRpX96xKBDfw0L+OOJiP4R6gqkE1FHduw@mail.gmail.com>
	<CAKVJ-_6MDCWTpwD+_QnYxhgSQ4Y6Yc2cbgOm+fb-BfrLi8S3hQ@mail.gmail.com>
Message-ID: <CAKVJ-_5Yx4+pkhe+My4QERfe=KkdhffqsUwwjQj_-weaLFpzcw@mail.gmail.com>

On Mon, May 20, 2013 at 7:09 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>> Do you want to have a go at re-using the index code in Bio.File
>> (the back end for SeqIO and SearchIO's indexing)? Let me know
>> if the current setup is too mysterious and I can try and document
>> more of it and/or do this for the GOA module.
>
> I'd like to have a go..
>
> ./I

Great - a few more details then,

The second part of Bio/File.py has some private classes
_IndexedSeqFileProxy and _IndexedSeqFileDict and
_SQLiteManySeqFilesDict which can be used for any
sequential record file format (meaning one after the other,
not just biological sequences).

These are used by the Bio.SeqIO.index() and index_db()
functions, and their sisters in Bio.SearchIO.

The idea is you write a subclass of _IndexedSeqFileProxy
for your new file format, and then this gets used by either
_IndexedSeqFileDict (in memory offset dictionary) or
_SQLiteManySeqFilesDict (SQLite offset dictionary).

Your _IndexedSeqFileProxy subclass has to define
an __iter__ method which loops over the file giving
a tuple for each record giving the identifier string
and the start offset, and ideally the length in bytes.
It must also define a get method which must seek
to the offset and then parse the record.

For the GOA files, the __iter__ loop will just spot
batches of lines for the same identifier which together
make up a single record.

I managed to explain the setup to Bow, and he got it
to work for SearchIO, but we were doing face to face
video chats for that during GSoC last year. Fresh eyes
will surely find some more rough edges in my docs ;)

Regards,

Peter

From pgarland at gmail.com  Sun May 26 22:27:05 2013
From: pgarland at gmail.com (Phillip Garland)
Date: Sun, 26 May 2013 19:27:05 -0700
Subject: [Biopython-dev] test_SeqIO_online failure
Message-ID: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>

The fasta formatted record is fine, the problem seems to come after
requesting and reading the genbank-formatted record for the protein
with GI:16130152.

It looks like the record was modified a few days ago:

LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013

and ends with

CONTIG      join(WP_000865568.1:1..367)\n//\n\n'

instead of

ORIGIN and the sequence data.

Is this a problem with the genbank record that should be reported to
NCBI, or is SeqIO supposed to handle the record as it is by fetching
the sequence from the linked contig, or is the test doing the wrong
thing by using rettype="gb" instead of rettype="gbwithparts"?

Here's the test output:

pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
run_tests.py test_SeqIO_online.py
Python version: 2.7.5 (default, May 20 2013, 11:51:12)
[GCC 4.7.3]
Operating system: posix linux2
test_SeqIO_online ... FAIL
======================================================================
FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
Bio.Entrez.efetch(protein, 16130152, ...)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
line 77, in <lambda>
    method = lambda x : x.simple(d, f, e, l, c)
  File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
line 65, in simple
    self.assertEqual(seguid(record.seq), checksum)
AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'

----------------------------------------------------------------------
Ran 1 test in 10.010 seconds

FAILED (failures = 1)

~Phillip

From kai.blin at biotech.uni-tuebingen.de  Mon May 27 02:19:20 2013
From: kai.blin at biotech.uni-tuebingen.de (Kai Blin)
Date: Mon, 27 May 2013 08:19:20 +0200
Subject: [Biopython-dev] SearchIO: Fix a bug in the HMMer2 text parser
Message-ID: <51A2FAE8.1040408@biotech.uni-tuebingen.de>

Hi folks,

I've run into and fixed a bug in the hmmer2-text parser when parsing 
consensus lines. The pull request is at
https://github.com/biopython/biopython/pull/182

Cheers,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-Universit?t T?bingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 T?bingen                        Fax :   ++49 7071 29-5979
Germany
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben

From p.j.a.cock at googlemail.com  Mon May 27 05:05:44 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 27 May 2013 10:05:44 +0100
Subject: [Biopython-dev] test_SeqIO_online failure
In-Reply-To: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
References: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
Message-ID: <CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>

Hi Philip,

On Mon, May 27, 2013 at 3:27 AM, Phillip Garland <pgarland at gmail.com> wrote:
> The fasta formatted record is fine, the problem seems to come after
> requesting and reading the genbank-formatted record for the protein
> with GI:16130152.
>
> It looks like the record was modified a few days ago:
>
> LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013
>
> and ends with
>
> CONTIG      join(WP_000865568.1:1..367)\n//\n\n'
>
> instead of
>
> ORIGIN and the sequence data.
>
> Is this a problem with the genbank record that should be reported to
> NCBI, or is SeqIO supposed to handle the record as it is by fetching
> the sequence from the linked contig, or is the test doing the wrong
> thing by using rettype="gb" instead of rettype="gbwithparts"?

Interesting - it looks like the NCBI made a change to Entrez and
where previously this record had included the sequence with
rettype="gb" now we have to ask for it explicitly with the longer
rettype="gbwithparts" - my guess is this is now happening on
more records.

Note it does not affect all records, consider this example in our
Tutorial which seems unchanged:

  from Bio import Entrez
  Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
  handle = Entrez.efetch(db="nucleotide", id="186972394",
rettype="gb", retmode="text")
  print handle.read()

Curious.

> Here's the test output:
>
> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
> run_tests.py test_SeqIO_online.py
> Python version: 2.7.5 (default, May 20 2013, 11:51:12)
> [GCC 4.7.3]
> Operating system: posix linux2
> test_SeqIO_online ... FAIL
> ======================================================================
> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
> Bio.Entrez.efetch(protein, 16130152, ...)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
> line 77, in <lambda>
>     method = lambda x : x.simple(d, f, e, l, c)
>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
> line 65, in simple
>     self.assertEqual(seguid(record.seq), checksum)
> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'
>
> ----------------------------------------------------------------------
> Ran 1 test in 10.010 seconds
>
> FAILED (failures = 1)

I'd noticed this on Friday but hadn't looked into why the sequence was
different (and sometimes Entrez errors are transient). Thanks for
exploring this :)

Would you like to submit a pull request to update test_SeqIO_online.py
or should I just go ahead and change the rettype?

It would be sensible to review all the Entrez examples in the Tutorial,
to perhaps make more use of 'gbwithparts' rather than 'gb'?

Thanks,

Peter

From pgarland at gmail.com  Mon May 27 17:38:30 2013
From: pgarland at gmail.com (Phillip Garland)
Date: Mon, 27 May 2013 14:38:30 -0700
Subject: [Biopython-dev] test_SeqIO_online failure
In-Reply-To: <CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>
References: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
	<CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>
Message-ID: <CA+iPz=VkovUtZ00qmr9w4tjOLN3=JoYKxbL+E=1FBo9+j7_VTA@mail.gmail.com>

Hi Peter,

On Mon, May 27, 2013 at 2:05 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Philip,
>
> On Mon, May 27, 2013 at 3:27 AM, Phillip Garland <pgarland at gmail.com> wrote:
>> The fasta formatted record is fine, the problem seems to come after
>> requesting and reading the genbank-formatted record for the protein
>> with GI:16130152.
>>
>> It looks like the record was modified a few days ago:
>>
>> LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013
>>
>> and ends with
>>
>> CONTIG      join(WP_000865568.1:1..367)\n//\n\n'
>>
>> instead of
>>
>> ORIGIN and the sequence data.
>>
>> Is this a problem with the genbank record that should be reported to
>> NCBI, or is SeqIO supposed to handle the record as it is by fetching
>> the sequence from the linked contig, or is the test doing the wrong
>> thing by using rettype="gb" instead of rettype="gbwithparts"?
>
> Interesting - it looks like the NCBI made a change to Entrez and
> where previously this record had included the sequence with
> rettype="gb" now we have to ask for it explicitly with the longer
> rettype="gbwithparts" - my guess is this is now happening on
> more records.
>
> Note it does not affect all records, consider this example in our
> Tutorial which seems unchanged:
>
>   from Bio import Entrez
>   Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
>   handle = Entrez.efetch(db="nucleotide", id="186972394",
> rettype="gb", retmode="text")
>   print handle.read()
>
> Curious.
>
>> Here's the test output:
>>
>> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
>> run_tests.py test_SeqIO_online.py
>> Python version: 2.7.5 (default, May 20 2013, 11:51:12)
>> [GCC 4.7.3]
>> Operating system: posix linux2
>> test_SeqIO_online ... FAIL
>> ======================================================================
>> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
>> Bio.Entrez.efetch(protein, 16130152, ...)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
>> line 77, in <lambda>
>>     method = lambda x : x.simple(d, f, e, l, c)
>>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
>> line 65, in simple
>>     self.assertEqual(seguid(record.seq), checksum)
>> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'
>>
>> ----------------------------------------------------------------------
>> Ran 1 test in 10.010 seconds
>>
>> FAILED (failures = 1)
>
> I'd noticed this on Friday but hadn't looked into why the sequence was
> different (and sometimes Entrez errors are transient). Thanks for
> exploring this :)
>
> Would you like to submit a pull request to update test_SeqIO_online.py
> or should I just go ahead and change the rettype?
>
> It would be sensible to review all the Entrez examples in the Tutorial,
> to perhaps make more use of 'gbwithparts' rather than 'gb'?
>
> Thanks,
>
> Peter

The slight problem with just replacing "gb" with "gbwithparts" is that
SeqIO doesn't take "gbwithparts" as an option for the file format. So
in test_SeqIO_online.py, you have this code:

            handle = Entrez.efetch(db=database, id=entry, rettype=f,
retmode="text")
            record = SeqIO.read(handle, f)

which is a natural way to write the test (because it tests fasta and
genbank files), but will currently fail if f is "gbwithparts", b/c
SeqIO doesn't accept "gbwithparts" as a file format specifier. My
guess is that most existing code hardcodes the rettype and SeqIO file
format specifier, so we could just test for gbwithparts prior to
calling SeqIO.read:

  handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text")
            if f == "gbwithparts":
                f = "gb"
            record = SeqIO.read(handle, f)

I submitted a pull request with a minimal patch that does this.

For code like this, it would be cleaner if SeqIO accepted,
"gbwithparts" as an alias for "genbank", just like "gb" is, but I
don't know if it's a common pattern enough to bother.

If records like this are becoming more common, then "gbwithparts"
should be clearly documented in the biopython tutorial, though
"gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It
seems safer to always use "gbwithparts" at this point, at least when
you want the sequence.

~Phillip

From p.j.a.cock at googlemail.com  Mon May 27 18:43:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 27 May 2013 23:43:19 +0100
Subject: [Biopython-dev] test_SeqIO_online failure
In-Reply-To: <CA+iPz=VkovUtZ00qmr9w4tjOLN3=JoYKxbL+E=1FBo9+j7_VTA@mail.gmail.com>
References: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
	<CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>
	<CA+iPz=VkovUtZ00qmr9w4tjOLN3=JoYKxbL+E=1FBo9+j7_VTA@mail.gmail.com>
Message-ID: <CAKVJ-_6UU3oCpg4bsNaeVYjAx1FT2q_j7PVRf=_ZewguDx5Zug@mail.gmail.com>

On Mon, May 27, 2013 at 10:38 PM, Phillip Garland <pgarland at gmail.com> wrote:
> Hi Peter,
>
>> I'd noticed this on Friday but hadn't looked into why the sequence was
>> different (and sometimes Entrez errors are transient). Thanks for
>> exploring this :)
>>
>> Would you like to submit a pull request to update test_SeqIO_online.py
>> or should I just go ahead and change the rettype?
>>
>> It would be sensible to review all the Entrez examples in the Tutorial,
>> to perhaps make more use of 'gbwithparts' rather than 'gb'?
>>
>> Thanks,
>>
>> Peter
>
> The slight problem with just replacing "gb" with "gbwithparts" is that
> SeqIO doesn't take "gbwithparts" as an option for the file format. So
> in test_SeqIO_online.py, you have this code:
>
>             handle = Entrez.efetch(db=database, id=entry, rettype=f,
> retmode="text")
>             record = SeqIO.read(handle, f)
>
> which is a natural way to write the test (because it tests fasta and
> genbank files), but will currently fail if f is "gbwithparts", b/c
> SeqIO doesn't accept "gbwithparts" as a file format specifier. My
> guess is that most existing code hardcodes the rettype and SeqIO file
> format specifier, so we could just test for gbwithparts prior to
> calling SeqIO.read:
>
>   handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text")
>             if f == "gbwithparts":
>                 f = "gb"
>             record = SeqIO.read(handle, f)
>
> I submitted a pull request with a minimal patch that does this.

That's good for now :)

> For code like this, it would be cleaner if SeqIO accepted,
> "gbwithparts" as an alias for "genbank", just like "gb" is, but I
> don't know if it's a common pattern enough to bother.

That makes some sense for parsing files, but all those aliases
would cause confusion with writing GenBank files.

> If records like this are becoming more common, then "gbwithparts"
> should be clearly documented in the biopython tutorial, though
> "gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It
> seems safer to always use "gbwithparts" at this point, at least when
> you want the sequence.

Definitely - if the NCBI moves to using 'gb' as the light style
without the sequence then many people will just want to use
'gbwithparts' as their default when scripting this sort of thing.

Thanks,

Peter

From redmine at redmine.open-bio.org  Tue May 28 03:50:41 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 28 May 2013 07:50:41 +0000
Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on
	python3 for disordered atoms
Message-ID: <redmine.issue-3433.20130528075040@redmine.open-bio.org>


Issue #3433 has been reported by Alexander Campbell.

----------------------------------------
Bug #3433: MMCIFParser fails on python3 for disordered atoms
https://redmine.open-bio.org/issues/3433

Author: Alexander Campbell
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. 

The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
<pre>
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

</pre>

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work.

Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue May 28 03:50:41 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 28 May 2013 07:50:41 +0000
Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on
	python3 for disordered atoms
Message-ID: <redmine.issue-3433.20130528075040@redmine.open-bio.org>


Issue #3433 has been reported by Alexander Campbell.

----------------------------------------
Bug #3433: MMCIFParser fails on python3 for disordered atoms
https://redmine.open-bio.org/issues/3433

Author: Alexander Campbell
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. 

The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
<pre>
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

</pre>

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work.

Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From tiagoantao at gmail.com  Tue May 28 07:14:53 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 12:14:53 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
Message-ID: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>

Hi,

I have been trying to setup a windows 8 buildbot. For that purpose I have
installed a recent version of mingw on a new win8 machine.

It seems that one of the compiling options of biopython (-mno-cygwin) is
deprecated. See here for more details:
http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html

-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 07:21:13 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 12:21:13 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
Message-ID: <CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>

On Tue, May 28, 2013 at 12:14 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> I have been trying to setup a windows 8 buildbot. For that purpose I have
> installed a recent version of mingw on a new win8 machine.
>
> It seems that one of the compiling options of biopython (-mno-cygwin) is
> deprecated. See here for more details:
> http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html

Looks like there's a confusing open bug about just removing this argument
from Python's distutils - http://bugs.python.org/issue12641

For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself
get it to work? I could live with that on the build slave, coupled with a
warning in our install documentation for the brave people self-compiling
under Windows.

Peter


From tiagoantao at gmail.com  Tue May 28 08:04:32 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 13:04:32 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
Message-ID: <CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>

Hi,


On Tue, May 28, 2013 at 12:21 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself
> get it to work? I could live with that on the build slave, coupled with a
> warning in our install documentation for the brave people self-compiling
> under Windows.
>

I have hacked my distutils implementation. It compiled OK.
That being said, there seems to be some problems with Bio.Applications on
win8:
http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio


-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 10:09:40 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 15:09:40 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
Message-ID: <CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>

On Tue, May 28, 2013 at 1:04 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
>
> On Tue, May 28, 2013 at 12:21 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself
>> get it to work? I could live with that on the build slave, coupled with a
>> warning in our install documentation for the brave people self-compiling
>> under Windows.
>
> I have hacked my distutils implementation. It compiled OK.

That's encouraging.

> That being said, there seems to be some problems with Bio.Applications on
> win8:
> http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio

Could you confirm output sys.platform is "win32" still?

I've got a hunch that spaces in the executable path might explain
some of these failures - I'm trying a patch for that here.

Some of the other failures appear to be down to newline differences
(the \r in some of the output suggests this). Here we can probably
use universal new lines mode for file input, but I am puzzled why
these pass under Windows XP with an older mingw32 or the
Intel compiler.

Peter


From tiagoantao at gmail.com  Tue May 28 10:40:02 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 15:40:02 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
Message-ID: <CAA9RGEMMd8w_K_ozBgzn=50jrnnTihv-PGNi6CxAXwyzaOS4dg@mail.gmail.com>

On Tue, May 28, 2013 at 3:09 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Could you confirm output sys.platform is "win32" still?
>

Yup

T

-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 12:36:20 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 17:36:20 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
Message-ID: <CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>

On Tue, May 28, 2013 at 3:09 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I've got a hunch that spaces in the executable path might explain
> some of these failures - I'm trying a patch for that here.

Hi Tiago,

Patch applied to master - this is essential for the rare case of
calling a binary under Unix where the path/filename includes
a space, but appears to be redundant under Windows XP:
https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580

I'm curious if that matters under Windows 8 or not - trying
the example in the commit comment at the command line
might be illuminating.

Peter

P.S. Saket - You might remember I touched on this issue in our
discussion on GitHub about your bwa/samtools wrappers, which
led to this commit keeping self.program_name as the binary only:
https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a

From tiagoantao at gmail.com  Tue May 28 12:50:39 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 17:50:39 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
Message-ID: <CAA9RGEO3apbtXLF1+NEGk3SjSPEv+USj0wdKW4ujEzTM7Q1LEQ@mail.gmail.com>

On Tue, May 28, 2013 at 5:36 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I'm curious if that matters under Windows 8 or not - trying
> the example in the commit comment at the command line
> might be illuminating.
>

I just re-scheduled a testing case and the results were not great...
http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio


I will test this manually and in deep when I arrive home today.

-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 13:15:23 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 18:15:23 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAA9RGEO3apbtXLF1+NEGk3SjSPEv+USj0wdKW4ujEzTM7Q1LEQ@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
	<CAA9RGEO3apbtXLF1+NEGk3SjSPEv+USj0wdKW4ujEzTM7Q1LEQ@mail.gmail.com>
Message-ID: <CAKVJ-_7LaJKq8B4N5oY_6yOkBUTG8BBQKFNtJSENAvwcV0vSmw@mail.gmail.com>

On Tue, May 28, 2013 at 5:50 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
> On Tue, May 28, 2013 at 5:36 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> I'm curious if that matters under Windows 8 or not - trying
>> the example in the commit comment at the command line
>> might be illuminating.
>
>
> I just re-scheduled a testing case and the results were not great...
> http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio
>
> I will test this manually and in deep when I arrive home today.

I think there are at just two classes of failure, calling applications:

test_Application ... FAIL

And indexing with Windows newlines (I wonder if the git setup
on my Windows XP machine has a different default to yours,
meaning I have Unix newlines and you have Windows newlines?):

test_SearchIO_blast_tab_index ... FAIL
test_SearchIO_blast_xml_index ... FAIL
test_SearchIO_exonerate_text_index ... FAIL
test_SearchIO_exonerate_vulgar_index ... FAIL
test_SearchIO_fasta_m10_index ... FAIL
test_SearchIO_hmmer2_text_index ... FAIL
test_SearchIO_hmmer3_domtab_index ... FAIL
test_SearchIO_hmmer3_tab_index ... FAIL
test_SearchIO_hmmer3_text_index ... FAIL
Bio.SeqIO docstring test ... FAIL

Plus of course the minor issues which I just introduced with
the escaping change (commits to follow).

Peter


From saketkc at gmail.com  Tue May 28 13:20:36 2013
From: saketkc at gmail.com (Saket Choudhary)
Date: Tue, 28 May 2013 22:50:36 +0530
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
Message-ID: <CAEDHeiux=Oj5jaLJpeJ8aJ9o+EgcZ-jnh-G0xYEDXBg9To5uPw@mail.gmail.com>

The constraint for me really is I do not have access to Windows/MAC
machines here.

Hunting for a  Windows machine is possible, besides these I need to
validate the _ArgumentList method for windows too


On 28 May 2013 22:06, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Tue, May 28, 2013 at 3:09 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > I've got a hunch that spaces in the executable path might explain
> > some of these failures - I'm trying a patch for that here.
>
> Hi Tiago,
>
> Patch applied to master - this is essential for the rare case of
> calling a binary under Unix where the path/filename includes
> a space, but appears to be redundant under Windows XP:
>
> https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580
>
> I'm curious if that matters under Windows 8 or not - trying
> the example in the commit comment at the command line
> might be illuminating.
>
> Peter
>
> P.S. Saket - You might remember I touched on this issue in our
> discussion on GitHub about your bwa/samtools wrappers, which
> led to this commit keeping self.program_name as the binary only:
>
> https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a
>

From p.j.a.cock at googlemail.com  Tue May 28 13:30:47 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 18:30:47 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAEDHeiux=Oj5jaLJpeJ8aJ9o+EgcZ-jnh-G0xYEDXBg9To5uPw@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
	<CAEDHeiux=Oj5jaLJpeJ8aJ9o+EgcZ-jnh-G0xYEDXBg9To5uPw@mail.gmail.com>
Message-ID: <CAKVJ-_7SzBWgHxr2w8fA82k7bjwFHO2NTfknnqEPnFRyCZDkZA@mail.gmail.com>

On Tue, May 28, 2013 at 6:20 PM, Saket Choudhary <saketkc at gmail.com> wrote:
> The constraint for me really is I do not have access to Windows/MAC machines
> here.
>
> Hunting for a  Windows machine is possible, besides these I need to validate
> the _ArgumentList method for windows too

I sympathise - sorting out a (virtual) 64bit Windows machine has been on
my TODO list for a while, since right now I don't have access to one.

When I started doing Biopython my primary machine was Windows XP.
That old laptop has retired and I now mainly use Mac OS X and Linux at
work, but I made a point of getting a Windows XP machine setup for
development (e.g. the Windows installers are build with this) and for
use as one of our nightly build slaves:
http://testing.open-bio.org/biopython/buildslaves

Regards,

Peter

From redmine at redmine.open-bio.org  Thu May 30 02:32:21 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 30 May 2013 06:32:21 +0000
Subject: [Biopython-dev] [Biopython - Bug #3433] (Resolved) MMCIFParser
	fails on python3 for disordered atoms
References: <redmine.issue-3433.20130528075040@redmine.open-bio.org>
Message-ID: <redmine.journal-15155.20130530063221@redmine.open-bio.org>


Issue #3433 has been updated by Michiel de Hoon.

Status changed from New to Resolved
% Done changed from 0 to 100

Patch applied, thanks.
----------------------------------------
Bug #3433: MMCIFParser fails on python3 for disordered atoms
https://redmine.open-bio.org/issues/3433

Author: Alexander Campbell
Status: Resolved
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. 

The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
<pre>
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

</pre>

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work.

Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Thu May 30 04:21:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 09:21:31 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
Message-ID: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>

Hi Tiago,

We'd been talking briefly off-list about the recent buildbot
failures under Python 3 where the recent change to using
subprocess in the PopGen module was causing failures.

Sadly while it seems to work on Python 3.1 and 3.2 my
suggestion to try using bytes with the communicate call
fails on Python 3.3 and under Windows:
https://github.com/biopython/biopython/commit/912692ee2b57e8c075ba38bdf814c9dbe4f5cdb9

e.g. After the change to use bytes,
http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/202
http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/816
http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.2/builds/680
http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.3/builds/206

This appears to be a known bug in the subprocess
module, http://bugs.python.org/issue16903 which
should be fixed in Python 3.2.4 and Python 3.3.
It appears not to have been fixed on Python 3.1.

I see two options,

Option One, revert that commit (i.e. send unicode strings
as before, not bytes). This will work on Python 3.2.4+
onwards including Windows. It will fail on Python 3.1
and out of date Python 3.2 through 3.2.3 releases.

Option Two, don't use universal_newlines=True which
then requires us to use byte strings for all the stdin,
stdout and stderr processing. More work, but it should
in principle work on old and new Python 3 releases.

Note that while we're not seeing any problems yet, I
suspect this issue would affect our Bio.Application
wrappers __call__ function as well when used to send
data to stdin. Here again we could switch to using
bytes and universal_newlines=False and do any
bytes/unicode handling within the __call_ function,
on just insist on a fixed version of Python.

If we decide to recommend at least Python 3.2.4
(when using Python 3), then we could add a warning
to the relevant modules to catch this issue?

What do people think?

Regards,

Peter

From tiagoantao at gmail.com  Thu May 30 04:28:04 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 May 2013 09:28:04 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
Message-ID: <CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>

I was having a look at the issue precisely now.

I do not have a cast opinion on the issue, I think it all boils down on how
many people are dependent on 3.2.3 and prior 3s.

In theory I would prefer not to have workarounds for implementation bugs
(as makes things more complex to manage in the long-run), but if many
people are using buggy 3.x, I see no option...

I simply do not have any view on how many people would be using these...

From p.j.a.cock at googlemail.com  Thu May 30 04:34:15 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 09:34:15 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
Message-ID: <CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>

On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> I was having a look at the issue precisely now.
>
> I do not have a cast opinion on the issue, I think it all boils down on how
> many people are dependent on 3.2.3 and prior 3s.
>
> In theory I would prefer not to have workarounds for implementation bugs (as
> makes things more complex to manage in the long-run), but if many people are
> using buggy 3.x, I see no option...
>
> I simply do not have any view on how many people would be using these...
>

Since till now we've not officially supported Python 3, but
plan to start doing so for the forthcoming Biopython 1.62
release, so we could just set a minimum version of 3.2.4
(with Python 3.3 being our current recommendation).

However, that may be a problem for some current Linux
distributions still shipping older versions?

Peter


From tiagoantao at gmail.com  Thu May 30 04:41:27 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 May 2013 09:41:27 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
Message-ID: <CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>

On Thu, May 30, 2013 at 9:34 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> However, that may be a problem for some current Linux
> distributions still shipping older versions?
>
>
>
I suppose people could revert to Python 2 in that case? [Do not get me
wrong, I really have no strong feelings either way]


-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Thu May 30 07:37:51 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 12:37:51 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
	<CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
Message-ID: <CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>

On Thu, May 30, 2013 at 9:41 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
> On Thu, May 30, 2013 at 9:34 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> However, that may be a problem for some current Linux
>> distributions still shipping older versions?
>
> I suppose people could revert to Python 2 in that case? [Do not get me
> wrong, I really have no strong feelings either way]
>

I guess we should do a brief survey on the main list of Python 3 versions
people have installed, if any.

In the meantime, I reverted that commit so the tests should now pass
under Python 3.2.4+ and Python 3.3.
https://github.com/biopython/biopython/commit/285988b1b5227b591bd2fed379e36db3a157eca2

Peter


From tiagoantao at gmail.com  Thu May 30 07:40:27 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 May 2013 12:40:27 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
	<CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
	<CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>
Message-ID: <CAA9RGEOzLPUQVPFLF=RPr3DfCjS3pL4FFsu1h6eK4mrP2WMLxg@mail.gmail.com>

> I guess we should do a brief survey on the main list of Python 3 versions
> people have installed, if any.
>
>
>
+1

From p.j.a.cock at googlemail.com  Thu May 30 07:47:33 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 12:47:33 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAA9RGEOzLPUQVPFLF=RPr3DfCjS3pL4FFsu1h6eK4mrP2WMLxg@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
	<CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
	<CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>
	<CAA9RGEOzLPUQVPFLF=RPr3DfCjS3pL4FFsu1h6eK4mrP2WMLxg@mail.gmail.com>
Message-ID: <CAKVJ-_4WwQE+B94_ZB600H741rObfu68An9Ztvkvhk2SgztVbg@mail.gmail.com>

On Thu, May 30, 2013 at 12:40 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
>> I guess we should do a brief survey on the main list of Python 3 versions
>> people have installed, if any.
>>
>>
>
> +1

Agreed, http://lists.open-bio.org/pipermail/biopython/2013-May/008598.html

Peter


From p.j.a.cock at googlemail.com  Thu May 30 09:33:22 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 14:33:22 +0100
Subject: [Biopython-dev] Python 2 and 3 migration thoughts
Message-ID: <CAKVJ-_7hWfNPgEjvGAQLyoMadR4aJhUQfSXvcZHrfr-2uvxfCg@mail.gmail.com>

Splitting off from this thread:
http://lists.open-bio.org/pipermail/biopython/2013-May/008601.html

On Thu, May 30, 2013 at 2:13 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Thank you for all the comments so far, don't stop yet :)
>
> On Thu, May 30, 2013 at 1:51 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> Hi everyone,
>>
>> I'm leaning towards insisting on Python >=3.3 support (I'm running
>> 3.3.2). I suppose that even if Python3.3 is not available on a machine
>> or through the default package manager, it's always installable on its
>> own. If that's not the case, I imagine Python2.x is most likely
>> present in these machines (so Biopython can still be used).
>
> True.
>
> So far everyone who has replied (including some off list) have said
> they are using Python 3.3 which is encouraging. Thank you for
> the comments so far.
>
> It looks like we can forget about Python 3.1, and just need to
> decide if it is worth including Python 3.2.5 in the short term.
>
>> On a related note, do we have a defined timeline on when we
>> would drop support for Python2.x? Are there any plans to have
>> our codebase written in Python3.x instead of Python2.x?
>
> Nothing concrete planned, no. I'll reply in more detail on the
> biopython-dev list as I do have some thoughts about this.

Good question Bow,

I think people will still be using Python 2 a year or two from
now, so we must support both for some time.

Biopython 1.62 (next week perhaps?)
- Final release with Python 2.5 support
- Official support for Python 2.5, 2.6, 2.7 and 3.3
- Possibly official support for Python 3.2.5+ as well?

(Exactly which versions of Python 3 we'll include to be
decided, see the other thread for that discussion.)

Short term we will continue with developing using Python 2
syntax and running 2to3 for Python 3. As far as I know,
the reverse process with 3to2 is not well established. If
anyone wants to investigate that would be useful as
another option. However, dropping Python 2.5 support
makes things more flexible...

Medium term I believe it would be possible to have a single
code base which is both valid Python 2 and 3 at the same
time. This may require us to target 2.7 and 3.3+ only - we'll
have to try it and see if Python 2.6 will hold us back.

I've actually done this with lzma.backports, a small but
non-trivial module with Python and C code:

https://pypi.python.org/pypi/backports.lzma/
https://github.com/peterjc/backports.lzma

Python 3.3 reintroduces some features designed to make
this more straightforward, like unicode literals (missing in
the early versions of Python 3). This is why I'd like to drop
Python 3.2 as soon as possible.

What I was thinking is we can start migrating modules on a
case by case basis from "Python 2 syntax" to "Dual syntax"
one by one, with a white-list in the do2to3.py script. That
way over time less and less modules need to be converted
via 2to3, and "python3 setup.py install" will get faster,
until eventually we can stop using 2to3 at all.

This conversion could consider the code and doctests
separately. However, using using print(example) we can
hopefully get most of the doctests and Tutorial examples
to work under both Python 2 and 3 at the same time.

That's my current thinking anyway - and I think the fact
that it would be a gradual migration from writing Python 2
specific code to writing dual 2/3 code makes it low risk
(as long as we're continuing to run regular testing).

Regards,

Peter

From p.j.a.cock at googlemail.com  Thu May 30 10:23:01 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 15:23:01 +0100
Subject: [Biopython-dev] HMMER3.1 beta test 1 released
Message-ID: <CAKVJ-_5K3BCTZci=bsyKB-1egZnB_ZjGn=hez=8K_dY1OhQT0w@mail.gmail.com>

Hi Bow,

Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759

"The programs phmmer, hmmsearch, and hmmscan offer a new
tabular output format for easier automated parsing, --pfamtblout.
his format is the one used internally by Pfam, but we make it more
broadly available in case it is of use elsewhere. An analagous
output format is available for nhmmer and nhmmscan, --dfamtblout."

Something to consider for SearchIO later on...

Regards,

Peter

From w.arindrarto at gmail.com  Thu May 30 10:50:24 2013
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 30 May 2013 16:50:24 +0200
Subject: [Biopython-dev] HMMER3.1 beta test 1 released
In-Reply-To: <CAKVJ-_5K3BCTZci=bsyKB-1egZnB_ZjGn=hez=8K_dY1OhQT0w@mail.gmail.com>
References: <CAKVJ-_5K3BCTZci=bsyKB-1egZnB_ZjGn=hez=8K_dY1OhQT0w@mail.gmail.com>
Message-ID: <CADEGkF5ObERnmN1ZufFEymOt6XoWjaN2oH=G3vYr_m4AYaimNg@mail.gmail.com>

Hi Peter,

Thanks for the heads-up. This just showed up in my feed as well. I've
been waiting for the official release (since they first mentioned it
some monts ago). I'll follow up on this slowly :)..

Best regards,
Bow

On Thu, May 30, 2013 at 4:23 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Bow,
>
> Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759
>
> "The programs phmmer, hmmsearch, and hmmscan offer a new
> tabular output format for easier automated parsing, --pfamtblout.
> his format is the one used internally by Pfam, but we make it more
> broadly available in case it is of use elsewhere. An analagous
> output format is available for nhmmer and nhmmscan, --dfamtblout."
>
> Something to consider for SearchIO later on...
>
> Regards,
>
> Peter

From rz1991 at foxmail.com  Thu May 30 11:37:00 2013
From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=)
Date: Thu, 30 May 2013 23:37:00 +0800
Subject: [Biopython-dev] GSoC 2013 Student Self-introduction
Message-ID: <tencent_5A1C96A50AB12050753BA7EE@qq.com>

Hi Everyone,


This is Zheng Ruan, a first year graduate students at the University of Georgia. I'm happy to be chosen to participate in GSoC this year. My project is "Codon Alignment and Analysis in Biopython" and I will be working with Eric Talevich and Peter Cock during the summer.


My undergraduate major is biotechnology and now seeking for a PhD in bioinformatics. I hope to improve my python programming skills during the project and make long term contribution to biopython. I will follow the timeline of my proposal in the Community Bounding Period these days (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/rzzmh12345/1). Thanks!


Best,
Ruan


From p.j.a.cock at googlemail.com  Thu May 30 12:18:41 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 17:18:41 +0100
Subject: [Biopython-dev] Biopython projects with NESCent for GSoC 2013
In-Reply-To: <CAKVJ-_5=pEytpc7EHiCPMbrrHh01_7U9x0MyN34AUGhbWHdTiw@mail.gmail.com>
References: <CAKVJ-_5=pEytpc7EHiCPMbrrHh01_7U9x0MyN34AUGhbWHdTiw@mail.gmail.com>
Message-ID: <CAKVJ-_6k-YwX4kUmN_BkKFoTQT__ZsQYC9EBKVF53CX+t7GuCA@mail.gmail.com>

Dear all,

After the disappointing news that the Open Bioinformatics Foundation (OBF)
was not accepted as a Google Summer of Code (GSoC) organisation this
year, Biopython was fortunate to once again offer some projects with the
NESCent team:

http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013

As always the student proposals have been very competitive, and we've
not been able to take on everyone. This year NESCent was fortunately to
be able to accept seven students through GSoC and one through the GNOME
Outreach Program for Women. Two of these GSoC projects are Biopython
related:

    Codon Alignment and Analysis in Biopython
    Student: Zheng Ruan
    Mentors: Eric Talevich, Peter Cock
    http://www.google-melange.com/gsoc/project/google/gsoc2013/rzzmh12345/32001

    Phylogenetics in Biopython: Filling in the gaps
    Student: Yanbo Ye
    http://www.google-melange.com/gsoc/project/google/gsoc2013/yeyanbo/45001
    Mentors: Mark Holder, Jeet Sukumaran, Eric Talevich

Thank you NESCent, and congratulations to Zheng Ruan and Yanbo Ye!

I'm hoping you're already setting up a blog, which I hope you'll be able to
use for roughly weekly progress reports during the summer - CC'd to the
biopython-dev mailing list and the NESCent  Phyloinformatics Summer of
Code forum on Google+,

http://lists.open-bio.org/mailman/listinfo/biopython-dev
https://plus.google.com/communities/105828320619238393015

An introduction to your project would be a great idea for your first post -
here's Bow's from last year as an example:

http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/
http://bow.web.id/blog/2012/08/summers-over/
http://bow.web.id/blog/tag/gsoc/

The idea here is to keep the wider community informed about how
your project is going.

On behalf of the Biopython developers, congratulations! We're
looking forward to another productive Summer of Code :)

Peter

From p.j.a.cock at googlemail.com  Fri May 31 05:04:28 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 31 May 2013 10:04:28 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
Message-ID: <CAKVJ-_5FYPMHjMaPr=jFoVxz9n4quUOePyDqagiW7PpvK0K_Rw@mail.gmail.com>

On Thu, May 30, 2013 at 9:34 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>> I was having a look at the issue precisely now.
>>
>> I do not have a cast opinion on the issue, I think it all boils down on how
>> many people are dependent on 3.2.3 and prior 3s.
>>
>> In theory I would prefer not to have workarounds for implementation bugs (as
>> makes things more complex to manage in the long-run), but if many people are
>> using buggy 3.x, I see no option...
>>
>> I simply do not have any view on how many people would be using these...
>>
>
> Since till now we've not officially supported Python 3, but
> plan to start doing so for the forthcoming Biopython 1.62
> release, so we could just set a minimum version of 3.2.4
> (with Python 3.3 being our current recommendation).

>From the discussion on the main list, requiring a recent
version of Python 3 where this bug is fixed should be fine.

For now I've added code to skip this test on the older
Python 3 releases where the bug exists:
https://github.com/biopython/biopython/commit/9c16c09806ca4af84f714662e54c9bd3057b0a52

Once we've settled on the versions to support with the next
release we should review what versions we run on the buildbot.

Regards,

Peter


From zhigangwu.bgi at gmail.com  Wed May  1 14:17:14 2013
From: zhigangwu.bgi at gmail.com (Zhigang Wu)
Date: Wed, 1 May 2013 07:17:14 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
Message-ID: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>

Hi Peter and all,
Thanks for the long explanation.
I got much better understand of this project though I am still confusing on
how to implement the lazy-loading parser for feature rich files (EMBL,
GenBank, GFF3).
Since the deadline is pretty close,I decided to post my premature of
proposal for this project. It would be great if you all can given me some
comments and suggestions. The proposal is available
here<https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing>.
Thank you all in advance.


Zhigang


On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> wrote:
> > Peter,
> >
> > Thanks for the detailed explanation. It's very helpful. I am not quite
> > sure about the goal of the lazy-loading parser.
> > Let me try to summarize what are the goals of lazy-loading and how
> > lazy-loading would work. Please correct me if necessary. Below I use
> > fasta/fastq file as an example. The idea should generally applies to
> > other format such as GenBank/EMBL as you mentioned.
> >
> > Lazy-loading is useful under the assumption that given a large file,
> > we are interested in partial information of it but not all of them.
> > For example a fasta file contains Arabidopsis genome, we only
> > interested in the sequence of chr5 from index position from 2000-3000.
> > Rather than parsing the whole file and storing each record in memory
> > as most parsers will do,  during the indexing step, lazy loading
> > parser will only store a few position information, such as access
> > positions (readily usable for seek) for all chromosomes (chr1, chr2,
> > chr3, chr4, chr5, ...) and may be position index information such as
> > the access positions for every 1000bp positions for each sequence in
> > the given file. After indexing, we store these information in a
> > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
> > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
> > 2000:access_pos,}, 'chr3'...}.
> >
> > Compared to the usual parser which tends to parsing the whole file, we
> > gain two benefits: speed, less memory usage and random access. Speed
> > is gained because we skipped a lot during the parsing step. Go back to
> > my example, once we have the dictionary, we can just seek to the
> > access position of chr5:2000 and start reading and parsing from there.
> > Less memory usage is due to we only stores access positions for each
> > record as a dictionary in memory.
> >
> >
> > Best,
> >
> > Zhigang
>
> Hi Zhigang,
>
> Yes - that's the basic idea of a disk based lazy loader. Here
> the data stays on the disk until needed, so generally this is
> very low memory but can be slow as it needs to read from
> the disk. And existing example already in Biopython is our
> BioSQL bindings which present a SeqRecord subclass which
> only retrieves values from the database on demand.
>
> Note in the case of FASTA, we might want to use the existing
> FAI index files from Heng Li's faidx tool (or another existing
> index scheme). That relies on each record using a consistent
> line wrapping length, so that seek offsets can be easily
> calculated.
>
> An alternative idea is to load the data into memory (so that the
> file is not touched again, useful for stream processing where
> you cannot seek within the input data) but it is only parsed into
> Python objects on demand. This would use a lot more memory,
> but should be faster as there is no disk seeking and reading
> (other than the one initial read). For FASTA this wouldn't help
> much but it might work for EMBL/GenBank.
>
> Something to beware of with any lazy loading / lazy parsing is
> what happens if the user tries to edit the record? Do you want
> to allow this (it makes the code more complex) or not (simpler
> and still very useful).
>
> In terms of usage examples, for things like raw NGS data this
> is (currently) made up of lots and lots of short sequences (under
> 1000bp). Lazy loading here is unlikely to be very helpful - unless
> perhaps you can make the FASTQ parser faster this way?
> (Once the reads are assembled or mapped to a reference,
> random access to lookup reads by their mapped location is
> very very important, thus the BAI indexing of BAM files).
>
> In terms of this project, I was thinking about a SeqRecord
> style interface extending Bio.SeqIO (but you can suggest
> something different for your project).
>
> What I saw as the main use case here is large datasets like
> whole chromosomes in FASTA format or richly annotated
> formats like EMBL, GenBank or GFF3. Right now if I am
> doing something with (for example) the annotated human
> chromosomes, loading these as GenBank files is quite
> slow (it takes a far amount of memory too, but that isn't
> my main worry). A lazy loading approach should let me
> 'load' the GenBank files almost instantly, and delay
> reading specific features or sequence from the disk
> until needed.
>
> For example, I might have a list of genes for which I wish
> to extract the annotation or sequence for - and there is no
> need to load all the other features or the rest of the genome.
>
> (Note we can already do this by loading GenBank files
> into a BioSQL database, and access them that way)
>
> Regards,
>
> Peter
>


From chris.mit7 at gmail.com  Wed May  1 14:40:26 2013
From: chris.mit7 at gmail.com (Chris Mitchell)
Date: Wed, 1 May 2013 10:40:26 -0400
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
Message-ID: <CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>

Hi Zhigang,

I throw some comments on your proposal.  As i said there, I think you need
to find & look at a variety of gff/gtf files to see where your
implementation breaks down.  Also, for parsing, I would focus on optimizing
the speed the user can access attributes, they're the bits people care most
about (where is gene X, what is the FPKM of isoform y?, etc.)

Chris


On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:

> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).
> Since the deadline is pretty close,I decided to post my premature of
> proposal for this project. It would be great if you all can given me some
> comments and suggestions. The proposal is available
> here<
> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
> >.
> Thank you all in advance.
>
>
> Zhigang
>
>
>
> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> > wrote:
> > > Peter,
> > >
> > > Thanks for the detailed explanation. It's very helpful. I am not quite
> > > sure about the goal of the lazy-loading parser.
> > > Let me try to summarize what are the goals of lazy-loading and how
> > > lazy-loading would work. Please correct me if necessary. Below I use
> > > fasta/fastq file as an example. The idea should generally applies to
> > > other format such as GenBank/EMBL as you mentioned.
> > >
> > > Lazy-loading is useful under the assumption that given a large file,
> > > we are interested in partial information of it but not all of them.
> > > For example a fasta file contains Arabidopsis genome, we only
> > > interested in the sequence of chr5 from index position from 2000-3000.
> > > Rather than parsing the whole file and storing each record in memory
> > > as most parsers will do,  during the indexing step, lazy loading
> > > parser will only store a few position information, such as access
> > > positions (readily usable for seek) for all chromosomes (chr1, chr2,
> > > chr3, chr4, chr5, ...) and may be position index information such as
> > > the access positions for every 1000bp positions for each sequence in
> > > the given file. After indexing, we store these information in a
> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
> > > 2000:access_pos,}, 'chr3'...}.
> > >
> > > Compared to the usual parser which tends to parsing the whole file, we
> > > gain two benefits: speed, less memory usage and random access. Speed
> > > is gained because we skipped a lot during the parsing step. Go back to
> > > my example, once we have the dictionary, we can just seek to the
> > > access position of chr5:2000 and start reading and parsing from there.
> > > Less memory usage is due to we only stores access positions for each
> > > record as a dictionary in memory.
> > >
> > >
> > > Best,
> > >
> > > Zhigang
> >
> > Hi Zhigang,
> >
> > Yes - that's the basic idea of a disk based lazy loader. Here
> > the data stays on the disk until needed, so generally this is
> > very low memory but can be slow as it needs to read from
> > the disk. And existing example already in Biopython is our
> > BioSQL bindings which present a SeqRecord subclass which
> > only retrieves values from the database on demand.
> >
> > Note in the case of FASTA, we might want to use the existing
> > FAI index files from Heng Li's faidx tool (or another existing
> > index scheme). That relies on each record using a consistent
> > line wrapping length, so that seek offsets can be easily
> > calculated.
> >
> > An alternative idea is to load the data into memory (so that the
> > file is not touched again, useful for stream processing where
> > you cannot seek within the input data) but it is only parsed into
> > Python objects on demand. This would use a lot more memory,
> > but should be faster as there is no disk seeking and reading
> > (other than the one initial read). For FASTA this wouldn't help
> > much but it might work for EMBL/GenBank.
> >
> > Something to beware of with any lazy loading / lazy parsing is
> > what happens if the user tries to edit the record? Do you want
> > to allow this (it makes the code more complex) or not (simpler
> > and still very useful).
> >
> > In terms of usage examples, for things like raw NGS data this
> > is (currently) made up of lots and lots of short sequences (under
> > 1000bp). Lazy loading here is unlikely to be very helpful - unless
> > perhaps you can make the FASTQ parser faster this way?
> > (Once the reads are assembled or mapped to a reference,
> > random access to lookup reads by their mapped location is
> > very very important, thus the BAI indexing of BAM files).
> >
> > In terms of this project, I was thinking about a SeqRecord
> > style interface extending Bio.SeqIO (but you can suggest
> > something different for your project).
> >
> > What I saw as the main use case here is large datasets like
> > whole chromosomes in FASTA format or richly annotated
> > formats like EMBL, GenBank or GFF3. Right now if I am
> > doing something with (for example) the annotated human
> > chromosomes, loading these as GenBank files is quite
> > slow (it takes a far amount of memory too, but that isn't
> > my main worry). A lazy loading approach should let me
> > 'load' the GenBank files almost instantly, and delay
> > reading specific features or sequence from the disk
> > until needed.
> >
> > For example, I might have a list of genes for which I wish
> > to extract the annotation or sequence for - and there is no
> > need to load all the other features or the rest of the genome.
> >
> > (Note we can already do this by loading GenBank files
> > into a BioSQL database, and access them that way)
> >
> > Regards,
> >
> > Peter
> >
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From eric.talevich at gmail.com  Wed May  1 15:46:43 2013
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 1 May 2013 11:46:43 -0400
Subject: [Biopython-dev] gsoc phylo project questions
In-Reply-To: <CADoMHjxR_CM7wg377AyFfwz+UKxAvD0=pyHRUJOfo0YdsCnEmg@mail.gmail.com>
References: <CADoMHjxR_CM7wg377AyFfwz+UKxAvD0=pyHRUJOfo0YdsCnEmg@mail.gmail.com>
Message-ID: <CAMC681nq4xMUZe5j7J891C_A1GCSYFVWMshdHyGizf4w=EDBjg@mail.gmail.com>

On Tue, Apr 30, 2013 at 3:20 AM, Yanbo Ye <yeyanbo289 at gmail.com> wrote:

> Hi Eric,
>
> Again, thanks for your comment. It might be better to discuss here.
> https://github.com/lijax/gsoc/commit/e969c82a5a0aef45bba1277ce01d6dbee03e6a84#commitcomment-3096321
>
> I have changed my proposal and timeline based on your advice. I think I
> was too optimistic that I didn't consider about the compatibility with
> existing code or other potential problem that may exist. After careful
> consideration, I removed one task from the goal list to make the time more
> relaxed, the tree comparison<http://www.biopython.org/wiki/Phylo_cookbook#Comparing_trees>(seems
> I miss understood this). I might be able to complete all of them. But it's
> better to make it as an extra task, to make sure this coding experience is
> not a burden.
>

I agree it's best to commit to a feasible timeline and then reserve a few
"stretch goals". Dropping the tree distance function is fine, as there are
currently some other students who might develop this small module as a
course project, independently of GSoC. In any case that functionality is
independent of the other tasks you've proposed.


> According to your comment:
>
> 1. I didn't know PyCogent and DendroPy. I'll refer to them for useful
> solutions.
> 2. For distance-based tree and consensus tree, I think there is no need
> to use NumPy. And for consensus tree, my original plan is to implement a
> binary class to count the clade with the same leaves for performance. As
> you suggest, I'll implement a class with the same API and improve the
> performance later, so that I can pay more attention to the Strict and Adam
> Consensus algorithms.
>

Sounds good.


> 3. I didn't find the distance matrix method for MSA on Phylo Cookbook
> page, only from existing tree.
>

Ah, I think I misunderstood you earlier. Yes, for the NJ method you'll need
to use a substitution matrix to compute pairwise distances from a multiple
sequence alignment. This shouldn't be too challenging, though you might
find the need to add a new matrix to the Bio.SubsMat module if you want to
let the user choose something other than BLOSUM or PAM.

4. For parsimony tree search, I have already know how several heuristic
> search algorithms work. Do I need to implement them all?
>

No, just choose a well-established one that you feel comfortable
implementing.

5. I'm not clear about the radial layout and Felsenstein's Equal Daylight
> algorithm. Isn't this algorithm one way of showing the radial layout? I'm
> sorry that I'm not familiar with this layout. Can you give some figure
> examples and references?
>

For radial tree layout:
https://en.wikipedia.org/wiki/Radial_tree
http://www.infosun.fim.uni-passau.de/~chris/down/DrawingPhyloTreesEA.pdf

The paper above also explains an "angle spreading" refinement step to
improve the appearance of radial trees, which you could opt to implement
instead of Equal Daylight.

The Equal Daylight algorithm seems to only be documented fully in the book
"Inferring Phylogenies" and implemented in the "drawtree" program in
Phylip. In the Phylip documentation, the radial layout algorithm is called
"Equal Arc", and the layout provided by that algorithm is the starting
point for Equal Daylight:
http://evolution.genetics.washington.edu/phylip/doc/drawtree.html

Cheers,
Eric


From albl500 at york.ac.uk  Wed May  1 22:56:12 2013
From: albl500 at york.ac.uk (Alex Leach)
Date: Wed, 01 May 2013 23:56:12 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
Message-ID: <op.wwfgnyegyzmrol@ns1.alexleach.org.uk>

Dear all,

I also left some minor comments on the proposal; I hope they're helpful  
and I wish you every success!

You should focus on the proposal for now, but I thought I'd share a more  
presentable version of the fasta lazy-loader I wrote a couple of years  
ago. The focus at the time was to minimise memory usage and increase the  
speed of random access to fasta-formatted sequences, stored on disk. Only  
sequence accessions and file locations are stored in-memory (in a dict).  
Once the index has been populated, it can 'pickle' the dictionary to a  
file on disk, for later re-use.

It doesn't exactly fulfill all of your needs, but I hope it might help you  
in the right direction..

Also, were there plans for making the lazy loader thread-safe? I've done  
it in the past by passing a `multiprocessing.Pipe` instance to a method  
(`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to  
implement a callback scheme, but passing a Pipe did the job.. Maybe it's  
outside the current scope of the project, but anyway, I put the module up  
on github if you want to check it out[1].


Cheers,
Alex


[1] -  
https://github.com/alexleach/fasta_lazy_loader/blob/master/fasta_lazy_loader.py


From zhigang.wu at email.ucr.edu  Thu May  2 08:14:04 2013
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Thu, 2 May 2013 01:14:04 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <op.wwfgnyegyzmrol@ns1.alexleach.org.uk>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
	<op.wwfgnyegyzmrol@ns1.alexleach.org.uk>
Message-ID: <CADhJE9tuszX-zYs1Kn26deC0zux8ZyL5VLOPitssd-5OSs-XQA@mail.gmail.com>

Hi Alex,

The idea of taking advantage of multiprocessing is great. I haven't touched
this kind of thing before and I think it's going to be cool to integrate
into the project.

Best,

Zhigang


On Wed, May 1, 2013 at 3:56 PM, Alex Leach <albl500 at york.ac.uk> wrote:

> Dear all,
>
> I also left some minor comments on the proposal; I hope they're helpful
> and I wish you every success!
>
> You should focus on the proposal for now, but I thought I'd share a more
> presentable version of the fasta lazy-loader I wrote a couple of years ago.
> The focus at the time was to minimise memory usage and increase the speed
> of random access to fasta-formatted sequences, stored on disk. Only
> sequence accessions and file locations are stored in-memory (in a dict).
> Once the index has been populated, it can 'pickle' the dictionary to a file
> on disk, for later re-use.
>
> It doesn't exactly fulfill all of your needs, but I hope it might help you
> in the right direction..
>
> Also, were there plans for making the lazy loader thread-safe? I've done
> it in the past by passing a `multiprocessing.Pipe` instance to a method
> (`pipe_sequences`) of the lazy loader. If redesigning the code, I'd try to
> implement a callback scheme, but passing a Pipe did the job.. Maybe it's
> outside the current scope of the project, but anyway, I put the module up
> on github if you want to check it out[1].
>
>
> Cheers,
> Alex
>
>
> [1] - https://github.com/alexleach/**fasta_lazy_loader/blob/master/**
> fasta_lazy_loader.py<https://github.com/alexleach/fasta_lazy_loader/blob/master/fasta_lazy_loader.py>
>
> ______________________________**_________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.**org <Biopython-dev at lists.open-bio.org>
> http://lists.open-bio.org/**mailman/listinfo/biopython-dev<http://lists.open-bio.org/mailman/listinfo/biopython-dev>
>


From albl500 at york.ac.uk  Thu May  2 09:08:23 2013
From: albl500 at york.ac.uk (Alex Leach)
Date: Thu, 02 May 2013 10:08:23 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9tuszX-zYs1Kn26deC0zux8ZyL5VLOPitssd-5OSs-XQA@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
	<op.wwfgnyegyzmrol@ns1.alexleach.org.uk>
	<CADhJE9tuszX-zYs1Kn26deC0zux8ZyL5VLOPitssd-5OSs-XQA@mail.gmail.com>
Message-ID: <op.wwf8z9ivyzmrol@ns1.alexleach.org.uk>

On Thu, 02 May 2013 09:14:04 +0100, Zhigang Wu <zhigang.wu at email.ucr.edu>  
wrote:

> Hi Alex,
>
> The idea of taking advantage of multiprocessing is great. I haven't  
> touched this kind of thing before and I think >it's going to be cool to  
> integrate into the project.


Pleasure.

Multiprocessing is quite a large topic, and the relevant library  
documentation also rather large[1-2]. If you haven't worked with  
multiprocessing before, it will probably take a long while before you're  
comfortable using the libraries involved. So if you were to mention it in  
the proposal, I'd keep it out of the core objectives, as you have a lot  
else on your plate, already. Don't know if anyone else has any thoughts on  
this, though?

I could potentially help to provide some pointers, so if you have any  
questions I might be able to help with, please feel free to ask.

Kind regards,
Alex


[1] - http://docs.python.org/2/library/multiprocessing.html
[2] - http://docs.python.org/2/library/threading.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130502/9024522a/attachment-0002.html>

From p.j.a.cock at googlemail.com  Thu May  2 09:52:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 10:52:19 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
Message-ID: <CAKVJ-_74FUzv+yWASR0nwfLE6Bub1t+mcXVtSNB0pU6r7sHV6Q@mail.gmail.com>

On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).
> Since the deadline is pretty close,I decided to post my premature of
> proposal for this project. It would be great if you all can given me some
> comments and suggestions. The proposal is available here.
> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
> Thank you all in advance.
>
> Zhigang

Hi Zhigang,

I've posted a few comment there, but it would be a good idea
to put the draft on Google Melange soon. I see you've posted
the Google Doc on the NESCent Google+ as well, good.

Looking at the current draft, you don't yet have a timeline. This
is vital - and it should include writing tests (as you write code -
not all at the end) and documentation (which can come after
the code).

In the community bonding period you could write that you
plan to setup your development environment including
multiple versions of Python (at least Python 2.6, Python 3,
Jython 2.7, and PyPy 2.0 to cover the main variants).

For instance, it would make sense to start with learning about
faidx and how its indexing works, and trying to reproduce it in
Python code, and then wrapping that in a SeqRecord style
API. Include writing and evaluating some benchmarks too -
you may need to learn how to profile Python code for this,
since speed and performance is one the reasons for wanting
lazy loading (lower memory usage is the other main driver).
That could be the first few weeks perhaps?

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu May  2 10:37:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 11:37:31 +0100
Subject: [Biopython-dev] Fwd: [PhyloSoC] Application deadline fast
	approaching
In-Reply-To: <CAL1BuGUK+1eMpJb5dZ8-o4ChDphyCrf8sY8evfhhtTLVY3kf3Q@mail.gmail.com>
References: <CAL1BuGUK+1eMpJb5dZ8-o4ChDphyCrf8sY8evfhhtTLVY3kf3Q@mail.gmail.com>
Message-ID: <CAKVJ-_50j8zV2ETaqivwOTppXmEgCmOkUD1ryUZ2-9bV=zuEuw@mail.gmail.com>

Hi all,

I'm forwarding this for any potential  Google Summer of Code 2013
students and mentors - note you should also be signed up to the
NESCent "Phyloinformatics Summer of Code" mailing list to make
sure you don't miss any important information.

Thanks,

Peter

---------- Forwarded message ----------
From: Karen Cranston <karen.cranston at nescent.org>
Date: Thu, May 2, 2013 at 12:39 AM
Subject: [PhyloSoC] Application deadline fast approaching
To: Phyloinformatics Summer of Code <phylosoc at nescent.org>


The student application deadline for GSoC is this Friday, May 3 at 19:00 UTC!

Thanks to everyone for their expertise and enthusiasm so far. Expect
much traffic in Melange and on the G+ page between now and the
deadline. Please do help students (for your projects or others)
improve their applications - either on the G+ page or via a public
comment on Melange. The most common issue is a lack of detail in the
project plan. You can point students to the wiki for examples from
previous years. Feel free to ask for help on this list.

We will send out more about assigning mentors / scoring after the
application deadline.

Cheers,
Karen & Jim

--

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karen Cranston, PhD
Training Coordinator and Informatics Project Manager
nescent.org
@kcranstn
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

_______________________________________________
PhyloSoC mailing list
PhyloSoC at nescent.org
https://lists.nescent.org/mailman/listinfo/phylosoc

UNSUBSCRIBE: https://lists.nescent.org/mailman/options/phylosoc/p.j.a.cock%40googlemail.com?unsub=1&unsubconfirm=1


From p.j.a.cock at googlemail.com  Thu May  2 12:54:52 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 13:54:52 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
Message-ID: <CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>

On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com> wrote:
> Hi Peter and all,
> Thanks for the long explanation.
> I got much better understand of this project though I am still confusing on
> how to implement the lazy-loading parser for feature rich files (EMBL,
> GenBank, GFF3).

Hi Zhigang,

I'd considered two ideas for GenBank/EMBL,

Lazy parsing of the feature table: The existing iterator approach reads
in a GenBank file record by record, and parses everything into objects
(a SeqRecord object with the sequence as a Seq object and the
features as a list of SeqFeature objects). I did some profiling a while
ago, and of this the feature processing is quite slow, therefore during
the initial parse the features could be stored in memory as a list of
strings, and only parsed into SeqFeature objects if the user tries to
access the SeqRecord's feature property.

It would require a fairly simple subclassing of the SeqRecord to make
the features list into a property in order to populate the list of
SeqFeatures when first accessed.

In the situation where the user never uses the features, this should
be much faster, and save some memory as well (that would need to
be confirmed by measurement - but a list of strings should take less
RAM than a list of SeqFeature objects with all the sub-objects like
the locations and annotations).

In the situation where the use does access the features, the simplest
behaviour would be to process the cached raw feature table into a
list of SeqFeature objects. The overall runtime and memory usage
would be about what we have now. This would not require any
file seeking, and could be used within the existing SeqIO interface
where we make a single pass though the file for parsing - this is
vital in order to cope with handles like stdin and network handles
where you cannot seek backwards in the file.

That is the simpler idea, some real benefits, but not too ambitious.
If you are already familiar with the GenBank/EMBL file format and
our current parser and the SeqRecord object, then I think a week
is reasonable.

A full index based approach would mean scanning the GenBank,
EMBL or GFF file and recording information about where each
feature is on disk (file offset) and the feature location coordinates.
This could be recorded in an efficient index structure (I was thinking
something based on BAM's BAI or Heng Li's improved version CSI).
The idea here is that when the user wants to look at features in a
particular region of the genome (e.g. they have a mutation or SNP
in region 1234567 on chr5) then only the annotation in that part
of the genome needs to be loaded from the disk.

This would likely require API changes or additions, for example
the SeqRecord currently holds the SeqFeature objects as a
simple list - with no build in co-ordinate access.

As I wrote in the original outline email, there is scope for a very
ambitious project working in this area - but some of these ideas
would require more background knowledge or preparation:
http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html

Anything looking to work with GFF (in the broad sense of GFF3
and/or GTF) would ideal incorporate Brad Chapman's existing
work: http://biopython.org/wiki/GFF_Parsing

Regards,

Peter


From albl500 at york.ac.uk  Thu May  2 13:54:37 2013
From: albl500 at york.ac.uk (Alex Leach)
Date: Thu, 02 May 2013 14:54:37 +0100
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
Message-ID: <op.wwgl9bwkyzmrol@ns1.alexleach.org.uk>

Hi again,

Thought I'd contribute some thoughts... Hope I'm not intruding too much on  
the discussion.

On Thu, 02 May 2013 13:54:52 +0100, Peter Cock <p.j.a.cock at googlemail.com>  
wrote:

>
> It would require a fairly simple subclassing of the SeqRecord to make
> the features list into a property in order to populate the list of
> SeqFeatures when first accessed.
>

Yes. You can turn a class property into a function quite easily, using  
decorators. Here[1] is a pretty good example, description and  
justification.

[1] -  
http://stackoverflow.com/questions/6618002/python-property-versus-getters-and-setters

> In the situation where the user never uses the features, this should
> be much faster, and save some memory as well (that would need to
> be confirmed by measurement - but a list of strings should take less
> RAM than a list of SeqFeature objects with all the sub-objects like
> the locations and annotations).
>
> In the situation where the use does access the features, the simplest
> behaviour would be to process the cached raw feature table into a
> list of SeqFeature objects. The overall runtime and memory usage
> would be about what we have now. This would not require any
> file seeking, and could be used within the existing SeqIO interface
> where we make a single pass though the file for parsing - this is
> vital in order to cope with handles like stdin and network handles
> where you cannot seek backwards in the file.


I think the Pythonic way here would be to follow the "Easier to Ask for  
Forgiveness than to ask for Permission" (EAFP) idiom[2]. i.e. Try to seek  
the file handle first, and if that raises an IOError, catch the exception  
and continue to cache the input stream data, perhaps writing it to a  
temporary file on disk.

[2] - http://docs.python.org/2/glossary.html#term-eafp


>
> That is the simpler idea, some real benefits, but not too ambitious.
> If you are already familiar with the GenBank/EMBL file format and
> our current parser and the SeqRecord object, then I think a week
> is reasonable.
>
> A full index based approach would mean scanning the GenBank,
> EMBL or GFF file and recording information about where each
> feature is on disk (file offset) and the feature location coordinates.
> This could be recorded in an efficient index structure (I was thinking
> something based on BAM's BAI or Heng Li's improved version CSI).
> The idea here is that when the user wants to look at features in a
> particular region of the genome (e.g. they have a mutation or SNP
> in region 1234567 on chr5) then only the annotation in that part
> of the genome needs to be loaded from the disk.


Thought I'd add that Blast uses SQL tables (in ISAM format) for  
maintaining indexes to their databases[3]. I'm not familiar with  
BioPython's BioSQL module at all, but a nice feature of sqlite is that you  
can hold temporary databases in memory[4].

[3] -  
http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/doxyhtml/seqdbisam_8hpp.html
[4] -  
http://docs.python.org/2/library/sqlite3.html#using-sqlite3-efficiently


Cheers,
Alex

>
> This would likely require API changes or additions, for example
> the SeqRecord currently holds the SeqFeature objects as a
> simple list - with no build in co-ordinate access.
>
> As I wrote in the original outline email, there is scope for a very
> ambitious project working in this area - but some of these ideas
> would require more background knowledge or preparation:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
> Anything looking to work with GFF (in the broad sense of GFF3
> and/or GTF) would ideal incorporate Brad Chapman's existing
> work: http://biopython.org/wiki/GFF_Parsing
>
> Regards,
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


-- 
---
Alex Leach. BSc, MRes
PhD Student
Chong & Redeker Labs
Department of Biology
University of York
YO10 5DD
Tel: 07940 480 771

EMAIL DISCLAIMER: http://www.york.ac.uk/docs/disclaimer/email.htm


From idoerg at gmail.com  Thu May  2 16:12:12 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Thu, 2 May 2013 12:12:12 -0400
Subject: [Biopython-dev] Uniprot-GOA parser
Message-ID: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>

Does anybody have a GOA parser in the works? Currently writing a simple
parser for GAF, GPA and GPI formats. Can contribute if there is interest.

More on GOA: http://www.ebi.ac.uk/GOA

Cheers,

Iddo

-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From p.j.a.cock at googlemail.com  Thu May  2 16:18:17 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 May 2013 17:18:17 +0100
Subject: [Biopython-dev] Uniprot-GOA parser
In-Reply-To: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>
References: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>
Message-ID: <CAKVJ-_7mrezjn7PZ7PWm02O=sBSsWmgnjuA8XbYUY8_hv+MHgg@mail.gmail.com>

On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> Does anybody have a GOA parser in the works? Currently writing a simple
> parser for GAF, GPA and GPI formats. Can contribute if there is interest.
>
> More on GOA: http://www.ebi.ac.uk/GOA
>
> Cheers,
>
> Iddo

Hi Iddo,

I see they're now offering GPAD1.1 format (as well? instead?).
Does targeting that make more sense in the long run?

I know a few people on the list are or were looking at ontology
support for Biopython... it would be good to add this.

Regards,

Peter


From idoerg at gmail.com  Thu May  2 16:19:39 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Thu, 2 May 2013 12:19:39 -0400
Subject: [Biopython-dev] Uniprot-GOA parser
In-Reply-To: <CAKVJ-_7mrezjn7PZ7PWm02O=sBSsWmgnjuA8XbYUY8_hv+MHgg@mail.gmail.com>
References: <CABm4-MT=AgYEL1noWdEvmMrWhZ6SydXE09r-29LjkwJNi8rsGg@mail.gmail.com>
	<CAKVJ-_7mrezjn7PZ7PWm02O=sBSsWmgnjuA8XbYUY8_hv+MHgg@mail.gmail.com>
Message-ID: <CABm4-MR-Aikxd705XO+eRmBDWiXNHW0Xg96w0m+nbDQBJiGvRQ@mail.gmail.com>

Yes, will do GPAD as well. Need to preserve the others though, due to
legacy.

./I

On Thu, May 2, 2013 at 12:18 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, May 2, 2013 at 5:12 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > Does anybody have a GOA parser in the works? Currently writing a simple
> > parser for GAF, GPA and GPI formats. Can contribute if there is interest.
> >
> > More on GOA: http://www.ebi.ac.uk/GOA
> >
> > Cheers,
> >
> > Iddo
>
> Hi Iddo,
>
> I see they're now offering GPAD1.1 format (as well? instead?).
> Does targeting that make more sense in the long run?
>
> I know a few people on the list are or were looking at ontology
> support for Biopython... it would be good to add this.
>
> Regards,
>
> Peter
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From zhigang.wu at email.ucr.edu  Thu May  2 21:18:43 2013
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Thu, 2 May 2013 14:18:43 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
Message-ID: <CADhJE9v+dbaTLEDvzRsn7JiXj45xdzg32cZzpHFioOkVeE6m2g@mail.gmail.com>

Hi Chris and All,

In your comments to my proposal, you mentioned that some GFF files may have
a size of GBs.
After seeing that comment, I just want to roughly know how large is a gff
file people are often working with?
I mainly work on plants and I am not quite familiar with animals.
Below I listed out a list of animals and plants, to my knowledge from
reading papers,  which most people are working with.

organism(genome size)                      size of gff         url to the
ftp *folder*(not a huge file so feel free to click it)
arabidopsis(~120MB)                         44MB
ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/
rice(~450MB)                                     77MB
here<ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/>
corn(3GB)                                          87MB
http://ftp.maizesequence.org/release-5b/filtered-set/
D. melanogaster                                450MB
ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/
C. elegans                                         (site going down)
http://wiki.wormbase.org/index.php/Downloads#GFF2
H. sapiens(3G)                                   170MB
here<http://galaxy.raetschlab.org/library_common/browse_library?show_deleted=False&cntrller=library&use_panels=False&id=2f94e8ae9edff68a>

My point is that caching gff files in memory wasn't as bad as we have
thought. Any comments or suggestion are welcome.

Best,


Zhigang


On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell <chris.mit7 at gmail.com> wrote:

> Hi Zhigang,
>
> I throw some comments on your proposal.  As i said there, I think you need
> to find & look at a variety of gff/gtf files to see where your
> implementation breaks down.  Also, for parsing, I would focus on optimizing
> the speed the user can access attributes, they're the bits people care most
> about (where is gene X, what is the FPKM of isoform y?, etc.)
>
> Chris
>
>
> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu <zhigangwu.bgi at gmail.com>wrote:
>
>> Hi Peter and all,
>> Thanks for the long explanation.
>> I got much better understand of this project though I am still confusing
>> on
>> how to implement the lazy-loading parser for feature rich files (EMBL,
>> GenBank, GFF3).
>> Since the deadline is pretty close,I decided to post my premature of
>> proposal for this project. It would be great if you all can given me some
>> comments and suggestions. The proposal is available
>> here<
>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
>> >.
>>
>> Thank you all in advance.
>>
>>
>> Zhigang
>>
>>
>>
>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock <p.j.a.cock at googlemail.com
>> >wrote:
>>
>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
>> > wrote:
>> > > Peter,
>> > >
>> > > Thanks for the detailed explanation. It's very helpful. I am not quite
>> > > sure about the goal of the lazy-loading parser.
>> > > Let me try to summarize what are the goals of lazy-loading and how
>> > > lazy-loading would work. Please correct me if necessary. Below I use
>> > > fasta/fastq file as an example. The idea should generally applies to
>> > > other format such as GenBank/EMBL as you mentioned.
>> > >
>> > > Lazy-loading is useful under the assumption that given a large file,
>> > > we are interested in partial information of it but not all of them.
>> > > For example a fasta file contains Arabidopsis genome, we only
>> > > interested in the sequence of chr5 from index position from 2000-3000.
>> > > Rather than parsing the whole file and storing each record in memory
>> > > as most parsers will do,  during the indexing step, lazy loading
>> > > parser will only store a few position information, such as access
>> > > positions (readily usable for seek) for all chromosomes (chr1, chr2,
>> > > chr3, chr4, chr5, ...) and may be position index information such as
>> > > the access positions for every 1000bp positions for each sequence in
>> > > the given file. After indexing, we store these information in a
>> > > dictionary like following {'chr1':{0:access_pos, 1000:access_pos,
>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
>> > > 2000:access_pos,}, 'chr3'...}.
>> > >
>> > > Compared to the usual parser which tends to parsing the whole file, we
>> > > gain two benefits: speed, less memory usage and random access. Speed
>> > > is gained because we skipped a lot during the parsing step. Go back to
>> > > my example, once we have the dictionary, we can just seek to the
>> > > access position of chr5:2000 and start reading and parsing from there.
>> > > Less memory usage is due to we only stores access positions for each
>> > > record as a dictionary in memory.
>> > >
>> > >
>> > > Best,
>> > >
>> > > Zhigang
>> >
>> > Hi Zhigang,
>> >
>> > Yes - that's the basic idea of a disk based lazy loader. Here
>> > the data stays on the disk until needed, so generally this is
>> > very low memory but can be slow as it needs to read from
>> > the disk. And existing example already in Biopython is our
>> > BioSQL bindings which present a SeqRecord subclass which
>> > only retrieves values from the database on demand.
>> >
>> > Note in the case of FASTA, we might want to use the existing
>> > FAI index files from Heng Li's faidx tool (or another existing
>> > index scheme). That relies on each record using a consistent
>> > line wrapping length, so that seek offsets can be easily
>> > calculated.
>> >
>> > An alternative idea is to load the data into memory (so that the
>> > file is not touched again, useful for stream processing where
>> > you cannot seek within the input data) but it is only parsed into
>> > Python objects on demand. This would use a lot more memory,
>> > but should be faster as there is no disk seeking and reading
>> > (other than the one initial read). For FASTA this wouldn't help
>> > much but it might work for EMBL/GenBank.
>> >
>> > Something to beware of with any lazy loading / lazy parsing is
>> > what happens if the user tries to edit the record? Do you want
>> > to allow this (it makes the code more complex) or not (simpler
>> > and still very useful).
>> >
>> > In terms of usage examples, for things like raw NGS data this
>> > is (currently) made up of lots and lots of short sequences (under
>> > 1000bp). Lazy loading here is unlikely to be very helpful - unless
>> > perhaps you can make the FASTQ parser faster this way?
>> > (Once the reads are assembled or mapped to a reference,
>> > random access to lookup reads by their mapped location is
>> > very very important, thus the BAI indexing of BAM files).
>> >
>> > In terms of this project, I was thinking about a SeqRecord
>> > style interface extending Bio.SeqIO (but you can suggest
>> > something different for your project).
>> >
>> > What I saw as the main use case here is large datasets like
>> > whole chromosomes in FASTA format or richly annotated
>> > formats like EMBL, GenBank or GFF3. Right now if I am
>> > doing something with (for example) the annotated human
>> > chromosomes, loading these as GenBank files is quite
>> > slow (it takes a far amount of memory too, but that isn't
>> > my main worry). A lazy loading approach should let me
>> > 'load' the GenBank files almost instantly, and delay
>> > reading specific features or sequence from the disk
>> > until needed.
>> >
>> > For example, I might have a list of genes for which I wish
>> > to extract the annotation or sequence for - and there is no
>> > need to load all the other features or the rest of the genome.
>> >
>> > (Note we can already do this by loading GenBank files
>> > into a BioSQL database, and access them that way)
>> >
>> > Regards,
>> >
>> > Peter
>> >
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>
>
>


From zhigang.wu at email.ucr.edu  Fri May  3 00:18:03 2013
From: zhigang.wu at email.ucr.edu (Zhigang Wu)
Date: Thu, 2 May 2013 17:18:03 -0700
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAKVJ-_69O1OZUN4XCTftu8tBfRX5m08v2eS6eNWWiPnkJ_1v=g@mail.gmail.com>
Message-ID: <CADhJE9uw2V3sX3nfrOSBiZCdeTcMJc9FSLkcoNV=Y1Wb-5SoVQ@mail.gmail.com>

On Thu, May 2, 2013 at 5:54 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Wed, May 1, 2013 at 3:17 PM, Zhigang Wu <zhigangwu.bgi at gmail.com>
> wrote:
> > Hi Peter and all,
> > Thanks for the long explanation.
> > I got much better understand of this project though I am still confusing
> on
> > how to implement the lazy-loading parser for feature rich files (EMBL,
> > GenBank, GFF3).
>
> Hi Zhigang,
>
> I'd considered two ideas for GenBank/EMBL,
>
> Lazy parsing of the feature table: The existing iterator approach reads
> in a GenBank file record by record, and parses everything into objects
> (a SeqRecord object with the sequence as a Seq object and the
> features as a list of SeqFeature objects). I did some profiling a while
> ago, and of this the feature processing is quite slow, therefore during
> the initial parse the features could be stored in memory as a list of
> strings, and only parsed into SeqFeature objects if the user tries to
> access the SeqRecord's feature property.
>
> It would require a fairly simple subclassing of the SeqRecord to make
> the features list into a property in order to populate the list of
> SeqFeatures when first accessed.
>
> In the situation where the user never uses the features, this should
> be much faster, and save some memory as well (that would need to
> be confirmed by measurement - but a list of strings should take less
> RAM than a list of SeqFeature objects with all the sub-objects like
> the locations and annotations).
>

I agree. This would save some memory.


> In the situation where the use does access the features, the simplest
> behaviour would be to process the cached raw feature table into a
> list of SeqFeature objects. The overall runtime and memory usage
> would be about what we have now. This would not require any
> file seeking, and could be used within the existing SeqIO interface
> where we make a single pass though the file for parsing - this is
> vital in order to cope with handles like stdin and network handles
> where you cannot seek backwards in the file.
>
> Yes, I agree. So in this sense, the name "lazy-loading" is a little
misleading.
Because, this would load everything into memory at the beginning, while
just delay
in parsing any feature until a specific one is requested.
Seems like "lazy parsing" would be more appropriate.

That is the simpler idea, some real benefits, but not too ambitious.
> If you are already familiar with the GenBank/EMBL file format and
> our current parser and the SeqRecord object, then I think a week
> is reasonable.
>
>
No, I am not quite familiar with these.


> A full index based approach would mean scanning the GenBank,
> EMBL or GFF file and recording information about where each
> feature is on disk (file offset) and the feature location coordinates.
> This could be recorded in an efficient index structure (I was thinking
> something based on BAM's BAI or Heng Li's improved version CSI).
> The idea here is that when the user wants to look at features in a
> particular region of the genome (e.g. they have a mutation or SNP
> in region 1234567 on chr5) then only the annotation in that part
> of the genome needs to be loaded from the disk.
>
> This would likely require API changes or additions, for example
> the SeqRecord currently holds the SeqFeature objects as a
> simple list - with no build in co-ordinate access.
>
> As I wrote in the original outline email, there is scope for a very
> ambitious project working in this area - but some of these ideas
> would require more background knowledge or preparation:
> http://lists.open-bio.org/pipermail/biopython-dev/2013-March/010469.html
>
>
Hmm, this is actually INDEXing a big file. Don't you think a little bit off
topic, "lazy-loading parser".
But this seems interesting and challenging and definitely going to be
useful.


> Anything looking to work with GFF (in the broad sense of GFF3
> and/or GTF) would ideal incorporate Brad Chapman's existing
> work: http://biopython.org/wiki/GFF_Parsing
>
> Yes, I definitely will take a Brad's GFF parser.


> Regards,
>
> Peter
>

Thanks for the long explanation again.


Zhigang


From yeyanbo289 at gmail.com  Fri May  3 02:19:07 2013
From: yeyanbo289 at gmail.com (Yanbo Ye)
Date: Fri, 3 May 2013 10:19:07 +0800
Subject: [Biopython-dev] Biopython Phylo Proposal
Message-ID: <CADoMHjx3HPxpHpx10x7a1TUUGf6bO3GXbjF924biZMT4eohnVw@mail.gmail.com>

Hi everyone,

I forget to post my gsoc proposal page here. Any comment?
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/yeyanbo/1#

Thanks,

Yanbo
-- 

???

????????????????

Ye Yanbo

Bioinformatics Group, Wuhan Institute Of Virology, Chinese Academy of
Sciences


From Markus.Piotrowski at ruhr-uni-bochum.de  Fri May  3 06:32:43 2013
From: Markus.Piotrowski at ruhr-uni-bochum.de (Markus Piotrowski)
Date: 3 May 2013 08:32:43 +0200
Subject: [Biopython-dev] Lazy-loading parsers,
 was: Biopython GSoC 2013 applications via NESCent
In-Reply-To: <CADhJE9v+dbaTLEDvzRsn7JiXj45xdzg32cZzpHFioOkVeE6m2g@mail.gmail.com>
References: <CAKVJ-_7Xv0jgZ1Yc5Y3KcdgA4nZXn9kBfUEUoasDe3_7SBennw@mail.gmail.com>
	<CADhJE9uT5xgEgdeVz6OY67m4uBJCDEqEaQkU9T98ivcQuzz-yQ@mail.gmail.com>
	<CAKVJ-_64r3pzjaQA=Q1OtrJgHqza2KzjMSpdVd9E-U7jEPV8EA@mail.gmail.com>
	<CADhJE9vKxiEtKEn0dVi5GgycbjD4dn9khafti8-DMeUh9NFDLw@mail.gmail.com>
	<CAK_U6ODeUwrcASbjQNHOM6vGbpVqJ9tArzSryEcsZS7opOFx+g@mail.gmail.com>
	<CADhJE9v+dbaTLEDvzRsn7JiXj45xdzg32cZzpHFioOkVeE6m2g@mail.gmail.com>
Message-ID: <e7682ba4d69f8006c063791b2c06053f@mpx2.rz.ruhr-uni-bochum.de>

Hi Zhigang,

Sequence read files from Next Generation Sequencing methods are several 
GB large. Don't know if they are regulary stored in GFF files, anyhow.

Best,

Markus

Am 2013-05-02 23:18, schrieb Zhigang Wu:
> Hi Chris and All,
>
> In your comments to my proposal, you mentioned that some GFF files 
> may have
> a size of GBs.
> After seeing that comment, I just want to roughly know how large is a 
> gff
> file people are often working with?
> I mainly work on plants and I am not quite familiar with animals.
> Below I listed out a list of animals and plants, to my knowledge from
> reading papers,  which most people are working with.
>
> organism(genome size)                      size of gff         url to 
> the
> ftp *folder*(not a huge file so feel free to click it)
> arabidopsis(~120MB)                         44MB
> ftp://ftp.arabidopsis.org/Maps/gbrowse_data/TAIR10/
> rice(~450MB)                                     77MB
> 
> here<ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/all.dir/>
> corn(3GB)                                          87MB
> http://ftp.maizesequence.org/release-5b/filtered-set/
> D. melanogaster                                450MB
> 
> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.50_FB2013_02/gff/
> C. elegans                                         (site going down)
> http://wiki.wormbase.org/index.php/Downloads#GFF2
> H. sapiens(3G)                                   170MB
> 
> here<http://galaxy.raetschlab.org/library_common/browse_library?show_deleted=False&cntrller=library&use_panels=False&id=2f94e8ae9edff68a>
>
> My point is that caching gff files in memory wasn't as bad as we have
> thought. Any comments or suggestion are welcome.
>
> Best,
>
>
> Zhigang
>
>
>
>
> On Wed, May 1, 2013 at 7:40 AM, Chris Mitchell <chris.mit7 at gmail.com> 
> wrote:
>
>> Hi Zhigang,
>>
>> I throw some comments on your proposal.  As i said there, I think 
>> you need
>> to find & look at a variety of gff/gtf files to see where your
>> implementation breaks down.  Also, for parsing, I would focus on 
>> optimizing
>> the speed the user can access attributes, they're the bits people 
>> care most
>> about (where is gene X, what is the FPKM of isoform y?, etc.)
>>
>> Chris
>>
>>
>> On Wed, May 1, 2013 at 10:17 AM, Zhigang Wu 
>> <zhigangwu.bgi at gmail.com>wrote:
>>
>>> Hi Peter and all,
>>> Thanks for the long explanation.
>>> I got much better understand of this project though I am still 
>>> confusing
>>> on
>>> how to implement the lazy-loading parser for feature rich files 
>>> (EMBL,
>>> GenBank, GFF3).
>>> Since the deadline is pretty close,I decided to post my premature 
>>> of
>>> proposal for this project. It would be great if you all can given 
>>> me some
>>> comments and suggestions. The proposal is available
>>> here<
>>> 
>>> https://docs.google.com/document/d/1BgPRKTq7HXq1K6fb9U2TnN7VvSDDlSTsQN991okekzk/edit?usp=sharing
>>> >.
>>>
>>> Thank you all in advance.
>>>
>>>
>>> Zhigang
>>>
>>>
>>>
>>> On Sat, Apr 27, 2013 at 1:40 PM, Peter Cock 
>>> <p.j.a.cock at googlemail.com
>>> >wrote:
>>>
>>> > On Sat, Apr 27, 2013 at 8:22 PM, Zhigang Wu 
>>> <zhigangwu.bgi at gmail.com>
>>> > wrote:
>>> > > Peter,
>>> > >
>>> > > Thanks for the detailed explanation. It's very helpful. I am 
>>> not quite
>>> > > sure about the goal of the lazy-loading parser.
>>> > > Let me try to summarize what are the goals of lazy-loading and 
>>> how
>>> > > lazy-loading would work. Please correct me if necessary. Below 
>>> I use
>>> > > fasta/fastq file as an example. The idea should generally 
>>> applies to
>>> > > other format such as GenBank/EMBL as you mentioned.
>>> > >
>>> > > Lazy-loading is useful under the assumption that given a large 
>>> file,
>>> > > we are interested in partial information of it but not all of 
>>> them.
>>> > > For example a fasta file contains Arabidopsis genome, we only
>>> > > interested in the sequence of chr5 from index position from 
>>> 2000-3000.
>>> > > Rather than parsing the whole file and storing each record in 
>>> memory
>>> > > as most parsers will do,  during the indexing step, lazy 
>>> loading
>>> > > parser will only store a few position information, such as 
>>> access
>>> > > positions (readily usable for seek) for all chromosomes (chr1, 
>>> chr2,
>>> > > chr3, chr4, chr5, ...) and may be position index information 
>>> such as
>>> > > the access positions for every 1000bp positions for each 
>>> sequence in
>>> > > the given file. After indexing, we store these information in a
>>> > > dictionary like following {'chr1':{0:access_pos, 
>>> 1000:access_pos,
>>> > > 2000:access_pos, ...}, 'chr2':{0:access_pos, 1000:access_pos,
>>> > > 2000:access_pos,}, 'chr3'...}.
>>> > >
>>> > > Compared to the usual parser which tends to parsing the whole 
>>> file, we
>>> > > gain two benefits: speed, less memory usage and random access. 
>>> Speed
>>> > > is gained because we skipped a lot during the parsing step. Go 
>>> back to
>>> > > my example, once we have the dictionary, we can just seek to 
>>> the
>>> > > access position of chr5:2000 and start reading and parsing from 
>>> there.
>>> > > Less memory usage is due to we only stores access positions for 
>>> each
>>> > > record as a dictionary in memory.
>>> > >
>>> > >
>>> > > Best,
>>> > >
>>> > > Zhigang
>>> >
>>> > Hi Zhigang,
>>> >
>>> > Yes - that's the basic idea of a disk based lazy loader. Here
>>> > the data stays on the disk until needed, so generally this is
>>> > very low memory but can be slow as it needs to read from
>>> > the disk. And existing example already in Biopython is our
>>> > BioSQL bindings which present a SeqRecord subclass which
>>> > only retrieves values from the database on demand.
>>> >
>>> > Note in the case of FASTA, we might want to use the existing
>>> > FAI index files from Heng Li's faidx tool (or another existing
>>> > index scheme). That relies on each record using a consistent
>>> > line wrapping length, so that seek offsets can be easily
>>> > calculated.
>>> >
>>> > An alternative idea is to load the data into memory (so that the
>>> > file is not touched again, useful for stream processing where
>>> > you cannot seek within the input data) but it is only parsed into
>>> > Python objects on demand. This would use a lot more memory,
>>> > but should be faster as there is no disk seeking and reading
>>> > (other than the one initial read). For FASTA this wouldn't help
>>> > much but it might work for EMBL/GenBank.
>>> >
>>> > Something to beware of with any lazy loading / lazy parsing is
>>> > what happens if the user tries to edit the record? Do you want
>>> > to allow this (it makes the code more complex) or not (simpler
>>> > and still very useful).
>>> >
>>> > In terms of usage examples, for things like raw NGS data this
>>> > is (currently) made up of lots and lots of short sequences (under
>>> > 1000bp). Lazy loading here is unlikely to be very helpful - 
>>> unless
>>> > perhaps you can make the FASTQ parser faster this way?
>>> > (Once the reads are assembled or mapped to a reference,
>>> > random access to lookup reads by their mapped location is
>>> > very very important, thus the BAI indexing of BAM files).
>>> >
>>> > In terms of this project, I was thinking about a SeqRecord
>>> > style interface extending Bio.SeqIO (but you can suggest
>>> > something different for your project).
>>> >
>>> > What I saw as the main use case here is large datasets like
>>> > whole chromosomes in FASTA format or richly annotated
>>> > formats like EMBL, GenBank or GFF3. Right now if I am
>>> > doing something with (for example) the annotated human
>>> > chromosomes, loading these as GenBank files is quite
>>> > slow (it takes a far amount of memory too, but that isn't
>>> > my main worry). A lazy loading approach should let me
>>> > 'load' the GenBank files almost instantly, and delay
>>> > reading specific features or sequence from the disk
>>> > until needed.
>>> >
>>> > For example, I might have a list of genes for which I wish
>>> > to extract the annotation or sequence for - and there is no
>>> > need to load all the other features or the rest of the genome.
>>> >
>>> > (Note we can already do this by loading GenBank files
>>> > into a BioSQL database, and access them that way)
>>> >
>>> > Regards,
>>> >
>>> > Peter
>>> >
>>> _______________________________________________
>>> Biopython-dev mailing list
>>> Biopython-dev at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>>>
>>
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From p.j.a.cock at googlemail.com  Mon May  6 11:23:24 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 6 May 2013 12:23:24 +0100
Subject: [Biopython-dev] Abstract for "Biopython Project Update" at BOSC
	2013
In-Reply-To: <CAKVJ-_6z4V6OOmG3R-4SjVfD3xJbcc7eRiuzHUH4h=jHXfJc=w@mail.gmail.com>
References: <CAKVJ-_4_yYYDUcpXoVivr2FdZ4yXS-HzAHLFgwodYq9S7_PCgA@mail.gmail.com>
	<CAKVJ-_4cdF=XDJ7P1Scg+9LR_8V6Evz7eEWetwMxPin4Kn3pyg@mail.gmail.com>
	<CAMC681nKv5qguoBn_iZyQfR80L+-f=Kjkr_O-8G4YUnfRr7KaQ@mail.gmail.com>
	<CAKVJ-_6z4V6OOmG3R-4SjVfD3xJbcc7eRiuzHUH4h=jHXfJc=w@mail.gmail.com>
Message-ID: <CAKVJ-_4NX8J-EBBRiaMtp85RmKaACUX2JzS-jDg9HGMpfCM0hQ@mail.gmail.com>

On Tue, Apr 16, 2013 at 9:47 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Tue, Apr 16, 2013 at 1:43 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
>>
>> The abstract looks good to me. Which release was the first to include
>> SearchIO, was that 1.61? If so, maybe it would be good to note that in
>> addition to the smaller improvements, SearchIO specifically was (one of?)
>> the new module(s) that introduced the beta designation.
>>
>
> Yes, SearchIO was included in Biopython 1.61, but you're right that
> could be made a bit clearer.
>

The Biopython update has been accepted for a 10 minute talk slot at
BOSC (anyone else with an abstract submitted should have had an
email by now), the reviewers' feedback was short and positive:

  (A) Keep it short and show the variety of active sub-projects and
     people involved and the presentaion will will be attractive to the
     audience. The last year's talk is a good example (based on the
     shared slides).

(Last year it was Eric at BOSC 2012 in Long Beach, CA - well done)

(B) Nice to see latest news on BioPython and future directions of one
    of the most popular OpenBio project.

(C) This talk reports an update on the BioPython project (support for
     experimental codes, Python 3 compatibility, SearchIO and genomic
     variant formats). BioPython is one of the central projects of O.B.F
     and its update is worth getting some attention at BOSC.

We have until June to revise our abstract - so perhaps we should
do the next release this month in May ;)

Peter


From idoerg at gmail.com  Tue May  7 16:24:00 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Tue, 7 May 2013 12:24:00 -0400
Subject: [Biopython-dev] uniprot-GOA parse
Message-ID: <CABm4-MS4qr_6PASH5j6dYH20NWX=JvHZjQkVtvyTjn0JF_WZ-Q@mail.gmail.com>

hi,

As promised, I have written a uniprot-goa parser. Very skeletal, has
iterators for reading the three uniprot-GOA file types, a write function,
and a couple of usage examples.

No github write access, so attaching.

Cheers,

Iddo


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: upg_parser.py
Type: application/octet-stream
Size: 10344 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20130507/913ffc33/attachment-0002.obj>

From p.j.a.cock at googlemail.com  Tue May  7 16:47:16 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 7 May 2013 17:47:16 +0100
Subject: [Biopython-dev] uniprot-GOA parse
In-Reply-To: <CABm4-MS4qr_6PASH5j6dYH20NWX=JvHZjQkVtvyTjn0JF_WZ-Q@mail.gmail.com>
References: <CABm4-MS4qr_6PASH5j6dYH20NWX=JvHZjQkVtvyTjn0JF_WZ-Q@mail.gmail.com>
Message-ID: <CAKVJ-_7ExtyWmzEM=Cxqg4JytHux29XKw_W+Zna6xA4OC8gYoA@mail.gmail.com>

On Tue, May 7, 2013 at 5:24 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> hi,
>
> As promised, I have written a uniprot-goa parser. Very skeletal, has
> iterators for reading the three uniprot-GOA file types, a write function,
> and a couple of usage examples.
>
> No github write access, so attaching.

The file arrived :)

Did you have any thoughts on where in the namespace to put this?

The idea with github is you'd register an account, say iddux (since
that's your Twitter username), and then fork the repository as
https://github.com/iddux/biopython - and make a new branch
there with your changes, and ask for feedback or make a pull
request.

All that can be done without any write access to the main repository,
and is intended to lower the barrier to entry.

In your case, given you're a past project leader etc, drop me (or Brad
etc) an email once you've mastered the git basic and we can give you
direct access.

Regards,

Peter


From natemsutton at yahoo.com  Tue May  7 21:12:59 2013
From: natemsutton at yahoo.com (Nate Sutton)
Date: Tue, 7 May 2013 14:12:59 -0700 (PDT)
Subject: [Biopython-dev] Progress with ticket 3336
Message-ID: <1367961179.88206.YahooMailNeo@web122603.mail.ne1.yahoo.com>

Hi,

Here is a progress follow up to http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010548.html . ?I have added a commit to the github branch that adds an option to create claude branch lines using linecollection. ?The linecollection objects are stored in a tuple before adding them to the plot. ?It?s in Bio/Phylo/_utils.py. ?Is this what the last bullet point was requesting in https://redmine.open-bio.org/issues/3336 ? ?

Thanks!

Nate

P. S. ?I used a tuple to store the linecollection objects instead of a list because that was mentioned in the ticket but if that looks like it should be different let me know. ?Also, I got some global variables to work with the code but I was only able to do that after declaring them as globals twice. ?If there are suggestions on how to code that differently let me know.


From idoerg at gmail.com  Wed May  8 23:28:17 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Wed, 8 May 2013 19:28:17 -0400
Subject: [Biopython-dev] UniProt GOA parser
Message-ID: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>

A new uniprot-GOA parser is available for you to poke around:

https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA

More on Uniprot-GOA: http://www.ebi.ac.uk/GOA

There are three file formats: GAF (gene association file) , GPA (gene
product association) and GPI (gene product information) explained here:
http://www.ebi.ac.uk/GOA/downloads

Input GAF files can be very large, due to the growth of uniprot GOA. If you
would like to test in a timely fashion, I suggest you get historical files,
which are smaller. Once you get to the > 40 version numbers, the runtime
for the example code in UniProtGOA.py goes over 2 minutes (on my i5
machine).

Old GAF files are available here:
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/

Current GPI and GPA files are not very large.

Thanks to Peter for his help on this.

Best,

Iddo
-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From p.j.a.cock at googlemail.com  Fri May 10 10:06:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 10 May 2013 11:06:19 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
Message-ID: <CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>

On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
> A new uniprot-GOA parser is available for you to poke around:
>
> https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA
>

I think for the namespace, we might be better off using Bio.UniProt.GOA,
where Iddo's parser would be in Bio/UniProt/GOA.py and any other
UniProt specific code could also go under Bio/UniProt - for example
a web API.

Some of Bio.SwissProt might also migrate here over time.

> More on Uniprot-GOA: http://www.ebi.ac.uk/GOA
>
> There are three file formats: GAF (gene association file) , GPA (gene
> product association) and GPI (gene product information) explained here:
> http://www.ebi.ac.uk/GOA/downloads
>
> Input GAF files can be very large, due to the growth of uniprot GOA. If you
> would like to test in a timely fashion, I suggest you get historical files,
> which are smaller. Once you get to the > 40 version numbers, the runtime
> for the example code in UniProtGOA.py goes over 2 minutes (on my i5
> machine).

Would it make sense to want random access to the GOA files based
on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
should be fairly straight forward to do building on the indexing code
for Bio.SeqIO and SearchIO.

Note here I am picturing combining all the (consecutive) lines
for the same DB_Object_ID - currently the parser is line based,
but batching by DB_Object_ID would be a straightforward change
and may better suit some uses.

> Old GAF files are available here:
> ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
>
> Current GPI and GPA files are not very large.
>
> Thanks to Peter for his help on this.
>
> Best,
>
> Iddo

Peter


From idoerg at gmail.com  Fri May 10 16:20:16 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Fri, 10 May 2013 12:20:16 -0400
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
Message-ID: <CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>

On Fri, May 10, 2013 at 6:06 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Thu, May 9, 2013 at 12:28 AM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > A new uniprot-GOA parser is available for you to poke around:
> >
> > https://github.com/idoerg/biopython/tree/uniprot-goa/Bio/UniProtGOA
> >
>
> I think for the namespace, we might be better off using Bio.UniProt.GOA,
> where Iddo's parser would be in Bio/UniProt/GOA.py and any other
> UniProt specific code could also go under Bio/UniProt - for example
> a web API.
>

OK.


>
> Some of Bio.SwissProt might also migrate here over time.
>
> > More on Uniprot-GOA: http://www.ebi.ac.uk/GOA
> >
> > There are three file formats: GAF (gene association file) , GPA (gene
> > product association) and GPI (gene product information) explained here:
> > http://www.ebi.ac.uk/GOA/downloads
> >
> > Input GAF files can be very large, due to the growth of uniprot GOA. If
> you
> > would like to test in a timely fashion, I suggest you get historical
> files,
> > which are smaller. Once you get to the > 40 version numbers, the runtime
> > for the example code in UniProtGOA.py goes over 2 minutes (on my i5
> > machine).
>
> Would it make sense to want random access to the GOA files based
> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
> should be fairly straight forward to do building on the indexing code
> for Bio.SeqIO and SearchIO.
>

Would that require reading it all into memory? Uniprot_GOA files are huge,
it is impractical to read them in fully.


>
> Note here I am picturing combining all the (consecutive) lines
> for the same DB_Object_ID - currently the parser is line based,
> but batching by DB_Object_ID would be a straightforward change
> and may better suit some uses.
>

Perhaps only for organism specific file, which in some cases can be read
fully into memory.

>
> > Old GAF files are available here:
> > ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/old/UNIPROT/
> >
> > Current GPI and GPA files are not very large.
> >
> > Thanks to Peter for his help on this.
> >
> > Best,
> >
> > Iddo
>
> Peter
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From p.j.a.cock at googlemail.com  Fri May 10 16:26:13 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 10 May 2013 17:26:13 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
Message-ID: <CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>

On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
>>
>> Would it make sense to want random access to the GOA files based
>> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
>> should be fairly straight forward to do building on the indexing code
>> for Bio.SeqIO and SearchIO.
>
>
> Would that require reading it all into memory? Uniprot_GOA files
> are huge, it is impractical to read them in fully.

Not at all - we'd record a dictionary mapping the record ID to an offset
in the file on disk, or record this mapping in an SQLite index file.

>> Note here I am picturing combining all the (consecutive) lines
>> for the same DB_Object_ID - currently the parser is line based,
>> but batching by DB_Object_ID would be a straightforward change
>> and may better suit some uses.
>
> Perhaps only for organism specific file, which in some cases can
> be read fully into memory.

The examples I looked at only seemed to have a dozen or so
lines for each DB_Object_ID - but perhaps these were easy
cases? How many lines per DB_Object_ID in the worst cases?

Peter


From idoerg at gmail.com  Fri May 10 16:32:43 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Fri, 10 May 2013 12:32:43 -0400
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
Message-ID: <CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>

On Fri, May 10, 2013 at 12:26 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
> >>
> >> Would it make sense to want random access to the GOA files based
> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
> >> should be fairly straight forward to do building on the indexing code
> >> for Bio.SeqIO and SearchIO.
> >
> >
> > Would that require reading it all into memory? Uniprot_GOA files
> > are huge, it is impractical to read them in fully.
>
> Not at all - we'd record a dictionary mapping the record ID to an offset
> in the file on disk, or record this mapping in an SQLite index file.
>

 Ok, that's good then


> >> Note here I am picturing combining all the (consecutive) lines
> >> for the same DB_Object_ID - currently the parser is line based,
> >> but batching by DB_Object_ID would be a straightforward change
> >> and may better suit some uses.
> >
> > Perhaps only for organism specific file, which in some cases can
> > be read fully into memory.
>
> The examples I looked at only seemed to have a dozen or so
> lines for each DB_Object_ID - but perhaps these were easy
> cases? How many lines per DB_Object_ID in the worst cases?
>
> Peter
>


I was actually thinking you are suggesting that the whole file should be
read in memory, nit just buffer by DB-Object_ID.  My mistake.


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From linxzh1989 at gmail.com  Sun May 12 12:57:25 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Sun, 12 May 2013 20:57:25 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
Message-ID: <CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>

I am very Sorry about my mistake.
I want to install biopython 1.61 in a local server(CentOS),
    python setup.py build
    python setup.py test
and then showed some errors:


======================================================================
FAIL: Test an input file containing a single sequence.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Clustalw_tool.py", line 166, in test_single_sequence
    self.assertTrue(str(err) == "No records found in handle")
AssertionError

======================================================================
ERROR: Test Entrez.read from URL
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Entrez_online.py", line 34, in test_read_from_url
    rec = Entrez.read(einfo)
  File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py",
line 362, in read
    record = handler.read(handle)
  File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
line 184, in read
    self.parser.ParseFile(handle)
  File "/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
line 322, in endElementHandler
    raise RuntimeError(value)
RuntimeError: Unable to open connection to #DbInfo?dbaf=

======================================================================
ERROR: Run tutorial doctests.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Tutorial.py", line 152, in test_doctests
ValueError: 4 Tutorial doctests failed: test_from_line_05671,
test_from_line_06030, test_from_line_06190, test_from_line_06479

----------------------------------------------------------------------
Ran 213 tests in 1621.002 seconds

FAILED (failures = 3)


i use python 2.6.5

2013/5/12 ??? <linxzh1989 at gmail.com>:
> I've run the


From saketkc at gmail.com  Sun May 12 18:11:46 2013
From: saketkc at gmail.com (Saket Choudhary)
Date: Sun, 12 May 2013 23:41:46 +0530
Subject: [Biopython-dev] samtools threaded daemon
In-Reply-To: <CAK_U6OCO_QTyi1f9vxHFog2cER-J43pstD_xmaoQziDiUkd23w@mail.gmail.com>
References: <CAK_U6OB6huGJsEG9Tr=anxBHNO-KJ4savdwimVOSEs22XHyj8Q@mail.gmail.com>
	<516653BE.8060509@brueffer.de>
	<CAKVJ-_5G8nawG2nY9sGUNkVpBskTLMMmKuEtdgR1xR3GzxR3Qg@mail.gmail.com>
	<CAK_U6OBESn3Y20k5kJU0papp4ZmWdVW8Z_8QS8LMjXteKAJZVg@mail.gmail.com>
	<CAKVJ-_6RsKLL6JQgB9Q=OirfuMK3QM5LCCyF21uSvmpp4aXyzw@mail.gmail.com>
	<CAK_U6ODx3nk4hi0i3cJZzrgxd6eBHLSL70R0RVNwiz6Y4sGf_g@mail.gmail.com>
	<CAK_U6OCO_QTyi1f9vxHFog2cER-J43pstD_xmaoQziDiUkd23w@mail.gmail.com>
Message-ID: <CAEDHeis1-E-0gPy9kwcAyVYeuVjaBBOyadM+ewLqN+i41AUZpA@mail.gmail.com>

Just completed writing samtools wrapper :
https://github.com/biopython/biopython/pull/180

Unit Tests pending.

On 11 April 2013 23:51, Chris Mitchell <chris.mit7 at gmail.com> wrote:
> Here's the branch I'm starting with, including a working mpileup daemon for
> those who want to use it:
>
> https://github.com/chrismit/biopython/tree/samtools
>
> sample usage:
> from Bio.SamTools import SamTools
> sTools = '/home/chris/bin/samtools'
> hg19 = '/media/chris/ChrisSSD/ref/human/hg19.fa'
> bamSource = '/media/chris/ChrisSSD/TH1Alignment/NK/accepted_hits.bam'
> st = SamTools(bamSource,binary=sTools,threads=30)
>
> #now with a callback, which is advisable to use to process data as it is
> generated
> def processPileup(pileup):
>     print 'to process',pileup
>
> #st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in
> xrange(2000001,2001001)],callback=processPileup) #with callback
> #print st.mpileup(f=hg19,r=['chr1:%d-%d'%(i,i+1) for i in
> xrange(2000001,2000101)]) #will just return as a list
>
>
> On Thu, Apr 11, 2013 at 10:04 AM, Chris Mitchell <chris.mit7 at gmail.com>wrote:
>
>> Given that we'd be chasing after the samtools development cycle, I think
>> it's just easier to implement command line wrappers that are dynamic enough
>> to handle future versions.  For instance, some of the code doesn't seem too
>> set in stone and appears empirical (the BAQ computation comes to mind) and
>> therefore probable to change in future versions.  I can package in my
>> existing pileup parser, but in general I think most people will be using a
>> callback routine to handle it themselves since use cases of the final
>> output sort of vary project by project.
>>
>> Chris
>>
>>
>> On Thu, Apr 11, 2013 at 9:54 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>>
>>> On Thu, Apr 11, 2013 at 2:46 PM, Chris Mitchell <chris.mit7 at gmail.com>
>>> wrote:
>>> > Also, if a binary can't be found, having it fallback to the future
>>> > BioPython parser seems like it might be a good idea (provided it has
>>> > similar functionality like creating pileups, does it?).
>>>
>>> It has the low level random access via the BAI index done, but
>>> does not yet have a reimplementation of the mpileup code, no.
>>> (Would that be useful compared to calling samtools and parsing
>>> its output?)
>>>
>>> Peter
>>>
>>
>>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From linxzh1989 at gmail.com  Mon May 13 01:41:30 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Mon, 13 May 2013 09:41:30 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
Message-ID: <CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>

2013/5/13 Peter Cock <p.j.a.cock at googlemail.com>:
> On Sun, May 12, 2013 at 1:57 PM, ??? <linxzh1989 at gmail.com> wrote:
>> I want to install biopython 1.61 in a local server(CentOS),
>>     python setup.py build
>>     python setup.py test
>> and then showed some errors:
>>
>> ...
>>
>> i use python 2.6.5
>>
>
> Thank you for getting in touch, and including the important
> information about the operating system, version of Python
> and version of Biopython.
>
>> FAIL: Test an input file containing a single sequence.
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "test_Clustalw_tool.py", line 166, in test_single_sequence
>>     self.assertTrue(str(err) == "No records found in handle")
>> AssertionError
>>
>
> This test calls the command line tool clustalw.
>
> What version of clustalw do you have?
>
>> ERROR: Test Entrez.read from URL
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "test_Entrez_online.py", line 34, in test_read_from_url
>>     rec = Entrez.read(einfo)
>>   File
"/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/__init__.py",
>> line 362, in read
>>     record = handler.read(handle)
>>   File
"/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
>> line 184, in read
>>     self.parser.ParseFile(handle)
>>   File
"/share/fg2/Linxzh/biopython-1.61/build/lib.linux-x86_64-2.6/Bio/Entrez/Parser.py",
>> line 322, in endElementHandler
>>     raise RuntimeError(value)
>> RuntimeError: Unable to open connection to #DbInfo?dbaf=
>>
>
> This test connects to the NCBI Entrez server over the internet.
> This kind of error is usually a temporary network problem, and
> will go away if you repeat the test later.
>
>> ERROR: Run tutorial doctests.
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "test_Tutorial.py", line 152, in test_doctests
>> ValueError: 4 Tutorial doctests failed: test_from_line_05671,
>> test_from_line_06030, test_from_line_06190, test_from_line_06479
>
> Those four failing examples in the Tutorial seem to match this
> commit, made just before the Biopython 1.61 release:
>
> https://github.com/biopython/biopython/commit/b84bda01bd22e93a1cf71613a55
February 2013 (Biopython
1.61)cfca876b7128d7#Doc/Tutorial.tex<https://github.com/biopython/biopython/commit/b84bda01bd22e93a1cf71613a5cfca876b7128d7#Doc/Tutorial.tex>
>
> Where did you get the Biopython 1.61 files from? e.g. The zip file
> or tar.gz file on our website? Perhaps I accidentally included an
> older copy of the Doc/Tutorial.tex file? Could you look for the
> "Late Update" line in your Tutorial.tex file for me - does it say:
>
> \date{Last Update -- 5 February 2013 (Biopython 1.61)}
>
> Thanks,
>
> Peter

Hi?Peter?
Clustalw I am using is 1.83.

I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5
February 2013 (Biopython 1.61)}'.

I downloaded the tar.gz from the biopython website.

Thanks
Lin


From p.j.a.cock at googlemail.com  Mon May 13 08:49:20 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 13 May 2013 09:49:20 +0100
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
Message-ID: <CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>

On Mon, May 13, 2013 at 2:41 AM, ??? <linxzh1989 at gmail.com> wrote:
>
>> Where did you get the Biopython 1.61 files from? e.g. The zip file
>> or tar.gz file on our website? Perhaps I accidentally included an
>> older copy of the Doc/Tutorial.tex file? Could you look for the
>> "Late Update" line in your Tutorial.tex file for me - does it say:
>>
>> \date{Last Update -- 5 February 2013 (Biopython 1.61)}
>>
>> Thanks,
>>
>> Peter
>
> Hi?Peter?
> Clustalw I am using is 1.83.

Hi Lin,

I also have clustalw 1.83, so this isn't simply a version
problem. It could be something subtle about the locale -
what language is your CentOS running in (that can alter
error messages etc)?

> I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update -- 5
> February 2013 (Biopython 1.61)}'.

That's good - that's what it should say :)

(Sorry late/last was my typing error).

>
> I downloaded the tar.gz from the biopython website.
>

Thanks. I could reproduce the test_Tutorial.py problem with that.
This is easy to explain - I forgot to include the test file my_blast.xml
when doing the release (and you are the first person to report this
problem). I should have noticed this myself, sorry :(

I've fixed this ready for the next release - thank you for reporting this:
https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5

If you want to, you can get the missing file from here:
http://biopython.org/SRC/Doc/examples/my_blast.xml

or:
https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml

If you save that in the Biopython 1.61 source under Doc/examples
then the Tutorial test should pass.

--

Did you retry the test_Entrez_online.py example to see if
this was a temporary problem?

--

The good news is these minor issues should not cause you
any problems installing and using Biopython 1.61 - so you
can go ahead and run 'python setup.py install.

Thanks,

Peter


From linxzh1989 at gmail.com  Mon May 13 14:34:31 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Mon, 13 May 2013 22:34:31 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
	<CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
Message-ID: <CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>

2013/5/13 Peter Cock <p.j.a.cock at googlemail.com>

> On Mon, May 13, 2013 at 2:41 AM, ??? <linxzh1989 at gmail.com> wrote:
> >
> >> Where did you get the Biopython 1.61 files from? e.g. The zip file
> >> or tar.gz file on our website? Perhaps I accidentally included an
> >> older copy of the Doc/Tutorial.tex file? Could you look for the
> >> "Late Update" line in your Tutorial.tex file for me - does it say:
> >>
> >> \date{Last Update -- 5 February 2013 (Biopython 1.61)}
> >>
> >> Thanks,
> >>
> >> Peter
> >
> > Hi?Peter?
> > Clustalw I am using is 1.83.
>
> Hi Lin,
>
> I also have clustalw 1.83, so this isn't simply a version
> problem. It could be something subtle about the locale -
> what language is your CentOS running in (that can alter
> error messages etc)?
>
> > I've found the 'Late Update' in Tutorial.tex, it's ' \date{Last Update
> -- 5
> > February 2013 (Biopython 1.61)}'.
>
> That's good - that's what it should say :)
>
> (Sorry late/last was my typing error).
>
> >
> > I downloaded the tar.gz from the biopython website.
> >
>
> Thanks. I could reproduce the test_Tutorial.py problem with that.
> This is easy to explain - I forgot to include the test file my_blast.xml
> when doing the release (and you are the first person to report this
> problem). I should have noticed this myself, sorry :(
>
> I've fixed this ready for the next release - thank you for reporting this:
>
> https://github.com/biopython/biopython/commit/c1b63b88dd5a50fa3f6f2aef840a51fe9092e0c5
>
> If you want to, you can get the missing file from here:
> http://biopython.org/SRC/Doc/examples/my_blast.xml
>
> or:
> https://github.com/biopython/biopython/raw/master/Doc/examples/my_blast.xml
>
> If you save that in the Biopython 1.61 source under Doc/examples
> then the Tutorial test should pass.
>
> --
>
> Did you retry the test_Entrez_online.py example to see if
> this was a temporary problem?
>
> --
>
> The good news is these minor issues should not cause you
> any problems installing and using Biopython 1.61 - so you
> can go ahead and run 'python setup.py install.
>
> Thanks,
>
> Peter
>
Hi Peter
I have run the locale in my serve

$ locale
LANG=en_US.UTF-8
LC_CTYPE=zh_CN.UTF-8
LC_NUMERIC=zh_CN.UTF-8
LC_TIME=zh_CN.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=zh_CN.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=zh_CN.UTF-8
LC_NAME=zh_CN.UTF-8
LC_ADDRESS=zh_CN.UTF-8
LC_TELEPHONE=zh_CN.UTF-8
LC_MEASUREMENT=zh_CN.UTF-8
LC_IDENTIFICATION=zh_CN.UTF-8
LC_ALL=

Is that locale you want?

I retryed the the test_Entrez_online.py, it's all right now. As you said,
it should be a connection problem.

I have put the file in the Doc/examples file, but the error still exists.
And i find there is no my_blat.psl in Doc/examples comparing with the zip
file i downloaded from github. After i put the my_blat.psi in the
Doc/examples, the error did not show up again.

Thanks
Lin


From p.j.a.cock at googlemail.com  Mon May 13 15:50:26 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 13 May 2013 16:50:26 +0100
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
	<CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
	<CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>
Message-ID: <CAKVJ-_6ZT9KDs2g7ojbwDrK9jW7M3oaK+zG5AQhzwx-ngym+fQ@mail.gmail.com>

On Mon, May 13, 2013 at 3:34 PM, ??? <linxzh1989 at gmail.com> wrote:
>
> Hi Peter
> I have run the locale in my serve
>
> $ locale
> LANG=en_US.UTF-8
> LC_CTYPE=zh_CN.UTF-8
> LC_NUMERIC=zh_CN.UTF-8
> LC_TIME=zh_CN.UTF-8
> LC_COLLATE="en_US.UTF-8"
> LC_MONETARY=zh_CN.UTF-8
> LC_MESSAGES="en_US.UTF-8"
> LC_PAPER=zh_CN.UTF-8
> LC_NAME=zh_CN.UTF-8
> LC_ADDRESS=zh_CN.UTF-8
> LC_TELEPHONE=zh_CN.UTF-8
> LC_MEASUREMENT=zh_CN.UTF-8
> LC_IDENTIFICATION=zh_CN.UTF-8
> LC_ALL=
>
> Is that locale you want?

Hi Lin,

Thanks for checking that, but having looked in more detail
I think this is not related to the locale settings. My first guess
was wrong :(

I think I may have solved this - my test machine has both
clustalw 2.1 and clustalw 1.83, and they behave differently
for this example. The old test only worked with v2.1, fixed:
https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00

If you want to verify this, you could update your copy of
Tests/test_Clustalw_tool.py to that from github (or just
tried installing the latest Biopython code from github?).

Note the Clustal developers intended that clustalw 1 and 2
would behave the same as each other (Version 2 was a
rewrite as a step towards version 3, no called ClustalOmega),
but there are still some minor differences.

> I retryed the the test_Entrez_online.py, it's all right now. As
> you said, it should be a connection problem.

OK, good.

> I have put the file in the Doc/examples file, but the error still exists.
> And i find there is no my_blat.psl in Doc/examples comparing with the zip
> file i downloaded from github. After i put the my_blat.psi in the
> Doc/examples, the error did not show up again.

Thank you, that should be fixed in the next release:
https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d

Your feedback has been very helpful,

Thanks,

Peter


From linxzh1989 at gmail.com  Tue May 14 01:32:23 2013
From: linxzh1989 at gmail.com (=?GB2312?B?wdbQ0Nba?=)
Date: Tue, 14 May 2013 09:32:23 +0800
Subject: [Biopython-dev] Errors about installing biopython 1.61
In-Reply-To: <CAKVJ-_6ZT9KDs2g7ojbwDrK9jW7M3oaK+zG5AQhzwx-ngym+fQ@mail.gmail.com>
References: <CALzRd7MB0DK3U=0=+pdrpHf6MPs9zVsWjbxV5iqyD7nJ3MuCsg@mail.gmail.com>
	<CALzRd7PBYp998tnzV4ueGQea2oS5pUeNgMjtR5F=7AopaKLnsw@mail.gmail.com>
	<CAKVJ-_7=WkZ8BVjxGx6s-q_pnC7ZNKw6L6A6zKYBjGyc-_Betw@mail.gmail.com>
	<CALzRd7MnogAyEDmrRFNaGupw+wrtLnSodv58omexE3ESAuWxqg@mail.gmail.com>
	<CAKVJ-_6ACA3TRpf7T7H4CNOm0sw5ipN1oUUPTqRPJP8efF8Wjw@mail.gmail.com>
	<CALzRd7Oax7eOKBuK8e-0eVqXstuQAjzVCsYNz9Ra6-wgg2Gx5w@mail.gmail.com>
	<CAKVJ-_6ZT9KDs2g7ojbwDrK9jW7M3oaK+zG5AQhzwx-ngym+fQ@mail.gmail.com>
Message-ID: <CALzRd7N5P4pVRebZiWyzL5RQ2mxYoymLaoMQRAL5vEd_ZEPrSQ@mail.gmail.com>

Hi Peter
I copy the test_Clustalw_tool.py from the github, now it does work. Thank
you!
Lin

2013/5/13 Peter Cock <p.j.a.cock at googlemail.com>

> On Mon, May 13, 2013 at 3:34 PM, ??? <linxzh1989 at gmail.com> wrote:
> >
> > Hi Peter
> > I have run the locale in my serve
> >
> > $ locale
> > LANG=en_US.UTF-8
> > LC_CTYPE=zh_CN.UTF-8
> > LC_NUMERIC=zh_CN.UTF-8
> > LC_TIME=zh_CN.UTF-8
> > LC_COLLATE="en_US.UTF-8"
> > LC_MONETARY=zh_CN.UTF-8
> > LC_MESSAGES="en_US.UTF-8"
> > LC_PAPER=zh_CN.UTF-8
> > LC_NAME=zh_CN.UTF-8
> > LC_ADDRESS=zh_CN.UTF-8
> > LC_TELEPHONE=zh_CN.UTF-8
> > LC_MEASUREMENT=zh_CN.UTF-8
> > LC_IDENTIFICATION=zh_CN.UTF-8
> > LC_ALL=
> >
> > Is that locale you want?
>
> Hi Lin,
>
> Thanks for checking that, but having looked in more detail
> I think this is not related to the locale settings. My first guess
> was wrong :(
>
> I think I may have solved this - my test machine has both
> clustalw 2.1 and clustalw 1.83, and they behave differently
> for this example. The old test only worked with v2.1, fixed:
>
> https://github.com/biopython/biopython/commit/859d07f3c5e8b789156a5ec2e98f4153ab896e00
>
> If you want to verify this, you could update your copy of
> Tests/test_Clustalw_tool.py to that from github (or just
> tried installing the latest Biopython code from github?).
>
> Note the Clustal developers intended that clustalw 1 and 2
> would behave the same as each other (Version 2 was a
> rewrite as a step towards version 3, no called ClustalOmega),
> but there are still some minor differences.
>
> > I retryed the the test_Entrez_online.py, it's all right now. As
> > you said, it should be a connection problem.
>
> OK, good.
>
> > I have put the file in the Doc/examples file, but the error still exists.
> > And i find there is no my_blat.psl in Doc/examples comparing with the zip
> > file i downloaded from github. After i put the my_blat.psi in the
> > Doc/examples, the error did not show up again.
>
> Thank you, that should be fixed in the next release:
>
> https://github.com/biopython/biopython/commit/a3bb49b56abb5cbb9a0a00accb57674115c7004d
>
> Your feedback has been very helpful,
>
> Thanks,
>
> Peter
>


From idoerg at gmail.com  Fri May 17 21:35:41 2013
From: idoerg at gmail.com (Iddo Friedberg)
Date: Fri, 17 May 2013 17:35:41 -0400
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
	<CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
Message-ID: <CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>

OK. I added a few changes as suggested by Peter.

There is a parser now to group GAF files by DB_Object_ID, and a write
function to write them. Random access not implemented yet.

On Fri, May 10, 2013 at 12:32 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

>
>
> On Fri, May 10, 2013 at 12:26 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Fri, May 10, 2013 at 5:20 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>> > On Fri, May 10, 2013 at 6:06 AM, Peter Cock wrote:
>> >>
>> >> Would it make sense to want random access to the GOA files based
>> >> on the identifier (DB_Object_ID and/or DB_Object_Symbol)? That
>> >> should be fairly straight forward to do building on the indexing code
>> >> for Bio.SeqIO and SearchIO.
>> >
>> >
>> > Would that require reading it all into memory? Uniprot_GOA files
>> > are huge, it is impractical to read them in fully.
>>
>> Not at all - we'd record a dictionary mapping the record ID to an offset
>> in the file on disk, or record this mapping in an SQLite index file.
>>
>
>  Ok, that's good then
>
>
>> >> Note here I am picturing combining all the (consecutive) lines
>> >> for the same DB_Object_ID - currently the parser is line based,
>> >> but batching by DB_Object_ID would be a straightforward change
>> >> and may better suit some uses.
>> >
>> > Perhaps only for organism specific file, which in some cases can
>> > be read fully into memory.
>>
>> The examples I looked at only seemed to have a dozen or so
>> lines for each DB_Object_ID - but perhaps these were easy
>> cases? How many lines per DB_Object_ID in the worst cases?
>>
>> Peter
>>
>
>
> I was actually thinking you are suggesting that the whole file should be
> read in memory, nit just buffer by DB-Object_ID.  My mistake.
>
>
> --
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
> ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
> .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
> >>----.<--.>++++++.<<<<------------------------------------.
>


-- 
Iddo Friedberg
http://iddo-friedberg.net/contact.html
++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.>
++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----.
.>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>>
>>----.<--.>++++++.<<<<------------------------------------.


From p.j.a.cock at googlemail.com  Mon May 20 13:16:45 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 20 May 2013 14:16:45 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
	<CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
	<CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>
Message-ID: <CAKVJ-_7CehNHjbKNvsUJoFpxw9MkCQgyx_-C38r0HRFZT2PsBg@mail.gmail.com>

On Fri, May 17, 2013 at 10:35 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>
>
> OK. I added a few changes as suggested by Peter.
>
> There is a parser now to group GAF files by DB_Object_ID, and a write
> function to write them. Random access not implemented yet.
>

Hi Iddo,

Over on this branch building on your work I moved things under
Bio.UniProt.GOA, and got things a bit more in line with PEP8:
https://github.com/peterjc/biopython/tree/uniprot-goa

(Drop me an email off list if you need a hand pulling those
changes into your branch)

Do you want to have a go at re-using the index code in Bio.File
(the back end for SeqIO and SearchIO's indexing)? Let me know
if the current setup is too mysterious and I can try and document
more of it and/or do this for the GOA module.

Peter


From redmine at redmine.open-bio.org  Tue May 21 12:24:34 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 21 May 2013 12:24:34 +0000
Subject: [Biopython-dev] [Biopython - Feature #3432] (New) Updated/Extended
	module MeltingTemp in Bio.SeqUtils
Message-ID: <redmine.issue-3432.20130521122434@redmine.open-bio.org>


Issue #3432 has been reported by Markus Piotrowski.

----------------------------------------
Feature #3432: Updated/Extended module MeltingTemp in Bio.SeqUtils
https://redmine.open-bio.org/issues/3432

Author: Markus Piotrowski
Status: New
Priority: Normal
Assignee: 
Category: 
Target version: 
URL: 


Dear Biopython developers,

I updated/extended the MeltingTemp module of SeqUtils and would be happy if you would consider it for implementing. Please find the source code attached. Any feedback is appreciated.
'Old' module:
One method, Tm_staluc, which calculates the melting temperature by the nearest neighbor method, using two different thermodynamic data sets for DNA and RNA. Fixed salt correction formula.
'Updated' module:
1. Three different Tm calculations: one 'rule of thumb' (Tm_Wallace), one using approximative formulas basing on GC content (Tm_GC) and one using nearest neighbor calculations (Tm_NN).
2. The new Tm_NN allows the usage of different thermodynamic datasets (8 tables are included for Watson-Crick base-pairing) and includes tables for mismatches (including inosine) and dangling ends. The datasets are Python dictionaries; the user can use his own datasets or change/update existing tables for his needs.
3. Seven different formulas to correct for salt concentration, including correction for Mg2+ ions (method salt_correction).
4. Method chem_correction which allows for Tm correction when using DMSO and formaldehyde.

I haven't touched the old Tm_staluc method (except adding some comments [labelled 'MP'] and a deprecation warning). Actually, the method has two problems on the RNA side: The dataset for RNA is faulty and 'U' isn't considered as input. Of course this problems can easily be fixed, however, I would prefer (if it is decided to accept the updated module) to completely exchange the body of Tm_staluc for calls to Tm_NN (as outlined in the comments).

There is one thing, that I'm uneasy with: For terminal mismatches, I used thermodynamic data from a patent application that has been withdrawn (http://patentscope.wipo.int/search/en/WO2001094611). Actually, I found the reference in the manual for Primer3 which also seems to use these data (http://primer3.sourceforge.net/primer3_manual.htm). Indeed, the Primer3 source (which is distributed under GPLv2) contains the data.

Best wishes,

Markus


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Wed May 22 13:45:00 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 22 May 2013 14:45:00 +0100
Subject: [Biopython-dev] UniProt GOA parser
In-Reply-To: <CAKVJ-_6MDCWTpwD+_QnYxhgSQ4Y6Yc2cbgOm+fb-BfrLi8S3hQ@mail.gmail.com>
References: <CABm4-MQe5YkdQ1DKbJtn2BKP__d55pyBF308YFxSbdiCXtp4ig@mail.gmail.com>
	<CAKVJ-_73PCzzj4dyO12OOcyJ1CbDa4k92o62YAnnTywbSBo0Aw@mail.gmail.com>
	<CABm4-MSNgx4eiT5xgVCtp8N78Y3yw3bidBR1feK_15VFW1M_7g@mail.gmail.com>
	<CAKVJ-_7mGDC-uJy81K=vj9tMAkXU-kAZ9GaCe8-mKRbzq_bqdg@mail.gmail.com>
	<CABm4-MR76cu66wUG6H+s1YmtNVU2Axkyh-FgU0_XYAJB2hD_Bw@mail.gmail.com>
	<CABm4-MSkDW217WaNXrRAfMC3KF3BJG7AKf_qnjg6KxkTN=mCZQ@mail.gmail.com>
	<CAKVJ-_7CehNHjbKNvsUJoFpxw9MkCQgyx_-C38r0HRFZT2PsBg@mail.gmail.com>
	<CABm4-MQhC3ZuE0NSwWRpX96xKBDfw0L+OOJiP4R6gqkE1FHduw@mail.gmail.com>
	<CAKVJ-_6MDCWTpwD+_QnYxhgSQ4Y6Yc2cbgOm+fb-BfrLi8S3hQ@mail.gmail.com>
Message-ID: <CAKVJ-_5Yx4+pkhe+My4QERfe=KkdhffqsUwwjQj_-weaLFpzcw@mail.gmail.com>

On Mon, May 20, 2013 at 7:09 PM, Iddo Friedberg <idoerg at gmail.com> wrote:
>> Do you want to have a go at re-using the index code in Bio.File
>> (the back end for SeqIO and SearchIO's indexing)? Let me know
>> if the current setup is too mysterious and I can try and document
>> more of it and/or do this for the GOA module.
>
> I'd like to have a go..
>
> ./I

Great - a few more details then,

The second part of Bio/File.py has some private classes
_IndexedSeqFileProxy and _IndexedSeqFileDict and
_SQLiteManySeqFilesDict which can be used for any
sequential record file format (meaning one after the other,
not just biological sequences).

These are used by the Bio.SeqIO.index() and index_db()
functions, and their sisters in Bio.SearchIO.

The idea is you write a subclass of _IndexedSeqFileProxy
for your new file format, and then this gets used by either
_IndexedSeqFileDict (in memory offset dictionary) or
_SQLiteManySeqFilesDict (SQLite offset dictionary).

Your _IndexedSeqFileProxy subclass has to define
an __iter__ method which loops over the file giving
a tuple for each record giving the identifier string
and the start offset, and ideally the length in bytes.
It must also define a get method which must seek
to the offset and then parse the record.

For the GOA files, the __iter__ loop will just spot
batches of lines for the same identifier which together
make up a single record.

I managed to explain the setup to Bow, and he got it
to work for SearchIO, but we were doing face to face
video chats for that during GSoC last year. Fresh eyes
will surely find some more rough edges in my docs ;)

Regards,

Peter


From pgarland at gmail.com  Mon May 27 02:27:05 2013
From: pgarland at gmail.com (Phillip Garland)
Date: Sun, 26 May 2013 19:27:05 -0700
Subject: [Biopython-dev] test_SeqIO_online failure
Message-ID: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>

The fasta formatted record is fine, the problem seems to come after
requesting and reading the genbank-formatted record for the protein
with GI:16130152.

It looks like the record was modified a few days ago:

LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013

and ends with

CONTIG      join(WP_000865568.1:1..367)\n//\n\n'

instead of

ORIGIN and the sequence data.

Is this a problem with the genbank record that should be reported to
NCBI, or is SeqIO supposed to handle the record as it is by fetching
the sequence from the linked contig, or is the test doing the wrong
thing by using rettype="gb" instead of rettype="gbwithparts"?

Here's the test output:

pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
run_tests.py test_SeqIO_online.py
Python version: 2.7.5 (default, May 20 2013, 11:51:12)
[GCC 4.7.3]
Operating system: posix linux2
test_SeqIO_online ... FAIL
======================================================================
FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
Bio.Entrez.efetch(protein, 16130152, ...)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
line 77, in <lambda>
    method = lambda x : x.simple(d, f, e, l, c)
  File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
line 65, in simple
    self.assertEqual(seguid(record.seq), checksum)
AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'

----------------------------------------------------------------------
Ran 1 test in 10.010 seconds

FAILED (failures = 1)

~Phillip


From kai.blin at biotech.uni-tuebingen.de  Mon May 27 06:19:20 2013
From: kai.blin at biotech.uni-tuebingen.de (Kai Blin)
Date: Mon, 27 May 2013 08:19:20 +0200
Subject: [Biopython-dev] SearchIO: Fix a bug in the HMMer2 text parser
Message-ID: <51A2FAE8.1040408@biotech.uni-tuebingen.de>

Hi folks,

I've run into and fixed a bug in the hmmer2-text parser when parsing 
consensus lines. The pull request is at
https://github.com/biopython/biopython/pull/182

Cheers,
Kai

-- 
Dipl.-Inform. Kai Blin         kai.blin at biotech.uni-tuebingen.de
Institute for Microbiology and Infection Medicine
Division of Microbiology/Biotechnology
Eberhard-Karls-Universit?t T?bingen
Auf der Morgenstelle 28                 Phone : ++49 7071 29-78841
D-72076 T?bingen                        Fax :   ++49 7071 29-5979
Germany
Homepage: http://www.mikrobio.uni-tuebingen.de/ag_wohlleben


From p.j.a.cock at googlemail.com  Mon May 27 09:05:44 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 27 May 2013 10:05:44 +0100
Subject: [Biopython-dev] test_SeqIO_online failure
In-Reply-To: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
References: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
Message-ID: <CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>

Hi Philip,

On Mon, May 27, 2013 at 3:27 AM, Phillip Garland <pgarland at gmail.com> wrote:
> The fasta formatted record is fine, the problem seems to come after
> requesting and reading the genbank-formatted record for the protein
> with GI:16130152.
>
> It looks like the record was modified a few days ago:
>
> LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013
>
> and ends with
>
> CONTIG      join(WP_000865568.1:1..367)\n//\n\n'
>
> instead of
>
> ORIGIN and the sequence data.
>
> Is this a problem with the genbank record that should be reported to
> NCBI, or is SeqIO supposed to handle the record as it is by fetching
> the sequence from the linked contig, or is the test doing the wrong
> thing by using rettype="gb" instead of rettype="gbwithparts"?

Interesting - it looks like the NCBI made a change to Entrez and
where previously this record had included the sequence with
rettype="gb" now we have to ask for it explicitly with the longer
rettype="gbwithparts" - my guess is this is now happening on
more records.

Note it does not affect all records, consider this example in our
Tutorial which seems unchanged:

  from Bio import Entrez
  Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
  handle = Entrez.efetch(db="nucleotide", id="186972394",
rettype="gb", retmode="text")
  print handle.read()

Curious.

> Here's the test output:
>
> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
> run_tests.py test_SeqIO_online.py
> Python version: 2.7.5 (default, May 20 2013, 11:51:12)
> [GCC 4.7.3]
> Operating system: posix linux2
> test_SeqIO_online ... FAIL
> ======================================================================
> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
> Bio.Entrez.efetch(protein, 16130152, ...)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
> line 77, in <lambda>
>     method = lambda x : x.simple(d, f, e, l, c)
>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
> line 65, in simple
>     self.assertEqual(seguid(record.seq), checksum)
> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'
>
> ----------------------------------------------------------------------
> Ran 1 test in 10.010 seconds
>
> FAILED (failures = 1)

I'd noticed this on Friday but hadn't looked into why the sequence was
different (and sometimes Entrez errors are transient). Thanks for
exploring this :)

Would you like to submit a pull request to update test_SeqIO_online.py
or should I just go ahead and change the rettype?

It would be sensible to review all the Entrez examples in the Tutorial,
to perhaps make more use of 'gbwithparts' rather than 'gb'?

Thanks,

Peter


From pgarland at gmail.com  Mon May 27 21:38:30 2013
From: pgarland at gmail.com (Phillip Garland)
Date: Mon, 27 May 2013 14:38:30 -0700
Subject: [Biopython-dev] test_SeqIO_online failure
In-Reply-To: <CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>
References: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
	<CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>
Message-ID: <CA+iPz=VkovUtZ00qmr9w4tjOLN3=JoYKxbL+E=1FBo9+j7_VTA@mail.gmail.com>

Hi Peter,

On Mon, May 27, 2013 at 2:05 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Philip,
>
> On Mon, May 27, 2013 at 3:27 AM, Phillip Garland <pgarland at gmail.com> wrote:
>> The fasta formatted record is fine, the problem seems to come after
>> requesting and reading the genbank-formatted record for the protein
>> with GI:16130152.
>>
>> It looks like the record was modified a few days ago:
>>
>> LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013
>>
>> and ends with
>>
>> CONTIG      join(WP_000865568.1:1..367)\n//\n\n'
>>
>> instead of
>>
>> ORIGIN and the sequence data.
>>
>> Is this a problem with the genbank record that should be reported to
>> NCBI, or is SeqIO supposed to handle the record as it is by fetching
>> the sequence from the linked contig, or is the test doing the wrong
>> thing by using rettype="gb" instead of rettype="gbwithparts"?
>
> Interesting - it looks like the NCBI made a change to Entrez and
> where previously this record had included the sequence with
> rettype="gb" now we have to ask for it explicitly with the longer
> rettype="gbwithparts" - my guess is this is now happening on
> more records.
>
> Note it does not affect all records, consider this example in our
> Tutorial which seems unchanged:
>
>   from Bio import Entrez
>   Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
>   handle = Entrez.efetch(db="nucleotide", id="186972394",
> rettype="gb", retmode="text")
>   print handle.read()
>
> Curious.
>
>> Here's the test output:
>>
>> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
>> run_tests.py test_SeqIO_online.py
>> Python version: 2.7.5 (default, May 20 2013, 11:51:12)
>> [GCC 4.7.3]
>> Operating system: posix linux2
>> test_SeqIO_online ... FAIL
>> ======================================================================
>> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
>> Bio.Entrez.efetch(protein, 16130152, ...)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
>> line 77, in <lambda>
>>     method = lambda x : x.simple(d, f, e, l, c)
>>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
>> line 65, in simple
>>     self.assertEqual(seguid(record.seq), checksum)
>> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'
>>
>> ----------------------------------------------------------------------
>> Ran 1 test in 10.010 seconds
>>
>> FAILED (failures = 1)
>
> I'd noticed this on Friday but hadn't looked into why the sequence was
> different (and sometimes Entrez errors are transient). Thanks for
> exploring this :)
>
> Would you like to submit a pull request to update test_SeqIO_online.py
> or should I just go ahead and change the rettype?
>
> It would be sensible to review all the Entrez examples in the Tutorial,
> to perhaps make more use of 'gbwithparts' rather than 'gb'?
>
> Thanks,
>
> Peter

The slight problem with just replacing "gb" with "gbwithparts" is that
SeqIO doesn't take "gbwithparts" as an option for the file format. So
in test_SeqIO_online.py, you have this code:

            handle = Entrez.efetch(db=database, id=entry, rettype=f,
retmode="text")
            record = SeqIO.read(handle, f)

which is a natural way to write the test (because it tests fasta and
genbank files), but will currently fail if f is "gbwithparts", b/c
SeqIO doesn't accept "gbwithparts" as a file format specifier. My
guess is that most existing code hardcodes the rettype and SeqIO file
format specifier, so we could just test for gbwithparts prior to
calling SeqIO.read:

  handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text")
            if f == "gbwithparts":
                f = "gb"
            record = SeqIO.read(handle, f)

I submitted a pull request with a minimal patch that does this.

For code like this, it would be cleaner if SeqIO accepted,
"gbwithparts" as an alias for "genbank", just like "gb" is, but I
don't know if it's a common pattern enough to bother.

If records like this are becoming more common, then "gbwithparts"
should be clearly documented in the biopython tutorial, though
"gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It
seems safer to always use "gbwithparts" at this point, at least when
you want the sequence.

~Phillip


From p.j.a.cock at googlemail.com  Mon May 27 22:43:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 27 May 2013 23:43:19 +0100
Subject: [Biopython-dev] test_SeqIO_online failure
In-Reply-To: <CA+iPz=VkovUtZ00qmr9w4tjOLN3=JoYKxbL+E=1FBo9+j7_VTA@mail.gmail.com>
References: <CA+iPz=U5pOeLrMsQLpFViGCsrYZ6tA6VCOkWTsV6PBHvD53QFg@mail.gmail.com>
	<CAKVJ-_6mfbGDhOWiZOpNh3VUnEmHE-FkWPrmYF2YWb=w6ht=AA@mail.gmail.com>
	<CA+iPz=VkovUtZ00qmr9w4tjOLN3=JoYKxbL+E=1FBo9+j7_VTA@mail.gmail.com>
Message-ID: <CAKVJ-_6UU3oCpg4bsNaeVYjAx1FT2q_j7PVRf=_ZewguDx5Zug@mail.gmail.com>

On Mon, May 27, 2013 at 10:38 PM, Phillip Garland <pgarland at gmail.com> wrote:
> Hi Peter,
>
>> I'd noticed this on Friday but hadn't looked into why the sequence was
>> different (and sometimes Entrez errors are transient). Thanks for
>> exploring this :)
>>
>> Would you like to submit a pull request to update test_SeqIO_online.py
>> or should I just go ahead and change the rettype?
>>
>> It would be sensible to review all the Entrez examples in the Tutorial,
>> to perhaps make more use of 'gbwithparts' rather than 'gb'?
>>
>> Thanks,
>>
>> Peter
>
> The slight problem with just replacing "gb" with "gbwithparts" is that
> SeqIO doesn't take "gbwithparts" as an option for the file format. So
> in test_SeqIO_online.py, you have this code:
>
>             handle = Entrez.efetch(db=database, id=entry, rettype=f,
> retmode="text")
>             record = SeqIO.read(handle, f)
>
> which is a natural way to write the test (because it tests fasta and
> genbank files), but will currently fail if f is "gbwithparts", b/c
> SeqIO doesn't accept "gbwithparts" as a file format specifier. My
> guess is that most existing code hardcodes the rettype and SeqIO file
> format specifier, so we could just test for gbwithparts prior to
> calling SeqIO.read:
>
>   handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text")
>             if f == "gbwithparts":
>                 f = "gb"
>             record = SeqIO.read(handle, f)
>
> I submitted a pull request with a minimal patch that does this.

That's good for now :)

> For code like this, it would be cleaner if SeqIO accepted,
> "gbwithparts" as an alias for "genbank", just like "gb" is, but I
> don't know if it's a common pattern enough to bother.

That makes some sense for parsing files, but all those aliases
would cause confusion with writing GenBank files.

> If records like this are becoming more common, then "gbwithparts"
> should be clearly documented in the biopython tutorial, though
> "gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It
> seems safer to always use "gbwithparts" at this point, at least when
> you want the sequence.

Definitely - if the NCBI moves to using 'gb' as the light style
without the sequence then many people will just want to use
'gbwithparts' as their default when scripting this sort of thing.

Thanks,

Peter


From redmine at redmine.open-bio.org  Tue May 28 07:50:41 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 28 May 2013 07:50:41 +0000
Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on
	python3 for disordered atoms
Message-ID: <redmine.issue-3433.20130528075040@redmine.open-bio.org>


Issue #3433 has been reported by Alexander Campbell.

----------------------------------------
Bug #3433: MMCIFParser fails on python3 for disordered atoms
https://redmine.open-bio.org/issues/3433

Author: Alexander Campbell
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. 

The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
<pre>
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

</pre>

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work.

Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible.


----------------------------------------
You have received this notification because this email was added to the New Issue Alert plugin


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From redmine at redmine.open-bio.org  Tue May 28 07:50:41 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Tue, 28 May 2013 07:50:41 +0000
Subject: [Biopython-dev] [Biopython - Bug #3433] (New) MMCIFParser fails on
	python3 for disordered atoms
Message-ID: <redmine.issue-3433.20130528075040@redmine.open-bio.org>


Issue #3433 has been reported by Alexander Campbell.

----------------------------------------
Bug #3433: MMCIFParser fails on python3 for disordered atoms
https://redmine.open-bio.org/issues/3433

Author: Alexander Campbell
Status: New
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. 

The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
<pre>
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

</pre>

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work.

Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From tiagoantao at gmail.com  Tue May 28 11:14:53 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 12:14:53 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
Message-ID: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>

Hi,

I have been trying to setup a windows 8 buildbot. For that purpose I have
installed a recent version of mingw on a new win8 machine.

It seems that one of the compiling options of biopython (-mno-cygwin) is
deprecated. See here for more details:
http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html

-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 11:21:13 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 12:21:13 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
Message-ID: <CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>

On Tue, May 28, 2013 at 12:14 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
> I have been trying to setup a windows 8 buildbot. For that purpose I have
> installed a recent version of mingw on a new win8 machine.
>
> It seems that one of the compiling options of biopython (-mno-cygwin) is
> deprecated. See here for more details:
> http://korbinin.blogspot.co.uk/2013/03/cython-mno-cygwin-problems.html

Looks like there's a confusing open bug about just removing this argument
from Python's distutils - http://bugs.python.org/issue12641

For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself
get it to work? I could live with that on the build slave, coupled with a
warning in our install documentation for the brave people self-compiling
under Windows.

Peter


From tiagoantao at gmail.com  Tue May 28 12:04:32 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 13:04:32 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
Message-ID: <CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>

Hi,


On Tue, May 28, 2013 at 12:21 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself
> get it to work? I could live with that on the build slave, coupled with a
> warning in our install documentation for the brave people self-compiling
> under Windows.
>

I have hacked my distutils implementation. It compiled OK.
That being said, there seems to be some problems with Bio.Applications on
win8:
http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio


-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 14:09:40 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 15:09:40 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
Message-ID: <CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>

On Tue, May 28, 2013 at 1:04 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> Hi,
>
>
> On Tue, May 28, 2013 at 12:21 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> For now does the hack of editing Lib\distutils\cygwinccompiler.py yourself
>> get it to work? I could live with that on the build slave, coupled with a
>> warning in our install documentation for the brave people self-compiling
>> under Windows.
>
> I have hacked my distutils implementation. It compiled OK.

That's encouraging.

> That being said, there seems to be some problems with Bio.Applications on
> win8:
> http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/12/steps/shell/logs/stdio

Could you confirm output sys.platform is "win32" still?

I've got a hunch that spaces in the executable path might explain
some of these failures - I'm trying a patch for that here.

Some of the other failures appear to be down to newline differences
(the \r in some of the output suggests this). Here we can probably
use universal new lines mode for file input, but I am puzzled why
these pass under Windows XP with an older mingw32 or the
Intel compiler.

Peter


From tiagoantao at gmail.com  Tue May 28 14:40:02 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 15:40:02 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
Message-ID: <CAA9RGEMMd8w_K_ozBgzn=50jrnnTihv-PGNi6CxAXwyzaOS4dg@mail.gmail.com>

On Tue, May 28, 2013 at 3:09 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> Could you confirm output sys.platform is "win32" still?
>

Yup

T

-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 16:36:20 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 17:36:20 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
Message-ID: <CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>

On Tue, May 28, 2013 at 3:09 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>
> I've got a hunch that spaces in the executable path might explain
> some of these failures - I'm trying a patch for that here.

Hi Tiago,

Patch applied to master - this is essential for the rare case of
calling a binary under Unix where the path/filename includes
a space, but appears to be redundant under Windows XP:
https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580

I'm curious if that matters under Windows 8 or not - trying
the example in the commit comment at the command line
might be illuminating.

Peter

P.S. Saket - You might remember I touched on this issue in our
discussion on GitHub about your bwa/samtools wrappers, which
led to this commit keeping self.program_name as the binary only:
https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a


From tiagoantao at gmail.com  Tue May 28 16:50:39 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 28 May 2013 17:50:39 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
Message-ID: <CAA9RGEO3apbtXLF1+NEGk3SjSPEv+USj0wdKW4ujEzTM7Q1LEQ@mail.gmail.com>

On Tue, May 28, 2013 at 5:36 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> I'm curious if that matters under Windows 8 or not - trying
> the example in the commit comment at the command line
> might be illuminating.
>

I just re-scheduled a testing case and the results were not great...
http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio


I will test this manually and in deep when I arrive home today.

-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Tue May 28 17:15:23 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 18:15:23 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAA9RGEO3apbtXLF1+NEGk3SjSPEv+USj0wdKW4ujEzTM7Q1LEQ@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
	<CAA9RGEO3apbtXLF1+NEGk3SjSPEv+USj0wdKW4ujEzTM7Q1LEQ@mail.gmail.com>
Message-ID: <CAKVJ-_7LaJKq8B4N5oY_6yOkBUTG8BBQKFNtJSENAvwcV0vSmw@mail.gmail.com>

On Tue, May 28, 2013 at 5:50 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
> On Tue, May 28, 2013 at 5:36 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> I'm curious if that matters under Windows 8 or not - trying
>> the example in the commit comment at the command line
>> might be illuminating.
>
>
> I just re-scheduled a testing case and the results were not great...
> http://testing.open-bio.org/biopython/builders/Windows%208%20-%20Python%202.7/builds/13/steps/shell/logs/stdio
>
> I will test this manually and in deep when I arrive home today.

I think there are at just two classes of failure, calling applications:

test_Application ... FAIL

And indexing with Windows newlines (I wonder if the git setup
on my Windows XP machine has a different default to yours,
meaning I have Unix newlines and you have Windows newlines?):

test_SearchIO_blast_tab_index ... FAIL
test_SearchIO_blast_xml_index ... FAIL
test_SearchIO_exonerate_text_index ... FAIL
test_SearchIO_exonerate_vulgar_index ... FAIL
test_SearchIO_fasta_m10_index ... FAIL
test_SearchIO_hmmer2_text_index ... FAIL
test_SearchIO_hmmer3_domtab_index ... FAIL
test_SearchIO_hmmer3_tab_index ... FAIL
test_SearchIO_hmmer3_text_index ... FAIL
Bio.SeqIO docstring test ... FAIL

Plus of course the minor issues which I just introduced with
the escaping change (commits to follow).

Peter


From saketkc at gmail.com  Tue May 28 17:20:36 2013
From: saketkc at gmail.com (Saket Choudhary)
Date: Tue, 28 May 2013 22:50:36 +0530
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
Message-ID: <CAEDHeiux=Oj5jaLJpeJ8aJ9o+EgcZ-jnh-G0xYEDXBg9To5uPw@mail.gmail.com>

The constraint for me really is I do not have access to Windows/MAC
machines here.

Hunting for a  Windows machine is possible, besides these I need to
validate the _ArgumentList method for windows too


On 28 May 2013 22:06, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Tue, May 28, 2013 at 3:09 PM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> >
> > I've got a hunch that spaces in the executable path might explain
> > some of these failures - I'm trying a patch for that here.
>
> Hi Tiago,
>
> Patch applied to master - this is essential for the rare case of
> calling a binary under Unix where the path/filename includes
> a space, but appears to be redundant under Windows XP:
>
> https://github.com/biopython/biopython/commit/815de571b623f1cd3659fe4c80e3917e1a437580
>
> I'm curious if that matters under Windows 8 or not - trying
> the example in the commit comment at the command line
> might be illuminating.
>
> Peter
>
> P.S. Saket - You might remember I touched on this issue in our
> discussion on GitHub about your bwa/samtools wrappers, which
> led to this commit keeping self.program_name as the binary only:
>
> https://github.com/biopython/biopython/commit/ca93be741c8fd9bad67106acb455348251797f3a
>


From p.j.a.cock at googlemail.com  Tue May 28 17:30:47 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 May 2013 18:30:47 +0100
Subject: [Biopython-dev] Compiling on modern windows (recent mingw)
In-Reply-To: <CAEDHeiux=Oj5jaLJpeJ8aJ9o+EgcZ-jnh-G0xYEDXBg9To5uPw@mail.gmail.com>
References: <CAA9RGEPj704tgwdTnqnaMY3-61u2vnt=5LaMzhANk_4-z1AQ2Q@mail.gmail.com>
	<CAKVJ-_5k78wOmKTZJf4tYAE9nBmBD7LZE028gU1qgf8+1nB7hQ@mail.gmail.com>
	<CAA9RGEO8v+bEpH6WoN3GrdN=zO4MVGPGGrf2vH4b1p_rc+8BJQ@mail.gmail.com>
	<CAKVJ-_5Zz42Amd6KSZqoikMDFMZSCPtYWSvSp3vP3JyePMzuFA@mail.gmail.com>
	<CAKVJ-_5tFLm3-ykPnpu7VmUb90rLKSENT6hAo4189y5F16yX+Q@mail.gmail.com>
	<CAEDHeiux=Oj5jaLJpeJ8aJ9o+EgcZ-jnh-G0xYEDXBg9To5uPw@mail.gmail.com>
Message-ID: <CAKVJ-_7SzBWgHxr2w8fA82k7bjwFHO2NTfknnqEPnFRyCZDkZA@mail.gmail.com>

On Tue, May 28, 2013 at 6:20 PM, Saket Choudhary <saketkc at gmail.com> wrote:
> The constraint for me really is I do not have access to Windows/MAC machines
> here.
>
> Hunting for a  Windows machine is possible, besides these I need to validate
> the _ArgumentList method for windows too

I sympathise - sorting out a (virtual) 64bit Windows machine has been on
my TODO list for a while, since right now I don't have access to one.

When I started doing Biopython my primary machine was Windows XP.
That old laptop has retired and I now mainly use Mac OS X and Linux at
work, but I made a point of getting a Windows XP machine setup for
development (e.g. the Windows installers are build with this) and for
use as one of our nightly build slaves:
http://testing.open-bio.org/biopython/buildslaves

Regards,

Peter


From redmine at redmine.open-bio.org  Thu May 30 06:32:21 2013
From: redmine at redmine.open-bio.org (redmine at redmine.open-bio.org)
Date: Thu, 30 May 2013 06:32:21 +0000
Subject: [Biopython-dev] [Biopython - Bug #3433] (Resolved) MMCIFParser
	fails on python3 for disordered atoms
References: <redmine.issue-3433.20130528075040@redmine.open-bio.org>
Message-ID: <redmine.journal-15155.20130530063221@redmine.open-bio.org>


Issue #3433 has been updated by Michiel de Hoon.

Status changed from New to Resolved
% Done changed from 0 to 100

Patch applied, thanks.
----------------------------------------
Bug #3433: MMCIFParser fails on python3 for disordered atoms
https://redmine.open-bio.org/issues/3433

Author: Alexander Campbell
Status: Resolved
Priority: Normal
Assignee: Biopython Dev Mailing List
Category: Main Distribution
Target version: 
URL: 


The new shlex based parser works under python3, but reveals that the changed comparison rules in python3 lead to unhandled exceptions when parsing disordered atoms. Furthermore, it reveals that occupancy and temperature factor attributes of Atom objects were never cast from str to float types when parsed from mmCIF files. 

The comparison code which raises the exception under python3 is at Atom.py line 333: @if occupancy>self.last_occupancy:@ . The exception can be prevented my modifying MMCIFParser.py to cast occupancy and temperature factor to float. The following patch is a basic copy of the equivalent code in PDBParser.py:
<pre>
diff --git a/Bio/PDB/MMCIFParser.py b/Bio/PDB/MMCIFParser.py
index 64d16bc..4be6490 100644
--- a/Bio/PDB/MMCIFParser.py
+++ b/Bio/PDB/MMCIFParser.py
@@ -84,8 +84,15 @@ class MMCIFParser(object):
                 altloc=" "
             resseq=seq_id_list[i]
             name=atom_id_list[i]
-            tempfactor=b_factor_list[i]
-            occupancy=occupancy_list[i]
+            # occupancy & B factor
+            try:
+                tempfactor=float(b_factor_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing B factor")
+            try:
+                occupancy=float(occupancy_list[i])
+            except ValueError:
+                raise PDBConstructionException("Invalid or missing occupancy")
             fieldname=fieldname_list[i]
             if fieldname=="HETATM":
                 hetatm_flag="H"

</pre>

This patch was tested with the "mmCIF file for PDB structure 3u8h":http://www.rcsb.org/pdb/download/downloadFile.do?fileFormat=cif&compression=NO&structureId=3U8H , which would cause the mmCIF parsing exception under python3.2. After the patch, there were no exceptions during parsing and the occupancy and bfactor attributes had the correct type (float). The patch was also tested under python2.7, which worked just fine and also showed the correct types. I haven't tested earlier versions of python2, but the simple syntax ought to work.

Could a dev apply this patch? Or better yet, suggest a patch for casting the types at the StructureBuilder level, which would make such things independent of the specific parser used. This is just a minimal-quickfix patch, but I'm sure a better solution is possible.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org


From p.j.a.cock at googlemail.com  Thu May 30 08:21:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 09:21:31 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
Message-ID: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>

Hi Tiago,

We'd been talking briefly off-list about the recent buildbot
failures under Python 3 where the recent change to using
subprocess in the PopGen module was causing failures.

Sadly while it seems to work on Python 3.1 and 3.2 my
suggestion to try using bytes with the communicate call
fails on Python 3.3 and under Windows:
https://github.com/biopython/biopython/commit/912692ee2b57e8c075ba38bdf814c9dbe4f5cdb9

e.g. After the change to use bytes,
http://testing.open-bio.org/biopython/builders/Linux%2064%20-%20Python%203.3/builds/202
http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.1/builds/816
http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.2/builds/680
http://testing.open-bio.org/biopython/builders/Windows%20XP%20-%20Python%203.3/builds/206

This appears to be a known bug in the subprocess
module, http://bugs.python.org/issue16903 which
should be fixed in Python 3.2.4 and Python 3.3.
It appears not to have been fixed on Python 3.1.

I see two options,

Option One, revert that commit (i.e. send unicode strings
as before, not bytes). This will work on Python 3.2.4+
onwards including Windows. It will fail on Python 3.1
and out of date Python 3.2 through 3.2.3 releases.

Option Two, don't use universal_newlines=True which
then requires us to use byte strings for all the stdin,
stdout and stderr processing. More work, but it should
in principle work on old and new Python 3 releases.

Note that while we're not seeing any problems yet, I
suspect this issue would affect our Bio.Application
wrappers __call__ function as well when used to send
data to stdin. Here again we could switch to using
bytes and universal_newlines=False and do any
bytes/unicode handling within the __call_ function,
on just insist on a fixed version of Python.

If we decide to recommend at least Python 3.2.4
(when using Python 3), then we could add a warning
to the relevant modules to catch this issue?

What do people think?

Regards,

Peter


From tiagoantao at gmail.com  Thu May 30 08:28:04 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 May 2013 09:28:04 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
Message-ID: <CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>

I was having a look at the issue precisely now.

I do not have a cast opinion on the issue, I think it all boils down on how
many people are dependent on 3.2.3 and prior 3s.

In theory I would prefer not to have workarounds for implementation bugs
(as makes things more complex to manage in the long-run), but if many
people are using buggy 3.x, I see no option...

I simply do not have any view on how many people would be using these...


From p.j.a.cock at googlemail.com  Thu May 30 08:34:15 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 09:34:15 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
Message-ID: <CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>

On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
> I was having a look at the issue precisely now.
>
> I do not have a cast opinion on the issue, I think it all boils down on how
> many people are dependent on 3.2.3 and prior 3s.
>
> In theory I would prefer not to have workarounds for implementation bugs (as
> makes things more complex to manage in the long-run), but if many people are
> using buggy 3.x, I see no option...
>
> I simply do not have any view on how many people would be using these...
>

Since till now we've not officially supported Python 3, but
plan to start doing so for the forthcoming Biopython 1.62
release, so we could just set a minimum version of 3.2.4
(with Python 3.3 being our current recommendation).

However, that may be a problem for some current Linux
distributions still shipping older versions?

Peter


From tiagoantao at gmail.com  Thu May 30 08:41:27 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 May 2013 09:41:27 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
Message-ID: <CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>

On Thu, May 30, 2013 at 9:34 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> However, that may be a problem for some current Linux
> distributions still shipping older versions?
>
>
>
I suppose people could revert to Python 2 in that case? [Do not get me
wrong, I really have no strong feelings either way]


-- 
?Grant me chastity and continence, but not yet? - St Augustine


From p.j.a.cock at googlemail.com  Thu May 30 11:37:51 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 12:37:51 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
	<CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
Message-ID: <CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>

On Thu, May 30, 2013 at 9:41 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
> On Thu, May 30, 2013 at 9:34 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
>>
>> However, that may be a problem for some current Linux
>> distributions still shipping older versions?
>
> I suppose people could revert to Python 2 in that case? [Do not get me
> wrong, I really have no strong feelings either way]
>

I guess we should do a brief survey on the main list of Python 3 versions
people have installed, if any.

In the meantime, I reverted that commit so the tests should now pass
under Python 3.2.4+ and Python 3.3.
https://github.com/biopython/biopython/commit/285988b1b5227b591bd2fed379e36db3a157eca2

Peter


From tiagoantao at gmail.com  Thu May 30 11:40:27 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 30 May 2013 12:40:27 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
	<CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
	<CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>
Message-ID: <CAA9RGEOzLPUQVPFLF=RPr3DfCjS3pL4FFsu1h6eK4mrP2WMLxg@mail.gmail.com>

> I guess we should do a brief survey on the main list of Python 3 versions
> people have installed, if any.
>
>
>
+1


From p.j.a.cock at googlemail.com  Thu May 30 11:47:33 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 12:47:33 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAA9RGEOzLPUQVPFLF=RPr3DfCjS3pL4FFsu1h6eK4mrP2WMLxg@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
	<CAA9RGEOsjR2RYjcoYu3NPfDnDukSxJsb1U6NSmUhYxrWkWQcUw@mail.gmail.com>
	<CAKVJ-_4tFuuTsWnSwYEUx9nww8L_xQXEtXtVJ2-VGupOKd-XBw@mail.gmail.com>
	<CAA9RGEOzLPUQVPFLF=RPr3DfCjS3pL4FFsu1h6eK4mrP2WMLxg@mail.gmail.com>
Message-ID: <CAKVJ-_4WwQE+B94_ZB600H741rObfu68An9Ztvkvhk2SgztVbg@mail.gmail.com>

On Thu, May 30, 2013 at 12:40 PM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>
>> I guess we should do a brief survey on the main list of Python 3 versions
>> people have installed, if any.
>>
>>
>
> +1

Agreed, http://lists.open-bio.org/pipermail/biopython/2013-May/008598.html

Peter


From p.j.a.cock at googlemail.com  Thu May 30 13:33:22 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 14:33:22 +0100
Subject: [Biopython-dev] Python 2 and 3 migration thoughts
Message-ID: <CAKVJ-_7hWfNPgEjvGAQLyoMadR4aJhUQfSXvcZHrfr-2uvxfCg@mail.gmail.com>

Splitting off from this thread:
http://lists.open-bio.org/pipermail/biopython/2013-May/008601.html

On Thu, May 30, 2013 at 2:13 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Thank you for all the comments so far, don't stop yet :)
>
> On Thu, May 30, 2013 at 1:51 PM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
>> Hi everyone,
>>
>> I'm leaning towards insisting on Python >=3.3 support (I'm running
>> 3.3.2). I suppose that even if Python3.3 is not available on a machine
>> or through the default package manager, it's always installable on its
>> own. If that's not the case, I imagine Python2.x is most likely
>> present in these machines (so Biopython can still be used).
>
> True.
>
> So far everyone who has replied (including some off list) have said
> they are using Python 3.3 which is encouraging. Thank you for
> the comments so far.
>
> It looks like we can forget about Python 3.1, and just need to
> decide if it is worth including Python 3.2.5 in the short term.
>
>> On a related note, do we have a defined timeline on when we
>> would drop support for Python2.x? Are there any plans to have
>> our codebase written in Python3.x instead of Python2.x?
>
> Nothing concrete planned, no. I'll reply in more detail on the
> biopython-dev list as I do have some thoughts about this.

Good question Bow,

I think people will still be using Python 2 a year or two from
now, so we must support both for some time.

Biopython 1.62 (next week perhaps?)
- Final release with Python 2.5 support
- Official support for Python 2.5, 2.6, 2.7 and 3.3
- Possibly official support for Python 3.2.5+ as well?

(Exactly which versions of Python 3 we'll include to be
decided, see the other thread for that discussion.)

Short term we will continue with developing using Python 2
syntax and running 2to3 for Python 3. As far as I know,
the reverse process with 3to2 is not well established. If
anyone wants to investigate that would be useful as
another option. However, dropping Python 2.5 support
makes things more flexible...

Medium term I believe it would be possible to have a single
code base which is both valid Python 2 and 3 at the same
time. This may require us to target 2.7 and 3.3+ only - we'll
have to try it and see if Python 2.6 will hold us back.

I've actually done this with lzma.backports, a small but
non-trivial module with Python and C code:

https://pypi.python.org/pypi/backports.lzma/
https://github.com/peterjc/backports.lzma

Python 3.3 reintroduces some features designed to make
this more straightforward, like unicode literals (missing in
the early versions of Python 3). This is why I'd like to drop
Python 3.2 as soon as possible.

What I was thinking is we can start migrating modules on a
case by case basis from "Python 2 syntax" to "Dual syntax"
one by one, with a white-list in the do2to3.py script. That
way over time less and less modules need to be converted
via 2to3, and "python3 setup.py install" will get faster,
until eventually we can stop using 2to3 at all.

This conversion could consider the code and doctests
separately. However, using using print(example) we can
hopefully get most of the doctests and Tutorial examples
to work under both Python 2 and 3 at the same time.

That's my current thinking anyway - and I think the fact
that it would be a gradual migration from writing Python 2
specific code to writing dual 2/3 code makes it low risk
(as long as we're continuing to run regular testing).

Regards,

Peter


From p.j.a.cock at googlemail.com  Thu May 30 14:23:01 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 15:23:01 +0100
Subject: [Biopython-dev] HMMER3.1 beta test 1 released
Message-ID: <CAKVJ-_5K3BCTZci=bsyKB-1egZnB_ZjGn=hez=8K_dY1OhQT0w@mail.gmail.com>

Hi Bow,

Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759

"The programs phmmer, hmmsearch, and hmmscan offer a new
tabular output format for easier automated parsing, --pfamtblout.
his format is the one used internally by Pfam, but we make it more
broadly available in case it is of use elsewhere. An analagous
output format is available for nhmmer and nhmmscan, --dfamtblout."

Something to consider for SearchIO later on...

Regards,

Peter


From w.arindrarto at gmail.com  Thu May 30 14:50:24 2013
From: w.arindrarto at gmail.com (Wibowo Arindrarto)
Date: Thu, 30 May 2013 16:50:24 +0200
Subject: [Biopython-dev] HMMER3.1 beta test 1 released
In-Reply-To: <CAKVJ-_5K3BCTZci=bsyKB-1egZnB_ZjGn=hez=8K_dY1OhQT0w@mail.gmail.com>
References: <CAKVJ-_5K3BCTZci=bsyKB-1egZnB_ZjGn=hez=8K_dY1OhQT0w@mail.gmail.com>
Message-ID: <CADEGkF5ObERnmN1ZufFEymOt6XoWjaN2oH=G3vYr_m4AYaimNg@mail.gmail.com>

Hi Peter,

Thanks for the heads-up. This just showed up in my feed as well. I've
been waiting for the official release (since they first mentioned it
some monts ago). I'll follow up on this slowly :)..

Best regards,
Bow

On Thu, May 30, 2013 at 4:23 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Bow,
>
> Just FYI, see http://selab.janelia.org/people/eddys/blog/?p=759
>
> "The programs phmmer, hmmsearch, and hmmscan offer a new
> tabular output format for easier automated parsing, --pfamtblout.
> his format is the one used internally by Pfam, but we make it more
> broadly available in case it is of use elsewhere. An analagous
> output format is available for nhmmer and nhmmscan, --dfamtblout."
>
> Something to consider for SearchIO later on...
>
> Regards,
>
> Peter


From rz1991 at foxmail.com  Thu May 30 15:37:00 2013
From: rz1991 at foxmail.com (=?gb18030?B?yO7vow==?=)
Date: Thu, 30 May 2013 23:37:00 +0800
Subject: [Biopython-dev] GSoC 2013 Student Self-introduction
Message-ID: <tencent_5A1C96A50AB12050753BA7EE@qq.com>

Hi Everyone,


This is Zheng Ruan, a first year graduate students at the University of Georgia. I'm happy to be chosen to participate in GSoC this year. My project is "Codon Alignment and Analysis in Biopython" and I will be working with Eric Talevich and Peter Cock during the summer.


My undergraduate major is biotechnology and now seeking for a PhD in bioinformatics. I hope to improve my python programming skills during the project and make long term contribution to biopython. I will follow the timeline of my proposal in the Community Bounding Period these days (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/rzzmh12345/1). Thanks!


Best,
Ruan


From p.j.a.cock at googlemail.com  Thu May 30 16:18:41 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 30 May 2013 17:18:41 +0100
Subject: [Biopython-dev] Biopython projects with NESCent for GSoC 2013
In-Reply-To: <CAKVJ-_5=pEytpc7EHiCPMbrrHh01_7U9x0MyN34AUGhbWHdTiw@mail.gmail.com>
References: <CAKVJ-_5=pEytpc7EHiCPMbrrHh01_7U9x0MyN34AUGhbWHdTiw@mail.gmail.com>
Message-ID: <CAKVJ-_6k-YwX4kUmN_BkKFoTQT__ZsQYC9EBKVF53CX+t7GuCA@mail.gmail.com>

Dear all,

After the disappointing news that the Open Bioinformatics Foundation (OBF)
was not accepted as a Google Summer of Code (GSoC) organisation this
year, Biopython was fortunate to once again offer some projects with the
NESCent team:

http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2013

As always the student proposals have been very competitive, and we've
not been able to take on everyone. This year NESCent was fortunately to
be able to accept seven students through GSoC and one through the GNOME
Outreach Program for Women. Two of these GSoC projects are Biopython
related:

    Codon Alignment and Analysis in Biopython
    Student: Zheng Ruan
    Mentors: Eric Talevich, Peter Cock
    http://www.google-melange.com/gsoc/project/google/gsoc2013/rzzmh12345/32001

    Phylogenetics in Biopython: Filling in the gaps
    Student: Yanbo Ye
    http://www.google-melange.com/gsoc/project/google/gsoc2013/yeyanbo/45001
    Mentors: Mark Holder, Jeet Sukumaran, Eric Talevich

Thank you NESCent, and congratulations to Zheng Ruan and Yanbo Ye!

I'm hoping you're already setting up a blog, which I hope you'll be able to
use for roughly weekly progress reports during the summer - CC'd to the
biopython-dev mailing list and the NESCent  Phyloinformatics Summer of
Code forum on Google+,

http://lists.open-bio.org/mailman/listinfo/biopython-dev
https://plus.google.com/communities/105828320619238393015

An introduction to your project would be a great idea for your first post -
here's Bow's from last year as an example:

http://bow.web.id/blog/2012/04/google-summer-of-code-is-on/
http://bow.web.id/blog/2012/08/summers-over/
http://bow.web.id/blog/tag/gsoc/

The idea here is to keep the wider community informed about how
your project is going.

On behalf of the Biopython developers, congratulations! We're
looking forward to another productive Summer of Code :)

Peter


From p.j.a.cock at googlemail.com  Fri May 31 09:04:28 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 31 May 2013 10:04:28 +0100
Subject: [Biopython-dev] Python 3 and subprocess .communicate() bug
In-Reply-To: <CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
References: <CAKVJ-_6PW6xoBML3WCkhoUwQXNyidKE88JPiazubudv270nFeQ@mail.gmail.com>
	<CAA9RGENrD7k3+fAwCrDh+EWrcmVL+vnu9QHa+ZaZJuF_dwkenA@mail.gmail.com>
	<CAKVJ-_4BnK5Kkp0A4aHuqEOfgUa03BuTvE5G8UK-JtUi8POg1g@mail.gmail.com>
Message-ID: <CAKVJ-_5FYPMHjMaPr=jFoVxz9n4quUOePyDqagiW7PpvK0K_Rw@mail.gmail.com>

On Thu, May 30, 2013 at 9:34 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, May 30, 2013 at 9:28 AM, Tiago Ant?o <tiagoantao at gmail.com> wrote:
>> I was having a look at the issue precisely now.
>>
>> I do not have a cast opinion on the issue, I think it all boils down on how
>> many people are dependent on 3.2.3 and prior 3s.
>>
>> In theory I would prefer not to have workarounds for implementation bugs (as
>> makes things more complex to manage in the long-run), but if many people are
>> using buggy 3.x, I see no option...
>>
>> I simply do not have any view on how many people would be using these...
>>
>
> Since till now we've not officially supported Python 3, but
> plan to start doing so for the forthcoming Biopython 1.62
> release, so we could just set a minimum version of 3.2.4
> (with Python 3.3 being our current recommendation).

>From the discussion on the main list, requiring a recent
version of Python 3 where this bug is fixed should be fine.

For now I've added code to skip this test on the older
Python 3 releases where the bug exists:
https://github.com/biopython/biopython/commit/9c16c09806ca4af84f714662e54c9bd3057b0a52

Once we've settled on the versions to support with the next
release we should review what versions we run on the buildbot.

Regards,

Peter