From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 06:17:58 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 06:17:58 -0500
Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing
In-Reply-To: <bug-3004-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011117.o11BHwib023118@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |biopython-
                   |                            |bugzilla at maubp.freeserve.co.
                   |                            |uk
            Summary|PSL alignment format parsing|PSL alignment format parsing
                   |in Bio.AlignIO              |


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 06:17 EST -------
(In reply to comment #2)
> Now on github:
> 
> http://github.com/vforget/PyBLATPSL
> 
> Vince
> 

Thanks for the link.

I don't see how this connects to sequence alignments for Bio.AlignIO as
suggested in your original comment (bug title edited accordingly). I see
you are parsing tabular output into an object, with addition methods for
scores etc. This looks fairly useful, but is not appropriate for the
Bio.AlignIO module. Maybe it can go under a new namespace instead, maybe
Bio.BLAT?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 06:27:00 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 06:27:00 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed multiline entry?
In-Reply-To: <bug-3000-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011127.o11BR0lp023326@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 06:26 EST -------
(In reply to comment #1)
> (In reply to comment #0)
> > Still, I suspect this will
> > reformat the entry (currently I see trailing dot removed from KEYWORDS, no
> > REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being
> > re-ordered).
> 
> Yes, using Bio.SeqIO to read/write a GenBank record will give you (slightly)
> different output. We do not guarantee a 100% round trip (even on simpler
> formats like FASTA). Even little things like line wrapping would make this
> very difficult.
> 
> Regarding GenBank KEYWORDS, please file a bug.

Don't worry about reporting a bug for this, I've just fixed the missing period
for KEYWORDS:

http://github.com/biopython/biopython/commit/5a87b070fc1f4fb911d4cf8a2e53c330cd6bd83d

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 08:35:11 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 08:35:11 -0500
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011335.o11DZBcJ029190@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 08:35 EST -------
(In reply to comment #16)
> 
> > * Writing references
> 
> Not done yet, but for my personal needs this is low priority.

Reference output in GenBank format from SeqIO just committed on github,
http://github.com/biopython/biopython/commit/42707bda738d0239a9ff85a39c39c89c8024549d

> > * Extending to cover writing EBML files
> 
> Not done yet, but should be comparatively straight forward. Let's track this
> possible enhancement on a separate bug.

EMBL output in SeqIO was done a while ago and was included in Biopython 1.52
(although we don't yet write references in EMBL output).

Things still to do on GenBank output include better handling of the LOCUS
line, such as the data division. See also Bug 2578 for the molecule type.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 09:43:41 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 09:43:41 -0500
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011443.o11EhfAT031724@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 09:43 EST -------
(In reply to comment #17)
> 
> EMBL output in SeqIO was done a while ago and was included in Biopython 1.52
> (although we don't yet write references in EMBL output).

References in EMBL output implemented now:
http://github.com/biopython/biopython/commit/370e02053a45aec6209bd826aebab7bfc29d7e84


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Tue Feb  2 13:37:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Feb 2010 18:37:25 +0000
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
Message-ID: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>

Hi all,

Over on enhancement Bug 3000, Martin was asking about
getting raw unparsed strings for each record in a sequence file:
http://bugzilla.open-bio.org/show_bug.cgi?id=3000

This makes sense for sequential files like FASTA and GenBank,
but not for interlaced files like PHYLIP, and has less obvious
uses when there is any kind of header or footer (e.g. XML or
SFF files).

The particular example Martin gave was selecting a subset of
records in a large GenBank file (I've done this myself in the past).
While this can be done via Bio.SeqIO, the process of parsing
the data into a SeqRecord and saving it again is lossy. While
there is room for improvement. For this particular example, I
suggested Martin use the "old" iterator class in Bio.GenBank.

In general things like white space and wrapping mean that a
SeqIO parse/write cannot guarantee a 100% unaltered round
trip, and will also be slower than using the raw record as a string.

Martin suggested adding an optional argument to the parse
function. I'm not sure this is a good API choice, as it would
dramatically alter the return values. Perhaps we could have
a new iterator function in Bio.SeqIO for suitable sequential
files only which returns a series of strings, one for each
record, unmodified?

Either way I don't see how this would be used - surely
the user would need to do some basic analysis of each
raw record to decide how to process it? In this example,
they would need to extract the ID/accession to see if they
want to output the record or not. While parsing the record
into a SeqRecord may not be needed, in most cases the
record identifier would be very useful - and this has some
big overlaps with the Bio.SeqIO.index() code which already
breaks up files into records and extracts their identifiers.

i.e. A top level Bio.SeqIO function to iterate over a file
returning tuples of the record identifier and the raw
record as strings *could* be useful. Implementing this
nicely would mean re-factoring Bio.SeqIO.index()
extensively.

Another solution to this task (extracting the raw GenBank
records from a large file) would seem to be to extend the
Bio.SeqIO.index functionality. The patch I'm about to
attach to Bug 3000 adds a new "get_raw" method to the
dictionary like object we return. Unlike the __getitem__
and get methods which return a SeqRecord this just gives
the raw string.

Note that I haven't implemented this for all the index
support file formats yet, and this has had only very basic
testing. Writing this email took longer than writing the
code. However, I hope it illustrates the idea enough for a
discussion. As an example how the index function could
be used with this patch:

>>> from Bio import SeqIO
>>> data = SeqIO.index("cor6_6.gb", "gb")
>>> data.keys()
['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
>>> print data.get_raw("X62281.1")
LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
DEFINITION  A.thaliana kin2 gene.
ACCESSION   X62281
...
//

What are people's thoughts on this?

Peter

From bugzilla-daemon at portal.open-bio.org  Tue Feb  2 13:40:07 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 2 Feb 2010 13:40:07 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed multiline entry?
In-Reply-To: <bug-3000-42@http.bugzilla.open-bio.org/>
Message-ID: <201002021840.o12Ie7pO015898@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-02 13:40 EST -------
Created an attachment (id=1436)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1436&action=view)
Adds a get_raw method to the dictionaries returned by Bio.SeqIO.index()

Outline implementation of an alternative proposal, allowing access to the
raw text for each record via the Bio.SeqIO.index() dictionary like objects.
See discussion here:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007301.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From krother at rubor.de  Wed Feb  3 05:29:04 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 3 Feb 2010 11:29:04 +0100
Subject: [Biopython-dev] report: what happens on 'from Bio import PDB'?
In-Reply-To: <201002021840.o12Ie7pO015898@portal.open-bio.org>
References: <201002021840.o12Ie7pO015898@portal.open-bio.org>
Message-ID: <18fbb8f40f6ec6efe3d5dffff68aaa57-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVlcWgFbWw==-webmailer2@server03.webmailer.hosteurope.de>


Hi,

I'm currently checking what my application is using its memory for
(because it uses way too much for non-Biopython related things). However,
as soon as the simple command

from Bio import PDB

is executed, these are the objects that Python has in memory after running
the gc:

1 <class 'codecs.CodecInfo'>
1 <class 'ctypes.PyDLL'>
1 <class 'ctypes._endian._swapped_meta'>
1 <class 'numpy.core.numeric._unspecified'>
1 <class 'numpy.lib._datasource._FileOpeners'>
1 <class 'numpy.lib.index_tricks.CClass'>
1 <class 'numpy.lib.index_tricks.RClass'>
1 <class 'numpy.ma.core.MaskedArray'>
1 <class 'numpy.ma.core._maximum_operation'>
1 <class 'numpy.ma.core._minimum_operation'>
1 <class 'numpy.ma.extras.mr_class'>
1 <class 'random.Random'>
1 <class 'site._Helper'>
1 <class 'string._TemplateMetaclass'>
1 <class 'unittest.TestLoader'>
1 <type 'NoneType'>
1 <type 'NotImplementedType'>
1 <type '_ctypes.ArrayType'>
1 <type '_ctypes.StructType'>
1 <type '_ctypes.UnionType'>
1 <type 'ellipsis'>
1 <type 'exceptions.MemoryError'>
1 <type 'exceptions.RuntimeError'>
1 <type 'mtrand.RandomState'>
1 <type 'numpy.ndarray'>
1 <type 'sys.flags'>
1 <type 'sys.floatinfo'>
1 <type 'thread.lock'>
1 <type 'unicode'>
1 from module:Bio.GenBank.utils
1 from module:Bio.PDB.PDBIO
1 from module:Bio.PropertyManager
1 from module:os
2 <class 'Bio.PropertyManager.CreateDict'>
2 <class 'ctypes.LibraryLoader'>
2 <class 'numpy.lib.index_tricks.IndexExpression'>
2 <class 'numpy.lib.index_tricks.nd_grid'>
2 <class 'site.Quitter'>
2 <type 'bool'>
2 <type 'numpy.bool_'>
2 <type 'numpy.float64'>
2 <type 'object'>
2 from module:Bio.GenBank.LocationParser
2 from module:xml.sax.handler
3 <class 'site._Printer'>
3 <type '_ctypes.PointerType'>
3 <type 'file'>
3 <type 'frame'>
4 <type 'PyCObject'>
5 <class 'ctypes.CFunctionType'>
6 <class 'numpy.core.numerictypes._typedict'>
6 <type 'complex'>
6 from module:numpy.ma.extras
7 <type 'imp.NullImporter'>
7 from module:Bio.Alphabet.IUPAC
7 from module:__future__
8 <type '_ctypes.CFuncPtrType'>
9 <class 'numpy.ma.core._arraymethod'>
10 <type 'classmethod_descriptor'>
13 <type 'Struct'>
14 <type 'frozenset'>
15 <class 'numpy.testing.nosetester.NoseTester'>
16 <class 'abc.ABCMeta'>
16 <type 'staticmethod'>
19 <type 'classmethod'>
27 <type '_ctypes.SimpleType'>
35 <type 'cell'>
35 <type 'operator.itemgetter'>
36 <type 'StgDict'>
38 <type '_sre.SRE_Pattern'>
49 <type 'set'>
56 <type 'long'>
56 from module:Bio.Alphabet
68 <type 'instancemethod'>
76 <type 'numpy.ufunc'>
91 <type 'property'>
95 from module:numpy.ma.core
201 <type 'member_descriptor'>
203 <type 'classobj'>
225 <type 'module'>
350 <type 'type'>
351 from module:Bio.Data.CodonTable
360 <type 'int'>
385 <type 'float'>
393 <type 'getset_descriptor'>
407 <type 'weakref'>
579 <type 'method_descriptor'>
837 <type 'builtin_function_or_method'>
1365 <type 'wrapper_descriptor'>
2073 <type 'dict'>
3191 <type 'function'>
3289 <type 'code'>
4099 <type 'list'>
11989 <type 'tuple'>
19718 <type 'str'>
total 50912

Hope this is useful ;-)

Best Regards,
   Kristian


From lplp90 at gmail.com  Wed Feb  3 06:35:49 2010
From: lplp90 at gmail.com (Laura Padioleu)
Date: Wed, 3 Feb 2010 12:35:49 +0100
Subject: [Biopython-dev]  Multiple alignment - Clustalw etc...
Message-ID: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com>

On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org
<http://lists.open-bio.org/mailman/listinfo/biopython-dev>> wrote:
>*
*>* Hi Folks,
*>*
*>* this is a demo that i use to create then align my fasta sequences
using clustalw. Hope it helps.
here's the code
*
>def clustal(list_struc):
>
>
>	hash_table={}
>	for i in range (len(list_struc)):
>		for j in range (i+1,len(list_struc)):
>			pair=(list_struc[i],list_struc[j])
>			hash_table
>[pair]=0
>
>
>	for pair in hash_table
>.keys():
>		fasta_fic=open("fasta.fasta",'w')
>		for ID in pair:
>			fasta_fic.write(">"+ID.get_id()+'\n')
>
>			# recuperation des sequences des acides amines
>			for chain in ID.get_chains():
>       			ppb = PPBuilder()
>
>        			pp = ppb.build_peptides(chain)
>				# l'ajout des sequences aux fichiers fasta
>				fasta_fic.write(pp[0].get_sequence().tostring())
>			fasta_fic.write('\n')
>		fasta_fic.close()
>		cline = ClustalwCommandline(cmd="clustalw", infile="file.fasta")
>		return_code = subprocess.call(str(cline), shell=(sys.platform!="win32"))
>		
>		alignment = AlignIO.read(open("file"+str(nb)+".aln"),"clustal")
>
>		
>		j=0
>		i=0
>		for record in alignment:
>   			for amino_acid in record.seq:
>       			if amino_acid == '-':
>          				pass
>        			else:
>            				if amino_acid == alignment[0].seq[j]:
>                				i += 1
>        			j += 1
>    			j = 0
>    			
>seq = str(record.seq)
>    			gap_strip = seq.replace('-', '')
>    			percent = 100.0*i/len(seq)
>    			
>    			i=0
>			hash_table[pair]=str(percent)+"\t"+str(percent2)
>
>		
>	return hash_table
>
>def csv_writer(list_struc):
>	hash_table=clustal(list_struc)
>	csv_fic=open("file.csv",'a')
>	for couple in hash_table.keys():
>		csv_fic.write(pari[0].get_id()+"\t"+str(hash_table[pair])+'\n')
>	csv_fic.close()*


Hello,

im using python version 2.5 but i can't compile this code correctly
what version of python and biopython you are using ?
Thanks

From chapmanb at 50mail.com  Wed Feb  3 07:46:48 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Feb 2010 07:46:48 -0500
Subject: [Biopython-dev] Multiple alignment - Clustalw etc...
In-Reply-To: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com>
References: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com>
Message-ID: <20100203124648.GC40046@sobchak.mgh.harvard.edu>

Hi Laura;

[clustalw example from Cymon]

> im using python version 2.5 but i can't compile this code correctly
> what version of python and biopython you are using ?

We could help more with some additional information. Could you copy
and paste the error message you are seeing?

Brad

From cy at cymon.org  Wed Feb  3 07:48:49 2010
From: cy at cymon.org (Cymon Cox)
Date: Wed, 3 Feb 2010 12:48:49 +0000
Subject: [Biopython-dev]  Multiple alignment - Clustalw etc...
In-Reply-To: <7265d4f1002030412l1258237jf50ff37845e7c5a5@mail.gmail.com>
References: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com> 
	<7265d4f1002030412l1258237jf50ff37845e7c5a5@mail.gmail.com>
Message-ID: <7265d4f1002030448n28065ea1ifc411cf0c7b462e8@mail.gmail.com>

---------- Forwarded message ----------
From: Cymon Cox <cy at cymon.org>
Date: 3 February 2010 12:12
Subject: Re: [Biopython-dev] Multiple alignment - Clustalw etc...
To: Laura Padioleu <lplp90 at gmail.com>


Hi Laura,

On 3 February 2010 11:35, Laura Padioleu <lplp90 at gmail.com> wrote:

> On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org
> <http://lists.open-bio.org/mailman/listinfo/biopython-dev>> wrote:
> >*
> *>* Hi Folks,
>

Yes, I did write that...

<snip a bunch of code that I didn't write, nor can attribute to anyone...>

Hello,
>
> im using python version 2.5 but i can't compile this code correctly
> what version of python and biopython you are using ?
>


How exactly are you using this code? What error do you get? Can you cut and
paste a session from the terminal?

Cheers, C.

--

From chapmanb at 50mail.com  Wed Feb  3 07:55:52 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Feb 2010 07:55:52 -0500
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
Message-ID: <20100203125552.GD40046@sobchak.mgh.harvard.edu>

Hi Peter;

> Another solution to this task (extracting the raw GenBank
> records from a large file) would seem to be to extend the
> Bio.SeqIO.index functionality. The patch I'm about to
> attach to Bug 3000 adds a new "get_raw" method to the
> dictionary like object we return. Unlike the __getitem__
> and get methods which return a SeqRecord this just gives
> the raw string.
[...]
> >>> from Bio import SeqIO
> >>> data = SeqIO.index("cor6_6.gb", "gb")
> >>> data.keys()
> ['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
> >>> print data.get_raw("X62281.1")
> LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
> DEFINITION  A.thaliana kin2 gene.
> ACCESSION   X62281
> ...
> //
> 
> What are people's thoughts on this?

Not much to add, but a +1 from me. This sounds like a solid solution
and makes sense for the use case I can think of, which is picking
out records of interest from a large file and re-writing them in a
smaller file.

Brad

From chapmanb at 50mail.com  Wed Feb  3 07:55:52 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Feb 2010 07:55:52 -0500
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
Message-ID: <20100203125552.GD40046@sobchak.mgh.harvard.edu>

Hi Peter;

> Another solution to this task (extracting the raw GenBank
> records from a large file) would seem to be to extend the
> Bio.SeqIO.index functionality. The patch I'm about to
> attach to Bug 3000 adds a new "get_raw" method to the
> dictionary like object we return. Unlike the __getitem__
> and get methods which return a SeqRecord this just gives
> the raw string.
[...]
> >>> from Bio import SeqIO
> >>> data = SeqIO.index("cor6_6.gb", "gb")
> >>> data.keys()
> ['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
> >>> print data.get_raw("X62281.1")
> LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
> DEFINITION  A.thaliana kin2 gene.
> ACCESSION   X62281
> ...
> //
> 
> What are people's thoughts on this?

Not much to add, but a +1 from me. This sounds like a solid solution
and makes sense for the use case I can think of, which is picking
out records of interest from a large file and re-writing them in a
smaller file.

Brad

From bugzilla-daemon at portal.open-bio.org  Wed Feb  3 16:44:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Feb 2010 16:44:14 -0500
Subject: [Biopython-dev] [Bug 1999] new frame translation method
In-Reply-To: <bug-1999-42@http.bugzilla.open-bio.org/>
Message-ID: <201002032144.o13LiERA027299@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1999


------- Comment #3 from eric.talevich at gmail.com  2010-02-03 16:44 EST -------
Can we split this into two functions? I tried this function today, hoping it
would help me get a list of ORFs from a big contig -- but both
frameTranslations and six_frame_translation do two things without stopping in
between:

1. Translate the DNA or RNA sequence to amino acids in all six frames
2. Pretty-print the six-frame translation


So, how about factoring out just this piece (or similar):

def translate_six_frames(seq, genetic_code=1):
    """Dictionary of 6-frame translations."""
    anti = seq.reverse_complement()
    frames = {}
    for i in range(0,3):
        frames[i+1]  = seq[i:].translate(genetic_code)
        frames[-i-1] = SeqUtils.reverse(anti[i:].translate(genetic_code))
    return frames


Then either pretty-printer can call this internally, and the user also has
access to the individual translated sequences.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Wed Feb  3 18:13:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Feb 2010 23:13:10 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed 	multiline entry?
In-Reply-To: <4B6995D0.3030405@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
	<4B6995D0.3030405@fold.natur.cuni.cz>
Message-ID: <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>

On Wed, Feb 3, 2010 at 3:27 PM, Martin MOKREJ?
<mmokrejs at fold.natur.cuni.cz> wrote:
>
> Hi Peter,
> ?thank you very much for all your efforts. I will try to get to testing the cvs
> code in few days. Definitely will keep you updated. ;)
> Martin
>
> bugzilla-daemon at portal.open-bio.org wrote:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=3000
>> ...

The patch hasn't been checked in, but should apply to either the
master branch in github or (I expect) Biopython 1.53

I'm looking forward to feedback.

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Feb  4 10:20:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 4 Feb 2010 10:20:51 -0500
Subject: [Biopython-dev] [Bug 1999] new frame translation method
In-Reply-To: <bug-1999-42@http.bugzilla.open-bio.org/>
Message-ID: <201002041520.o14FKp9j000360@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1999


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-04 10:20 EST -------
(In reply to comment #3)
> Can we split this into two functions? I tried this function today, hoping it
> would help me get a list of ORFs from a big contig -- but both
> frameTranslations and six_frame_translation do two things without stopping in
> between:
> 
> 1. Translate the DNA or RNA sequence to amino acids in all six frames

I'd wondered about this - possibly as a generator/iterator which always gives
back exactly six sequences - but don't really see much point. There is also
going to be some debate about how frames are labelled (especially the minus
frames).

> 2. Pretty-print the six-frame translation

Personally I don't see this as being very useful, but someone must like it.
I lean to just deprecating and removing this code.

> So, how about factoring out just this piece (or similar):
> 
> def translate_six_frames(seq, genetic_code=1):
>     """Dictionary of 6-frame translations."""
>     anti = seq.reverse_complement()
>     frames = {}
>     for i in range(0,3):
>         frames[i+1]  = seq[i:].translate(genetic_code)
>         frames[-i-1] = SeqUtils.reverse(anti[i:].translate(genetic_code))
>     return frames

You should be taking the reverse complement, not just the reverse. This
would just be seq[i:].reverse_complement() or seq.reverse_complenent()[i:]
depending on how you label the reverse frames. 

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Thu Feb  4 10:30:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Feb 2010 15:30:47 +0000
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <20100203125552.GD40046@sobchak.mgh.harvard.edu>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
	<20100203125552.GD40046@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002040730q7835e1a1uc784dfaae5faaef2@mail.gmail.com>

On Wed, Feb 3, 2010 at 12:55 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Not much to add, but a +1 from me. This sounds like a solid solution
> and makes sense for the use case I can think of, which is picking
> out records of interest from a large file and re-writing them in a
> smaller file.
>

Let's give Martin a chance to test with the patch, and see how he gets on.

I'm curious if anyone can come up with other examples of how this could
be applied, which would help justify adding it to Bio.SeqIO.

Peter

From biopython at maubp.freeserve.co.uk  Thu Feb  4 10:30:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Feb 2010 15:30:47 +0000
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <20100203125552.GD40046@sobchak.mgh.harvard.edu>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
	<20100203125552.GD40046@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002040730q7835e1a1uc784dfaae5faaef2@mail.gmail.com>

On Wed, Feb 3, 2010 at 12:55 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Not much to add, but a +1 from me. This sounds like a solid solution
> and makes sense for the use case I can think of, which is picking
> out records of interest from a large file and re-writing them in a
> smaller file.
>

Let's give Martin a chance to test with the patch, and see how he gets on.

I'm curious if anyone can come up with other examples of how this could
be applied, which would help justify adding it to Bio.SeqIO.

Peter

From bugzilla-daemon at portal.open-bio.org  Mon Feb  8 12:08:33 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Feb 2010 12:08:33 -0500
Subject: [Biopython-dev] [Bug 3006] New: esearch medline fails with xml
	format
Message-ID: <bug-3006-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3006

           Summary: esearch medline fails with xml format
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: georg.lipps at fhnw.ch


I used to retrieve Pubmed records with python 2.5.1 however lately the efetch
with xml produces an error. The problem has arosen at the year change maybe
related to the DTD definition file:

Here is a short code which produces the error:


from Bio import Entrez
from Bio import Medline


def retrieve_medline(doi):
    # Uses the doi to obtain the medline id and then retrieves the medline
entry
    # Returns the medline entry as text and python object or an empty string
    print "...queing medline with DOI", doi
    handle = Entrez.esearch(db="pubmed", term=doi, retmode="XML")
    record=Entrez.read(handle)
    if record["Count"]<>"1":
        return None, None
    handle=Entrez.efetch(db="pubmed", id=record["IdList"], retmode="text",
rettype="medline")
    xml=Entrez.efetch(db="pubmed", id=record["IdList"], retmode="XML",
rettype="medline")
    return handle.read(), Entrez.read(xml)


doi='10.1038/nature07389'
article, xml=retrieve_medline(doi)
print article


OUTPUT:

Traceback (most recent call last):
  File "U:/Literatur/pdf to RM converter/test.py", line 24, in <module>
    article, xml=retrieve_medline(doi)
  File "U:/Literatur/pdf to RM converter/test.py", line 15, in retrieve_medline
    return handle.read(), Entrez.read(xml)
  File "C:\Program Files\python25\lib\site-packages\Bio\Entrez\__init__.py",
line 283, in read
    record = handler.run(handle)
  File "C:\Program Files\python25\lib\site-packages\Bio\Entrez\Parser.py", line
95, in run
    self.parser.ParseFile(handle)
  File "C:\Program Files\python25\lib\site-packages\Bio\Entrez\Parser.py", line
131, in startElement
    return
UnboundLocalError: local variable 'object' referenced before assignment


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Feb  8 18:26:38 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Feb 2010 18:26:38 -0500
Subject: [Biopython-dev] [Bug 3006] esearch medline fails with xml format
In-Reply-To: <bug-3006-42@http.bugzilla.open-bio.org/>
Message-ID: <201002082326.o18NQcwP006902@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3006


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2010-02-08 18:26 EST -------
I was not able to replicate this bug. Your example code ran correctly with
Python 2.6, Biopython 1.53. Are you using the latest version of Biopython?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From sandford at ufl.edu  Mon Feb  8 16:49:20 2010
From: sandford at ufl.edu (Michael Sandford)
Date: Mon, 08 Feb 2010 16:49:20 -0500
Subject: [Biopython-dev] Where should feature intersection code go?
Message-ID: <4B7086E0.1090501@ufl.edu>

I'm working on a project that's looking for alternative splicing using 
solexa data instead of microarray data.  Basically we've got a GFF file 
containing all the genes, introns and exons and 35M reads that have been 
placed into one of the various chromosomes via the excellent bowtie 
application out of Maryland.

Bowtie output is documented here:
http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output

In summary it's roughly a cross between fastq and GFF.  It's got the 
read name, strand, sequence the read aligned to, position, sequence, 
quality, and a few others.  It seems like it could rather easily be 
coerced into a SeqRecord 
(http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html).  
It might not get filled up completely, but it'd be better than handling 
things in a one-off way.

The FeatureLocation class provides for approximate and exact locations 
(both start and stop positions).  It seems like the correct location to 
put code that determines if two FeatureLocations overlap, or if one 
contains another, or is contained by another. 

Overall I'm talking about writing a bowtie .map parser and the 
comparison code for FeatureLocation.  Would these be welcome features?

Thanks,
Mike


From chapmanb at 50mail.com  Mon Feb  8 20:04:25 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 8 Feb 2010 20:04:25 -0500
Subject: [Biopython-dev] Where should feature intersection code go?
In-Reply-To: <4B7086E0.1090501@ufl.edu>
References: <4B7086E0.1090501@ufl.edu>
Message-ID: <20100209010425.GD2193@kunkel>

Mike;

> I'm working on a project that's looking for alternative splicing
> using solexa data instead of microarray data.  Basically we've got a
> GFF file containing all the genes, introns and exons and 35M reads
> that have been placed into one of the various chromosomes via the
> excellent bowtie application out of Maryland.

[...]

> Overall I'm talking about writing a bowtie .map parser and the
> comparison code for FeatureLocation.  Would these be welcome
> features?

A .map parser would definitely be useful. Another suggestion is to
get Bowtie to produce SAM format and use Pysam for parsing:

http://code.google.com/p/pysam/

The advantage of SAM is that it's an emerging standard and a lot of
downstream applications can use it. This way you can switch aligners
in your workflow without much disruption.

For doing feature overlaps, IntervalTree in bx-python is excellent:

http://bitbucket.org/james_taylor/bx-python/wiki/Home
http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx

See the doc string of the IntervalTree class for how to use it. My
normal workflow is to build an IntervalTree with the GFF features of
your genome, and then loop through the alignment file finding
features that each alignment intersects.

For alternative splicing, are you using the raw genome or a built
transcriptome for all possible combinations of exons? One practical
thing to consider if that a read will not be aligned to the genome
if it splits an exon/exon junction.

Hope this helps,
Brad

From bugzilla-daemon at portal.open-bio.org  Tue Feb  9 20:42:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 9 Feb 2010 20:42:28 -0500
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <201002100142.o1A1gSJJ022517@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-09 20:42 EST -------
(In reply to comment #17)
> 
> Things still to do on GenBank output include better handling of the LOCUS
> line, such as the data division. See also Bug 2578 for the molecule type.
> 

I've adding mappings for some EMBL divisions to suitable GenBank divisions.

I'm closing this bug now, as GenBank output does basically work now.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From rjalves at igc.gulbenkian.pt  Wed Feb 10 13:30:05 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 10 Feb 2010 18:30:05 +0000
Subject: [Biopython-dev] KEGG support
Message-ID: <4B72FB2D.4070808@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everyone,

KEGG support in Biopython has been mostly untouched for the past 8 years
with only a few changes and test additions. There is code in the tree to
work with the Enzyme and Compound databases but not for others such as
GENES, ORTHOLOGY, DRUG, ...

Considering the fact that I will need to write some code to work with
other formats I was planning to contribute and integrate it with the
SeqIO interface. This will require some additional homework on my part.

KEGG also has a SOAP based API [1]. It's functionality could be in some
aspects compared to NCBI eutils. Using the python SOAP library suds [2]
I had no problem interacting with it.

So just in case someone was already working on this secretly :) I would
like to know to make my life easier. If not I would also like to know if
you would be interested in the addition and finally what's your thought
about the SOAP interface and the suds (optional) dependency.

Just a word on suds. Even though the project has been around for a few
years now, it's still not available in most Linux distros. On my
personal experience with it it's probably the simplest and easy to use
SOAP library for python out there.

Cheers,
Renato

[1] - http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
[2] - https://fedorahosted.org/suds/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkty+ygACgkQYh11EUYTX9Sb3wCgiQrS/HWOr96CEwHErx+RKBVQ
1VMAn1NOlNr/HZ/rmFuqKTlyOM/pZwqi
=zBxB
-----END PGP SIGNATURE-----

From kellrott at gmail.com  Wed Feb 10 15:12:10 2010
From: kellrott at gmail.com (Kyle)
Date: Wed, 10 Feb 2010 12:12:10 -0800
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B72FB2D.4070808@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
Message-ID: <bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>

I think external library dependancies should be avoided unless necessary.
 Would a tool like wsdl2py produce code that isn't dependent on an installed
library? Alternatively, suds is LGPL based, could we just cannibalize the
source code for the important classes?

Kyle


On Wed, Feb 10, 2010 at 10:30 AM, Renato Alves <rjalves at igc.gulbenkian.pt>wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi everyone,
>
> KEGG support in Biopython has been mostly untouched for the past 8 years
> with only a few changes and test additions. There is code in the tree to
> work with the Enzyme and Compound databases but not for others such as
> GENES, ORTHOLOGY, DRUG, ...
>
> Considering the fact that I will need to write some code to work with
> other formats I was planning to contribute and integrate it with the
> SeqIO interface. This will require some additional homework on my part.
>
> KEGG also has a SOAP based API [1]. It's functionality could be in some
> aspects compared to NCBI eutils. Using the python SOAP library suds [2]
> I had no problem interacting with it.
>
> So just in case someone was already working on this secretly :) I would
> like to know to make my life easier. If not I would also like to know if
> you would be interested in the addition and finally what's your thought
> about the SOAP interface and the suds (optional) dependency.
>
> Just a word on suds. Even though the project has been around for a few
> years now, it's still not available in most Linux distros. On my
> personal experience with it it's probably the simplest and easy to use
> SOAP library for python out there.
>
> Cheers,
> Renato
>
> [1] - http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
> [2] - https://fedorahosted.org/suds/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkty+ygACgkQYh11EUYTX9Sb3wCgiQrS/HWOr96CEwHErx+RKBVQ
> 1VMAn1NOlNr/HZ/rmFuqKTlyOM/pZwqi
> =zBxB
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From dalloliogm at gmail.com  Wed Feb 10 17:13:04 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 10 Feb 2010 23:13:04 +0100
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B72FB2D.4070808@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
Message-ID: <5aa3b3571002101413o55d04432vc76c230aa9c43252@mail.gmail.com>

On Wed, Feb 10, 2010 at 7:30 PM, Renato Alves <rjalves at igc.gulbenkian.pt>wrote:

> KEGG support in Biopython has been mostly untouched for the past 8 years
> with only a few changes and test additions. There is code in the tree to
> work with the Enzyme and Compound databases but not for others such as
> GENES, ORTHOLOGY, DRUG, ...
>

Hi,
I had a terrible experience with parsing Kegg pathway's files: in the end I
discovered that the files that are stored in their ftp don't correspond
exactly to the diagrams that you can find in the web interface, as for
example biochemical interactions don't have directionality while if you look
at them on kegg/pathway you will see arrows.

Some time ago I proposed to implement something similar to what you have
said for kegg/pathway, but in the end I abandoned the effort, because I had
problem both with suds and SOAPpy, and I wasn't satisfied by the annotations
in KEGG.


>
> Considering the fact that I will need to write some code to work with
> other formats I was planning to contribute and integrate it with the
> SeqIO interface. This will require some additional homework on my part.
>
>
If you are serious about that I may help you, but I can only work on the
weekends and you should tell me exactly what I have to do :-)

KEGG also has a SOAP based API [1]. It's functionality could be in some
> aspects compared to NCBI eutils. Using the python SOAP library suds [2]
> I had no problem interacting with it.
>


Are you sure? I tried it on KEGG an year ago and I was having problems to
execute slightly more complex queries. If you look at suds's bug tracker,
you will find some reports by me, like this one:
- https://fedorahosted.org/suds/ticket/213

I remember that I was looping between the KEGG support centre and the suds
bug tracker; both were very responsive to feedback and very keen to answer
me, but in the end they didn't speak to each other and the bug reports that
I have filed are still unfixed.

Which library can you use for the soap queries? I had the feeling that
SOAPpy (which I think it is included in the standard lib) worked well with
KEGG, however it development has stopped many years ago (
http://sourceforge.net/projects/pywebsvcs/files/SOAP.py/), it is a mess if
you want to use it behind an http_proxy (I should have a patch somewhere if
you are interested) and I am sure it won't be kept compatible with the
future versions of python.

Another alternative may be beautiful soup, but I have never tried it. This
question on stackoverflow may provide you some ideas:
-
http://stackoverflow.com/questions/206154/whats-the-best-soap-client-library-for-python-and-where-is-the-documentation-fo

I am not sure about which is the standard soap library for python, and which
one is included in the standard lib. If you are going to use SOAPpy, it is a
bad bet toward compatibility and maintenance for the future releases. Suds
is the best option but it is not in the standard lib, and they still have to
fix the bugs I have reported an year ago. I have the feeling that there is
no good alternative for python.

Moreover, the WSDL functions that I have seen for KEGG are not especially
useful. They seems to allow for the basic queries, but for most of the tasks
it is better to download the ftp locally and work there.


> So just in case someone was already working on this secretly :) I would
> like to know to make my life easier. If not I would also like to know if
> you would be interested in the addition and finally what's your thought
> about the SOAP interface and the suds (optional) dependency.
>
> Just a word on suds. Even though the project has been around for a few
> years now, it's still not available in most Linux distros. On my
> personal experience with it it's probably the simplest and easy to use
> SOAP library for python out there.
>
> Cheers,
> Renato
>
> [1] - http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
> [2] - https://fedorahosted.org/suds/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkty+ygACgkQYh11EUYTX9Sb3wCgiQrS/HWOr96CEwHErx+RKBVQ
> 1VMAn1NOlNr/HZ/rmFuqKTlyOM/pZwqi
> =zBxB
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From bugzilla-daemon at portal.open-bio.org  Wed Feb 10 17:16:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 10 Feb 2010 17:16:14 -0500
Subject: [Biopython-dev] [Bug 3009] New: Check the FASTA m10 alignment
	parser works with FASTA36
Message-ID: <bug-3009-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3009

           Summary: Check the FASTA m10 alignment parser works with FASTA36
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Unit Tests
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Bill Pearson has just announced the release of FASTA36:
http://faculty.virginia.edu/wrpearson/fasta/fasta36/

>From his email,
> This version is a major update from FASTA version 35.
> It's main new feature is the ability to report all
> statistically significant alignments between a query
> and library sequence (equivalent to BLAST's multiple
> HSPs).  All previous versions of the FASTA program
> reported only the best alignment between the query
> and library sequence, a serious shortcoming when
> comparing a query protein to a multi-exon gene or
> multi-domain protein.

We need to check the FASTA36 -m 10 output, add this to
our unit tests, and update our parser as required.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From dalloliogm at gmail.com  Wed Feb 10 17:26:08 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 10 Feb 2010 23:26:08 +0100
Subject: [Biopython-dev] KEGG support
In-Reply-To: <bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
Message-ID: <5aa3b3571002101426p7b57f50aga270f0ea7eb8554f@mail.gmail.com>

On Wed, Feb 10, 2010 at 9:12 PM, Kyle <kellrott at gmail.com> wrote:

> I think external library dependancies should be avoided unless necessary.
>  Would a tool like wsdl2py produce code that isn't dependent on an
> installed
> library? Alternatively, suds is LGPL based, could we just cannibalize the
> source code for the important classes?
>

Honestly I think that the best solution would be to make an external module
to extend the basic biopython and to link it on the biopython's web page.
The core biopython should provide objects and infrastructures for biological
data, but then the additional functionalities should go on separate modules
linked on the biopython's web page, taking inspiration from BioConductor and
installed with easy_install or a derivate.
If we keep on maintaining a constrain that all biopython modules should have
the same dependencies, then it is impossible to make anything more complex
than the basic stuff, and then biopython won't never be useful as it may be.
You can't make a good library for using WSDL services with SOAPpy, or plot
nice graphics without matplotlib, or store data in HDF5 format, and there
are many other examples. Bioinformatics is a very general word, people
working on it have a big variety of needs, and it is difficult to accomplish
it all with few dependencies.

-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Wed Feb 10 17:27:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 10 Feb 2010 22:27:07 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B72FB2D.4070808@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
Message-ID: <320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>

On Wed, Feb 10, 2010 at 6:30 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>
> Hi everyone,
>
> KEGG support in Biopython has been mostly untouched for the past 8 years
> with only a few changes and test additions. There is code in the tree to
> work with the Enzyme and Compound databases but not for others such as
> GENES, ORTHOLOGY, DRUG, ...
>
> Considering the fact that I will need to write some code to work with
> other formats I was planning to contribute and integrate it with the
> SeqIO interface. This will require some additional homework on my part.

Excellent news. Have you looked at the existing KEGG parsers in
Biopython, and do you think the current style is suitable? (I haven't
looked at the code recently myself, but will do).

Regarding the SeqIO interface (for KEGG GENES only?), I would be
happy to advise. Initially I suggest you work on adding a parser much
like the other KEGG parsers, returning gene records. Then we can
add a Bio/SeqIO/KeggGeneIO.py wrapper to turn these into SeqRecord
objects.

> KEGG also has a SOAP based API [1]. It's functionality could be in some
> aspects compared to NCBI eutils. Using the python SOAP library suds [2]
> I had no problem interacting with it.

I have not used SOAP, and have a personal preference for REST style
APIs. However, if that is what KEGG offers, this is worth considering.
I think Brad has some experience with (other) SOAP services in Python.
Note the KEGG documentation suggests using SOAPpy for Python.

Interestingly, KEGG are however looking into providing RDF (and
perhaps one day SPARQL endpoints). I will try and find out what sort
of time scale they have in mind while I am at the BioHackathon 2010
this week - http://hackathon3.dbcls.jp/

For now, I would prioritise the KEGG flat file parsers.

Peter

From biopython at maubp.freeserve.co.uk  Wed Feb 10 17:37:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 10 Feb 2010 22:37:03 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
Message-ID: <320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>

On Wed, Feb 10, 2010 at 8:12 PM, Kyle <kellrott at gmail.com> wrote:
> I think external library dependancies should be avoided unless necessary.
> ?Would a tool like wsdl2py produce code that isn't dependent on an installed
> library? Alternatively, suds is LGPL based, could we just cannibalize the
> source code for the important classes?

Working with SOAP is so complicated that using an external library
would be the sensible option. It would be an optional dependency
(and would not be an install time dependency like NumPy), much
like how we have a optional dependency on ReportLab just for
Bio.Graphics, and now also the option to use NetworkX with the
new Bio.phylo code.

Package management (e.g. under Linux distros) can mark these
external modules as suggestions or soft requirements, making
this quite straight forward.

Regarding some of Giovanni's points, modularising the distribution
of Biopython (which can already be considered to be a core plus
assorted domain-specific modules like Bio.PDB, Bio.Cluster,
Bio.Graphics and so on) seems premature to me give the current
state of python distribution.

Peter

P.S. We can't take any GPL or LPGL code and incorporate it into
Biopython, due to the nature of those licences.


From anaryin at gmail.com  Wed Feb 10 17:52:53 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 10 Feb 2010 14:52:53 -0800
Subject: [Biopython-dev] KEGG support
Message-ID: <b537e3711002101452i73a55414gd44e875261f8e67d@mail.gmail.com>

Hello all,

For what it's worth: I worked with KEGG about a year and a half ago, to do
some very basic things. I remember I tried using SOAPpy and ZSI. The first
is a pain to install in Windows (at least then it was), so I opted for the
second. However it has been quite outdated and I had some problems dealing
with complex data types..

Regarding modularising/non-modularising the code, I guess that some features
will have to have dependences that cannot be included in the core
distribution, and thus the user should be warned that it needs library X or
Y to have them work. In short, keeping the current structure seems the
wisest IMO. I don't see such a need of creating outer-modules.

Lastly, good luck with KEGG's services' speed. That API is slower than a
turle :x

From rjalves at igc.gulbenkian.pt  Wed Feb 10 19:44:59 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 00:44:59 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
Message-ID: <4B73530B.7090203@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- From Peter on 02/10/2010 10:27 PM:
> Excellent news. Have you looked at the existing KEGG parsers in
> Biopython, and do you think the current style is suitable? (I haven't
> looked at the code recently myself, but will do).

The style seems good enough but I was thinking of having a more
functional approach, at least for the parser to try to get away of the
massive if/elif/else cascades. The writer would come as second priority
and would be similar although I would also try to keep code duplication
at lower levels than what we can see in the Enzyme/__init__.py file. I
would also consider using Genes.py instead of Genes/__init__.py ... I
don't see the need of packages here.

> Regarding the SeqIO interface (for KEGG GENES only?), I would be
> happy to advise. Initially I suggest you work on adding a parser much
> like the other KEGG parsers, returning gene records. Then we can
> add a Bio/SeqIO/KeggGeneIO.py wrapper to turn these into SeqRecord
> objects.

Yes for now my main goal would be GENES. The other formats can probably
grow from there. Your suggestion on the SeqIO seems reasonable. I'll try
to have a prototype in the next days/weekend and we can discuss from there.

> I have not used SOAP, and have a personal preference for REST style
> APIs. However, if that is what KEGG offers, this is worth considering.
> I think Brad has some experience with (other) SOAP services in Python.
> Note the KEGG documentation suggests using SOAPpy for Python.

According to the http://www.genome.jp/kegg/docs/weblink.html page they
do mention a REST like URL for generic entries, pathways and brite. But
it seems more useful for external linking than as an API. I couldn't
even figure out how to return the information in plaintext instead of
the default HTML. About SOAPpy, I've nothing against it besides the fact
that when I first tried I had few problems. Anyway it was a long time
ago... I've only played with suds since.

> Interestingly, KEGG are however looking into providing RDF (and
> perhaps one day SPARQL endpoints). I will try and find out what sort
> of time scale they have in mind while I am at the BioHackathon 2010
> this week - http://hackathon3.dbcls.jp/

We'll be waiting on your feedback on this :)

> For now, I would prioritise the KEGG flat file parsers.

Agreed.

> Peter
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzUwgACgkQYh11EUYTX9SPcwCfSrNkIovs1vnPinuAtMFZQJYn
pmAAnjHAAro2Ls/c1Nq4DCuliReaPm64
=Dohn
-----END PGP SIGNATURE-----

From rjalves at igc.gulbenkian.pt  Wed Feb 10 19:53:03 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 00:53:03 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>	
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
	<320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>
Message-ID: <4B7354EF.8020703@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- From Peter on 02/10/2010 10:37 PM:
> On Wed, Feb 10, 2010 at 8:12 PM, Kyle <kellrott at gmail.com> wrote:
>> I think external library dependancies should be avoided unless necessary.
>>  Would a tool like wsdl2py produce code that isn't dependent on an installed
>> library? Alternatively, suds is LGPL based, could we just cannibalize the
>> source code for the important classes?
> 
> Working with SOAP is so complicated that using an external library
> would be the sensible option. It would be an optional dependency
> (and would not be an install time dependency like NumPy), much
> like how we have a optional dependency on ReportLab just for
> Bio.Graphics, and now also the option to use NetworkX with the
> new Bio.phylo code.

Yes that would be my idea on the SOAP interface. If doable we could even
evaluate the possibility of having some abstraction layer that could
enable the use of SOAPpy or suds if either is already available on the
system.

> Package management (e.g. under Linux distros) can mark these
> external modules as suggestions or soft requirements, making
> this quite straight forward.

The 'or' case for soap libraries would also fit in this scheme since
most package managers already support this kind of feature.

> Regarding some of Giovanni's points, modularising the distribution
> of Biopython (which can already be considered to be a core plus
> assorted domain-specific modules like Bio.PDB, Bio.Cluster,
> Bio.Graphics and so on) seems premature to me give the current
> state of python distribution.

Could you elaborate a little on what you mean by 'current state of
python...'. Are you referring to the python3 transition?

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzVO0ACgkQYh11EUYTX9S1ngCfYFiW7VeNu6atl0J1eViqquSo
PCIAn3KO2p//fRYpZVC0QSp2gITP/n2I
=uTTc
-----END PGP SIGNATURE-----

From chapmanb at 50mail.com  Wed Feb 10 19:56:00 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 10 Feb 2010 19:56:00 -0500
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B73530B.7090203@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
	<4B73530B.7090203@igc.gulbenkian.pt>
Message-ID: <20100211005600.GB1923@kunkel>

Renato;
Great idea to work with the KEGG parsers. Very happy to have someone
tackling this.

> According to the http://www.genome.jp/kegg/docs/weblink.html page they
> do mention a REST like URL for generic entries, pathways and brite. But
> it seems more useful for external linking than as an API. I couldn't
> even figure out how to return the information in plaintext instead of
> the default HTML. About SOAPpy, I've nothing against it besides the fact
> that when I first tried I had few problems. Anyway it was a long time
> ago... I've only played with suds since.

My suggestion would be to use the TogoWS REST interface

http://togows.dbcls.jp/site/en/rest.html

It makes getting records crazy easy. There are tons of examples,
but for GENES, here's how to get the plain text record:

http://togows.dbcls.jp/entry/gene/eco:b0002

If you really want to use SOAP, my experience has been best with
suds. However, the complexities of SOAP are really not worth it if
you can get REST approaches to do what you need.

Hope this helps,
Brad

From rjalves at igc.gulbenkian.pt  Wed Feb 10 20:14:52 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 01:14:52 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <5aa3b3571002101413o55d04432vc76c230aa9c43252@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<5aa3b3571002101413o55d04432vc76c230aa9c43252@mail.gmail.com>
Message-ID: <4B735A0C.8070902@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- From Giovanni Marco Dall'Olio on 02/10/2010 10:13 PM:
> Hi,
> I had a terrible experience with parsing Kegg pathway's files: in the
> end I discovered that the files that are stored in their ftp don't
> correspond exactly to the diagrams that you can find in the web
> interface, as for example biochemical interactions don't have
> directionality while if you look at them on kegg/pathway you will see
> arrows.

I haven't used pathway files yet so I'll be careful when I reach them :)
Have you mentioned this aspect to the KEGG maintainers?

> Some time ago I proposed to implement something similar to what you have
> said for kegg/pathway, but in the end I abandoned the effort, because I
> had problem both with suds and SOAPpy, and I wasn't satisfied by the
> annotations in KEGG.
>  
> If you are serious about that I may help you, but I can only work on the
> weekends and you should tell me exactly what I have to do :-) 

Hehe, I can only tell you once I get my hands dirty. I'll keep my code
on github to maximize interaction. I'll get back at you when I get the
first working draft for GENES. Thanks for the hand ;)

> Are you sure? I tried it on KEGG an year ago and I was having problems
> to execute slightly more complex queries. If you look at suds's bug
> tracker, you will find some reports by me, like this one:
> - https://fedorahosted.org/suds/ticket/213

As of suds revision 658 I can no longer reproduce the error in the ticket.

> I remember that I was looping between the KEGG support centre and the
> suds bug tracker; both were very responsive to feedback and very keen to
> answer me, but in the end they didn't speak to each other and the bug
> reports that I have filed are still unfixed.
> 
> Which library can you use for the soap queries? I had the feeling that
> SOAPpy (which I think it is included in the standard lib) worked well
> with KEGG, however it development has stopped many years ago
> (http://sourceforge.net/projects/pywebsvcs/files/SOAP.py/), it is a mess
> if you want to use it behind an http_proxy (I should have a patch
> somewhere if you are interested) and I am sure it won't be kept
> compatible with the future versions of python.

SOAPpy doesn't seem to be in the standard lib, at least I don't have it
out of the box here. Only as external package in the repository.

> Another alternative may be beautiful soup, but I have never tried it.

I've only used beautiful soup as HTML cleaner/formatter, like HTML tidy.
I wasn't aware that it could be used for SOAP stuff. Are you sure about
this?

> This question on stackoverflow may provide you some ideas:
> http://stackoverflow.com/questions/206154/whats-the-best-soap-client-library-for-python-and-where-is-the-documentation-fo
> 
> I am not sure about which is the standard soap library for python, and
> which one is included in the standard lib. If you are going to use
> SOAPpy, it is a bad bet toward compatibility and maintenance for the
> future releases. Suds is the best option but it is not in the standard
> lib, and they still have to fix the bugs I have reported an year ago. I
> have the feeling that there is no good alternative for python.

I'll wait for your opinions. I don't want to sound religious about suds. :P

> Moreover, the WSDL functions that I have seen for KEGG are not
> especially useful. They seems to allow for the basic queries, but for
> most of the tasks it is better to download the ftp locally and work there.

Well if you just want a quick check on something the API still gives
better/quicker results than downloading the stuff via FTP. Given the
size, probably the load of the server and the fact that I'm on the other
side of the globe, I got an ETA of close to 20 hours when downloading
the genes.tar.gz file which is only a few GB in size.

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzWgoACgkQYh11EUYTX9Rp6QCfaHf6Ic3uT/npDw2o8l9F+8Kk
RtgAnjNXGxcrfvh48dcdFf6G4wK9+PNI
=vpUY
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Wed Feb 10 20:15:21 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Feb 2010 01:15:21 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B7354EF.8020703@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
	<320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>
	<4B7354EF.8020703@igc.gulbenkian.pt>
Message-ID: <320fb6e01002101715n3ccb8894r155631a2c6e34cb6@mail.gmail.com>

Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>> Regarding some of Giovanni's points, modularising the distribution
>> of Biopython (which can already be considered to be a core plus
>> assorted domain-specific modules like Bio.PDB, Bio.Cluster,
>> Bio.Graphics and so on) seems premature to me give the current
>> state of python distribution.
>
> Could you elaborate a little on what you mean by 'current state of
> python...'. Are you referring to the python3 transition?

I didn't mean anything about Python 3 here. Just the current state
of python package management, with distutils vs setuptools,
easy_install, Distribute, etc. I'm am looking forward to an official
Python successor to distutils one day which will properly handle
dependencies (and hopefully uninstallation) nicely. However, for
now, a single monolithic Biopython released several times a
year works fine and I see no reason to change that.

Peter

From rjalves at igc.gulbenkian.pt  Wed Feb 10 20:46:59 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 01:46:59 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <20100211005600.GB1923@kunkel>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
	<4B73530B.7090203@igc.gulbenkian.pt> <20100211005600.GB1923@kunkel>
Message-ID: <4B736193.9020801@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> Renato;
> Great idea to work with the KEGG parsers. Very happy to have someone
> tackling this.

Well as we say here, when the need comes we grab the bull by the horns.
:) (Small illustration even though I'm not a fan of the 'sport'
http://www.youtube.com/watch?v=OBORPnrm89I)

> My suggestion would be to use the TogoWS REST interface
> 
> http://togows.dbcls.jp/site/en/rest.html
> 
> It makes getting records crazy easy. There are tons of examples,
> but for GENES, here's how to get the plain text record:
> 
> http://togows.dbcls.jp/entry/gene/eco:b0002
> 
> If you really want to use SOAP, my experience has been best with
> suds. However, the complexities of SOAP are really not worth it if
> you can get REST approaches to do what you need.

Indeed this exactly the same without the need of additional libraries.
If all the functionality available on the SOAP API is also here I agree
with you, the complexity of SOAP is unnecessary.

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzYZEACgkQYh11EUYTX9RMWQCeLOXZH5vBjxB7rgPjhS53Fx7Z
EuMAoItWzjJ1LEtV6T8NcDDqnoDyIyBS
=dPVp
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Thu Feb 11 00:29:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Feb 2010 05:29:15 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
	<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
Message-ID: <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>

On Mon, Jan 11, 2010 at 5:11 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Hi all,
>
> I didn't want to rush the SFF support into Biopython 1.53, but its been
> waiting "ready" for a while now. Any objections or comments about
> me merging this now?
>
> Thanks,
>
> Peter

There were no objections, and I ran this by Brad and Michiel and
have just merged this into the master branch. Time for some more
testing!

Peter

From krother at rubor.de  Thu Feb 11 07:31:58 2010
From: krother at rubor.de (Kristian Rother)
Date: Thu, 11 Feb 2010 13:31:58 +0100
Subject: [Biopython-dev] Bio.PDB.KDTree test for memory leak
Message-ID: <112c17235319b66a00eebc499294fb2b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVBcWQ1eVw==-webmailer2@server03.webmailer.hosteurope.de>


Hi,

I've encountered a problem with running KDTree: it leaks memory.
The code below fills 1GB memory within a minute.

Running the GC doesn't help (it slows the process down, but only because
the GC is much slower than KDTree.

I think the problem might be in the C code. I'd like to get this bug
sorted out, but I'm not very good in C. Is there anyone around who I could
check ideas with?

Best Regards,
   Kristian


----

from Bio.KDTree.KDTree import *
from numpy.random import random

nr_points=1000
dim=3
bucket_size=10
coords=(200*random((nr_points, dim)))

while 1:
    kdtree=KDTree(dim, bucket_size)
    kdtree.set_coords(coords)


From biopython at maubp.freeserve.co.uk  Fri Feb 12 01:10:13 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Feb 2010 06:10:13 +0000
Subject: [Biopython-dev] Bio.PDB.KDTree test for memory leak
In-Reply-To: <112c17235319b66a00eebc499294fb2b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVBcWQ1eVw==-webmailer2@server03.webmailer.hosteurope.de>
References: <112c17235319b66a00eebc499294fb2b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVBcWQ1eVw==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <320fb6e01002112210y10ad4670p7ac3e003b5976685@mail.gmail.com>

On Thu, Feb 11, 2010 at 12:31 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi,
>
> I've encountered a problem with running KDTree: it leaks memory.
> The code below fills 1GB memory within a minute.
>
> Running the GC doesn't help (it slows the process down, but only because
> the GC is much slower than KDTree.

You mean something like this?

import gc
from Bio.KDTree.KDTree import *
from numpy.random import random

nr_points=1000
dim=3
bucket_size=10
coords=(200*random((nr_points, dim)))

while True:
   kdtree=KDTree(dim, bucket_size)
   kdtree.set_coords(coords)
   del kdtree #explicitly tell Python it can GC this object
   gc.collect() #force Python to run GC

I agree, this does seem to gradually consume more and more RAM.
Could you open a bug on bugzilla to track this please?

> I think the problem might be in the C code. I'd like to get this bug
> sorted out, but I'm not very good in C. Is there anyone around who
> I could check ideas with?

Have you ever used valgrind on a C tool? I'm not sure if it is easy
to use via Python, but it is my tool of choice for checking memory
leaks in C.

Peter

From bugzilla-daemon at portal.open-bio.org  Fri Feb 12 03:30:12 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Feb 2010 03:30:12 -0500
Subject: [Biopython-dev] [Bug 3010] New: Bio.KDTree is leaking memory
Message-ID: <bug-3010-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3010

           Summary: Bio.KDTree is leaking memory
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: krother at rubor.de


When I run KDTree on several of our PCs (Ubuntu, one with BioPython 1.53, one
with 1.51), it consumes memory that is never freed unless the process
terminates.

The code below fills 1GB memory within about a minute.

----
#!/usr/bin/env python

from Bio.KDTree.KDTree import *
from numpy.random import random

nr_points=1000
dim=3
bucket_size=10
coords=(200*random((nr_points, dim)))

while True:
   kdtree=KDTree(dim, bucket_size)
   kdtree.set_coords(coords)

----

Running the GC doesn't help (via del kdtree; gc.collect() in the while loop)
does not help.

I think the problem might be the C code or the Python/C interaction. I checked
the sources of KDTree superficially (to see whether there is a free() for each
malloc(), but did not see anything unusual (am not a C programmer though).

Peter proposed using valgrind to check memory leaks in C. Eventually it is
applicable to the problem.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Feb 12 07:31:13 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Feb 2010 07:31:13 -0500
Subject: [Biopython-dev] [Bug 3006] esearch medline fails with xml format
In-Reply-To: <bug-3006-42@http.bugzilla.open-bio.org/>
Message-ID: <201002121231.o1CCVDlN010496@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3006


georg.lipps at fhnw.ch changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from georg.lipps at fhnw.ch  2010-02-12 07:31 EST -------
I updated to python 2.6.4 and Biopython 1.5.3 and can confirm that the problem
does not persist.

Thanks for checking.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Feb 12 11:23:17 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Feb 2010 11:23:17 -0500
Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory
In-Reply-To: <bug-3010-42@http.bugzilla.open-bio.org/>
Message-ID: <201002121623.o1CGNHHd017669@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3010


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2010-02-12 11:23 EST -------
Does the memory leak occur also without the line kdtree.set_coords(coords)?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sun Feb 14 05:45:48 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 14 Feb 2010 05:45:48 -0500
Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory
In-Reply-To: <bug-3010-42@http.bugzilla.open-bio.org/>
Message-ID: <201002141045.o1EAjmV1029393@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3010


------- Comment #2 from krother at rubor.de  2010-02-14 05:45 EST -------
(In reply to comment #1)
> Does the memory leak occur also without the line kdtree.set_coords(coords)?
> 
No, I tried, and it doesnt.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From MatatTHC at gmx.de  Tue Feb 16 04:48:25 2010
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Tue, 16 Feb 2010 10:48:25 +0100
Subject: [Biopython-dev] derive from Seq
Message-ID: <20100216094825.25190@gmx.net>

Hi, 

I've implemented a class derived from Seq. Many of the Seq functions return Seq. Thus, I can not use those functions because I need instances of the derived class.

This can easily be fixed by returning: 

self.__class__( .. ) 

Regards, 
Matthias
-- 
Sicherer, schneller und einfacher. Die aktuellen Internet-Browser -
jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser

From chapmanb at 50mail.com  Tue Feb 16 08:09:45 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 16 Feb 2010 08:09:45 -0500
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <20100216094825.25190@gmx.net>
References: <20100216094825.25190@gmx.net>
Message-ID: <20100216130945.GH64068@sobchak.mgh.harvard.edu>

Hi Matthias;

> I've implemented a class derived from Seq. Many of the Seq functions
> return Seq. Thus, I can not use those functions because I need
> instances of the derived class.
>
> This can easily be fixed by returning:
>
> self.__class__( .. ) 

Good catch. Would you be able to submit a patch for this to the bug
tracker?

More generally, it is interesting that you are subclassing Seq. Can
you describe your application for this? I was debating with Peter
and Michiel this week and arguing that the Seq class should be
switched to a standard string, with biological functions like
reverse_complement and the like moving to stand alone functions and
SeqRecord objects. I'd be interested in hearing the opposite case;
that additional functionality is needed on a Seq object.

Brad

From bugzilla-daemon at portal.open-bio.org  Tue Feb 16 12:53:29 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Feb 2010 12:53:29 -0500
Subject: [Biopython-dev] [Bug 3013] New: import warnings missing in
	Bio/PDB/MMCIF2Dict.py
Message-ID: <bug-3013-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013

           Summary: import warnings missing in Bio/PDB/MMCIF2Dict.py
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: macrozhu+biopy at gmail.com


python library >>warnings<< is not imported in Bio/PDB/MMCIF2Dict.py
Please import the library in the beginning of the source code.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Feb 16 20:24:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Feb 2010 20:24:39 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002170124.o1H1OdhE003209@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2010-02-16 20:24 EST -------
Fixed in the repository, thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From p.j.a.cock at googlemail.com  Tue Feb 16 21:48:01 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Feb 2010 02:48:01 +0000
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <20100216130945.GH64068@sobchak.mgh.harvard.edu>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002161848s28e543a9jdac436976de3f279@mail.gmail.com>

On Tue, Feb 16, 2010 at 1:09 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Matthias;
>
>> I've implemented a class derived from Seq. Many of the Seq functions
>> return Seq. Thus, I can not use those functions because I need
>> instances of the derived class.
>>
>> This can easily be fixed by returning:
>>
>> self.__class__( .. )

We debated this on the mailing list a while ago (I'd hack to search
a little harder to find the thread). While switching to this form makes
subclassing easier in some cases, it doesn't in all.

> More generally, it is interesting that you are subclassing Seq. Can
> you describe your application for this? ... I'd be interested in
> hearing ... additional functionality is needed on a Seq object.
>
> Brad

Last time this (subclassing the Seq object) was mentioned, the
specific use was to change the equality operations to be string
like. This is a change we're considering making in Biopython itself
(and again was something Brad, Michiel and I chatted about
last week - I will be sending out an email about that next week,
I'm on holiday right now and haven't had internet access till
today).

But to echo Brad, use cases for subclassing the Seq are
of great interest.

Regards,

Peter

From MatatTHC at gmx.de  Wed Feb 17 03:33:11 2010
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Wed, 17 Feb 2010 09:33:11 +0100
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <320fb6e01002161848s28e543a9jdac436976de3f279@mail.gmail.com>
References: <20100216094825.25190@gmx.net>	
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
	<320fb6e01002161848s28e543a9jdac436976de3f279@mail.gmail.com>
Message-ID: <20100217083311.287840@gmx.net>

Hi, 

I'm dealing with circular sequences. Thus, I need some specialised functions (e.g. getting a subsequence). Furthermore, for me it seems to be the natural way to extend the functionality of Seq to my own needs. 
But, maybe this is not the best way. 

Matthias


> > Hi Matthias;
> >
> >> I've implemented a class derived from Seq. Many of the Seq functions
> >> return Seq. Thus, I can not use those functions because I need
> >> instances of the derived class.
> >>
> >> This can easily be fixed by returning:
> >>
> >> self.__class__( .. )
> 
> We debated this on the mailing list a while ago (I'd hack to search
> a little harder to find the thread). While switching to this form makes
> subclassing easier in some cases, it doesn't in all.
> 
> > More generally, it is interesting that you are subclassing Seq. Can
> > you describe your application for this? ... I'd be interested in
> > hearing ... additional functionality is needed on a Seq object.
> >
> > Brad
> 
> Last time this (subclassing the Seq object) was mentioned, the
> specific use was to change the equality operations to be string
> like. This is a change we're considering making in Biopython itself
> (and again was something Brad, Michiel and I chatted about
> last week - I will be sending out an email about that next week,
> I'm on holiday right now and haven't had internet access till
> today).
> 
> But to echo Brad, use cases for subclassing the Seq are
> of great interest.


-- 
NEU: Mit GMX DSL ?ber 1000,- ? sparen!
http://portal.gmx.net/de/go/dsl02

From bugzilla-daemon at portal.open-bio.org  Thu Feb 18 11:09:52 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 18 Feb 2010 11:09:52 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002181609.o1IG9qth028156@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #2 from macrozhu+biopy at gmail.com  2010-02-18 11:09 EST -------
Can pychecker be of any use for detecting such minor bugs? It might be too
much, I guess.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat Feb 20 13:40:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 20 Feb 2010 13:40:59 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002201840.o1KIexYS017773@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #3 from eric.talevich at gmail.com  2010-02-20 13:40 EST -------
(In reply to comment #2)
> Can pychecker be of any use for detecting such minor bugs? It might be too
> much, I guess.
> 

I don't know about PyChecker, but PyLint will catch import errors and
uninitialized variables like this. For example, I just tried "pylint -e
Bio/PDB/*.py" to a branch that didn't have this fix in it yet, and it flagged
this bug:

E: 79:MMCIF2Dict._make_mmcif_dict: Undefined variable 'warnings'
E: 91:MMCIF2Dict._make_mmcif_dict: Undefined variable 'warnings'
E:107:MMCIF2Dict._make_mmcif_dict: Undefined variable 'warnings'


While I'm at it, here are the other errors in Bio.PDB that pylint caught in a
freshly updated master branch:


************* Module Chain
E: 79:Chain.__delitem__: Class 'Entity' has no '__delitem__' member

************* Module DSSP
E:101:make_dssp_dict: function already defined line 8
E:139:DSSP: class already defined line 8

************* Module Entity
E: 56:Entity.get_level: Instance of 'Entity' has no 'level' member

************* Module FragmentMapper
E:137:Fragment.add_residue: Undefined variable 'PDBException'
E:191:_make_fragment_list: Undefined variable 'PDBException'
E:193:_make_fragment_list: Undefined variable 'PDBException'
E:226:FragmentMapper: class already defined line 10
E:250:FragmentMapper.__init__: Undefined variable 'PDBException'

************* Module HSExposure
E: 67:_AbstractHSExposure.__init__: Instance of '_AbstractHSExposure' has no
'_get_cb' member
E:131:HSExposureCA: class already defined line 9
E:222:HSExposureCB: class already defined line 9
E:257:ExposureCN: class already defined line 9

************* Module MMCIF2Dict
E:  8: No name 'MMCIFlex' in module 'Bio.PDB.mmCIF'
E: 31:MMCIF2Dict.__init__: Module 'Bio.PDB.mmCIF' has no 'MMCIFlex' member
E: 33:MMCIF2Dict.__init__: Module 'Bio.PDB.mmCIF' has no 'MMCIFlex' member
E: 44:MMCIF2Dict._make_mmcif_dict: Module 'Bio.PDB.mmCIF' has no 'MMCIFlex'
member

************* Module NACCESS
E:183: Instance of 'NACCESS' has no 'get_iterator' member

************* Module PDBParser
E:159:PDBParser._parse_coordinates: Undefined variable 'PDBContructionError'

************* Module Polypeptide
E:276:_PPBuilder.build_peptides: Instance of '_PPBuilder' has no
'_is_connected' member

************* Module ResidueDepth
E: 65:get_surface: function already defined line 11
E:123:ResidueDepth: class already defined line 11

************* Module StructureAlignment
E: 14:StructureAlignment: class already defined line 6


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From eric.talevich at gmail.com  Sat Feb 20 14:01:42 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 20 Feb 2010 14:01:42 -0500
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <20100216130945.GH64068@sobchak.mgh.harvard.edu>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com>

On Tue, Feb 16, 2010 at 8:09 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> More generally, it is interesting that you are subclassing Seq. Can
> you describe your application for this? I was debating with Peter
> and Michiel this week and arguing that the Seq class should be
> switched to a standard string, with biological functions like
> reverse_complement and the like moving to stand alone functions and
> SeqRecord objects. I'd be interested in hearing the opposite case;
> that additional functionality is needed on a Seq object.
>
>
I've seen a technique like this used to good effect:

# File: Seq.py

# Standalone functions all take a string-like first argument
def reverse_complement(seq): ...
def translate(seq, table=1): ...

class Seq(basestring):  # or str
    def __init__(self, data, alphabet): ...
    # Then attach the above functions as methods here
    reverse_complement = reverse_complement
    translate = translate
    ...


The same functionality is then available in a functional or OO style, with
minimal code duplication. And for interactive sessions, where converting
strings to Seqs is a bit more of an inconvenience, "from Bio.Seq import *"
becomes quick and handy.

-Eric

From biopython at maubp.freeserve.co.uk  Sun Feb 21 07:03:21 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 21 Feb 2010 12:03:21 +0000
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
	<3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com>
Message-ID: <320fb6e01002210403i721f8f26l2e7d37aae0b13c35@mail.gmail.com>

On Sat, Feb 20, 2010 at 7:01 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> I've seen a technique like this used to good effect:
>
> # File: Seq.py
>
> ...
>
> The same functionality is then available in a functional or OO style, with
> minimal code duplication. And for interactive sessions, where converting
> strings to Seqs is a bit more of an inconvenience, "from Bio.Seq import *"
> becomes quick and handy.

Doesn't that describe the Bio.Seq module as it is pretty well?
In addition to the Seq object methods, there are several functions
which can be used on strings or Seq (like) objects.

Peter

From eric.talevich at gmail.com  Sun Feb 21 11:36:13 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 21 Feb 2010 11:36:13 -0500
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <320fb6e01002210403i721f8f26l2e7d37aae0b13c35@mail.gmail.com>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu> 
	<3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com> 
	<320fb6e01002210403i721f8f26l2e7d37aae0b13c35@mail.gmail.com>
Message-ID: <3f6baf361002210836of243016s8206035c1b89de24@mail.gmail.com>

On Sun, Feb 21, 2010 at 7:03 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sat, Feb 20, 2010 at 7:01 PM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> > I've seen a technique like this used to good effect:
> >
> > ...
> >
> > The same functionality is then available in a functional or OO style,
> with
> > minimal code duplication. And for interactive sessions, where converting
> > strings to Seqs is a bit more of an inconvenience, "from Bio.Seq import
> *"
> > becomes quick and handy.
>
> Doesn't that describe the Bio.Seq module as it is pretty well?
> In addition to the Seq object methods, there are several functions
> which can be used on strings or Seq (like) objects.
>
> Peter
>

I'm not fully up to speed on the debate or the use cases that triggered it,
but I'm guessing the goal is better code flexibility without sacrificing
performance. Here's some code to consider:

def transcribe(dna, alphabet=None):
    """Transcribe a DNA sequence into RNA. Returns a string."""
    if isinstance(dna, Seq) or isinstance(dna, MutableSeq):
        # At first, maybe issue a warning here
        alphabet = dna.alphabet
        dna = str(dna)
    if alphabet is not None:
        # Validate
        base = Alphabet._get_base_alphabet(alphabet)
        if isinstance(base, Alphabet.ProteinAlphabet):
            raise ValueError("Proteins cannot be transcribed!")
        if isinstance(base, Alphabet.RNAAlphabet):
            raise ValueError("RNA cannot be transcribed!")
    return dna.replace('T','U').replace('t','u')

class Seq:
    # ...
    def transcribe(self):
        transcript = transcribe(self._data)
        # Rebuild the Seq object
        if self.alphabet==IUPAC.unambiguous_dna:
            alphabet = IUPAC.unambiguous_rna
        elif self.alphabet==IUPAC.ambiguous_dna:
            alphabet = IUPAC.ambiguous_rna
        else:
            alphabet = Alphabet.generic_rna
        return Seq(transcript, alphabet)


Notes:
 - The standalone takes an optional 'alphabet' argument, and performs
validation if requested.
 - Since the standalone function now has the same functionality as the Seq
method, Seq can dispatch to the function -- rather than the other way
around, as it is currently -- and then just rebuild a Seq object.
 - The standalone function now always returns the same type (str). Since
this might break some existing code, a little shim and deprecation dance may
be needed in real life. But I think returning a plain string is the Right
Thing: there's "one obvious way" to work with Seq objects or plain strings.
 - If the grand proposal is to eventually move the alphabet attribute to
SeqRecord, this provides an intermediate step and a more convenient
foundation for testing the idea.

Best,
Eric

From biopython at maubp.freeserve.co.uk  Mon Feb 22 09:48:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 22 Feb 2010 14:48:14 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
References: <C731B9BA.2C661%lpritc@scri.ac.uk>
	<200911250945.20870.jblanca@btc.upv.es>
	<320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com>
	<200911251220.53881.jblanca@btc.upv.es>
	<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
	<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
	<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
	<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
	<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
Message-ID: <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>

Hi all,

I've just got back from Japan - Brad and I were fortunate to be
able to attend the DBCLS BioHackathon 2010 held in Tokyo,
http://hackathon3.dbcls.jp/

As Brad already mentioned in passing, we also managed to have
dinner one evening with Michiel, and had an informal chat about
Biopython plans. Expect a few more emails on other topics to
follow.

One of the short term aims we agreed on was to press ahead
with the Seq equality changes outlined on this thread late last
year. Mailing list archive link:
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html

To recap, the agreed best behaviour was to make Seq equality
act like string equality, but to raise a Python warning when
incompatible alphabets are compared (e.g. DNA to Protein).
This also applies to all the other comparison operators:
not equal, less than, greater than, less than or equal, and
greater than or equal.

This is my outline plan for the change:

For Biopython up to 1.53, Seq class uses object equality,
seq1==seq2 acts as id(seq1)==id(seq2)

For Biopython 1.54 (and perhaps a few more releases),
the Seq classes will still use object equality but will trigger
a warning suggesting explicit use of  id(seq1)==id(seq2)
or str(seq1)==str(seq2) as appropriate.

For Biopython 1.xx (maybe 1.55 or 1.56?) the Seq classes
will switch to using string equality (with an alphabet aware
warning for comparing DNA to RNA etc), but will also trigger
a warning that this is a change from previous releases, and
suggest in the short term the continued explicit use of either
id(seq1)==id(seq2) for object identity or str(seq1)==str(seq2)
for string identity.

For Biopython 1.yy (maybe 1.57?) the Seq classes will
use string equality (with an alphabet aware warning for
comparing DNA to RNA etc), without any warning about
this being a change from historic behaviour.

These warning messages could also point at a wiki page,
and we'd need a FAQ entry in the tutorial as well. The
aim of this slightly drawn out switch is to try and make
sure all users are aware of the change, even if they
only update their copy of Biopython every few releases.

Does that all sound sensible? If so, we should probably
have an announcement on the main mailing list, in case
there are any other views.

Other more complex options include a flag for switching
between the modes - but that complexity doesn't seem
such a good idea to me. All my own code and most of
the unit tests use str(seq1)==str(seq2) explicitly anyway.
The only exception is some of the genetic algorithm unit
tests which do seem to want explicit object identity.

Regards,

Peter

From biopython at maubp.freeserve.co.uk  Tue Feb 23 06:31:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 23 Feb 2010 11:31:35 +0000
Subject: [Biopython-dev] Handles and/or filenames in Bio.SeqIO etc?
In-Reply-To: <20090728221726.GK68751@sobchak.mgh.harvard.edu>
References: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com>
	<20090728221726.GK68751@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002230331j5f5f87c5lf328d3bacc4a557b@mail.gmail.com>

Hi all,

As mentioned in another thread, Brad, Michiel and I had
an informal meeting earlier this month in Tokyo and
discussed some plans for Biopython. One of the short
term changes we agreed on was to push ahead with
the Seq object equality changes, see:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007351.html

Another short term change we agreed was worthwhile
was to follow other Python libraries and allow handles
OR filenames in our parsers (starting with SeqIO and
AlignIO). This follows the discussion for the "TreeIO"
module (since renamed) and the Bio.SeqIO.convert
functions here on the mailing list last year, see:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006503.html

I will tackle this shortly for Bio.SeqIO and Bio.AlignIO.

Peter

From bugzilla-daemon at portal.open-bio.org  Tue Feb 23 12:43:01 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 23 Feb 2010 12:43:01 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002231743.o1NHh17v001826@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-23 12:43 EST -------
Hi Eric,

I have fixed most (all?) of those problems reported by pylint - see mailing
list post.

Thanks!

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Tue Feb 23 12:43:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 23 Feb 2010 17:43:31 +0000
Subject: [Biopython-dev] Running pylint over Biopython
Message-ID: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>

Hi all,

Those following @Biopython on twitter or subscribed to the github RSS
feed for our repository will know this already, but I've been using
pylint today to spot some errors in Biopython.
http://www.logilab.org/project/pylint

This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
finding some issues - thank Eric, this was a valuable suggestion.

With its default settings pylint is very very noisy, and in particular
doesn't like our naming conventions. However, with the following
command line you can focus in on the important stuff:

pylint --disable-msg-cat=CRW --include-ids=y
--disable-msg=E1101,E1103,E0102 -r n Bio BioSQL

Note that instead of module names, you can give filenames (e.g. *.py).
What that does is disable several categories of message (conventions,
possible refactorings, warnings) leaving just errors and fatal
messages. I turned on the message identifiers so that I have something
useful to stick into Google if need be, or to add to the ignore list
(currently three cases which looked like false positives). Then I turn
off the detailed report.

[Tip - don't run this from the Biopython source directory as then
importing our C code modules will fail]

As you will be able to tell from the recent flurry of git commits,
this highlighted some simple errors like missing imports or typos in
variable names.


Tiago, could you have a look at these possible problems in Bio.PopGen:

************* Module Bio.PopGen.Async
E0602: 78:Async.get_result: Undefined variable 'done'
E0602: 79:Async.get_result: Undefined variable 'done'
************* Module Bio.PopGen.GenePop
E0602:160:Record.split_in_pops: Undefined variable 'GenePop'
E0602:177:Record.split_in_loci: Undefined variable 'GenePop'
************* Module Bio.PopGen.GenePop.Controller
E0602: 41:_read_allele_freq_table: Undefined variable 'self'
E0602:133:_hw_func: Undefined variable 'self'
E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext'
E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable
'currrent_pop'
************* Module Bio.PopGen.SimCoal.Cache
E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config'
E0602: 88: Undefined variable 'Cache'
************* Module Bio.PopGen.SimCoal.Controller
E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config'


Eric, I don't have all the dependencies installed by pylint does
appear to dislike a few things in Bio.Phylo on the trunk:

************* Module Bio.Phylo.BaseTree
E0203:521:TreeMixin.prune: Access to member 'root' before its
definition line 531
E0203:527:TreeMixin.prune: Access to member 'root' before its
definition line 531
E0202:672:Subtree.root: An attribute inherited from TreeMixin hide this method
************* Module Bio.Phylo.PhyloXML
E1120:182:Phylogeny.get_alignment: No value passed for parameter
'follow_attrs' in function call


One thing this exercise has shown is that we still need to do some
work on the unit test coverage.

Regards

Peter

From tiagoantao at gmail.com  Tue Feb 23 12:56:22 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 23 Feb 2010 17:56:22 +0000
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
Message-ID: <6d941f121002230956h582acc11r59cb18bca8d2f727@mail.gmail.com>

This comes in a good time, I've actually been making changes to the
code (as the genepop parser is not able to handle big files and I've
had quite a few complains about that). it seems to be 2.6 related or
so because I've detected the Config problem myself. I will correct
this next week (this week is _impossible_), along with an update to
the genepop parser to support big files.

2010/2/23 Peter <biopython at maubp.freeserve.co.uk>:
> Hi all,
>
> Those following @Biopython on twitter or subscribed to the github RSS
> feed for our repository will know this already, but I've been using
> pylint today to spot some errors in Biopython.
> http://www.logilab.org/project/pylint
>
> This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
> finding some issues - thank Eric, this was a valuable suggestion.
>
> With its default settings pylint is very very noisy, and in particular
> doesn't like our naming conventions. However, with the following
> command line you can focus in on the important stuff:
>
> pylint --disable-msg-cat=CRW --include-ids=y
> --disable-msg=E1101,E1103,E0102 -r n Bio BioSQL
>
> Note that instead of module names, you can give filenames (e.g. *.py).
> What that does is disable several categories of message (conventions,
> possible refactorings, warnings) leaving just errors and fatal
> messages. I turned on the message identifiers so that I have something
> useful to stick into Google if need be, or to add to the ignore list
> (currently three cases which looked like false positives). Then I turn
> off the detailed report.
>
> [Tip - don't run this from the Biopython source directory as then
> importing our C code modules will fail]
>
> As you will be able to tell from the recent flurry of git commits,
> this highlighted some simple errors like missing imports or typos in
> variable names.
>
>
> Tiago, could you have a look at these possible problems in Bio.PopGen:
>
> ************* Module Bio.PopGen.Async
> E0602: 78:Async.get_result: Undefined variable 'done'
> E0602: 79:Async.get_result: Undefined variable 'done'
> ************* Module Bio.PopGen.GenePop
> E0602:160:Record.split_in_pops: Undefined variable 'GenePop'
> E0602:177:Record.split_in_loci: Undefined variable 'GenePop'
> ************* Module Bio.PopGen.GenePop.Controller
> E0602: 41:_read_allele_freq_table: Undefined variable 'self'
> E0602:133:_hw_func: Undefined variable 'self'
> E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext'
> E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable
> 'currrent_pop'
> ************* Module Bio.PopGen.SimCoal.Cache
> E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config'
> E0602: 88: Undefined variable 'Cache'
> ************* Module Bio.PopGen.SimCoal.Controller
> E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config'
>
>
> Eric, I don't have all the dependencies installed by pylint does
> appear to dislike a few things in Bio.Phylo on the trunk:
>
> ************* Module Bio.Phylo.BaseTree
> E0203:521:TreeMixin.prune: Access to member 'root' before its
> definition line 531
> E0203:527:TreeMixin.prune: Access to member 'root' before its
> definition line 531
> E0202:672:Subtree.root: An attribute inherited from TreeMixin hide this method
> ************* Module Bio.Phylo.PhyloXML
> E1120:182:Phylogeny.get_alignment: No value passed for parameter
> 'follow_attrs' in function call
>
>
> One thing this exercise has shown is that we still need to do some
> work on the unit test coverage.
>
> Regards
>
> Peter
>


-- 
?Pessimism of the Intellect; Optimism of the Will? -Antonio Gramsci


From bugzilla-daemon at portal.open-bio.org  Tue Feb 23 13:03:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 23 Feb 2010 13:03:53 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002231803.o1NI3r10002509@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


macrozhu+biopy at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |macrozhu+biopy at gmail.com


------- Comment #5 from macrozhu+biopy at gmail.com  2010-02-23 13:03 EST -------
wow, the developers really respond very quickly. 

How about running >>pylint<< or >>pychecker<< on all BioPython code to detect
potential problems?

cheers,


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Feb 23 13:59:49 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 23 Feb 2010 13:59:49 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002231859.o1NIxnJH004142@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-23 13:59 EST -------
(In reply to comment #5)
> wow, the developers really respond very quickly. 
> 
> How about running >>pylint<< or >>pychecker<< on all BioPython code to detect
> potential problems?
> 
> cheers,
> 

Already tried with pylint earlier today ;)
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From eric.talevich at gmail.com  Tue Feb 23 22:11:25 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 23 Feb 2010 22:11:25 -0500
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
Message-ID: <3f6baf361002231911g65b47cbfw3625f9aacf863abc@mail.gmail.com>

2010/2/23 Peter <biopython at maubp.freeserve.co.uk>

> Hi all,
>
> Those following @Biopython on twitter or subscribed to the github RSS
> feed for our repository will know this already, but I've been using
> pylint today to spot some errors in Biopython.
> http://www.logilab.org/project/pylint
>
> This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
> finding some issues - thank Eric, this was a valuable suggestion.
>
> Glad I could help. :)


> Eric, I don't have all the dependencies installed by pylint does
> appear to dislike a few things in Bio.Phylo on the trunk:
>

Pylint hates the way I wrote Bio.Phylo, in particular the way TreeMixin
assumes it will be mixed with a class that has 'root' and 'is_terminal'
attributes, and the __dict__ hack in the PhyloXML class __init__ methods --
it can't figure out where the attributes are coming from.

The last error was real, and I've pushed a fix to the trunk. Thanks for
catching it.

One thing this exercise has shown is that we still need to do some
> work on the unit test coverage.
>

Agreed. I also added a unit test for get_alignment (finally), and should get
to TreeMixin.prune and .split soon. Then Bio.Phylo will have essentially
100% unit test coverage.

Cheers
-Eric

From biopython at maubp.freeserve.co.uk  Wed Feb 24 02:41:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 07:41:18 +0000
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <3f6baf361002231911g65b47cbfw3625f9aacf863abc@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
	<3f6baf361002231911g65b47cbfw3625f9aacf863abc@mail.gmail.com>
Message-ID: <320fb6e01002232341n3ee397basddde348df86d4871@mail.gmail.com>

On Wed, Feb 24, 2010 at 3:11 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> 2010/2/23 Peter <biopython at maubp.freeserve.co.uk>
>
>> Hi all,
>>
>> Those following @Biopython on twitter or subscribed to the github RSS
>> feed for our repository will know this already, but I've been using
>> pylint today to spot some errors in Biopython.
>> http://www.logilab.org/project/pylint
>>
>> This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
>> finding some issues - thank Eric, this was a valuable suggestion.
>>
>> Glad I could help. :)

Re-reading Bug 3013, we might also want to try PyChecker
as suggested by Hongbo Zhu - I've not used that before.

>> Eric, I don't have all the dependencies installed by pylint does
>> appear to dislike a few things in Bio.Phylo on the trunk:
>
> Pylint hates the way I wrote Bio.Phylo, in particular the way TreeMixin
> assumes it will be mixed with a class that has 'root' and 'is_terminal'
> attributes, and the __dict__ hack in the PhyloXML class __init__ methods --
> it can't figure out where the attributes are coming from.

Some of the "apparent false positives" I was ignoring related to
the iterator classes in Bio.SeqIO, again this seems to be valid
code which pylint can't cope with. We may want to follow up
on this (it could be a bug in pylint?).

That said, if you can think of a cleaner way to code your bits
that might be advantageous for long term maintainance. Maybe
just add a TODO comment to consider using Abstract Base
Classes once we require Python 2.6+ for Biopython (if that
looks suitable)?

> The last error was real, and I've pushed a fix to the trunk.
> Thanks for catching it.

Cool.

>> One thing this exercise has shown is that we still need
>> to do some work on the unit test coverage.
>
> Agreed. I also added a unit test for get_alignment (finally),
> and should get to TreeMixin.prune and .split soon. Then
> Bio.Phylo will have essentially 100% unit test coverage.

I didn't mean to single out just Bio.Phylo - I meant the whole
of Biopython would benefit from more unit tests. In particular,
a lot of the "minor" errors pylint helped me fix were in error
messages (e.g. wrong variable name used). This means if
a user hit the error, rather than the exception we wanted to
raise they'd get an error about our message. So, not critical,
but it suggests we need more tests to cover the exceptions
(as well as the more important tests to cover typical usage).

Peter

From p.j.a.cock at googlemail.com  Wed Feb 24 02:43:48 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 24 Feb 2010 07:43:48 +0000
Subject: [Biopython-dev] Medium/long term plans
Message-ID: <320fb6e01002232343s2df80990s96774b44f942e851@mail.gmail.com>

Hi all,

As mentioned in other recent threads, Brad and I were in Tokyo
earlier this month for the DBCLS BioHackathon 2010 (see
http://hackathon3.dbcls.jp/ for details). While there, we met up
with Michiel for an informal dinner meeting, and discussed some
possible plans for Biopython.

=== Short term action points ===

Seq object equality, see:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007351.html

Filenames or handles in SeqIO, AlignIO, etc, see:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007352.html

=== Medium term action points ===

Python 3 support. With NumPy starting to make serious plans for
supporting Python 3 this year, we should be able to look at doing this
too. Initially we will continue to focus on Python 2.x, but make more
effort to ensure that we can run without issues in the "Python 3
warning mode" available in Python 2.6 (or 2.7 once that is out).
Then start to put Biopython through 2to3, and see how we get on.

Name space reorganisation for sequences. It would be nice to
have the Seq objects, SeqFeature, SeqRecord and probably
SeqUtils and SeqIO all under one module name. We may be
able to handle this in the short term with two import routes with
the old module names discouraged and eventually deprecated.
See also the "Code review request for phyloxml branch" thread
which covered some of this:
http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007215.html

=== Long term action points ===

There are things in Biopython that with hindsight we feel have not
worked out so well (module naming, alphabets objects) where
change may require a break, i.e. a Biopython version two. Should
we start a wiki to record points of debate, and get people to list their
niggles/faults for consideration?

Regarding Python 3.x support and a possible Biopython 2.x see
also Guido's blog post (there is probably an email version on one
of the python mailing lists too):
http://www.artima.com/weblogs/viewpost.jsp?thread=227041

Peter

From biopython at maubp.freeserve.co.uk  Wed Feb 24 06:52:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 11:52:55 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>
	<4B2C12B0.9060806@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
Message-ID: <320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>

On Tue, Dec 22, 2009 at 4:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> The gzip mode issue is interesting... running on the Mac,
> Leopard 10.5, using the Apple provided Python 2.5.2,
> looking at a gzipped QUAL file everything is fine:
>
> Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
> [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import gzip
>>>> gzip.open("Quality/example.qual.gz", "r").read()
> ...
>
> Looking at a gzipped FASTA file everything is fine:
> ...
>
> But, there is a problem with my gzipped FASTQ file:
>
>>>> gzip.open("Quality/example.fastq.gz", "r").read()
> '@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>>> gzip.open("Quality/example.fastq.gz", "rb").read()
> '@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>>> gzip.open("Quality/example.fastq.gz", "rU").read()
> Traceback (most recent call last):
> ?File "<stdin>", line 1, in <module>
> ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 220, in read
> ? ?self._read(readsize)
> ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 292, in _read
> ? ?self._read_eof()
> ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 311, in _read_eof
> ? ?raise IOError, "CRC check failed"
> IOError: CRC check failed
>
> I may have stumbled on a bug in the Python gzip library :(
>

Prompted by a thread on the BioPerl mailing list, I revisited this issue:
http://lists.open-bio.org/pipermail/bioperl-l/2010-February/032359.html

>From some cross platform testing, I always seem to get the CRC error
when trying to open this gzipped FASTQ file in universal read lines mode.
The FASTA and QUAL file seem fine.

According to the gzip python module's documentation, it uses the zlib
module, and you can find the underlying version number like this:

>>> import zlib
>>> zlib.ZLIB_VERSION
'1.2.3'

Results from some testing the simple examples above (using Python
and the gzip module only):

[1] Mac OS X 10.5, Python 2.5.2, GCC 4.0.1, zlib 1.2.3 - fails
[2] Linux, Python 2.4.3, GCC 3.4.5, zlib 1.2.1.2 - fails
[3] Linux, Python 2.3.4, GCC 3.4.6, zlib 1.2.1.2 - fails
[3] Linux, Python 2.6.1, GCC 3.4.6, zlib 1.2.1.2 - fails
[4] Linux, Python 2.4.3, GCC 4.1.2, zlib 1.2.3 - fails
[4] Linux, Python 2.6.1, GCC 3.4.6, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.7a1, MSC v.1500, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.6, MSC v.1500, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.5.2, MSC v.1310, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.4.4, MSC v.1310, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.3.5, MSC v.1200, zlib 1.1.4 - fails

[1] My mac, [2] Local server, [3] Cluster head, [4] Cluster node, [5]
My windows box

This tells me that the failure isn't OS specific, and isn't specific
to a particular
version of Python or zlib. Note that on the Mac and Linux machines where I
get the CRC failure in python, the command line tool gunzip can decompress
the files fine.

If anyone else wants to test this (to confirm I'm not missing anything
obvious), you can download the gzipped files from github here:
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.qual.gz
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.fasta.gz
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.fastq.gz

Maybe this mode isn't fully supported in gzip? I think that provided we
assume that any gzipped text file will use Unix new lines, we don't need
to worry about this.

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 24 07:00:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 12:00:18 +0000
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <6d941f121002230956h582acc11r59cb18bca8d2f727@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
	<6d941f121002230956h582acc11r59cb18bca8d2f727@mail.gmail.com>
Message-ID: <320fb6e01002240400k11764b2al2438d5381ed335c4@mail.gmail.com>

2010/2/23 Tiago Ant?o <tiagoantao at gmail.com>:
> This comes in a good time, I've actually been making changes to the
> code (as the genepop parser is not able to handle big files and I've
> had quite a few complains about that). it seems to be 2.6 related or
> so because I've detected the Config problem myself. I will correct
> this next week (this week is _impossible_), along with an update to
> the genepop parser to support big files.

Sound good :)

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 24 07:37:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 12:37:20 +0000
Subject: [Biopython-dev] test_PhyloXML.py failing on Windows
Message-ID: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>

Hi Eric,

Do you have access to a Windows machine for testing? There
seem to be two issues in the PhyloXML tests (tested on
Python 2.5, 2.6 and 2.7a1 on Windows XP):

Count and confirm the number of tags in each example XML file. ... FAIL
Round-trip parsing and serialization of apaf.xml. ... ERROR
Round-trip parsing and serialization of bcl_2.xml. ... ERROR
Round-trip parsing and serialization of o_tol_332_d_dollo.xml. ... ERROR
Round-trip parsing and serialization of made_up.xml. ... ERROR
Round-trip parsing and serialization of phyloxml_examples.xml. ... ERROR

The tag count error I don't immediately understand:

======================================================================
FAIL: Count and confirm the number of tags in each example XML file.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\repositories\biopython_official\Tests\test_PhyloXML.py",
line 56, in test_dump_tags
    self.assertEquals(len(output.readlines()), count)
AssertionError: 301 != 289

----------------------------------------------------------------------

The rest all fail in _stash_rewrite_and_call where something about
your file renaming is failing. It looks like you deliberately move some
of your example XML files to a temp filename during the test and
then move them back. This seems risky (e.g. if the test suite is
stopped mid way). Can you rework this to write the output to a
temp file or perhaps better yet a StringIO handle?

The errors look like this:

======================================================================
ERROR: Round-trip parsing and serialization of apaf.xml.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_PhyloXML.py", line 561, in test_apaf
    (TreeTests, ['test_DomainArchitecture']),
  File "test_PhyloXML.py", line 546, in _stash_rewrite_and_call
    os.rename(fname, fname + '~')
WindowsError: [Error 183] Cannot create a file when that file already exists

----------------------------------------------------------------------

Thanks,

Peter

From bugzilla-daemon at portal.open-bio.org  Wed Feb 24 10:38:13 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 24 Feb 2010 10:38:13 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201002241538.o1OFcDJ4005667@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-24 10:38 EST -------
Just to add a note, on Snow Leopard Apple provides python 2.5 (default, 32bit
only) and python 2.6 (supports 64 bit).

I suspect if you install Biopython under python 2.6 you won't need the 10.4
SDK... something to check?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From rjalves at igc.gulbenkian.pt  Wed Feb 24 11:07:01 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 24 Feb 2010 16:07:01 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>	
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>	
	<4B2C12B0.9060806@igc.gulbenkian.pt>	
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>	
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>	
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>	
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>	
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
Message-ID: <4B854EA5.7050100@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Quoting Peter on 02/24/2010 11:52 AM:

> Maybe this mode isn't fully supported in gzip? I think that provided we
> assume that any gzipped text file will use Unix new lines, we don't need
> to worry about this.

Your example puzzled me. I did a few more tests with the files you
pointed out. Turns out that the fastq file is 'badly' read even on
normal open 'Universal' mode. This doesn't happen on the other files:

Python 2.6.4 [GCC 4.4.1] Linux

>>> open('example.fastq.gz', 'rb').read() == open('example.fastq.gz',
'rU').read()
False
>>> open('example.fasta.gz', 'rb').read() == open('example.fasta.gz',
'rU').read()
True
>>> open('example.qual.gz', 'rb').read() == open('example.qual.gz',
'rU').read()
True

In particular the character in fault seems to be:

>>> (open('example.fastq.gz', 'rb').read()[145],
open('example.fastq.gz', 'rU').read()[145])
('\r', '\n')

This is the only thing that changed.

After going a little over the content of the file, I found this workaround:

$ gunzip example.fastq.gz && echo >> example.fastq && gzip example.fastq

Which simply adds a new empty line to the end of the file.

>>> open('example.fastq.gz', 'rb').read() == open('example.fastq.gz',
'rU').read()
True

After this I also looked into python3 (3.1.1) just in case they fixed it
already and apparently they did. See for yourself:

This was tested in Python-3.1.1 from within blender2.5, (apologies for
that, it was the only python3 version I had around).

>>> open('example.fastq.gz','rb').read() ==
open('example.fastq.gz','rU').read()
Traceback (most recent call last):
(...)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
unexpected code byte

Seems like I need to force binary mode...

>>> open('example.fastq.gz','rb').read() ==
open('example.fastq.gz','rbU').read()
True

Success!

>>> import gzip
>>> gzip.open('example.fastq.gz','rb').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

>>> gzip.open('example.fastq.gz','rU').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

>>> gzip.open('example.fastq.gz','rbU').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

And everything works as expected.

So unless the blender devs changed python to fix this bug, this has been
fixed in python3.

Should this go upstream?

- --
Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFTqAACgkQYh11EUYTX9TXbgCgmBDKrrjL6Eue8qRfgs2ydAUQ
11kAnR0beVQDLP4ldBcd2RFfJ5Q+Opo6
=MLu3
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Wed Feb 24 11:48:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 16:48:58 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <4B854EA5.7050100@igc.gulbenkian.pt>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>
	<4B2C12B0.9060806@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
	<4B854EA5.7050100@igc.gulbenkian.pt>
Message-ID: <320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>

On Wed, Feb 24, 2010 at 4:07 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>
> After this I also looked into python3 (3.1.1) just in case they fixed it
> already and apparently they did. See for yourself:

You seem to be right, I tried this on Windows using Python 3.0.1 and 3.1.1,

C:\repositories\biopython_pjc\Tests>c:\python30\python
Python 3.0.1 (r301:69561, Feb 13 2009, 20:04:18) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open("Quality\example.fastq.gz", "r").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rb").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rU").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'


C:\repositories\biopython_pjc\Tests>c:\python31\python
Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open("Quality\example.fastq.gz", "r").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rb").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rU").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'

So this does look like a Python 2.x bug which has been fixed in Python
3.x, and we should probably report this (after searching to see if it
is a known issue).

However, even if it is fixed in Python 2.6.x and 2.7.x, it won't get
fixed in older versions like Python 2.4 or 2.5.

Peter

From biopython at maubp.freeserve.co.uk  Wed Feb 24 12:03:09 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 17:03:09 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<4B2C12B0.9060806@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
	<4B854EA5.7050100@igc.gulbenkian.pt>
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
Message-ID: <320fb6e01002240903m52629576vf85f428f68d32d15@mail.gmail.com>

Hi all,

I've updated my branch to cope with gzipped FASTQ files, tested on
Windows XP, Mac OS X Snow Leopard, and Linux:

http://github.com/peterjc/biopython/tree/index-zip

This works by just opening gzipped files in default mode - which
seems to be fine with the examples (FASTA, QUAL and FASTQ)
where the text file in the archive uses Unix new line entries.

While this may be a good solution, we should test on gzipped
files containing Windows new lines too. Plus of course, try
non-gzipped compression. And very large files. etc.

Peter

From eric.talevich at gmail.com  Wed Feb 24 12:03:31 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 24 Feb 2010 12:03:31 -0500
Subject: [Biopython-dev] test_PhyloXML.py failing on Windows
In-Reply-To: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>
References: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>
Message-ID: <3f6baf361002240903q2b395fd1qa5426a130b5b3d61@mail.gmail.com>

On Wed, Feb 24, 2010 at 7:37 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> Do you have access to a Windows machine for testing? There
> seem to be two issues in the PhyloXML tests (tested on
> Python 2.5, 2.6 and 2.7a1 on Windows XP):
>

I'll have access to Windows XP this weekend, but I think I can probably fix
these tests before then.

======================================================================
> FAIL: Count and confirm the number of tags in each example XML file.
> ----------------------------------------------------------------------
>

This was an early sanity check for parsing XML with ElementTree, and while I
don't see a good reason for the number of lines to be different between OSes
(line endings?), the test isn't Biopython-specific anyway. I'll just delete
it.

======================================================================
> ERROR: Round-trip parsing and serialization of apaf.xml.
> ----------------------------------------------------------------------
>

Apparently Windows doesn't like renaming a file to replace another existing
file. To fix this error asap I'll call os.remove before the rename, but
you're right that these tests should be rewritten to use named temp files or
StringIO. (I needed to trick unittest into re-running the parser tests on
re-written files and this sufficed last summer)

-Eric

From rjalves at igc.gulbenkian.pt  Wed Feb 24 12:13:41 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 24 Feb 2010 17:13:41 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>	
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>	
	<4B2C12B0.9060806@igc.gulbenkian.pt>	
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>	
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>	
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>	
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>	
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>	
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>	
	<4B854EA5.7050100@igc.gulbenkian.pt>
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
Message-ID: <4B855E45.9080708@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>So this does look like a Python 2.x bug which has been fixed in Python
>3.x, and we should probably report this (after searching to see if it
>is a known issue).

The closest I could find is: http://bugs.python.org/issue5148

But it's also on gzip.open(), not plain open().

>However, even if it is fixed in Python 2.6.x and 2.7.x, it won't get
>fixed in older versions like Python 2.4 or 2.5.

Do you raising a warning if the 'U' mode is explicitly passed would be a
reasonable solution for older python versions?

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFXkMACgkQYh11EUYTX9TKNACfXIj2p5OTRetf9cWU/ppV8oWb
CPcAoIJkkNfHj6AeLAxl2/FtSH3+7UR5
=W7wg
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Wed Feb 24 12:28:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 17:28:58 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <4B855E45.9080708@igc.gulbenkian.pt>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
	<4B854EA5.7050100@igc.gulbenkian.pt>
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
	<4B855E45.9080708@igc.gulbenkian.pt>
Message-ID: <320fb6e01002240928h54519628pfb91dd1bf8d9c1f7@mail.gmail.com>

On Wed, Feb 24, 2010 at 5:13 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>>So this does look like a Python 2.x bug which has been fixed in Python
>>3.x, and we should probably report this (after searching to see if it
>>is a known issue).
>
> The closest I could find is: http://bugs.python.org/issue5148
>
> But it's also on gzip.open(), not plain open().

It is gzip.open() that we have a problem with, open() is fine.

It does look like http://bugs.python.org/issue6759 and/or the
linked bug http://bugs.python.org/issue6759 cover this issue.
Thanks for finding them.

>>However, even if it is fixed in Python 2.6.x and 2.7.x, it won't get
>>fixed in older versions like Python 2.4 or 2.5.
>
> Do you raising a warning if the 'U' mode is explicitly passed
> would be a reasonable solution for older python versions?

Are you asking about what I would like Python to do?
I would like gzip.open() to support universal newline mode.

For Biopython's index function we currently don't allow the
user to specify the mode at all - the code decides this based
on the file format (SFF files must be binary, for text files I use
universal newline mode).

Peter

From rjalves at igc.gulbenkian.pt  Wed Feb 24 13:25:04 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 24 Feb 2010 18:25:04 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240928h54519628pfb91dd1bf8d9c1f7@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>	
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>	
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>	
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>	
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>	
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>	
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>	
	<4B854EA5.7050100@igc.gulbenkian.pt>	
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>	
	<4B855E45.9080708@igc.gulbenkian.pt>
	<320fb6e01002240928h54519628pfb91dd1bf8d9c1f7@mail.gmail.com>
Message-ID: <4B856F00.7030201@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> For Biopython's index function we currently don't allow the
> user to specify the mode at all - the code decides this based
> on the file format (SFF files must be binary, for text files I use
> universal newline mode).

For some reason I thought the user could set the mode.
Anyway, thanks for the clarification.

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFbvwACgkQYh11EUYTX9QM6gCeK4aMVBoZWZmI+SNccwSd9qle
xv8AnA8gZLQn1m8bXMT9Dl5YIRM4akC2
=jQ9l
-----END PGP SIGNATURE-----

From bugzilla-daemon at portal.open-bio.org  Thu Feb 25 08:35:04 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 25 Feb 2010 08:35:04 -0500
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <201002251335.o1PDZ4qn013099@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-25 08:35 EST -------
Marking as fixed since I recently merged this code into the trunk.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From biopython at maubp.freeserve.co.uk  Thu Feb 25 09:29:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 25 Feb 2010 14:29:19 +0000
Subject: [Biopython-dev] test_PhyloXML.py failing on Windows
In-Reply-To: <3f6baf361002240903q2b395fd1qa5426a130b5b3d61@mail.gmail.com>
References: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>
	<3f6baf361002240903q2b395fd1qa5426a130b5b3d61@mail.gmail.com>
Message-ID: <320fb6e01002250629te597954v46308838faca607e@mail.gmail.com>

On Wed, Feb 24, 2010 at 5:03 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Apparently Windows doesn't like renaming a file to replace another existing
> file. To fix this error asap I'll call os.remove before the rename, ...

I had to add other similar check before it would run on my machine.

> but you're right that these tests should be rewritten to use named temp
> files or StringIO. (I needed to trick unittest into re-running the parser
> tests on re-written files and this sufficed last summer)

OK, something for the TODO list. Should we file a bug to remind us?

Peter

From biopython at maubp.freeserve.co.uk  Fri Feb 26 08:09:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 26 Feb 2010 13:09:46 +0000
Subject: [Biopython-dev] ImportWarning is new on Python 2.5
Message-ID: <320fb6e01002260509k65e9f9acm494c80a76af8d25e@mail.gmail.com>

Hi Eric,

I've just been running the test suite on Python 2.4 (on CentOS 5.4)
and noticed you use ImportWarning (which was added in Python 2.5) in
Bio/Phylo/PhyloXMLIO.py

Although we are going to phase out support for Python 2.4, we still
need to keep things compatible for now.

Are you happy to switch this to a different warning for now, and add a
TODO comment to put it back to an ImportWarning once we drop Python
2.4 support?

Thanks

Peter

From eric.talevich at gmail.com  Fri Feb 26 09:56:53 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 26 Feb 2010 09:56:53 -0500
Subject: [Biopython-dev] ImportWarning is new on Python 2.5
In-Reply-To: <320fb6e01002260509k65e9f9acm494c80a76af8d25e@mail.gmail.com>
References: <320fb6e01002260509k65e9f9acm494c80a76af8d25e@mail.gmail.com>
Message-ID: <3f6baf361002260656n581a526dtc4a5374640f546ed@mail.gmail.com>

On Fri, Feb 26, 2010 at 8:09 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> I've just been running the test suite on Python 2.4 (on CentOS 5.4)
> and noticed you use ImportWarning (which was added in Python 2.5) in
> Bio/Phylo/PhyloXMLIO.py
>
> Although we are going to phase out support for Python 2.4, we still
> need to keep things compatible for now.
>
> Are you happy to switch this to a different warning for now, and add a
> TODO comment to put it back to an ImportWarning once we drop Python
> 2.4 support?
>

Sure, I'll switch it to a generic Warning for now and leave a comment. I
doubt the type of the warning is very important for most uses.

-Eric

From bugzilla-daemon at portal.open-bio.org  Fri Feb 26 11:26:10 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Feb 2010 11:26:10 -0500
Subject: [Biopython-dev] [Bug 2553] Adding SeqRecord objects to an alignment
	(append or extend)
In-Reply-To: <bug-2553-42@http.bugzilla.open-bio.org/>
Message-ID: <201002261626.o1QGQA1g028222@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2553


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-26 11:26 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj

This already handles:
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects

I also plan to cover:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]
Bug 2552 - Adding alignments 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Feb 26 11:26:43 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Feb 2010 11:26:43 -0500
Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of
	SeqRecord objects
In-Reply-To: <bug-2554-42@http.bugzilla.open-bio.org/>
Message-ID: <201002261626.o1QGQhNF028283@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2554


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-26 11:26 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj

This already handles:
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects

I also plan to cover:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]
Bug 2552 - Adding alignments 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Feb 26 12:28:31 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Feb 2010 12:28:31 -0500
Subject: [Biopython-dev] [Bug 2552] Adding alignments
In-Reply-To: <bug-2552-42@http.bugzilla.open-bio.org/>
Message-ID: <201002261728.o1QHSVob029960@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2552


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-26 12:28 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj

This now handles:
Bug 2552 - Adding alignments (this bug)
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects

I also plan to cover:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Sat Feb 27 13:24:03 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 27 Feb 2010 13:24:03 -0500
Subject: [Biopython-dev] [Bug 3016] New: Change WriterTests in
	test_PhyloXML.py to use StringIO or temp files
Message-ID: <bug-3016-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3016

           Summary: Change WriterTests in test_PhyloXML.py to use StringIO
                    or temp files
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Unit Tests
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: eric.talevich at gmail.com


The method _stash_rewrite_and_call currently parses each of the example
phyloXML files, renames the parsed file to [filename]~, writes out another copy
(from the parsed data structure) using the original filename, re-runs the suite
of parser tests on the rewritten files, and finally renames the stashed copies
back to the original filenames. This is protected by a try-finally clause, but
could still fail to restore the original test files if the Python interpreter
is interrupted/killed. Moreover, the design is a little pathological, and could
be hard to maintain or extend later.

Redesign the writer tests to rewrite and test a copy of each originals at some
location other than the original filename. Ideally, use StringIO to store the
copy; a named temporary file (see tempfile module) is also acceptable.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 00:21:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 31 Jan 2010 19:21:51 -0500
Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing in
	Bio.AlignIO
In-Reply-To: <bug-3004-42@http.bugzilla.open-bio.org/>
Message-ID: <201002010021.o110Lp9e009311@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004


------- Comment #2 from forgetta at gmail.com  2010-01-31 19:21 EST -------
Now on github:

http://github.com/vforget/PyBLATPSL

Vince


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 11:17:58 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 06:17:58 -0500
Subject: [Biopython-dev] [Bug 3004] PSL alignment format parsing
In-Reply-To: <bug-3004-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011117.o11BHwib023118@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3004


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |biopython-
                   |                            |bugzilla at maubp.freeserve.co.
                   |                            |uk
            Summary|PSL alignment format parsing|PSL alignment format parsing
                   |in Bio.AlignIO              |


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 06:17 EST -------
(In reply to comment #2)
> Now on github:
> 
> http://github.com/vforget/PyBLATPSL
> 
> Vince
> 

Thanks for the link.

I don't see how this connects to sequence alignments for Bio.AlignIO as
suggested in your original comment (bug title edited accordingly). I see
you are parsing tabular output into an object, with addition methods for
scores etc. This looks fairly useful, but is not appropriate for the
Bio.AlignIO module. Maybe it can go under a new namespace instead, maybe
Bio.BLAT?

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 11:27:00 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 06:27:00 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed multiline entry?
In-Reply-To: <bug-3000-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011127.o11BR0lp023326@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 06:26 EST -------
(In reply to comment #1)
> (In reply to comment #0)
> > Still, I suspect this will
> > reformat the entry (currently I see trailing dot removed from KEYWORDS, no
> > REFERENCE, AUTHORS, TITLE, JOURNAL, PUBMED; and FEATURES.source being
> > re-ordered).
> 
> Yes, using Bio.SeqIO to read/write a GenBank record will give you (slightly)
> different output. We do not guarantee a 100% round trip (even on simpler
> formats like FASTA). Even little things like line wrapping would make this
> very difficult.
> 
> Regarding GenBank KEYWORDS, please file a bug.

Don't worry about reporting a bug for this, I've just fixed the missing period
for KEYWORDS:

http://github.com/biopython/biopython/commit/5a87b070fc1f4fb911d4cf8a2e53c330cd6bd83d

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 13:35:11 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 08:35:11 -0500
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011335.o11DZBcJ029190@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


------- Comment #17 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 08:35 EST -------
(In reply to comment #16)
> 
> > * Writing references
> 
> Not done yet, but for my personal needs this is low priority.

Reference output in GenBank format from SeqIO just committed on github,
http://github.com/biopython/biopython/commit/42707bda738d0239a9ff85a39c39c89c8024549d

> > * Extending to cover writing EBML files
> 
> Not done yet, but should be comparatively straight forward. Let's track this
> possible enhancement on a separate bug.

EMBL output in SeqIO was done a while ago and was included in Biopython 1.52
(although we don't yet write references in EMBL output).

Things still to do on GenBank output include better handling of the LOCUS
line, such as the data division. See also Bug 2578 for the molecule type.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Feb  1 14:43:41 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 1 Feb 2010 09:43:41 -0500
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <201002011443.o11EhfAT031724@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-01 09:43 EST -------
(In reply to comment #17)
> 
> EMBL output in SeqIO was done a while ago and was included in Biopython 1.52
> (although we don't yet write references in EMBL output).

References in EMBL output implemented now:
http://github.com/biopython/biopython/commit/370e02053a45aec6209bd826aebab7bfc29d7e84


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Tue Feb  2 18:37:25 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 2 Feb 2010 18:37:25 +0000
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
Message-ID: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>

Hi all,

Over on enhancement Bug 3000, Martin was asking about
getting raw unparsed strings for each record in a sequence file:
http://bugzilla.open-bio.org/show_bug.cgi?id=3000

This makes sense for sequential files like FASTA and GenBank,
but not for interlaced files like PHYLIP, and has less obvious
uses when there is any kind of header or footer (e.g. XML or
SFF files).

The particular example Martin gave was selecting a subset of
records in a large GenBank file (I've done this myself in the past).
While this can be done via Bio.SeqIO, the process of parsing
the data into a SeqRecord and saving it again is lossy. While
there is room for improvement. For this particular example, I
suggested Martin use the "old" iterator class in Bio.GenBank.

In general things like white space and wrapping mean that a
SeqIO parse/write cannot guarantee a 100% unaltered round
trip, and will also be slower than using the raw record as a string.

Martin suggested adding an optional argument to the parse
function. I'm not sure this is a good API choice, as it would
dramatically alter the return values. Perhaps we could have
a new iterator function in Bio.SeqIO for suitable sequential
files only which returns a series of strings, one for each
record, unmodified?

Either way I don't see how this would be used - surely
the user would need to do some basic analysis of each
raw record to decide how to process it? In this example,
they would need to extract the ID/accession to see if they
want to output the record or not. While parsing the record
into a SeqRecord may not be needed, in most cases the
record identifier would be very useful - and this has some
big overlaps with the Bio.SeqIO.index() code which already
breaks up files into records and extracts their identifiers.

i.e. A top level Bio.SeqIO function to iterate over a file
returning tuples of the record identifier and the raw
record as strings *could* be useful. Implementing this
nicely would mean re-factoring Bio.SeqIO.index()
extensively.

Another solution to this task (extracting the raw GenBank
records from a large file) would seem to be to extend the
Bio.SeqIO.index functionality. The patch I'm about to
attach to Bug 3000 adds a new "get_raw" method to the
dictionary like object we return. Unlike the __getitem__
and get methods which return a SeqRecord this just gives
the raw string.

Note that I haven't implemented this for all the index
support file formats yet, and this has had only very basic
testing. Writing this email took longer than writing the
code. However, I hope it illustrates the idea enough for a
discussion. As an example how the index function could
be used with this patch:

>>> from Bio import SeqIO
>>> data = SeqIO.index("cor6_6.gb", "gb")
>>> data.keys()
['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
>>> print data.get_raw("X62281.1")
LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
DEFINITION  A.thaliana kin2 gene.
ACCESSION   X62281
...
//

What are people's thoughts on this?

Peter


From bugzilla-daemon at portal.open-bio.org  Tue Feb  2 18:40:07 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 2 Feb 2010 13:40:07 -0500
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed multiline entry?
In-Reply-To: <bug-3000-42@http.bugzilla.open-bio.org/>
Message-ID: <201002021840.o12Ie7pO015898@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3000


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-02 13:40 EST -------
Created an attachment (id=1436)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=1436&action=view)
Adds a get_raw method to the dictionaries returned by Bio.SeqIO.index()

Outline implementation of an alternative proposal, allowing access to the
raw text for each record via the Bio.SeqIO.index() dictionary like objects.
See discussion here:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007301.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From krother at rubor.de  Wed Feb  3 10:29:04 2010
From: krother at rubor.de (Kristian Rother)
Date: Wed, 3 Feb 2010 11:29:04 +0100
Subject: [Biopython-dev] report: what happens on 'from Bio import PDB'?
In-Reply-To: <201002021840.o12Ie7pO015898@portal.open-bio.org>
References: <201002021840.o12Ie7pO015898@portal.open-bio.org>
Message-ID: <18fbb8f40f6ec6efe3d5dffff68aaa57-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVlcWgFbWw==-webmailer2@server03.webmailer.hosteurope.de>


Hi,

I'm currently checking what my application is using its memory for
(because it uses way too much for non-Biopython related things). However,
as soon as the simple command

from Bio import PDB

is executed, these are the objects that Python has in memory after running
the gc:

1 <class 'codecs.CodecInfo'>
1 <class 'ctypes.PyDLL'>
1 <class 'ctypes._endian._swapped_meta'>
1 <class 'numpy.core.numeric._unspecified'>
1 <class 'numpy.lib._datasource._FileOpeners'>
1 <class 'numpy.lib.index_tricks.CClass'>
1 <class 'numpy.lib.index_tricks.RClass'>
1 <class 'numpy.ma.core.MaskedArray'>
1 <class 'numpy.ma.core._maximum_operation'>
1 <class 'numpy.ma.core._minimum_operation'>
1 <class 'numpy.ma.extras.mr_class'>
1 <class 'random.Random'>
1 <class 'site._Helper'>
1 <class 'string._TemplateMetaclass'>
1 <class 'unittest.TestLoader'>
1 <type 'NoneType'>
1 <type 'NotImplementedType'>
1 <type '_ctypes.ArrayType'>
1 <type '_ctypes.StructType'>
1 <type '_ctypes.UnionType'>
1 <type 'ellipsis'>
1 <type 'exceptions.MemoryError'>
1 <type 'exceptions.RuntimeError'>
1 <type 'mtrand.RandomState'>
1 <type 'numpy.ndarray'>
1 <type 'sys.flags'>
1 <type 'sys.floatinfo'>
1 <type 'thread.lock'>
1 <type 'unicode'>
1 from module:Bio.GenBank.utils
1 from module:Bio.PDB.PDBIO
1 from module:Bio.PropertyManager
1 from module:os
2 <class 'Bio.PropertyManager.CreateDict'>
2 <class 'ctypes.LibraryLoader'>
2 <class 'numpy.lib.index_tricks.IndexExpression'>
2 <class 'numpy.lib.index_tricks.nd_grid'>
2 <class 'site.Quitter'>
2 <type 'bool'>
2 <type 'numpy.bool_'>
2 <type 'numpy.float64'>
2 <type 'object'>
2 from module:Bio.GenBank.LocationParser
2 from module:xml.sax.handler
3 <class 'site._Printer'>
3 <type '_ctypes.PointerType'>
3 <type 'file'>
3 <type 'frame'>
4 <type 'PyCObject'>
5 <class 'ctypes.CFunctionType'>
6 <class 'numpy.core.numerictypes._typedict'>
6 <type 'complex'>
6 from module:numpy.ma.extras
7 <type 'imp.NullImporter'>
7 from module:Bio.Alphabet.IUPAC
7 from module:__future__
8 <type '_ctypes.CFuncPtrType'>
9 <class 'numpy.ma.core._arraymethod'>
10 <type 'classmethod_descriptor'>
13 <type 'Struct'>
14 <type 'frozenset'>
15 <class 'numpy.testing.nosetester.NoseTester'>
16 <class 'abc.ABCMeta'>
16 <type 'staticmethod'>
19 <type 'classmethod'>
27 <type '_ctypes.SimpleType'>
35 <type 'cell'>
35 <type 'operator.itemgetter'>
36 <type 'StgDict'>
38 <type '_sre.SRE_Pattern'>
49 <type 'set'>
56 <type 'long'>
56 from module:Bio.Alphabet
68 <type 'instancemethod'>
76 <type 'numpy.ufunc'>
91 <type 'property'>
95 from module:numpy.ma.core
201 <type 'member_descriptor'>
203 <type 'classobj'>
225 <type 'module'>
350 <type 'type'>
351 from module:Bio.Data.CodonTable
360 <type 'int'>
385 <type 'float'>
393 <type 'getset_descriptor'>
407 <type 'weakref'>
579 <type 'method_descriptor'>
837 <type 'builtin_function_or_method'>
1365 <type 'wrapper_descriptor'>
2073 <type 'dict'>
3191 <type 'function'>
3289 <type 'code'>
4099 <type 'list'>
11989 <type 'tuple'>
19718 <type 'str'>
total 50912

Hope this is useful ;-)

Best Regards,
   Kristian


From lplp90 at gmail.com  Wed Feb  3 11:35:49 2010
From: lplp90 at gmail.com (Laura Padioleu)
Date: Wed, 3 Feb 2010 12:35:49 +0100
Subject: [Biopython-dev]  Multiple alignment - Clustalw etc...
Message-ID: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com>

On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org
<http://lists.open-bio.org/mailman/listinfo/biopython-dev>> wrote:
>*
*>* Hi Folks,
*>*
*>* this is a demo that i use to create then align my fasta sequences
using clustalw. Hope it helps.
here's the code
*
>def clustal(list_struc):
>
>
>	hash_table={}
>	for i in range (len(list_struc)):
>		for j in range (i+1,len(list_struc)):
>			pair=(list_struc[i],list_struc[j])
>			hash_table
>[pair]=0
>
>
>	for pair in hash_table
>.keys():
>		fasta_fic=open("fasta.fasta",'w')
>		for ID in pair:
>			fasta_fic.write(">"+ID.get_id()+'\n')
>
>			# recuperation des sequences des acides amines
>			for chain in ID.get_chains():
>       			ppb = PPBuilder()
>
>        			pp = ppb.build_peptides(chain)
>				# l'ajout des sequences aux fichiers fasta
>				fasta_fic.write(pp[0].get_sequence().tostring())
>			fasta_fic.write('\n')
>		fasta_fic.close()
>		cline = ClustalwCommandline(cmd="clustalw", infile="file.fasta")
>		return_code = subprocess.call(str(cline), shell=(sys.platform!="win32"))
>		
>		alignment = AlignIO.read(open("file"+str(nb)+".aln"),"clustal")
>
>		
>		j=0
>		i=0
>		for record in alignment:
>   			for amino_acid in record.seq:
>       			if amino_acid == '-':
>          				pass
>        			else:
>            				if amino_acid == alignment[0].seq[j]:
>                				i += 1
>        			j += 1
>    			j = 0
>    			
>seq = str(record.seq)
>    			gap_strip = seq.replace('-', '')
>    			percent = 100.0*i/len(seq)
>    			
>    			i=0
>			hash_table[pair]=str(percent)+"\t"+str(percent2)
>
>		
>	return hash_table
>
>def csv_writer(list_struc):
>	hash_table=clustal(list_struc)
>	csv_fic=open("file.csv",'a')
>	for couple in hash_table.keys():
>		csv_fic.write(pari[0].get_id()+"\t"+str(hash_table[pair])+'\n')
>	csv_fic.close()*


Hello,

im using python version 2.5 but i can't compile this code correctly
what version of python and biopython you are using ?
Thanks


From chapmanb at 50mail.com  Wed Feb  3 12:46:48 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Feb 2010 07:46:48 -0500
Subject: [Biopython-dev] Multiple alignment - Clustalw etc...
In-Reply-To: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com>
References: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com>
Message-ID: <20100203124648.GC40046@sobchak.mgh.harvard.edu>

Hi Laura;

[clustalw example from Cymon]

> im using python version 2.5 but i can't compile this code correctly
> what version of python and biopython you are using ?

We could help more with some additional information. Could you copy
and paste the error message you are seeing?

Brad


From cy at cymon.org  Wed Feb  3 12:48:49 2010
From: cy at cymon.org (Cymon Cox)
Date: Wed, 3 Feb 2010 12:48:49 +0000
Subject: [Biopython-dev]  Multiple alignment - Clustalw etc...
In-Reply-To: <7265d4f1002030412l1258237jf50ff37845e7c5a5@mail.gmail.com>
References: <771202501002030335r52f62fccta1b976f9e586d435@mail.gmail.com> 
	<7265d4f1002030412l1258237jf50ff37845e7c5a5@mail.gmail.com>
Message-ID: <7265d4f1002030448n28065ea1ifc411cf0c7b462e8@mail.gmail.com>

---------- Forwarded message ----------
From: Cymon Cox <cy at cymon.org>
Date: 3 February 2010 12:12
Subject: Re: [Biopython-dev] Multiple alignment - Clustalw etc...
To: Laura Padioleu <lplp90 at gmail.com>


Hi Laura,

On 3 February 2010 11:35, Laura Padioleu <lplp90 at gmail.com> wrote:

> On Mon, Mar 30, 2009 at 12:42 PM, Cymon Cox <cy at cymon.org
> <http://lists.open-bio.org/mailman/listinfo/biopython-dev>> wrote:
> >*
> *>* Hi Folks,
>

Yes, I did write that...

<snip a bunch of code that I didn't write, nor can attribute to anyone...>

Hello,
>
> im using python version 2.5 but i can't compile this code correctly
> what version of python and biopython you are using ?
>


How exactly are you using this code? What error do you get? Can you cut and
paste a session from the terminal?

Cheers, C.

--


From chapmanb at 50mail.com  Wed Feb  3 12:55:52 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Feb 2010 07:55:52 -0500
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
Message-ID: <20100203125552.GD40046@sobchak.mgh.harvard.edu>

Hi Peter;

> Another solution to this task (extracting the raw GenBank
> records from a large file) would seem to be to extend the
> Bio.SeqIO.index functionality. The patch I'm about to
> attach to Bug 3000 adds a new "get_raw" method to the
> dictionary like object we return. Unlike the __getitem__
> and get methods which return a SeqRecord this just gives
> the raw string.
[...]
> >>> from Bio import SeqIO
> >>> data = SeqIO.index("cor6_6.gb", "gb")
> >>> data.keys()
> ['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
> >>> print data.get_raw("X62281.1")
> LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
> DEFINITION  A.thaliana kin2 gene.
> ACCESSION   X62281
> ...
> //
> 
> What are people's thoughts on this?

Not much to add, but a +1 from me. This sounds like a solid solution
and makes sense for the use case I can think of, which is picking
out records of interest from a large file and re-writing them in a
smaller file.

Brad


From chapmanb at 50mail.com  Wed Feb  3 12:55:52 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 3 Feb 2010 07:55:52 -0500
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
Message-ID: <20100203125552.GD40046@sobchak.mgh.harvard.edu>

Hi Peter;

> Another solution to this task (extracting the raw GenBank
> records from a large file) would seem to be to extend the
> Bio.SeqIO.index functionality. The patch I'm about to
> attach to Bug 3000 adds a new "get_raw" method to the
> dictionary like object we return. Unlike the __getitem__
> and get methods which return a SeqRecord this just gives
> the raw string.
[...]
> >>> from Bio import SeqIO
> >>> data = SeqIO.index("cor6_6.gb", "gb")
> >>> data.keys()
> ['L31939.1', 'AJ237582.1', 'X62281.1', 'AF297471.1', 'X55053.1', 'M81224.1']
> >>> print data.get_raw("X62281.1")
> LOCUS       ATKIN2        880 bp    DNA             PLN       23-JUL-1992
> DEFINITION  A.thaliana kin2 gene.
> ACCESSION   X62281
> ...
> //
> 
> What are people's thoughts on this?

Not much to add, but a +1 from me. This sounds like a solid solution
and makes sense for the use case I can think of, which is picking
out records of interest from a large file and re-writing them in a
smaller file.

Brad


From bugzilla-daemon at portal.open-bio.org  Wed Feb  3 21:44:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 3 Feb 2010 16:44:14 -0500
Subject: [Biopython-dev] [Bug 1999] new frame translation method
In-Reply-To: <bug-1999-42@http.bugzilla.open-bio.org/>
Message-ID: <201002032144.o13LiERA027299@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1999


------- Comment #3 from eric.talevich at gmail.com  2010-02-03 16:44 EST -------
Can we split this into two functions? I tried this function today, hoping it
would help me get a list of ORFs from a big contig -- but both
frameTranslations and six_frame_translation do two things without stopping in
between:

1. Translate the DNA or RNA sequence to amino acids in all six frames
2. Pretty-print the six-frame translation


So, how about factoring out just this piece (or similar):

def translate_six_frames(seq, genetic_code=1):
    """Dictionary of 6-frame translations."""
    anti = seq.reverse_complement()
    frames = {}
    for i in range(0,3):
        frames[i+1]  = seq[i:].translate(genetic_code)
        frames[-i-1] = SeqUtils.reverse(anti[i:].translate(genetic_code))
    return frames


Then either pretty-printer can call this internally, and the user also has
access to the individual translated sequences.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Wed Feb  3 23:13:10 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 3 Feb 2010 23:13:10 +0000
Subject: [Biopython-dev] [Bug 3000] Could SeqIO.parse() store the whole,
	unparsed 	multiline entry?
In-Reply-To: <4B6995D0.3030405@fold.natur.cuni.cz>
References: <201002021840.o12Ie88i015906@portal.open-bio.org>
	<4B6995D0.3030405@fold.natur.cuni.cz>
Message-ID: <320fb6e01002031513r1faac5faicf027daf5da77d80@mail.gmail.com>

On Wed, Feb 3, 2010 at 3:27 PM, Martin MOKREJ?
<mmokrejs at fold.natur.cuni.cz> wrote:
>
> Hi Peter,
> ?thank you very much for all your efforts. I will try to get to testing the cvs
> code in few days. Definitely will keep you updated. ;)
> Martin
>
> bugzilla-daemon at portal.open-bio.org wrote:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=3000
>> ...

The patch hasn't been checked in, but should apply to either the
master branch in github or (I expect) Biopython 1.53

I'm looking forward to feedback.

Peter


From bugzilla-daemon at portal.open-bio.org  Thu Feb  4 15:20:51 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 4 Feb 2010 10:20:51 -0500
Subject: [Biopython-dev] [Bug 1999] new frame translation method
In-Reply-To: <bug-1999-42@http.bugzilla.open-bio.org/>
Message-ID: <201002041520.o14FKp9j000360@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1999


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-04 10:20 EST -------
(In reply to comment #3)
> Can we split this into two functions? I tried this function today, hoping it
> would help me get a list of ORFs from a big contig -- but both
> frameTranslations and six_frame_translation do two things without stopping in
> between:
> 
> 1. Translate the DNA or RNA sequence to amino acids in all six frames

I'd wondered about this - possibly as a generator/iterator which always gives
back exactly six sequences - but don't really see much point. There is also
going to be some debate about how frames are labelled (especially the minus
frames).

> 2. Pretty-print the six-frame translation

Personally I don't see this as being very useful, but someone must like it.
I lean to just deprecating and removing this code.

> So, how about factoring out just this piece (or similar):
> 
> def translate_six_frames(seq, genetic_code=1):
>     """Dictionary of 6-frame translations."""
>     anti = seq.reverse_complement()
>     frames = {}
>     for i in range(0,3):
>         frames[i+1]  = seq[i:].translate(genetic_code)
>         frames[-i-1] = SeqUtils.reverse(anti[i:].translate(genetic_code))
>     return frames

You should be taking the reverse complement, not just the reverse. This
would just be seq[i:].reverse_complement() or seq.reverse_complenent()[i:]
depending on how you label the reverse frames. 

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Thu Feb  4 15:30:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Feb 2010 15:30:47 +0000
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <20100203125552.GD40046@sobchak.mgh.harvard.edu>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
	<20100203125552.GD40046@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002040730q7835e1a1uc784dfaae5faaef2@mail.gmail.com>

On Wed, Feb 3, 2010 at 12:55 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Not much to add, but a +1 from me. This sounds like a solid solution
> and makes sense for the use case I can think of, which is picking
> out records of interest from a large file and re-writing them in a
> smaller file.
>

Let's give Martin a chance to test with the patch, and see how he gets on.

I'm curious if anyone can come up with other examples of how this could
be applied, which would help justify adding it to Bio.SeqIO.

Peter


From biopython at maubp.freeserve.co.uk  Thu Feb  4 15:30:47 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 4 Feb 2010 15:30:47 +0000
Subject: [Biopython-dev] Getting raw unparsed records with SeqIO?
In-Reply-To: <20100203125552.GD40046@sobchak.mgh.harvard.edu>
References: <320fb6e01002021037p75d24f43w84837c426b057093@mail.gmail.com>
	<20100203125552.GD40046@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002040730q7835e1a1uc784dfaae5faaef2@mail.gmail.com>

On Wed, Feb 3, 2010 at 12:55 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
>
> Not much to add, but a +1 from me. This sounds like a solid solution
> and makes sense for the use case I can think of, which is picking
> out records of interest from a large file and re-writing them in a
> smaller file.
>

Let's give Martin a chance to test with the patch, and see how he gets on.

I'm curious if anyone can come up with other examples of how this could
be applied, which would help justify adding it to Bio.SeqIO.

Peter


From bugzilla-daemon at portal.open-bio.org  Mon Feb  8 17:08:33 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Feb 2010 12:08:33 -0500
Subject: [Biopython-dev] [Bug 3006] New: esearch medline fails with xml
	format
Message-ID: <bug-3006-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3006

           Summary: esearch medline fails with xml format
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: georg.lipps at fhnw.ch


I used to retrieve Pubmed records with python 2.5.1 however lately the efetch
with xml produces an error. The problem has arosen at the year change maybe
related to the DTD definition file:

Here is a short code which produces the error:


from Bio import Entrez
from Bio import Medline


def retrieve_medline(doi):
    # Uses the doi to obtain the medline id and then retrieves the medline
entry
    # Returns the medline entry as text and python object or an empty string
    print "...queing medline with DOI", doi
    handle = Entrez.esearch(db="pubmed", term=doi, retmode="XML")
    record=Entrez.read(handle)
    if record["Count"]<>"1":
        return None, None
    handle=Entrez.efetch(db="pubmed", id=record["IdList"], retmode="text",
rettype="medline")
    xml=Entrez.efetch(db="pubmed", id=record["IdList"], retmode="XML",
rettype="medline")
    return handle.read(), Entrez.read(xml)


doi='10.1038/nature07389'
article, xml=retrieve_medline(doi)
print article


OUTPUT:

Traceback (most recent call last):
  File "U:/Literatur/pdf to RM converter/test.py", line 24, in <module>
    article, xml=retrieve_medline(doi)
  File "U:/Literatur/pdf to RM converter/test.py", line 15, in retrieve_medline
    return handle.read(), Entrez.read(xml)
  File "C:\Program Files\python25\lib\site-packages\Bio\Entrez\__init__.py",
line 283, in read
    record = handler.run(handle)
  File "C:\Program Files\python25\lib\site-packages\Bio\Entrez\Parser.py", line
95, in run
    self.parser.ParseFile(handle)
  File "C:\Program Files\python25\lib\site-packages\Bio\Entrez\Parser.py", line
131, in startElement
    return
UnboundLocalError: local variable 'object' referenced before assignment


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Feb  8 23:26:38 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 8 Feb 2010 18:26:38 -0500
Subject: [Biopython-dev] [Bug 3006] esearch medline fails with xml format
In-Reply-To: <bug-3006-42@http.bugzilla.open-bio.org/>
Message-ID: <201002082326.o18NQcwP006902@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3006


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2010-02-08 18:26 EST -------
I was not able to replicate this bug. Your example code ran correctly with
Python 2.6, Biopython 1.53. Are you using the latest version of Biopython?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From sandford at ufl.edu  Mon Feb  8 21:49:20 2010
From: sandford at ufl.edu (Michael Sandford)
Date: Mon, 08 Feb 2010 16:49:20 -0500
Subject: [Biopython-dev] Where should feature intersection code go?
Message-ID: <4B7086E0.1090501@ufl.edu>

I'm working on a project that's looking for alternative splicing using 
solexa data instead of microarray data.  Basically we've got a GFF file 
containing all the genes, introns and exons and 35M reads that have been 
placed into one of the various chromosomes via the excellent bowtie 
application out of Maryland.

Bowtie output is documented here:
http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output

In summary it's roughly a cross between fastq and GFF.  It's got the 
read name, strand, sequence the read aligned to, position, sequence, 
quality, and a few others.  It seems like it could rather easily be 
coerced into a SeqRecord 
(http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html).  
It might not get filled up completely, but it'd be better than handling 
things in a one-off way.

The FeatureLocation class provides for approximate and exact locations 
(both start and stop positions).  It seems like the correct location to 
put code that determines if two FeatureLocations overlap, or if one 
contains another, or is contained by another. 

Overall I'm talking about writing a bowtie .map parser and the 
comparison code for FeatureLocation.  Would these be welcome features?

Thanks,
Mike


From chapmanb at 50mail.com  Tue Feb  9 01:04:25 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 8 Feb 2010 20:04:25 -0500
Subject: [Biopython-dev] Where should feature intersection code go?
In-Reply-To: <4B7086E0.1090501@ufl.edu>
References: <4B7086E0.1090501@ufl.edu>
Message-ID: <20100209010425.GD2193@kunkel>

Mike;

> I'm working on a project that's looking for alternative splicing
> using solexa data instead of microarray data.  Basically we've got a
> GFF file containing all the genes, introns and exons and 35M reads
> that have been placed into one of the various chromosomes via the
> excellent bowtie application out of Maryland.

[...]

> Overall I'm talking about writing a bowtie .map parser and the
> comparison code for FeatureLocation.  Would these be welcome
> features?

A .map parser would definitely be useful. Another suggestion is to
get Bowtie to produce SAM format and use Pysam for parsing:

http://code.google.com/p/pysam/

The advantage of SAM is that it's an emerging standard and a lot of
downstream applications can use it. This way you can switch aligners
in your workflow without much disruption.

For doing feature overlaps, IntervalTree in bx-python is excellent:

http://bitbucket.org/james_taylor/bx-python/wiki/Home
http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx

See the doc string of the IntervalTree class for how to use it. My
normal workflow is to build an IntervalTree with the GFF features of
your genome, and then loop through the alignment file finding
features that each alignment intersects.

For alternative splicing, are you using the raw genome or a built
transcriptome for all possible combinations of exons? One practical
thing to consider if that a read will not be aligned to the genome
if it splits an exon/exon junction.

Hope this helps,
Brad


From bugzilla-daemon at portal.open-bio.org  Wed Feb 10 01:42:28 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 9 Feb 2010 20:42:28 -0500
Subject: [Biopython-dev] [Bug 2294] Writing GenBank files with Bio.SeqIO
In-Reply-To: <bug-2294-42@http.bugzilla.open-bio.org/>
Message-ID: <201002100142.o1A1gSJJ022517@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2294


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #19 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-09 20:42 EST -------
(In reply to comment #17)
> 
> Things still to do on GenBank output include better handling of the LOCUS
> line, such as the data division. See also Bug 2578 for the molecule type.
> 

I've adding mappings for some EMBL divisions to suitable GenBank divisions.

I'm closing this bug now, as GenBank output does basically work now.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From rjalves at igc.gulbenkian.pt  Wed Feb 10 18:30:05 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 10 Feb 2010 18:30:05 +0000
Subject: [Biopython-dev] KEGG support
Message-ID: <4B72FB2D.4070808@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi everyone,

KEGG support in Biopython has been mostly untouched for the past 8 years
with only a few changes and test additions. There is code in the tree to
work with the Enzyme and Compound databases but not for others such as
GENES, ORTHOLOGY, DRUG, ...

Considering the fact that I will need to write some code to work with
other formats I was planning to contribute and integrate it with the
SeqIO interface. This will require some additional homework on my part.

KEGG also has a SOAP based API [1]. It's functionality could be in some
aspects compared to NCBI eutils. Using the python SOAP library suds [2]
I had no problem interacting with it.

So just in case someone was already working on this secretly :) I would
like to know to make my life easier. If not I would also like to know if
you would be interested in the addition and finally what's your thought
about the SOAP interface and the suds (optional) dependency.

Just a word on suds. Even though the project has been around for a few
years now, it's still not available in most Linux distros. On my
personal experience with it it's probably the simplest and easy to use
SOAP library for python out there.

Cheers,
Renato

[1] - http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
[2] - https://fedorahosted.org/suds/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkty+ygACgkQYh11EUYTX9Sb3wCgiQrS/HWOr96CEwHErx+RKBVQ
1VMAn1NOlNr/HZ/rmFuqKTlyOM/pZwqi
=zBxB
-----END PGP SIGNATURE-----


From kellrott at gmail.com  Wed Feb 10 20:12:10 2010
From: kellrott at gmail.com (Kyle)
Date: Wed, 10 Feb 2010 12:12:10 -0800
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B72FB2D.4070808@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
Message-ID: <bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>

I think external library dependancies should be avoided unless necessary.
 Would a tool like wsdl2py produce code that isn't dependent on an installed
library? Alternatively, suds is LGPL based, could we just cannibalize the
source code for the important classes?

Kyle


On Wed, Feb 10, 2010 at 10:30 AM, Renato Alves <rjalves at igc.gulbenkian.pt>wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi everyone,
>
> KEGG support in Biopython has been mostly untouched for the past 8 years
> with only a few changes and test additions. There is code in the tree to
> work with the Enzyme and Compound databases but not for others such as
> GENES, ORTHOLOGY, DRUG, ...
>
> Considering the fact that I will need to write some code to work with
> other formats I was planning to contribute and integrate it with the
> SeqIO interface. This will require some additional homework on my part.
>
> KEGG also has a SOAP based API [1]. It's functionality could be in some
> aspects compared to NCBI eutils. Using the python SOAP library suds [2]
> I had no problem interacting with it.
>
> So just in case someone was already working on this secretly :) I would
> like to know to make my life easier. If not I would also like to know if
> you would be interested in the addition and finally what's your thought
> about the SOAP interface and the suds (optional) dependency.
>
> Just a word on suds. Even though the project has been around for a few
> years now, it's still not available in most Linux distros. On my
> personal experience with it it's probably the simplest and easy to use
> SOAP library for python out there.
>
> Cheers,
> Renato
>
> [1] - http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
> [2] - https://fedorahosted.org/suds/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkty+ygACgkQYh11EUYTX9Sb3wCgiQrS/HWOr96CEwHErx+RKBVQ
> 1VMAn1NOlNr/HZ/rmFuqKTlyOM/pZwqi
> =zBxB
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From dalloliogm at gmail.com  Wed Feb 10 22:13:04 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 10 Feb 2010 23:13:04 +0100
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B72FB2D.4070808@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
Message-ID: <5aa3b3571002101413o55d04432vc76c230aa9c43252@mail.gmail.com>

On Wed, Feb 10, 2010 at 7:30 PM, Renato Alves <rjalves at igc.gulbenkian.pt>wrote:

> KEGG support in Biopython has been mostly untouched for the past 8 years
> with only a few changes and test additions. There is code in the tree to
> work with the Enzyme and Compound databases but not for others such as
> GENES, ORTHOLOGY, DRUG, ...
>

Hi,
I had a terrible experience with parsing Kegg pathway's files: in the end I
discovered that the files that are stored in their ftp don't correspond
exactly to the diagrams that you can find in the web interface, as for
example biochemical interactions don't have directionality while if you look
at them on kegg/pathway you will see arrows.

Some time ago I proposed to implement something similar to what you have
said for kegg/pathway, but in the end I abandoned the effort, because I had
problem both with suds and SOAPpy, and I wasn't satisfied by the annotations
in KEGG.


>
> Considering the fact that I will need to write some code to work with
> other formats I was planning to contribute and integrate it with the
> SeqIO interface. This will require some additional homework on my part.
>
>
If you are serious about that I may help you, but I can only work on the
weekends and you should tell me exactly what I have to do :-)

KEGG also has a SOAP based API [1]. It's functionality could be in some
> aspects compared to NCBI eutils. Using the python SOAP library suds [2]
> I had no problem interacting with it.
>


Are you sure? I tried it on KEGG an year ago and I was having problems to
execute slightly more complex queries. If you look at suds's bug tracker,
you will find some reports by me, like this one:
- https://fedorahosted.org/suds/ticket/213

I remember that I was looping between the KEGG support centre and the suds
bug tracker; both were very responsive to feedback and very keen to answer
me, but in the end they didn't speak to each other and the bug reports that
I have filed are still unfixed.

Which library can you use for the soap queries? I had the feeling that
SOAPpy (which I think it is included in the standard lib) worked well with
KEGG, however it development has stopped many years ago (
http://sourceforge.net/projects/pywebsvcs/files/SOAP.py/), it is a mess if
you want to use it behind an http_proxy (I should have a patch somewhere if
you are interested) and I am sure it won't be kept compatible with the
future versions of python.

Another alternative may be beautiful soup, but I have never tried it. This
question on stackoverflow may provide you some ideas:
-
http://stackoverflow.com/questions/206154/whats-the-best-soap-client-library-for-python-and-where-is-the-documentation-fo

I am not sure about which is the standard soap library for python, and which
one is included in the standard lib. If you are going to use SOAPpy, it is a
bad bet toward compatibility and maintenance for the future releases. Suds
is the best option but it is not in the standard lib, and they still have to
fix the bugs I have reported an year ago. I have the feeling that there is
no good alternative for python.

Moreover, the WSDL functions that I have seen for KEGG are not especially
useful. They seems to allow for the basic queries, but for most of the tasks
it is better to download the ftp locally and work there.


> So just in case someone was already working on this secretly :) I would
> like to know to make my life easier. If not I would also like to know if
> you would be interested in the addition and finally what's your thought
> about the SOAP interface and the suds (optional) dependency.
>
> Just a word on suds. Even though the project has been around for a few
> years now, it's still not available in most Linux distros. On my
> personal experience with it it's probably the simplest and easy to use
> SOAP library for python out there.
>
> Cheers,
> Renato
>
> [1] - http://www.genome.jp/kegg/soap/doc/keggapi_manual.html
> [2] - https://fedorahosted.org/suds/
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAkty+ygACgkQYh11EUYTX9Sb3wCgiQrS/HWOr96CEwHErx+RKBVQ
> 1VMAn1NOlNr/HZ/rmFuqKTlyOM/pZwqi
> =zBxB
> -----END PGP SIGNATURE-----
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From bugzilla-daemon at portal.open-bio.org  Wed Feb 10 22:16:14 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 10 Feb 2010 17:16:14 -0500
Subject: [Biopython-dev] [Bug 3009] New: Check the FASTA m10 alignment
	parser works with FASTA36
Message-ID: <bug-3009-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3009

           Summary: Check the FASTA m10 alignment parser works with FASTA36
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Unit Tests
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk


Bill Pearson has just announced the release of FASTA36:
http://faculty.virginia.edu/wrpearson/fasta/fasta36/

>From his email,
> This version is a major update from FASTA version 35.
> It's main new feature is the ability to report all
> statistically significant alignments between a query
> and library sequence (equivalent to BLAST's multiple
> HSPs).  All previous versions of the FASTA program
> reported only the best alignment between the query
> and library sequence, a serious shortcoming when
> comparing a query protein to a multi-exon gene or
> multi-domain protein.

We need to check the FASTA36 -m 10 output, add this to
our unit tests, and update our parser as required.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From dalloliogm at gmail.com  Wed Feb 10 22:26:08 2010
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 10 Feb 2010 23:26:08 +0100
Subject: [Biopython-dev] KEGG support
In-Reply-To: <bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
Message-ID: <5aa3b3571002101426p7b57f50aga270f0ea7eb8554f@mail.gmail.com>

On Wed, Feb 10, 2010 at 9:12 PM, Kyle <kellrott at gmail.com> wrote:

> I think external library dependancies should be avoided unless necessary.
>  Would a tool like wsdl2py produce code that isn't dependent on an
> installed
> library? Alternatively, suds is LGPL based, could we just cannibalize the
> source code for the important classes?
>

Honestly I think that the best solution would be to make an external module
to extend the basic biopython and to link it on the biopython's web page.
The core biopython should provide objects and infrastructures for biological
data, but then the additional functionalities should go on separate modules
linked on the biopython's web page, taking inspiration from BioConductor and
installed with easy_install or a derivate.
If we keep on maintaining a constrain that all biopython modules should have
the same dependencies, then it is impossible to make anything more complex
than the basic stuff, and then biopython won't never be useful as it may be.
You can't make a good library for using WSDL services with SOAPpy, or plot
nice graphics without matplotlib, or store data in HDF5 format, and there
are many other examples. Bioinformatics is a very general word, people
working on it have a big variety of needs, and it is difficult to accomplish
it all with few dependencies.

-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Wed Feb 10 22:27:07 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 10 Feb 2010 22:27:07 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B72FB2D.4070808@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
Message-ID: <320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>

On Wed, Feb 10, 2010 at 6:30 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>
> Hi everyone,
>
> KEGG support in Biopython has been mostly untouched for the past 8 years
> with only a few changes and test additions. There is code in the tree to
> work with the Enzyme and Compound databases but not for others such as
> GENES, ORTHOLOGY, DRUG, ...
>
> Considering the fact that I will need to write some code to work with
> other formats I was planning to contribute and integrate it with the
> SeqIO interface. This will require some additional homework on my part.

Excellent news. Have you looked at the existing KEGG parsers in
Biopython, and do you think the current style is suitable? (I haven't
looked at the code recently myself, but will do).

Regarding the SeqIO interface (for KEGG GENES only?), I would be
happy to advise. Initially I suggest you work on adding a parser much
like the other KEGG parsers, returning gene records. Then we can
add a Bio/SeqIO/KeggGeneIO.py wrapper to turn these into SeqRecord
objects.

> KEGG also has a SOAP based API [1]. It's functionality could be in some
> aspects compared to NCBI eutils. Using the python SOAP library suds [2]
> I had no problem interacting with it.

I have not used SOAP, and have a personal preference for REST style
APIs. However, if that is what KEGG offers, this is worth considering.
I think Brad has some experience with (other) SOAP services in Python.
Note the KEGG documentation suggests using SOAPpy for Python.

Interestingly, KEGG are however looking into providing RDF (and
perhaps one day SPARQL endpoints). I will try and find out what sort
of time scale they have in mind while I am at the BioHackathon 2010
this week - http://hackathon3.dbcls.jp/

For now, I would prioritise the KEGG flat file parsers.

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 10 22:37:03 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 10 Feb 2010 22:37:03 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
Message-ID: <320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>

On Wed, Feb 10, 2010 at 8:12 PM, Kyle <kellrott at gmail.com> wrote:
> I think external library dependancies should be avoided unless necessary.
> ?Would a tool like wsdl2py produce code that isn't dependent on an installed
> library? Alternatively, suds is LGPL based, could we just cannibalize the
> source code for the important classes?

Working with SOAP is so complicated that using an external library
would be the sensible option. It would be an optional dependency
(and would not be an install time dependency like NumPy), much
like how we have a optional dependency on ReportLab just for
Bio.Graphics, and now also the option to use NetworkX with the
new Bio.phylo code.

Package management (e.g. under Linux distros) can mark these
external modules as suggestions or soft requirements, making
this quite straight forward.

Regarding some of Giovanni's points, modularising the distribution
of Biopython (which can already be considered to be a core plus
assorted domain-specific modules like Bio.PDB, Bio.Cluster,
Bio.Graphics and so on) seems premature to me give the current
state of python distribution.

Peter

P.S. We can't take any GPL or LPGL code and incorporate it into
Biopython, due to the nature of those licences.


From anaryin at gmail.com  Wed Feb 10 22:52:53 2010
From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=)
Date: Wed, 10 Feb 2010 14:52:53 -0800
Subject: [Biopython-dev] KEGG support
Message-ID: <b537e3711002101452i73a55414gd44e875261f8e67d@mail.gmail.com>

Hello all,

For what it's worth: I worked with KEGG about a year and a half ago, to do
some very basic things. I remember I tried using SOAPpy and ZSI. The first
is a pain to install in Windows (at least then it was), so I opted for the
second. However it has been quite outdated and I had some problems dealing
with complex data types..

Regarding modularising/non-modularising the code, I guess that some features
will have to have dependences that cannot be included in the core
distribution, and thus the user should be warned that it needs library X or
Y to have them work. In short, keeping the current structure seems the
wisest IMO. I don't see such a need of creating outer-modules.

Lastly, good luck with KEGG's services' speed. That API is slower than a
turle :x


From rjalves at igc.gulbenkian.pt  Thu Feb 11 00:44:59 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 00:44:59 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
Message-ID: <4B73530B.7090203@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- From Peter on 02/10/2010 10:27 PM:
> Excellent news. Have you looked at the existing KEGG parsers in
> Biopython, and do you think the current style is suitable? (I haven't
> looked at the code recently myself, but will do).

The style seems good enough but I was thinking of having a more
functional approach, at least for the parser to try to get away of the
massive if/elif/else cascades. The writer would come as second priority
and would be similar although I would also try to keep code duplication
at lower levels than what we can see in the Enzyme/__init__.py file. I
would also consider using Genes.py instead of Genes/__init__.py ... I
don't see the need of packages here.

> Regarding the SeqIO interface (for KEGG GENES only?), I would be
> happy to advise. Initially I suggest you work on adding a parser much
> like the other KEGG parsers, returning gene records. Then we can
> add a Bio/SeqIO/KeggGeneIO.py wrapper to turn these into SeqRecord
> objects.

Yes for now my main goal would be GENES. The other formats can probably
grow from there. Your suggestion on the SeqIO seems reasonable. I'll try
to have a prototype in the next days/weekend and we can discuss from there.

> I have not used SOAP, and have a personal preference for REST style
> APIs. However, if that is what KEGG offers, this is worth considering.
> I think Brad has some experience with (other) SOAP services in Python.
> Note the KEGG documentation suggests using SOAPpy for Python.

According to the http://www.genome.jp/kegg/docs/weblink.html page they
do mention a REST like URL for generic entries, pathways and brite. But
it seems more useful for external linking than as an API. I couldn't
even figure out how to return the information in plaintext instead of
the default HTML. About SOAPpy, I've nothing against it besides the fact
that when I first tried I had few problems. Anyway it was a long time
ago... I've only played with suds since.

> Interestingly, KEGG are however looking into providing RDF (and
> perhaps one day SPARQL endpoints). I will try and find out what sort
> of time scale they have in mind while I am at the BioHackathon 2010
> this week - http://hackathon3.dbcls.jp/

We'll be waiting on your feedback on this :)

> For now, I would prioritise the KEGG flat file parsers.

Agreed.

> Peter
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzUwgACgkQYh11EUYTX9SPcwCfSrNkIovs1vnPinuAtMFZQJYn
pmAAnjHAAro2Ls/c1Nq4DCuliReaPm64
=Dohn
-----END PGP SIGNATURE-----


From rjalves at igc.gulbenkian.pt  Thu Feb 11 00:53:03 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 00:53:03 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>	
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
	<320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>
Message-ID: <4B7354EF.8020703@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- From Peter on 02/10/2010 10:37 PM:
> On Wed, Feb 10, 2010 at 8:12 PM, Kyle <kellrott at gmail.com> wrote:
>> I think external library dependancies should be avoided unless necessary.
>>  Would a tool like wsdl2py produce code that isn't dependent on an installed
>> library? Alternatively, suds is LGPL based, could we just cannibalize the
>> source code for the important classes?
> 
> Working with SOAP is so complicated that using an external library
> would be the sensible option. It would be an optional dependency
> (and would not be an install time dependency like NumPy), much
> like how we have a optional dependency on ReportLab just for
> Bio.Graphics, and now also the option to use NetworkX with the
> new Bio.phylo code.

Yes that would be my idea on the SOAP interface. If doable we could even
evaluate the possibility of having some abstraction layer that could
enable the use of SOAPpy or suds if either is already available on the
system.

> Package management (e.g. under Linux distros) can mark these
> external modules as suggestions or soft requirements, making
> this quite straight forward.

The 'or' case for soap libraries would also fit in this scheme since
most package managers already support this kind of feature.

> Regarding some of Giovanni's points, modularising the distribution
> of Biopython (which can already be considered to be a core plus
> assorted domain-specific modules like Bio.PDB, Bio.Cluster,
> Bio.Graphics and so on) seems premature to me give the current
> state of python distribution.

Could you elaborate a little on what you mean by 'current state of
python...'. Are you referring to the python3 transition?

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzVO0ACgkQYh11EUYTX9S1ngCfYFiW7VeNu6atl0J1eViqquSo
PCIAn3KO2p//fRYpZVC0QSp2gITP/n2I
=uTTc
-----END PGP SIGNATURE-----


From chapmanb at 50mail.com  Thu Feb 11 00:56:00 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 10 Feb 2010 19:56:00 -0500
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B73530B.7090203@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
	<4B73530B.7090203@igc.gulbenkian.pt>
Message-ID: <20100211005600.GB1923@kunkel>

Renato;
Great idea to work with the KEGG parsers. Very happy to have someone
tackling this.

> According to the http://www.genome.jp/kegg/docs/weblink.html page they
> do mention a REST like URL for generic entries, pathways and brite. But
> it seems more useful for external linking than as an API. I couldn't
> even figure out how to return the information in plaintext instead of
> the default HTML. About SOAPpy, I've nothing against it besides the fact
> that when I first tried I had few problems. Anyway it was a long time
> ago... I've only played with suds since.

My suggestion would be to use the TogoWS REST interface

http://togows.dbcls.jp/site/en/rest.html

It makes getting records crazy easy. There are tons of examples,
but for GENES, here's how to get the plain text record:

http://togows.dbcls.jp/entry/gene/eco:b0002

If you really want to use SOAP, my experience has been best with
suds. However, the complexities of SOAP are really not worth it if
you can get REST approaches to do what you need.

Hope this helps,
Brad


From rjalves at igc.gulbenkian.pt  Thu Feb 11 01:14:52 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 01:14:52 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <5aa3b3571002101413o55d04432vc76c230aa9c43252@mail.gmail.com>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<5aa3b3571002101413o55d04432vc76c230aa9c43252@mail.gmail.com>
Message-ID: <4B735A0C.8070902@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

- From Giovanni Marco Dall'Olio on 02/10/2010 10:13 PM:
> Hi,
> I had a terrible experience with parsing Kegg pathway's files: in the
> end I discovered that the files that are stored in their ftp don't
> correspond exactly to the diagrams that you can find in the web
> interface, as for example biochemical interactions don't have
> directionality while if you look at them on kegg/pathway you will see
> arrows.

I haven't used pathway files yet so I'll be careful when I reach them :)
Have you mentioned this aspect to the KEGG maintainers?

> Some time ago I proposed to implement something similar to what you have
> said for kegg/pathway, but in the end I abandoned the effort, because I
> had problem both with suds and SOAPpy, and I wasn't satisfied by the
> annotations in KEGG.
>  
> If you are serious about that I may help you, but I can only work on the
> weekends and you should tell me exactly what I have to do :-) 

Hehe, I can only tell you once I get my hands dirty. I'll keep my code
on github to maximize interaction. I'll get back at you when I get the
first working draft for GENES. Thanks for the hand ;)

> Are you sure? I tried it on KEGG an year ago and I was having problems
> to execute slightly more complex queries. If you look at suds's bug
> tracker, you will find some reports by me, like this one:
> - https://fedorahosted.org/suds/ticket/213

As of suds revision 658 I can no longer reproduce the error in the ticket.

> I remember that I was looping between the KEGG support centre and the
> suds bug tracker; both were very responsive to feedback and very keen to
> answer me, but in the end they didn't speak to each other and the bug
> reports that I have filed are still unfixed.
> 
> Which library can you use for the soap queries? I had the feeling that
> SOAPpy (which I think it is included in the standard lib) worked well
> with KEGG, however it development has stopped many years ago
> (http://sourceforge.net/projects/pywebsvcs/files/SOAP.py/), it is a mess
> if you want to use it behind an http_proxy (I should have a patch
> somewhere if you are interested) and I am sure it won't be kept
> compatible with the future versions of python.

SOAPpy doesn't seem to be in the standard lib, at least I don't have it
out of the box here. Only as external package in the repository.

> Another alternative may be beautiful soup, but I have never tried it.

I've only used beautiful soup as HTML cleaner/formatter, like HTML tidy.
I wasn't aware that it could be used for SOAP stuff. Are you sure about
this?

> This question on stackoverflow may provide you some ideas:
> http://stackoverflow.com/questions/206154/whats-the-best-soap-client-library-for-python-and-where-is-the-documentation-fo
> 
> I am not sure about which is the standard soap library for python, and
> which one is included in the standard lib. If you are going to use
> SOAPpy, it is a bad bet toward compatibility and maintenance for the
> future releases. Suds is the best option but it is not in the standard
> lib, and they still have to fix the bugs I have reported an year ago. I
> have the feeling that there is no good alternative for python.

I'll wait for your opinions. I don't want to sound religious about suds. :P

> Moreover, the WSDL functions that I have seen for KEGG are not
> especially useful. They seems to allow for the basic queries, but for
> most of the tasks it is better to download the ftp locally and work there.

Well if you just want a quick check on something the API still gives
better/quicker results than downloading the stuff via FTP. Given the
size, probably the load of the server and the fact that I'm on the other
side of the globe, I got an ETA of close to 20 hours when downloading
the genes.tar.gz file which is only a few GB in size.

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzWgoACgkQYh11EUYTX9Rp6QCfaHf6Ic3uT/npDw2o8l9F+8Kk
RtgAnjNXGxcrfvh48dcdFf6G4wK9+PNI
=vpUY
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Thu Feb 11 01:15:21 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Feb 2010 01:15:21 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <4B7354EF.8020703@igc.gulbenkian.pt>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<bb02be081002101212m381f1263r1922d145f61e26fa@mail.gmail.com>
	<320fb6e01002101437p5f382ce0tb4cd864cec59a6c5@mail.gmail.com>
	<4B7354EF.8020703@igc.gulbenkian.pt>
Message-ID: <320fb6e01002101715n3ccb8894r155631a2c6e34cb6@mail.gmail.com>

Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>> Regarding some of Giovanni's points, modularising the distribution
>> of Biopython (which can already be considered to be a core plus
>> assorted domain-specific modules like Bio.PDB, Bio.Cluster,
>> Bio.Graphics and so on) seems premature to me give the current
>> state of python distribution.
>
> Could you elaborate a little on what you mean by 'current state of
> python...'. Are you referring to the python3 transition?

I didn't mean anything about Python 3 here. Just the current state
of python package management, with distutils vs setuptools,
easy_install, Distribute, etc. I'm am looking forward to an official
Python successor to distutils one day which will properly handle
dependencies (and hopefully uninstallation) nicely. However, for
now, a single monolithic Biopython released several times a
year works fine and I see no reason to change that.

Peter


From rjalves at igc.gulbenkian.pt  Thu Feb 11 01:46:59 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Thu, 11 Feb 2010 01:46:59 +0000
Subject: [Biopython-dev] KEGG support
In-Reply-To: <20100211005600.GB1923@kunkel>
References: <4B72FB2D.4070808@igc.gulbenkian.pt>
	<320fb6e01002101427o5efc003es3402308e44b2533a@mail.gmail.com>
	<4B73530B.7090203@igc.gulbenkian.pt> <20100211005600.GB1923@kunkel>
Message-ID: <4B736193.9020801@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> Renato;
> Great idea to work with the KEGG parsers. Very happy to have someone
> tackling this.

Well as we say here, when the need comes we grab the bull by the horns.
:) (Small illustration even though I'm not a fan of the 'sport'
http://www.youtube.com/watch?v=OBORPnrm89I)

> My suggestion would be to use the TogoWS REST interface
> 
> http://togows.dbcls.jp/site/en/rest.html
> 
> It makes getting records crazy easy. There are tons of examples,
> but for GENES, here's how to get the plain text record:
> 
> http://togows.dbcls.jp/entry/gene/eco:b0002
> 
> If you really want to use SOAP, my experience has been best with
> suds. However, the complexities of SOAP are really not worth it if
> you can get REST approaches to do what you need.

Indeed this exactly the same without the need of additional libraries.
If all the functionality available on the SOAP API is also here I agree
with you, the complexity of SOAP is unnecessary.

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)

iEYEARECAAYFAktzYZEACgkQYh11EUYTX9RMWQCeLOXZH5vBjxB7rgPjhS53Fx7Z
EuMAoItWzjJ1LEtV6T8NcDDqnoDyIyBS
=dPVp
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Thu Feb 11 05:29:15 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 11 Feb 2010 05:29:15 +0000
Subject: [Biopython-dev] Merging Bio.SeqIO SFF support?
In-Reply-To: <320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
References: <320fb6e00911230643l611bb8f5i253630f3acabf438@mail.gmail.com>
	<320fb6e01001110911g2961a680qe95c01b14e8d23b3@mail.gmail.com>
Message-ID: <320fb6e01002102129r24e210e1qc070b40f7652fac8@mail.gmail.com>

On Mon, Jan 11, 2010 at 5:11 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> Hi all,
>
> I didn't want to rush the SFF support into Biopython 1.53, but its been
> waiting "ready" for a while now. Any objections or comments about
> me merging this now?
>
> Thanks,
>
> Peter

There were no objections, and I ran this by Brad and Michiel and
have just merged this into the master branch. Time for some more
testing!

Peter


From krother at rubor.de  Thu Feb 11 12:31:58 2010
From: krother at rubor.de (Kristian Rother)
Date: Thu, 11 Feb 2010 13:31:58 +0100
Subject: [Biopython-dev] Bio.PDB.KDTree test for memory leak
Message-ID: <112c17235319b66a00eebc499294fb2b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVBcWQ1eVw==-webmailer2@server03.webmailer.hosteurope.de>


Hi,

I've encountered a problem with running KDTree: it leaks memory.
The code below fills 1GB memory within a minute.

Running the GC doesn't help (it slows the process down, but only because
the GC is much slower than KDTree.

I think the problem might be in the C code. I'd like to get this bug
sorted out, but I'm not very good in C. Is there anyone around who I could
check ideas with?

Best Regards,
   Kristian


----

from Bio.KDTree.KDTree import *
from numpy.random import random

nr_points=1000
dim=3
bucket_size=10
coords=(200*random((nr_points, dim)))

while 1:
    kdtree=KDTree(dim, bucket_size)
    kdtree.set_coords(coords)


From biopython at maubp.freeserve.co.uk  Fri Feb 12 06:10:13 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 12 Feb 2010 06:10:13 +0000
Subject: [Biopython-dev] Bio.PDB.KDTree test for memory leak
In-Reply-To: <112c17235319b66a00eebc499294fb2b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVBcWQ1eVw==-webmailer2@server03.webmailer.hosteurope.de>
References: <112c17235319b66a00eebc499294fb2b-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FdQU1tXlhfRVBcWQ1eVw==-webmailer2@server03.webmailer.hosteurope.de>
Message-ID: <320fb6e01002112210y10ad4670p7ac3e003b5976685@mail.gmail.com>

On Thu, Feb 11, 2010 at 12:31 PM, Kristian Rother <krother at rubor.de> wrote:
>
> Hi,
>
> I've encountered a problem with running KDTree: it leaks memory.
> The code below fills 1GB memory within a minute.
>
> Running the GC doesn't help (it slows the process down, but only because
> the GC is much slower than KDTree.

You mean something like this?

import gc
from Bio.KDTree.KDTree import *
from numpy.random import random

nr_points=1000
dim=3
bucket_size=10
coords=(200*random((nr_points, dim)))

while True:
   kdtree=KDTree(dim, bucket_size)
   kdtree.set_coords(coords)
   del kdtree #explicitly tell Python it can GC this object
   gc.collect() #force Python to run GC

I agree, this does seem to gradually consume more and more RAM.
Could you open a bug on bugzilla to track this please?

> I think the problem might be in the C code. I'd like to get this bug
> sorted out, but I'm not very good in C. Is there anyone around who
> I could check ideas with?

Have you ever used valgrind on a C tool? I'm not sure if it is easy
to use via Python, but it is my tool of choice for checking memory
leaks in C.

Peter


From bugzilla-daemon at portal.open-bio.org  Fri Feb 12 08:30:12 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Feb 2010 03:30:12 -0500
Subject: [Biopython-dev] [Bug 3010] New: Bio.KDTree is leaking memory
Message-ID: <bug-3010-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3010

           Summary: Bio.KDTree is leaking memory
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: krother at rubor.de


When I run KDTree on several of our PCs (Ubuntu, one with BioPython 1.53, one
with 1.51), it consumes memory that is never freed unless the process
terminates.

The code below fills 1GB memory within about a minute.

----
#!/usr/bin/env python

from Bio.KDTree.KDTree import *
from numpy.random import random

nr_points=1000
dim=3
bucket_size=10
coords=(200*random((nr_points, dim)))

while True:
   kdtree=KDTree(dim, bucket_size)
   kdtree.set_coords(coords)

----

Running the GC doesn't help (via del kdtree; gc.collect() in the while loop)
does not help.

I think the problem might be the C code or the Python/C interaction. I checked
the sources of KDTree superficially (to see whether there is a free() for each
malloc(), but did not see anything unusual (am not a C programmer though).

Peter proposed using valgrind to check memory leaks in C. Eventually it is
applicable to the problem.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Feb 12 12:31:13 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Feb 2010 07:31:13 -0500
Subject: [Biopython-dev] [Bug 3006] esearch medline fails with xml format
In-Reply-To: <bug-3006-42@http.bugzilla.open-bio.org/>
Message-ID: <201002121231.o1CCVDlN010496@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3006


georg.lipps at fhnw.ch changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from georg.lipps at fhnw.ch  2010-02-12 07:31 EST -------
I updated to python 2.6.4 and Biopython 1.5.3 and can confirm that the problem
does not persist.

Thanks for checking.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Feb 12 16:23:17 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 12 Feb 2010 11:23:17 -0500
Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory
In-Reply-To: <bug-3010-42@http.bugzilla.open-bio.org/>
Message-ID: <201002121623.o1CGNHHd017669@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3010


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2010-02-12 11:23 EST -------
Does the memory leak occur also without the line kdtree.set_coords(coords)?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sun Feb 14 10:45:48 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 14 Feb 2010 05:45:48 -0500
Subject: [Biopython-dev] [Bug 3010] Bio.KDTree is leaking memory
In-Reply-To: <bug-3010-42@http.bugzilla.open-bio.org/>
Message-ID: <201002141045.o1EAjmV1029393@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3010


------- Comment #2 from krother at rubor.de  2010-02-14 05:45 EST -------
(In reply to comment #1)
> Does the memory leak occur also without the line kdtree.set_coords(coords)?
> 
No, I tried, and it doesnt.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From MatatTHC at gmx.de  Tue Feb 16 09:48:25 2010
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Tue, 16 Feb 2010 10:48:25 +0100
Subject: [Biopython-dev] derive from Seq
Message-ID: <20100216094825.25190@gmx.net>

Hi, 

I've implemented a class derived from Seq. Many of the Seq functions return Seq. Thus, I can not use those functions because I need instances of the derived class.

This can easily be fixed by returning: 

self.__class__( .. ) 

Regards, 
Matthias
-- 
Sicherer, schneller und einfacher. Die aktuellen Internet-Browser -
jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser


From chapmanb at 50mail.com  Tue Feb 16 13:09:45 2010
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 16 Feb 2010 08:09:45 -0500
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <20100216094825.25190@gmx.net>
References: <20100216094825.25190@gmx.net>
Message-ID: <20100216130945.GH64068@sobchak.mgh.harvard.edu>

Hi Matthias;

> I've implemented a class derived from Seq. Many of the Seq functions
> return Seq. Thus, I can not use those functions because I need
> instances of the derived class.
>
> This can easily be fixed by returning:
>
> self.__class__( .. ) 

Good catch. Would you be able to submit a patch for this to the bug
tracker?

More generally, it is interesting that you are subclassing Seq. Can
you describe your application for this? I was debating with Peter
and Michiel this week and arguing that the Seq class should be
switched to a standard string, with biological functions like
reverse_complement and the like moving to stand alone functions and
SeqRecord objects. I'd be interested in hearing the opposite case;
that additional functionality is needed on a Seq object.

Brad


From bugzilla-daemon at portal.open-bio.org  Tue Feb 16 17:53:29 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Feb 2010 12:53:29 -0500
Subject: [Biopython-dev] [Bug 3013] New: import warnings missing in
	Bio/PDB/MMCIF2Dict.py
Message-ID: <bug-3013-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013

           Summary: import warnings missing in Bio/PDB/MMCIF2Dict.py
           Product: Biopython
           Version: 1.53
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: macrozhu+biopy at gmail.com


python library >>warnings<< is not imported in Bio/PDB/MMCIF2Dict.py
Please import the library in the beginning of the source code.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Wed Feb 17 01:24:39 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 16 Feb 2010 20:24:39 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002170124.o1H1OdhE003209@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


mdehoon at ims.u-tokyo.ac.jp changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #1 from mdehoon at ims.u-tokyo.ac.jp  2010-02-16 20:24 EST -------
Fixed in the repository, thanks.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From p.j.a.cock at googlemail.com  Wed Feb 17 02:48:01 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 17 Feb 2010 02:48:01 +0000
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <20100216130945.GH64068@sobchak.mgh.harvard.edu>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002161848s28e543a9jdac436976de3f279@mail.gmail.com>

On Tue, Feb 16, 2010 at 1:09 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi Matthias;
>
>> I've implemented a class derived from Seq. Many of the Seq functions
>> return Seq. Thus, I can not use those functions because I need
>> instances of the derived class.
>>
>> This can easily be fixed by returning:
>>
>> self.__class__( .. )

We debated this on the mailing list a while ago (I'd hack to search
a little harder to find the thread). While switching to this form makes
subclassing easier in some cases, it doesn't in all.

> More generally, it is interesting that you are subclassing Seq. Can
> you describe your application for this? ... I'd be interested in
> hearing ... additional functionality is needed on a Seq object.
>
> Brad

Last time this (subclassing the Seq object) was mentioned, the
specific use was to change the equality operations to be string
like. This is a change we're considering making in Biopython itself
(and again was something Brad, Michiel and I chatted about
last week - I will be sending out an email about that next week,
I'm on holiday right now and haven't had internet access till
today).

But to echo Brad, use cases for subclassing the Seq are
of great interest.

Regards,

Peter


From MatatTHC at gmx.de  Wed Feb 17 08:33:11 2010
From: MatatTHC at gmx.de (Matthias Bernt)
Date: Wed, 17 Feb 2010 09:33:11 +0100
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <320fb6e01002161848s28e543a9jdac436976de3f279@mail.gmail.com>
References: <20100216094825.25190@gmx.net>	
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
	<320fb6e01002161848s28e543a9jdac436976de3f279@mail.gmail.com>
Message-ID: <20100217083311.287840@gmx.net>

Hi, 

I'm dealing with circular sequences. Thus, I need some specialised functions (e.g. getting a subsequence). Furthermore, for me it seems to be the natural way to extend the functionality of Seq to my own needs. 
But, maybe this is not the best way. 

Matthias


> > Hi Matthias;
> >
> >> I've implemented a class derived from Seq. Many of the Seq functions
> >> return Seq. Thus, I can not use those functions because I need
> >> instances of the derived class.
> >>
> >> This can easily be fixed by returning:
> >>
> >> self.__class__( .. )
> 
> We debated this on the mailing list a while ago (I'd hack to search
> a little harder to find the thread). While switching to this form makes
> subclassing easier in some cases, it doesn't in all.
> 
> > More generally, it is interesting that you are subclassing Seq. Can
> > you describe your application for this? ... I'd be interested in
> > hearing ... additional functionality is needed on a Seq object.
> >
> > Brad
> 
> Last time this (subclassing the Seq object) was mentioned, the
> specific use was to change the equality operations to be string
> like. This is a change we're considering making in Biopython itself
> (and again was something Brad, Michiel and I chatted about
> last week - I will be sending out an email about that next week,
> I'm on holiday right now and haven't had internet access till
> today).
> 
> But to echo Brad, use cases for subclassing the Seq are
> of great interest.


-- 
NEU: Mit GMX DSL ?ber 1000,- ? sparen!
http://portal.gmx.net/de/go/dsl02


From bugzilla-daemon at portal.open-bio.org  Thu Feb 18 16:09:52 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 18 Feb 2010 11:09:52 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002181609.o1IG9qth028156@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #2 from macrozhu+biopy at gmail.com  2010-02-18 11:09 EST -------
Can pychecker be of any use for detecting such minor bugs? It might be too
much, I guess.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat Feb 20 18:40:59 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 20 Feb 2010 13:40:59 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002201840.o1KIexYS017773@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #3 from eric.talevich at gmail.com  2010-02-20 13:40 EST -------
(In reply to comment #2)
> Can pychecker be of any use for detecting such minor bugs? It might be too
> much, I guess.
> 

I don't know about PyChecker, but PyLint will catch import errors and
uninitialized variables like this. For example, I just tried "pylint -e
Bio/PDB/*.py" to a branch that didn't have this fix in it yet, and it flagged
this bug:

E: 79:MMCIF2Dict._make_mmcif_dict: Undefined variable 'warnings'
E: 91:MMCIF2Dict._make_mmcif_dict: Undefined variable 'warnings'
E:107:MMCIF2Dict._make_mmcif_dict: Undefined variable 'warnings'


While I'm at it, here are the other errors in Bio.PDB that pylint caught in a
freshly updated master branch:


************* Module Chain
E: 79:Chain.__delitem__: Class 'Entity' has no '__delitem__' member

************* Module DSSP
E:101:make_dssp_dict: function already defined line 8
E:139:DSSP: class already defined line 8

************* Module Entity
E: 56:Entity.get_level: Instance of 'Entity' has no 'level' member

************* Module FragmentMapper
E:137:Fragment.add_residue: Undefined variable 'PDBException'
E:191:_make_fragment_list: Undefined variable 'PDBException'
E:193:_make_fragment_list: Undefined variable 'PDBException'
E:226:FragmentMapper: class already defined line 10
E:250:FragmentMapper.__init__: Undefined variable 'PDBException'

************* Module HSExposure
E: 67:_AbstractHSExposure.__init__: Instance of '_AbstractHSExposure' has no
'_get_cb' member
E:131:HSExposureCA: class already defined line 9
E:222:HSExposureCB: class already defined line 9
E:257:ExposureCN: class already defined line 9

************* Module MMCIF2Dict
E:  8: No name 'MMCIFlex' in module 'Bio.PDB.mmCIF'
E: 31:MMCIF2Dict.__init__: Module 'Bio.PDB.mmCIF' has no 'MMCIFlex' member
E: 33:MMCIF2Dict.__init__: Module 'Bio.PDB.mmCIF' has no 'MMCIFlex' member
E: 44:MMCIF2Dict._make_mmcif_dict: Module 'Bio.PDB.mmCIF' has no 'MMCIFlex'
member

************* Module NACCESS
E:183: Instance of 'NACCESS' has no 'get_iterator' member

************* Module PDBParser
E:159:PDBParser._parse_coordinates: Undefined variable 'PDBContructionError'

************* Module Polypeptide
E:276:_PPBuilder.build_peptides: Instance of '_PPBuilder' has no
'_is_connected' member

************* Module ResidueDepth
E: 65:get_surface: function already defined line 11
E:123:ResidueDepth: class already defined line 11

************* Module StructureAlignment
E: 14:StructureAlignment: class already defined line 6


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From eric.talevich at gmail.com  Sat Feb 20 19:01:42 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sat, 20 Feb 2010 14:01:42 -0500
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <20100216130945.GH64068@sobchak.mgh.harvard.edu>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
Message-ID: <3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com>

On Tue, Feb 16, 2010 at 8:09 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> More generally, it is interesting that you are subclassing Seq. Can
> you describe your application for this? I was debating with Peter
> and Michiel this week and arguing that the Seq class should be
> switched to a standard string, with biological functions like
> reverse_complement and the like moving to stand alone functions and
> SeqRecord objects. I'd be interested in hearing the opposite case;
> that additional functionality is needed on a Seq object.
>
>
I've seen a technique like this used to good effect:

# File: Seq.py

# Standalone functions all take a string-like first argument
def reverse_complement(seq): ...
def translate(seq, table=1): ...

class Seq(basestring):  # or str
    def __init__(self, data, alphabet): ...
    # Then attach the above functions as methods here
    reverse_complement = reverse_complement
    translate = translate
    ...


The same functionality is then available in a functional or OO style, with
minimal code duplication. And for interactive sessions, where converting
strings to Seqs is a bit more of an inconvenience, "from Bio.Seq import *"
becomes quick and handy.

-Eric


From biopython at maubp.freeserve.co.uk  Sun Feb 21 12:03:21 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 21 Feb 2010 12:03:21 +0000
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu>
	<3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com>
Message-ID: <320fb6e01002210403i721f8f26l2e7d37aae0b13c35@mail.gmail.com>

On Sat, Feb 20, 2010 at 7:01 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
> I've seen a technique like this used to good effect:
>
> # File: Seq.py
>
> ...
>
> The same functionality is then available in a functional or OO style, with
> minimal code duplication. And for interactive sessions, where converting
> strings to Seqs is a bit more of an inconvenience, "from Bio.Seq import *"
> becomes quick and handy.

Doesn't that describe the Bio.Seq module as it is pretty well?
In addition to the Seq object methods, there are several functions
which can be used on strings or Seq (like) objects.

Peter


From eric.talevich at gmail.com  Sun Feb 21 16:36:13 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 21 Feb 2010 11:36:13 -0500
Subject: [Biopython-dev] derive from Seq
In-Reply-To: <320fb6e01002210403i721f8f26l2e7d37aae0b13c35@mail.gmail.com>
References: <20100216094825.25190@gmx.net>
	<20100216130945.GH64068@sobchak.mgh.harvard.edu> 
	<3f6baf361002201101m78a9550csc6d371df251e4319@mail.gmail.com> 
	<320fb6e01002210403i721f8f26l2e7d37aae0b13c35@mail.gmail.com>
Message-ID: <3f6baf361002210836of243016s8206035c1b89de24@mail.gmail.com>

On Sun, Feb 21, 2010 at 7:03 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Sat, Feb 20, 2010 at 7:01 PM, Eric Talevich <eric.talevich at gmail.com>
> wrote:
> > I've seen a technique like this used to good effect:
> >
> > ...
> >
> > The same functionality is then available in a functional or OO style,
> with
> > minimal code duplication. And for interactive sessions, where converting
> > strings to Seqs is a bit more of an inconvenience, "from Bio.Seq import
> *"
> > becomes quick and handy.
>
> Doesn't that describe the Bio.Seq module as it is pretty well?
> In addition to the Seq object methods, there are several functions
> which can be used on strings or Seq (like) objects.
>
> Peter
>

I'm not fully up to speed on the debate or the use cases that triggered it,
but I'm guessing the goal is better code flexibility without sacrificing
performance. Here's some code to consider:

def transcribe(dna, alphabet=None):
    """Transcribe a DNA sequence into RNA. Returns a string."""
    if isinstance(dna, Seq) or isinstance(dna, MutableSeq):
        # At first, maybe issue a warning here
        alphabet = dna.alphabet
        dna = str(dna)
    if alphabet is not None:
        # Validate
        base = Alphabet._get_base_alphabet(alphabet)
        if isinstance(base, Alphabet.ProteinAlphabet):
            raise ValueError("Proteins cannot be transcribed!")
        if isinstance(base, Alphabet.RNAAlphabet):
            raise ValueError("RNA cannot be transcribed!")
    return dna.replace('T','U').replace('t','u')

class Seq:
    # ...
    def transcribe(self):
        transcript = transcribe(self._data)
        # Rebuild the Seq object
        if self.alphabet==IUPAC.unambiguous_dna:
            alphabet = IUPAC.unambiguous_rna
        elif self.alphabet==IUPAC.ambiguous_dna:
            alphabet = IUPAC.ambiguous_rna
        else:
            alphabet = Alphabet.generic_rna
        return Seq(transcript, alphabet)


Notes:
 - The standalone takes an optional 'alphabet' argument, and performs
validation if requested.
 - Since the standalone function now has the same functionality as the Seq
method, Seq can dispatch to the function -- rather than the other way
around, as it is currently -- and then just rebuild a Seq object.
 - The standalone function now always returns the same type (str). Since
this might break some existing code, a little shim and deprecation dance may
be needed in real life. But I think returning a plain string is the Right
Thing: there's "one obvious way" to work with Seq objects or plain strings.
 - If the grand proposal is to eventually move the alphabet attribute to
SeqRecord, this provides an intermediate step and a more convenient
foundation for testing the idea.

Best,
Eric


From biopython at maubp.freeserve.co.uk  Mon Feb 22 14:48:14 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 22 Feb 2010 14:48:14 +0000
Subject: [Biopython-dev] Changing Seq equality
In-Reply-To: <320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
References: <C731B9BA.2C661%lpritc@scri.ac.uk>
	<200911250945.20870.jblanca@btc.upv.es>
	<320fb6e00911250226w4e86ea5cr4cdea4a424d32b7@mail.gmail.com>
	<200911251220.53881.jblanca@btc.upv.es>
	<320fb6e00911250348m249533d1g5e30b6c593769dd1@mail.gmail.com>
	<3f6baf360911252314u72ab5c19rbcb899e736117a4f@mail.gmail.com>
	<320fb6e00911260241j22fbee47ufaad13412c0ff580@mail.gmail.com>
	<3f6baf360911261213g2047607aw212215cce2b4fe82@mail.gmail.com>
	<320fb6e00911270339s3354051cub0cc193466575f16@mail.gmail.com>
Message-ID: <320fb6e01002220648n5d47f015r65f17a37f782fcde@mail.gmail.com>

Hi all,

I've just got back from Japan - Brad and I were fortunate to be
able to attend the DBCLS BioHackathon 2010 held in Tokyo,
http://hackathon3.dbcls.jp/

As Brad already mentioned in passing, we also managed to have
dinner one evening with Michiel, and had an informal chat about
Biopython plans. Expect a few more emails on other topics to
follow.

One of the short term aims we agreed on was to press ahead
with the Seq equality changes outlined on this thread late last
year. Mailing list archive link:
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/007021.html

To recap, the agreed best behaviour was to make Seq equality
act like string equality, but to raise a Python warning when
incompatible alphabets are compared (e.g. DNA to Protein).
This also applies to all the other comparison operators:
not equal, less than, greater than, less than or equal, and
greater than or equal.

This is my outline plan for the change:

For Biopython up to 1.53, Seq class uses object equality,
seq1==seq2 acts as id(seq1)==id(seq2)

For Biopython 1.54 (and perhaps a few more releases),
the Seq classes will still use object equality but will trigger
a warning suggesting explicit use of  id(seq1)==id(seq2)
or str(seq1)==str(seq2) as appropriate.

For Biopython 1.xx (maybe 1.55 or 1.56?) the Seq classes
will switch to using string equality (with an alphabet aware
warning for comparing DNA to RNA etc), but will also trigger
a warning that this is a change from previous releases, and
suggest in the short term the continued explicit use of either
id(seq1)==id(seq2) for object identity or str(seq1)==str(seq2)
for string identity.

For Biopython 1.yy (maybe 1.57?) the Seq classes will
use string equality (with an alphabet aware warning for
comparing DNA to RNA etc), without any warning about
this being a change from historic behaviour.

These warning messages could also point at a wiki page,
and we'd need a FAQ entry in the tutorial as well. The
aim of this slightly drawn out switch is to try and make
sure all users are aware of the change, even if they
only update their copy of Biopython every few releases.

Does that all sound sensible? If so, we should probably
have an announcement on the main mailing list, in case
there are any other views.

Other more complex options include a flag for switching
between the modes - but that complexity doesn't seem
such a good idea to me. All my own code and most of
the unit tests use str(seq1)==str(seq2) explicitly anyway.
The only exception is some of the genetic algorithm unit
tests which do seem to want explicit object identity.

Regards,

Peter


From biopython at maubp.freeserve.co.uk  Tue Feb 23 11:31:35 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 23 Feb 2010 11:31:35 +0000
Subject: [Biopython-dev] Handles and/or filenames in Bio.SeqIO etc?
In-Reply-To: <20090728221726.GK68751@sobchak.mgh.harvard.edu>
References: <320fb6e00907280934i54f326a6r38325c05a314cdbc@mail.gmail.com>
	<20090728221726.GK68751@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e01002230331j5f5f87c5lf328d3bacc4a557b@mail.gmail.com>

Hi all,

As mentioned in another thread, Brad, Michiel and I had
an informal meeting earlier this month in Tokyo and
discussed some plans for Biopython. One of the short
term changes we agreed on was to push ahead with
the Seq object equality changes, see:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007351.html

Another short term change we agreed was worthwhile
was to follow other Python libraries and allow handles
OR filenames in our parsers (starting with SeqIO and
AlignIO). This follows the discussion for the "TreeIO"
module (since renamed) and the Bio.SeqIO.convert
functions here on the mailing list last year, see:
http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006503.html

I will tackle this shortly for Bio.SeqIO and Bio.AlignIO.

Peter


From bugzilla-daemon at portal.open-bio.org  Tue Feb 23 17:43:01 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 23 Feb 2010 12:43:01 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002231743.o1NHh17v001826@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-23 12:43 EST -------
Hi Eric,

I have fixed most (all?) of those problems reported by pylint - see mailing
list post.

Thanks!

Peter


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Tue Feb 23 17:43:31 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 23 Feb 2010 17:43:31 +0000
Subject: [Biopython-dev] Running pylint over Biopython
Message-ID: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>

Hi all,

Those following @Biopython on twitter or subscribed to the github RSS
feed for our repository will know this already, but I've been using
pylint today to spot some errors in Biopython.
http://www.logilab.org/project/pylint

This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
finding some issues - thank Eric, this was a valuable suggestion.

With its default settings pylint is very very noisy, and in particular
doesn't like our naming conventions. However, with the following
command line you can focus in on the important stuff:

pylint --disable-msg-cat=CRW --include-ids=y
--disable-msg=E1101,E1103,E0102 -r n Bio BioSQL

Note that instead of module names, you can give filenames (e.g. *.py).
What that does is disable several categories of message (conventions,
possible refactorings, warnings) leaving just errors and fatal
messages. I turned on the message identifiers so that I have something
useful to stick into Google if need be, or to add to the ignore list
(currently three cases which looked like false positives). Then I turn
off the detailed report.

[Tip - don't run this from the Biopython source directory as then
importing our C code modules will fail]

As you will be able to tell from the recent flurry of git commits,
this highlighted some simple errors like missing imports or typos in
variable names.


Tiago, could you have a look at these possible problems in Bio.PopGen:

************* Module Bio.PopGen.Async
E0602: 78:Async.get_result: Undefined variable 'done'
E0602: 79:Async.get_result: Undefined variable 'done'
************* Module Bio.PopGen.GenePop
E0602:160:Record.split_in_pops: Undefined variable 'GenePop'
E0602:177:Record.split_in_loci: Undefined variable 'GenePop'
************* Module Bio.PopGen.GenePop.Controller
E0602: 41:_read_allele_freq_table: Undefined variable 'self'
E0602:133:_hw_func: Undefined variable 'self'
E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext'
E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable
'currrent_pop'
************* Module Bio.PopGen.SimCoal.Cache
E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config'
E0602: 88: Undefined variable 'Cache'
************* Module Bio.PopGen.SimCoal.Controller
E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config'


Eric, I don't have all the dependencies installed by pylint does
appear to dislike a few things in Bio.Phylo on the trunk:

************* Module Bio.Phylo.BaseTree
E0203:521:TreeMixin.prune: Access to member 'root' before its
definition line 531
E0203:527:TreeMixin.prune: Access to member 'root' before its
definition line 531
E0202:672:Subtree.root: An attribute inherited from TreeMixin hide this method
************* Module Bio.Phylo.PhyloXML
E1120:182:Phylogeny.get_alignment: No value passed for parameter
'follow_attrs' in function call


One thing this exercise has shown is that we still need to do some
work on the unit test coverage.

Regards

Peter


From tiagoantao at gmail.com  Tue Feb 23 17:56:22 2010
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 23 Feb 2010 17:56:22 +0000
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
Message-ID: <6d941f121002230956h582acc11r59cb18bca8d2f727@mail.gmail.com>

This comes in a good time, I've actually been making changes to the
code (as the genepop parser is not able to handle big files and I've
had quite a few complains about that). it seems to be 2.6 related or
so because I've detected the Config problem myself. I will correct
this next week (this week is _impossible_), along with an update to
the genepop parser to support big files.

2010/2/23 Peter <biopython at maubp.freeserve.co.uk>:
> Hi all,
>
> Those following @Biopython on twitter or subscribed to the github RSS
> feed for our repository will know this already, but I've been using
> pylint today to spot some errors in Biopython.
> http://www.logilab.org/project/pylint
>
> This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
> finding some issues - thank Eric, this was a valuable suggestion.
>
> With its default settings pylint is very very noisy, and in particular
> doesn't like our naming conventions. However, with the following
> command line you can focus in on the important stuff:
>
> pylint --disable-msg-cat=CRW --include-ids=y
> --disable-msg=E1101,E1103,E0102 -r n Bio BioSQL
>
> Note that instead of module names, you can give filenames (e.g. *.py).
> What that does is disable several categories of message (conventions,
> possible refactorings, warnings) leaving just errors and fatal
> messages. I turned on the message identifiers so that I have something
> useful to stick into Google if need be, or to add to the ignore list
> (currently three cases which looked like false positives). Then I turn
> off the detailed report.
>
> [Tip - don't run this from the Biopython source directory as then
> importing our C code modules will fail]
>
> As you will be able to tell from the recent flurry of git commits,
> this highlighted some simple errors like missing imports or typos in
> variable names.
>
>
> Tiago, could you have a look at these possible problems in Bio.PopGen:
>
> ************* Module Bio.PopGen.Async
> E0602: 78:Async.get_result: Undefined variable 'done'
> E0602: 79:Async.get_result: Undefined variable 'done'
> ************* Module Bio.PopGen.GenePop
> E0602:160:Record.split_in_pops: Undefined variable 'GenePop'
> E0602:177:Record.split_in_loci: Undefined variable 'GenePop'
> ************* Module Bio.PopGen.GenePop.Controller
> E0602: 41:_read_allele_freq_table: Undefined variable 'self'
> E0602:133:_hw_func: Undefined variable 'self'
> E0602:393:GenePopController.test_pop_hw_prob: Undefined variable 'ext'
> E0602:458:GenePopController.test_ld.ld_pop_func: Undefined variable
> 'currrent_pop'
> ************* Module Bio.PopGen.SimCoal.Cache
> E0602: 79:SimCoalCache.getSimulation: Undefined variable 'Config'
> E0602: 88: Undefined variable 'Cache'
> ************* Module Bio.PopGen.SimCoal.Controller
> E0602: 47:SimCoalController.run_simcoal: Undefined variable 'Config'
>
>
> Eric, I don't have all the dependencies installed by pylint does
> appear to dislike a few things in Bio.Phylo on the trunk:
>
> ************* Module Bio.Phylo.BaseTree
> E0203:521:TreeMixin.prune: Access to member 'root' before its
> definition line 531
> E0203:527:TreeMixin.prune: Access to member 'root' before its
> definition line 531
> E0202:672:Subtree.root: An attribute inherited from TreeMixin hide this method
> ************* Module Bio.Phylo.PhyloXML
> E1120:182:Phylogeny.get_alignment: No value passed for parameter
> 'follow_attrs' in function call
>
>
> One thing this exercise has shown is that we still need to do some
> work on the unit test coverage.
>
> Regards
>
> Peter
>


-- 
?Pessimism of the Intellect; Optimism of the Will? -Antonio Gramsci


From bugzilla-daemon at portal.open-bio.org  Tue Feb 23 18:03:53 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 23 Feb 2010 13:03:53 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002231803.o1NI3r10002509@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


macrozhu+biopy at gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |macrozhu+biopy at gmail.com


------- Comment #5 from macrozhu+biopy at gmail.com  2010-02-23 13:03 EST -------
wow, the developers really respond very quickly. 

How about running >>pylint<< or >>pychecker<< on all BioPython code to detect
potential problems?

cheers,


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Tue Feb 23 18:59:49 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 23 Feb 2010 13:59:49 -0500
Subject: [Biopython-dev] [Bug 3013] import warnings missing in
	Bio/PDB/MMCIF2Dict.py
In-Reply-To: <bug-3013-42@http.bugzilla.open-bio.org/>
Message-ID: <201002231859.o1NIxnJH004142@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=3013


------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-23 13:59 EST -------
(In reply to comment #5)
> wow, the developers really respond very quickly. 
> 
> How about running >>pylint<< or >>pychecker<< on all BioPython code to detect
> potential problems?
> 
> cheers,
> 

Already tried with pylint earlier today ;)
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007354.html


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From eric.talevich at gmail.com  Wed Feb 24 03:11:25 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Tue, 23 Feb 2010 22:11:25 -0500
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
Message-ID: <3f6baf361002231911g65b47cbfw3625f9aacf863abc@mail.gmail.com>

2010/2/23 Peter <biopython at maubp.freeserve.co.uk>

> Hi all,
>
> Those following @Biopython on twitter or subscribed to the github RSS
> feed for our repository will know this already, but I've been using
> pylint today to spot some errors in Biopython.
> http://www.logilab.org/project/pylint
>
> This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
> finding some issues - thank Eric, this was a valuable suggestion.
>
> Glad I could help. :)


> Eric, I don't have all the dependencies installed by pylint does
> appear to dislike a few things in Bio.Phylo on the trunk:
>

Pylint hates the way I wrote Bio.Phylo, in particular the way TreeMixin
assumes it will be mixed with a class that has 'root' and 'is_terminal'
attributes, and the __dict__ hack in the PhyloXML class __init__ methods --
it can't figure out where the attributes are coming from.

The last error was real, and I've pushed a fix to the trunk. Thanks for
catching it.

One thing this exercise has shown is that we still need to do some
> work on the unit test coverage.
>

Agreed. I also added a unit test for get_alignment (finally), and should get
to TreeMixin.prune and .split soon. Then Bio.Phylo will have essentially
100% unit test coverage.

Cheers
-Eric


From biopython at maubp.freeserve.co.uk  Wed Feb 24 07:41:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 07:41:18 +0000
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <3f6baf361002231911g65b47cbfw3625f9aacf863abc@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
	<3f6baf361002231911g65b47cbfw3625f9aacf863abc@mail.gmail.com>
Message-ID: <320fb6e01002232341n3ee397basddde348df86d4871@mail.gmail.com>

On Wed, Feb 24, 2010 at 3:11 AM, Eric Talevich <eric.talevich at gmail.com> wrote:
> 2010/2/23 Peter <biopython at maubp.freeserve.co.uk>
>
>> Hi all,
>>
>> Those following @Biopython on twitter or subscribed to the github RSS
>> feed for our repository will know this already, but I've been using
>> pylint today to spot some errors in Biopython.
>> http://www.logilab.org/project/pylint
>>
>> This was prompted by Eric trying this on Bio.PDB for Bug 3013 and
>> finding some issues - thank Eric, this was a valuable suggestion.
>>
>> Glad I could help. :)

Re-reading Bug 3013, we might also want to try PyChecker
as suggested by Hongbo Zhu - I've not used that before.

>> Eric, I don't have all the dependencies installed by pylint does
>> appear to dislike a few things in Bio.Phylo on the trunk:
>
> Pylint hates the way I wrote Bio.Phylo, in particular the way TreeMixin
> assumes it will be mixed with a class that has 'root' and 'is_terminal'
> attributes, and the __dict__ hack in the PhyloXML class __init__ methods --
> it can't figure out where the attributes are coming from.

Some of the "apparent false positives" I was ignoring related to
the iterator classes in Bio.SeqIO, again this seems to be valid
code which pylint can't cope with. We may want to follow up
on this (it could be a bug in pylint?).

That said, if you can think of a cleaner way to code your bits
that might be advantageous for long term maintainance. Maybe
just add a TODO comment to consider using Abstract Base
Classes once we require Python 2.6+ for Biopython (if that
looks suitable)?

> The last error was real, and I've pushed a fix to the trunk.
> Thanks for catching it.

Cool.

>> One thing this exercise has shown is that we still need
>> to do some work on the unit test coverage.
>
> Agreed. I also added a unit test for get_alignment (finally),
> and should get to TreeMixin.prune and .split soon. Then
> Bio.Phylo will have essentially 100% unit test coverage.

I didn't mean to single out just Bio.Phylo - I meant the whole
of Biopython would benefit from more unit tests. In particular,
a lot of the "minor" errors pylint helped me fix were in error
messages (e.g. wrong variable name used). This means if
a user hit the error, rather than the exception we wanted to
raise they'd get an error about our message. So, not critical,
but it suggests we need more tests to cover the exceptions
(as well as the more important tests to cover typical usage).

Peter


From p.j.a.cock at googlemail.com  Wed Feb 24 07:43:48 2010
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 24 Feb 2010 07:43:48 +0000
Subject: [Biopython-dev] Medium/long term plans
Message-ID: <320fb6e01002232343s2df80990s96774b44f942e851@mail.gmail.com>

Hi all,

As mentioned in other recent threads, Brad and I were in Tokyo
earlier this month for the DBCLS BioHackathon 2010 (see
http://hackathon3.dbcls.jp/ for details). While there, we met up
with Michiel for an informal dinner meeting, and discussed some
possible plans for Biopython.

=== Short term action points ===

Seq object equality, see:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007351.html

Filenames or handles in SeqIO, AlignIO, etc, see:
http://lists.open-bio.org/pipermail/biopython-dev/2010-February/007352.html

=== Medium term action points ===

Python 3 support. With NumPy starting to make serious plans for
supporting Python 3 this year, we should be able to look at doing this
too. Initially we will continue to focus on Python 2.x, but make more
effort to ensure that we can run without issues in the "Python 3
warning mode" available in Python 2.6 (or 2.7 once that is out).
Then start to put Biopython through 2to3, and see how we get on.

Name space reorganisation for sequences. It would be nice to
have the Seq objects, SeqFeature, SeqRecord and probably
SeqUtils and SeqIO all under one module name. We may be
able to handle this in the short term with two import routes with
the old module names discouraged and eventually deprecated.
See also the "Code review request for phyloxml branch" thread
which covered some of this:
http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007215.html

=== Long term action points ===

There are things in Biopython that with hindsight we feel have not
worked out so well (module naming, alphabets objects) where
change may require a break, i.e. a Biopython version two. Should
we start a wiki to record points of debate, and get people to list their
niggles/faults for consideration?

Regarding Python 3.x support and a possible Biopython 2.x see
also Guido's blog post (there is probably an email version on one
of the python mailing lists too):
http://www.artima.com/weblogs/viewpost.jsp?thread=227041

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 24 11:52:55 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 11:52:55 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>
	<4B2C12B0.9060806@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
Message-ID: <320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>

On Tue, Dec 22, 2009 at 4:08 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>
> The gzip mode issue is interesting... running on the Mac,
> Leopard 10.5, using the Apple provided Python 2.5.2,
> looking at a gzipped QUAL file everything is fine:
>
> Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53)
> [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import gzip
>>>> gzip.open("Quality/example.qual.gz", "r").read()
> ...
>
> Looking at a gzipped FASTA file everything is fine:
> ...
>
> But, there is a problem with my gzipped FASTQ file:
>
>>>> gzip.open("Quality/example.fastq.gz", "r").read()
> '@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>>> gzip.open("Quality/example.fastq.gz", "rb").read()
> '@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'
>>>> gzip.open("Quality/example.fastq.gz", "rU").read()
> Traceback (most recent call last):
> ?File "<stdin>", line 1, in <module>
> ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 220, in read
> ? ?self._read(readsize)
> ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 292, in _read
> ? ?self._read_eof()
> ?File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/gzip.py",
> line 311, in _read_eof
> ? ?raise IOError, "CRC check failed"
> IOError: CRC check failed
>
> I may have stumbled on a bug in the Python gzip library :(
>

Prompted by a thread on the BioPerl mailing list, I revisited this issue:
http://lists.open-bio.org/pipermail/bioperl-l/2010-February/032359.html

>From some cross platform testing, I always seem to get the CRC error
when trying to open this gzipped FASTQ file in universal read lines mode.
The FASTA and QUAL file seem fine.

According to the gzip python module's documentation, it uses the zlib
module, and you can find the underlying version number like this:

>>> import zlib
>>> zlib.ZLIB_VERSION
'1.2.3'

Results from some testing the simple examples above (using Python
and the gzip module only):

[1] Mac OS X 10.5, Python 2.5.2, GCC 4.0.1, zlib 1.2.3 - fails
[2] Linux, Python 2.4.3, GCC 3.4.5, zlib 1.2.1.2 - fails
[3] Linux, Python 2.3.4, GCC 3.4.6, zlib 1.2.1.2 - fails
[3] Linux, Python 2.6.1, GCC 3.4.6, zlib 1.2.1.2 - fails
[4] Linux, Python 2.4.3, GCC 4.1.2, zlib 1.2.3 - fails
[4] Linux, Python 2.6.1, GCC 3.4.6, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.7a1, MSC v.1500, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.6, MSC v.1500, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.5.2, MSC v.1310, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.4.4, MSC v.1310, zlib 1.2.3 - fails
[5] Windows XP 32bit, Python 2.3.5, MSC v.1200, zlib 1.1.4 - fails

[1] My mac, [2] Local server, [3] Cluster head, [4] Cluster node, [5]
My windows box

This tells me that the failure isn't OS specific, and isn't specific
to a particular
version of Python or zlib. Note that on the Mac and Linux machines where I
get the CRC failure in python, the command line tool gunzip can decompress
the files fine.

If anyone else wants to test this (to confirm I'm not missing anything
obvious), you can download the gzipped files from github here:
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.qual.gz
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.fasta.gz
wget http://github.com/peterjc/biopython/raw/index-zip/Tests/Quality/example.fastq.gz

Maybe this mode isn't fully supported in gzip? I think that provided we
assume that any gzipped text file will use Unix new lines, we don't need
to worry about this.

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 24 12:00:18 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 12:00:18 +0000
Subject: [Biopython-dev] Running pylint over Biopython
In-Reply-To: <6d941f121002230956h582acc11r59cb18bca8d2f727@mail.gmail.com>
References: <320fb6e01002230943g474dc673x43f7df120e679d91@mail.gmail.com>
	<6d941f121002230956h582acc11r59cb18bca8d2f727@mail.gmail.com>
Message-ID: <320fb6e01002240400k11764b2al2438d5381ed335c4@mail.gmail.com>

2010/2/23 Tiago Ant?o <tiagoantao at gmail.com>:
> This comes in a good time, I've actually been making changes to the
> code (as the genepop parser is not able to handle big files and I've
> had quite a few complains about that). it seems to be 2.6 related or
> so because I've detected the Config problem myself. I will correct
> this next week (this week is _impossible_), along with an update to
> the genepop parser to support big files.

Sound good :)

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 24 12:37:20 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 12:37:20 +0000
Subject: [Biopython-dev] test_PhyloXML.py failing on Windows
Message-ID: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>

Hi Eric,

Do you have access to a Windows machine for testing? There
seem to be two issues in the PhyloXML tests (tested on
Python 2.5, 2.6 and 2.7a1 on Windows XP):

Count and confirm the number of tags in each example XML file. ... FAIL
Round-trip parsing and serialization of apaf.xml. ... ERROR
Round-trip parsing and serialization of bcl_2.xml. ... ERROR
Round-trip parsing and serialization of o_tol_332_d_dollo.xml. ... ERROR
Round-trip parsing and serialization of made_up.xml. ... ERROR
Round-trip parsing and serialization of phyloxml_examples.xml. ... ERROR

The tag count error I don't immediately understand:

======================================================================
FAIL: Count and confirm the number of tags in each example XML file.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\repositories\biopython_official\Tests\test_PhyloXML.py",
line 56, in test_dump_tags
    self.assertEquals(len(output.readlines()), count)
AssertionError: 301 != 289

----------------------------------------------------------------------

The rest all fail in _stash_rewrite_and_call where something about
your file renaming is failing. It looks like you deliberately move some
of your example XML files to a temp filename during the test and
then move them back. This seems risky (e.g. if the test suite is
stopped mid way). Can you rework this to write the output to a
temp file or perhaps better yet a StringIO handle?

The errors look like this:

======================================================================
ERROR: Round-trip parsing and serialization of apaf.xml.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_PhyloXML.py", line 561, in test_apaf
    (TreeTests, ['test_DomainArchitecture']),
  File "test_PhyloXML.py", line 546, in _stash_rewrite_and_call
    os.rename(fname, fname + '~')
WindowsError: [Error 183] Cannot create a file when that file already exists

----------------------------------------------------------------------

Thanks,

Peter


From bugzilla-daemon at portal.open-bio.org  Wed Feb 24 15:38:13 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 24 Feb 2010 10:38:13 -0500
Subject: [Biopython-dev] [Bug 2998] Document need XCode with 10.4 SDK for
	Mac OS
In-Reply-To: <bug-2998-42@http.bugzilla.open-bio.org/>
Message-ID: <201002241538.o1OFcDJ4005667@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2998


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-24 10:38 EST -------
Just to add a note, on Snow Leopard Apple provides python 2.5 (default, 32bit
only) and python 2.6 (supports 64 bit).

I suspect if you install Biopython under python 2.6 you won't need the 10.4
SDK... something to check?


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From rjalves at igc.gulbenkian.pt  Wed Feb 24 16:07:01 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 24 Feb 2010 16:07:01 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>	
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>	
	<4B2C12B0.9060806@igc.gulbenkian.pt>	
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>	
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>	
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>	
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>	
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
Message-ID: <4B854EA5.7050100@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Quoting Peter on 02/24/2010 11:52 AM:

> Maybe this mode isn't fully supported in gzip? I think that provided we
> assume that any gzipped text file will use Unix new lines, we don't need
> to worry about this.

Your example puzzled me. I did a few more tests with the files you
pointed out. Turns out that the fastq file is 'badly' read even on
normal open 'Universal' mode. This doesn't happen on the other files:

Python 2.6.4 [GCC 4.4.1] Linux

>>> open('example.fastq.gz', 'rb').read() == open('example.fastq.gz',
'rU').read()
False
>>> open('example.fasta.gz', 'rb').read() == open('example.fasta.gz',
'rU').read()
True
>>> open('example.qual.gz', 'rb').read() == open('example.qual.gz',
'rU').read()
True

In particular the character in fault seems to be:

>>> (open('example.fastq.gz', 'rb').read()[145],
open('example.fastq.gz', 'rU').read()[145])
('\r', '\n')

This is the only thing that changed.

After going a little over the content of the file, I found this workaround:

$ gunzip example.fastq.gz && echo >> example.fastq && gzip example.fastq

Which simply adds a new empty line to the end of the file.

>>> open('example.fastq.gz', 'rb').read() == open('example.fastq.gz',
'rU').read()
True

After this I also looked into python3 (3.1.1) just in case they fixed it
already and apparently they did. See for yourself:

This was tested in Python-3.1.1 from within blender2.5, (apologies for
that, it was the only python3 version I had around).

>>> open('example.fastq.gz','rb').read() ==
open('example.fastq.gz','rU').read()
Traceback (most recent call last):
(...)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1:
unexpected code byte

Seems like I need to force binary mode...

>>> open('example.fastq.gz','rb').read() ==
open('example.fastq.gz','rbU').read()
True

Success!

>>> import gzip
>>> gzip.open('example.fastq.gz','rb').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

>>> gzip.open('example.fastq.gz','rU').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

>>> gzip.open('example.fastq.gz','rbU').read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;393333\n'

And everything works as expected.

So unless the blender devs changed python to fix this bug, this has been
fixed in python3.

Should this go upstream?

- --
Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFTqAACgkQYh11EUYTX9TXbgCgmBDKrrjL6Eue8qRfgs2ydAUQ
11kAnR0beVQDLP4ldBcd2RFfJ5Q+Opo6
=MLu3
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Wed Feb 24 16:48:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 16:48:58 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <4B854EA5.7050100@igc.gulbenkian.pt>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>
	<4B2C12B0.9060806@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
	<4B854EA5.7050100@igc.gulbenkian.pt>
Message-ID: <320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>

On Wed, Feb 24, 2010 at 4:07 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
>
> After this I also looked into python3 (3.1.1) just in case they fixed it
> already and apparently they did. See for yourself:

You seem to be right, I tried this on Windows using Python 3.0.1 and 3.1.1,

C:\repositories\biopython_pjc\Tests>c:\python30\python
Python 3.0.1 (r301:69561, Feb 13 2009, 20:04:18) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open("Quality\example.fastq.gz", "r").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rb").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rU").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'


C:\repositories\biopython_pjc\Tests>c:\python31\python
Python 3.1.1 (r311:74483, Aug 17 2009, 17:02:12) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open("Quality\example.fastq.gz", "r").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rb").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'
>>> gzip.open("Quality\example.fastq.gz", "rU").read()
b'@EAS54_6_R1_2_1_413_324\nCCCTTCTTGTCTTCAGCGTTTCTCC\n+\n;;3;;;;;;;;;;;;7;;;;;;;
88\n at EAS54_6_R1_2_1_540_792\nTTGGCAGGCCAAGGCCGATGGATCA\n+\n;;;;;;;;;;;7;;;;;-;;;
3;83\n at EAS54_6_R1_2_1_443_348\nGTTGCTTCTGGCGTGGGTGGGGGGG\n+\n;;;;;;;;;;;9;7;;.7;
393333\n'

So this does look like a Python 2.x bug which has been fixed in Python
3.x, and we should probably report this (after searching to see if it
is a known issue).

However, even if it is fixed in Python 2.6.x and 2.7.x, it won't get
fixed in older versions like Python 2.4 or 2.5.

Peter


From biopython at maubp.freeserve.co.uk  Wed Feb 24 17:03:09 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 17:03:09 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<4B2C12B0.9060806@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
	<4B854EA5.7050100@igc.gulbenkian.pt>
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
Message-ID: <320fb6e01002240903m52629576vf85f428f68d32d15@mail.gmail.com>

Hi all,

I've updated my branch to cope with gzipped FASTQ files, tested on
Windows XP, Mac OS X Snow Leopard, and Linux:

http://github.com/peterjc/biopython/tree/index-zip

This works by just opening gzipped files in default mode - which
seems to be fine with the examples (FASTA, QUAL and FASTQ)
where the text file in the archive uses Unix new line entries.

While this may be a good solution, we should test on gzipped
files containing Windows new lines too. Plus of course, try
non-gzipped compression. And very large files. etc.

Peter


From eric.talevich at gmail.com  Wed Feb 24 17:03:31 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 24 Feb 2010 12:03:31 -0500
Subject: [Biopython-dev] test_PhyloXML.py failing on Windows
In-Reply-To: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>
References: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>
Message-ID: <3f6baf361002240903q2b395fd1qa5426a130b5b3d61@mail.gmail.com>

On Wed, Feb 24, 2010 at 7:37 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> Do you have access to a Windows machine for testing? There
> seem to be two issues in the PhyloXML tests (tested on
> Python 2.5, 2.6 and 2.7a1 on Windows XP):
>

I'll have access to Windows XP this weekend, but I think I can probably fix
these tests before then.

======================================================================
> FAIL: Count and confirm the number of tags in each example XML file.
> ----------------------------------------------------------------------
>

This was an early sanity check for parsing XML with ElementTree, and while I
don't see a good reason for the number of lines to be different between OSes
(line endings?), the test isn't Biopython-specific anyway. I'll just delete
it.

======================================================================
> ERROR: Round-trip parsing and serialization of apaf.xml.
> ----------------------------------------------------------------------
>

Apparently Windows doesn't like renaming a file to replace another existing
file. To fix this error asap I'll call os.remove before the rename, but
you're right that these tests should be rewritten to use named temp files or
StringIO. (I needed to trick unittest into re-running the parser tests on
re-written files and this sufficed last summer)

-Eric


From rjalves at igc.gulbenkian.pt  Wed Feb 24 17:13:41 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 24 Feb 2010 17:13:41 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>	
	<320fb6e00912181339o1a5c4100w6f1957fd4d78d20d@mail.gmail.com>	
	<4B2C12B0.9060806@igc.gulbenkian.pt>	
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>	
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>	
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>	
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>	
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>	
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>	
	<4B854EA5.7050100@igc.gulbenkian.pt>
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
Message-ID: <4B855E45.9080708@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>So this does look like a Python 2.x bug which has been fixed in Python
>3.x, and we should probably report this (after searching to see if it
>is a known issue).

The closest I could find is: http://bugs.python.org/issue5148

But it's also on gzip.open(), not plain open().

>However, even if it is fixed in Python 2.6.x and 2.7.x, it won't get
>fixed in older versions like Python 2.4 or 2.5.

Do you raising a warning if the 'U' mode is explicitly passed would be a
reasonable solution for older python versions?

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFXkMACgkQYh11EUYTX9TKNACfXIj2p5OTRetf9cWU/ppV8oWb
CPcAoIJkkNfHj6AeLAxl2/FtSH3+7UR5
=W7wg
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Wed Feb 24 17:28:58 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 24 Feb 2010 17:28:58 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <4B855E45.9080708@igc.gulbenkian.pt>
References: <4B2BB938.5030709@igc.gulbenkian.pt>
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>
	<4B854EA5.7050100@igc.gulbenkian.pt>
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>
	<4B855E45.9080708@igc.gulbenkian.pt>
Message-ID: <320fb6e01002240928h54519628pfb91dd1bf8d9c1f7@mail.gmail.com>

On Wed, Feb 24, 2010 at 5:13 PM, Renato Alves <rjalves at igc.gulbenkian.pt> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>>So this does look like a Python 2.x bug which has been fixed in Python
>>3.x, and we should probably report this (after searching to see if it
>>is a known issue).
>
> The closest I could find is: http://bugs.python.org/issue5148
>
> But it's also on gzip.open(), not plain open().

It is gzip.open() that we have a problem with, open() is fine.

It does look like http://bugs.python.org/issue6759 and/or the
linked bug http://bugs.python.org/issue6759 cover this issue.
Thanks for finding them.

>>However, even if it is fixed in Python 2.6.x and 2.7.x, it won't get
>>fixed in older versions like Python 2.4 or 2.5.
>
> Do you raising a warning if the 'U' mode is explicitly passed
> would be a reasonable solution for older python versions?

Are you asking about what I would like Python to do?
I would like gzip.open() to support universal newline mode.

For Biopython's index function we currently don't allow the
user to specify the mode at all - the code decides this based
on the file format (SFF files must be binary, for text files I use
universal newline mode).

Peter


From rjalves at igc.gulbenkian.pt  Wed Feb 24 18:25:04 2010
From: rjalves at igc.gulbenkian.pt (Renato Alves)
Date: Wed, 24 Feb 2010 18:25:04 +0000
Subject: [Biopython-dev] [Biopython] SeqIO.index improvement suggestions
In-Reply-To: <320fb6e01002240928h54519628pfb91dd1bf8d9c1f7@mail.gmail.com>
References: <4B2BB938.5030709@igc.gulbenkian.pt>	
	<320fb6e00912190157m151c1b49t59b776c5130dad22@mail.gmail.com>	
	<3f6baf360912191442m1ceb36afw824437f703dfaad0@mail.gmail.com>	
	<320fb6e00912201006k5fbfebe4rb61e0538578e6ad@mail.gmail.com>	
	<320fb6e00912220734r197e4baanac78c9188a33ddce@mail.gmail.com>	
	<320fb6e00912220808w53485af8s801e5a24666d9627@mail.gmail.com>	
	<320fb6e01002240352p622f842bo140b1210d76fde4c@mail.gmail.com>	
	<4B854EA5.7050100@igc.gulbenkian.pt>	
	<320fb6e01002240848t37b0bea4jc8488440ab266ac6@mail.gmail.com>	
	<4B855E45.9080708@igc.gulbenkian.pt>
	<320fb6e01002240928h54519628pfb91dd1bf8d9c1f7@mail.gmail.com>
Message-ID: <4B856F00.7030201@igc.gulbenkian.pt>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> For Biopython's index function we currently don't allow the
> user to specify the mode at all - the code decides this based
> on the file format (SFF files must be binary, for text files I use
> universal newline mode).

For some reason I thought the user could set the mode.
Anyway, thanks for the clarification.

Renato
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkuFbvwACgkQYh11EUYTX9QM6gCeK4aMVBoZWZmI+SNccwSd9qle
xv8AnA8gZLQn1m8bXMT9Dl5YIRM4akC2
=jQ9l
-----END PGP SIGNATURE-----


From bugzilla-daemon at portal.open-bio.org  Thu Feb 25 13:35:04 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 25 Feb 2010 08:35:04 -0500
Subject: [Biopython-dev] [Bug 2837] Reading Roche 454 SFF sequence read
	files in Bio.SeqIO
In-Reply-To: <bug-2837-42@http.bugzilla.open-bio.org/>
Message-ID: <201002251335.o1PDZ4qn013099@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2837


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-25 08:35 EST -------
Marking as fixed since I recently merged this code into the trunk.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From biopython at maubp.freeserve.co.uk  Thu Feb 25 14:29:19 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 25 Feb 2010 14:29:19 +0000
Subject: [Biopython-dev] test_PhyloXML.py failing on Windows
In-Reply-To: <3f6baf361002240903q2b395fd1qa5426a130b5b3d61@mail.gmail.com>
References: <320fb6e01002240437s4be4aef1q82bee9141acfb501@mail.gmail.com>
	<3f6baf361002240903q2b395fd1qa5426a130b5b3d61@mail.gmail.com>
Message-ID: <320fb6e01002250629te597954v46308838faca607e@mail.gmail.com>

On Wed, Feb 24, 2010 at 5:03 PM, Eric Talevich <eric.talevich at gmail.com> wrote:
>
> Apparently Windows doesn't like renaming a file to replace another existing
> file. To fix this error asap I'll call os.remove before the rename, ...

I had to add other similar check before it would run on my machine.

> but you're right that these tests should be rewritten to use named temp
> files or StringIO. (I needed to trick unittest into re-running the parser
> tests on re-written files and this sufficed last summer)

OK, something for the TODO list. Should we file a bug to remind us?

Peter


From biopython at maubp.freeserve.co.uk  Fri Feb 26 13:09:46 2010
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 26 Feb 2010 13:09:46 +0000
Subject: [Biopython-dev] ImportWarning is new on Python 2.5
Message-ID: <320fb6e01002260509k65e9f9acm494c80a76af8d25e@mail.gmail.com>

Hi Eric,

I've just been running the test suite on Python 2.4 (on CentOS 5.4)
and noticed you use ImportWarning (which was added in Python 2.5) in
Bio/Phylo/PhyloXMLIO.py

Although we are going to phase out support for Python 2.4, we still
need to keep things compatible for now.

Are you happy to switch this to a different warning for now, and add a
TODO comment to put it back to an ImportWarning once we drop Python
2.4 support?

Thanks

Peter


From eric.talevich at gmail.com  Fri Feb 26 14:56:53 2010
From: eric.talevich at gmail.com (Eric Talevich)
Date: Fri, 26 Feb 2010 09:56:53 -0500
Subject: [Biopython-dev] ImportWarning is new on Python 2.5
In-Reply-To: <320fb6e01002260509k65e9f9acm494c80a76af8d25e@mail.gmail.com>
References: <320fb6e01002260509k65e9f9acm494c80a76af8d25e@mail.gmail.com>
Message-ID: <3f6baf361002260656n581a526dtc4a5374640f546ed@mail.gmail.com>

On Fri, Feb 26, 2010 at 8:09 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Hi Eric,
>
> I've just been running the test suite on Python 2.4 (on CentOS 5.4)
> and noticed you use ImportWarning (which was added in Python 2.5) in
> Bio/Phylo/PhyloXMLIO.py
>
> Although we are going to phase out support for Python 2.4, we still
> need to keep things compatible for now.
>
> Are you happy to switch this to a different warning for now, and add a
> TODO comment to put it back to an ImportWarning once we drop Python
> 2.4 support?
>

Sure, I'll switch it to a generic Warning for now and leave a comment. I
doubt the type of the warning is very important for most uses.

-Eric


From bugzilla-daemon at portal.open-bio.org  Fri Feb 26 16:26:10 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Feb 2010 11:26:10 -0500
Subject: [Biopython-dev] [Bug 2553] Adding SeqRecord objects to an alignment
	(append or extend)
In-Reply-To: <bug-2553-42@http.bugzilla.open-bio.org/>
Message-ID: <201002261626.o1QGQA1g028222@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2553


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-26 11:26 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj

This already handles:
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects

I also plan to cover:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]
Bug 2552 - Adding alignments 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Feb 26 16:26:43 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Feb 2010 11:26:43 -0500
Subject: [Biopython-dev] [Bug 2554] Creating an Alignment from a list of
	SeqRecord objects
In-Reply-To: <bug-2554-42@http.bugzilla.open-bio.org/>
Message-ID: <201002261626.o1QGQhNF028283@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2554


------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-26 11:26 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj

This already handles:
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects

I also plan to cover:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]
Bug 2552 - Adding alignments 


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Fri Feb 26 17:28:31 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 26 Feb 2010 12:28:31 -0500
Subject: [Biopython-dev] [Bug 2552] Adding alignments
In-Reply-To: <bug-2552-42@http.bugzilla.open-bio.org/>
Message-ID: <201002261728.o1QHSVob029960@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2552


------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk  2010-02-26 12:28 EST -------
I've started a possible implementation of an improved multiple
sequence alignment object on a github branch:
http://github.com/peterjc/biopython/commits/alignment-obj

This now handles:
Bug 2552 - Adding alignments (this bug)
Bug 2553 - Adding SeqRecord objects to an alignment (append or extend)
Bug 2554 - Creating an Alignment from a list of SeqRecord objects

I also plan to cover:
Bug 2551 - Adding advanced __getitem__ e.g. align[1:2,5:-5]


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Sat Feb 27 18:24:03 2010
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 27 Feb 2010 13:24:03 -0500
Subject: [Biopython-dev] [Bug 3016] New: Change WriterTests in
	test_PhyloXML.py to use StringIO or temp files
Message-ID: <bug-3016-42@http.bugzilla.open-bio.org/>

http://bugzilla.open-bio.org/show_bug.cgi?id=3016

           Summary: Change WriterTests in test_PhyloXML.py to use StringIO
                    or temp files
           Product: Biopython
           Version: Not Applicable
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Unit Tests
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: eric.talevich at gmail.com


The method _stash_rewrite_and_call currently parses each of the example
phyloXML files, renames the parsed file to [filename]~, writes out another copy
(from the parsed data structure) using the original filename, re-runs the suite
of parser tests on the rewritten files, and finally renames the stashed copies
back to the original filenames. This is protected by a try-finally clause, but
could still fail to restore the original test files if the Python interpreter
is interrupted/killed. Moreover, the design is a little pathological, and could
be hard to maintain or extend later.

Redesign the writer tests to rewrite and test a copy of each originals at some
location other than the original filename. Ideally, use StringIO to store the
copy; a named temporary file (see tempfile module) is also acceptable.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.