From hoffman at ebi.ac.uk  Mon Mar  1 08:27:19 2004
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Bio.SeqIO.FASTA fix
Message-ID: <Pine.LNX.4.58.0403011324310.31950@qnzvnan.rov.np.hx>

Bio.SeqIO.FASTA.FastaReader.next() will return a SeqRecord that has
the id attribute set to either a list or a string depending on how
many words are in the definition line (list if there is one word;
string if there is more than one word!). This is a fix so that it will
always be a string.

OK to check in?

Index: FASTA.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/SeqIO/FASTA.py,v
retrieving revision 1.6
diff -u -r1.6 FASTA.py
--- FASTA.py    11 Apr 2003 20:04:54 -0000      1.6
+++ FASTA.py    1 Mar 2004 13:34:58 -0000
@@ -32,7 +32,7 @@
         # description.  If there's only one word, it's the id.
         x = string.split(line[1:].rstrip(), None, 1)
         if len(x) == 1:
-            id = x
+            id = x[0]
             desc = ""
         else:
             id, desc = x
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute

From chapmanb at uga.edu  Mon Mar  1 13:52:13 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Bio.SeqIO.FASTA fix
In-Reply-To: <Pine.LNX.4.58.0403011324310.31950@qnzvnan.rov.np.hx>
References: <Pine.LNX.4.58.0403011324310.31950@qnzvnan.rov.np.hx>
Message-ID: <20040301185213.GO24150@evostick.agtec.uga.edu>

Hi Micheal;

> Bio.SeqIO.FASTA.FastaReader.next() will return a SeqRecord that has
> the id attribute set to either a list or a string depending on how
> many words are in the definition line (list if there is one word;
> string if there is more than one word!). This is a fix so that it will
> always be a string.

Cool. I'm glad that you are using the SeqIO stuff -- I must admit
that I don't use it much myself, and it's great to have someone
looking after it.

Please go ahead and check-in away (now, that can't be proper english
phrasing) with impunity. Thanks -- always so glad to have the fixes.

Brad

From mcolosimo at mitre.org  Wed Mar 10 15:12:47 2004
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Problem with MutableSeq
Message-ID: <4C6EBA4C-72CF-11D8-8096-000A95A5D8B2@mitre.org>

I've been banging my head against my monitor over this for awhile. Here 
is the problem (using stuff from 
<http://www.biopython.org/docs/tutorial/Tutorial003.html#toc5>)

I want to reverse my DNA Seq object, so I did this:

	mut_seq = my_seq.tomutable()
	mut_seq.reverse()
	my_seq = mut_seq

I thought these behaved the same (silly me).  Later on I translate it, 
however, I get a TypeError!

I had to pull out the code to see what the hell was going on because 
print my_seq looks fine.

The problem is that MutableSeq.data is an array whereas Seq.data is 
real data. So when you do this:

s = my_seq.data
n = len(s)
for i in range(0, n-n%3, 3):
	print s[i:i+3]

for Seq it prints

CGC

for MutableSeq it prints things like

array('c', 'ACG')

which is a TypeError!!! This should be put in the doc that you need to 
call the lonely method .toseq to get back a real sequence. Or change 
MutableSeq.data to  MutableSeq.array_data and make MutableSeq.data a 
string.


From idoerg at burnham.org  Fri Mar 12 16:45:57 2004
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] reduced alphabets
Message-ID: <40522F95.40703@burnham.org>

Hi all,

I am thinking of incorporating reduced alphabets into biopython. Reduced 
(or redundant) alphabets are used to represent protein sequences using 
an alternative alphabet which lumps together several amino-acids into 
one letter, based on physico-chemical traits. For example, all the 
aliphatics (I,L,V) are usually quite interchangeable, so many sequence 
studies lump them into one letter. We don't have that, do we? This can 
also be applied to DNA, although I only heard of a 4->2 reduction (to 
purines & pyrimidines), and it is usually less useful.

You can see examples of reduced alphabets here:

http://viscose.ifg.uni-muenster.de/html/alphabets.html

I was thinking of making additions in two places:

1) in util.py I will add a function "reduce_sequence":


def reduce_sequence(seq, reduction_table,new_alphabet=None):
    """ given an amino-acid sequence, return it in reduced alphabet form 
based on the
letter-translation table passed
        seq: a Seq.Seq type sequence
        reduction_table: a dictionary whose
        keys are the "from" alphabet, and values
        are the "to" alphabet"""

    if new_alphabet is None:
       new_alphabet = Alphabet.single_letter_alphabet
       new_alphabet.letters = ''
       for letter in reduction_table:
          new_alphabet.letters += letter
       new_alphabet.size = len(new_alphabet.letters)
    new_seq = Seq.Seq('',new_alphabet)
    for letter in seq:
       new_seq += aa_table[letter]
    return new_seq


******************

2) In Bio.Alphabets I will
2.1) add a module some dictionaries mapping the 20 and 23 aa alphabet 
"brand name" reduced alphabets,
2.2) Add another module, along the lines of IUPAC.py with the brand name 
alphabets as instances of SingleLetterAlphabet

Comments, suggestions?

Thanks,

./I

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo


From idoerg at burnham.org  Mon Mar 15 17:37:28 2004
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] why buffer in utils.count_monomers?
Message-ID: <40563028.2010905@burnham.org>

Hi,

I tried the following (CVS version from last Friday, Python 2.2);
 >>> l = \ 
Seq.Seq('MINAIRTPDQRFSNLDQYPFSPNYLDDLPGYPGLRAHYLDEGNSDAEDVFLCLHGEPTWS',IUPAC.protein)
 >>>
 >>> utils.count_monomers(l)

Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "/home/iddo/biopy_cvs/biopython/Bio/utils.py", line 64, in 
count_monomers
     dict[c] = string.count(s, c)
   File "/usr/lib/python2.2/string.py", line 161, in count
     return s.count(*args)
AttributeError: 'buffer' object has no attribute 'count'

when I replaced the variable "s" in line 64 in utils.py with "seq.data" 
everything worked fine. In line 62 "s" is defined as:
s=buffer(seq.data)

Does that serve a purpose? Can we do without it, (meaning I deposit the 
bugfix) or is it important?

thanks,

Iddo


Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "/home/iddo/biopy_cvs/biopython/Bio/utils.py", line 64, in 
count_monomers
     dict[c] = string.count(s, c)
   File "/usr/lib/python2.2/string.py", line 161, in count
     return s.count(*args)
AttributeError: 'buffer' object has no attribute 'count'


-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo


From hoffman at ebi.ac.uk  Mon Mar 15 17:55:22 2004
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] why buffer in utils.count_monomers?
In-Reply-To: <40563028.2010905@burnham.org>
References: <40563028.2010905@burnham.org>
Message-ID: <Pine.LNX.4.58.0403152253260.19149@qnzvnan.rov.np.hx>

On Mon, 15 Mar 2004, Iddo Friedberg wrote:

> when I replaced the variable "s" in line 64 in utils.py with "seq.data"
> everything worked fine. In line 62 "s" is defined as:
> s=buffer(seq.data)
>
> Does that serve a purpose? Can we do without it, (meaning I deposit the
> bugfix) or is it important?

There is a comment that says that using a buffer makes the function
work for both strings and arrays. Of course right now it works for
neither...
-- 
Michael Hoffman <hoffman@ebi.ac.uk>
European Bioinformatics Institute

From idoerg at burnham.org  Tue Mar 16 17:53:42 2004
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Reduced alphabets
Message-ID: <40578576.60309@burnham.org>


Hi,

Thanks to overwhelming demand (well, nobody really objected ;) biopython
now has the rudimentaries for handling reduced alphabets. I committed
the following changes:

1) in Bio.utils I added

   reduce_sequence(seq, reduction_table, new_alphabet=None)

2) in Alphabet, I added Reduce.py, which has reduction tables, and
reduced alphabet definitions + literature citations


Enjoy,

Iddo

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo


From chapmanb at uga.edu  Wed Mar 17 20:13:41 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Problem with MutableSeq
In-Reply-To: <4C6EBA4C-72CF-11D8-8096-000A95A5D8B2@mitre.org>
References: <4C6EBA4C-72CF-11D8-8096-000A95A5D8B2@mitre.org>
Message-ID: <20040318011341.GB99271@evostick.agtec.uga.edu>

Hi Marc;
Thanks for the feedback on the documentation and MutableSeq. Sorry
for the delay in responding -- I've been out of town and am just
getting myself back together.

> I've been banging my head against my monitor over this for awhile. Here 
> is the problem (using stuff from 
> <http://www.biopython.org/docs/tutorial/Tutorial003.html#toc5>)
> 
> I want to reverse my DNA Seq object, so I did this:
> 
> 	mut_seq = my_seq.tomutable()
> 	mut_seq.reverse()
> 	my_seq = mut_seq
> 
> I thought these behaved the same (silly me).  Later on I translate it, 
> however, I get a TypeError!
> 
> I had to pull out the code to see what the hell was going on because 
> print my_seq looks fine.
> 
> The problem is that MutableSeq.data is an array whereas Seq.data is 
> real data. So when you do this:
[...]
> which is a TypeError!!! This should be put in the doc that you need to 
> call the lonely method .toseq to get back a real sequence. Or change 
> MutableSeq.data to  MutableSeq.array_data and make MutableSeq.data a 
> string.

Yes, I definitely agree that this is confusing. When Andrew
implemented MutableSeq it uses the array to represent the sequence
instead of strings, as you correctly point out. This does confuse
things because many of the functions that deal with sequences aren't
set up to deal with both arrays and strings for the data object.

I do think the real answer to fix the problem is to adjust the docs
so they make this clear in the transition between the mutable seq
part and the translation part.

I am currently trying to re-do documentation into a more small sized
cookbook format, to make it easier to maintain and update (and for
people to contribute :-). I will try to put it on my list to pull
out this section and update it to avoid this kind of problem.

Sorry for the frustration and thanks for sharing your experience so
we can make the documentation better.
Brad

From chapmanb at uga.edu  Wed Mar 17 20:25:46 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Reduced alphabets
In-Reply-To: <40578576.60309@burnham.org>
References: <40578576.60309@burnham.org>
Message-ID: <20040318012546.GD99271@evostick.agtec.uga.edu>

Hi Iddo;

> Thanks to overwhelming demand (well, nobody really objected ;) biopython
> now has the rudimentaries for handling reduced alphabets. I committed
> the following changes:
> 
> 1) in Bio.utils I added
> 
>   reduce_sequence(seq, reduction_table, new_alphabet=None)
> 
> 2) in Alphabet, I added Reduce.py, which has reduction tables, and
> reduced alphabet definitions + literature citations

Thanks for this. Sorry I didn't have a chance to weigh in earlier,
but I was out of town (actually on Biopython business). 

Everything looks good -- my only suggestion would be to add a bit
more documentation to the modules, specifically Alphabet/Reduced.py.
I think just copying and pasting the relevant bits from your
original e-mail to a doc-string at the top would be a real help for
someone searching around and saying "wellllll...what do we have
here."

Other than that, all good. Thanks for the fix on count_monomers -- I
do think that's the right thing to do. We should really discourage
using MutableSeqs (which is where the array stuff comes from) on
for anything besides, well, mutating them -- so this fix is fine.

Thanks for the contribution!
Brad

From idoerg at burnham.org  Wed Mar 17 20:52:08 2004
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Reduced alphabets
In-Reply-To: <20040318012546.GD99271@evostick.agtec.uga.edu>
References: <40578576.60309@burnham.org>
	<20040318012546.GD99271@evostick.agtec.uga.edu>
Message-ID: <405900C8.6070106@burnham.org>

Hi Brad,

Document my code? Why do you think they call it "code"?

OK, I'll do it. Tomorrow. After I recover from the green beer we are 
about to consume as part of a farewell party+St. Patrick's day lab night 
on the town....

./I


Brad Chapman wrote:
> Hi Iddo;
> 
> 
>>Thanks to overwhelming demand (well, nobody really objected ;) biopython
>>now has the rudimentaries for handling reduced alphabets. I committed
>>the following changes:
>>
>>1) in Bio.utils I added
>>
>>  reduce_sequence(seq, reduction_table, new_alphabet=None)
>>
>>2) in Alphabet, I added Reduce.py, which has reduction tables, and
>>reduced alphabet definitions + literature citations
> 
> 
> Thanks for this. Sorry I didn't have a chance to weigh in earlier,
> but I was out of town (actually on Biopython business). 
> 
> Everything looks good -- my only suggestion would be to add a bit
> more documentation to the modules, specifically Alphabet/Reduced.py.
> I think just copying and pasting the relevant bits from your
> original e-mail to a doc-string at the top would be a real help for
> someone searching around and saying "wellllll...what do we have
> here."
> 
> Other than that, all good. Thanks for the fix on count_monomers -- I
> do think that's the right thing to do. We should really discourage
> using MutableSeqs (which is where the array stuff comes from) on
> for anything besides, well, mutating them -- so this fix is fine.
> 
> Thanks for the contribution!
> Brad
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 
> 

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo


From jeffrey_chang at stanfordalumni.org  Mon Mar 22 11:10:40 2004
From: jeffrey_chang at stanfordalumni.org (Jeffrey Chang)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Fwd: Auto-discard notification
Message-ID: <768C25C8-7C1B-11D8-B2FD-000A956845CE@stanfordalumni.org>

Forwarding a message that was discarded by the spam filter...

Jeff


> From: Yair Benita <Y.Benita@pharm.uu.nl>
> Date: March 22, 2004 9:35:50 AM EST
> To: <biopython-dev@biopython.org>
> Subject: Saving an HMM
>
> Hi All,
> I have been playing with the HMM in biopython and I am happy to say it
> actually works. However, it takes me about 30 minutes to train the 
> model and
> I would like to save the trained HMM so that I can use it to predict
> whenever I need to. Any idea how I can save the trained model?
> Thanks,
> Yair
> -- 
> Yair Benita
> Pharmaceutical Proteomics
> Faculty of Pharmacy
> Utrecht University


From dyoo at hkn.eecs.berkeley.edu  Mon Mar 22 21:18:22 2004
From: dyoo at hkn.eecs.berkeley.edu (Danny Yoo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] Fwd: Auto-discard notification
In-Reply-To: <768C25C8-7C1B-11D8-B2FD-000A956845CE@stanfordalumni.org>
Message-ID: <Pine.LNX.4.44.0403221808320.26523-100000@hkn.eecs.berkeley.edu>


> > I have been playing with the HMM in biopython and I am happy to say it
> > actually works. However, it takes me about 30 minutes to train the
> > model and I would like to save the trained HMM so that I can use it to
> > predict whenever I need to. Any idea how I can save the trained model?

Hi Yair,


I haven't tested this yet, but if the HMM is in pure Python (as it appears
to be!), then the 'pickle' or 'shelve' modules from the Standard Library
may be able to store the HMM.  Here's a link to the documentation:


    http://www.python.org/doc/lib/module-shelve.html
    http://www.python.org/doc/lib/module-pickle.html


Good luck!


From bugzilla-daemon at portal.open-bio.org  Mon Mar 22 21:55:24 2004
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] [Bug 1605] New: kMeans.py should be deprecated
Message-ID: <200403230255.i2N2tO12025207@portal.open-bio.org>

http://bugzilla.bioperl.org/show_bug.cgi?id=1605

           Summary: kMeans.py should be deprecated
           Product: Biopython
           Version: 1.24
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: jchang@biopython.org


The functionality is present in Bio.Cluster, so this is now duplicated code.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From mcolosimo at mitre.org  Wed Mar 24 15:28:29 2004
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] BioSQL bugs
Message-ID: <CFD63845-7DD1-11D8-8015-000A95A5D8B2@mitre.org>

First,  I've added support for pgdb to DBUtils and did some testing the 
diff is at the end. Second the fix for taxon doesn't work. The problem 
is that it tries to enter NULLs for fields that are required to be 
unique.

BioSQL.Loader
line 188 parent_taxon_id = None
	for taxon in lineage:
             self.adaptor.execute(
                 "INSERT INTO taxon(parent_taxon_id, ncbi_taxon_id, 
node_rank,"\
                 " left_value, right_value)" \
                 " VALUES (%s, %s, %s, %s, %s)", (parent_taxon_id,
                                                  taxon[0],
                                                  taxon[1],
                                                  left_value,
                                                  right_value))

This might work the first time, but since parent_taxon and other need 
to be unique this fails. I don't know a simple solution for this, 
except to give up and not put in a taxon_id (which isn't required for a 
bioentry).

Index: DBUtils.py
===================================================================
RCS file: /home/repository/biopython/biopython/BioSQL/DBUtils.py,v
retrieving revision 1.2
diff -r1.2 DBUtils.py
37c37
< class Pg_dbutils(Generic_dbutils):
---
 > class Psycopg_dbutils(Generic_dbutils):
54c54,75
< _dbutils["psycopg"] = Pg_dbutils
---
 > _dbutils["psycopg"] = Psycopg_dbutils
 >
 > class Pgdb_dbutils(Generic_dbutils):
 >     def next_id(self, cursor, table):
 >         table = self.tname(table)
 >         sql = r"select nextval('%s_pk_seq')" % table
 >         cursor.execute(sql)
 >         rv = cursor.fetchone()
 >         return rv[0]
 >
 >     def last_id(self, cursor, table):
 >         table = self.tname(table)
 >         sql = r"select currval('%s_pk_seq')" % table
 >         cursor.execute(sql)
 >         rv = cursor.fetchone()
 >         return rv[0]
 >
 >     def autocommit(self, conn, y = True):
 >         raise NotImplementedError("pgdb does not support this!")
 >
 > _dbutils["pgdb"] = Pgdb_dbutils
 >


From chapmanb at uga.edu  Wed Mar 24 16:35:53 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] BioSQL bugs
In-Reply-To: <CFD63845-7DD1-11D8-8015-000A95A5D8B2@mitre.org>
References: <CFD63845-7DD1-11D8-8015-000A95A5D8B2@mitre.org>
Message-ID: <20040324213553.GL22666@evostick.agtec.uga.edu>

Hi Marc;

> First,  I've added support for pgdb to DBUtils and did some testing the 
> diff is at the end. 

Thanks. I've just checked your patch in. The only problem I have is
with the autocommit functionality. I dug around on mailing lists and
the like and do see that PyGreSQL doesn't support anything like this
-- however, do you have any ideas to make the Tests work without
this type of functionality.

The problem (as far as I can see it right now) is that if a
connection is opened then you can't do DROPs (or CREATE?). However
if you don't have an open connection, then you can't execute SQL so
you can't do the DROPs either. So I guess maybe it's a catch 22 that
really only affects the tests (where we need to do this annoying
dropping and creating automatically), but do you (or anything) have
any clever ideas to work around this so that the Tests will work?

> Second the fix for taxon doesn't work. The problem 
> is that it tries to enter NULLs for fields that are required to be 
> unique.
> 
> BioSQL.Loader
> line 188 parent_taxon_id = None
> 	for taxon in lineage:
>             self.adaptor.execute(
>                 "INSERT INTO taxon(parent_taxon_id, ncbi_taxon_id, 
> node_rank,"\
>                 " left_value, right_value)" \
>                 " VALUES (%s, %s, %s, %s, %s)", (parent_taxon_id,
>                                                  taxon[0],
>                                                  taxon[1],
>                                                  left_value,
>                                                  right_value))
> 
> This might work the first time, but since parent_taxon and other need 
> to be unique this fails. I don't know a simple solution for this, 
> except to give up and not put in a taxon_id (which isn't required for a 
> bioentry).

Okay, I was playing around with this and fixed it for a problem I
was having (with non-unique right_values) in an ugly way which I'm
sure is not right.

My real problem is I don't understand the table:

CREATE TABLE taxon (
       taxon_id		INT(10) UNSIGNED NOT NULL auto_increment,
       ncbi_taxon_id 	INT(10),
       parent_taxon_id	INT(10) UNSIGNED,
       node_rank	VARCHAR(32),
       genetic_code	TINYINT UNSIGNED,
       mito_genetic_code TINYINT UNSIGNED,
       left_value	INT(10) UNSIGNED,
       right_value	INT(10) UNSIGNED,
       PRIMARY KEY (taxon_id),
       UNIQUE (ncbi_taxon_id),
       UNIQUE (left_value),
       UNIQUE (right_value)
) TYPE=INNODB;

Okay, so the problem is that I have no idea what parent_taxon_id,
left_value and right_value are. I assume that they are supposed to
represent some kind of heirarchy of taxonomy. As near as I can
figure if you have a tree like:

A -> B -> C -> D
          |
           --> E

Then this table would be filled for C with parent_taxon_id to be B's
taxon_id, left_value to be D's taxon_id and right_value to be E's
taxon_id.

Is this right at all or am I completely confused? I can take a hit
at this, but without really getting the table I've been stumped so
far and just stare at it scratching my head.

Thanks for the work on BioSQL. Sorry if I am a bit (a lot) confused 
about things at the moment.
Brad

From mcolosimo at mitre.org  Fri Mar 26 10:48:07 2004
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] BioSQL bugs
Message-ID: <406450B7.6060401@mitre.org>

>
>
>Hi Marc;
>
>>/ First,  I've added support for pgdb to DBUtils and did some testing the 
>/>/ diff is at the end. 
>/
>Thanks. I've just checked your patch in. The only problem I have is
>with the autocommit functionality. I dug around on mailing lists and
>the like and do see that PyGreSQL doesn't support anything like this
>-- however, do you have any ideas to make the Tests work without
>this type of functionality.
>
>The problem (as far as I can see it right now) is that if a
>connection is opened then you can't do DROPs (or CREATE?). However
>if you don't have an open connection, then you can't execute SQL so
>you can't do the DROPs either. So I guess maybe it's a catch 22 that
>really only affects the tests (where we need to do this annoying
>dropping and creating automatically), but do you (or anything) have
>any clever ideas to work around this so that the Tests will work?
>  
>

I think bioperl makes use of functions (see the biosqldb-pg.sql). I was 
thinking about adding some of these function calls to the DBUtils 
section to speed up the transactions. Removing some of the constraints 
will increase the speed as the database grows. This code works fine for 
small sets, but it quickly slows down (probably because of the checks).

>>/ Second the fix for taxon doesn't work. The problem 
>/>/ is that it tries to enter NULLs for fields that are required to be 
>/>/ unique.
>/>/ 
>/>/ BioSQL.Loader
>/>/ line 188 parent_taxon_id = None
>/>/ 	for taxon in lineage:
>/>/             self.adaptor.execute(
>/>/                 "INSERT INTO taxon(parent_taxon_id, ncbi_taxon_id, 
>/>/ node_rank,"\
>/>/                 " left_value, right_value)" \
>/>/                 " VALUES (%s, %s, %s, %s, %s)", (parent_taxon_id,
>/>/                                                  taxon[0],
>/>/                                                  taxon[1],
>/>/                                                  left_value,
>/>/                                                  right_value))
>/>/ 
>/>/ This might work the first time, but since parent_taxon and other need 
>/>/ to be unique this fails. I don't know a simple solution for this, 
>/>/ except to give up and not put in a taxon_id (which isn't required for a 
>/>/ bioentry).
>/
>Okay, I was playing around with this and fixed it for a problem I
>was having (with non-unique right_values) in an ugly way which I'm
>sure is not right.
>
>My real problem is I don't understand the table:
>
>CREATE TABLE taxon (
>       taxon_id		INT(10) UNSIGNED NOT NULL auto_increment,
>       ncbi_taxon_id 	INT(10),
>       parent_taxon_id	INT(10) UNSIGNED,
>       node_rank	VARCHAR(32),
>       genetic_code	TINYINT UNSIGNED,
>       mito_genetic_code TINYINT UNSIGNED,
>       left_value	INT(10) UNSIGNED,
>       right_value	INT(10) UNSIGNED,
>       PRIMARY KEY (taxon_id),
>       UNIQUE (ncbi_taxon_id),
>       UNIQUE (left_value),
>       UNIQUE (right_value)
>) TYPE=INNODB;
>
>Okay, so the problem is that I have no idea what parent_taxon_id,
>left_value and right_value are. I assume that they are supposed to
>represent some kind of heirarchy of taxonomy. As near as I can
>figure if you have a tree like:
>  
>

These values are needed for nested-set representation 
<http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html?page=1>. 
They are used to quickly limit a branch of a tree. Selecting on the 
values >= the left and <= the right gives you all the elements under 
that part of the tree. I don't think it would be easy to add a new 
element to the tree with out rebuilding the whole representation. 
Therefore, I just skip it and put in a null (and print out that it 
wasn't known). This needs to be fixed in the source of the data.


>A -> B -> C -> D
>          |
>           --> E
>
>Then this table would be filled for C with parent_taxon_id to be B's
>taxon_id, left_value to be D's taxon_id and right_value to be E's
>taxon_id.
>
>Is this right at all or am I completely confused? I can take a hit
>at this, but without really getting the table I've been stumped so
>far and just stare at it scratching my head.
>
>Thanks for the work on BioSQL. Sorry if I am a bit (a lot) confused 
>about things at the moment.
>Brad
>


From bugzilla-daemon at portal.open-bio.org  Fri Mar 26 11:03:47 2004
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] [Bug 1608] NCBIStandalone.py dies parsing long
	blastpgp -m6 output
Message-ID: <200403261603.i2QG3l8t011835@portal.open-bio.org>

http://bugzilla.bioperl.org/show_bug.cgi?id=1608


------- Additional Comments From j.a.casbon@qmul.ac.uk  2004-03-26 11:03 -------
Created an attachment (id=121)
 --> (http://bugzilla.bioperl.org/attachment.cgi?id=121&action=view)
sample blast output that causes crash

This blast output causes the crash for me. 


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Fri Mar 26 11:00:58 2004
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] [Bug 1608] New: NCBIStandalone.py dies parsing long
	blastpgp -m6 output
Message-ID: <200403261600.i2QG0wcF011745@portal.open-bio.org>

http://bugzilla.bioperl.org/show_bug.cgi?id=1608

           Summary: NCBIStandalone.py dies parsing long blastpgp -m6 output
           Product: Biopython
           Version: 1.24
          Platform: All
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: j.a.casbon@qmul.ac.uk


When using: 
    blastout = os.popen("zcat %s" % file) 
    b_parser = NCBIStandalone.PSIBlastParser() 
    b_iterator = NCBIStandalone.Iterator(blastout, b_parser) 
    for b_record in b_iterator: 
 
to parse blast output in multiple alignment format (-m6), the parser dies on 
some files, and not others.  It seems not to like longer files - but this is 
just my feeling. 
 
This is biopython 2.4 on debian linux unstable. 
 
I have  blast output that produces this bug available at: 
http://compbio.mds.qmw.ac.uk/~james/78700.blo.gz 
 
Here is the stack trace: 
  File "/home/james/exp/iss/shared_sequences", line 63, in get_seqs 
    for b_record in b_iterator: 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 1332, in next 
    return self._parser.parse(File.StringHandle(data)) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 571, in parse 
    self._scanner.feed(handle, self._consumer) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 97, in feed 
    self._scan_rounds(uhandle, consumer) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 152, in _scan_rounds 
    self._scan_descriptions(uhandle, consumer) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 272, in _scan_descriptions 
    read_and_call_until(uhandle, consumer.description, blank=1) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/ParserSupport.py", 
line 340, in read_and_call_until 
    method(line) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 637, in description 
    dh = self._parse(line) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 694, in _parse 
    dh.score = _safe_int(dh.score) 
  File 
"/home/james/biopython-1.24/build/lib.linux-i686-2.3/Bio/Blast/NCBIStandalone.py", 
line 1602, in _safe_int 
    return long(float(str)) 
ValueError: invalid literal for float(): 
RPVV----RDD-----R-------P----D----L-I-Y--R-----------T---MEG


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From iliketobicycle at yahoo.ca  Sat Mar 27 09:50:41 2004
From: iliketobicycle at yahoo.ca (Harry Zuzan)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] documentation for modules: happydoc/epydoc ?
Message-ID: <20040327145041.96064.qmail@web21403.mail.yahoo.com>

Hi,

I'm trying to put some code in shape for submission to BioPython.  It's
for handling data from Affymetrix GeneChips.  It's efficiently handles
both the DAT image files and the probe cell data including the probe
sequences.  The first thing that I want to do is take care of
documentation.

I'm not sure if I should be using happydoc or epydoc.  I'm also not
sure how to get both html and pdf documentation from the same source.

Is the code in C and C++ modules documented in a separate way?

Best,

Harry Zuzan


______________________________________________________________________ 
Post your free ad now! http://personals.yahoo.ca

From chapmanb at uga.edu  Tue Mar 30 18:06:45 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] documentation for modules: happydoc/epydoc ?
In-Reply-To: <20040327145041.96064.qmail@web21403.mail.yahoo.com>
References: <20040327145041.96064.qmail@web21403.mail.yahoo.com>
Message-ID: <20040330230645.GD29401@evostick.agtec.uga.edu>

Hi Harry;

> I'm trying to put some code in shape for submission to BioPython.  It's
> for handling data from Affymetrix GeneChips.  It's efficiently handles
> both the DAT image files and the probe cell data including the probe
> sequences.  

Great! We could definitely use something like this. Thanks for
writing it and thinking of submitting it.

> The first thing that I want to do is take care of
> documentation.
> I'm not sure if I should be using happydoc or epydoc.  

Well, you don't really need to use either. These are just
automated ways to extract documentation from the source code
(module, class and function documentation) so that they are readable
on the web. The important thing is to have good documentation of
your code and then it will extract it.

At the bottom of our guide for contributing:

http://www.biopython.org/docs/developer/contrib.html

there are some hints about writing documentation strings so that
epydoc (which is what we are using now) can deal with them best.

> I'm also not
> sure how to get both html and pdf documentation from the same source.

This is a bit of a different question. If you'd like to write
separate cookbook-style documents for how to use your code (which is
definitely encouraged) you can write them in any format you like
which is displayable on the web. Plain text or HTML (preferrably
simple, hand-edtiable HTML -- not MS Word generated type) is fine.
The PDF/HTML documentation is generated using LaTeX along with HeVeA
to make the HTML. But if you don't know LaTeX -- any way to write it
that you like is great.

> Is the code in C and C++ modules documented in a separate way?

I'd just encourage good C commenting here. Not knowing how your
modules work, I'd assume you have some Python code wrappers around
the C code so that people don't access the C-written code directly.
In that case, it's most important that the code itself is documented
so that others can understand what you wrote and how to find or fix
bugs, if any come up.

Thanks again and be sure to write if you have other questions!
Brad

From chapmanb at uga.edu  Tue Mar 30 19:31:37 2004
From: chapmanb at uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] BioSQL bugs
In-Reply-To: <406450B7.6060401@mitre.org>
References: <406450B7.6060401@mitre.org>
Message-ID: <20040331003137.GH29401@evostick.agtec.uga.edu>

Hi Mark;

[I ask how we can make tests work without autocommit]

> I think bioperl makes use of functions (see the biosqldb-pg.sql). I was 
> thinking about adding some of these function calls to the DBUtils 
> section to speed up the transactions. Removing some of the constraints 
> will increase the speed as the database grows. This code works fine for 
> small sets, but it quickly slows down (probably because of the checks).

That would be great -- honestly, I am not a database expert at all
(as you can probably tell from my mails). This seems like a good
place to start. I'd definitely appreciate more contributions along
this line from you if you'd be willing to do more work on it. 

[I'm confused about the taxon table as well]
> These values are needed for nested-set representation 
> <http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html?page=1>. 
> They are used to quickly limit a branch of a tree. Selecting on the 
> values >= the left and <= the right gives you all the elements under 
> that part of the tree. I don't think it would be easy to add a new 
> element to the tree with out rebuilding the whole representation. 
> Therefore, I just skip it and put in a null (and print out that it 
> wasn't known). This needs to be fixed in the source of the data.

Thanks for the link. That makes good sense now -- it seems as if the
intent is to have the taxonomy information pre-loaded from taxon
tables, and then linking to the taxon table when loading records.

I agree with you -- I think the best way to handle it is to add
functionality (maybe to a mixin class that DBServer can derive from)
to load taxon table information into a database. Then, if this taxon
information exists link to it, otherwise add nulls as you suggest.

Thanks for the explanations!
Brad

From bugzilla-daemon at portal.open-bio.org  Tue Mar 30 20:04:39 2004
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] [Bug 1608] NCBIStandalone.py dies parsing long
	blastpgp -m6 output
Message-ID: <200403310104.i2V14dmk029750@portal.open-bio.org>

http://bugzilla.bioperl.org/show_bug.cgi?id=1608


------- Additional Comments From chapmanb@arches.uga.edu  2004-03-30 20:04 -------
Created an attachment (id=126)
 --> (http://bugzilla.bioperl.org/attachment.cgi?id=126&action=view)
Fix for the problem commited to revision 1.52 of NCBIStandalone


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Tue Mar 30 20:07:14 2004
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] [Bug 1608] NCBIStandalone.py dies parsing long
	blastpgp -m6 output
Message-ID: <200403310107.i2V17EfI029794@portal.open-bio.org>

http://bugzilla.bioperl.org/show_bug.cgi?id=1608

chapmanb@arches.uga.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Additional Comments From chapmanb@arches.uga.edu  2004-03-30 20:07 -------
James, thanks for the report. The problem was that sometimes the line:

Sequences not found previously or not previously below threshold:

which the parser expected to be followed by one or more descriptions, actually
contains no descriptions (ie. there are no more sequences to find). In this case
it kept on trying to get descriptions and got to the error you noted by trying
to parse an alignment line as a description.

Fixes are checked into CVS and a patch is attached to the bug. Thanks!


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From miehe at mail.ipk-gatersleben.de  Wed Mar 31 11:41:51 2004
From: miehe at mail.ipk-gatersleben.de (miehe@mail.ipk-gatersleben.de)
Date: Sat Mar  5 14:43:31 2005
Subject: [Biopython-dev] (no subject)
Message-ID: <1080751311.406af4cfe2d54@webmail.ipk-gatersleben.de>

Hello,

The parser expression for blastn defined in Bio.expressions.blast.ncbiblast.py
is broken for version BLASTN 2.2.8 [Jan-05-2004] (and even for the older
BLASTN 2.2.6 [Apr-09-2003]).
In the output is an additional 'hsp_info' section, which was not defined in the
blastn expression. This could be patched in one line.

Here I send you the diff to the current CVS version.
***************
*** 441,444 ****
--- 441,445 ----
                   gap_penalties_stats +
                   generic_info1 +
+                  Opt(hsp_info) +
                   generic_info2 +
                   t_info +
-------------------

With best regards,
Heiko