From biopython-dev at maubp.freeserve.co.uk  Sat Dec 10 13:39:13 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sat Dec 10 14:12:22 2005
Subject: [Biopython-dev] Bio.Geo for NCBI's GEO microarry SOFT files
Message-ID: <439B20D1.2020707@maubp.freeserve.co.uk>

I've just been looking at the Bio.Geo module by Katharine Lindner, 
contributed back in 2002 which should parse the NCBI's Gene Expression 
Omnibus (GEO) microarray data files.

http://www.ncbi.nlm.nih.gov/geo/

Is anyone using Bio.Geo at the moment?

The NCBI seem to call these SOFT files, (*.soft) and the format is 
documented here:

http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTformat

Apparently in 2005, they began a switch to a revised file format, new 
format files here:

ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_gz/

Old format files here:

ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_old/
ftp://ftp.ncbi.nih.gov/pub/geo/data/gds/soft_old_gz/

As far as I can tell, neither the "old" or "new" versions work in 
Bio.Geo, so there may have been another format change between 2002 and 2005.

In addition the 2005 change introduces new lines, before and after the 
actual data:

!dataset_table_begin
!dataset_table_end

These are definitely not supported in the current Martel grammar for GEO 
files.

Peter

From ivan at biodec.com  Tue Dec 13 15:35:05 2005
From: ivan at biodec.com (Ivan Rossi)
Date: Tue Dec 13 15:39:46 2005
Subject: [Biopython-dev] tiny Align.AlignInfo patch
In-Reply-To: <Pine.LNX.4.61.0512131822270.8417@gorby.bo.biodec.com>
References: <Pine.LNX.4.61.0512131822270.8417@gorby.bo.biodec.com>
Message-ID: <Pine.LNX.4.61.0512132133100.10358@gorby.bo.biodec.com>


Dear BioPythoneers,
   I am submitting a tiny patch to the pos_specific_score_matrix method of 
Bio.Align.AlignInfo

It allows for the generation of PSSMs composed by the "alphabet+gap" symbols. 
I use it all the time to generate 21-symbols PSSMs for proteins, that we use 
as inputs for neural networks and HMMs.

The patch is not invasive at all and it preserves the default behavior of 
AlignInfo.pos_specific_score_matrix()

I hope it will be considered for inclusion in the CVS.

Ivan

--
  Ivan Rossi, Ph.D. - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it
  BioDec s.r.l., Via Fanin 48, I-40127 Bologna (Italy)
  Phone: +39-051-4200321 - fax: +39-051-4200317 - web: www.biodec.com
-------------- next part --------------
*** AlignInfo.py.orig	Tue Dec 13 18:09:22 2005
--- AlignInfo.py	Tue Dec 13 18:18:40 2005
***************
*** 335,341 ****
  
  
      def pos_specific_score_matrix(self, axis_seq = None,
!                                   chars_to_ignore = []):
          """Create a position specific score matrix object for the alignment.
  
          This creates a position specific score matrix (pssm) which is an
--- 335,342 ----
  
  
      def pos_specific_score_matrix(self, axis_seq = None,
!                                   chars_to_ignore = [],
!                                   drop_gap_char = True):
          """Create a position specific score matrix object for the alignment.
  
          This creates a position specific score matrix (pssm) which is an
***************
*** 348,353 ****
--- 349,357 ----
          put on the axis of the PSSM. This should be a Seq object. If nothing
          is specified, the consensus sequence, calculated with default
          parameters, will be used.
+         o drop_gap_char - An optional boolean parameter to specify if the gap 
+         symbol has to be accounted for in the pssm. Useful to generate the 
+         "alphabet+gap" PSSMs used by some remote-homologi detection codes.
  
          Returns:
          o A PSSM (position specific score matrix) object.
***************
*** 355,363 ****
          # determine all of the letters we have to deal with
          all_letters = self.alignment._alphabet.letters
  
!         # if we have a gap char, add it to stuff to ignore
!         if isinstance(self.alignment._alphabet, Alphabet.Gapped):
!             chars_to_ignore.append(self.alignment._alphabet.gap_char)
          
          for char in chars_to_ignore:
              all_letters = string.replace(all_letters, char, '')
--- 359,368 ----
          # determine all of the letters we have to deal with
          all_letters = self.alignment._alphabet.letters
  
!         if drop_gap_char:
!             # if we have a gap char, add it to stuff to ignore
!             if isinstance(self.alignment._alphabet, Alphabet.Gapped):
!                 chars_to_ignore.append(self.alignment._alphabet.gap_char)
          
          for char in chars_to_ignore:
              all_letters = string.replace(all_letters, char, '')
From mdehoon at c2b2.columbia.edu  Tue Dec 13 15:43:40 2005
From: mdehoon at c2b2.columbia.edu (Michiel De Hoon)
Date: Tue Dec 13 15:49:13 2005
Subject: [Biopython-dev] tiny Align.AlignInfo patch
Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECDC5@cgcmail.cgc.cpmc.columbia.edu>

Hi Ivan,

Thanks for the patch. But could you submit it through bugzilla? Patches
posted to mailing lists tend to get lost. (They shouldn't, but it happens a
lot in practice).

Thanks again,

--Michiel.

Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1150 St Nicholas Avenue
New York, NY 10032


-----Original Message-----
From: biopython-dev-bounces@portal.open-bio.org on behalf of Ivan Rossi
Sent: Tue 12/13/2005 3:35 PM
To: biopython-dev@biopython.org
Subject: [Biopython-dev] tiny Align.AlignInfo patch
 

Dear BioPythoneers,
   I am submitting a tiny patch to the pos_specific_score_matrix method of 
Bio.Align.AlignInfo

It allows for the generation of PSSMs composed by the "alphabet+gap" symbols.

I use it all the time to generate 21-symbols PSSMs for proteins, that we use 
as inputs for neural networks and HMMs.

The patch is not invasive at all and it preserves the default behavior of 
AlignInfo.pos_specific_score_matrix()

I hope it will be considered for inclusion in the CVS.

Ivan

--
  Ivan Rossi, Ph.D. - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot
it
  BioDec s.r.l., Via Fanin 48, I-40127 Bologna (Italy)
  Phone: +39-051-4200321 - fax: +39-051-4200317 - web: www.biodec.com


From biopython-dev at maubp.freeserve.co.uk  Tue Dec 13 17:23:11 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Tue Dec 13 17:20:29 2005
Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files
In-Reply-To: <43737060.4070006@maubp.freeserve.co.uk>
References: <43737060.4070006@maubp.freeserve.co.uk>
Message-ID: <439F49CF.7030006@maubp.freeserve.co.uk>

Are there any others on the list interested in parsing GenBank files who 
wouldn't mind proofreading/commenting on this change to the 
Tutorial/Cookbook?

i.e. Changes to this document, section 3.4 GenBank:

http://www.biopython.org/docs/tutorial/Tutorial004.html#toc13
http://www.biopython.org/docs/tutorial/Tutorial.pdf

The patch is on the mailing list archive here:

http://www.biopython.org/pipermail/biopython-dev/2005-November/002193.html

Or I could log a bug & attach the patch to it.

Would I be better off asking on the Discussion List, rather than the 
Development List for this sort of question?

Bonus question: where could I find multi-record GenBank files?

Peter

On 10 Nov 2005, I wrote:
> There should be a patch attached for Biopython Doc/Tutorial.tex which 
> tries to clarify GenBank parsing.
> 
> Created on Windows using:-
> 
> diff cvs_Tutorial.tex new_Tutorial.tex -E -Naur > patch.txt
> 
> In particular, I have tried make it clear that GenBank.Iterator() and 
> GenBank.index_file() are overkill/unnecessary when dealing with GenBank 
> files which contain only single record (which is the typical case in my 
> personal experience).
> 
> My changes add an introductory example: parsing a small bacterial genome 
> (a single large GenBank record), before moving on to the 
> GenBank.Iterator() and GenBank.index_file() examples.
> 
> I have also pointed out that the multi-record example GenBank file used 
> in these examples (cor6_6.gb) is included in the downloadable BioPython 
> source code.
> 
> Plus there is a minor correction to the GenBank.index_file example, 
> len(gb_dict) gives 6, not 7.
> 
> Peter

From biopython-dev at maubp.freeserve.co.uk  Wed Dec 14 13:33:15 2005
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed Dec 14 13:39:36 2005
Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files
In-Reply-To: <FBD19F44-0D8B-439D-A184-4F21D5F7BE21@mitre.org>
References: <43737060.4070006@maubp.freeserve.co.uk>
	<439F49CF.7030006@maubp.freeserve.co.uk>
	<FBD19F44-0D8B-439D-A184-4F21D5F7BE21@mitre.org>
Message-ID: <43A0656B.8060200@maubp.freeserve.co.uk>

Marc Colosimo wrote:
> The patch looks go to me , but i could have missed something there. I  
> forgot about the Discussion List. I really should join that list.

Motion seconded - any developer want to accept this?

> Also, I probably will be filling a bug on Bio.Fasta documentation.  
> There are two basic doc changes that should be made:
> 
> Under the doc for Fasta:
> RecordParser  Parses FASTA sequence data into a Record object <-  change 
> to a Fasta.Record object which is not the same as a Seq.Record

Sounds sensible

> Cookbooks:
> 
> Then maybe in the Cookbook, give an example on using  
> Fasta.SequenceParser with title2ids. With out title2ids, you don't  get 
> name or id. You only get description which is the title.  Fasta.Record 
> only has title, which maybe should be renamed   (depreciated to) 
> description to make it the same default behavior as  SequenceParser.

I don't usually bother with the title2ids function either.

I agree that the fact that its .title and .description depending on the 
parser used (Fasta.RecordParser or Fasta.SequenceParser) is odd.

> It seems odd that the Fasta stuff is buried within Chapter 2 (2.4.3  
> Making it easier - plus it is missing "import string").

Yes, but I think it would be better to avoid using the string module 
completely, and use the split method of the string object instead:

from Bio import Fasta

def parseTitle2Ids(title):
      return title.split("|")[:3]

parser = Fasta.SequenceParser(title2ids = parseTitle2Ids)
file = open("ls_orchid.fasta")
iterator = Fasta.Iterator(file, parser)
...


Peter

From mcolosimo at mitre.org  Wed Dec 14 14:01:22 2005
From: mcolosimo at mitre.org (Colosimo, Marc E.)
Date: Wed Dec 14 16:01:51 2005
Subject: [Biopython-dev] Updates to the tutorial for parsing GenBank files
In-Reply-To: <43A0656B.8060200@maubp.freeserve.co.uk>
Message-ID: <BFC5D632.4B8B%mcolosimo@mitre.org>


On 12/14/05 1:33 PM, "Peter" <biopython-dev@maubp.freeserve.co.uk> wrote:

> Marc Colosimo wrote:
> 
>> It seems odd that the Fasta stuff is buried within Chapter 2 (2.4.3
>> Making it easier - plus it is missing "import string").
> 
> Yes, but I think it would be better to avoid using the string module
> completely, and use the split method of the string object instead:
> 

I totally agree with you on this. I was just following the coding style used
in the cookbook and not my own.

> from Bio import Fasta
> 
> def parseTitle2Ids(title):
>       return title.split("|")[:3]
> 
> parser = Fasta.SequenceParser(title2ids = parseTitle2Ids)
> file = open("ls_orchid.fasta")
> iterator = Fasta.Iterator(file, parser)
> ...
> 
Marc 

From bugzilla-daemon at portal.open-bio.org  Thu Dec 15 14:51:14 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Thu Dec 15 14:58:17 2005
Subject: [Biopython-dev] [Bug 1919]  New: Transcribe DNA
Message-ID: <200512151951.jBFJpEEK012122@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1919

           Summary: Transcribe DNA
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Main Distribution
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: tmagalhaes@dcc.fc.up.pt


I was reading some examples in the biopython tutorial and cookbook and for the
first time, since I'd already read it many times, I get confused...
Transcribing the dna sequence ATCG produces the AUCG rna sequence or the UAGC?
Biopython does the first one, but until today I was completely sure that the
correct one is the second.
Probably this is a Tania's bug :) and not a biopython bug, and probably this is
not the right place to put that kind of questions, but at this time I really
don't know how the transcribe works, I'm really confused because in the
internet I found sites where they do like I thought it was (or at least it
seems to me the same thing)...


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Thu Dec 15 16:55:57 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Thu Dec 15 16:58:17 2005
Subject: [Biopython-dev] [Bug 1919] Transcribe DNA
Message-ID: <200512152155.jBFLtvG6014663@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1919


------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-15 16:55 -------
Transcription:
DNA {using A,T,C and G} --> mRNA {using A,U,C and G}

Translation:
mRNA {using A,U,C and G} --> Protein {Amino Acids}

Note that the BioPython Translation object can use used to go direct from DNA
{ATCG} to Protein {Amino Acids} which may be helpful.

Are you asking about the effect of complementation that also happens as part of
the transciption in biology?  Because your example was just the four
nucleotides I wasn't entirely clear on what you meant.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mhampton at d.umn.edu  Thu Dec 15 15:06:46 2005
From: mhampton at d.umn.edu (Marshall Hampton)
Date: Thu Dec 15 23:05:42 2005
Subject: [Biopython-dev] Re: [BioPython] blastx works fine?
Message-ID: <Pine.GSO.4.52.0512151405510.11766@bulldog.d.umn.edu>


Hi,

I am a new user of biopython - I like it a lot, thanks for all
those contributions! - and I have been wondering about this too.  It would
help me a lot to automate some blastx searches.  What is the best way to
do this?

Thanks,
Marshall Hampton
Dept. Mathematics & Statistics
University of Minnesota, Duluth

Frank Kauff wrote:

>Hi all,
>
>qblast currently says it works only for blastp and blastn. Actually it
>seems to work fine with blastx as well - xml output parses well with
>NCBIXML. Or am I missing something?
>
>Frank
>
>
>--
>Frank Kauff
>Dept. of Biology
>Duke University

From bugzilla-daemon at portal.open-bio.org  Sun Dec 18 15:45:26 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sun Dec 18 15:58:18 2005
Subject: [Biopython-dev] [Bug 1920] Bio.Geo does not support recent GEO files
Message-ID: <200512182045.jBIKjQrE015541@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1920


------- Comment #1 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-18 15:45 -------
Created an attachment (id=260)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=260&action=view)
Patch for Bio/Geo/*.py

Changes to the Martel format definition in Bio/Geo/geo_format.py

Changes to the Geo.Iterator in Bio/Geo/__init__.py


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Sun Dec 18 15:50:10 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sun Dec 18 15:58:20 2005
Subject: [Biopython-dev] [Bug 1920] Bio.Geo does not support recent GEO files
Message-ID: <200512182050.jBIKoAcG015567@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1920


------- Comment #2 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-18 15:50 -------
Created an attachment (id=261)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=261&action=view)
ZIP file containing revised test_geo.py and five test files

The example files are from the NCBI webpage, they are examples of valid GEO
SOFTtext submission files, but its the closest they offered.

* a single Platform submission.
* three dual channel Sample submissions.
* a single Series submission.
* a family (Platform, Samples and Series) submission.
* three Affymetrix Sample submissions. 

http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTsubmissionexamples


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Sun Dec 18 15:43:31 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sun Dec 18 15:58:21 2005
Subject: [Biopython-dev] [Bug 1920] New: Bio.Geo does not support recent GEO
	files
Message-ID: <200512182043.jBIKhVM5015510@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1920

           Summary: Bio.Geo does not support recent GEO files
           Product: Biopython
           Version: Not Applicable
          Platform: PC
               URL: http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html
                    #SOFTformat
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Martel/Mindy
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: biopython-bugzilla@maubp.freeserve.co.uk


The NCBI tweaked their GEO SOFT file format this year (2005) and the old
Bio.Geo parser can't cope.

I have fixed the Martel format definition to support this (and the old test
cases).

I have also changed the Geo.Iterator as it didn't seem to work (it seemed to be
doing an entire file at a time).

Patch to follow, along with new test cases from the NCBI.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Dec 19 06:43:00 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Dec 19 06:59:23 2005
Subject: [Biopython-dev] [Bug 1921] BioSeqDatabase.load() method fails
Message-ID: <200512191143.jBJBh0kH029607@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1921


------- Comment #1 from lpritc@scri.sari.ac.uk  2005-12-19 06:42 -------
Created an attachment (id=262)
 --> (http://bugzilla.open-bio.org/attachment.cgi?id=262&action=view)
Patch to BioSQL/Loader.py fixing problem with bioentry.taxon_id field

I'm not sure if this is a fix or a workaround, as I'm not confident that it has
no unfortunate downstream effects.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Mon Dec 19 06:38:29 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Mon Dec 19 06:59:25 2005
Subject: [Biopython-dev] [Bug 1921] New: BioSeqDatabase.load() method fails
Message-ID: <200512191138.jBJBcTPM029522@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1921

           Summary: BioSeqDatabase.load() method fails
           Product: Biopython
           Version: Not Applicable
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: BioSQL
        AssignedTo: biopython-dev@biopython.org
        ReportedBy: lpritc@scri.sari.ac.uk


Using Fedora Core 3, MySQL Ver 14.7 Distrib 4.1.11 with BioPython from CVS
head.

On attempting to follow the documentation code at
http://www.biopython.org/docs/biosql/python_biosql_basic.html#htoc10 to
populate a BioSQL database from the example GenBank file, an error was thrown,
with traceback:

  File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 414,
in load
    db_loader.load_seqrecord(cur_record)
  File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 37, in
load_seqrecord
    bioentry_id = self._load_bioentry_table(record)
  File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 251, in
_load_bioentry_table
    self.adaptor.execute(sql, (self.dbid,
  File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 277,
in execute
    self.cursor.execute(sql, args or ())
  File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 95, in
execute
    return self._execute(query, args)
  File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 114, in
_execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.3/site-packages/MySQLdb/connections.py", line 33, in
defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (1216, 'Cannot add or update a child row: a foreign key
constraint fails')

This problem had previously been reported under a different configuration on
the BioPython discussion mailing list at
http://www.biopython.org/pipermail/biopython/2005-July/002716.html

The test_BioSQL.py script with the CVS BioPython failed with the same error:

[lpritc@lplinuxdev Tests]$ python test_BioSQL.py
Load SeqRecord objects into a BioSQL database. ... ERROR
Get a list of all items in the database. ... ERROR
Test retrieval of items using various ids. ... ERROR
Make sure Seqs from BioSQL implement the right interface. ... ERROR
Check SeqFeatures of a sequence. ... ERROR
Make sure SeqRecords from BioSQL implement the right interface. ... ERROR
Check that slices of sequences are retrieved properly. ... ERROR
Make sure all records are correctly loaded. ... ERROR
Indepth check that SeqFeatures are transmitted through the db. ... ERROR

======================================================================
ERROR: Load SeqRecord objects into a BioSQL database.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_BioSQL.py", line 316, in t_load_database
    self.db.load(self.iterator)
  File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 414,
in load
    db_loader.load_seqrecord(cur_record)
  File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 37, in
load_seqrecord
    bioentry_id = self._load_bioentry_table(record)
  File "/usr/lib/python2.3/site-packages/BioSQL/Loader.py", line 251, in
_load_bioentry_table
    self.adaptor.execute(sql, (self.dbid,
  File "/usr/lib/python2.3/site-packages/BioSQL/BioSeqDatabase.py", line 277,
in execute
    self.cursor.execute(sql, args or ())
  File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 95, in
execute
    return self._execute(query, args)
  File "/usr/lib/python2.3/site-packages/MySQLdb/cursors.py", line 114, in
_execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.3/site-packages/MySQLdb/connections.py", line 33, in
defaulterrorhandler
    raise errorclass, errorvalue
OperationalError: (1216, 'Cannot add or update a child row: a foreign key
constraint fails')

The problem seems to stem from the DatabaseLoader._load_bioentry_table() method
in Loader.py - a previous fix attempts to solve a previous problem with the
population of the bioentry.taxon_id field by assigning it the value "0" in the
INSERT SQL statment.  Attempting to do this in a database where the taxon table
is unpopulated is a violation of a foreign key constraint in both the current
BioSQL schema, and the one that ships with BioPython, and throws the error
seen.

I modified the code in DatabaseLoader._load_bioentry_table() so that the INSERT
statement no londer attempts to populate the bioentry.taxon_id field, which is
left to take the default value of NULL.  The diff is below:

226c226
<         taxon_id = "0" # inserted this because the taxon population code is
out of date
---
>         #taxon_id = "0" # inserted this because the taxon population code is out of date
231a232,234
>       # removed taxon_id field, as it was causing difficulties with the
>       # schema  - not inserting a value allows it to default to NULL,
>       # avoiding violation of the foreign key constraint.
235d237
<          taxon_id,
249d250
<          %s,
252d252
<                                    taxon_id,


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Tue Dec 20 07:32:41 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Tue Dec 20 07:58:21 2005
Subject: [Biopython-dev] [Bug 1909] Format issue with GenBank with segmented
	BACs (eg GI:55276707)
Message-ID: <200512201232.jBKCWfFw021417@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1909


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


------- Comment #2 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-20 07:32 -------
A GenBank format entry for GI:55276707 can be downloaded from here:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=55276707

Its a 401 kb GenBank file, containing THREE separate GenBank records (three
segments), starting:

LOCUS       AY643842S1             12998 bp    DNA     linear   PLN 17-NOV-2004
DEFINITION  Hordeum vulgare subsp. vulgare clone BAC 519K7 hardness locus
            region.
ACCESSION   AY643842
VERSION     AY643842.1  GI:55276708
KEYWORDS    .
SEGMENT     1 of 3
..

Using the old Martel GenBank parser (e.g. BioPython 1.41) the following works
perfectly:

print "Method 1 - Using for record in Iterator"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file  = open(gbk_filename, "r")
for gb_record in GenBank.Iterator(input_file, GenBank.RecordParser()) :
    print "Loaded GenBank record %s" % gb_record.locus
print "Done"
input_file.close()

Or:

print "Method 2 - Using Iterator.next()"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file  = open(gbk_filename, "r")
gb_iterator = GenBank.Iterator(input_file, GenBank.RecordParser())
while True:
    gb_record = gb_iterator.next()
    if gb_record is None : break
    print "Loaded GenBank record %s" % gb_record.locus
print "Done"
input_file.close()

This bit of code will reproduce the error reported:

print "Method 3 - No Iterator object, this fails"
from Bio import GenBank
gbk_filename = "AY643842.gbk"
input_file  = open(gbk_filename, "r")
gb_record = GenBank.RecordParser().parse(input_file)
..

The reason the error message says "unparsed text remains" beyond position
18263, is the fact that there are actually two more records in the file.

Your text editor may have a "goto character" command (TextPad does, available
to try from www.textpad.com but it does cost money).

The following snippet of code is another way to find out where a Martel parser
is failing from a position in a file, in this case 18263:

print "Debug:"
input_file = open(gbk_filename, "r")
raw_text = "".join(input_file.readlines())
input_file.close()
print raw_text[18263:18263+100] + "..."

Debug:
LOCUS       AY643842S2            129099 bp    DNA     linear   PLN 17-NOV-2004
DEFINITION  Hordeum ...

i.e. It's complaining about the presence of second record (i.e. LOCUS line
onwards) in the GenBank file.

Resolution
==========
If you can't be sure in advance that there is only one record, allways use the
GenBank.Iterator object.

Note
====
Using the current version of the GenBank parser (in CVS, not yet released),
then method 3 above will work and give you the (just) first record.  It does
not warn you in any way that there is a second or third record available.

P.S.
====
My testing and the original report were done on Windows.  If you run this on
unix, then because of the different line endings, the exact position of the
second record will change slightly.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From e.picardi at unical.it  Thu Dec 22 03:58:52 2005
From: e.picardi at unical.it (Ernesto)
Date: Thu Dec 22 07:28:32 2005
Subject: [Biopython-dev] simple class to generate random trees
Message-ID: <005101c606d5$f7de0080$572561a0@mirko84cf0g99i>

Skipped content of type multipart/alternative-------------- next part --------------
"""
RandomTree is a simple class to generate random rooted trees.
Clock-like trees are generated according to the methodology 
of Kuhner and Felsenstein (1994) Mol. Biol. Evol. 11: 459-468,
whereas no clock-like trees are created following Guindon and
Gascuel (2002) Mol. Biol. Evol. 19: 534-543.
Once a clock-like tree is generted, each branch length is 
multiplied by a gamma dinstributed factor. If the mean of this
distribution is equal to 1 and the shape fixed to 0.5, then the
departure from molecular clock is strong. The opposite situation
is when gamma shape is fixed to 2.0.

When the RandomTree class is invoked a simple object is created.
It contains:
ntips: number of tips for tree --> default is 10
nobr: 1 for trees without branch lengths --> default is 0
pm: probability of change per unit time --> default is 0.03
shape: gamma shape for variable trees --> default is 0.5
mean: mean of gamma distribution for variable trees --> default is 1

USING this class:
>>> from RandomTree import RandomTree
>>> tree = RandomTree()
>>> tree.nobr
0
>>> tree.ntips=4
>>> tree.constant_tree()
'((T3:0.01516,T4:0.01516):0.01332,(T1:0.02643,T2:0.02643):0.00205);'
>>> tree.variable_tree()
'(((T4:0.00050,T3:0.00954):0.00009,T2:0.01591):0.00595,T1:0.00531);'
>>>tree.nobr=1
>>> tree.constant_tree()
'((T2,(T1,T4)),T3);'
>>> tree.variable_tree()
'((T1,(T3,T2)),T4);'

Copyright (c) 2004-2005, Ernesto Picardi.
This class comes with ABSOLUTELY NO WARRANTY.
"""

import math,string,fpformat,random,re,sys  # import of standard modules

class RandomTree:

      def __init__(self,alltips=10,nobr=0,pm=0.03,shape=0.5,mean=1):
          self.alltips=alltips   # number of tips
          self.nobr=nobr     # use branch lengths
          self.pm=pm         # probability of change per unit time
          self.shape=shape   # gamma shape parameter
          self.mean=mean     # mean of gamma dinstribution

      def constant_tree(self):  # function to generate a clock-like tree
          if self.alltips <=2:
              sys.exit('At least three tips. Bye.')
          tips=[]
	  for i in range(1, self.alltips+1):
              tips.append("T"+str(i))
          Lb=[]
	  for i in range(len(tips)):
              Lb.append(0)
          n=1
	  dictionary={}
	  while len(tips)!=1:
                R=random.random()
                tyme=(-(math.log(R))/len(tips))*self.pm
	        fixtyme=fpformat.fix(tyme,5)
		brlens=float(fixtyme)
		for i in range(len(tips)):
			Lb[i]=Lb[i]+brlens
		nodeName = '@node%04i@' % n
		s1=random.choice(tips)
		i1=str(Lb[tips.index(s1)])
		del Lb[tips.index(s1)]
		tips.remove(s1)
		s2=random.choice(tips)
		i2=str(Lb[tips.index(s2)])
		del Lb[tips.index(s2)]
		tips.remove(s2)
		if self.nobr:
		   nodo="("+s1+","+s2+")"
                else:
		     nodo="("+s1+":"+i1+","+s2+":"+i2+")"
		dictionary[nodeName]=nodo
		tips.append(nodeName)
		Lb.append(0)
		n+=1
          findNodes=re.compile(r"@node.*?@", re.I) #to identify a node name
          lastNode = max(dictionary.keys())
          treestring = lastNode
          while 1:
                nodeList = findNodes.findall(treestring)
                if nodeList == []: break
                for element in nodeList:
                    treestring=treestring.replace(element, dictionary[element])
          return treestring + ';'

      def variable_tree(self):  # function to generate a variable tree
          treestring=self.constant_tree()
          findbr=re.compile(":[0-9]+.[0-9]+[\),]")
          allbr=findbr.findall(treestring)
          dicbr={}
          for i in allbr:
              br=(i.split(':'))[1]
	      brval=eval(br.strip('),'))
              beta=float(self.shape)/self.mean
	      gammafactor=random.gammavariate(self.shape,beta)
              newbr=brval*gammafactor
	      newbr1=fpformat.fix(newbr,5)
	      dicbr[i]=newbr1
          for j in dicbr:
	      if ',' in j:
	         treestring=treestring.replace(j,':'+dicbr[j]+',')
              elif ')' in j:
                   treestring=treestring.replace(i,':'+dicbr[i]+')')
          return treestring


From bugzilla-daemon at portal.open-bio.org  Sat Dec 24 06:39:52 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Dec 24 07:00:24 2005
Subject: [Biopython-dev] [Bug 1920] Bio.Geo does not support recent GEO files
Message-ID: <200512241139.jBOBdqIL008387@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1920


------- Comment #3 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-24 06:39 -------
The current patch (patch 260) doesn't cope with current GPL files (GEO
platforms/annotation files) where the "before/after table comments" are
slightly different.

Also, the GEO Record object's __str__ method will attempt to show all the rows
in a data table, and for large GDS or GPL files this is a very bad idea -
python seems to lock up my computer as a result.

I propose to only print the first 20, then a ..., and the final record.  20 is
a reasonably low number and will not affect the existing test cases.

I have a revised patch prepared that tackles these two issues, but won't have
direct internet access until the New Year.  If anyone with CVS access feels the
urge, don't let this stop you from checking in the current patch and the
additional test cases.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Sat Dec 24 07:18:47 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Dec 24 07:58:45 2005
Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing
Message-ID: <200512241218.jBOCIlWh008813@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1680


------- Comment #4 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-24 07:18 -------
I think (after following references through several files) that we need to
focus on Bio/expressions/genbank.py

The "record" definition appears to allow multiple trailing blank lines at the
end of a record, see "record_end".  i.e. It looks for // and then one or more
new lines.

However, the "format" definition which appears to be used to build the index is
this:

format = Martel.ParseRecords("genbank", {"format" : "genbank"},
                             record, RecordReader.EndsWith, ("//",))

If I am not mistaken the for files with blank lines between records (as
reported in this bug), this will lead to the first record with no trailing
lines, and then subsequent records would have leading blank lines.

So, my suggestions are:

(a) Allow blank lines at the start of a genbank record (before the LOCUS line)

Or:

(b) we could try this:

format = Martel.ParseRecords("genbank", {"format" : "genbank"},
                             record, RecordReader.StartsWith, ("LOCUS ",))


Making this change seems to fix this bug (indexing the small 6 KB GenBank file
with three entries, takes under a second).

As the GenBank.Iterator code works by looking for records that start LOCUS,
this seems like a more consistent approach.

NOTE - I have not run the full test suite to look for any side effects.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Sat Dec 24 07:28:45 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Dec 24 07:58:47 2005
Subject: [Biopython-dev] [Bug 1680] Problems with the GenBank indexing
Message-ID: <200512241228.jBOCSjLU008952@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1680


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |marc.saric@gmx.de


------- Comment #5 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-24 07:28 -------
*** Bug 1773 has been marked as a duplicate of this bug. ***


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org  Sat Dec 24 07:28:43 2005
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon@portal.open-bio.org)
Date: Sat Dec 24 07:58:48 2005
Subject: [Biopython-dev] [Bug 1773] Martel.Parser.ParserPositionException
Message-ID: <200512241228.jBOCShrh008947@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=1773


biopython-bugzilla@maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |DUPLICATE


------- Comment #2 from biopython-bugzilla@maubp.freeserve.co.uk  2005-12-24 07:28 -------
Having investigated bug 1680 further, I'm sure that your issue with the
trailing blank lines is the same problem, so I'm marking this as a duplicate.

However, as far as I can tell, your example GenBank file only has one "genbank
record" in it (i.e. it only has one LOCUS line).

*This means that indexing this particular file is rather pointless*

Indexing the features within this single GenBank record might be more useful,
there is an in-memory approach to this here:

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/genbank/#indexing_features

*** This bug has been marked as a duplicate of 1680 ***


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.