From jdiggans at gmail.com  Fri Sep  8 15:26:19 2006
From: jdiggans at gmail.com (James Diggans)
Date: Fri, 8 Sep 2006 15:26:19 -0400
Subject: [Biopython-dev] Parsing PubMed XML records
In-Reply-To: <bd07b9280609081139p22654e89o7efb022986574d73@mail.gmail.com>
References: <bd07b9280609081139p22654e89o7efb022986574d73@mail.gmail.com>
Message-ID: <bd07b9280609081226h7fbdeb95g64372023026762b3@mail.gmail.com>

Just began a small project to parse records from a few PubMed searches
and in using the Bio.Pubmed and Bio.Medline packages. The method used
(once patched acc. to the link below) in the documentation seems to
use the plain-text Medline format which doesn't seem to include
Affiliation, a field in which I'm interested.

The XML parsers *do* include this field in their parse but it doesn't
look as if they were ever finished (e.g. NLMMedlineXML.py has a
'Citation' object while PubMed.py uses a 'Record' object; I don't see
any hierarchical relationships between the two). Can someone provide a
brief overview as to the status of this package? Is the XML interface
usable (even if I have to write a new format perhaps?)?

Regards,
James

http://lists.open-bio.org/pipermail/biopython-dev/2003-July/001348.html

From biopython-dev at maubp.freeserve.co.uk  Wed Sep 13 14:23:56 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 13 Sep 2006 19:23:56 +0100
Subject: [Biopython-dev] Fasta.SequenceParser slower on python 2.4 than 2.3
Message-ID: <45084CBC.9080103@maubp.freeserve.co.uk>

I've been looking at sequence parsing again, and was a little puzzled to 
notice that the stock Fasta.SequenceParser (which uses Martel 
internally) is about three to four times slower on Python 2.4 than on 
Python 2.3 (on my Windows XP laptop).

Has anyone else noticed this?

For comparison, SeqIO.FASTA.FastaReader is about the same (maybe even a 
fraction faster).

I've been using rat.protein.faa as a test case, a 22 MB file with approx 
36000 entries.  The sequences are split into 80 character lines. 
Available here:

ftp://ftp.ncbi.nlm.nih.gov/refseq/R_norvegicus/mRNA_Prot/rat.protein.faa.gz

On python 2.3.3 the attached script takes about 12s to parse, on python 
2.4.3 it takes about 56s.  Explicitly caching the file using cStringIO 
makes no real difference.  Using SeqIO.FASTA.FastaReader takes about 10s 
or 11s (regardless of the version of python).

It is possible that this "slow down" is Windows only - I know they 
switched from MSVC version 6 to version 7 (or something) instead, which 
may be to blame.

Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: simple_no_cache.py
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060913/a69fedfd/attachment.pl 

From biopython-dev at maubp.freeserve.co.uk  Sun Sep 17 07:05:14 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 Sep 2006 12:05:14 +0100
Subject: [Biopython-dev] Bio.GenBank FeatureParser vs RecordParser
In-Reply-To: <450C8966.3030106@maubp.freeserve.co.uk>
References: <450C8966.3030106@maubp.freeserve.co.uk>
Message-ID: <450D2BEA.6040903@maubp.freeserve.co.uk>

Peter wrote:
> I've been looking at some timings for parsing GenBank files, in 
> particular FeatureParser vs RecordParser
> 
> The test file I'm using is one of the largest bacterial genomes, the 
> GenBank file is almost 24MB:
> 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
> 
> On my nice new desktop:
> 
> RecordParser takes about 5s to return a Bio.GenBank.Record object.
> 
> FeatureParser takes about 45 to 50s to return a SeqRecord object.
> 
> ...
> 
> The other option (which I do plan to look into) is improving the 
> location parser so that it doesn't cause such a slow down.
> 

I started this thread on the discussion list, but this follow up is 
probably better off on the development list...

With the following fairly small change to Bio/GenBank/LocationParser.py 
the time taken by the FeatureParser is almost halved (from about 45 to 
50s to about about 27 or 28s).

Old code:

def scan(input):
     scanner = LocationScanner()
     return scanner.tokenize(input)

def parse(tokens):
     #print "I have", tokens
     parser = LocationParser()
     return parser.parse(tokens)


New code:

_cached_scanner = LocationScanner()
def scan(input):
     return _cached_scanner.tokenize(input)

_cached_parser = LocationParser()
def parse(tokens):
     #print "I have", tokens
     return _cached_parser.parse(tokens)


These two functions are called for every feature by the location method 
of the _FeatureConsumer class in Bio/GenBank/__init__.py

I checked that test_GenBank and test_GenBankFormat still pass.

My change means the LocationScanner() and LocationParser() objects are 
created once and then reused - rather than being recreated for each feature.

Alternatively, the _FeatureConsumer could create its own copies of these 
objects (once) and call them directly instead of using the scan and 
parse functions.  This also works and takes a similar amount of time.

If no one objects, I'll double check this works (and is worthwhile) on 
my older slower windows machine, and check it in at some point next week.

Peter


From biopython-dev at maubp.freeserve.co.uk  Sun Sep 17 18:06:32 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 Sep 2006 23:06:32 +0100
Subject: [Biopython-dev] Bio.GenBank FeatureParser vs RecordParser
In-Reply-To: <450D2BEA.6040903@maubp.freeserve.co.uk>
References: <450C8966.3030106@maubp.freeserve.co.uk>
	<450D2BEA.6040903@maubp.freeserve.co.uk>
Message-ID: <450DC6E8.5030100@maubp.freeserve.co.uk>

Peter wrote:
> Peter wrote:
>> I've been looking at some timings for parsing GenBank files, in 
>> particular FeatureParser vs RecordParser
>>
>> The test file I'm using is one of the largest bacterial genomes, the 
>> GenBank file is almost 24MB:
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
>>
>> On my nice new desktop:
>>
>> RecordParser takes about 5s to return a Bio.GenBank.Record object.
>>
>> FeatureParser takes about 45 to 50s to return a SeqRecord object.
>>
>> ...
>>
>> The other option (which I do plan to look into) is improving the 
>> location parser so that it doesn't cause such a slow down.
>>
> 
> I started this thread on the discussion list, but this follow up is 
> probably better off on the development list...
> 
> With the following fairly small change to Bio/GenBank/LocationParser.py 
> the time taken by the FeatureParser is almost halved (from about 45 to 
> 50s to about about 27 or 28s).
> 
> Old code:
> 
> def scan(input):
>      scanner = LocationScanner()
>      return scanner.tokenize(input)
> 
> def parse(tokens):
>      #print "I have", tokens
>      parser = LocationParser()
>      return parser.parse(tokens)
> 
> 
> New code:
> 
> _cached_scanner = LocationScanner()
> def scan(input):
>      return _cached_scanner.tokenize(input)
> 
> _cached_parser = LocationParser()
> def parse(tokens):
>      #print "I have", tokens
>      return _cached_parser.parse(tokens)
> 
> 
> These two functions are called for every feature by the location method 
> of the _FeatureConsumer class in Bio/GenBank/__init__.py
> 
> I checked that test_GenBank and test_GenBankFormat still pass.
> 
> My change means the LocationScanner() and LocationParser() objects are 
> created once and then reused - rather than being recreated for each feature.
> 
> Alternatively, the _FeatureConsumer could create its own copies of these 
> objects (once) and call them directly instead of using the scan and 
> parse functions.  This also works and takes a similar amount of time.
> 
> If no one objects, I'll double check this works (and is worthwhile) on
> my older slower windows machine, and check it in at some point next week.

I still plan to check in the above fairly minor change.

I've also looked deeper, and I have tweaked LocationParser.py to handle 
the typical (exact) cases using regular expressions as special cases 
(falling back on the existing spark parser otherwise):

"123..456"
"function(123..456)" e.g. "complement(123..456)"

The above are enough for most bacteria, I then added:

"function(123..456,789..1066,1999..2006)" to cover joins,

and:

"function(function(123..456,789..1066,1999..2006))"

to cover the complement of joins for non-bacteria. With this in place 
the parsing time for the large example falls from about 27s to about 7s 
(compared to the 45s or more taken by the CVS edition of the parser).

I'm not ready to check in this hybrid regular expressions/spark parser, 
as I think it could be done more cleanly...

Peter


From idoerg at burnham.org  Mon Sep 18 18:09:08 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Mon, 18 Sep 2006 15:09:08 -0700
Subject: [Biopython-dev] Biopython for Ubuntu
Message-ID: <450F1904.3070601@burnham.org>

Apparently we have a Debian / Ubuntu package for Biopython. If there was 
an announcement here then I am sorry, but it went past me. Anyhow, 
thanks very much to Philipp Benner for creating the Ubuntu package. 
Currently Biopython 1.41, and you need to add the universe repository to 
get it. It's in the universe/python section.

I'll add something to the Wiki

Iddo


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From bugzilla-daemon at portal.open-bio.org  Mon Sep 25 10:53:40 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Sep 2006 10:53:40 -0400
Subject: [Biopython-dev] [Bug 2076] EMBL to GenBank converter should fix
	unterminated lines
In-Reply-To: <bug-2076-42@http.bugzilla.open-bio.org/>
Message-ID: <200609251453.k8PEredO017998@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2076


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From bugzilla-daemon at portal.open-bio.org  Mon Sep 25 11:02:10 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Sep 2006 11:02:10 -0400
Subject: [Biopython-dev] [Bug 2035] fast/approximate clustalw parameter set
	incorrectly
In-Reply-To: <bug-2035-42@http.bugzilla.open-bio.org/>
Message-ID: <200609251502.k8PF2A1j018510@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2035


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2006-09-25 11:02 -------
Fix checked in, revision 1.14

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

From kvddrift at earthlink.net  Tue Sep 26 14:32:56 2006
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Tue, 26 Sep 2006 14:32:56 -0400 (GMT-04:00)
Subject: [Biopython-dev] biopython instructions for Mac OS X
Message-ID: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>

Hi,

I was reading your wiki page and noticed that the instructions for installing biopython on Mac OS X are quite elaborous. I would like to bring under your attention that it is very easy to install the package using the fink package manager (similar to debian, see also http://fink.sf.net). Fink will take care of getting the source tarballs and installing all additional packages needed for biopython. If you would like to add this to your wiki page, I can write a few sentences for this.

Also, does the most recent version of biopython work with python 2.5?

thanks,

- Koen.


From mdehoon at c2b2.columbia.edu  Tue Sep 26 21:02:58 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Tue, 26 Sep 2006 21:02:58 -0400
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
Message-ID: <4519CDC2.3080805@c2b2.columbia.edu>

Koen van der Drift wrote:
> If you would like to add this to your wiki page, I can write a few
> sentences for this.

You can make an account to be able to edit the wiki page by going to 
"Log in / create account" at the top of the biopython home page. Let me 
know if this doesn't work for you.

> Also, does the most recent version of biopython work with python 2.5?

Yes, as far as I can tell. At least I didn't experience any problems 
with Biopython with python 2.5 on Cygwin or Mac OS X. Some deprecation 
warnings (which should be fixed for the next release), but nothing serious.

--Michiel.

From kvddrift at earthlink.net  Tue Sep 26 22:25:55 2006
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Tue, 26 Sep 2006 22:25:55 -0400
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <4519CDC2.3080805@c2b2.columbia.edu>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
	<4519CDC2.3080805@c2b2.columbia.edu>
Message-ID: <8B1F1014-DF4E-4F0A-A6E7-5172C6404FF1@earthlink.net>


On Sep 26, 2006, at 9:02 PM, Michiel de Hoon wrote:

> You can make an account to be able to edit the wiki page by going  
> to "Log in / create account" at the top of the biopython home page.  
> Let me know if this doesn't work for you.

Thanks, I was able to create an account. However, I just noticed that  
the install instructions are only linked from the wiki page, and are  
on an external HTML document created by Brad Chapman. I will email  
him and ask him to update the instructions.

FYI, I was also able to build biopython 1.42 with python 2.5 on Mac  
OS X.

- Koen.

From mdehoon at c2b2.columbia.edu  Tue Sep 26 22:55:50 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Tue, 26 Sep 2006 22:55:50 -0400
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <8B1F1014-DF4E-4F0A-A6E7-5172C6404FF1@earthlink.net>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
	<4519CDC2.3080805@c2b2.columbia.edu>
	<8B1F1014-DF4E-4F0A-A6E7-5172C6404FF1@earthlink.net>
Message-ID: <4519E836.3030204@c2b2.columbia.edu>

Koen van der Drift wrote:
> Thanks, I was able to create an account. However, I just noticed that 
> the install instructions are only linked from the wiki page, and are on 
> an external HTML document created by Brad Chapman. I will email him and 
> ask him to update the instructions.

I'm not sure if Brad is still very actively involved with Biopython 
(sorry Brad if this statement is incorrect). But, we can also help you 
with fixing the install instructions. The TeX source for this is in CVS; 
you can access it here:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/install/?cvsroot=biopython

The simplest thing may be to add your instructions to the .tex file, 
send me (not the mailing list) the result, and then I'll upload the new 
tex, pdf, html to CVS and the website.

It makes sense though for these installations instructions to be part of 
the wiki. So:

Who prefers the current tex/pdf/html form to having a wiki for the 
installation instructions?


--Michiel.


From lpritc at scri.sari.ac.uk  Wed Sep 27 04:36:40 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 27 Sep 2006 09:36:40 +0100
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
Message-ID: <1159346200.4794.38.camel@lplinuxdev>

Hi Koen,

I notice you're maintaining the fink distribution of biopython ;)
Thanks for doing that.

On Tue, 2006-09-26 at 14:32 -0400, Koen van der Drift wrote:
> Also, does the most recent version of biopython work with python 2.5?

It works for me with Python2.5 on OS X, as do all the dependencies.  

(Getting matplotlib/pylab installed correctly with 2.5 was a much more
involved matter, but that story belongs on a different mailing list.)

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From jdiggans at gmail.com  Fri Sep  8 19:26:19 2006
From: jdiggans at gmail.com (James Diggans)
Date: Fri, 8 Sep 2006 15:26:19 -0400
Subject: [Biopython-dev] Parsing PubMed XML records
In-Reply-To: <bd07b9280609081139p22654e89o7efb022986574d73@mail.gmail.com>
References: <bd07b9280609081139p22654e89o7efb022986574d73@mail.gmail.com>
Message-ID: <bd07b9280609081226h7fbdeb95g64372023026762b3@mail.gmail.com>

Just began a small project to parse records from a few PubMed searches
and in using the Bio.Pubmed and Bio.Medline packages. The method used
(once patched acc. to the link below) in the documentation seems to
use the plain-text Medline format which doesn't seem to include
Affiliation, a field in which I'm interested.

The XML parsers *do* include this field in their parse but it doesn't
look as if they were ever finished (e.g. NLMMedlineXML.py has a
'Citation' object while PubMed.py uses a 'Record' object; I don't see
any hierarchical relationships between the two). Can someone provide a
brief overview as to the status of this package? Is the XML interface
usable (even if I have to write a new format perhaps?)?

Regards,
James

http://lists.open-bio.org/pipermail/biopython-dev/2003-July/001348.html


From biopython-dev at maubp.freeserve.co.uk  Wed Sep 13 18:23:56 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 13 Sep 2006 19:23:56 +0100
Subject: [Biopython-dev] Fasta.SequenceParser slower on python 2.4 than 2.3
Message-ID: <45084CBC.9080103@maubp.freeserve.co.uk>

I've been looking at sequence parsing again, and was a little puzzled to 
notice that the stock Fasta.SequenceParser (which uses Martel 
internally) is about three to four times slower on Python 2.4 than on 
Python 2.3 (on my Windows XP laptop).

Has anyone else noticed this?

For comparison, SeqIO.FASTA.FastaReader is about the same (maybe even a 
fraction faster).

I've been using rat.protein.faa as a test case, a 22 MB file with approx 
36000 entries.  The sequences are split into 80 character lines. 
Available here:

ftp://ftp.ncbi.nlm.nih.gov/refseq/R_norvegicus/mRNA_Prot/rat.protein.faa.gz

On python 2.3.3 the attached script takes about 12s to parse, on python 
2.4.3 it takes about 56s.  Explicitly caching the file using cStringIO 
makes no real difference.  Using SeqIO.FASTA.FastaReader takes about 10s 
or 11s (regardless of the version of python).

It is possible that this "slow down" is Windows only - I know they 
switched from MSVC version 6 to version 7 (or something) instead, which 
may be to blame.

Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: simple_no_cache.py
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060913/a69fedfd/attachment.ksh>

From biopython-dev at maubp.freeserve.co.uk  Sun Sep 17 11:05:14 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 Sep 2006 12:05:14 +0100
Subject: [Biopython-dev] Bio.GenBank FeatureParser vs RecordParser
In-Reply-To: <450C8966.3030106@maubp.freeserve.co.uk>
References: <450C8966.3030106@maubp.freeserve.co.uk>
Message-ID: <450D2BEA.6040903@maubp.freeserve.co.uk>

Peter wrote:
> I've been looking at some timings for parsing GenBank files, in 
> particular FeatureParser vs RecordParser
> 
> The test file I'm using is one of the largest bacterial genomes, the 
> GenBank file is almost 24MB:
> 
> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
> 
> On my nice new desktop:
> 
> RecordParser takes about 5s to return a Bio.GenBank.Record object.
> 
> FeatureParser takes about 45 to 50s to return a SeqRecord object.
> 
> ...
> 
> The other option (which I do plan to look into) is improving the 
> location parser so that it doesn't cause such a slow down.
> 

I started this thread on the discussion list, but this follow up is 
probably better off on the development list...

With the following fairly small change to Bio/GenBank/LocationParser.py 
the time taken by the FeatureParser is almost halved (from about 45 to 
50s to about about 27 or 28s).

Old code:

def scan(input):
     scanner = LocationScanner()
     return scanner.tokenize(input)

def parse(tokens):
     #print "I have", tokens
     parser = LocationParser()
     return parser.parse(tokens)


New code:

_cached_scanner = LocationScanner()
def scan(input):
     return _cached_scanner.tokenize(input)

_cached_parser = LocationParser()
def parse(tokens):
     #print "I have", tokens
     return _cached_parser.parse(tokens)


These two functions are called for every feature by the location method 
of the _FeatureConsumer class in Bio/GenBank/__init__.py

I checked that test_GenBank and test_GenBankFormat still pass.

My change means the LocationScanner() and LocationParser() objects are 
created once and then reused - rather than being recreated for each feature.

Alternatively, the _FeatureConsumer could create its own copies of these 
objects (once) and call them directly instead of using the scan and 
parse functions.  This also works and takes a similar amount of time.

If no one objects, I'll double check this works (and is worthwhile) on 
my older slower windows machine, and check it in at some point next week.

Peter


From biopython-dev at maubp.freeserve.co.uk  Sun Sep 17 22:06:32 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 Sep 2006 23:06:32 +0100
Subject: [Biopython-dev] Bio.GenBank FeatureParser vs RecordParser
In-Reply-To: <450D2BEA.6040903@maubp.freeserve.co.uk>
References: <450C8966.3030106@maubp.freeserve.co.uk>
	<450D2BEA.6040903@maubp.freeserve.co.uk>
Message-ID: <450DC6E8.5030100@maubp.freeserve.co.uk>

Peter wrote:
> Peter wrote:
>> I've been looking at some timings for parsing GenBank files, in 
>> particular FeatureParser vs RecordParser
>>
>> The test file I'm using is one of the largest bacterial genomes, the 
>> GenBank file is almost 24MB:
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Streptomyces_coelicolor/NC_003888.gbk
>>
>> On my nice new desktop:
>>
>> RecordParser takes about 5s to return a Bio.GenBank.Record object.
>>
>> FeatureParser takes about 45 to 50s to return a SeqRecord object.
>>
>> ...
>>
>> The other option (which I do plan to look into) is improving the 
>> location parser so that it doesn't cause such a slow down.
>>
> 
> I started this thread on the discussion list, but this follow up is 
> probably better off on the development list...
> 
> With the following fairly small change to Bio/GenBank/LocationParser.py 
> the time taken by the FeatureParser is almost halved (from about 45 to 
> 50s to about about 27 or 28s).
> 
> Old code:
> 
> def scan(input):
>      scanner = LocationScanner()
>      return scanner.tokenize(input)
> 
> def parse(tokens):
>      #print "I have", tokens
>      parser = LocationParser()
>      return parser.parse(tokens)
> 
> 
> New code:
> 
> _cached_scanner = LocationScanner()
> def scan(input):
>      return _cached_scanner.tokenize(input)
> 
> _cached_parser = LocationParser()
> def parse(tokens):
>      #print "I have", tokens
>      return _cached_parser.parse(tokens)
> 
> 
> These two functions are called for every feature by the location method 
> of the _FeatureConsumer class in Bio/GenBank/__init__.py
> 
> I checked that test_GenBank and test_GenBankFormat still pass.
> 
> My change means the LocationScanner() and LocationParser() objects are 
> created once and then reused - rather than being recreated for each feature.
> 
> Alternatively, the _FeatureConsumer could create its own copies of these 
> objects (once) and call them directly instead of using the scan and 
> parse functions.  This also works and takes a similar amount of time.
> 
> If no one objects, I'll double check this works (and is worthwhile) on
> my older slower windows machine, and check it in at some point next week.

I still plan to check in the above fairly minor change.

I've also looked deeper, and I have tweaked LocationParser.py to handle 
the typical (exact) cases using regular expressions as special cases 
(falling back on the existing spark parser otherwise):

"123..456"
"function(123..456)" e.g. "complement(123..456)"

The above are enough for most bacteria, I then added:

"function(123..456,789..1066,1999..2006)" to cover joins,

and:

"function(function(123..456,789..1066,1999..2006))"

to cover the complement of joins for non-bacteria. With this in place 
the parsing time for the large example falls from about 27s to about 7s 
(compared to the 45s or more taken by the CVS edition of the parser).

I'm not ready to check in this hybrid regular expressions/spark parser, 
as I think it could be done more cleanly...

Peter


From idoerg at burnham.org  Mon Sep 18 22:09:08 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Mon, 18 Sep 2006 15:09:08 -0700
Subject: [Biopython-dev] Biopython for Ubuntu
Message-ID: <450F1904.3070601@burnham.org>

Apparently we have a Debian / Ubuntu package for Biopython. If there was 
an announcement here then I am sorry, but it went past me. Anyhow, 
thanks very much to Philipp Benner for creating the Ubuntu package. 
Currently Biopython 1.41, and you need to add the universe repository to 
get it. It's in the universe/python section.

I'll add something to the Wiki

Iddo


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037, USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From bugzilla-daemon at portal.open-bio.org  Mon Sep 25 14:53:40 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Sep 2006 10:53:40 -0400
Subject: [Biopython-dev] [Bug 2076] EMBL to GenBank converter should fix
	unterminated lines
In-Reply-To: <bug-2076-42@http.bugzilla.open-bio.org/>
Message-ID: <200609251453.k8PEredO017998@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2076


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at portal.open-bio.org  Mon Sep 25 15:02:10 2006
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 25 Sep 2006 11:02:10 -0400
Subject: [Biopython-dev] [Bug 2035] fast/approximate clustalw parameter set
	incorrectly
In-Reply-To: <bug-2035-42@http.bugzilla.open-bio.org/>
Message-ID: <200609251502.k8PF2A1j018510@portal.open-bio.org>

http://bugzilla.open-bio.org/show_bug.cgi?id=2035


biopython-bugzilla at maubp.freeserve.co.uk changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |RESOLVED
         Resolution|                            |FIXED


------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk  2006-09-25 11:02 -------
Fix checked in, revision 1.14

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Clustalw/__init__.py?cvsroot=biopython


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From kvddrift at earthlink.net  Tue Sep 26 18:32:56 2006
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Tue, 26 Sep 2006 14:32:56 -0400 (GMT-04:00)
Subject: [Biopython-dev] biopython instructions for Mac OS X
Message-ID: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>

Hi,

I was reading your wiki page and noticed that the instructions for installing biopython on Mac OS X are quite elaborous. I would like to bring under your attention that it is very easy to install the package using the fink package manager (similar to debian, see also http://fink.sf.net). Fink will take care of getting the source tarballs and installing all additional packages needed for biopython. If you would like to add this to your wiki page, I can write a few sentences for this.

Also, does the most recent version of biopython work with python 2.5?

thanks,

- Koen.


From mdehoon at c2b2.columbia.edu  Wed Sep 27 01:02:58 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Tue, 26 Sep 2006 21:02:58 -0400
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
Message-ID: <4519CDC2.3080805@c2b2.columbia.edu>

Koen van der Drift wrote:
> If you would like to add this to your wiki page, I can write a few
> sentences for this.

You can make an account to be able to edit the wiki page by going to 
"Log in / create account" at the top of the biopython home page. Let me 
know if this doesn't work for you.

> Also, does the most recent version of biopython work with python 2.5?

Yes, as far as I can tell. At least I didn't experience any problems 
with Biopython with python 2.5 on Cygwin or Mac OS X. Some deprecation 
warnings (which should be fixed for the next release), but nothing serious.

--Michiel.


From kvddrift at earthlink.net  Wed Sep 27 02:25:55 2006
From: kvddrift at earthlink.net (Koen van der Drift)
Date: Tue, 26 Sep 2006 22:25:55 -0400
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <4519CDC2.3080805@c2b2.columbia.edu>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
	<4519CDC2.3080805@c2b2.columbia.edu>
Message-ID: <8B1F1014-DF4E-4F0A-A6E7-5172C6404FF1@earthlink.net>


On Sep 26, 2006, at 9:02 PM, Michiel de Hoon wrote:

> You can make an account to be able to edit the wiki page by going  
> to "Log in / create account" at the top of the biopython home page.  
> Let me know if this doesn't work for you.

Thanks, I was able to create an account. However, I just noticed that  
the install instructions are only linked from the wiki page, and are  
on an external HTML document created by Brad Chapman. I will email  
him and ask him to update the instructions.

FYI, I was also able to build biopython 1.42 with python 2.5 on Mac  
OS X.

- Koen.


From mdehoon at c2b2.columbia.edu  Wed Sep 27 02:55:50 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Tue, 26 Sep 2006 22:55:50 -0400
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <8B1F1014-DF4E-4F0A-A6E7-5172C6404FF1@earthlink.net>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
	<4519CDC2.3080805@c2b2.columbia.edu>
	<8B1F1014-DF4E-4F0A-A6E7-5172C6404FF1@earthlink.net>
Message-ID: <4519E836.3030204@c2b2.columbia.edu>

Koen van der Drift wrote:
> Thanks, I was able to create an account. However, I just noticed that 
> the install instructions are only linked from the wiki page, and are on 
> an external HTML document created by Brad Chapman. I will email him and 
> ask him to update the instructions.

I'm not sure if Brad is still very actively involved with Biopython 
(sorry Brad if this statement is incorrect). But, we can also help you 
with fixing the install instructions. The TeX source for this is in CVS; 
you can access it here:

http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/install/?cvsroot=biopython

The simplest thing may be to add your instructions to the .tex file, 
send me (not the mailing list) the result, and then I'll upload the new 
tex, pdf, html to CVS and the website.

It makes sense though for these installations instructions to be part of 
the wiki. So:

Who prefers the current tex/pdf/html form to having a wiki for the 
installation instructions?


--Michiel.


From lpritc at scri.sari.ac.uk  Wed Sep 27 08:36:40 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 27 Sep 2006 09:36:40 +0100
Subject: [Biopython-dev] biopython instructions for Mac OS X
In-Reply-To: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
References: <19508413.1159295576644.JavaMail.root@elwamui-wigeon.atl.sa.earthlink.net>
Message-ID: <1159346200.4794.38.camel@lplinuxdev>

Hi Koen,

I notice you're maintaining the fink distribution of biopython ;)
Thanks for doing that.

On Tue, 2006-09-26 at 14:32 -0400, Koen van der Drift wrote:
> Also, does the most recent version of biopython work with python 2.5?

It works for me with Python2.5 on OS X, as do all the dependencies.  

(Getting matplotlib/pylab installed correctly with 2.5 was a much more
involved matter, but that story belongs on a different mailing list.)

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).