From mdehoon at c2b2.columbia.edu  Sat Jul  1 17:47:28 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 01 Jul 2006 17:47:28 -0400
Subject: [Biopython-dev] Fasta parser
Message-ID: <44A6ED70.9080204@c2b2.columbia.edu>

Hi everybody,

The Biopython shows the following approach to parsing a Fasta file:

 >>> from Bio import Fasta
 >>> parser = Fasta.RecordParser()
 >>> file = open("ls_orchid.fasta")
 >>> iterator = Fasta.Iterator(file, parser)
 >>> cur_record = iterator.next()

But for large Fasta files, it's very slow, compared to file.read(), 
which may be due to going through Martel (I believe the same was true 
for large GenBank files).

So I'm thinking about writing a simple-minded Fasta parser for better 
performance with large files. What I'm wondering about:
1) Is there some advantage that I overlooked of using Martel for parsing 
Fasta files?
2) Why is it necessary to create a parser first and passing it to 
Fasta.Iterator? Are there any cases where Fasta.Iterator uses something 
other than a Fasta.RecordParser?

--Michiel.

From idoerg at burnham.org  Sat Jul  1 18:52:43 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 1 Jul 2006 15:52:43 -0700
Subject: [Biopython-dev] Fasta parser
References: <44A6ED70.9080204@c2b2.columbia.edu>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>

Michiel,

There is actually a simple minded fasta reader/writer  that does not use Martel. Bio.SeqIO.FASTA

./I

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
Sent: Sat 7/1/2006 2:47 PM
To: biopython-dev at biopython.org
Subject: [Biopython-dev] Fasta parser
 
Hi everybody,

The Biopython shows the following approach to parsing a Fasta file:

 >>> from Bio import Fasta
 >>> parser = Fasta.RecordParser()
 >>> file = open("ls_orchid.fasta")
 >>> iterator = Fasta.Iterator(file, parser)
 >>> cur_record = iterator.next()

But for large Fasta files, it's very slow, compared to file.read(), 
which may be due to going through Martel (I believe the same was true 
for large GenBank files).

So I'm thinking about writing a simple-minded Fasta parser for better 
performance with large files. What I'm wondering about:
1) Is there some advantage that I overlooked of using Martel for parsing 
Fasta files?
2) Why is it necessary to create a parser first and passing it to 
Fasta.Iterator? Are there any cases where Fasta.Iterator uses something 
other than a Fasta.RecordParser?

--Michiel.
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Sun Jul  2 00:43:47 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 00:43:47 -0400
Subject: [Biopython-dev] Fasta parser
In-Reply-To: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>
References: <44A6ED70.9080204@c2b2.columbia.edu>
	<1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>
Message-ID: <44A74F03.8020801@c2b2.columbia.edu>

Thanks Iddo!
I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than 
the Martel-based one in Bio.Fasta.

It would be nice to merge these two modules. However, it raises a bunch 
of design questions (such as Fasta.Record versus SeqRecord, and Seq 
versus string), so it's probably better to wait with that until after 
the next Biopython release. Which, by the way, will be coming up soon.

Thanks,

--Michiel.

Iddo Friedberg wrote:
> Michiel,
> 
> There is actually a simple minded fasta reader/writer  that does not use 
> Martel. Bio.SeqIO.FASTA
> 
> ./I
> 
> --
> Iddo Friedberg, PhD
> Burnham Institute for Medical Research
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037 USA
> T: +1 858 646 3100 x3516
> http://iddo-friedberg.org
> http://BioFunctionPrediction.org
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
> Sent: Sat 7/1/2006 2:47 PM
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] Fasta parser
> 
> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>  >>> from Bio import Fasta
>  >>> parser = Fasta.RecordParser()
>  >>> file = open("ls_orchid.fasta")
>  >>> iterator = Fasta.Iterator(file, parser)
>  >>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?
> 
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From idoerg at burnham.org  Sun Jul  2 00:48:50 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 1 Jul 2006 21:48:50 -0700
Subject: [Biopython-dev] Fasta parser
References: <44A6ED70.9080204@c2b2.columbia.edu>
	<1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>
	<44A74F03.8020801@c2b2.columbia.edu>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A6@MAIL.burnham.org>

By (lack of?) design, my own biopython using code seems to be using both the martel and non-Martel parsers. I imagine others may have the same. Point being: any design change should make sure that we are back compatible. 

Thanks very much for your work on the Biopython release.

Cheers,

./I

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: Michiel de Hoon [mailto:mdehoon at c2b2.columbia.edu]
Sent: Sat 7/1/2006 9:43 PM
To: Iddo Friedberg
Cc: biopython-dev at biopython.org
Subject: Re: [Biopython-dev] Fasta parser
 
Thanks Iddo!
I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than 
the Martel-based one in Bio.Fasta.

It would be nice to merge these two modules. However, it raises a bunch 
of design questions (such as Fasta.Record versus SeqRecord, and Seq 
versus string), so it's probably better to wait with that until after 
the next Biopython release. Which, by the way, will be coming up soon.

Thanks,

--Michiel.

Iddo Friedberg wrote:
> Michiel,
> 
> There is actually a simple minded fasta reader/writer  that does not use 
> Martel. Bio.SeqIO.FASTA
> 
> ./I
> 
> --
> Iddo Friedberg, PhD
> Burnham Institute for Medical Research
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037 USA
> T: +1 858 646 3100 x3516
> http://iddo-friedberg.org
> http://BioFunctionPrediction.org
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
> Sent: Sat 7/1/2006 2:47 PM
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] Fasta parser
> 
> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>  >>> from Bio import Fasta
>  >>> parser = Fasta.RecordParser()
>  >>> file = open("ls_orchid.fasta")
>  >>> iterator = Fasta.Iterator(file, parser)
>  >>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?
> 
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From mdehoon at c2b2.columbia.edu  Sun Jul  2 10:58:35 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 10:58:35 -0400
Subject: [Biopython-dev] New Biopython release coming up
Message-ID: <44A7DF1B.1000008@c2b2.columbia.edu>

Hi everybody,

The next Biopython release (1.42, code-named "Brooklyn") is coming up. 
I'm planning to finish this release about two weeks from now. The tests 
of Biopython in CVS all pass, so we are doing well. However, there are 
25 bugs listed in Bugzilla, so please have a look to see if there's 
something we can do about them. If you have some code sitting around, 
now would be a good time to commit it to CVS. However, if you are not 
sure if your code is ready for prime time, please hold off until after 
this release. Also, if you have a cvs checkout of Biopython, please make 
sure to update it before doing any commits to avoid overwriting.

Thanks everybody for your contributions to Biopython.

--Michiel.

From biopython-dev at maubp.freeserve.co.uk  Sun Jul  2 14:11:47 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 02 Jul 2006 19:11:47 +0100
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A7DF1B.1000008@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
Message-ID: <44A80C63.7060809@maubp.freeserve.co.uk>

Michiel de Hoon wrote:
> Hi everybody,
> 
> The next Biopython release (1.42, code-named "Brooklyn") is coming up. 
> I'm planning to finish this release about two weeks from now. The tests 
> of Biopython in CVS all pass, so we are doing well. However, there are 
> 25 bugs listed in Bugzilla, so please have a look to see if there's 
> something we can do about them. If you have some code sitting around, 
> now would be a good time to commit it to CVS. However, if you are not 
> sure if your code is ready for prime time, please hold off until after 
> this release. Also, if you have a cvs checkout of Biopython, please make 
> sure to update it before doing any commits to avoid overwriting.
> 
> Thanks everybody for your contributions to Biopython.
> 
> --Michiel.

Sounds like a good plan Michiel

Did anyone get back to you about the NBCI Blast XML format?  I would say 
parsing blast output is a fairly important feature to a lot of users (I 
may of course be biased)...

Getting down to specifics:

Bugzilla Bug 1997 VARCHAR too small in SCOP tables
http://bugzilla.open-bio.org/show_bug.cgi?id=1997
Suggested fix looked OK to me, but as I've never used SCOP as second 
opinion would be wise.

Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character
http://bugzilla.open-bio.org/show_bug.cgi?id=1987
I have attached a suggested patch, second opinion welcome

Bugzilla Bug 1981 GenBank parser generates unusual feature qualifiers.
http://bugzilla.open-bio.org/show_bug.cgi?id=1981
A question about the white space in GenBank comments etc.  Changing this 
is probably harmless but we are already making a big change internally 
with the move away from Martel, I would rather post pone any further 
change until after the next release.

Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in 
error
http://bugzilla.open-bio.org/show_bug.cgi?id=1936
One for Thomas Hamelryck which on the face of it looks fairly simple.

Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT
Does anyone use the new project line?  Would a simple string be enough 
to store this?

Peter


From mcolosimo at mitre.org  Sun Jul  2 14:36:22 2006
From: mcolosimo at mitre.org (Colosimo, Marc E.)
Date: Sun, 02 Jul 2006 14:36:22 -0400
Subject: [Biopython-dev] Fasta parser
In-Reply-To: <44A6ED70.9080204@c2b2.columbia.edu>
Message-ID: <C0CD8A66.8A18%mcolosimo@mitre.org>


On 7/1/06 5:47 PM, "Michiel de Hoon" <mdehoon at c2b2.columbia.edu> wrote:

> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>>>> from Bio import Fasta
>>>> parser = Fasta.RecordParser()
>>>> file = open("ls_orchid.fasta")
>>>> iterator = Fasta.Iterator(file, parser)
>>>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?

Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object
(Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then
remap into a SeqRecord.

Also, could someone re-run epydoc! My changes in the code have not made it
to the on-line API docs.

Marc


From mcolosimo at mitre.org  Sun Jul  2 15:12:23 2006
From: mcolosimo at mitre.org (Colosimo, Marc E.)
Date: Sun, 02 Jul 2006 15:12:23 -0400
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <44A74F03.8020801@c2b2.columbia.edu>
Message-ID: <C0CD92D7.8A1B%mcolosimo@mitre.org>

Michiel,

When will this next release be made and what is going into it?

Since you brought up the issue of design question, I'll have my little rant
now. But first, I would like to say that I think it is great that people
contribute code and more importantly their time to this project. With out
all of the core developers there would be no BioPython. So, Kudos to anyone
who has contribute code. Now on to my rant....

<rant>
I'm not a big user of either BioPerl or BioJava. However, they are well
structured and more consistent than BioPython.This FastaIO issue is one of
several design issues that really need to be addressed.

For example, both BioPerl and BioJava use an SeqIO object structure. Our
SeqIO module is heavily underused. For example, we have Fasta, GenBank,
LocusLink, NBRF, SwissProt, UniGene main Modules. Interestingly, there is a
writers.SeqRecord.embl but I can't quickly find something to read in an embl
file! 

Just look at what BioPerl can read in
<http://www.bioperl.org/wiki/HOWTO:SeqIO> and how easy it is to find this
out (even with out the doc page, all of these are listed under
Bio::SeqIO::*)

There is a very short "Coding Convention"
<http://biopython.org/wiki/Contributing#Coding_conventions>, which doesn't
seem to be followed all that well.

My suggestion is if enough people are going to ISMB this year (which I am
not), that time should be made to think about a road map for BioPython.

My suggestions are:
1) split off a branch for ver 2.0 that supports Python 2.4 only (this would
suck for Mac people, like me, but its time to move on)
2) clean house - remove depreciated items, restructure IO, etc...
3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/convertcode.py")
4) use Cheese Shop for missing modules
5) documentation

</rant>

marc

On 7/2/06 12:43 AM, "Michiel de Hoon" <mdehoon at c2b2.columbia.edu> wrote:

> Thanks Iddo!
> I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than
> the Martel-based one in Bio.Fasta.
> 
> It would be nice to merge these two modules. However, it raises a bunch
> of design questions (such as Fasta.Record versus SeqRecord, and Seq
> versus string), so it's probably better to wait with that until after
> the next Biopython release. Which, by the way, will be coming up soon.
> 
> Thanks,
> 
> --Michiel.
> 
> Iddo Friedberg wrote:
>> Michiel,
>> 
>> There is actually a simple minded fasta reader/writer  that does not use
>> Martel. Bio.SeqIO.FASTA
>> 
>> ./I
>> 
>> --
>> Iddo Friedberg, PhD
>> Burnham Institute for Medical Research
>> 10901 N. Torrey Pines Rd.
>> La Jolla, CA 92037 USA
>> T: +1 858 646 3100 x3516
>> http://iddo-friedberg.org
>> http://BioFunctionPrediction.org
>> 
>> 
>> 
>> -----Original Message-----
>> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
>> Sent: Sat 7/1/2006 2:47 PM
>> To: biopython-dev at biopython.org
>> Subject: [Biopython-dev] Fasta parser
>> 
>> Hi everybody,
>> 
>> The Biopython shows the following approach to parsing a Fasta file:
>> 
>>>>> from Bio import Fasta
>>>>> parser = Fasta.RecordParser()
>>>>> file = open("ls_orchid.fasta")
>>>>> iterator = Fasta.Iterator(file, parser)
>>>>> cur_record = iterator.next()
>> 
>> But for large Fasta files, it's very slow, compared to file.read(),
>> which may be due to going through Martel (I believe the same was true
>> for large GenBank files).
>> 
>> So I'm thinking about writing a simple-minded Fasta parser for better
>> performance with large files. What I'm wondering about:
>> 1) Is there some advantage that I overlooked of using Martel for parsing
>> Fasta files?
>> 2) Why is it necessary to create a parser first and passing it to
>> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
>> other than a Fasta.RecordParser?
>> 
>> --Michiel.
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Sun Jul  2 16:54:27 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 16:54:27 -0400
Subject: [Biopython-dev] Fasta parser
In-Reply-To: <C0CD8A66.8A18%mcolosimo@mitre.org>
References: <C0CD8A66.8A18%mcolosimo@mitre.org>
Message-ID: <44A83283.4060401@c2b2.columbia.edu>

>> 2) Why is it necessary to create a parser first and passing it to
>> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
>> other than a Fasta.RecordParser?
> 
> Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object
> (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then
> remap into a SeqRecord.

I see. This is one of the design issues I ran into when comparing 
Bio.Fasta and Bio.SeqIO.FASTA: Whether parsing a Fasta file should 
result in a Fasta.Record object or a SeqRecord.

> Also, could someone re-run epydoc! My changes in the code have not made it
> to the on-line API docs.

Done.

--Michiel.

From mdehoon at c2b2.columbia.edu  Sun Jul  2 17:19:46 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 17:19:46 -0400
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <C0CD92D7.8A1B%mcolosimo@mitre.org>
References: <C0CD92D7.8A1B%mcolosimo@mitre.org>
Message-ID: <44A83872.4070209@c2b2.columbia.edu>

Colosimo, Marc E. wrote:
> When will this next release be made ...
I'm planning for the weekend of 15/16 July.

> ... and what is going into it?
Whatever is in CVS at that time. So essentially today's CVS plus as many 
  bug fixes as possible. I'd hold off on any major changes until after 
the  release.

> <rant>
> </rant>

I pretty much agree with Marc here.

 > My suggestion is if enough people are going to ISMB this year
 > (which I am not), that time should be made to think about a
 > road map for BioPython.

Unfortunately, I won't be going either. A Biopython road map seems like 
a good idea though.

 > My suggestions are:
 > 1) split off a branch for ver 2.0 that supports Python 2.4 only
 > (this would suck for Mac people, like me, but its time to move on)

Is there something essential in 2.4 that's missing in 2.3? Not that I 
object against supporting 2.4 only, I'm just wondering. Though I'd be 
hesitant to split off a separate branch, since Biopython is confusing 
enough already as it is.

Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no problem 
   for Mac users to support 2.4 only.

 > 2) clean house - remove depreciated items, restructure IO, etc...

I totally agree.

 > 3) move to SciPy/NumPy verse Numeric (could try 
"numpy/lib/convertcode.py")

Here, I'm a bit hesitant. SciPy does not have a good track record in 
terms of portability. The latest version of numpy looks better though 
(it compiled without problems on all platforms I tried). But I don't 
really want to pay $40 for the documentation.

 > 4) use Cheese Shop for missing modules
 > 5) documentation

My guess is that maintaining the documentation will be easier once we 
cleaned up Biopython.

--Michiel.


From mdehoon at c2b2.columbia.edu  Sun Jul  2 21:21:00 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 21:21:00 -0400
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
Message-ID: <44A870FC.4060909@c2b2.columbia.edu>

Peter wrote:
> Did anyone get back to you about the NBCI Blast XML format?  I would say 
> parsing blast output is a fairly important feature to a lot of users (I 
> may of course be biased)...
No response yet, but I'll ask them again before the upcoming release. 
The existing XML parser still works as advertised for single blast 
searches. For multiple blast searches, people will have to run a 
previous version of blast locally.

> Bugzilla Bug 1997 VARCHAR too small in SCOP tables
> http://bugzilla.open-bio.org/show_bug.cgi?id=1997
> Suggested fix looked OK to me, but as I've never used SCOP as second 
> opinion would be wise.

This one looks fine to me, but I'm not a SCOP user either.

> Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character
> http://bugzilla.open-bio.org/show_bug.cgi?id=1987
> I have attached a suggested patch, second opinion welcome

Whereas the patch looks fine, I have no idea what this code is supposed 
to do, or why it needs to be so complicated.

> Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT
> Does anyone use the new project line?  Would a simple string be enough 
> to store this?
> 
 From NCBI's description, it appears they're not quite sure yet what 
this project line should look like (note that the project line in the 
description is different from the project line in the GenBank file: 
GenomeProject vs. GENOME_PROJECT). I would just store the line in a 
simple string, and do something more fancy once we know the proper format.

My 2?.

--Michiel.

From idoerg at burnham.org  Mon Jul  3 13:52:44 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Mon, 03 Jul 2006 10:52:44 -0700
Subject: [Biopython-dev] [Fwd: [OBF] Call For Birds of a Feather Suggestions]
Message-ID: <44A9596C.90208@burnham.org>


The BOSC organizing comittee is currently seeking suggestions for Birds
of a Feather meeting ideas. Birds of a Feather meetings are one of the
more popular activities at BOSC, occurring at the end of each days
session. These are free-form meetings organized by the attendees
themselves to discuss one or a few topics of interest in greater detail.
BOF?s have been formed to allow developers and users of individual OBF
software to meet each other face-to-face to discuss the project, or to
discuss completely new ideas, and even start new software development
projects. These meetings offer a unique opportunity for individuals to
explore more about the activities of the various Open Source Projects,
and, in some cases, even take an active role influencing the future of
Open Source Software development. If you would like to create a BOF,
just sign up for a wiki account, login, and edit the 

<a
href="http://www.open-bio.org/wiki/BOSC_2006/Birds-of-a-Feather">BOSC
2006 Birds of a Feather page</a>.
_______________________________________________
Open-Bioinformatics-Foundation mailing list
Open-Bioinformatics-Foundation at lists.open-bio.org

This is a broadcast-only announce list used to distribute emails to people who subscribe to OBF hosted email discussion or announce lists. To prevent our most active members from getting many duplicate copies of important announcements we created this list today so that only one email gets sent to each subscribed email address. You do not need to subscribe/unsubscribe from this lsit. Problems or Concerns? -- send an email to the OBF mailteam at: mailteam at open-bio.org 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From biopython-dev at maubp.freeserve.co.uk  Thu Jul  6 05:06:07 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu, 06 Jul 2006 10:06:07 +0100
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44A870FC.4060909@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
Message-ID: <44ACD27F.90906@maubp.freeserve.co.uk>

Peter wrote:
>> Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character
>> http://bugzilla.open-bio.org/show_bug.cgi?id=1987
>> I have attached a suggested patch, second opinion welcome

Michiel de Hoon wrote:
> Whereas the patch looks fine, I have no idea what this code is supposed 
> to do, or why it needs to be so complicated.

I'm not the person to ask.

The whole Alphabet is something that confused me a little when first 
using BioPython.  I see why a special class for sequences is a nice 
idea, and that handling the different variants of RNA, DNA and proteins 
is a good idea.

But to be honest, I have generally used plain strings in my own 
programs, and meddled with alphabets only when needed (e.g. for 
translating from DNA to protein sequences).

Peter


From hoffman at ebi.ac.uk  Thu Jul  6 06:36:53 2006
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Thu, 6 Jul 2006 11:36:53 +0100 (BST)
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
	<44ACD27F.90906@maubp.freeserve.co.uk>
Message-ID: <Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>

[Peter]

> The whole Alphabet is something that confused me a little when first
> using BioPython.  I see why a special class for sequences is a nice
> idea, and that handling the different variants of RNA, DNA and proteins
> is a good idea.
>
> But to be honest, I have generally used plain strings in my own
> programs, and meddled with alphabets only when needed (e.g. for
> translating from DNA to protein sequences).

I agree. In general, I think that the alphabet stuff adds unnecessary
complexity to perhaps 95 % of the sort of things I would do with
Biopython. But as it stands I usually use strs myself instead.
-- 
Michael Hoffman <hoffman at ebi.ac.uk>
European Bioinformatics Institute

From Leighton.Pritchard at scri.ac.uk  Thu Jul  6 06:34:46 2006
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Thu, 6 Jul 2006 11:34:46 +0100
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
	<44ACD27F.90906@maubp.freeserve.co.uk>
Message-ID: <1152182087.4828.96.camel@lplinuxdev>

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0002.pl 
-------------- next part --------------
An embedded message was scrubbed...
From: "Leighton Pritchard" <Leighton.Pritchard at scri.ac.uk>
Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets
Date: Thu, 6 Jul 2006 11:34:46 +0100
Size: 4250
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0002.mht 

From Leighton.Pritchard at scri.ac.uk  Thu Jul  6 06:34:46 2006
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Thu, 6 Jul 2006 11:34:46 +0100
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
	<44ACD27F.90906@maubp.freeserve.co.uk>
Message-ID: <1152182087.4828.96.camel@lplinuxdev>

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0003.pl 
-------------- next part --------------
An embedded message was scrubbed...
From: "Leighton Pritchard" <Leighton.Pritchard at scri.ac.uk>
Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets
Date: Thu, 6 Jul 2006 11:34:46 +0100
Size: 4250
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0003.mht 

From mdehoon at c2b2.columbia.edu  Thu Jul  6 12:39:09 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 06 Jul 2006 12:39:09 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
Message-ID: <44AD3CAD.8030504@c2b2.columbia.edu>

Michael Hoffman wrote:
> [Peter]
>> But to be honest, I have generally used plain strings in my own
>> programs, and meddled with alphabets only when needed (e.g. for
>> translating from DNA to protein sequences).

Note that there is a function "translate" in Bio.Seq that translates DNA 
to protein using plain strings.
> 
> I agree. In general, I think that the alphabet stuff adds unnecessary
> complexity to perhaps 95 % of the sort of things I would do with
> Biopython. But as it stands I usually use strs myself instead.

It appears that most people (myself included) use plain strings instead 
of Seq objects (= string + Alphabet). We should check on the biopython 
mailing list if anybody really needs alphabets, and if not get rid of 
them (after the upcoming Brooklyn-release (1.42) though).

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From fkauff at duke.edu  Thu Jul  6 13:53:23 2006
From: fkauff at duke.edu (Frank Kauff)
Date: Thu, 06 Jul 2006 13:53:23 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
	<44AD3CAD.8030504@c2b2.columbia.edu>
Message-ID: <1152208403.2487.36.camel@osiris.biology.duke.edu>

On Thu, 2006-07-06 at 12:39 -0400, Michiel Jan Laurens de Hoon wrote:
> Michael Hoffman wrote:
> > [Peter]
> >> But to be honest, I have generally used plain strings in my own
> >> programs, and meddled with alphabets only when needed (e.g. for
> >> translating from DNA to protein sequences).
> 
> Note that there is a function "translate" in Bio.Seq that translates DNA 
> to protein using plain strings.
> > 
> > I agree. In general, I think that the alphabet stuff adds unnecessary
> > complexity to perhaps 95 % of the sort of things I would do with
> > Biopython. But as it stands I usually use strs myself instead.
> 
> It appears that most people (myself included) use plain strings instead 
> of Seq objects (= string + Alphabet). We should check on the biopython 
> mailing list if anybody really needs alphabets, and if not get rid of 
> them (after the upcoming Brooklyn-release (1.42) though).
> 
I use seq objects and the alphabet stuff in the nexus parser, but I
don't really know why and wouldn't mind at all to get rid of them. 

Frank


> --Michiel.
> 
> 
-- 
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA

Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net


From thamelry at binf.ku.dk  Fri Jul  7 06:44:24 2006
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST)
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk>

Hi,

> Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in
> error
> http://bugzilla.open-bio.org/show_bug.cgi?id=1936
> One for Thomas Hamelryck which on the face of it looks fairly simple.

Won't have time to work on biopython before august I'm afraid (CASP+
articles that need to be finished, etc.). Sorry!

Best regards,

-Thomas


From thamelry at binf.ku.dk  Fri Jul  7 06:44:24 2006
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST)
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk>

Hi,

> Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in
> error
> http://bugzilla.open-bio.org/show_bug.cgi?id=1936
> One for Thomas Hamelryck which on the face of it looks fairly simple.

Won't have time to work on biopython before august I'm afraid (CASP+
articles that need to be finished, etc.). Sorry!

Best regards,

-Thomas


From mcolosimo at mitre.org  Tue Jul 11 12:01:15 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 11 Jul 2006 12:01:15 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
	<44AD3CAD.8030504@c2b2.columbia.edu>
Message-ID: <B4886B3D-E885-450D-9376-3EBE93859F7A@mitre.org>


On Jul 6, 2006, at 12:39 PM, Michiel Jan Laurens de Hoon wrote:

> Michael Hoffman wrote:
>> [Peter]
>>> But to be honest, I have generally used plain strings in my own
>>> programs, and meddled with alphabets only when needed (e.g. for
>>> translating from DNA to protein sequences).
>
> Note that there is a function "translate" in Bio.Seq that  
> translates DNA
> to protein using plain strings.
>>
>> I agree. In general, I think that the alphabet stuff adds unnecessary
>> complexity to perhaps 95 % of the sort of things I would do with
>> Biopython. But as it stands I usually use strs myself instead.
>
> It appears that most people (myself included) use plain strings  
> instead
> of Seq objects (= string + Alphabet). We should check on the biopython
> mailing list if anybody really needs alphabets, and if not get rid of
> them (after the upcoming Brooklyn-release (1.42) though).
>
> --Michiel.

I am strongly arguing  against removing the alphabets. You would loss  
all of the cool features of Seq Objects (complement,  
reverse_complement).  There are similar functions under Bio.SeqUtils  
but those are "Deprecated". From just looking around, I think this  
would break many things.

Having said that, I do find them a pain to deal with, but that might  
have more to do with the structure/layout of the classes. My simple  
suggestion is to fix/change the base Alphabet classes in  
Bio.Alphabet.__init__. I am trying to think of a way that we can have  
a "true" GenericAlphabet class (not generic_alphabet = Alphabet() )  
and using just strings. The problem is, is that I don't know if just  
using letters = None (or letters = []) will cause problems down the  
road (things like if x in aplabet.letters is used in many classes).

Also, I'm really confused as to what is going on in IUPAC.py with the  
default_manager stuff and _bootstrap.

Marc

From mcolosimo at mitre.org  Tue Jul 11 13:29:52 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 11 Jul 2006 13:29:52 -0400
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <44A83872.4070209@c2b2.columbia.edu>
References: <C0CD92D7.8A1B%mcolosimo@mitre.org>
	<44A83872.4070209@c2b2.columbia.edu>
Message-ID: <7C24AEA4-68EC-4517-9391-C07512CDD146@mitre.org>


On Jul 2, 2006, at 5:19 PM, Michiel de Hoon wrote:

>
> > My suggestions are:
> > 1) split off a branch for ver 2.0 that supports Python 2.4 only
> > (this would suck for Mac people, like me, but its time to move on)
>
> Is there something essential in 2.4 that's missing in 2.3? Not that  
> I object against supporting 2.4 only, I'm just wondering. Though  
> I'd be hesitant to split off a separate branch, since Biopython is  
> confusing enough already as it is.
>
> Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no  
> problem   for Mac users to support 2.4 only.

There are two off the top of my head:

Generator expressions (PEP 289, <http://www.python.org/doc/peps/ 
pep-0289>) This could be very useful in cleaning up the old code
Decorators for Functions (PEP 318,  <http://www.python.org/dev/peps/ 
pep-0318>)  I like the idea of using staticmethod and classmethod.  
The accepts and returns decorators are also interesting. I wish I  
could find a list of all possible decorators.

In any case, some clean up of the code is needed because people have  
used the string "Decorator" (Alphabet.__init__.py and NeCatch.py)

>
> > 2) clean house - remove depreciated items, restructure IO, etc...
>
> I totally agree.
>
> > 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/ 
> convertcode.py")
>
> Here, I'm a bit hesitant. SciPy does not have a good track record  
> in terms of portability. The latest version of numpy looks better  
> though (it compiled without problems on all platforms I tried). But  
> I don't really want to pay $40 for the documentation.


I saw this, but didn't know it was the only documentation. However,  
as far as I can tell Numeric is dead <http://numeric.scipy.org/> is  
NumPy!

Marc

From krewink at inb.uni-luebeck.de  Tue Jul 11 17:23:14 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Tue, 11 Jul 2006 23:23:14 +0200
Subject: [Biopython-dev] BioPython Design
Message-ID: <20060711212314.GA31351@pc06.inb.mu-luebeck.de>

Am 11.07.2006 um 18:01 schrieb Marc Colosimo:

> It appears that most people (myself included) use plain strings instead
> of Seq objects (= string + Alphabet). We should check on the biopython
> mailing list if anybody really needs alphabets, and if not get rid of
> them (after the upcoming Brooklyn-release (1.42) though).

There are some good points about Seq objects in the discussion
last year:
http://lists.open-bio.org/pipermail/biopython-dev/2005-April/002074.html

Personaly, I would prefere to keep Alphabets as a part of Seq,
but make it behave more like python strings, i.e.:
str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:]

Furthermore, alphabets could be more usefull with an __init__
method looking like

def __init__(self, data, alphabet, validate=False)

This way, sequences could be checked for consistency on demand.

To make Alphabets more usable, it would be nice to have some kind
of dictionary interface to map different alphabets:
e.g. Alphabet.Alphabets['protein'] == Bio.Alphabet.IUPAC.protein

Cheers,
Albert

-- 
Albert Krewinkel
University of Luebeck
phone: +49 (451) 500 5516
email: krewink at inb.uni-luebeck.de

From f.schlesinger at iu-bremen.de  Wed Jul 12 09:25:43 2006
From: f.schlesinger at iu-bremen.de (Felix Schlesinger)
Date: Wed, 12 Jul 2006 15:25:43 +0200
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <20060711212314.GA31351@pc06.inb.mu-luebeck.de>
References: <20060711212314.GA31351@pc06.inb.mu-luebeck.de>
Message-ID: <7317d50c0607120625x7e76008fo961814b280dbad51@mail.gmail.com>

> Personaly, I would prefere to keep Alphabets as a part of Seq,
> but make it behave more like python strings, i.e.:
> str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:]

Isn't the whole alphabet thing just a type information in the end?
(I.e. "This string is of type protein") And if it is, shouldn't we let
the python type system handle it via a class hirachie? Or use the
python concept of duck typing and assume the string has whatever type
is needed at the moment until it fails?

Felix Schlesinger

From mdehoon at c2b2.columbia.edu  Wed Jul 26 13:39:46 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Wed, 26 Jul 2006 13:39:46 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <B4886B3D-E885-450D-9376-3EBE93859F7A@mitre.org>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
	<44AD3CAD.8030504@c2b2.columbia.edu>
	<B4886B3D-E885-450D-9376-3EBE93859F7A@mitre.org>
Message-ID: <44C7A8E2.2050100@c2b2.columbia.edu>

Marc Colosimo wrote:
>> [Michiel]
>> It appears that most people (myself included) use plain strings instead
>> of Seq objects (= string + Alphabet). We should check on the biopython
>> mailing list if anybody really needs alphabets, and if not get rid of
>> them (after the upcoming Brooklyn-release (1.42) though).
 >
 > [Marc]
> I am strongly arguing  against removing the alphabets. You would loss 
> all of the cool features of Seq Objects (complement, 
> reverse_complement).  There are similar functions under Bio.SeqUtils but 
> those are "Deprecated". From just looking around, I think this would 
> break many things.

There is a function reverse_complement in Bio.Seq that works on plain 
strings. (If you need the complement instead, you can of course reverse 
the result). So can you be more specific on which features of Seq 
objects are actually needed? While I can see the intuitive appeal of 
having a Seq class, I cannot think of any practical cases where a simple 
string wouldn't do.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From biopython-dev at maubp.freeserve.co.uk  Fri Jul 28 09:50:39 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 28 Jul 2006 14:50:39 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Message-ID: <44CA162F.1040604@maubp.freeserve.co.uk>

This follows on from the discussion last month started by Marc Colosimo, 
  but I want to focus just on reading in sequence files:

http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002386.html

There was also a thread back a few years ago where Michael Hoffman was 
looking at timings for parsing Fasta files.

http://www.biopython.org/pipermail/biopython-dev/2003-October/001480.html

Jeffrey Chang wrote:
> That is a nice implementation.  However, Biopython already has at least 
> 3 Fasta parsers!
>    Bio/Fasta
>    Bio/SeqIO/FASTA
>    Bio/expressions/fasta
> 
> Bio/Fasta, the one you compared against, is easily the slowest one.  
> Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> to be significantly faster or slower.  Bio/expressions/fasta uses 
> Martel.  I don't know how well that will perform.  The parsing part 
> should be blazingly fast (since it is mostly in C), but building the 
> object will be slow.  It might be a wash.
> 
> Jeff

Clearly we could try and consolidate these (while making things as nice 
as possible with depreciation warnings etc for existing code).

I've had a little read on the BioPerl SeqIO system:
http://www.bioperl.org/wiki/HOWTO:SeqIO

I agree with Marc that what we have in BioPython could (and should) be 
more organised.

Ideally (in my opinion) BioPython should be able to read sequences from 
multiple sequence file formats (e.g. Fasta, GenBank, EMBL, ...)
* using a standard interface
* into a standard object
* do this quickly

The resulting object should be able to hold addition information like 
annotation and (sub)features, the Bio.SeqRecord.SeqRecord object seems 
ideal.

It looks like we have:

(1) We have a number of format specific sequence reading modules (in 
particular Fasta and GenBank) which can read their particular file 
format into one or more different object representations.  These seem to 
be the best documented (in my opinion).

(2) We have a fairly generic (but relatively slow) framework in the 
Bio.FormatIO system which uses Martel expressions internally.  I have 
found Martel frustrating to debug, and especially slow with large 
individual records (like genomic GenBank files).  There is some 
documentation on this, e.g.

http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html

(3) We have the start of a generic "pure python" framework in the 
Bio.SeqIO system, but it needs some work (e.g. some doc strings, fixing 
the LargeFastaFormat class, GenBank support).

QUESTION: What do you all tend to use?  Should I draft a "questionnaire" 
to be posted on the main discussion list (and the announcements?).

Personally, I have been using Bio.Fasta and Bio.GenBank to read 
sequences.  I tend to only output Fasta files, and usually do this "by 
hand" as they are so simple and I want full control over the description 
lines.

Peter

From biopython-dev at maubp.freeserve.co.uk  Fri Jul 28 11:05:21 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 28 Jul 2006 16:05:21 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
Message-ID: <44CA27B1.30107@maubp.freeserve.co.uk>

Jeffrey Chang wrote:
> ...  However, Biopython already has at least 
> 3 Fasta parsers!
>    Bio/Fasta
>    Bio/SeqIO/FASTA
>    Bio/expressions/fasta
> 
> Bio/Fasta, the one you compared against, is easily the slowest one.  
> Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> to be significantly faster or slower.  Bio/expressions/fasta uses 
> Martel.  I don't know how well that will perform.  The parsing part 
> should be blazingly fast (since it is mostly in C), but building the 
> object will be slow.  It might be a wash.

The following timings are for iterating over a large fasta file 
(Escherichia_coli_K12, NC_000913.ffn, with 5254 nucleotide CDS sequences).

The test script is attached, the test input is available here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.ffn

I used BioPython 1.42 with Python 2.3 on Windows XP on a laptop computer.

Apart from Fasta.RecordParser, these all return a SeqRecord object with 
a generic alphabet:

0.89s SeqIO.FASTA.FastaReader (for record in interator)
0.88s SeqIO.FASTA.FastaReader (iterator.next)
0.88s SeqIO.FASTA.FastaReader (iterator[i])

5.52s FormatIO/SeqRecord (for record in interator)
5.41s FormatIO/SeqRecord (iterator.next)

6.06s Fasta.RecordParser (for record in interator)
6.10s Fasta.SequenceParser (for record in interator)
6.27s Fasta.SequenceParser (iterator.next)

As you can see, SeqIO.FASTA.FastaReader (written in simple python) is 
about six times faster than both the martel based parsers.

I have tried this on a file with 2000 records and see a similar scaling.

Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: test_fasta_methods.py
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060728/93dbbbb7/attachment.pl 

From mdehoon at c2b2.columbia.edu  Sun Jul 30 21:20:50 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 30 Jul 2006 21:20:50 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
Message-ID: <44CD5AF2.10708@c2b2.columbia.edu>

Thanks Peter.

Peter (BioPython Dev) wrote:
> QUESTION: What do you all tend to use?

I use the stuff in Bio.Fasta, but actually just because it's in the 
documentation. From your timings, and also because I'm not smart enough 
to be able to understand Martel, let alone maintain Martel-based 
parsers, I'm pretty much in favor of Bio.SeqIO.

>  Should I draft a "questionnaire" 
> to be posted on the main discussion list (and the announcements?).

By all means, yes. In the questionnaire, be sure to separate the issue 
of parser internals (Martel vs. pure Python) from the issue of how the 
results should be formatted (Fasta.Record or SeqRecord).

--Michiel

From lpritc at scri.sari.ac.uk  Mon Jul 31 05:59:47 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Mon, 31 Jul 2006 10:59:47 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CA27B1.30107@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
Message-ID: <1154339988.1490.81.camel@lplinuxdev>

Hi all,

On Fri, 2006-07-28 at 16:05 +0100, Peter (BioPython Dev) wrote:
> Jeffrey Chang wrote:
> > ...  However, Biopython already has at least 
> > 3 Fasta parsers!
> >    Bio/Fasta
> >    Bio/SeqIO/FASTA
> >    Bio/expressions/fasta
> > 
> > Bio/Fasta, the one you compared against, is easily the slowest one.  
> > Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> > to be significantly faster or slower.  Bio/expressions/fasta uses 
> > Martel.  I don't know how well that will perform.  The parsing part 
> > should be blazingly fast (since it is mostly in C), but building the 
> > object will be slow.  It might be a wash.

Just to add to the confusion, when parsing large FASTA sequence files, I
have been using a home-rolled Flex/Pyrex parser (if you'd like a copy,
drop me a line).  I've used Peter's test framework on the same input
file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora
Core 3 (up-to-date, eh? ;) ) to get the following typical results:

4.07s FormatIO/SeqRecord (for record in interator)
4.05s FormatIO/SeqRecord (iterator.next)
0.32s SeqIO.FASTA.FastaReader (for record in interator)
0.30s SeqIO.FASTA.FastaReader (iterator.next)
0.31s SeqIO.FASTA.FastaReader (iterator[i])
5.53s Fasta.RecordParser (for record in interator)
5.00s Fasta.SequenceParser (for record in interator)
4.80s Fasta.SequenceParser (iterator.next)
0.18s SeqUtils/quick_FASTA_reader
0.11s pyfastaseqlexer/next_record
0.09s pyfastaseqlexer/quick_FASTA_reader
0.19s SeqUtils/quick_FASTA_reader (conversion to Seq)
0.14s pyfastaseqlexer/next_record (conversion to Seq)
0.11s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
0.17s pyfastaseqlexer/next_record (conversion to SeqRecord)
0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

pyfastaseqlexer is my Flex/Pyrex combination, which has a number of
methods for reading in FASTA sequences.  Here I've used the two that
correspond to the Bio.SeqUtils.quick_FASTA_reader method (overlooked in
the original list, but also included here for comparison), and Peter's
iterator method for his tests.  Since these extra methods don't return
Bio.Seq or Bio.SeqRecord objects, but instead lists of (name, sequence)
tuples, I've also included test functions that carry out the conversion
in Python, and their timings.

It's probably not a surprise that a dedicated Flex-based parser shows
such a dramatic speed improvement over the Martel-based parsers.  The
improvement over SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader
is only marginal, though (a factor of approximately two when conversion
to SeqRecord is taken into account).  

Since we've been discussing the need to use only strings to represent
sequences recently, it's interesting to note that
SeqUtils.quick_FASTA_reader is about twice as fast as
SeqIO.FASTA.FastaReader if there is no conversion of sequences from
strings to Seq or SeqRecord objects.

While the Flex-based parser is the fastest in these tests, the time
saved is marginal unless a large FASTA file is being parsed.  Using a
file with over 72000 entries (Phytophthora infestans ESTs), my typical
timings become:

51.22s FormatIO/SeqRecord (for record in interator)
45.64s FormatIO/SeqRecord (iterator.next)
4.26s SeqIO.FASTA.FastaReader (for record in interator)
4.10s SeqIO.FASTA.FastaReader (iterator.next)
4.30s SeqIO.FASTA.FastaReader (iterator[i])
58.39s Fasta.RecordParser (for record in interator)
59.97s Fasta.SequenceParser (for record in interator)
58.70s Fasta.SequenceParser (iterator.next)
2.20s SeqUtils/quick_FASTA_reader
1.13s pyfastaseqlexer/next_record
0.56s pyfastaseqlexer/quick_FASTA_reader
2.20s SeqUtils/quick_FASTA_reader (conversion to Seq)
1.53s pyfastaseqlexer/next_record (conversion to Seq)
0.84s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
2.11s pyfastaseqlexer/next_record (conversion to SeqRecord)
1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

The Martel-based parsers become almost unworkable when dealing with
files of this size.  Note that the conversion of strings to SeqRecord
objects is pretty much a constant overhead for the Bio.SeqUtils and
pyfastaseqlexer methods (taking around 1s), but that there are
apparently additional overheads in the SeqIO.FASTA.FastaReader method.

Of course, the hassles of including a Flex-based parser in a general
BioPython release probably outweigh the marginal time-saving benefits
(see MMCIFlex for details ;) ).  I think SeqIO.FASTA.FastaReader and
SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and beat
the inclusion of a Flex-based parser hands-down in terms of
maintainability and portability.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 06:36:00 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 11:36:00 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CD5AF2.10708@c2b2.columbia.edu>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>
Message-ID: <44CDDD10.4020904@maubp.freeserve.co.uk>

Michiel de Hoon wrote:
> Thanks Peter.
> 
> Peter wrote:
>>QUESTION: What do you all tend to use?
> 
> I use the stuff in Bio.Fasta, but actually just because it's in the 
> documentation.

Me too.

 > From your timings, and also because I'm not smart enough
> to be able to understand Martel, let alone maintain Martel-based 
> parsers, I'm pretty much in favor of Bio.SeqIO.

That was my gut instinct too.

Starting with Bio.SeqIO as a base, I've been "playing" with the code and 
have a rough "Sequence Iterator" class that supports iteration (provides 
a next() and __iter__() method),  as well as strictly increasing index 
access.

At the moment I have iterators returning SeqRecords for:
- Fasta Files
- GenBank features (returns the CDS features and their translations)
- Genbank files (with the features as SeqFeature objects)

There is code in Bio/SeqIO/general.py for a few more file formats which 
I haven't used yet.

This new GenBank iterator actually uses the current Bio.Genbank parser 
(with a slight tweak to how it acts once it reaches the end of a record).

Michiel de Hoon wrote:
>
 >Peter wrote:
>> Should I draft a "questionnaire" 
>> to be posted on the main discussion list (and the announcements?).
> 
> By all means, yes. In the questionnaire, be sure to separate the issue 
> of parser internals (Martel vs. pure Python) from the issue of how the 
> results should be formatted (Fasta.Record or SeqRecord).
> 

Draft questionnaire follows, I have included by comments for the record. 
  Too long?  Missing any important questions?

Peter

--

Introduction
============
There is some discussion on the Developer's Mailing list about 
BioPython's sequence input/output routines.

For example, its a bit silly that there are three different Fasta 
reading routines in BioPython (even if only one of them, Bio.Fasta, is 
properly documented).

Note that we are not going to "just remove" any of the current 
functionality.  Some existing code may be re-written internally, while 
other code might be marked with a DeprecationWarning.

If you could answer the following questions that would help guide our 
choices.

Question One
============
Is reading sequence files an important function to you, and if so which 
file formats in particular (e.g. Fasta, GenBank, ...)

If you have had to write you own code to read a "common" file format 
which BioPython doesn't support, please get in touch.

Peter's answer:
 > I read Fasta and GenBank files mostly.  Also Clustalw alignments,
 > and Stockholm alignments.

Question Two - Reading Fasta Files
==================================
Which of the following do you currently use (and why)?:

(a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
title, and the sequence as a string)
(b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
(c) Bio.Fasta with your own parser (Could you tell us more?)
(d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
(e) Bio.FormatIO (giving SeqRecord objects)
(f) Other (Could you tell us more?)

Peter's answer:
 > In most of my script I use Bio.Fasta with either the RecordParser or
 > FeatureParser.  I did look at Bio.FormatIO when I started but found
 > Bio.Fasta was much better documented (and a similar speed).  I have
 > only recently looked at Bio.SeqIO (hence this entire thread).

Question Three - index_file based dictionaries
==============================================
Do you use any of the following:
(a) Bio.Fasta.Dictionary
(b) Bio.Genbank.Dictionary
(c) Any other "Martel/Mindy" based dictionary which first requires 
creation of an index using the index_file function

If so, do you have any comments?

Peter's answer:
 > I do not use multi-record Genbank files (mine are single chromosomes).
 >
 > I have used Bio.Fasta.Dictionary but found dealing with the indexes
 > created by index_file to be annoying - especially when re-indexing
 > Fasta files which change often.
 >
 > I now use a simple wrapper function to load a Fasta file with an
 > iterator and build the dictionary in memory.  For me this is much
 > less hassle and the memory demands are not too great.

Question Four - Record Access...
================================
When loading a file with multiple sequences do you use:

(a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the 
records one by one in the order from the file.

(b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you 
random access to the records using their identifier.

(c) A list giving random access by index number (e.g. load the records 
using an iterator but saving them in a list).

Do you have any additional comments on this?  For example, flexibility 
versus memory requirements.

For example, when I need random access to a Fasta file, I build a 
dictionary in memory (using an iterator) rather than messing about with 
the index_file based dictionary.

Peter's answer:
 > I usually deal with each record sequentially using an iterator.
 >
 > However, I often need random access using the record identifier and
 > for this I use a dictionary which I create in memory using an iterator.
 >
 > As stated in the question, I had tired used Bio.Fasta.Dictionary but
 > found dealing with the indexes created by index_file to be annoying,
 > especially having to re-indexing Fasta files which change often.


Question Four - Fasta files: FastaRecord or SeqRecord
=====================================================
If you use Fasta files, do you want get records returned as FastaRecords 
or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

For example,

 >name text text text
ACGTACACGT

As a FastaRecord this would have:

FastaRecord.title = "name text text text" (string)
FastaRecord.sequence= "ACGTACACGT" (string)

As a SeqRecord (with the default title2ids mapping):

SeqRecord.id = (default string)
SeqRecord.name = (default string)
SeqRecord.description = "name text text text" (string)
SeqRecord.seq = Seq("ACGTACACGT", alphabet)

Peter's answer
 > For FASTA files I have usually used FastaRecord objects (with the
 > sequence as a string) but I have no strong preference.  Thinking of
 > the big picture it would be better to have every parser return
 > SeqRecords by default.

Question Five - GenBank files: GenbankRecord or SeqRecord
==========================================================
If you use GenBank files, do you use:
(a) Bio.Genbank.FeatureParser which returns SeqRecord objects
(b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects

Do you care much either way?  For me the only significant difference is 
that feature locations are held as objects in the SeqRecord, and as the 
raw string in the Record.

Peter's answer
 > I have no strong preference - unless I wanted to manipulate the
 > feature locations.  I think there might be a performance difference...

Question Six - Martel, Scanners and Consumers
==============================================
Some of BioPython's existing parsers (e.g. those using Martel) use an 
event/callback model, where the scanner component generates parsing 
events which are dealt with by the consumer component.

Do any of you use this system to modify existing parser behaviour, or 
use it as part of your own personal file parser?

(a) I don't know, or don't care.  I just the the parsers provided.
(b) I use this framework to modify a parser in order to do ... (please 
provide details).

Peter's answer
 > As a user I don't care about the internals.  I do care about what
 > gets used as the name/id/description for SeqRecords but that level
 > of flexibility is enough.
 >
 > As a BioPython contributor: Martel is scary.  I think I understand
 > the whole scanner/consumer model but don't see the point (unless
 > using a event based scanner like Martel).  I suspect all the
 > function call backs is one reason Martel parsers are slow.

Peter

From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 08:12:26 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 13:12:26 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154339988.1490.81.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
Message-ID: <44CDF3AA.2020308@maubp.freeserve.co.uk>

Leighton Pritchard wrote:
> Just to add to the confusion, when parsing large FASTA sequence files, I
> have been using a home-rolled Flex/Pyrex parser (if you'd like a copy,
> drop me a line).  I've used Peter's test framework on the same input
> file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora
> Core 3 (up-to-date, eh? ;) ) to get the following typical results:

Times for NC_000913.ffn when returning SeqRecord objects:
> 4.07s FormatIO/SeqRecord (for record in interator)
> 4.05s FormatIO/SeqRecord (iterator.next)
> 5.00s Fasta.SequenceParser (for record in interator)
> 4.80s Fasta.SequenceParser (iterator.next)

 > 0.32s SeqIO.FASTA.FastaReader (for record in interator)
 > 0.30s SeqIO.FASTA.FastaReader (iterator.next)
 > 0.31s SeqIO.FASTA.FastaReader (iterator[i])

> 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

And again, but for Phytophthora infestans ESTs with 72000 entries
 > 51.22s FormatIO/SeqRecord (for record in interator)
 > 45.64s FormatIO/SeqRecord (iterator.next)
 > 59.97s Fasta.SequenceParser (for record in interator)
 > 58.70s Fasta.SequenceParser (iterator.next)

 > 4.26s SeqIO.FASTA.FastaReader (for record in interator)
 > 4.10s SeqIO.FASTA.FastaReader (iterator.next)
 > 4.30s SeqIO.FASTA.FastaReader (iterator[i])

 > 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
 > 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord)
 > 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

I imagine this file is much much larger than what most of our uses work 
with - but it does clearly show that the Martel parsers do not scale well.

Out of interest, are the sequences in this file split into multiple 
lines (e.g. max length 80) or are they all single (long) lines?  I would 
expect the later to be quicker to load due to less string operations.

 > Of course, the hassles of including a Flex-based parser in a general
 > BioPython release probably outweigh the marginal time-saving benefits
 > (see MMCIFlex for details ;) ).  I think SeqIO.FASTA.FastaReader and
 > SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and
 > beat the inclusion of a Flex-based parser hands-down in terms of
 > maintainability and portability.

I agree with you completely that we should avoid the Flex parser based 
on those grounds, as we can get "close enough" with pure python. 
Especially if we do something about the overhead of Seq and SeqRecord 
objects.

I did some work on a brand new SeqIO over the weekend. I had got the 
fasta iterator slightly quicker too.

The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
entire file into memory in one go, and then parses it.  On the other 
hand its not perfect: I would use "\n>" as the split marker rather than 
">" which could appear in the description of a sequence.

The iterator approach is probably slower but requires much less memory. 
  How big is your 72,000 entry file in MB?  Do we need to worry about 
the size of the raw file in memory - allowing the parsers to load it 
into memory could make things much faster...

Peter

From lpritc at scri.sari.ac.uk  Mon Jul 31 10:15:54 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Mon, 31 Jul 2006 15:15:54 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CDF3AA.2020308@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
Message-ID: <1154355358.1490.116.camel@lplinuxdev>

On Mon, 2006-07-31 at 13:12 +0100, Peter (BioPython Dev) wrote:
> I imagine this file is much much larger than what most of our uses work 
> with - but it does clearly show that the Martel parsers do not scale well.

I noticed the scaling problem mostly for GenBank files.  Your new
GenBank parser is a welcome improvement in speed.

> Out of interest, are the sequences in this file split into multiple 
> lines (e.g. max length 80) or are they all single (long) lines?  I would 
> expect the later to be quicker to load due to less string operations.

They're multiple lines with max length 50, and the whole file is 33Mb.
It's not the largest FASTA sequence file I'm working with, that's 353Mb
(530801 sequences, it's most of a eukaryotic genome with sequences split
into multiple lines), so I ran your test script on it, just to see what
happened:

419.42s FormatIO/SeqRecord (for record in interator)
389.05s FormatIO/SeqRecord (iterator.next)
35.46s SeqIO.FASTA.FastaReader (for record in interator)
33.73s SeqIO.FASTA.FastaReader (iterator.next)
36.19s SeqIO.FASTA.FastaReader (iterator[i])
490.19s Fasta.RecordParser (for record in interator)
555.43s Fasta.SequenceParser (for record in interator)
546.87s Fasta.SequenceParser (iterator.next)
37.94s SeqUtils/quick_FASTA_reader
12.84s pyfastaseqlexer/next_record
6.06s pyfastaseqlexer/quick_FASTA_reader
24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
12.27s pyfastaseqlexer/next_record (conversion to Seq)
8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

This is only one run - my patience has limits <grin>  Again, scaling is
a big problem for some methods.

> The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
> entire file into memory in one go, and then parses it.  On the other 
> hand its not perfect: I would use "\n>" as the split marker rather than 
> ">" which could appear in the description of a sequence.

I agree (not that it's bitten me, yet), but I'd be inclined to go with
"%s>" % os.linesep as the split marker, just in case.

> Do we need to worry about the size of the raw file in memory - allowing the parsers to load it 
> into memory could make things much faster...

I use very few FASTA files where that would be a problem, so long as the
sequences remain as strings - when they're converted to
SeqRecords/SeqFeatures is where I start to get nervous about memory use.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 11:14:04 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 16:14:04 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154355358.1490.116.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	
	<44CA27B1.30107@maubp.freeserve.co.uk>	
	<1154339988.1490.81.camel@lplinuxdev>	
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
Message-ID: <44CE1E3C.2050502@maubp.freeserve.co.uk>

> 
> They're multiple lines with max length 50, and the whole file is 33Mb.
> It's not the largest FASTA sequence file I'm working with, that's 353Mb
> (530801 sequences, it's most of a eukaryotic genome with sequences split
> into multiple lines), so I ran your test script on it, just to see what
> happened:
> 
> 419.42s FormatIO/SeqRecord (for record in interator)
> 389.05s FormatIO/SeqRecord (iterator.next)
> 35.46s SeqIO.FASTA.FastaReader (for record in interator)
> 33.73s SeqIO.FASTA.FastaReader (iterator.next)
> 36.19s SeqIO.FASTA.FastaReader (iterator[i])
> 490.19s Fasta.RecordParser (for record in interator)
> 555.43s Fasta.SequenceParser (for record in interator)
> 546.87s Fasta.SequenceParser (iterator.next)
> 37.94s SeqUtils/quick_FASTA_reader
> 12.84s pyfastaseqlexer/next_record
> 6.06s pyfastaseqlexer/quick_FASTA_reader
> 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
> 12.27s pyfastaseqlexer/next_record (conversion to Seq)
> 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
> 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)
> 
> This is only one run - my patience has limits <grin>  Again, scaling is
> a big problem for some methods.

Interesting - but no big surprises, except maybe just how slow Martel 
is.  Did you notice if it run out of memory, and have to page to the 
hard disk?

>>The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
>>entire file into memory in one go, and then parses it.  On the other 
>>hand its not perfect: I would use "\n>" as the split marker rather than 
>>">" which could appear in the description of a sequence.
> 
> I agree (not that it's bitten me, yet), but I'd be inclined to go with
> "%s>" % os.linesep as the split marker, just in case.

Good point.  I wonder how many people even know this function exists?

>>Do we need to worry about the size of the raw file in memory - allowing
 >>the parsers to load it into memory could make things much faster...
> 
> I use very few FASTA files where that would be a problem, so long as the
> sequences remain as strings - when they're converted to
> SeqRecords/SeqFeatures is where I start to get nervous about memory use.

Maybe we should avoid loading entire files into memory while parsing - 
except for those formats like Clustal alignments where there is no real 
choice.

Have you got a feeling for the difference in memory required for a large 
Fasta file in memory as:
* Title string, sequence string
* Title string, sequence as Seq object
* SeqRecords (which include the sequence as a Seq object)

While its overkill for simple file formats like FASTA, I think we do 
need a fairly high level object like the SeqRecord when dealing with 
things like Genbank/EMBL to hold the basic annotation and identifiers 
(id/name/description).

I am thinking that we should have a set of sequence parsers that all 
return SeqRecord objects (with format specific options in some cases to 
control the exact mapping of the data, e.g. title2ids for Fasta files).

And a matching set of sequence writers that take SeqRecord object(s) and 
write them to a file.

Such a mapping won't be perfect, so maybe there is still a place for 
"format specific representations" like the Record object in 
Bio.GenBank.Record

In the short term maybe we should just replace the internals of the 
current Bio.Fasta module with a pure python implementation like that in 
Bio.SeqIO.FASTA - good idea?  Bad idea?

Peter

From f.schlesinger at iu-bremen.de  Mon Jul 31 12:07:08 2006
From: f.schlesinger at iu-bremen.de (Felix Schlesinger)
Date: Mon, 31 Jul 2006 18:07:08 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
Message-ID: <7317d50c0607310907sc468843nfe3945225d2ace76@mail.gmail.com>

> Have you got a feeling for the difference in memory required for a large
> Fasta file in memory as:
> * Title string, sequence string
> * Title string, sequence as Seq object
> * SeqRecords (which include the sequence as a Seq object)

>From looking at the code the only difference should be one instance of
alphabet and one reference to it per sequence.
  The main difference is that Seq.data.method involves some python,
while string.method is pure C code.

Felix

From mcolosimo at mitre.org  Mon Jul 31 12:08:50 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 31 Jul 2006 12:08:50 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	
	<44CA27B1.30107@maubp.freeserve.co.uk>	
	<1154339988.1490.81.camel@lplinuxdev>	
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
Message-ID: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>


On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:


>
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than
>>> ">" which could appear in the description of a sequence.
>>
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with
>> "%s>" % os.linesep as the split marker, just in case.
>
> Good point.  I wonder how many people even know this function exists?
>

The only problem with this is that if someone sends you a file not  
created on your system. I remember hugh problems 5 or so years ago in  
BioPerl with dealing with the Mac, Unix, Windows line-ending issues.  
This has mostly simplied down to two - Unix and Windows - unless the  
person uses a Mac GUI app some of which use \r (CR) instead of \n  
(LF) where Windows uses \r\n (CRLF). I think the standard python  
disto comes with crlf.py and lfcr.py that can convert the line endings.

> Maybe we should avoid loading entire files into memory while parsing -
> except for those formats like Clustal alignments where there is no  
> real
> choice.
>
> Have you got a feeling for the difference in memory required for a  
> large
> Fasta file in memory as:
> * Title string, sequence string
> * Title string, sequence as Seq object
> * SeqRecords (which include the sequence as a Seq object)
>
> While its overkill for simple file formats like FASTA, I think we do
> need a fairly high level object like the SeqRecord when dealing with
> things like Genbank/EMBL to hold the basic annotation and identifiers
> (id/name/description).
>
> I am thinking that we should have a set of sequence parsers that all
> return SeqRecord objects (with format specific options in some  
> cases to
> control the exact mapping of the data, e.g. title2ids for Fasta  
> files).
>
> And a matching set of sequence writers that take SeqRecord object 
> (s) and
> write them to a file.
>
> Such a mapping won't be perfect, so maybe there is still a place for
> "format specific representations" like the Record object in
> Bio.GenBank.Record
>
> In the short term maybe we should just replace the internals of the
> current Bio.Fasta module with a pure python implementation like  
> that in
> Bio.SeqIO.FASTA - good idea?  Bad idea?


I would keep them separate but change the documentation on the how-to  
site to point to using the Bio.SeqIO.FASTA since that is where I  
think we want people to start going. The code change to Bio.Fasta  
should be to add a depreciation warning.


Marc

From mdehoon at c2b2.columbia.edu  Mon Jul 31 13:34:41 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 31 Jul 2006 13:34:41 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
References: <44CA162F.1040604@maubp.freeserve.co.uk>		<44CA27B1.30107@maubp.freeserve.co.uk>		<1154339988.1490.81.camel@lplinuxdev>		<44CDF3AA.2020308@maubp.freeserve.co.uk>	<1154355358.1490.116.camel@lplinuxdev>	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
Message-ID: <44CE3F31.2080404@c2b2.columbia.edu>

Marc Colosimo wrote:
> On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
>> In the short term maybe we should just replace the internals of the
>> current Bio.Fasta module with a pure python implementation like  
>> that in
>> Bio.SeqIO.FASTA - good idea?  Bad idea?
> 
> I would keep them separate but change the documentation on the how-to  
> site to point to using the Bio.SeqIO.FASTA since that is where I  
> think we want people to start going. The code change to Bio.Fasta  
> should be to add a depreciation warning.

I agree with Marc here. No need to modify Bio.Fasta if it's on its way out.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032

From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 13:41:49 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 18:41:49 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
References: <44CA162F.1040604@maubp.freeserve.co.uk>		<44CA27B1.30107@maubp.freeserve.co.uk>		<1154339988.1490.81.camel@lplinuxdev>		<44CDF3AA.2020308@maubp.freeserve.co.uk>	<1154355358.1490.116.camel@lplinuxdev>	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
Message-ID: <44CE40DD.3010101@maubp.freeserve.co.uk>

Peter wrote:
>>In the short term maybe we should just replace the internals of the
>>current Bio.Fasta module with a pure python implementation like  
>>that in Bio.SeqIO.FASTA - good idea?  Bad idea?

Marc wrote:
> I would keep them separate but change the documentation on the how-to  
> site to point to using the Bio.SeqIO.FASTA since that is where I  
> think we want people to start going. The code change to Bio.Fasta  
> should be to add a depreciation warning.

Certainly long term we could do that.  There may be advantages to the 
current very flexible Bio.Fasta code that the SeqIO replacement may not 
offer (e.g. if we focus on just parsing into SeqRecords).

Short Term
----------
Right now I guess most people dealing with Fasta files will be using 
Bio.Fasta, and it is very slow, hence bug 2058:

http://bugzilla.open-bio.org/show_bug.cgi?id=2058

My patch makes Bio.Fasta almost as fast as Bio.SeqIO.FASTA according to 
my tests (modest sized files).

If any of you could try this patch on your machines - on the off chance 
that it causes problems for any existing code.  It does pass 
test_Fasta.py and test_Fasta2.py on Windows at least.

Medium/Long Term
----------------
We need to sort out what to do with Bio.SeqIO as currently the existing 
code in Bio/SeqIO/generic.py and Bio/SeqIO/FASTA.py uses different 
interfaces.  But do agree that something like that should be OK.

I have been working on a possible replacement (but it doesn't seem to 
have made it to the mailing list yet - must check my recent email).

Peter

From mdehoon at c2b2.columbia.edu  Sat Jul  1 21:47:28 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sat, 01 Jul 2006 17:47:28 -0400
Subject: [Biopython-dev] Fasta parser
Message-ID: <44A6ED70.9080204@c2b2.columbia.edu>

Hi everybody,

The Biopython shows the following approach to parsing a Fasta file:

 >>> from Bio import Fasta
 >>> parser = Fasta.RecordParser()
 >>> file = open("ls_orchid.fasta")
 >>> iterator = Fasta.Iterator(file, parser)
 >>> cur_record = iterator.next()

But for large Fasta files, it's very slow, compared to file.read(), 
which may be due to going through Martel (I believe the same was true 
for large GenBank files).

So I'm thinking about writing a simple-minded Fasta parser for better 
performance with large files. What I'm wondering about:
1) Is there some advantage that I overlooked of using Martel for parsing 
Fasta files?
2) Why is it necessary to create a parser first and passing it to 
Fasta.Iterator? Are there any cases where Fasta.Iterator uses something 
other than a Fasta.RecordParser?

--Michiel.


From idoerg at burnham.org  Sat Jul  1 22:52:43 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 1 Jul 2006 15:52:43 -0700
Subject: [Biopython-dev] Fasta parser
References: <44A6ED70.9080204@c2b2.columbia.edu>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>

Michiel,

There is actually a simple minded fasta reader/writer  that does not use Martel. Bio.SeqIO.FASTA

./I

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
Sent: Sat 7/1/2006 2:47 PM
To: biopython-dev at biopython.org
Subject: [Biopython-dev] Fasta parser
 
Hi everybody,

The Biopython shows the following approach to parsing a Fasta file:

 >>> from Bio import Fasta
 >>> parser = Fasta.RecordParser()
 >>> file = open("ls_orchid.fasta")
 >>> iterator = Fasta.Iterator(file, parser)
 >>> cur_record = iterator.next()

But for large Fasta files, it's very slow, compared to file.read(), 
which may be due to going through Martel (I believe the same was true 
for large GenBank files).

So I'm thinking about writing a simple-minded Fasta parser for better 
performance with large files. What I'm wondering about:
1) Is there some advantage that I overlooked of using Martel for parsing 
Fasta files?
2) Why is it necessary to create a parser first and passing it to 
Fasta.Iterator? Are there any cases where Fasta.Iterator uses something 
other than a Fasta.RecordParser?

--Michiel.
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Sun Jul  2 04:43:47 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 00:43:47 -0400
Subject: [Biopython-dev] Fasta parser
In-Reply-To: <1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>
References: <44A6ED70.9080204@c2b2.columbia.edu>
	<1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>
Message-ID: <44A74F03.8020801@c2b2.columbia.edu>

Thanks Iddo!
I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than 
the Martel-based one in Bio.Fasta.

It would be nice to merge these two modules. However, it raises a bunch 
of design questions (such as Fasta.Record versus SeqRecord, and Seq 
versus string), so it's probably better to wait with that until after 
the next Biopython release. Which, by the way, will be coming up soon.

Thanks,

--Michiel.

Iddo Friedberg wrote:
> Michiel,
> 
> There is actually a simple minded fasta reader/writer  that does not use 
> Martel. Bio.SeqIO.FASTA
> 
> ./I
> 
> --
> Iddo Friedberg, PhD
> Burnham Institute for Medical Research
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037 USA
> T: +1 858 646 3100 x3516
> http://iddo-friedberg.org
> http://BioFunctionPrediction.org
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
> Sent: Sat 7/1/2006 2:47 PM
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] Fasta parser
> 
> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>  >>> from Bio import Fasta
>  >>> parser = Fasta.RecordParser()
>  >>> file = open("ls_orchid.fasta")
>  >>> iterator = Fasta.Iterator(file, parser)
>  >>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?
> 
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From idoerg at burnham.org  Sun Jul  2 04:48:50 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Sat, 1 Jul 2006 21:48:50 -0700
Subject: [Biopython-dev] Fasta parser
References: <44A6ED70.9080204@c2b2.columbia.edu>
	<1F97379A556D0946AAEFE3F63FD6F5744D46A5@MAIL.burnham.org>
	<44A74F03.8020801@c2b2.columbia.edu>
Message-ID: <1F97379A556D0946AAEFE3F63FD6F5744D46A6@MAIL.burnham.org>

By (lack of?) design, my own biopython using code seems to be using both the martel and non-Martel parsers. I imagine others may have the same. Point being: any design change should make sure that we are back compatible. 

Thanks very much for your work on the Biopython release.

Cheers,

./I

--
Iddo Friedberg, PhD
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
T: +1 858 646 3100 x3516
http://iddo-friedberg.org
http://BioFunctionPrediction.org


-----Original Message-----
From: Michiel de Hoon [mailto:mdehoon at c2b2.columbia.edu]
Sent: Sat 7/1/2006 9:43 PM
To: Iddo Friedberg
Cc: biopython-dev at biopython.org
Subject: Re: [Biopython-dev] Fasta parser
 
Thanks Iddo!
I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than 
the Martel-based one in Bio.Fasta.

It would be nice to merge these two modules. However, it raises a bunch 
of design questions (such as Fasta.Record versus SeqRecord, and Seq 
versus string), so it's probably better to wait with that until after 
the next Biopython release. Which, by the way, will be coming up soon.

Thanks,

--Michiel.

Iddo Friedberg wrote:
> Michiel,
> 
> There is actually a simple minded fasta reader/writer  that does not use 
> Martel. Bio.SeqIO.FASTA
> 
> ./I
> 
> --
> Iddo Friedberg, PhD
> Burnham Institute for Medical Research
> 10901 N. Torrey Pines Rd.
> La Jolla, CA 92037 USA
> T: +1 858 646 3100 x3516
> http://iddo-friedberg.org
> http://BioFunctionPrediction.org
> 
> 
> 
> -----Original Message-----
> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
> Sent: Sat 7/1/2006 2:47 PM
> To: biopython-dev at biopython.org
> Subject: [Biopython-dev] Fasta parser
> 
> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>  >>> from Bio import Fasta
>  >>> parser = Fasta.RecordParser()
>  >>> file = open("ls_orchid.fasta")
>  >>> iterator = Fasta.Iterator(file, parser)
>  >>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?
> 
> --Michiel.
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> 


From mdehoon at c2b2.columbia.edu  Sun Jul  2 14:58:35 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 10:58:35 -0400
Subject: [Biopython-dev] New Biopython release coming up
Message-ID: <44A7DF1B.1000008@c2b2.columbia.edu>

Hi everybody,

The next Biopython release (1.42, code-named "Brooklyn") is coming up. 
I'm planning to finish this release about two weeks from now. The tests 
of Biopython in CVS all pass, so we are doing well. However, there are 
25 bugs listed in Bugzilla, so please have a look to see if there's 
something we can do about them. If you have some code sitting around, 
now would be a good time to commit it to CVS. However, if you are not 
sure if your code is ready for prime time, please hold off until after 
this release. Also, if you have a cvs checkout of Biopython, please make 
sure to update it before doing any commits to avoid overwriting.

Thanks everybody for your contributions to Biopython.

--Michiel.


From biopython-dev at maubp.freeserve.co.uk  Sun Jul  2 18:11:47 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sun, 02 Jul 2006 19:11:47 +0100
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A7DF1B.1000008@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
Message-ID: <44A80C63.7060809@maubp.freeserve.co.uk>

Michiel de Hoon wrote:
> Hi everybody,
> 
> The next Biopython release (1.42, code-named "Brooklyn") is coming up. 
> I'm planning to finish this release about two weeks from now. The tests 
> of Biopython in CVS all pass, so we are doing well. However, there are 
> 25 bugs listed in Bugzilla, so please have a look to see if there's 
> something we can do about them. If you have some code sitting around, 
> now would be a good time to commit it to CVS. However, if you are not 
> sure if your code is ready for prime time, please hold off until after 
> this release. Also, if you have a cvs checkout of Biopython, please make 
> sure to update it before doing any commits to avoid overwriting.
> 
> Thanks everybody for your contributions to Biopython.
> 
> --Michiel.

Sounds like a good plan Michiel

Did anyone get back to you about the NBCI Blast XML format?  I would say 
parsing blast output is a fairly important feature to a lot of users (I 
may of course be biased)...

Getting down to specifics:

Bugzilla Bug 1997 VARCHAR too small in SCOP tables
http://bugzilla.open-bio.org/show_bug.cgi?id=1997
Suggested fix looked OK to me, but as I've never used SCOP as second 
opinion would be wise.

Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character
http://bugzilla.open-bio.org/show_bug.cgi?id=1987
I have attached a suggested patch, second opinion welcome

Bugzilla Bug 1981 GenBank parser generates unusual feature qualifiers.
http://bugzilla.open-bio.org/show_bug.cgi?id=1981
A question about the white space in GenBank comments etc.  Changing this 
is probably harmless but we are already making a big change internally 
with the move away from Martel, I would rather post pone any further 
change until after the next release.

Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in 
error
http://bugzilla.open-bio.org/show_bug.cgi?id=1936
One for Thomas Hamelryck which on the face of it looks fairly simple.

Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT
Does anyone use the new project line?  Would a simple string be enough 
to store this?

Peter


From mcolosimo at mitre.org  Sun Jul  2 18:36:22 2006
From: mcolosimo at mitre.org (Colosimo, Marc E.)
Date: Sun, 02 Jul 2006 14:36:22 -0400
Subject: [Biopython-dev] Fasta parser
In-Reply-To: <44A6ED70.9080204@c2b2.columbia.edu>
Message-ID: <C0CD8A66.8A18%mcolosimo@mitre.org>


On 7/1/06 5:47 PM, "Michiel de Hoon" <mdehoon at c2b2.columbia.edu> wrote:

> Hi everybody,
> 
> The Biopython shows the following approach to parsing a Fasta file:
> 
>>>> from Bio import Fasta
>>>> parser = Fasta.RecordParser()
>>>> file = open("ls_orchid.fasta")
>>>> iterator = Fasta.Iterator(file, parser)
>>>> cur_record = iterator.next()
> 
> But for large Fasta files, it's very slow, compared to file.read(),
> which may be due to going through Martel (I believe the same was true
> for large GenBank files).
> 
> So I'm thinking about writing a simple-minded Fasta parser for better
> performance with large files. What I'm wondering about:
> 1) Is there some advantage that I overlooked of using Martel for parsing
> Fasta files?
> 2) Why is it necessary to create a parser first and passing it to
> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
> other than a Fasta.RecordParser?

Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object
(Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then
remap into a SeqRecord.

Also, could someone re-run epydoc! My changes in the code have not made it
to the on-line API docs.

Marc


From mcolosimo at mitre.org  Sun Jul  2 19:12:23 2006
From: mcolosimo at mitre.org (Colosimo, Marc E.)
Date: Sun, 02 Jul 2006 15:12:23 -0400
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <44A74F03.8020801@c2b2.columbia.edu>
Message-ID: <C0CD92D7.8A1B%mcolosimo@mitre.org>

Michiel,

When will this next release be made and what is going into it?

Since you brought up the issue of design question, I'll have my little rant
now. But first, I would like to say that I think it is great that people
contribute code and more importantly their time to this project. With out
all of the core developers there would be no BioPython. So, Kudos to anyone
who has contribute code. Now on to my rant....

<rant>
I'm not a big user of either BioPerl or BioJava. However, they are well
structured and more consistent than BioPython.This FastaIO issue is one of
several design issues that really need to be addressed.

For example, both BioPerl and BioJava use an SeqIO object structure. Our
SeqIO module is heavily underused. For example, we have Fasta, GenBank,
LocusLink, NBRF, SwissProt, UniGene main Modules. Interestingly, there is a
writers.SeqRecord.embl but I can't quickly find something to read in an embl
file! 

Just look at what BioPerl can read in
<http://www.bioperl.org/wiki/HOWTO:SeqIO> and how easy it is to find this
out (even with out the doc page, all of these are listed under
Bio::SeqIO::*)

There is a very short "Coding Convention"
<http://biopython.org/wiki/Contributing#Coding_conventions>, which doesn't
seem to be followed all that well.

My suggestion is if enough people are going to ISMB this year (which I am
not), that time should be made to think about a road map for BioPython.

My suggestions are:
1) split off a branch for ver 2.0 that supports Python 2.4 only (this would
suck for Mac people, like me, but its time to move on)
2) clean house - remove depreciated items, restructure IO, etc...
3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/convertcode.py")
4) use Cheese Shop for missing modules
5) documentation

</rant>

marc

On 7/2/06 12:43 AM, "Michiel de Hoon" <mdehoon at c2b2.columbia.edu> wrote:

> Thanks Iddo!
> I tried the parser in Bio.SeqIO.FASTA and it is indeed a lot faster than
> the Martel-based one in Bio.Fasta.
> 
> It would be nice to merge these two modules. However, it raises a bunch
> of design questions (such as Fasta.Record versus SeqRecord, and Seq
> versus string), so it's probably better to wait with that until after
> the next Biopython release. Which, by the way, will be coming up soon.
> 
> Thanks,
> 
> --Michiel.
> 
> Iddo Friedberg wrote:
>> Michiel,
>> 
>> There is actually a simple minded fasta reader/writer  that does not use
>> Martel. Bio.SeqIO.FASTA
>> 
>> ./I
>> 
>> --
>> Iddo Friedberg, PhD
>> Burnham Institute for Medical Research
>> 10901 N. Torrey Pines Rd.
>> La Jolla, CA 92037 USA
>> T: +1 858 646 3100 x3516
>> http://iddo-friedberg.org
>> http://BioFunctionPrediction.org
>> 
>> 
>> 
>> -----Original Message-----
>> From: biopython-dev-bounces at lists.open-bio.org on behalf of Michiel de Hoon
>> Sent: Sat 7/1/2006 2:47 PM
>> To: biopython-dev at biopython.org
>> Subject: [Biopython-dev] Fasta parser
>> 
>> Hi everybody,
>> 
>> The Biopython shows the following approach to parsing a Fasta file:
>> 
>>>>> from Bio import Fasta
>>>>> parser = Fasta.RecordParser()
>>>>> file = open("ls_orchid.fasta")
>>>>> iterator = Fasta.Iterator(file, parser)
>>>>> cur_record = iterator.next()
>> 
>> But for large Fasta files, it's very slow, compared to file.read(),
>> which may be due to going through Martel (I believe the same was true
>> for large GenBank files).
>> 
>> So I'm thinking about writing a simple-minded Fasta parser for better
>> performance with large files. What I'm wondering about:
>> 1) Is there some advantage that I overlooked of using Martel for parsing
>> Fasta files?
>> 2) Why is it necessary to create a parser first and passing it to
>> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
>> other than a Fasta.RecordParser?
>> 
>> --Michiel.
>> _______________________________________________
>> Biopython-dev mailing list
>> Biopython-dev at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev


From mdehoon at c2b2.columbia.edu  Sun Jul  2 20:54:27 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 16:54:27 -0400
Subject: [Biopython-dev] Fasta parser
In-Reply-To: <C0CD8A66.8A18%mcolosimo@mitre.org>
References: <C0CD8A66.8A18%mcolosimo@mitre.org>
Message-ID: <44A83283.4060401@c2b2.columbia.edu>

>> 2) Why is it necessary to create a parser first and passing it to
>> Fasta.Iterator? Are there any cases where Fasta.Iterator uses something
>> other than a Fasta.RecordParser?
> 
> Yes!!!! I use Fasta.SequenceParser which gives me a SeqRecord Object
> (Bio.SeqRecord) not some odd Fasta.Record Object that I would have to then
> remap into a SeqRecord.

I see. This is one of the design issues I ran into when comparing 
Bio.Fasta and Bio.SeqIO.FASTA: Whether parsing a Fasta file should 
result in a Fasta.Record object or a SeqRecord.

> Also, could someone re-run epydoc! My changes in the code have not made it
> to the on-line API docs.

Done.

--Michiel.


From mdehoon at c2b2.columbia.edu  Sun Jul  2 21:19:46 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 17:19:46 -0400
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <C0CD92D7.8A1B%mcolosimo@mitre.org>
References: <C0CD92D7.8A1B%mcolosimo@mitre.org>
Message-ID: <44A83872.4070209@c2b2.columbia.edu>

Colosimo, Marc E. wrote:
> When will this next release be made ...
I'm planning for the weekend of 15/16 July.

> ... and what is going into it?
Whatever is in CVS at that time. So essentially today's CVS plus as many 
  bug fixes as possible. I'd hold off on any major changes until after 
the  release.

> <rant>
> </rant>

I pretty much agree with Marc here.

 > My suggestion is if enough people are going to ISMB this year
 > (which I am not), that time should be made to think about a
 > road map for BioPython.

Unfortunately, I won't be going either. A Biopython road map seems like 
a good idea though.

 > My suggestions are:
 > 1) split off a branch for ver 2.0 that supports Python 2.4 only
 > (this would suck for Mac people, like me, but its time to move on)

Is there something essential in 2.4 that's missing in 2.3? Not that I 
object against supporting 2.4 only, I'm just wondering. Though I'd be 
hesitant to split off a separate branch, since Biopython is confusing 
enough already as it is.

Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no problem 
   for Mac users to support 2.4 only.

 > 2) clean house - remove depreciated items, restructure IO, etc...

I totally agree.

 > 3) move to SciPy/NumPy verse Numeric (could try 
"numpy/lib/convertcode.py")

Here, I'm a bit hesitant. SciPy does not have a good track record in 
terms of portability. The latest version of numpy looks better though 
(it compiled without problems on all platforms I tried). But I don't 
really want to pay $40 for the documentation.

 > 4) use Cheese Shop for missing modules
 > 5) documentation

My guess is that maintaining the documentation will be easier once we 
cleaned up Biopython.

--Michiel.


From mdehoon at c2b2.columbia.edu  Mon Jul  3 01:21:00 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 02 Jul 2006 21:21:00 -0400
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
Message-ID: <44A870FC.4060909@c2b2.columbia.edu>

Peter wrote:
> Did anyone get back to you about the NBCI Blast XML format?  I would say 
> parsing blast output is a fairly important feature to a lot of users (I 
> may of course be biased)...
No response yet, but I'll ask them again before the upcoming release. 
The existing XML parser still works as advertised for single blast 
searches. For multiple blast searches, people will have to run a 
previous version of blast locally.

> Bugzilla Bug 1997 VARCHAR too small in SCOP tables
> http://bugzilla.open-bio.org/show_bug.cgi?id=1997
> Suggested fix looked OK to me, but as I've never used SCOP as second 
> opinion would be wise.

This one looks fine to me, but I'm not a SCOP user either.

> Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character
> http://bugzilla.open-bio.org/show_bug.cgi?id=1987
> I have attached a suggested patch, second opinion welcome

Whereas the patch looks fine, I have no idea what this code is supposed 
to do, or why it needs to be so complicated.

> Bugzilla Bug 1946 Parsing GenBank Files - unknown line type PROJECT
> Does anyone use the new project line?  Would a simple string be enough 
> to store this?
> 
 From NCBI's description, it appears they're not quite sure yet what 
this project line should look like (note that the project line in the 
description is different from the project line in the GenBank file: 
GenomeProject vs. GENOME_PROJECT). I would just store the line in a 
simple string, and do something more fancy once we know the proper format.

My 2?.

--Michiel.


From idoerg at burnham.org  Mon Jul  3 17:52:44 2006
From: idoerg at burnham.org (Iddo Friedberg)
Date: Mon, 03 Jul 2006 10:52:44 -0700
Subject: [Biopython-dev] [Fwd: [OBF] Call For Birds of a Feather Suggestions]
Message-ID: <44A9596C.90208@burnham.org>


The BOSC organizing comittee is currently seeking suggestions for Birds
of a Feather meeting ideas. Birds of a Feather meetings are one of the
more popular activities at BOSC, occurring at the end of each days
session. These are free-form meetings organized by the attendees
themselves to discuss one or a few topics of interest in greater detail.
BOF?s have been formed to allow developers and users of individual OBF
software to meet each other face-to-face to discuss the project, or to
discuss completely new ideas, and even start new software development
projects. These meetings offer a unique opportunity for individuals to
explore more about the activities of the various Open Source Projects,
and, in some cases, even take an active role influencing the future of
Open Source Software development. If you would like to create a BOF,
just sign up for a wiki account, login, and edit the 

<a
href="http://www.open-bio.org/wiki/BOSC_2006/Birds-of-a-Feather">BOSC
2006 Birds of a Feather page</a>.
_______________________________________________
Open-Bioinformatics-Foundation mailing list
Open-Bioinformatics-Foundation at lists.open-bio.org

This is a broadcast-only announce list used to distribute emails to people who subscribe to OBF hosted email discussion or announce lists. To prevent our most active members from getting many duplicate copies of important announcements we created this list today so that only one email gets sent to each subscribed email address. You do not need to subscribe/unsubscribe from this lsit. Problems or Concerns? -- send an email to the OBF mailteam at: mailteam at open-bio.org 


-- 
Iddo Friedberg, Ph.D.
Burnham Institute for Medical Research
10901 N. Torrey Pines Rd.
La Jolla, CA 92037 USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9949
http://iddo-friedberg.org
http://BioFunctionPrediction.org


From biopython-dev at maubp.freeserve.co.uk  Thu Jul  6 09:06:07 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Thu, 06 Jul 2006 10:06:07 +0100
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44A870FC.4060909@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
Message-ID: <44ACD27F.90906@maubp.freeserve.co.uk>

Peter wrote:
>> Bugzilla Bug 1987 Alphabet.Gapped does not retain gap character
>> http://bugzilla.open-bio.org/show_bug.cgi?id=1987
>> I have attached a suggested patch, second opinion welcome

Michiel de Hoon wrote:
> Whereas the patch looks fine, I have no idea what this code is supposed 
> to do, or why it needs to be so complicated.

I'm not the person to ask.

The whole Alphabet is something that confused me a little when first 
using BioPython.  I see why a special class for sequences is a nice 
idea, and that handling the different variants of RNA, DNA and proteins 
is a good idea.

But to be honest, I have generally used plain strings in my own 
programs, and meddled with alphabets only when needed (e.g. for 
translating from DNA to protein sequences).

Peter


From hoffman at ebi.ac.uk  Thu Jul  6 10:36:53 2006
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Thu, 6 Jul 2006 11:36:53 +0100 (BST)
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
	<44ACD27F.90906@maubp.freeserve.co.uk>
Message-ID: <Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>

[Peter]

> The whole Alphabet is something that confused me a little when first
> using BioPython.  I see why a special class for sequences is a nice
> idea, and that handling the different variants of RNA, DNA and proteins
> is a good idea.
>
> But to be honest, I have generally used plain strings in my own
> programs, and meddled with alphabets only when needed (e.g. for
> translating from DNA to protein sequences).

I agree. In general, I think that the alphabet stuff adds unnecessary
complexity to perhaps 95 % of the sort of things I would do with
Biopython. But as it stands I usually use strs myself instead.
-- 
Michael Hoffman <hoffman at ebi.ac.uk>
European Bioinformatics Institute


From Leighton.Pritchard at scri.ac.uk  Thu Jul  6 10:34:46 2006
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Thu, 6 Jul 2006 11:34:46 +0100
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
	<44ACD27F.90906@maubp.freeserve.co.uk>
Message-ID: <1152182087.4828.96.camel@lplinuxdev>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: "Leighton Pritchard" <Leighton.Pritchard at scri.ac.uk>
Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets
Date: Thu, 6 Jul 2006 11:34:46 +0100
Size: 4250
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment.eml>

From Leighton.Pritchard at scri.ac.uk  Thu Jul  6 10:34:46 2006
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Thu, 6 Jul 2006 11:34:46 +0100
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44ACD27F.90906@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>
	<44ACD27F.90906@maubp.freeserve.co.uk>
Message-ID: <1152182087.4828.96.camel@lplinuxdev>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0001.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: "Leighton Pritchard" <Leighton.Pritchard at scri.ac.uk>
Subject: Re: [Biopython-dev] New Biopython release coming up / Alphabets
Date: Thu, 6 Jul 2006 11:34:46 +0100
Size: 4250
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060706/8ff09d1a/attachment-0001.eml>

From mdehoon at c2b2.columbia.edu  Thu Jul  6 16:39:09 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Thu, 06 Jul 2006 12:39:09 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
Message-ID: <44AD3CAD.8030504@c2b2.columbia.edu>

Michael Hoffman wrote:
> [Peter]
>> But to be honest, I have generally used plain strings in my own
>> programs, and meddled with alphabets only when needed (e.g. for
>> translating from DNA to protein sequences).

Note that there is a function "translate" in Bio.Seq that translates DNA 
to protein using plain strings.
> 
> I agree. In general, I think that the alphabet stuff adds unnecessary
> complexity to perhaps 95 % of the sort of things I would do with
> Biopython. But as it stands I usually use strs myself instead.

It appears that most people (myself included) use plain strings instead 
of Seq objects (= string + Alphabet). We should check on the biopython 
mailing list if anybody really needs alphabets, and if not get rid of 
them (after the upcoming Brooklyn-release (1.42) though).

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From fkauff at duke.edu  Thu Jul  6 17:53:23 2006
From: fkauff at duke.edu (Frank Kauff)
Date: Thu, 06 Jul 2006 13:53:23 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
	<44AD3CAD.8030504@c2b2.columbia.edu>
Message-ID: <1152208403.2487.36.camel@osiris.biology.duke.edu>

On Thu, 2006-07-06 at 12:39 -0400, Michiel Jan Laurens de Hoon wrote:
> Michael Hoffman wrote:
> > [Peter]
> >> But to be honest, I have generally used plain strings in my own
> >> programs, and meddled with alphabets only when needed (e.g. for
> >> translating from DNA to protein sequences).
> 
> Note that there is a function "translate" in Bio.Seq that translates DNA 
> to protein using plain strings.
> > 
> > I agree. In general, I think that the alphabet stuff adds unnecessary
> > complexity to perhaps 95 % of the sort of things I would do with
> > Biopython. But as it stands I usually use strs myself instead.
> 
> It appears that most people (myself included) use plain strings instead 
> of Seq objects (= string + Alphabet). We should check on the biopython 
> mailing list if anybody really needs alphabets, and if not get rid of 
> them (after the upcoming Brooklyn-release (1.42) though).
> 
I use seq objects and the alphabet stuff in the nexus parser, but I
don't really know why and wouldn't mind at all to get rid of them. 

Frank


> --Michiel.
> 
> 
-- 
Frank Kauff
Dept. of Biology
Duke University
Box 90338
Durham, NC 27708
USA

Phone 919-660-7382
Fax 919-660-7293
Web http://www.lutzonilab.net


From thamelry at binf.ku.dk  Fri Jul  7 10:44:24 2006
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST)
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk>

Hi,

> Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in
> error
> http://bugzilla.open-bio.org/show_bug.cgi?id=1936
> One for Thomas Hamelryck which on the face of it looks fairly simple.

Won't have time to work on biopython before august I'm afraid (CASP+
articles that need to be finished, etc.). Sorry!

Best regards,

-Thomas


From thamelry at binf.ku.dk  Fri Jul  7 10:44:24 2006
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 7 Jul 2006 12:44:24 +0200 (CEST)
Subject: [Biopython-dev] New Biopython release coming up
In-Reply-To: <44A80C63.7060809@maubp.freeserve.co.uk>
References: <44A7DF1B.1000008@c2b2.columbia.edu>
	<44A80C63.7060809@maubp.freeserve.co.uk>
Message-ID: <38979.129.11.36.25.1152269064.squirrel@secure.binf.ku.dk>

Hi,

> Bugzilla Bug 1936 Bio.PDB.PDBParser throws PDBException, 'No parent' in
> error
> http://bugzilla.open-bio.org/show_bug.cgi?id=1936
> One for Thomas Hamelryck which on the face of it looks fairly simple.

Won't have time to work on biopython before august I'm afraid (CASP+
articles that need to be finished, etc.). Sorry!

Best regards,

-Thomas


From mcolosimo at mitre.org  Tue Jul 11 16:01:15 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 11 Jul 2006 12:01:15 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <44AD3CAD.8030504@c2b2.columbia.edu>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
	<44AD3CAD.8030504@c2b2.columbia.edu>
Message-ID: <B4886B3D-E885-450D-9376-3EBE93859F7A@mitre.org>


On Jul 6, 2006, at 12:39 PM, Michiel Jan Laurens de Hoon wrote:

> Michael Hoffman wrote:
>> [Peter]
>>> But to be honest, I have generally used plain strings in my own
>>> programs, and meddled with alphabets only when needed (e.g. for
>>> translating from DNA to protein sequences).
>
> Note that there is a function "translate" in Bio.Seq that  
> translates DNA
> to protein using plain strings.
>>
>> I agree. In general, I think that the alphabet stuff adds unnecessary
>> complexity to perhaps 95 % of the sort of things I would do with
>> Biopython. But as it stands I usually use strs myself instead.
>
> It appears that most people (myself included) use plain strings  
> instead
> of Seq objects (= string + Alphabet). We should check on the biopython
> mailing list if anybody really needs alphabets, and if not get rid of
> them (after the upcoming Brooklyn-release (1.42) though).
>
> --Michiel.

I am strongly arguing  against removing the alphabets. You would loss  
all of the cool features of Seq Objects (complement,  
reverse_complement).  There are similar functions under Bio.SeqUtils  
but those are "Deprecated". From just looking around, I think this  
would break many things.

Having said that, I do find them a pain to deal with, but that might  
have more to do with the structure/layout of the classes. My simple  
suggestion is to fix/change the base Alphabet classes in  
Bio.Alphabet.__init__. I am trying to think of a way that we can have  
a "true" GenericAlphabet class (not generic_alphabet = Alphabet() )  
and using just strings. The problem is, is that I don't know if just  
using letters = None (or letters = []) will cause problems down the  
road (things like if x in aplabet.letters is used in many classes).

Also, I'm really confused as to what is going on in IUPAC.py with the  
default_manager stuff and _bootstrap.

Marc


From mcolosimo at mitre.org  Tue Jul 11 17:29:52 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Tue, 11 Jul 2006 13:29:52 -0400
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <44A83872.4070209@c2b2.columbia.edu>
References: <C0CD92D7.8A1B%mcolosimo@mitre.org>
	<44A83872.4070209@c2b2.columbia.edu>
Message-ID: <7C24AEA4-68EC-4517-9391-C07512CDD146@mitre.org>


On Jul 2, 2006, at 5:19 PM, Michiel de Hoon wrote:

>
> > My suggestions are:
> > 1) split off a branch for ver 2.0 that supports Python 2.4 only
> > (this would suck for Mac people, like me, but its time to move on)
>
> Is there something essential in 2.4 that's missing in 2.3? Not that  
> I object against supporting 2.4 only, I'm just wondering. Though  
> I'd be hesitant to split off a separate branch, since Biopython is  
> confusing enough already as it is.
>
> Btw, I am running Python 2.4 on Mac OS X, and AFAICT there is no  
> problem   for Mac users to support 2.4 only.

There are two off the top of my head:

Generator expressions (PEP 289, <http://www.python.org/doc/peps/ 
pep-0289>) This could be very useful in cleaning up the old code
Decorators for Functions (PEP 318,  <http://www.python.org/dev/peps/ 
pep-0318>)  I like the idea of using staticmethod and classmethod.  
The accepts and returns decorators are also interesting. I wish I  
could find a list of all possible decorators.

In any case, some clean up of the code is needed because people have  
used the string "Decorator" (Alphabet.__init__.py and NeCatch.py)

>
> > 2) clean house - remove depreciated items, restructure IO, etc...
>
> I totally agree.
>
> > 3) move to SciPy/NumPy verse Numeric (could try "numpy/lib/ 
> convertcode.py")
>
> Here, I'm a bit hesitant. SciPy does not have a good track record  
> in terms of portability. The latest version of numpy looks better  
> though (it compiled without problems on all platforms I tried). But  
> I don't really want to pay $40 for the documentation.


I saw this, but didn't know it was the only documentation. However,  
as far as I can tell Numeric is dead <http://numeric.scipy.org/> is  
NumPy!

Marc


From krewink at inb.uni-luebeck.de  Tue Jul 11 21:23:14 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Tue, 11 Jul 2006 23:23:14 +0200
Subject: [Biopython-dev] BioPython Design
Message-ID: <20060711212314.GA31351@pc06.inb.mu-luebeck.de>

Am 11.07.2006 um 18:01 schrieb Marc Colosimo:

> It appears that most people (myself included) use plain strings instead
> of Seq objects (= string + Alphabet). We should check on the biopython
> mailing list if anybody really needs alphabets, and if not get rid of
> them (after the upcoming Brooklyn-release (1.42) though).

There are some good points about Seq objects in the discussion
last year:
http://lists.open-bio.org/pipermail/biopython-dev/2005-April/002074.html

Personaly, I would prefere to keep Alphabets as a part of Seq,
but make it behave more like python strings, i.e.:
str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:]

Furthermore, alphabets could be more usefull with an __init__
method looking like

def __init__(self, data, alphabet, validate=False)

This way, sequences could be checked for consistency on demand.

To make Alphabets more usable, it would be nice to have some kind
of dictionary interface to map different alphabets:
e.g. Alphabet.Alphabets['protein'] == Bio.Alphabet.IUPAC.protein

Cheers,
Albert

-- 
Albert Krewinkel
University of Luebeck
phone: +49 (451) 500 5516
email: krewink at inb.uni-luebeck.de


From f.schlesinger at iu-bremen.de  Wed Jul 12 13:25:43 2006
From: f.schlesinger at iu-bremen.de (Felix Schlesinger)
Date: Wed, 12 Jul 2006 15:25:43 +0200
Subject: [Biopython-dev] BioPython Design
In-Reply-To: <20060711212314.GA31351@pc06.inb.mu-luebeck.de>
References: <20060711212314.GA31351@pc06.inb.mu-luebeck.de>
Message-ID: <7317d50c0607120625x7e76008fo961814b280dbad51@mail.gmail.com>

> Personaly, I would prefere to keep Alphabets as a part of Seq,
> but make it behave more like python strings, i.e.:
> str(seq_obj) == seq_obj.data == seq_obj.tostring() == seq_obj[:]

Isn't the whole alphabet thing just a type information in the end?
(I.e. "This string is of type protein") And if it is, shouldn't we let
the python type system handle it via a class hirachie? Or use the
python concept of duck typing and assume the string has whatever type
is needed at the moment until it fails?

Felix Schlesinger


From mdehoon at c2b2.columbia.edu  Wed Jul 26 17:39:46 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Wed, 26 Jul 2006 13:39:46 -0400
Subject: [Biopython-dev] New Biopython release coming up / Alphabets
In-Reply-To: <B4886B3D-E885-450D-9376-3EBE93859F7A@mitre.org>
References: <44A7DF1B.1000008@c2b2.columbia.edu>	<44A80C63.7060809@maubp.freeserve.co.uk>	<44A870FC.4060909@c2b2.columbia.edu>	<44ACD27F.90906@maubp.freeserve.co.uk>
	<Pine.LNX.4.64.0607061135180.9068@qnzvnan.rov.np.hx>
	<44AD3CAD.8030504@c2b2.columbia.edu>
	<B4886B3D-E885-450D-9376-3EBE93859F7A@mitre.org>
Message-ID: <44C7A8E2.2050100@c2b2.columbia.edu>

Marc Colosimo wrote:
>> [Michiel]
>> It appears that most people (myself included) use plain strings instead
>> of Seq objects (= string + Alphabet). We should check on the biopython
>> mailing list if anybody really needs alphabets, and if not get rid of
>> them (after the upcoming Brooklyn-release (1.42) though).
 >
 > [Marc]
> I am strongly arguing  against removing the alphabets. You would loss 
> all of the cool features of Seq Objects (complement, 
> reverse_complement).  There are similar functions under Bio.SeqUtils but 
> those are "Deprecated". From just looking around, I think this would 
> break many things.

There is a function reverse_complement in Bio.Seq that works on plain 
strings. (If you need the complement instead, you can of course reverse 
the result). So can you be more specific on which features of Seq 
objects are actually needed? While I can see the intuitive appeal of 
having a Seq class, I cannot think of any practical cases where a simple 
string wouldn't do.

--Michiel.


-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From biopython-dev at maubp.freeserve.co.uk  Fri Jul 28 13:50:39 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 28 Jul 2006 14:50:39 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Message-ID: <44CA162F.1040604@maubp.freeserve.co.uk>

This follows on from the discussion last month started by Marc Colosimo, 
  but I want to focus just on reading in sequence files:

http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002386.html

There was also a thread back a few years ago where Michael Hoffman was 
looking at timings for parsing Fasta files.

http://www.biopython.org/pipermail/biopython-dev/2003-October/001480.html

Jeffrey Chang wrote:
> That is a nice implementation.  However, Biopython already has at least 
> 3 Fasta parsers!
>    Bio/Fasta
>    Bio/SeqIO/FASTA
>    Bio/expressions/fasta
> 
> Bio/Fasta, the one you compared against, is easily the slowest one.  
> Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> to be significantly faster or slower.  Bio/expressions/fasta uses 
> Martel.  I don't know how well that will perform.  The parsing part 
> should be blazingly fast (since it is mostly in C), but building the 
> object will be slow.  It might be a wash.
> 
> Jeff

Clearly we could try and consolidate these (while making things as nice 
as possible with depreciation warnings etc for existing code).

I've had a little read on the BioPerl SeqIO system:
http://www.bioperl.org/wiki/HOWTO:SeqIO

I agree with Marc that what we have in BioPython could (and should) be 
more organised.

Ideally (in my opinion) BioPython should be able to read sequences from 
multiple sequence file formats (e.g. Fasta, GenBank, EMBL, ...)
* using a standard interface
* into a standard object
* do this quickly

The resulting object should be able to hold addition information like 
annotation and (sub)features, the Bio.SeqRecord.SeqRecord object seems 
ideal.

It looks like we have:

(1) We have a number of format specific sequence reading modules (in 
particular Fasta and GenBank) which can read their particular file 
format into one or more different object representations.  These seem to 
be the best documented (in my opinion).

(2) We have a fairly generic (but relatively slow) framework in the 
Bio.FormatIO system which uses Martel expressions internally.  I have 
found Martel frustrating to debug, and especially slow with large 
individual records (like genomic GenBank files).  There is some 
documentation on this, e.g.

http://www.biopython.org/DIST/docs/cookbook/genbank_to_fasta.html

(3) We have the start of a generic "pure python" framework in the 
Bio.SeqIO system, but it needs some work (e.g. some doc strings, fixing 
the LargeFastaFormat class, GenBank support).

QUESTION: What do you all tend to use?  Should I draft a "questionnaire" 
to be posted on the main discussion list (and the announcements?).

Personally, I have been using Bio.Fasta and Bio.GenBank to read 
sequences.  I tend to only output Fasta files, and usually do this "by 
hand" as they are so simple and I want full control over the description 
lines.

Peter


From biopython-dev at maubp.freeserve.co.uk  Fri Jul 28 15:05:21 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Fri, 28 Jul 2006 16:05:21 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
Message-ID: <44CA27B1.30107@maubp.freeserve.co.uk>

Jeffrey Chang wrote:
> ...  However, Biopython already has at least 
> 3 Fasta parsers!
>    Bio/Fasta
>    Bio/SeqIO/FASTA
>    Bio/expressions/fasta
> 
> Bio/Fasta, the one you compared against, is easily the slowest one.  
> Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> to be significantly faster or slower.  Bio/expressions/fasta uses 
> Martel.  I don't know how well that will perform.  The parsing part 
> should be blazingly fast (since it is mostly in C), but building the 
> object will be slow.  It might be a wash.

The following timings are for iterating over a large fasta file 
(Escherichia_coli_K12, NC_000913.ffn, with 5254 nucleotide CDS sequences).

The test script is attached, the test input is available here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K12/NC_000913.ffn

I used BioPython 1.42 with Python 2.3 on Windows XP on a laptop computer.

Apart from Fasta.RecordParser, these all return a SeqRecord object with 
a generic alphabet:

0.89s SeqIO.FASTA.FastaReader (for record in interator)
0.88s SeqIO.FASTA.FastaReader (iterator.next)
0.88s SeqIO.FASTA.FastaReader (iterator[i])

5.52s FormatIO/SeqRecord (for record in interator)
5.41s FormatIO/SeqRecord (iterator.next)

6.06s Fasta.RecordParser (for record in interator)
6.10s Fasta.SequenceParser (for record in interator)
6.27s Fasta.SequenceParser (iterator.next)

As you can see, SeqIO.FASTA.FastaReader (written in simple python) is 
about six times faster than both the martel based parsers.

I have tried this on a file with 2000 records and see a similar scaling.

Peter
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: test_fasta_methods.py
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060728/93dbbbb7/attachment.ksh>

From mdehoon at c2b2.columbia.edu  Mon Jul 31 01:20:50 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Sun, 30 Jul 2006 21:20:50 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CA162F.1040604@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
Message-ID: <44CD5AF2.10708@c2b2.columbia.edu>

Thanks Peter.

Peter (BioPython Dev) wrote:
> QUESTION: What do you all tend to use?

I use the stuff in Bio.Fasta, but actually just because it's in the 
documentation. From your timings, and also because I'm not smart enough 
to be able to understand Martel, let alone maintain Martel-based 
parsers, I'm pretty much in favor of Bio.SeqIO.

>  Should I draft a "questionnaire" 
> to be posted on the main discussion list (and the announcements?).

By all means, yes. In the questionnaire, be sure to separate the issue 
of parser internals (Martel vs. pure Python) from the issue of how the 
results should be formatted (Fasta.Record or SeqRecord).

--Michiel


From lpritc at scri.sari.ac.uk  Mon Jul 31 09:59:47 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Mon, 31 Jul 2006 10:59:47 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CA27B1.30107@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
Message-ID: <1154339988.1490.81.camel@lplinuxdev>

Hi all,

On Fri, 2006-07-28 at 16:05 +0100, Peter (BioPython Dev) wrote:
> Jeffrey Chang wrote:
> > ...  However, Biopython already has at least 
> > 3 Fasta parsers!
> >    Bio/Fasta
> >    Bio/SeqIO/FASTA
> >    Bio/expressions/fasta
> > 
> > Bio/Fasta, the one you compared against, is easily the slowest one.  
> > Bio/SeqIO/FASTA is very similar to your implementation and not likely 
> > to be significantly faster or slower.  Bio/expressions/fasta uses 
> > Martel.  I don't know how well that will perform.  The parsing part 
> > should be blazingly fast (since it is mostly in C), but building the 
> > object will be slow.  It might be a wash.

Just to add to the confusion, when parsing large FASTA sequence files, I
have been using a home-rolled Flex/Pyrex parser (if you'd like a copy,
drop me a line).  I've used Peter's test framework on the same input
file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora
Core 3 (up-to-date, eh? ;) ) to get the following typical results:

4.07s FormatIO/SeqRecord (for record in interator)
4.05s FormatIO/SeqRecord (iterator.next)
0.32s SeqIO.FASTA.FastaReader (for record in interator)
0.30s SeqIO.FASTA.FastaReader (iterator.next)
0.31s SeqIO.FASTA.FastaReader (iterator[i])
5.53s Fasta.RecordParser (for record in interator)
5.00s Fasta.SequenceParser (for record in interator)
4.80s Fasta.SequenceParser (iterator.next)
0.18s SeqUtils/quick_FASTA_reader
0.11s pyfastaseqlexer/next_record
0.09s pyfastaseqlexer/quick_FASTA_reader
0.19s SeqUtils/quick_FASTA_reader (conversion to Seq)
0.14s pyfastaseqlexer/next_record (conversion to Seq)
0.11s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
0.17s pyfastaseqlexer/next_record (conversion to SeqRecord)
0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

pyfastaseqlexer is my Flex/Pyrex combination, which has a number of
methods for reading in FASTA sequences.  Here I've used the two that
correspond to the Bio.SeqUtils.quick_FASTA_reader method (overlooked in
the original list, but also included here for comparison), and Peter's
iterator method for his tests.  Since these extra methods don't return
Bio.Seq or Bio.SeqRecord objects, but instead lists of (name, sequence)
tuples, I've also included test functions that carry out the conversion
in Python, and their timings.

It's probably not a surprise that a dedicated Flex-based parser shows
such a dramatic speed improvement over the Martel-based parsers.  The
improvement over SeqIO.FASTA.FastaReader and SeqUtils.quick_FASTA_reader
is only marginal, though (a factor of approximately two when conversion
to SeqRecord is taken into account).  

Since we've been discussing the need to use only strings to represent
sequences recently, it's interesting to note that
SeqUtils.quick_FASTA_reader is about twice as fast as
SeqIO.FASTA.FastaReader if there is no conversion of sequences from
strings to Seq or SeqRecord objects.

While the Flex-based parser is the fastest in these tests, the time
saved is marginal unless a large FASTA file is being parsed.  Using a
file with over 72000 entries (Phytophthora infestans ESTs), my typical
timings become:

51.22s FormatIO/SeqRecord (for record in interator)
45.64s FormatIO/SeqRecord (iterator.next)
4.26s SeqIO.FASTA.FastaReader (for record in interator)
4.10s SeqIO.FASTA.FastaReader (iterator.next)
4.30s SeqIO.FASTA.FastaReader (iterator[i])
58.39s Fasta.RecordParser (for record in interator)
59.97s Fasta.SequenceParser (for record in interator)
58.70s Fasta.SequenceParser (iterator.next)
2.20s SeqUtils/quick_FASTA_reader
1.13s pyfastaseqlexer/next_record
0.56s pyfastaseqlexer/quick_FASTA_reader
2.20s SeqUtils/quick_FASTA_reader (conversion to Seq)
1.53s pyfastaseqlexer/next_record (conversion to Seq)
0.84s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
2.11s pyfastaseqlexer/next_record (conversion to SeqRecord)
1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

The Martel-based parsers become almost unworkable when dealing with
files of this size.  Note that the conversion of strings to SeqRecord
objects is pretty much a constant overhead for the Bio.SeqUtils and
pyfastaseqlexer methods (taking around 1s), but that there are
apparently additional overheads in the SeqIO.FASTA.FastaReader method.

Of course, the hassles of including a Flex-based parser in a general
BioPython release probably outweigh the marginal time-saving benefits
(see MMCIFlex for details ;) ).  I think SeqIO.FASTA.FastaReader and
SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and beat
the inclusion of a Flex-based parser hands-down in terms of
maintainability and portability.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 10:36:00 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 11:36:00 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CD5AF2.10708@c2b2.columbia.edu>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>
Message-ID: <44CDDD10.4020904@maubp.freeserve.co.uk>

Michiel de Hoon wrote:
> Thanks Peter.
> 
> Peter wrote:
>>QUESTION: What do you all tend to use?
> 
> I use the stuff in Bio.Fasta, but actually just because it's in the 
> documentation.

Me too.

 > From your timings, and also because I'm not smart enough
> to be able to understand Martel, let alone maintain Martel-based 
> parsers, I'm pretty much in favor of Bio.SeqIO.

That was my gut instinct too.

Starting with Bio.SeqIO as a base, I've been "playing" with the code and 
have a rough "Sequence Iterator" class that supports iteration (provides 
a next() and __iter__() method),  as well as strictly increasing index 
access.

At the moment I have iterators returning SeqRecords for:
- Fasta Files
- GenBank features (returns the CDS features and their translations)
- Genbank files (with the features as SeqFeature objects)

There is code in Bio/SeqIO/general.py for a few more file formats which 
I haven't used yet.

This new GenBank iterator actually uses the current Bio.Genbank parser 
(with a slight tweak to how it acts once it reaches the end of a record).

Michiel de Hoon wrote:
>
 >Peter wrote:
>> Should I draft a "questionnaire" 
>> to be posted on the main discussion list (and the announcements?).
> 
> By all means, yes. In the questionnaire, be sure to separate the issue 
> of parser internals (Martel vs. pure Python) from the issue of how the 
> results should be formatted (Fasta.Record or SeqRecord).
> 

Draft questionnaire follows, I have included by comments for the record. 
  Too long?  Missing any important questions?

Peter

--

Introduction
============
There is some discussion on the Developer's Mailing list about 
BioPython's sequence input/output routines.

For example, its a bit silly that there are three different Fasta 
reading routines in BioPython (even if only one of them, Bio.Fasta, is 
properly documented).

Note that we are not going to "just remove" any of the current 
functionality.  Some existing code may be re-written internally, while 
other code might be marked with a DeprecationWarning.

If you could answer the following questions that would help guide our 
choices.

Question One
============
Is reading sequence files an important function to you, and if so which 
file formats in particular (e.g. Fasta, GenBank, ...)

If you have had to write you own code to read a "common" file format 
which BioPython doesn't support, please get in touch.

Peter's answer:
 > I read Fasta and GenBank files mostly.  Also Clustalw alignments,
 > and Stockholm alignments.

Question Two - Reading Fasta Files
==================================
Which of the following do you currently use (and why)?:

(a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
title, and the sequence as a string)
(b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
(c) Bio.Fasta with your own parser (Could you tell us more?)
(d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
(e) Bio.FormatIO (giving SeqRecord objects)
(f) Other (Could you tell us more?)

Peter's answer:
 > In most of my script I use Bio.Fasta with either the RecordParser or
 > FeatureParser.  I did look at Bio.FormatIO when I started but found
 > Bio.Fasta was much better documented (and a similar speed).  I have
 > only recently looked at Bio.SeqIO (hence this entire thread).

Question Three - index_file based dictionaries
==============================================
Do you use any of the following:
(a) Bio.Fasta.Dictionary
(b) Bio.Genbank.Dictionary
(c) Any other "Martel/Mindy" based dictionary which first requires 
creation of an index using the index_file function

If so, do you have any comments?

Peter's answer:
 > I do not use multi-record Genbank files (mine are single chromosomes).
 >
 > I have used Bio.Fasta.Dictionary but found dealing with the indexes
 > created by index_file to be annoying - especially when re-indexing
 > Fasta files which change often.
 >
 > I now use a simple wrapper function to load a Fasta file with an
 > iterator and build the dictionary in memory.  For me this is much
 > less hassle and the memory demands are not too great.

Question Four - Record Access...
================================
When loading a file with multiple sequences do you use:

(a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the 
records one by one in the order from the file.

(b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you 
random access to the records using their identifier.

(c) A list giving random access by index number (e.g. load the records 
using an iterator but saving them in a list).

Do you have any additional comments on this?  For example, flexibility 
versus memory requirements.

For example, when I need random access to a Fasta file, I build a 
dictionary in memory (using an iterator) rather than messing about with 
the index_file based dictionary.

Peter's answer:
 > I usually deal with each record sequentially using an iterator.
 >
 > However, I often need random access using the record identifier and
 > for this I use a dictionary which I create in memory using an iterator.
 >
 > As stated in the question, I had tired used Bio.Fasta.Dictionary but
 > found dealing with the indexes created by index_file to be annoying,
 > especially having to re-indexing Fasta files which change often.


Question Four - Fasta files: FastaRecord or SeqRecord
=====================================================
If you use Fasta files, do you want get records returned as FastaRecords 
or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

For example,

 >name text text text
ACGTACACGT

As a FastaRecord this would have:

FastaRecord.title = "name text text text" (string)
FastaRecord.sequence= "ACGTACACGT" (string)

As a SeqRecord (with the default title2ids mapping):

SeqRecord.id = (default string)
SeqRecord.name = (default string)
SeqRecord.description = "name text text text" (string)
SeqRecord.seq = Seq("ACGTACACGT", alphabet)

Peter's answer
 > For FASTA files I have usually used FastaRecord objects (with the
 > sequence as a string) but I have no strong preference.  Thinking of
 > the big picture it would be better to have every parser return
 > SeqRecords by default.

Question Five - GenBank files: GenbankRecord or SeqRecord
==========================================================
If you use GenBank files, do you use:
(a) Bio.Genbank.FeatureParser which returns SeqRecord objects
(b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects

Do you care much either way?  For me the only significant difference is 
that feature locations are held as objects in the SeqRecord, and as the 
raw string in the Record.

Peter's answer
 > I have no strong preference - unless I wanted to manipulate the
 > feature locations.  I think there might be a performance difference...

Question Six - Martel, Scanners and Consumers
==============================================
Some of BioPython's existing parsers (e.g. those using Martel) use an 
event/callback model, where the scanner component generates parsing 
events which are dealt with by the consumer component.

Do any of you use this system to modify existing parser behaviour, or 
use it as part of your own personal file parser?

(a) I don't know, or don't care.  I just the the parsers provided.
(b) I use this framework to modify a parser in order to do ... (please 
provide details).

Peter's answer
 > As a user I don't care about the internals.  I do care about what
 > gets used as the name/id/description for SeqRecords but that level
 > of flexibility is enough.
 >
 > As a BioPython contributor: Martel is scary.  I think I understand
 > the whole scanner/consumer model but don't see the point (unless
 > using a event based scanner like Martel).  I suspect all the
 > function call backs is one reason Martel parsers are slow.

Peter


From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 12:12:26 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 13:12:26 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154339988.1490.81.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
Message-ID: <44CDF3AA.2020308@maubp.freeserve.co.uk>

Leighton Pritchard wrote:
> Just to add to the confusion, when parsing large FASTA sequence files, I
> have been using a home-rolled Flex/Pyrex parser (if you'd like a copy,
> drop me a line).  I've used Peter's test framework on the same input
> file (NC_000913.ffn), using BioPython 1.41, with Python 2.4 on Fedora
> Core 3 (up-to-date, eh? ;) ) to get the following typical results:

Times for NC_000913.ffn when returning SeqRecord objects:
> 4.07s FormatIO/SeqRecord (for record in interator)
> 4.05s FormatIO/SeqRecord (iterator.next)
> 5.00s Fasta.SequenceParser (for record in interator)
> 4.80s Fasta.SequenceParser (iterator.next)

 > 0.32s SeqIO.FASTA.FastaReader (for record in interator)
 > 0.30s SeqIO.FASTA.FastaReader (iterator.next)
 > 0.31s SeqIO.FASTA.FastaReader (iterator[i])

> 0.32s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 0.17s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 0.16s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

And again, but for Phytophthora infestans ESTs with 72000 entries
 > 51.22s FormatIO/SeqRecord (for record in interator)
 > 45.64s FormatIO/SeqRecord (iterator.next)
 > 59.97s Fasta.SequenceParser (for record in interator)
 > 58.70s Fasta.SequenceParser (iterator.next)

 > 4.26s SeqIO.FASTA.FastaReader (for record in interator)
 > 4.10s SeqIO.FASTA.FastaReader (iterator.next)
 > 4.30s SeqIO.FASTA.FastaReader (iterator[i])

 > 2.97s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
 > 2.11s pyfastaseqlexer/next_record (conversion to SeqRecord)
 > 1.35s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

I imagine this file is much much larger than what most of our uses work 
with - but it does clearly show that the Martel parsers do not scale well.

Out of interest, are the sequences in this file split into multiple 
lines (e.g. max length 80) or are they all single (long) lines?  I would 
expect the later to be quicker to load due to less string operations.

 > Of course, the hassles of including a Flex-based parser in a general
 > BioPython release probably outweigh the marginal time-saving benefits
 > (see MMCIFlex for details ;) ).  I think SeqIO.FASTA.FastaReader and
 > SeqUtils.quick_FASTA_reader do a good, quick job as it stands, and
 > beat the inclusion of a Flex-based parser hands-down in terms of
 > maintainability and portability.

I agree with you completely that we should avoid the Flex parser based 
on those grounds, as we can get "close enough" with pure python. 
Especially if we do something about the overhead of Seq and SeqRecord 
objects.

I did some work on a brand new SeqIO over the weekend. I had got the 
fasta iterator slightly quicker too.

The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
entire file into memory in one go, and then parses it.  On the other 
hand its not perfect: I would use "\n>" as the split marker rather than 
">" which could appear in the description of a sequence.

The iterator approach is probably slower but requires much less memory. 
  How big is your 72,000 entry file in MB?  Do we need to worry about 
the size of the raw file in memory - allowing the parsers to load it 
into memory could make things much faster...

Peter


From lpritc at scri.sari.ac.uk  Mon Jul 31 14:15:54 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Mon, 31 Jul 2006 15:15:54 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CDF3AA.2020308@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
Message-ID: <1154355358.1490.116.camel@lplinuxdev>

On Mon, 2006-07-31 at 13:12 +0100, Peter (BioPython Dev) wrote:
> I imagine this file is much much larger than what most of our uses work 
> with - but it does clearly show that the Martel parsers do not scale well.

I noticed the scaling problem mostly for GenBank files.  Your new
GenBank parser is a welcome improvement in speed.

> Out of interest, are the sequences in this file split into multiple 
> lines (e.g. max length 80) or are they all single (long) lines?  I would 
> expect the later to be quicker to load due to less string operations.

They're multiple lines with max length 50, and the whole file is 33Mb.
It's not the largest FASTA sequence file I'm working with, that's 353Mb
(530801 sequences, it's most of a eukaryotic genome with sequences split
into multiple lines), so I ran your test script on it, just to see what
happened:

419.42s FormatIO/SeqRecord (for record in interator)
389.05s FormatIO/SeqRecord (iterator.next)
35.46s SeqIO.FASTA.FastaReader (for record in interator)
33.73s SeqIO.FASTA.FastaReader (iterator.next)
36.19s SeqIO.FASTA.FastaReader (iterator[i])
490.19s Fasta.RecordParser (for record in interator)
555.43s Fasta.SequenceParser (for record in interator)
546.87s Fasta.SequenceParser (iterator.next)
37.94s SeqUtils/quick_FASTA_reader
12.84s pyfastaseqlexer/next_record
6.06s pyfastaseqlexer/quick_FASTA_reader
24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
12.27s pyfastaseqlexer/next_record (conversion to Seq)
8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)

This is only one run - my patience has limits <grin>  Again, scaling is
a big problem for some methods.

> The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
> entire file into memory in one go, and then parses it.  On the other 
> hand its not perfect: I would use "\n>" as the split marker rather than 
> ">" which could appear in the description of a sequence.

I agree (not that it's bitten me, yet), but I'd be inclined to go with
"%s>" % os.linesep as the split marker, just in case.

> Do we need to worry about the size of the raw file in memory - allowing the parsers to load it 
> into memory could make things much faster...

I use very few FASTA files where that would be a problem, so long as the
sequences remain as strings - when they're converted to
SeqRecords/SeqFeatures is where I start to get nervous about memory use.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 15:14:04 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 16:14:04 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154355358.1490.116.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	
	<44CA27B1.30107@maubp.freeserve.co.uk>	
	<1154339988.1490.81.camel@lplinuxdev>	
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
Message-ID: <44CE1E3C.2050502@maubp.freeserve.co.uk>

> 
> They're multiple lines with max length 50, and the whole file is 33Mb.
> It's not the largest FASTA sequence file I'm working with, that's 353Mb
> (530801 sequences, it's most of a eukaryotic genome with sequences split
> into multiple lines), so I ran your test script on it, just to see what
> happened:
> 
> 419.42s FormatIO/SeqRecord (for record in interator)
> 389.05s FormatIO/SeqRecord (iterator.next)
> 35.46s SeqIO.FASTA.FastaReader (for record in interator)
> 33.73s SeqIO.FASTA.FastaReader (iterator.next)
> 36.19s SeqIO.FASTA.FastaReader (iterator[i])
> 490.19s Fasta.RecordParser (for record in interator)
> 555.43s Fasta.SequenceParser (for record in interator)
> 546.87s Fasta.SequenceParser (iterator.next)
> 37.94s SeqUtils/quick_FASTA_reader
> 12.84s pyfastaseqlexer/next_record
> 6.06s pyfastaseqlexer/quick_FASTA_reader
> 24.08s SeqUtils/quick_FASTA_reader (conversion to Seq)
> 12.27s pyfastaseqlexer/next_record (conversion to Seq)
> 8.71s pyfastaseqlexer/quick_FASTA_reader (conversion to Seq)
> 24.20s SeqUtils/quick_FASTA_reader (conversion to SeqRecord)
> 18.10s pyfastaseqlexer/next_record (conversion to SeqRecord)
> 13.45s pyfastaseqlexer/quick_FASTA_reader (conversion to SeqRecord)
> 
> This is only one run - my patience has limits <grin>  Again, scaling is
> a big problem for some methods.

Interesting - but no big surprises, except maybe just how slow Martel 
is.  Did you notice if it run out of memory, and have to page to the 
hard disk?

>>The SeqUtils/quick_FASTA_reader is interesting in that it loads the 
>>entire file into memory in one go, and then parses it.  On the other 
>>hand its not perfect: I would use "\n>" as the split marker rather than 
>>">" which could appear in the description of a sequence.
> 
> I agree (not that it's bitten me, yet), but I'd be inclined to go with
> "%s>" % os.linesep as the split marker, just in case.

Good point.  I wonder how many people even know this function exists?

>>Do we need to worry about the size of the raw file in memory - allowing
 >>the parsers to load it into memory could make things much faster...
> 
> I use very few FASTA files where that would be a problem, so long as the
> sequences remain as strings - when they're converted to
> SeqRecords/SeqFeatures is where I start to get nervous about memory use.

Maybe we should avoid loading entire files into memory while parsing - 
except for those formats like Clustal alignments where there is no real 
choice.

Have you got a feeling for the difference in memory required for a large 
Fasta file in memory as:
* Title string, sequence string
* Title string, sequence as Seq object
* SeqRecords (which include the sequence as a Seq object)

While its overkill for simple file formats like FASTA, I think we do 
need a fairly high level object like the SeqRecord when dealing with 
things like Genbank/EMBL to hold the basic annotation and identifiers 
(id/name/description).

I am thinking that we should have a set of sequence parsers that all 
return SeqRecord objects (with format specific options in some cases to 
control the exact mapping of the data, e.g. title2ids for Fasta files).

And a matching set of sequence writers that take SeqRecord object(s) and 
write them to a file.

Such a mapping won't be perfect, so maybe there is still a place for 
"format specific representations" like the Record object in 
Bio.GenBank.Record

In the short term maybe we should just replace the internals of the 
current Bio.Fasta module with a pure python implementation like that in 
Bio.SeqIO.FASTA - good idea?  Bad idea?

Peter


From f.schlesinger at iu-bremen.de  Mon Jul 31 16:07:08 2006
From: f.schlesinger at iu-bremen.de (Felix Schlesinger)
Date: Mon, 31 Jul 2006 18:07:08 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
Message-ID: <7317d50c0607310907sc468843nfe3945225d2ace76@mail.gmail.com>

> Have you got a feeling for the difference in memory required for a large
> Fasta file in memory as:
> * Title string, sequence string
> * Title string, sequence as Seq object
> * SeqRecords (which include the sequence as a Seq object)

>From looking at the code the only difference should be one instance of
alphabet and one reference to it per sequence.
  The main difference is that Seq.data.method involves some python,
while string.method is pure C code.

Felix


From mcolosimo at mitre.org  Mon Jul 31 16:08:50 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Mon, 31 Jul 2006 12:08:50 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CE1E3C.2050502@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	
	<44CA27B1.30107@maubp.freeserve.co.uk>	
	<1154339988.1490.81.camel@lplinuxdev>	
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
Message-ID: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>


On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:


>
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than
>>> ">" which could appear in the description of a sequence.
>>
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with
>> "%s>" % os.linesep as the split marker, just in case.
>
> Good point.  I wonder how many people even know this function exists?
>

The only problem with this is that if someone sends you a file not  
created on your system. I remember hugh problems 5 or so years ago in  
BioPerl with dealing with the Mac, Unix, Windows line-ending issues.  
This has mostly simplied down to two - Unix and Windows - unless the  
person uses a Mac GUI app some of which use \r (CR) instead of \n  
(LF) where Windows uses \r\n (CRLF). I think the standard python  
disto comes with crlf.py and lfcr.py that can convert the line endings.

> Maybe we should avoid loading entire files into memory while parsing -
> except for those formats like Clustal alignments where there is no  
> real
> choice.
>
> Have you got a feeling for the difference in memory required for a  
> large
> Fasta file in memory as:
> * Title string, sequence string
> * Title string, sequence as Seq object
> * SeqRecords (which include the sequence as a Seq object)
>
> While its overkill for simple file formats like FASTA, I think we do
> need a fairly high level object like the SeqRecord when dealing with
> things like Genbank/EMBL to hold the basic annotation and identifiers
> (id/name/description).
>
> I am thinking that we should have a set of sequence parsers that all
> return SeqRecord objects (with format specific options in some  
> cases to
> control the exact mapping of the data, e.g. title2ids for Fasta  
> files).
>
> And a matching set of sequence writers that take SeqRecord object 
> (s) and
> write them to a file.
>
> Such a mapping won't be perfect, so maybe there is still a place for
> "format specific representations" like the Record object in
> Bio.GenBank.Record
>
> In the short term maybe we should just replace the internals of the
> current Bio.Fasta module with a pure python implementation like  
> that in
> Bio.SeqIO.FASTA - good idea?  Bad idea?


I would keep them separate but change the documentation on the how-to  
site to point to using the Bio.SeqIO.FASTA since that is where I  
think we want people to start going. The code change to Bio.Fasta  
should be to add a depreciation warning.


Marc


From mdehoon at c2b2.columbia.edu  Mon Jul 31 17:34:41 2006
From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon)
Date: Mon, 31 Jul 2006 13:34:41 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
References: <44CA162F.1040604@maubp.freeserve.co.uk>		<44CA27B1.30107@maubp.freeserve.co.uk>		<1154339988.1490.81.camel@lplinuxdev>		<44CDF3AA.2020308@maubp.freeserve.co.uk>	<1154355358.1490.116.camel@lplinuxdev>	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
Message-ID: <44CE3F31.2080404@c2b2.columbia.edu>

Marc Colosimo wrote:
> On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
>> In the short term maybe we should just replace the internals of the
>> current Bio.Fasta module with a pure python implementation like  
>> that in
>> Bio.SeqIO.FASTA - good idea?  Bad idea?
> 
> I would keep them separate but change the documentation on the how-to  
> site to point to using the Bio.SeqIO.FASTA since that is where I  
> think we want people to start going. The code change to Bio.Fasta  
> should be to add a depreciation warning.

I agree with Marc here. No need to modify Bio.Fasta if it's on its way out.

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032


From biopython-dev at maubp.freeserve.co.uk  Mon Jul 31 17:41:49 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 31 Jul 2006 18:41:49 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
References: <44CA162F.1040604@maubp.freeserve.co.uk>		<44CA27B1.30107@maubp.freeserve.co.uk>		<1154339988.1490.81.camel@lplinuxdev>		<44CDF3AA.2020308@maubp.freeserve.co.uk>	<1154355358.1490.116.camel@lplinuxdev>	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
Message-ID: <44CE40DD.3010101@maubp.freeserve.co.uk>

Peter wrote:
>>In the short term maybe we should just replace the internals of the
>>current Bio.Fasta module with a pure python implementation like  
>>that in Bio.SeqIO.FASTA - good idea?  Bad idea?

Marc wrote:
> I would keep them separate but change the documentation on the how-to  
> site to point to using the Bio.SeqIO.FASTA since that is where I  
> think we want people to start going. The code change to Bio.Fasta  
> should be to add a depreciation warning.

Certainly long term we could do that.  There may be advantages to the 
current very flexible Bio.Fasta code that the SeqIO replacement may not 
offer (e.g. if we focus on just parsing into SeqRecords).

Short Term
----------
Right now I guess most people dealing with Fasta files will be using 
Bio.Fasta, and it is very slow, hence bug 2058:

http://bugzilla.open-bio.org/show_bug.cgi?id=2058

My patch makes Bio.Fasta almost as fast as Bio.SeqIO.FASTA according to 
my tests (modest sized files).

If any of you could try this patch on your machines - on the off chance 
that it causes problems for any existing code.  It does pass 
test_Fasta.py and test_Fasta2.py on Windows at least.

Medium/Long Term
----------------
We need to sort out what to do with Bio.SeqIO as currently the existing 
code in Bio/SeqIO/generic.py and Bio/SeqIO/FASTA.py uses different 
interfaces.  But do agree that something like that should be OK.

I have been working on a possible replacement (but it doesn't seem to 
have made it to the mailing list yet - must check my recent email).

Peter