From lpritc at scri.sari.ac.uk  Tue Aug  1 06:42:37 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Tue, 01 Aug 2006 11:42:37 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
Message-ID: <1154428959.4871.11.camel@lplinuxdev>

On Mon, 2006-07-31 at 12:08 -0400, Marc Colosimo wrote: 
> On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
> >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
> >>> entire file into memory in one go, and then parses it.  On the other
> >>> hand its not perfect: I would use "\n>" as the split marker  
> >>> rather than
> >>> ">" which could appear in the description of a sequence.
> >>
> >> I agree (not that it's bitten me, yet), but I'd be inclined to go  
> >> with
> >> "%s>" % os.linesep as the split marker, just in case.
> >
> > Good point.  I wonder how many people even know this function exists?
> >
> 
> The only problem with this is that if someone sends you a file not  
> created on your system. [...]  
> This has mostly simplied down to two - Unix and Windows - unless the  
> person uses a Mac GUI app some of which use \r (CR) instead of \n  
> (LF) where Windows uses \r\n (CRLF). I think the standard python  
> disto comes with crlf.py and lfcr.py that can convert the line endings.

Also a good point.  I had a play about with regular expression
splitting/substitution and the SeqUtils.quick_FASTA_reader method to see
if I could capture this variability in line-endings:

def method_quick_FASTA_reader3(filename):
    txt = file(filename).read()
    entries = []
    split_marker = re.compile('^>', re.M)
    for entry in re.split(split_marker, txt)[1:]:
        name,seq= re.split('[\r\n]', entry, 1)
        seq = re.sub('\s', '', seq).upper()
        entries.append((name, seq))
    return "SeqUtils/quick_FASTA_reader (import re)", len(entries)

Using regular expressions in this way seems to slow things down to about
the same speed as the SeqIO parser, with the disadvantage of still
having to process the entries into SeqRecord objects (if that's what you
want to do with them).  quick_FASTA_reader is a bit of a misnomer in
this case, I guess ;)

4.15s SeqIO.FASTA.FastaReader (for record in interator)
3.95s SeqIO.FASTA.FastaReader (iterator.next)
4.13s SeqIO.FASTA.FastaReader (iterator[i])
1.89s SeqUtils/quick_FASTA_reader
1.03s pyfastaseqlexer/next_record
0.52s pyfastaseqlexer/quick_FASTA_reader
4.44s SeqUtils/quick_FASTA_reader (import re)

Results are typical for the 72000 record set, and this doesn't look to
be a promising route.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From pfefferp at staff.uni-marburg.de  Tue Aug  1 08:02:25 2006
From: pfefferp at staff.uni-marburg.de (Patrick Pfeffer)
Date: Tue, 01 Aug 2006 14:02:25 +0200
Subject: [Biopython-dev] GAs in Biopython
Message-ID: <44CF42D1.8090209@staff.uni-marburg.de>

Hi there,

isn't there any documentation available for using the genetic algorithm 
available in the package?

Thanks for any kind of help,
Patrick

-- 
*************************************
Dipl. Bioinf. Patrick Pfeffer
Arbeitskreis Prof. Dr. G. Klebe
Institut f?r Pharmazeutische Chemie
Raum A116a
Fachbereich Pharmazie
Philipps-Universit?t Marburg
Marbacher Weg 6
35032 Marburg  Germany
Fon.: 06421/2825908
http://www.agklebe.de
e-mail: pfefferp at staff.uni-marburg.de
************************************* 


From biopython-dev at maubp.freeserve.co.uk  Tue Aug  1 16:53:08 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 01 Aug 2006 21:53:08 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154428959.4871.11.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CA27B1.30107@maubp.freeserve.co.uk>	<1154339988.1490.81.camel@lplinuxdev>	<44CDF3AA.2020308@maubp.freeserve.co.uk>	<1154355358.1490.116.camel@lplinuxdev>	<44CE1E3C.2050502@maubp.freeserve.co.uk>	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
	<1154428959.4871.11.camel@lplinuxdev>
Message-ID: <44CFBF34.7080106@maubp.freeserve.co.uk>

Peter wrote:
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than ">" which could appear in the description of a sequence.

Leighton Pritchard replied:
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with "%s>" % os.linesep as the split marker, just in case.

Peter then wrote:
> Good point.

I take that back - I was right the first time ;)

You are right to worry about the line sep changing from platform to
platform, but you shouldn't use "%s>" % os.linesep

However, when reading windows style files on windows, the newlines
appear in python as just \n (as do newlines from unix files read on
windows).

When writing text files on windows, again \n gets turned into CR LF on
the disk.

Just using "\n>" would work on any platform reading a FASTA file with
the expected newlines.  As a bonus it would work on Windows when reading
unix style newlines.

To get any platform to read newlines from any other platform what I
suggest is using "\n>" as the split string, but open the file in
universal text mode - this seems to work fine on Python 2.3, but I'm not
sure when universal newline reading was introduced.

For example, I created a simple file using the three newline conventions
(using the TextPad on Windows).

>>> import sys
>>> sys.platform
'win32'
>>> os.linesep
'\r\n'

>>> open("c:/temp/windows.txt","r").read()
'line\nline\n'
>>> open("c:/temp/mac.txt","r").read()
'line\rline\r'
>>> open("c:/temp/unix.txt","r").read()
'line\nline\n'

(Notice that using "\n>" wouldn't work when reading a Mac style file on
Windows)

>>> open("c:/temp/windows.txt","rU").read()
'line\nline\n'
>>> open("c:/temp/mac.txt","rU").read()
'line\nline\n'
>>> open("c:/temp/unix.txt","rU").read()
'line\nline\n'


Peter


From lpritc at scri.sari.ac.uk  Wed Aug  2 05:25:27 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 02 Aug 2006 10:25:27 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>
	<44CDDD10.4020904@maubp.freeserve.co.uk>
Message-ID: <1154510728.4871.66.camel@lplinuxdev>

On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> Question One
> ============
> Is reading sequence files an important function to you, and if so which 
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW

> If you have had to write you own code to read a "common" file format 
> which BioPython doesn't support, please get in touch.

EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
pretty).

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
> title, and the sequence as a string)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

Mostly (f), a homegrown Pyrex/Flex parser.

> Question Three - index_file based dictionaries
> ==============================================
> Do you use any of the following:
> (a) Bio.Fasta.Dictionary
> (b) Bio.Genbank.Dictionary
> (c) Any other "Martel/Mindy" based dictionary which first requires 
> creation of an index using the index_file function

No, but I do create dictionaries on-the-fly from (name, sequence)
tuples, where necessary.

> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the 
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you 
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the records 
> using an iterator but saving them in a list).
> 
> Do you have any additional comments on this?  For example, flexibility 
> versus memory requirements.

Depending on what I need to do, I might use different approaches.  If
I'm filtering sequences on, say, sequence composition, I'll use an
iterator.  If I need to cross-reference sequences from the file to some
other set of sequences by ID, I'll use a dictionary.  In each case, I
will generally either use a for loop or build a dictionary on-the-fly.

> Question Four - Fasta files: FastaRecord or SeqRecord
> =====================================================
> If you use Fasta files, do you want get records returned as FastaRecords 
> or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

I'd rather have SeqRecords.  SeqRecords are particularly useful for
annotations and attaching data to the sequence which, later, gets
written out in some format other than FASTA sequence format.  For
operations where no further information is associated with the sequence,
they offer equivalent functionality to FastaRecords.  

Currently I default to (name, seq) tuples, and only create SeqRecords
when necessary, but this is only out of convenience for the parser I
use.

> Question Five - GenBank files: GenbankRecord or SeqRecord
> ==========================================================
> If you use GenBank files, do you use:
> (a) Bio.Genbank.FeatureParser which returns SeqRecord objects
> (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects
> 
> Do you care much either way?  For me the only significant difference is 
> that feature locations are held as objects in the SeqRecord, and as the 
> raw string in the Record.

I use Bio.GenBank.FeatureParser because I prefer the storage of features
(which are what I'm generally interested in) as SeqFeature objects.

> Question Six - Martel, Scanners and Consumers
> ==============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an 
> event/callback model, where the scanner component generates parsing 
> events which are dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or 
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ... (please 
> provide details).

I care mostly about performance on large files and the convenient
representation of sequences and features.  Where parsers have not been
available (or quickly locatable) for file formats, such as EMBL, I have
sometimes used the Bio.ParserSupport classes and the Scanner/Consumer
pattern.  

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Wed Aug  2 06:45:34 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 02 Aug 2006 11:45:34 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154510728.4871.66.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CD5AF2.10708@c2b2.columbia.edu>	<44CDDD10.4020904@maubp.freeserve.co.uk>
	<1154510728.4871.66.camel@lplinuxdev>
Message-ID: <44D0824E.30808@maubp.freeserve.co.uk>

Leighton Pritchard wrote:
> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> 
>>Question One
>>============
>>Is reading sequence files an important function to you, and if so which 
>>file formats in particular (e.g. Fasta, GenBank, ...)
> 
> Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW
> 

PTT (Protein table files)

http://www.ibt.unam.mx/biocomputo/hom_make_db.html
(Anyone got an NCBI link for the file format?)

GFF (General Feature Format)

http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

GFF and PTT aren't exactly what I would call sequence files, in that
they don't contain any sequence data.  But thinking about it, maybe
those files could be turned into SeqRecords or SeqFeatures (with empty
sequences).

> 
>>If you have had to write you own code to read a "common" file format 
>>which BioPython doesn't support, please get in touch.
> 
> EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
> pretty).
> 

Its looks like there is enough overlap between the EMBL and Genbank to
make sharing code between them a good idea.  Certainly EMBL was a file
format I was thinking we should try to support.

Reading your other comments, it looks like you wouldn't miss FastaRecord
or GenBank records if they were phased out.

Personally, I'm suggesting we try and standardise on having any Sequence
IO framework standardize on returning SeqRecord objects.

Does anyone know if SeqIO stood for Sequence or Sequential Input/Ouput?

I think we should have a generic "Sequence Iterator" object to do this
which takes a file handle, subclassed for each file format - giving a
"Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc.

I'm inclined not to give any choice of parser object (e.g.
Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a
SeqRecord.

The individual readers should offer some level of control, for example
the title2ids function for Fasta files lets the user decide how the
title line should be broken up into id/name/description.  Also for some
file formats the user should be able to specify the alphabet.

Peter


From hoffman at ebi.ac.uk  Wed Aug  2 07:00:46 2006
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Wed, 2 Aug 2006 12:00:46 +0100 (BST)
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>
	<44CDDD10.4020904@maubp.freeserve.co.uk>
Message-ID: <Pine.LNX.4.64.0608021154490.27323@qnzvnan.rov.np.hx>

> Question One
> ============
> Is reading sequence files an important function to you, and if so which
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes. FASTA.

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
>
> (f) Other (Could you tell us more?)

I have written my own short iterator so that my code is portable
without requiring Biopython to be installed.

> Question Three - index_file based dictionaries
> ==============================================
> Do you use any of the following:

No.

> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
>
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.

Yes.

> Question Four - Fasta files: FastaRecord or SeqRecord
> =====================================================
> If you use Fasta files, do you want get records returned as FastaRecords
> or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

SeqRecords. I hate it when an interface tries to parse the definition
line for me. Perhaps a set of standard definition line parsers should
be provided so that one can choose, but usually I would rather have
plain text and parse it myself.

> Question Six - Martel, Scanners and Consumers
> ==============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an
> event/callback model, where the scanner component generates parsing
> events which are dealt with by the consumer component.
>
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?

No.
-- 
Michael Hoffman <hoffman at ebi.ac.uk>
European Bioinformatics Institute

From lpritc at scri.sari.ac.uk  Wed Aug  2 07:23:27 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 02 Aug 2006 12:23:27 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44D0824E.30808@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>	<44CDDD10.4020904@maubp.freeserve.co.uk>
	<1154510728.4871.66.camel@lplinuxdev>
	<44D0824E.30808@maubp.freeserve.co.uk>
Message-ID: <1154517808.4871.93.camel@lplinuxdev>

On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote:
> GFF (General Feature Format)
> 
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> 
> GFF and PTT aren't exactly what I would call sequence files, in that
> they don't contain any sequence data.  

Fair point, but GFF3 (see below) can optionally carry sequence data, and
I use them for exactly what you say here:

> those files could be turned into SeqRecords or SeqFeatures (with empty
> sequences).

I was thinking that GFF3 would be more useful than GFF:

http://song.sourceforge.net/gff3.shtml

NCBI have already gone over to this on bacterial genomes, at least,
(e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much richer format than the original specification.  Andrew Dalke has already written a GFF3 parser/writer, which is available at

http://www.dalkescientific.com/PyGFF3-0.5.tar.gz

I've not used this in anger, yet...

> Its looks like there is enough overlap between the EMBL and Genbank to
> make sharing code between them a good idea.  Certainly EMBL was a file
> format I was thinking we should try to support.

In a scanner/consumer pattern it's easy enough.  I've not looked under
the hood of the new GenBank parser yet, to see what you've done.  Most
of my contact with EMBL format is with headerless feature tables and
Artemis, which aren't directly similar to GenBank entries. 

> Reading your other comments, it looks like you wouldn't miss FastaRecord
> or GenBank records if they were phased out.

Not personally, but others may have strong opinions and breakable code,
yet.

> Personally, I'm suggesting we try and standardise on having any Sequence
> IO framework standardize on returning SeqRecord objects.
> 
> I think we should have a generic "Sequence Iterator" object to do this
> which takes a file handle, subclassed for each file format - giving a
> "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc.

> I'm inclined not to give any choice of parser object (e.g.
> Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a
> SeqRecord.

It may be a side-issue, but should a Clustal parser return an Alignment
object or iterate over SeqRecord objects?  And for that matter, what
about other MSA files in FASTA format?  I think we ought allow parsers
to return an Alignment where the user requests it, which is a
functionality I'm not currently aware of in the FASTA sequence parsers.

> The individual readers should offer some level of control, for example
> the title2ids function for Fasta files lets the user decide how the
> title line should be broken up into id/name/description.  Also for some
> file formats the user should be able to specify the alphabet.

Could the alphabet be optionally specified by the user on parsing, and
maybe return a warning or error if there are non-compliant symbols in
the file, as a quick validator for bad sequences, or reminder to the
occasionally forgetful that, for example, they're not working with
nucleotide sequences, today <cough, embarrassed glance at floor> ;)

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Wed Aug  2 08:56:23 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 02 Aug 2006 13:56:23 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154517808.4871.93.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CD5AF2.10708@c2b2.columbia.edu>	<44CDDD10.4020904@maubp.freeserve.co.uk>	<1154510728.4871.66.camel@lplinuxdev>	<44D0824E.30808@maubp.freeserve.co.uk>
	<1154517808.4871.93.camel@lplinuxdev>
Message-ID: <44D0A0F7.1020402@maubp.freeserve.co.uk>

Leighton Pritchard wrote:
> Fair point, but GFF3 (see below) can optionally carry sequence data,
> and I use them for exactly what you say here:
> 
>> maybe those files could be turned into SeqRecords or SeqFeatures 
>> (with empty sequences).
> 
> I was thinking that GFF3 would be more useful than GFF:
> 
> http://song.sourceforge.net/gff3.shtml
> 

Thanks for the links... interesting that GFF3 allows embedding Fasta
sequences.

>> Reading your other comments, it looks like you wouldn't miss 
>> FastaRecord or GenBank records if they were phased out.
> 
> Not personally, but others may have strong opinions and breakable 
> code, yet.

There is no need to remove the current modules, just mark them as
depreciated.  Of course, if there is some strong support for these
objects then we might not want to be so harsh...

> It may be a side-issue, but should a Clustal parser return an 
> Alignment object or iterate over SeqRecord objects?  And for that 
> matter, what about other MSA files in FASTA format?  I think we ought
> allow parsers to return an Alignment where the user requests it, 
> which is a functionality I'm not currently aware of in the FASTA 
> sequence parsers.

In my opinion we should offer both.  I would go for loading
clustal/fasta alignments as sequence iterators (as part of the new SeqIO
code) and make it very easy to turn ANY sequence iterator returning
SeqRecords into an alignment.

The current alignment object stores its sequences as SeqRecords
internally but doesn't (yet) allow simple addition of SeqRecords - that
would have to be fixed but it looks easy enough.  Accepting a
SequenceIterator for __init__ would also be nice.

>> The individual readers should offer some level of control, for 
>> example the title2ids function for Fasta files lets the user decide
>> how the title line should be broken up into id/name/description. 
>> Also for some file formats the user should be able to specify the 
>> alphabet.
> 
> Could the alphabet be optionally specified by the user on parsing, 
> and maybe return a warning or error if there are non-compliant 
> symbols in the file, as a quick validator for bad sequences, or 
> reminder to the occasionally forgetful that, for example, they're not
> working with nucleotide sequences, today <cough, embarrassed glance 
> at floor> ;)

For some file formats the parser should be able to deduce the alphabet,
but other like Fasta it must be specified.  I like the idea of
optionally checking the alphabet - but it would impose a speed penalty.

Do you think this should be done by the SeqRecord object (on request)?
Each parser could simply ask the SeqRecord object to verify itself
before returning it.

Peter

From Leighton.Pritchard at scri.ac.uk  Wed Aug  2 05:00:20 2006
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Wed, 2 Aug 2006 10:00:20 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CFBF34.7080106@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
	<1154428959.4871.11.camel@lplinuxdev>
	<44CFBF34.7080106@maubp.freeserve.co.uk>
Message-ID: <1154509221.4871.40.camel@lplinuxdev>

An embedded and charset-unspecified text was scrubbed...
Name: not available
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/605b8b80/attachment.pl 
-------------- next part --------------
An embedded message was scrubbed...
From: "Leighton Pritchard" <Leighton.Pritchard at scri.ac.uk>
Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Date: Wed, 2 Aug 2006 10:00:20 +0100
Size: 4641
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/605b8b80/attachment.mht 

From lpritc at scri.sari.ac.uk  Wed Aug  2 05:02:03 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 02 Aug 2006 10:02:03 +0100
Subject: [Biopython-dev] [Fwd: Re:  Reading sequences: FormatIO, SeqIO, etc]
Message-ID: <1154509323.4871.42.camel@lplinuxdev>

(this time without the signature)

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).
-------------- next part --------------
An embedded message was scrubbed...
From: Leighton Pritchard <lpritc at scri.sari.ac.uk>
Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Date: Wed, 02 Aug 2006 10:00:20 +0100
Size: 3943
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/5c8fac79/attachment.mht 

From mdehoon at c2b2.columbia.edu  Thu Aug  3 23:20:18 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Thu, 03 Aug 2006 23:20:18 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Message-ID: <44D2BCF2.9010500@c2b2.columbia.edu>

> Question One
> ============
 >
> Is reading sequence files an important
> function to you, and if so which file formats in particular (e.g.
> Fasta, GenBank, ...)
> 
I use Fasta, GenBank, and occasionally clustalw.
> 
> Question Two - Reading Fasta Files
> ==================================
>  Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with
> a title, and the sequence as a string) (b) Bio.Fasta with the
> FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own
> parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader
> (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord
> objects) (f) Other (Could you tell us more?)
I use Bio.Fasta with the RecordParser, but just because it's easy to 
find in the documentation. As a user, I think Bio.Fasta requires too 
many steps to be typed in; I would prefer something more 
straightforward. For the output format, I don't care so much, but for 
the sake of consistency a SeqRecord may be preferable.

> 
> Question Three - index_file based dictionaries 
> ============================================== Do you use any of the
> following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c)
> Any other "Martel/Mindy" based dictionary which first requires
> creation of an index using the index_file function
> 

No. I never really understood index files.

> 
> Question Four - Record Access...
> ================================ 
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the
> records using an iterator but saving them in a list).
I use (a). It's easy to create (b) or (c), if needed, if (a) is available.
> 
> Question Four - Fasta files: FastaRecord or SeqRecord 
> ===================================================== If you use
> Fasta files, do you want get records returned as FastaRecords or as
> SeqRecords?  If SeqRecords, do you use your own title2ids mapping?
> 
> For example,
> 
>> name text text text
> ACGTACACGT
> 
> As a FastaRecord this would have:
> 
> FastaRecord.title = "name text text text" (string) 
> FastaRecord.sequence= "ACGTACACGT" (string)
> 
> As a SeqRecord (with the default title2ids mapping):
> 
> SeqRecord.id = (default string) SeqRecord.name = (default string) 
> SeqRecord.description = "name text text text" (string) SeqRecord.seq
> = Seq("ACGTACACGT", alphabet)
I use the FastaRecord, but again for no particular reason. I have not 
experienced an advantage of Seq objects over simple strings, so for me 
the fact that FastaRecord contains a simple string is more convenient. 
But it doesn't matter much.

> Question Five - GenBank files: GenbankRecord or SeqRecord 
> ========================================================== If you use
> GenBank files, do you use: (a) Bio.Genbank.FeatureParser which
> returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns
> Bio.GenBank.Record objects
> 
I don't care so much, but I think that having two record types is 
confusing, so it would be better if we could decide on one. A SeqRecord
is more general than a Bio.GenBank.Record, so I have a slight preference 
for a SeqRecord.

> 
> Question Six - Martel, Scanners and Consumers 
> ============================================== Some of BioPython's
> existing parsers (e.g. those using Martel) use an event/callback
> model, where the scanner component generates parsing events which are
> dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided. 
> (b) I use this framework to modify a parser in order to do ...
> (please provide details).
> 
(a). Often, I'm just at the Python prompt typing away. What I like about 
Python and Numerical Python is that the commands are often obvious and 
easy to remember. With the parser framework, on the other hand, I always 
need to look up in the documentation how to use them.

--Michiel

From dag at sonsorol.org  Fri Aug  4 06:38:52 2006
From: dag at sonsorol.org (Chris Dagdigian)
Date: Fri, 4 Aug 2006 06:38:52 -0400
Subject: [Biopython-dev] Fwd: contributing comparative genomics tools
References: <22DA57C5-461D-48BE-B524-47108330CD80@chem.ucla.edu>
Message-ID: <9AFBA2D3-B8DF-4337-A54A-019F6EAFFC38@sonsorol.org>


Begin forwarded message:

> From: Christopher Lee <leec at chem.ucla.edu>
> Date: August 3, 2006 9:11:42 PM EDT
> To: biopython-dev-owner at lists.open-bio.org
> Subject: Fwd: contributing comparative genomics tools
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
> there appears to be an error in your code submission instructions  
> on the biopython.org/wiki, or in the configuration of the biopython- 
> dev list server.  The code submission instructions tell me to  
> submit my proposal by email to biopython-dev at biopython.org, but the  
> list server responds by saying that all mail will automatically be  
> rejected!  Please forward this proposal to the appropriate people  
> (presumably biopython-dev?), and let me know that you have done  
> so.  Otherwise I won't have any way of knowing whether anyone even  
> reads this email address...
>
> Yours with thanks,
>
> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>
> Begin forwarded message:
>
>> You are not allowed to post to this mailing list, and your message  
>> has
>> been automatically rejected.  If you think that your messages are
>> being rejected in error, contact the mailing list owner at
>> biopython-dev-owner at lists.open-bio.org.
>>
>>
>> From: Christopher Lee <leec at chem.ucla.edu>
>> Date: August 3, 2006 3:55:52 PM PDT
>> To: biopython-dev at biopython.org
>> Cc: Namshin Kim <deepreds at gmail.com>
>> Subject: contributing comparative genomics tools
>>
>>
>> Hi Biopython developers,
>> I'd like to contribute some Python tools that my lab has been  
>> developing for large-scale comparative genomics database query.   
>> These tools make it easy to work with huge multigenome alignment  
>> databases (e.g. the UCSC Genome Browser multigenome alignments)  
>> using a new disk-based interval indexing algorithm that gives very  
>> high performance with minimal memory usage.  e.g. whereas queries  
>> of the UCSC 17genome alignment typically take about 30 sec. per  
>> query using MySQL, the same query takes about 200 microsec. per  
>> query, making it possible to run huge numbers of queries for  
>> genome-wide studies.
>>
>> Here's an example usage (click the URL or just look at the code  
>> below)
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq- 
>> align.html#SECTION000125000000000000000
>>
>> We've tested this code very extensively in our own research, and  
>> it has had four open source releases so far.  At this point the  
>> code is in production use.  All the code is compatible back to  
>> Python version 2.2, but not 2.1 or before (we use generators).   
>> There is C code (accessed as Python classes) for the high- 
>> performance interval database index.  For details of history see  
>> the website
>> http://www.bioinformatics.ucla.edu/pygr
>>
>> There is also extensive tutorial and reference documentation:
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/
>>
>> Let me know what questions you have, and what process we would  
>> need to follow to contribute this code.
>>
>> Yours with best wishes,
>>
>> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>>
>>
>> ####### EXAMPLE USAGE
>> from pygr import cnestedlist
>> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE  
>> ALIGNMENT DB
>>
>> def printResults 
>> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None):
>>     'get alignment of each genome to site, print %identity and % 
>> aligned'
>>     for src,dest,edge in msa[site].edges(mergeMost=True): #  
>> ALIGNMENT QUERY!
>>         print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \
>>               %(altID,cluster_id,prefix,seqNames[dest],
>>                 100.*edge.pIdentity(),100.*edge.pAligned(),src[: 
>> 2],dest[:2])
>>
>> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs):
>>     'gene must be a slice of a sequence in our genome alignment msa'
>>     ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES
>>     ss2=gene[start2-2:start2]
>>     ss3=gene[stop:stop+2]
>>     e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES
>>     e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND  
>> stop
>>     zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd  
>> REQUESTS
>>     cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS  
>> TO MINIMIZE TRAFFIC
>>     for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3), 
>> ('e1',e1),('e2',e2)]:
>>         printResults(prefix,msa,site,seqNames=~ 
>> (msa.seqDict),**kwargs)
>>
>> # RUN A QUERY LIKE THIS...
>> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop)
>>
>> ############ EXPLANATION & NOTES
>> David Haussler's group has constructed alignments of multiple  
>> genomes. These alignments are extremely useful and interesting,  
>> but so large that it is cumbersome to work with the dataset using  
>> conventional methods. For example, for the 8-genome alignment you  
>> have to work simultaneously with the individual genome datasets  
>> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as  
>> well as the huge alignment itself. Pygr makes this quite easy.  
>> Here we illustrate an example of mapping an alternative 3' exon,  
>> which has two alternative splice sites (start1 and start2) and a  
>> single terminal splice site (stop). We use the alignment database  
>> to map each of these splice sites onto all the aligned genomes,  
>> and to print the percent-identity and percent-aligned for each  
>> genome, as well as the two nucleotides consituting the splice site  
>> itself. To examine the conservation of the two exonic regions  
>> (between start1 and start2, and the adjacent region terminated by  
>> stop, we print the same information for each genome's alignment to  
>> these two regions as well. The code first opens the alignment  
>> database. The function (getAlt3Conservation) obtains sequence  
>> slice objects representing the various ``sites'' to be queried.  
>> The actual alignment database query is performed in printResults:
>>
>>     * The alignment database query is in the first line of  
>> printResults(). msa is the database; site is the interval query;  
>> and the edges methods iterates over the results, returning a tuple  
>> for each, consisting of a source sequence interval (i.e. an  
>> interval of site), a destination sequence interval (i.e. an  
>> interval in an aligned genome), and an edge object describing that  
>> alignment. We are taking advantage of Pygr's group-by operator  
>> mergeMost, which will cause multiple intervals in a given sequence  
>> to be merged into a single interval that constitutes their  
>> ``union''. Thus, for each aligned genome, the edges iterator will  
>> return a single aligned interval. The alignment edge object  
>> provides some useful conveniences, such as calculating the percent- 
>> identity between src and dest automatically for you. pIdentity()  
>> computes the fraction of identical residues; pAligned computes the  
>> fraction of aligned residues (allowing you to see if there are big  
>> gaps or insertions in the alignment of this interval). If we had  
>> wanted to inspect the detailed alignment letter by letter, we  
>> would just iterate over the letters attribute instead of the edges  
>> method. (See the NLMSASlice documentation for further information).
>>
>>     * src[:2] and dest[:2] print the first two nucleotides of the  
>> site in gene and in the aligned genome.
>>
>>     * it's worth noting that the actual sequence string  
>> comparisons are being done using a completely different database  
>> mechanism (formerly NCBI's fastacmd, now our own (much faster)  
>> pureseq text format), not the cnestedlist database. Basically,  
>> each genome is being queried as a separate BLAST formatted  
>> database, represented in Pygr by the BlastDB class. Pygr makes  
>> this complex set of multi-database operations more or less  
>> transparent to the user. For further information, see the BlastDB  
>> documentation.
>>
>>     * The other operations here are entirely vanilla: mainly  
>> slicing a gene sequence to obtain the specific sites that we want  
>> to query. Note: gene must itself be a slice of a sequence in our  
>> alignment, or the alignment query msa[site] will raise an  
>> IndexError informing the user that the sequence site is not in the  
>> alignment.
>>
>>     * The only slightly interesting operation here is the use of  
>> interval addition to obtain the ``union'' of two intervals, e.g.  
>> e1=ss1+ss2. This obtains a single interval that contains both of  
>> the input intervals.
>>
>>     * When the print statement requests str() representations of  
>> these sequence objects, Pygr uses fastacmd -L to extract just the  
>> right piece of the corresponding chromosomes from the eight BLAST  
>> databases.
>>
>> (Actually, because of Pygr's caching / optimizations, considerably  
>> more is going on than indicated in this simplified sketch. But you  
>> get the idea: Pygr makes it relatively effortless to work with a  
>> variety of disparate (and large) resources in an integrated way.)
>>
>> Here is some example output:
>>
>> 1       Mm.99996        ss1     hg17    50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     canFam1 50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     panTro1 50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     rn3     100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     hg17    100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     canFam1 100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     panTro1 100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     rn3     100.0   100.0   AG      AG
>> 1       Mm.99996        ss3     hg17    100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     canFam1 100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     panTro1 100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     rn3     100.0   100.0   GT      GT
>> 1       Mm.99996        e1      hg17    78.9    100.0   AG      GG
>> 1       Mm.99996        e1      canFam1 84.2    100.0   AG      GG
>> 1       Mm.99996        e1      panTro1 77.6    100.0   AG      GG
>> 1       Mm.99996        e1      rn3     97.4    98.7    AG      AG
>> 1       Mm.99996        e2      hg17    91.6    99.1    CC      CC
>> 1       Mm.99996        e2      canFam1 88.8    99.1    CC      CC
>> 1       Mm.99996        e2      panTro1 91.6    99.1    CC      CC
>> 1       Mm.99996        e2      rn3     97.2    100.0   CC      CC
>>
>>
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (Darwin)
>>
>> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i
>> xhxE2zrffLaiCjSiEp4Eo6k=
>> =BeOe
>> -----END PGP SIGNATURE-----
>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (Darwin)
>
> iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm
> Wx/4AIstFjb+rbqY2QBppLg=
> =fghY
> -----END PGP SIGNATURE-----


From biopython-dev at maubp.freeserve.co.uk  Sat Aug 12 04:25:41 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sat, 12 Aug 2006 09:25:41 +0100
Subject: [Biopython-dev]  Reading sequences: FormatIO, SeqIO, etc
Message-ID: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>

I've having a few issues with my email setup which is why I haven't
replied recently.

A week ago I filed bug 2059 for this discussion, and attached some code:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

I'm interested in your feedback - from the framework down to if you
don't like the class names for example.

Peter

From krewink at inb.uni-luebeck.de  Wed Aug 16 08:44:07 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Wed, 16 Aug 2006 14:44:07 +0200
Subject: [Biopython-dev] Fwd: contributing comparative genomics tools
Message-ID: <20060816124407.GF12386@pc09.inb.uni-luebeck.de>

Hello,

I read Peter's SeqIO/__init__.py replacement and if I may say so: I
love it.  Thanks a lot for this!  Still, there are some things I'd
like to talk about.

The _parse_genbank_features function could also be used to parse embl
or ddjb features, therefore I think it should be named differently.


Since there is a lot of clean up effort right now: How about moving
the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
are closely related and seperate modules only clutter the namespace.

To me, this seems to be a general problem. It's very difficult to find
a tool to use for a certain problem if one doesn't allready know what
to look for.  I'd pretty much favour to create modules like
Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
a very big change, and therefore I'd like to follow Marc's suggestion
of splitting off a branch.  In general, I pretty much agree with what
Marc said in his <rant />.

I cannot estimate how much work it would be to maintain two seperate
biopython distributions, so please forgive me if I re-suggest
something completely idiotic here.  I just don't believe there is much
that could be lost that way.

Cheers,
Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics

From biopython-dev at maubp.freeserve.co.uk  Wed Aug 16 10:00:36 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Aug 2006 15:00:36 +0100
Subject: [Biopython-dev]  Reading sequences: FormatIO, SeqIO, etc
Message-ID: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>

(I changed the subject to that of the previous discussion, as this
isn't really about "contributing comparative genomics tools")

Albert Krewinkel wrote:
> Hello,
>
> I read Peter's SeqIO/__init__.py replacement and if I may say so: I
> love it.  Thanks a lot for this!  Still, there are some things I'd
> like to talk about.

Thank you :) The code is on Bug 2059 for anyone who hasn't looked yet.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

> The _parse_genbank_features function could also be used to parse embl
> or ddjb features, therefore I think it should be named differently.

First of all, that bit of code is for a new feature which I personally
wanted - to be able to iterate over CDS features in a genbank file.

But yes, I did have in mind that it (and the GenBank parser) could be
re-used to deal with EMBL files.  I have not yet taken the time to
learn the EMBL file format and how it corresponds to the GenBank file
format - but I agree a lot of the code could be shared.

> Since there is a lot of clean up effort right now: How about moving
> the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> are closely related and seperate modules only clutter the namespace.

What real benefit does that give us?  It will cause a certain amount
of upheaval in the short term as people will have to change their
import statements on existing scripts.  If we do start a new branch
for "big changes" then I have no real problem with this suggest.

> To me, this seems to be a general problem. It's very difficult to find
> a tool to use for a certain problem if one doesn't allready know what
> to look for.  I'd pretty much favour to create modules like
> Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
> a very big change, and therefore I'd like to follow Marc's suggestion
> of splitting off a branch.  In general, I pretty much agree with what
> Marc said in his <rant />.
>
> I cannot estimate how much work it would be to maintain two separate
> biopython distributions, so please forgive me if I re-suggest
> something completely idiotic here.  I just don't believe there is much
> that could be lost that way.

BioPython probably would benefit from a little reorganising - and for
anything drastic like moving entire modules about, a new branch makes
sense.  On the other hand, do we have the man-power to do it?  Are any
of the developers familiar with all of (or even most of) the existing
modules?  I would guess I have used less than half of the modules - I
have looked at the very basics of Bio.PDB for example, but have never
tried Bio.NMR

I would favour gradual incremental (and backwards compatible) changes.
 Such as adding a new sequence reading module and then marking the old
code as depreciated.

For example of some small changes, have any of you looked at:

Bug 2057 - SeqRecord has no __str__ or __repr__
http://bugzilla.open-bio.org/show_bug.cgi?id=2057

Bug 1963 - Adding __str__ method to codon tables and translators
http://bugzilla.open-bio.org/show_bug.cgi?id=1963

Little things in themselves that I think would help.

Peter

From krewink at inb.uni-luebeck.de  Wed Aug 16 10:44:36 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Wed, 16 Aug 2006 16:44:36 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
Message-ID: <20060816144436.GG12386@pc09.inb.uni-luebeck.de>

On Wed, Aug 16, 2006 at 03:00:36PM +0100, Peter wrote:
> Albert Krewinkel wrote:
> >The _parse_genbank_features function could also be used to parse embl
> >or ddjb features, therefore I think it should be named differently.
> 
> First of all, that bit of code is for a new feature which I personally
> wanted - to be able to iterate over CDS features in a genbank file.
> 
> But yes, I did have in mind that it (and the GenBank parser) could be
> re-used to deal with EMBL files.  I have not yet taken the time to
> learn the EMBL file format and how it corresponds to the GenBank file
> format - but I agree a lot of the code could be shared.

I will try to build something similar for EMBL files within the next
days.  This should be easy, since features really should look the same
in both formates:

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

> >Since there is a lot of clean up effort right now: How about moving
> >the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> >are closely related and seperate modules only clutter the namespace.
> 
> What real benefit does that give us?  It will cause a certain amount
> of upheaval in the short term as people will have to change their
> import statements on existing scripts.  If we do start a new branch
> for "big changes" then I have no real problem with this suggest.

Agree.

> >To me, this seems to be a general problem. It's very difficult to find
> >a tool to use for a certain problem if one doesn't allready know what
> >to look for.  I'd pretty much favour to create modules like
> >Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
> >a very big change, and therefore I'd like to follow Marc's suggestion
> >of splitting off a branch.  In general, I pretty much agree with what
> >Marc said in his <rant />.
> >
> >I cannot estimate how much work it would be to maintain two separate
> >biopython distributions, so please forgive me if I re-suggest
> >something completely idiotic here.  I just don't believe there is much
> >that could be lost that way.
> 
> BioPython probably would benefit from a little reorganising - and for
> anything drastic like moving entire modules about, a new branch makes
> sense.  On the other hand, do we have the man-power to do it?  Are any
> of the developers familiar with all of (or even most of) the existing
> modules?  I would guess I have used less than half of the modules - I
> have looked at the very basics of Bio.PDB for example, but have never
> tried Bio.NMR

I attached a file which I created when I was teaching myself
biopython. It provides a basic grouping for the current biopython
modules.  Naturaly, it's by no means complete and probably wrong in
some places.

> I would favour gradual incremental (and backwards compatible) changes.
> Such as adding a new sequence reading module and then marking the old
> code as depreciated.

I think we could do both: A new branch might make it easier to see
which modules are usefull the way they are and which are not.  Even if
this seperate branch never is released itself, it still would be handy
for reorganising coordination.

> For example of some small changes, have any of you looked at:
> 
> Bug 2057 - SeqRecord has no __str__ or __repr__
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
> 
> Bug 1963 - Adding __str__ method to codon tables and translators
> http://bugzilla.open-bio.org/show_bug.cgi?id=1963
> 
> Little things in themselves that I think would help.

True.  My (naive) hope is, that such things would be by-products of a
new branch.  I have to admit, that this is probably not possible
without doing a code sprint.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics
-------------- next part --------------
Databases:
 o NCBI
   - UniGene
   - GenBank
   - PubMed
   - Entrez
   - LocusLink
   - Geo
 o Kabat
 o KEGG
 o SwissProt
 o Medline
 o biblio (pywebsvcs dependency is mentioned only in the module itself)
 o dbdefs
 o InterPro
 o Gobase
 o Enzyme
 o Rebase

Models and Simulations:
 o Ais
 o MetaTool
 o Pathway
 o ECell                 			    

Algorigthms, Machine Learning and Pattern Recognition:
 o HMM
 o NeuralNetwork
 o Cluster
 o LogisticRegression, Statistics
 o GA
 o MarkovModel
 o pairwise2
 o NaiveBayes
 o MaxEntropy

Alignments:
 o Align
 o Blast
 o AlignAce
 o Clusalw
 o Fasta
 o FSSP
 o SubsMat
 o Search (WUBLAST output)
 o Saf
 o IntelliGenetics

Applications:
 o Application
 o Emboss
 o Nexus
 o AlignAce
 o Blast
 o MEME
 o Sequencing
 o Wise

Data Structures:
 o KDTree
 o trie

Sequences:
 o GFF
 o Seq
 o SeqUtils
 o SeqFeature
 o SeqRecord
 o Alphabet
 o Transcribe
 o Translate
 o lcc
 o Encodings
 o Data
 o NBRF

SeqIO:
 o writers
 o Writer
 o SeqIO
 o builders
 o Fasta
 o Index

Utilities:
 o utils.py
 o ParserSupport
 o File
 o Tools
 o Mindy
 o HotRand
 o config
 o formatdefs
 o MarkupEditor
 o DocSQL (wouldn't usage of SQL-Object be nicer? (if possible))
 o EUtils.ReseakFile
 o Std, StdHandler
 o PropertyManager
 o MultiProc
 o Decode
 o FilteredReader

Graphics:
 o Graphics

Web-Based:
 o GenBank
 o NetCache
 o EUtils
 o WWW

Microarrays:
 o Affy

Structure:
 o NMR
 o PDB
 o Crystal
 o Ndb
 o SCOP
 o SVDSuperimposer

Motives:
 o MEME
 o Prosite
 o CDD
 o Compass

References:
 o Medline, PubMed
 o DBXref

Restriction:
 o Restriction
 o CAPS                  


From biopython-dev at maubp.freeserve.co.uk  Wed Aug 16 12:05:12 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 16 Aug 2006 17:05:12 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <20060816144436.GG12386@pc09.inb.uni-luebeck.de>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>
Message-ID: <44E34238.2010508@maubp.freeserve.co.uk>

Albert Krewinkel wrote:
>>> The _parse_genbank_features function could also be used to parse embl
>>> or ddjb features, therefore I think it should be named differently.

Peter wrote:
>> First of all, that bit of code is for a new feature which I personally
>> wanted - to be able to iterate over CDS features in a genbank file.
>>
>> But yes, I did have in mind that it (and the GenBank parser) could be
>> re-used to deal with EMBL files.  I have not yet taken the time to
>> learn the EMBL file format and how it corresponds to the GenBank file
>> format - but I agree a lot of the code could be shared.

Albert Krewinkel wrote:
> I will try to build something similar for EMBL files within the next
> days.  This should be easy, since features really should look the same
> in both formates:
> 
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
> 

Oh - you meant just adding EMBL feature iteration.  I want thinking 
about the larger task of full EMBL file reading.

Doing just the features is very easy, here you go:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2

Any more feedback is very welcome.  Are you using the iterators 
directly, or via the helper function File2SequenceIterator?

Are you using just the sequence iterators, or the dictionary and list 
versions too?

Peter

From biopython-dev at maubp.freeserve.co.uk  Wed Aug 16 18:20:28 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Aug 2006 23:20:28 +0100
Subject: [Biopython-dev] Tweaking the SeqRecord class
Message-ID: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>

In the spirit of gradual improvements, I had a look at the SeqRecord class.

First of all, is there any comment on my suggestion to add __str__ and
__repr__ methods to the SeqRecord object, bug 2057:

http://bugzilla.open-bio.org/show_bug.cgi?id=2057

Next, I'd like to check in some basic __doc__ strings for the
SeqRecord class, e.g. something like this:

>>> from Bio.SeqRecord import SeqRecord
>>> print SeqRecord.__doc__
The SeqRecord object is designed to hold a sequence and information about it.

    Main properties:
    id          - Identifier such as a locus tag (string)
    seq         - The sequence itself (Seq object)

    Additional properties:
    name        - Sequence name, e.g. gene name (string)
    description - Additional text (string)
    dbxrefs     - List of database cross references (list of strings)
    features    - Any (sub)features defined (list of SeqFeature objects)
    annotations - Further information (dictionary)

I would also like to add doc strings to the id, seq, name, ...
themselves.  However, they are currently stored as attributes so this
isn't possible.  See PEP 0224,
http://www.python.org/dev/peps/pep-0224/

However, we could use the Python 2.2 "property" function to implement
these as properties.  The code might be clearer using the Python 2.4
"decorator" syntax, but I don't think we should depend on such a
recent version of python yet.

Using properties would allow this usage:

>>> print SeqRecord.features.__doc__
Annotations about parts of the sequence (list of SeqFeatures)

It would also mean that these properties show up in dir(SeqRecord) and
help(SeqRecord), which all in all should make the object slightly
easier to use.

Finally, using get/set property functions allows us to postpone
creation of string/list/dict objects for unused properties.  This does
actually seem to bring a slight improvement to the timings for Fasta
file parsing discussed last month.

If you recall, for the fastest parsers turning the data into SeqRecord
and Seq objects imposed a fairly large overhead (compared to just
using strings):

http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html

I would be interested to see how those numbers change with the
attached implementation - if you wouldn't mind please Leighton... ;)

I have attached a version of SeqRecord.py which implements the changes
I have described.  The backwards compatibility if statement is a bit
ugly - can we just assume Python 2.2 or later?

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SeqRecord.py
Type: text/x-script.phyton
Size: 9367 bytes
Desc: not available
Url : http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060816/9e2f173c/attachment-0001.bin 

From mdehoon at c2b2.columbia.edu  Wed Aug 16 21:39:12 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Wed, 16 Aug 2006 21:39:12 -0400
Subject: [Biopython-dev] Tweaking the SeqRecord class
In-Reply-To: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>
References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>
Message-ID: <44E3C8C0.5070200@c2b2.columbia.edu>

Peter wrote:
> First of all, is there any comment on my suggestion to add __str__ and
> __repr__ methods to the SeqRecord object, bug 2057:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
Here's a thought:
What if Seq were to inherit from str, and SeqRecord from Seq?
Then, you get these for free.

> Next, I'd like to check in some basic __doc__ strings for the
> SeqRecord class, e.g. something like this:
Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have 
documentation.

> If you recall, for the fastest parsers turning the data into SeqRecord
> and Seq objects imposed a fairly large overhead (compared to just
> using strings):
> 
> http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html
I wonder if this is still true if a Seq object and a SeqRecord object 
inherit from string. From the code, I don't see where the overhead comes 
from.


> The backwards compatibility if statement is a bit
> ugly - can we just assume Python 2.2 or later?
Biopython currently requires Python 2.3 or later.

--Michiel.

From krewink at inb.uni-luebeck.de  Thu Aug 17 03:25:34 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Thu, 17 Aug 2006 09:25:34 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E34238.2010508@maubp.freeserve.co.uk>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>
	<44E34238.2010508@maubp.freeserve.co.uk>
Message-ID: <20060817072534.GH12386@pc09.inb.uni-luebeck.de>

Peter wrote:
> Oh - you meant just adding EMBL feature iteration.  I want thinking 
> about the larger task of full EMBL file reading.

I started working on that, but I'm not very far yet.

> Doing just the features is very easy, here you go:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2

Wow, that was quick. And it's works allmost perfectly. One exception:
In _parse_embl_or_genbank_feature(), when parsing the location, it
shoudl say something like

<code>
from string import digits
while feature_location[-1] not in (')', digits):
    line = iterator.next()
    feature_location += line[FEATURE_QUALIFIER_INDENT:].strip()
</code>

This way, features may have multiline join(...) positions.


> Any more feedback is very welcome.  Are you using the iterators 
> directly, or via the helper function File2SequenceIterator?

I'm using iterators directly, out of old habits.  But most likely I
will finally get addicted to your nice helperfunction.

> Are you using just the sequence iterators, or the dictionary and list 
> versions too?

I don't used those yet.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics

From mcolosimo at mitre.org  Thu Aug 17 08:08:24 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Thu, 17 Aug 2006 08:08:24 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>
References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>
Message-ID: <FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>

Peter,

Nice quick work on that. For Clustal, I think it should NOT be an  
Iterator, but there should be SequenceDict or SequenceList for it.  
There are other alignment filetypes out there that could use a  
SequenceIterator (those that are not interlaced).  From looking over  
your code, it seem like it would be easy to add a check in  
File2SequenceDict/List to check for Clustal types and do something  
"special"

Marc


On Aug 12, 2006, at 4:25 AM, Peter wrote:

> I've having a few issues with my email setup which is why I haven't
> replied recently.
>
> A week ago I filed bug 2059 for this discussion, and attached some  
> code:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059
>
> I'm interested in your feedback - from the framework down to if you
> don't like the class names for example.
>
> Peter


From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 09:25:07 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 14:25:07 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>
References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>
	<FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>
Message-ID: <44E46E33.3090001@maubp.freeserve.co.uk>

Marc Colosimo wrote:
> Peter,
> 
> Nice quick work on that. For Clustal, I think it should NOT be an  
> Iterator, but there should be SequenceDict or SequenceList for it.  
> There are other alignment filetypes out there that could use a  
> SequenceIterator (those that are not interlaced).  From looking over  
> your code, it seem like it would be easy to add a check in  
> File2SequenceDict/List to check for Clustal types and do something  
> "special"

Yes, I was thinking wondering about that too.

For interlaced file formats (such as clustalw, NEXUS multiple alignment 
format) we have to load the whole file into memory anyway - so using a 
SequenceIterator was a bit odd.

What I was trying to do was use a SequenceIterator as the lowest common 
denominator - the ClustalIterator shows that this can be done for 
interlaced files, and seems to work.

Its trivial to "upgrade" the ClustalIterator to a SequenceDict or 
SequenceList if that's what is needed.

The way I wrote the ClustalIterator it actually reads the whole file and 
stores a list of IDs and a dictionary mapping the ID to the sequence 
string.  It creates SeqRecord objects only on request.  This should use 
less memory than a full list of every SeqRecord (but I have not measured 
this).

Note that I would also want to add an easy way to turn any 
SequenceIterator, SequenceList or SequenceDict into a multiple alignment 
object.

Out of interest, what are the largest alignments you deal with?

I was planning to add a Stockholm parser (where the sequences themselves 
are non-interleaved).  The PFAM database alignments use this, and are 
the largest alignments I am aware of.

However, the format supports per sequence annotation information and 
this information can be rather spread out.  Looking at a real example 
from PFAM, there were blocks of such data both before and after the 
sequences.  The format suggest that such annotation might also be found 
next to each sequence.

i.e. An annotation free Stockholm iterator would be easy, but including 
the meta data would in general require loading the whole file.

http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html

It looks like a subclassed version could be written to handle the PFAM 
annotations nicely.

Peter

From mcolosimo at mitre.org  Thu Aug 17 08:24:24 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Thu, 17 Aug 2006 08:24:24 -0400
Subject: [Biopython-dev] Fwd: contributing comparative genomics tools
In-Reply-To: <20060816124407.GF12386@pc09.inb.uni-luebeck.de>
References: <20060816124407.GF12386@pc09.inb.uni-luebeck.de>
Message-ID: <9A739306-6B91-4E43-87F8-EC464784B4B2@mitre.org>

On Aug 16, 2006, at 8:44 AM, Albert Krewinkel wrote:
> Hello,
>
> I read Peter's SeqIO/__init__.py replacement and if I may say so: I
> love it.  Thanks a lot for this!  Still, there are some things I'd
> like to talk about.
>
> The _parse_genbank_features function could also be used to parse embl
> or ddjb features, therefore I think it should be named differently.
>
>
> Since there is a lot of clean up effort right now: How about moving
> the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> are closely related and seperate modules only clutter the namespace.
>
The top namespace is sort of a mess of things.
> To me, this seems to be a general problem. It's very difficult to find
> a tool to use for a certain problem if one doesn't allready know what
> to look for.  I'd pretty much favour to create modules like
> Bio.structure to group modules like Bio.PDB and Bio.NMR etc.

I second this.
> This is
> a very big change, and therefore I'd like to follow Marc's suggestion
> of splitting off a branch.  In general, I pretty much agree with what
> Marc said in his <rant />.
>
> I cannot estimate how much work it would be to maintain two seperate
> biopython distributions, so please forgive me if I re-suggest
> something completely idiotic here.  I just don't believe there is much
> that could be lost that way.
I've done this for my internal work, but I never went back to see how  
to check out the other branch (I had not need). CVS is sometimes a  
bear to work with. SVN is suppose to handle branches much better, but  
I can't access SVN repositories that are not through HTTPS (SSL).  
Stupid corporate proxy is  currently not set up to handle external  
webDAV.

This might be a pain for a little while until the next full version  
is released, but I think the benfits of doing this now far out weigh  
the short term pain (of course I'm not an admin who has to build the  
releases).

Marc

From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 11:13:40 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 16:13:40 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <20060817072534.GH12386@pc09.inb.uni-luebeck.de>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>	<44E34238.2010508@maubp.freeserve.co.uk>
	<20060817072534.GH12386@pc09.inb.uni-luebeck.de>
Message-ID: <44E487A4.8040106@maubp.freeserve.co.uk>

Albert Krewinkel wrote:
> Peter wrote:
> 
>>Oh - you meant just adding EMBL feature iteration.  I was thinking 
>>about the larger task of full EMBL file reading.
> 
> I started working on that, but I'm not very far yet.

Are you starting from Bio.GenBank or from scratch?  I would point out 
that the code in Bio.GenBank was inserted into what was once a Martel 
based parser, and designed to be a transparent change for the end user.

What I would like to do is recycle that code into a new far simpler 
SeqIO GenBank parser which would only return SeqRecords.  In particular 
I would get rid off all the scanner/consumer model with all its function 
callbacks.

At this point I would try and handle both GenBank and EMBL files together.

I expect this to be faster, and easier to understand.  It would be a lot 
less flexible for the "power user", but then so is all the new SeqIO 
code I have been writing.

>>Doing just the features is very easy, here you go:
>>
>>http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2
> 
> Wow, that was quick.

Well, I did have something along these lines planned in advance - that's 
why there my parse function was outside the GenbankCdsFeatureIterator class.

 > And it's works allmost perfectly. One exception:
> In _parse_embl_or_genbank_feature(), when parsing the location, it
> shoudl say something like
> 
> <code>
> from string import digits
> while feature_location[-1] not in (')', digits):
>     line = iterator.next()
>     feature_location += line[FEATURE_QUALIFIER_INDENT:].strip()
> </code>
> 
> This way, features may have multiline join(...) positions.

Good point, something I was aware of and coped with in Bio.GenBank but 
hadn't done in the CDS iterator.  Thanks for point this out.

This affects both GenBank and EMBL files by the way.  My code is very 
similar but I included an assert to check the indent, and I only check 
for a trailing comma.  This works on all the files I have tried.

Peter

From krewink at inb.uni-luebeck.de  Thu Aug 17 13:41:06 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Thu, 17 Aug 2006 19:41:06 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E487A4.8040106@maubp.freeserve.co.uk>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>
	<44E34238.2010508@maubp.freeserve.co.uk>
	<20060817072534.GH12386@pc09.inb.uni-luebeck.de>
	<44E487A4.8040106@maubp.freeserve.co.uk>
Message-ID: <20060817174106.GI12386@pc09.inb.uni-luebeck.de>

Peter wrote:
> > Peter wrote:
> >>Oh - you meant just adding EMBL feature iteration.  I was thinking 
> >>about the larger task of full EMBL file reading.
> >
> Albert wrote:
> >I started working on that, but I'm not very far yet.
> 
> Are you starting from Bio.GenBank or from scratch?  I would point out 
> that the code in Bio.GenBank was inserted into what was once a Martel 
> based parser, and designed to be a transparent change for the end user.
>
> What I would like to do is recycle that code into a new far simpler 
> SeqIO GenBank parser which would only return SeqRecords.  In particular 
> I would get rid off all the scanner/consumer model with all its function 
> callbacks.
> 
> At this point I would try and handle both GenBank and EMBL files together.

I didn't do much more than to play with current code and add some
methods to parse EMBL specific things.  The results can be found here:

http://www.inb.uni-luebeck.de/~krewink/embl.py

It's ugly, and doesn't provide much functionality, but could be a
starting point.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics

From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 16:09:20 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 21:09:20 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E46E33.3090001@maubp.freeserve.co.uk>
References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>	<FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>
	<44E46E33.3090001@maubp.freeserve.co.uk>
Message-ID: <44E4CCF0.7090607@maubp.freeserve.co.uk>

Marc Colosimo wrote:
>> Nice quick work on that. For Clustal, I think it should NOT be an  
>> Iterator, but there should be SequenceDict or SequenceList for it.  
>> There are other alignment filetypes out there that could use a  
>> SequenceIterator (those that are not interlaced).  From looking over  
>> your code, it seem like it would be easy to add a check in  
>> File2SequenceDict/List to check for Clustal types and do something  
>> "special"

Peter (BioPython Dev) wrote:
> Yes, I was thinking wondering about that too.
> 
> For interlaced file formats (such as clustalw, NEXUS multiple alignment 
> format) we have to load the whole file into memory anyway - so using a 
> SequenceIterator was a bit odd.
> 
> What I was trying to do was use a SequenceIterator as the lowest common 
> denominator - the ClustalIterator shows that this can be done for 
> interlaced files, and seems to work.

There are two and a half examples done this way now...

> I was planning to add a Stockholm parser (where the sequences themselves 
> are non-interleaved).  The PFAM database alignments use this, and are 
> the largest alignments I am aware of.
> 
> ...
> 
> It looks like a subclassed version could be written to handle the PFAM 
> annotations nicely.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c3

Changes to the clustal parser, and addition of a parser for Stockholm
alignments, and a subclassed version to handle the PFAM style
annotations strings.

I have included basic handling of the sequence specific meta-data [I
need to have a look at real PFAM data to sort of the database cross
references still], but currently ignore the whole file level information
(#=GF lines) and the per column information (#=GC lines).

Maybe reading sequences out of multiple alignment files should be done
as a special case of loading multiple alignments?  Is this what you
meant by "something special" Marc?

Peter


From biopython-dev at maubp.freeserve.co.uk  Mon Aug 21 15:26:06 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 21 Aug 2006 20:26:06 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E4D2B4.3000600@maubp.freeserve.co.uk>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>
Message-ID: <44EA08CE.5070802@maubp.freeserve.co.uk>

You probably noticed I sent out a "Dealing with sequence files"
questionnaire on the main discussion list:

http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html

I've had four replies to date (off the list), and with the previous list
discussion and counting myself that makes eight views.  Not a very big
sample I know.

> Question One
> ============
> Is reading sequence files an important function to you, and if so which
> file formats in particular (e.g. Fasta, GenBank, ...)

Fasta very popular, with GenBank also scoring highly.  Michiel and I
both use clustalw.  Apart from EMBL (next question) there wasn't any
other popular file format given.

I'm tempted to ask again regarding multiple alignment formats.

> Question Two
> ============
> Are there any sequence formats you would like to be able to read using 
> BioPython that are not currently supported (e.g. EMBL, ...)

It may have been a leading question, but several respondents would like
to be able to read in EMBL format.

Other requests included:

XML based 454 sequence files
UniGene sequence cluster format

Leighton mentioned:

PTT (Protein table files)
GFF (General Feature Format)

And I wanted to be able to read Stockholm alignments.

> Question Three - Reading Fasta Files
> ====================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

A range covering (a), (b) and (d) plus DIY parsers.

> Question Four - Reading GenBank Files
> =====================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
> (b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
> (c) Other (Could you tell us more?)

Both (a) and (b) with no clear majority.

> Question Five - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the records
> using an iterator but save them in a list).

Most of you use iterators, storing records in memory as required.

> Question Six - Martel, Scanners and Consumers
> =============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an
> event/callback model, where the scanner component generates parsing
> events which are dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ... (please
> provide details).

Almost everyone said (a) which I think is a good thing if we are going
to try and re-work the BioPython's sequence reading.

> And finally...
> ==============
> Do you have any general questions of comments.

Several people have commented that BioPerl has a nice unified system
with good documentation.

-----------------------------------------------------------------------

Where next...

I think my code could be included "in parallel" with the existing
parsers, without the upheaval of creating a new branch etc.

I have started thinking about writing files too.

Part of this will involve trying to be as consistent as possible about
mapping annotations from different file formats to the SeqRecord
object's annotations dictionary.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

My code currently on bug 2059 is written as a single python file,
provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea
long term as more file formats are supported.

If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a
slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the
filenames would clash on Windows.  Some people are using the code in
Bio.SeqIO.FASTA, but I suppose the file could contain both the old code,
and my new fasta interface.

Alternatively, the new system could be put in Bio.SequenceIO or are
there any other suggestions?

Peter


From krewink at inb.uni-luebeck.de  Tue Aug 22 09:43:56 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Tue, 22 Aug 2006 15:43:56 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>
	<44EA08CE.5070802@maubp.freeserve.co.uk>
Message-ID: <20060822134356.GO12386@pc09.inb.uni-luebeck.de>

I'd like to seriously start working on an EMBL parser, but there are
some things I'm concerned about: It surely would be a good thing to
build the SequenceIO and Parser stuff upon some base classes and agree
on using certain tools which are (or will be) used in the hole
project.  Since I never received any education/training on software
development, I would appreciate if someone can tell me how the code's
structure should look like -- the current Scanner/Consumer code isn't
any help.

> Several people have commented that BioPerl has a nice unified system
> with good documentation.

How about using reStructuredText in docstrings?  IMO it leaves the
.__doc__ string very readable but improves epydoc generated
descriptions.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics

From bsouthey at gmail.com  Tue Aug 22 09:52:10 2006
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 22 Aug 2006 08:52:10 -0500
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>
	<44EA08CE.5070802@maubp.freeserve.co.uk>
Message-ID: <bbcd77d00608220652s5b9a5cc3i7f7999b9d27ed18b@mail.gmail.com>

Hi,
To date I have only used SwissProt code from BioPython so I am really
only lurking. But here are some responses.

Bruce

On 8/21/06, Peter (BioPython Dev) <biopython-dev at maubp.freeserve.co.uk> wrote:
> You probably noticed I sent out a "Dealing with sequence files"
> questionnaire on the main discussion list:
>
> http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html
>
> I've had four replies to date (off the list), and with the previous list
> discussion and counting myself that makes eight views.  Not a very big
> sample I know.
>
> > Question One
> > ============
> > Is reading sequence files an important function to you, and if so which
> > file formats in particular (e.g. Fasta, GenBank, ...)
>
> Fasta very popular, with GenBank also scoring highly.  Michiel and I
> both use clustalw.  Apart from EMBL (next question) there wasn't any
> other popular file format given.

Well, this is not a surprise because most apps around also use FASTA
as default format. Although most do not accept a comment line. Thus,
FASTA is the most important format.


>
> I'm tempted to ask again regarding multiple alignment formats.
>
> > Question Two
> > ============
> > Are there any sequence formats you would like to be able to read using
> > BioPython that are not currently supported (e.g. EMBL, ...)
>
> It may have been a leading question, but several respondents would like
> to be able to read in EMBL format.
>
> Other requests included:
>
> XML based 454 sequence files
> UniGene sequence cluster format
>
> Leighton mentioned:
>
> PTT (Protein table files)
> GFF (General Feature Format)
>
> And I wanted to be able to read Stockholm alignments.

I would like to be able to use a custom format that is based on the
FASTA format. That is allowing non-standard characters to included as
part of the sequence that I later remove. Perhaps this is just being
able to do subclassing.


>
> > Question Three - Reading Fasta Files
> > ====================================
> > Which of the following do you currently use (and why)?:
> >
> > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
> > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> > (c) Bio.Fasta with your own parser (Could you tell us more?)
> > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> > (e) Bio.FormatIO (giving SeqRecord objects)
> > (f) Other (Could you tell us more?)
>
> A range covering (a), (b) and (d) plus DIY parsers.
>
> > Question Four - Reading GenBank Files
> > =====================================
> > Which of the following do you currently use (and why)?:
> >
> > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
> > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
> > (c) Other (Could you tell us more?)
>
> Both (a) and (b) with no clear majority.
>
> > Question Five - Record Access...
> > ================================
> > When loading a file with multiple sequences do you use:
> >
> > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
> > records one by one in the order from the file.
> >
> > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> > random access to the records using their identifier.
> >
> > (c) A list giving random access by index number (e.g. load the records
> > using an iterator but save them in a list).
>
> Most of you use iterators, storing records in memory as required.

a

>
> > Question Six - Martel, Scanners and Consumers
> > =============================================
> > Some of BioPython's existing parsers (e.g. those using Martel) use an
> > event/callback model, where the scanner component generates parsing
> > events which are dealt with by the consumer component.
> >
> > Do any of you use this system to modify existing parser behaviour, or
> > use it as part of your own personal file parser?
> >
> > (a) I don't know, or don't care.  I just the the parsers provided.
> > (b) I use this framework to modify a parser in order to do ... (please
> > provide details).
>
> Almost everyone said (a) which I think is a good thing if we are going
> to try and re-work the BioPython's sequence reading.

a

>
> > And finally...
> > ==============
> > Do you have any general questions of comments.
>
> Several people have commented that BioPerl has a nice unified system
> with good documentation.
>
> -----------------------------------------------------------------------
>
> Where next...
>
> I think my code could be included "in parallel" with the existing
> parsers, without the upheaval of creating a new branch etc.
>
> I have started thinking about writing files too.
>
> Part of this will involve trying to be as consistent as possible about
> mapping annotations from different file formats to the SeqRecord
> object's annotations dictionary.
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059
>
> My code currently on bug 2059 is written as a single python file,
> provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea
> long term as more file formats are supported.
>
> If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a
> slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the
> filenames would clash on Windows.  Some people are using the code in
> Bio.SeqIO.FASTA, but I suppose the file could contain both the old code,
> and my new fasta interface.
>
> Alternatively, the new system could be put in Bio.SequenceIO or are
> there any other suggestions?
>
> Peter
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>

From biopython-dev at maubp.freeserve.co.uk  Tue Aug 22 12:46:39 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 22 Aug 2006 17:46:39 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <20060822134356.GO12386@pc09.inb.uni-luebeck.de>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>	<44EA08CE.5070802@maubp.freeserve.co.uk>
	<20060822134356.GO12386@pc09.inb.uni-luebeck.de>
Message-ID: <44EB34EF.1050901@maubp.freeserve.co.uk>

Albert Krewinkel wrote:
> I'd like to seriously start working on an EMBL parser, but ...

As the de-facto GenBank module owner, I'm also interested getting EMBL 
and GenBank working nicely together.  The big question BEFORE you/we 
start any serious coding on EMBL support is how it fits into BioPython.

Do we (a) add a new module like the existing Bio.Fasta and Bio.GenBank, 
or (b) use a new framework like the one I've put forward here:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

 > ... there are  some things I'm concerned about: It surely would be a
 > good thing to  build the SequenceIO and Parser stuff upon some base
 > classes and agree on using certain tools which are (or will be) used
 > in the hole project.

What I was proposing was that all the new sequence file format parsers
should be implemented as subclasses of my SequenceIterator class - 
either directly (e.g. FastaIterator) or indirectly (e.g. the 
PfamStockholmIterator) and they should return SeqRecord objects.

I am open to discussion about how interlaced file formats should be
handled, but I think I have shown how the SequenceIterator based scheme 
could work using the Clustalw and Stockholm formats as examples.

> Since I never received any education/training on software 
> development, I would appreciate if someone can tell me how the code's
> structure should look like -- the current Scanner/Consumer code
> isn't any help.

I agree that the current Scanner/Consumer code won't be much help.

The fact that the current Bio.GenBank parser uses the Scanner/Consumer 
model reflects the fact that I rewrote (in Python) what had been done 
using Martel/Mindy.  This is one excuse for the state of that code of 
mine ;)

I don't think the flexibility of the Scanner/Consumer model is needed
just to turn Embl/GenBank data into SeqRecord objects (and only into 
SeqRecord objects).

> How about using reStructuredText in docstrings?  IMO it leaves the 
> .__doc__ string very readable but improves epydoc generated 
> descriptions.

I'm not familiar with how any existing API documentation is extracted
from the source code...

Peter


From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 04:28:19 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 09:28:19 +0100
Subject: [Biopython-dev] Tweaking the SeqRecord class
In-Reply-To: <44E3C8C0.5070200@c2b2.columbia.edu>
References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>
	<44E3C8C0.5070200@c2b2.columbia.edu>
Message-ID: <44E428A3.70103@maubp.freeserve.co.uk>

Michiel de Hoon wrote:
> Peter wrote:
> 
>>First of all, is there any comment on my suggestion to add __str__ and
>>__repr__ methods to the SeqRecord object, bug 2057:
>>
>>http://bugzilla.open-bio.org/show_bug.cgi?id=2057
> 
> Here's a thought:
> What if Seq were to inherit from str, and SeqRecord from Seq?
> Then, you get these for free.

This wouldn't automatically show any id/name/desrc/annotation in the
__str__ and __repr__ methods, so I would want to override these methods
anyway.

We would still need to create and provide a Seq object on request as the
record.seq attribute/property (for backwards compatibility).

I also think we should change the Seq objects __str__, __repr__
functionality (while preserving the .tostring() method for some
backwards compatibility).  It might have been Marc the raised this point
- shouldn't __str__ turn the data into a string, and __repr__ return a
string that you could type into python to recreate the object?  This
would mean we would have to stop truncating the sequence data at 60
characters.

>>Next, I'd like to check in some basic __doc__ strings for the
>>SeqRecord class, e.g. something like this:
> 
> Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have 
> documentation.

OK, basic __doc__ strings checked in,  Bio/SeqRecord.py revision 1.9

The Seq object also needs some love and attention in this area.

>>If you recall, for the fastest parsers turning the data into SeqRecord
>>and Seq objects imposed a fairly large overhead (compared to just
>>using strings):
>>
>>http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html
> 
> I wonder if this is still true if a Seq object and a SeqRecord object 
> inherit from string. From the code, I don't see where the overhead comes 
> from.

I was wondering what the overhead was too.

It could just be creating objects (Seq and SeqRecord) plus their
associated strings/list/dictionary (compared with just two strings, the
fasta title string and the sequence).

My property change should reduce this a little bit as for Fasta files
there is no need to create the dbxrefs list or the annotations
dictionary (unless or until the user records some information here after
creating the SeqRecord object).

Making SeqRecord subclass Seq might help here if only one object needs
to be created.

>>The backwards compatibility if statement is a bit
>>ugly - can we just assume Python 2.2 or later?
> 
> Biopython currently requires Python 2.3 or later.

Great - I'll ditch that nasty big if and just re-write the class to use
properties.

Revised version attached - should be functionally identical.

Peter

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: SeqRecord.py
Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060817/79dd5fca/attachment-0001.pl 

From biopython-dev at maubp.freeserve.co.uk  Wed Aug 30 06:22:52 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Aug 2006 11:22:52 +0100
Subject: [Biopython-dev] Recent bug reports not making it to the mailing list
Message-ID: <44F566FC.30407@maubp.freeserve.co.uk>

Once upon a time (early 2006?) whenever a bug was filed on the BugZilla, 
a copy was sent to the mailing list.

Not any more... and in the last month or so there have been several bugs 
filed which have been ignored.

Does anyone get automatic email notification?

Who should I ask to be included in any default email notification?

Thanks

Peter


From lpritc at scri.sari.ac.uk  Tue Aug  1 10:42:37 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Tue, 01 Aug 2006 11:42:37 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
Message-ID: <1154428959.4871.11.camel@lplinuxdev>

On Mon, 2006-07-31 at 12:08 -0400, Marc Colosimo wrote: 
> On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote:
> >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
> >>> entire file into memory in one go, and then parses it.  On the other
> >>> hand its not perfect: I would use "\n>" as the split marker  
> >>> rather than
> >>> ">" which could appear in the description of a sequence.
> >>
> >> I agree (not that it's bitten me, yet), but I'd be inclined to go  
> >> with
> >> "%s>" % os.linesep as the split marker, just in case.
> >
> > Good point.  I wonder how many people even know this function exists?
> >
> 
> The only problem with this is that if someone sends you a file not  
> created on your system. [...]  
> This has mostly simplied down to two - Unix and Windows - unless the  
> person uses a Mac GUI app some of which use \r (CR) instead of \n  
> (LF) where Windows uses \r\n (CRLF). I think the standard python  
> disto comes with crlf.py and lfcr.py that can convert the line endings.

Also a good point.  I had a play about with regular expression
splitting/substitution and the SeqUtils.quick_FASTA_reader method to see
if I could capture this variability in line-endings:

def method_quick_FASTA_reader3(filename):
    txt = file(filename).read()
    entries = []
    split_marker = re.compile('^>', re.M)
    for entry in re.split(split_marker, txt)[1:]:
        name,seq= re.split('[\r\n]', entry, 1)
        seq = re.sub('\s', '', seq).upper()
        entries.append((name, seq))
    return "SeqUtils/quick_FASTA_reader (import re)", len(entries)

Using regular expressions in this way seems to slow things down to about
the same speed as the SeqIO parser, with the disadvantage of still
having to process the entries into SeqRecord objects (if that's what you
want to do with them).  quick_FASTA_reader is a bit of a misnomer in
this case, I guess ;)

4.15s SeqIO.FASTA.FastaReader (for record in interator)
3.95s SeqIO.FASTA.FastaReader (iterator.next)
4.13s SeqIO.FASTA.FastaReader (iterator[i])
1.89s SeqUtils/quick_FASTA_reader
1.03s pyfastaseqlexer/next_record
0.52s pyfastaseqlexer/quick_FASTA_reader
4.44s SeqUtils/quick_FASTA_reader (import re)

Results are typical for the 72000 record set, and this doesn't look to
be a promising route.

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From pfefferp at staff.uni-marburg.de  Tue Aug  1 12:02:25 2006
From: pfefferp at staff.uni-marburg.de (Patrick Pfeffer)
Date: Tue, 01 Aug 2006 14:02:25 +0200
Subject: [Biopython-dev] GAs in Biopython
Message-ID: <44CF42D1.8090209@staff.uni-marburg.de>

Hi there,

isn't there any documentation available for using the genetic algorithm 
available in the package?

Thanks for any kind of help,
Patrick

-- 
*************************************
Dipl. Bioinf. Patrick Pfeffer
Arbeitskreis Prof. Dr. G. Klebe
Institut f?r Pharmazeutische Chemie
Raum A116a
Fachbereich Pharmazie
Philipps-Universit?t Marburg
Marbacher Weg 6
35032 Marburg  Germany
Fon.: 06421/2825908
http://www.agklebe.de
e-mail: pfefferp at staff.uni-marburg.de
************************************* 


From biopython-dev at maubp.freeserve.co.uk  Tue Aug  1 20:53:08 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 01 Aug 2006 21:53:08 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154428959.4871.11.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CA27B1.30107@maubp.freeserve.co.uk>	<1154339988.1490.81.camel@lplinuxdev>	<44CDF3AA.2020308@maubp.freeserve.co.uk>	<1154355358.1490.116.camel@lplinuxdev>	<44CE1E3C.2050502@maubp.freeserve.co.uk>	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
	<1154428959.4871.11.camel@lplinuxdev>
Message-ID: <44CFBF34.7080106@maubp.freeserve.co.uk>

Peter wrote:
>>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the
>>> entire file into memory in one go, and then parses it.  On the other
>>> hand its not perfect: I would use "\n>" as the split marker  
>>> rather than ">" which could appear in the description of a sequence.

Leighton Pritchard replied:
>> I agree (not that it's bitten me, yet), but I'd be inclined to go  
>> with "%s>" % os.linesep as the split marker, just in case.

Peter then wrote:
> Good point.

I take that back - I was right the first time ;)

You are right to worry about the line sep changing from platform to
platform, but you shouldn't use "%s>" % os.linesep

However, when reading windows style files on windows, the newlines
appear in python as just \n (as do newlines from unix files read on
windows).

When writing text files on windows, again \n gets turned into CR LF on
the disk.

Just using "\n>" would work on any platform reading a FASTA file with
the expected newlines.  As a bonus it would work on Windows when reading
unix style newlines.

To get any platform to read newlines from any other platform what I
suggest is using "\n>" as the split string, but open the file in
universal text mode - this seems to work fine on Python 2.3, but I'm not
sure when universal newline reading was introduced.

For example, I created a simple file using the three newline conventions
(using the TextPad on Windows).

>>> import sys
>>> sys.platform
'win32'
>>> os.linesep
'\r\n'

>>> open("c:/temp/windows.txt","r").read()
'line\nline\n'
>>> open("c:/temp/mac.txt","r").read()
'line\rline\r'
>>> open("c:/temp/unix.txt","r").read()
'line\nline\n'

(Notice that using "\n>" wouldn't work when reading a Mac style file on
Windows)

>>> open("c:/temp/windows.txt","rU").read()
'line\nline\n'
>>> open("c:/temp/mac.txt","rU").read()
'line\nline\n'
>>> open("c:/temp/unix.txt","rU").read()
'line\nline\n'


Peter


From lpritc at scri.sari.ac.uk  Wed Aug  2 09:25:27 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 02 Aug 2006 10:25:27 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>
	<44CDDD10.4020904@maubp.freeserve.co.uk>
Message-ID: <1154510728.4871.66.camel@lplinuxdev>

On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> Question One
> ============
> Is reading sequence files an important function to you, and if so which 
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW

> If you have had to write you own code to read a "common" file format 
> which BioPython doesn't support, please get in touch.

EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
pretty).

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a 
> title, and the sequence as a string)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

Mostly (f), a homegrown Pyrex/Flex parser.

> Question Three - index_file based dictionaries
> ==============================================
> Do you use any of the following:
> (a) Bio.Fasta.Dictionary
> (b) Bio.Genbank.Dictionary
> (c) Any other "Martel/Mindy" based dictionary which first requires 
> creation of an index using the index_file function

No, but I do create dictionaries on-the-fly from (name, sequence)
tuples, where necessary.

> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the 
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you 
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the records 
> using an iterator but saving them in a list).
> 
> Do you have any additional comments on this?  For example, flexibility 
> versus memory requirements.

Depending on what I need to do, I might use different approaches.  If
I'm filtering sequences on, say, sequence composition, I'll use an
iterator.  If I need to cross-reference sequences from the file to some
other set of sequences by ID, I'll use a dictionary.  In each case, I
will generally either use a for loop or build a dictionary on-the-fly.

> Question Four - Fasta files: FastaRecord or SeqRecord
> =====================================================
> If you use Fasta files, do you want get records returned as FastaRecords 
> or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

I'd rather have SeqRecords.  SeqRecords are particularly useful for
annotations and attaching data to the sequence which, later, gets
written out in some format other than FASTA sequence format.  For
operations where no further information is associated with the sequence,
they offer equivalent functionality to FastaRecords.  

Currently I default to (name, seq) tuples, and only create SeqRecords
when necessary, but this is only out of convenience for the parser I
use.

> Question Five - GenBank files: GenbankRecord or SeqRecord
> ==========================================================
> If you use GenBank files, do you use:
> (a) Bio.Genbank.FeatureParser which returns SeqRecord objects
> (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects
> 
> Do you care much either way?  For me the only significant difference is 
> that feature locations are held as objects in the SeqRecord, and as the 
> raw string in the Record.

I use Bio.GenBank.FeatureParser because I prefer the storage of features
(which are what I'm generally interested in) as SeqFeature objects.

> Question Six - Martel, Scanners and Consumers
> ==============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an 
> event/callback model, where the scanner component generates parsing 
> events which are dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or 
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ... (please 
> provide details).

I care mostly about performance on large files and the convenient
representation of sequences and features.  Where parsers have not been
available (or quickly locatable) for file formats, such as EMBL, I have
sometimes used the Bio.ParserSupport classes and the Scanner/Consumer
pattern.  

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Wed Aug  2 10:45:34 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 02 Aug 2006 11:45:34 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154510728.4871.66.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CD5AF2.10708@c2b2.columbia.edu>	<44CDDD10.4020904@maubp.freeserve.co.uk>
	<1154510728.4871.66.camel@lplinuxdev>
Message-ID: <44D0824E.30808@maubp.freeserve.co.uk>

Leighton Pritchard wrote:
> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote:
> 
>>Question One
>>============
>>Is reading sequence files an important function to you, and if so which 
>>file formats in particular (e.g. Fasta, GenBank, ...)
> 
> Yes.  FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW
> 

PTT (Protein table files)

http://www.ibt.unam.mx/biocomputo/hom_make_db.html
(Anyone got an NCBI link for the file format?)

GFF (General Feature Format)

http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

GFF and PTT aren't exactly what I would call sequence files, in that
they don't contain any sequence data.  But thinking about it, maybe
those files could be turned into SeqRecords or SeqFeatures (with empty
sequences).

> 
>>If you have had to write you own code to read a "common" file format 
>>which BioPython doesn't support, please get in touch.
> 
> EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not
> pretty).
> 

Its looks like there is enough overlap between the EMBL and Genbank to
make sharing code between them a good idea.  Certainly EMBL was a file
format I was thinking we should try to support.

Reading your other comments, it looks like you wouldn't miss FastaRecord
or GenBank records if they were phased out.

Personally, I'm suggesting we try and standardise on having any Sequence
IO framework standardize on returning SeqRecord objects.

Does anyone know if SeqIO stood for Sequence or Sequential Input/Ouput?

I think we should have a generic "Sequence Iterator" object to do this
which takes a file handle, subclassed for each file format - giving a
"Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc.

I'm inclined not to give any choice of parser object (e.g.
Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a
SeqRecord.

The individual readers should offer some level of control, for example
the title2ids function for Fasta files lets the user decide how the
title line should be broken up into id/name/description.  Also for some
file formats the user should be able to specify the alphabet.

Peter


From hoffman at ebi.ac.uk  Wed Aug  2 11:00:46 2006
From: hoffman at ebi.ac.uk (Michael Hoffman)
Date: Wed, 2 Aug 2006 12:00:46 +0100 (BST)
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>
	<44CDDD10.4020904@maubp.freeserve.co.uk>
Message-ID: <Pine.LNX.4.64.0608021154490.27323@qnzvnan.rov.np.hx>

> Question One
> ============
> Is reading sequence files an important function to you, and if so which
> file formats in particular (e.g. Fasta, GenBank, ...)

Yes. FASTA.

> Question Two - Reading Fasta Files
> ==================================
> Which of the following do you currently use (and why)?:
>
> (f) Other (Could you tell us more?)

I have written my own short iterator so that my code is portable
without requiring Biopython to be installed.

> Question Three - index_file based dictionaries
> ==============================================
> Do you use any of the following:

No.

> Question Four - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
>
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.

Yes.

> Question Four - Fasta files: FastaRecord or SeqRecord
> =====================================================
> If you use Fasta files, do you want get records returned as FastaRecords
> or as SeqRecords?  If SeqRecords, do you use your own title2ids mapping?

SeqRecords. I hate it when an interface tries to parse the definition
line for me. Perhaps a set of standard definition line parsers should
be provided so that one can choose, but usually I would rather have
plain text and parse it myself.

> Question Six - Martel, Scanners and Consumers
> ==============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an
> event/callback model, where the scanner component generates parsing
> events which are dealt with by the consumer component.
>
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?

No.
-- 
Michael Hoffman <hoffman at ebi.ac.uk>
European Bioinformatics Institute


From lpritc at scri.sari.ac.uk  Wed Aug  2 11:23:27 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 02 Aug 2006 12:23:27 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44D0824E.30808@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CD5AF2.10708@c2b2.columbia.edu>	<44CDDD10.4020904@maubp.freeserve.co.uk>
	<1154510728.4871.66.camel@lplinuxdev>
	<44D0824E.30808@maubp.freeserve.co.uk>
Message-ID: <1154517808.4871.93.camel@lplinuxdev>

On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote:
> GFF (General Feature Format)
> 
> http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF
> http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
> 
> GFF and PTT aren't exactly what I would call sequence files, in that
> they don't contain any sequence data.  

Fair point, but GFF3 (see below) can optionally carry sequence data, and
I use them for exactly what you say here:

> those files could be turned into SeqRecords or SeqFeatures (with empty
> sequences).

I was thinking that GFF3 would be more useful than GFF:

http://song.sourceforge.net/gff3.shtml

NCBI have already gone over to this on bacterial genomes, at least,
(e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much richer format than the original specification.  Andrew Dalke has already written a GFF3 parser/writer, which is available at

http://www.dalkescientific.com/PyGFF3-0.5.tar.gz

I've not used this in anger, yet...

> Its looks like there is enough overlap between the EMBL and Genbank to
> make sharing code between them a good idea.  Certainly EMBL was a file
> format I was thinking we should try to support.

In a scanner/consumer pattern it's easy enough.  I've not looked under
the hood of the new GenBank parser yet, to see what you've done.  Most
of my contact with EMBL format is with headerless feature tables and
Artemis, which aren't directly similar to GenBank entries. 

> Reading your other comments, it looks like you wouldn't miss FastaRecord
> or GenBank records if they were phased out.

Not personally, but others may have strong opinions and breakable code,
yet.

> Personally, I'm suggesting we try and standardise on having any Sequence
> IO framework standardize on returning SeqRecord objects.
> 
> I think we should have a generic "Sequence Iterator" object to do this
> which takes a file handle, subclassed for each file format - giving a
> "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc.

> I'm inclined not to give any choice of parser object (e.g.
> Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a
> SeqRecord.

It may be a side-issue, but should a Clustal parser return an Alignment
object or iterate over SeqRecord objects?  And for that matter, what
about other MSA files in FASTA format?  I think we ought allow parsers
to return an Alignment where the user requests it, which is a
functionality I'm not currently aware of in the FASTA sequence parsers.

> The individual readers should offer some level of control, for example
> the title2ids function for Fasta files lets the user decide how the
> title line should be broken up into id/name/description.  Also for some
> file formats the user should be able to specify the alphabet.

Could the alphabet be optionally specified by the user on parsing, and
maybe return a warning or error if there are non-compliant symbols in
the file, as a quick validator for bad sequences, or reminder to the
occasionally forgetful that, for example, they're not working with
nucleotide sequences, today <cough, embarrassed glance at floor> ;)

L.

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).


From biopython-dev at maubp.freeserve.co.uk  Wed Aug  2 12:56:23 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 02 Aug 2006 13:56:23 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <1154517808.4871.93.camel@lplinuxdev>
References: <44CA162F.1040604@maubp.freeserve.co.uk>	<44CD5AF2.10708@c2b2.columbia.edu>	<44CDDD10.4020904@maubp.freeserve.co.uk>	<1154510728.4871.66.camel@lplinuxdev>	<44D0824E.30808@maubp.freeserve.co.uk>
	<1154517808.4871.93.camel@lplinuxdev>
Message-ID: <44D0A0F7.1020402@maubp.freeserve.co.uk>

Leighton Pritchard wrote:
> Fair point, but GFF3 (see below) can optionally carry sequence data,
> and I use them for exactly what you say here:
> 
>> maybe those files could be turned into SeqRecords or SeqFeatures 
>> (with empty sequences).
> 
> I was thinking that GFF3 would be more useful than GFF:
> 
> http://song.sourceforge.net/gff3.shtml
> 

Thanks for the links... interesting that GFF3 allows embedding Fasta
sequences.

>> Reading your other comments, it looks like you wouldn't miss 
>> FastaRecord or GenBank records if they were phased out.
> 
> Not personally, but others may have strong opinions and breakable 
> code, yet.

There is no need to remove the current modules, just mark them as
depreciated.  Of course, if there is some strong support for these
objects then we might not want to be so harsh...

> It may be a side-issue, but should a Clustal parser return an 
> Alignment object or iterate over SeqRecord objects?  And for that 
> matter, what about other MSA files in FASTA format?  I think we ought
> allow parsers to return an Alignment where the user requests it, 
> which is a functionality I'm not currently aware of in the FASTA 
> sequence parsers.

In my opinion we should offer both.  I would go for loading
clustal/fasta alignments as sequence iterators (as part of the new SeqIO
code) and make it very easy to turn ANY sequence iterator returning
SeqRecords into an alignment.

The current alignment object stores its sequences as SeqRecords
internally but doesn't (yet) allow simple addition of SeqRecords - that
would have to be fixed but it looks easy enough.  Accepting a
SequenceIterator for __init__ would also be nice.

>> The individual readers should offer some level of control, for 
>> example the title2ids function for Fasta files lets the user decide
>> how the title line should be broken up into id/name/description. 
>> Also for some file formats the user should be able to specify the 
>> alphabet.
> 
> Could the alphabet be optionally specified by the user on parsing, 
> and maybe return a warning or error if there are non-compliant 
> symbols in the file, as a quick validator for bad sequences, or 
> reminder to the occasionally forgetful that, for example, they're not
> working with nucleotide sequences, today <cough, embarrassed glance 
> at floor> ;)

For some file formats the parser should be able to deduce the alphabet,
but other like Fasta it must be specified.  I like the idea of
optionally checking the alphabet - but it would impose a speed penalty.

Do you think this should be done by the SeqRecord object (on request)?
Each parser could simply ask the SeqRecord object to verify itself
before returning it.

Peter


From Leighton.Pritchard at scri.ac.uk  Wed Aug  2 09:00:20 2006
From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard)
Date: Wed, 2 Aug 2006 10:00:20 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44CFBF34.7080106@maubp.freeserve.co.uk>
References: <44CA162F.1040604@maubp.freeserve.co.uk>
	<44CA27B1.30107@maubp.freeserve.co.uk>
	<1154339988.1490.81.camel@lplinuxdev>
	<44CDF3AA.2020308@maubp.freeserve.co.uk>
	<1154355358.1490.116.camel@lplinuxdev>
	<44CE1E3C.2050502@maubp.freeserve.co.uk>
	<BB4CD5A6-B1C1-4F1B-B66C-B03763419D6D@mitre.org>
	<1154428959.4871.11.camel@lplinuxdev>
	<44CFBF34.7080106@maubp.freeserve.co.uk>
Message-ID: <1154509221.4871.40.camel@lplinuxdev>

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/605b8b80/attachment.ksh>
-------------- next part --------------
An embedded message was scrubbed...
From: "Leighton Pritchard" <Leighton.Pritchard at scri.ac.uk>
Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Date: Wed, 2 Aug 2006 10:00:20 +0100
Size: 4641
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/605b8b80/attachment.eml>

From lpritc at scri.sari.ac.uk  Wed Aug  2 09:02:03 2006
From: lpritc at scri.sari.ac.uk (Leighton Pritchard)
Date: Wed, 02 Aug 2006 10:02:03 +0100
Subject: [Biopython-dev] [Fwd: Re:  Reading sequences: FormatIO, SeqIO, etc]
Message-ID: <1154509323.4871.42.camel@lplinuxdev>

(this time without the signature)

-- 
Dr Leighton Pritchard AMRSC
D131, Plant-Pathogen Interactions, Scottish Crop Research Institute
Invergowrie, Dundee, Scotland, DD2 5DA, UK
T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578
E: lpritc at scri.sari.ac.uk   W: http://bioinf.scri.sari.ac.uk/lp
GPG/PGP: FEFC205C E58BA41B  http://www.keyserver.net             
(If the signature does not verify, please remove the SCRI disclaimer)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views 
expressed by the sender are not necessarily the views of SCRI and its 
subsidiaries.  This email and any files transmitted with it are confidential 
to the intended recipient at the e-mail address to which it has been 
addressed.  It may not be disclosed or used by any other than that addressee.
If you are not the intended recipient you are requested to preserve this 
confidentiality and you must not use, disclose, copy, print or rely on this 
e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the 
name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are 
present in this email, neither the Institute nor the sender accepts any 
responsibility for any viruses, and it is your responsibility to scan the email 
and the attachments (if any).
-------------- next part --------------
An embedded message was scrubbed...
From: Leighton Pritchard <lpritc at scri.sari.ac.uk>
Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Date: Wed, 02 Aug 2006 10:00:20 +0100
Size: 3943
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/5c8fac79/attachment-0002.mht>

From mdehoon at c2b2.columbia.edu  Fri Aug  4 03:20:18 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Thu, 03 Aug 2006 23:20:18 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
Message-ID: <44D2BCF2.9010500@c2b2.columbia.edu>

> Question One
> ============
 >
> Is reading sequence files an important
> function to you, and if so which file formats in particular (e.g.
> Fasta, GenBank, ...)
> 
I use Fasta, GenBank, and occasionally clustalw.
> 
> Question Two - Reading Fasta Files
> ==================================
>  Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with
> a title, and the sequence as a string) (b) Bio.Fasta with the
> FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own
> parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader
> (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord
> objects) (f) Other (Could you tell us more?)
I use Bio.Fasta with the RecordParser, but just because it's easy to 
find in the documentation. As a user, I think Bio.Fasta requires too 
many steps to be typed in; I would prefer something more 
straightforward. For the output format, I don't care so much, but for 
the sake of consistency a SeqRecord may be preferable.

> 
> Question Three - index_file based dictionaries 
> ============================================== Do you use any of the
> following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c)
> Any other "Martel/Mindy" based dictionary which first requires
> creation of an index using the index_file function
> 

No. I never really understood index files.

> 
> Question Four - Record Access...
> ================================ 
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the
> records using an iterator but saving them in a list).
I use (a). It's easy to create (b) or (c), if needed, if (a) is available.
> 
> Question Four - Fasta files: FastaRecord or SeqRecord 
> ===================================================== If you use
> Fasta files, do you want get records returned as FastaRecords or as
> SeqRecords?  If SeqRecords, do you use your own title2ids mapping?
> 
> For example,
> 
>> name text text text
> ACGTACACGT
> 
> As a FastaRecord this would have:
> 
> FastaRecord.title = "name text text text" (string) 
> FastaRecord.sequence= "ACGTACACGT" (string)
> 
> As a SeqRecord (with the default title2ids mapping):
> 
> SeqRecord.id = (default string) SeqRecord.name = (default string) 
> SeqRecord.description = "name text text text" (string) SeqRecord.seq
> = Seq("ACGTACACGT", alphabet)
I use the FastaRecord, but again for no particular reason. I have not 
experienced an advantage of Seq objects over simple strings, so for me 
the fact that FastaRecord contains a simple string is more convenient. 
But it doesn't matter much.

> Question Five - GenBank files: GenbankRecord or SeqRecord 
> ========================================================== If you use
> GenBank files, do you use: (a) Bio.Genbank.FeatureParser which
> returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns
> Bio.GenBank.Record objects
> 
I don't care so much, but I think that having two record types is 
confusing, so it would be better if we could decide on one. A SeqRecord
is more general than a Bio.GenBank.Record, so I have a slight preference 
for a SeqRecord.

> 
> Question Six - Martel, Scanners and Consumers 
> ============================================== Some of BioPython's
> existing parsers (e.g. those using Martel) use an event/callback
> model, where the scanner component generates parsing events which are
> dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided. 
> (b) I use this framework to modify a parser in order to do ...
> (please provide details).
> 
(a). Often, I'm just at the Python prompt typing away. What I like about 
Python and Numerical Python is that the commands are often obvious and 
easy to remember. With the parser framework, on the other hand, I always 
need to look up in the documentation how to use them.

--Michiel


From dag at sonsorol.org  Fri Aug  4 10:38:52 2006
From: dag at sonsorol.org (Chris Dagdigian)
Date: Fri, 4 Aug 2006 06:38:52 -0400
Subject: [Biopython-dev] Fwd: contributing comparative genomics tools
References: <22DA57C5-461D-48BE-B524-47108330CD80@chem.ucla.edu>
Message-ID: <9AFBA2D3-B8DF-4337-A54A-019F6EAFFC38@sonsorol.org>


Begin forwarded message:

> From: Christopher Lee <leec at chem.ucla.edu>
> Date: August 3, 2006 9:11:42 PM EDT
> To: biopython-dev-owner at lists.open-bio.org
> Subject: Fwd: contributing comparative genomics tools
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
> there appears to be an error in your code submission instructions  
> on the biopython.org/wiki, or in the configuration of the biopython- 
> dev list server.  The code submission instructions tell me to  
> submit my proposal by email to biopython-dev at biopython.org, but the  
> list server responds by saying that all mail will automatically be  
> rejected!  Please forward this proposal to the appropriate people  
> (presumably biopython-dev?), and let me know that you have done  
> so.  Otherwise I won't have any way of knowing whether anyone even  
> reads this email address...
>
> Yours with thanks,
>
> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>
> Begin forwarded message:
>
>> You are not allowed to post to this mailing list, and your message  
>> has
>> been automatically rejected.  If you think that your messages are
>> being rejected in error, contact the mailing list owner at
>> biopython-dev-owner at lists.open-bio.org.
>>
>>
>> From: Christopher Lee <leec at chem.ucla.edu>
>> Date: August 3, 2006 3:55:52 PM PDT
>> To: biopython-dev at biopython.org
>> Cc: Namshin Kim <deepreds at gmail.com>
>> Subject: contributing comparative genomics tools
>>
>>
>> Hi Biopython developers,
>> I'd like to contribute some Python tools that my lab has been  
>> developing for large-scale comparative genomics database query.   
>> These tools make it easy to work with huge multigenome alignment  
>> databases (e.g. the UCSC Genome Browser multigenome alignments)  
>> using a new disk-based interval indexing algorithm that gives very  
>> high performance with minimal memory usage.  e.g. whereas queries  
>> of the UCSC 17genome alignment typically take about 30 sec. per  
>> query using MySQL, the same query takes about 200 microsec. per  
>> query, making it possible to run huge numbers of queries for  
>> genome-wide studies.
>>
>> Here's an example usage (click the URL or just look at the code  
>> below)
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq- 
>> align.html#SECTION000125000000000000000
>>
>> We've tested this code very extensively in our own research, and  
>> it has had four open source releases so far.  At this point the  
>> code is in production use.  All the code is compatible back to  
>> Python version 2.2, but not 2.1 or before (we use generators).   
>> There is C code (accessed as Python classes) for the high- 
>> performance interval database index.  For details of history see  
>> the website
>> http://www.bioinformatics.ucla.edu/pygr
>>
>> There is also extensive tutorial and reference documentation:
>> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/
>>
>> Let me know what questions you have, and what process we would  
>> need to follow to contribute this code.
>>
>> Yours with best wishes,
>>
>> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA
>>
>>
>> ####### EXAMPLE USAGE
>> from pygr import cnestedlist
>> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE  
>> ALIGNMENT DB
>>
>> def printResults 
>> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None):
>>     'get alignment of each genome to site, print %identity and % 
>> aligned'
>>     for src,dest,edge in msa[site].edges(mergeMost=True): #  
>> ALIGNMENT QUERY!
>>         print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \
>>               %(altID,cluster_id,prefix,seqNames[dest],
>>                 100.*edge.pIdentity(),100.*edge.pAligned(),src[: 
>> 2],dest[:2])
>>
>> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs):
>>     'gene must be a slice of a sequence in our genome alignment msa'
>>     ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES
>>     ss2=gene[start2-2:start2]
>>     ss3=gene[stop:stop+2]
>>     e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES
>>     e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND  
>> stop
>>     zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd  
>> REQUESTS
>>     cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS  
>> TO MINIMIZE TRAFFIC
>>     for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3), 
>> ('e1',e1),('e2',e2)]:
>>         printResults(prefix,msa,site,seqNames=~ 
>> (msa.seqDict),**kwargs)
>>
>> # RUN A QUERY LIKE THIS...
>> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop)
>>
>> ############ EXPLANATION & NOTES
>> David Haussler's group has constructed alignments of multiple  
>> genomes. These alignments are extremely useful and interesting,  
>> but so large that it is cumbersome to work with the dataset using  
>> conventional methods. For example, for the 8-genome alignment you  
>> have to work simultaneously with the individual genome datasets  
>> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as  
>> well as the huge alignment itself. Pygr makes this quite easy.  
>> Here we illustrate an example of mapping an alternative 3' exon,  
>> which has two alternative splice sites (start1 and start2) and a  
>> single terminal splice site (stop). We use the alignment database  
>> to map each of these splice sites onto all the aligned genomes,  
>> and to print the percent-identity and percent-aligned for each  
>> genome, as well as the two nucleotides consituting the splice site  
>> itself. To examine the conservation of the two exonic regions  
>> (between start1 and start2, and the adjacent region terminated by  
>> stop, we print the same information for each genome's alignment to  
>> these two regions as well. The code first opens the alignment  
>> database. The function (getAlt3Conservation) obtains sequence  
>> slice objects representing the various ``sites'' to be queried.  
>> The actual alignment database query is performed in printResults:
>>
>>     * The alignment database query is in the first line of  
>> printResults(). msa is the database; site is the interval query;  
>> and the edges methods iterates over the results, returning a tuple  
>> for each, consisting of a source sequence interval (i.e. an  
>> interval of site), a destination sequence interval (i.e. an  
>> interval in an aligned genome), and an edge object describing that  
>> alignment. We are taking advantage of Pygr's group-by operator  
>> mergeMost, which will cause multiple intervals in a given sequence  
>> to be merged into a single interval that constitutes their  
>> ``union''. Thus, for each aligned genome, the edges iterator will  
>> return a single aligned interval. The alignment edge object  
>> provides some useful conveniences, such as calculating the percent- 
>> identity between src and dest automatically for you. pIdentity()  
>> computes the fraction of identical residues; pAligned computes the  
>> fraction of aligned residues (allowing you to see if there are big  
>> gaps or insertions in the alignment of this interval). If we had  
>> wanted to inspect the detailed alignment letter by letter, we  
>> would just iterate over the letters attribute instead of the edges  
>> method. (See the NLMSASlice documentation for further information).
>>
>>     * src[:2] and dest[:2] print the first two nucleotides of the  
>> site in gene and in the aligned genome.
>>
>>     * it's worth noting that the actual sequence string  
>> comparisons are being done using a completely different database  
>> mechanism (formerly NCBI's fastacmd, now our own (much faster)  
>> pureseq text format), not the cnestedlist database. Basically,  
>> each genome is being queried as a separate BLAST formatted  
>> database, represented in Pygr by the BlastDB class. Pygr makes  
>> this complex set of multi-database operations more or less  
>> transparent to the user. For further information, see the BlastDB  
>> documentation.
>>
>>     * The other operations here are entirely vanilla: mainly  
>> slicing a gene sequence to obtain the specific sites that we want  
>> to query. Note: gene must itself be a slice of a sequence in our  
>> alignment, or the alignment query msa[site] will raise an  
>> IndexError informing the user that the sequence site is not in the  
>> alignment.
>>
>>     * The only slightly interesting operation here is the use of  
>> interval addition to obtain the ``union'' of two intervals, e.g.  
>> e1=ss1+ss2. This obtains a single interval that contains both of  
>> the input intervals.
>>
>>     * When the print statement requests str() representations of  
>> these sequence objects, Pygr uses fastacmd -L to extract just the  
>> right piece of the corresponding chromosomes from the eight BLAST  
>> databases.
>>
>> (Actually, because of Pygr's caching / optimizations, considerably  
>> more is going on than indicated in this simplified sketch. But you  
>> get the idea: Pygr makes it relatively effortless to work with a  
>> variety of disparate (and large) resources in an integrated way.)
>>
>> Here is some example output:
>>
>> 1       Mm.99996        ss1     hg17    50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     canFam1 50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     panTro1 50.0    100.0   AG      GG
>> 1       Mm.99996        ss1     rn3     100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     hg17    100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     canFam1 100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     panTro1 100.0   100.0   AG      AG
>> 1       Mm.99996        ss2     rn3     100.0   100.0   AG      AG
>> 1       Mm.99996        ss3     hg17    100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     canFam1 100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     panTro1 100.0   100.0   GT      GT
>> 1       Mm.99996        ss3     rn3     100.0   100.0   GT      GT
>> 1       Mm.99996        e1      hg17    78.9    100.0   AG      GG
>> 1       Mm.99996        e1      canFam1 84.2    100.0   AG      GG
>> 1       Mm.99996        e1      panTro1 77.6    100.0   AG      GG
>> 1       Mm.99996        e1      rn3     97.4    98.7    AG      AG
>> 1       Mm.99996        e2      hg17    91.6    99.1    CC      CC
>> 1       Mm.99996        e2      canFam1 88.8    99.1    CC      CC
>> 1       Mm.99996        e2      panTro1 91.6    99.1    CC      CC
>> 1       Mm.99996        e2      rn3     97.2    100.0   CC      CC
>>
>>
>>
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.2.2 (Darwin)
>>
>> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i
>> xhxE2zrffLaiCjSiEp4Eo6k=
>> =BeOe
>> -----END PGP SIGNATURE-----
>>
>>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.2 (Darwin)
>
> iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm
> Wx/4AIstFjb+rbqY2QBppLg=
> =fghY
> -----END PGP SIGNATURE-----


From biopython-dev at maubp.freeserve.co.uk  Sat Aug 12 08:25:41 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Sat, 12 Aug 2006 09:25:41 +0100
Subject: [Biopython-dev]  Reading sequences: FormatIO, SeqIO, etc
Message-ID: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>

I've having a few issues with my email setup which is why I haven't
replied recently.

A week ago I filed bug 2059 for this discussion, and attached some code:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

I'm interested in your feedback - from the framework down to if you
don't like the class names for example.

Peter


From krewink at inb.uni-luebeck.de  Wed Aug 16 12:44:07 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Wed, 16 Aug 2006 14:44:07 +0200
Subject: [Biopython-dev] Fwd: contributing comparative genomics tools
Message-ID: <20060816124407.GF12386@pc09.inb.uni-luebeck.de>

Hello,

I read Peter's SeqIO/__init__.py replacement and if I may say so: I
love it.  Thanks a lot for this!  Still, there are some things I'd
like to talk about.

The _parse_genbank_features function could also be used to parse embl
or ddjb features, therefore I think it should be named differently.


Since there is a lot of clean up effort right now: How about moving
the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
are closely related and seperate modules only clutter the namespace.

To me, this seems to be a general problem. It's very difficult to find
a tool to use for a certain problem if one doesn't allready know what
to look for.  I'd pretty much favour to create modules like
Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
a very big change, and therefore I'd like to follow Marc's suggestion
of splitting off a branch.  In general, I pretty much agree with what
Marc said in his <rant />.

I cannot estimate how much work it would be to maintain two seperate
biopython distributions, so please forgive me if I re-suggest
something completely idiotic here.  I just don't believe there is much
that could be lost that way.

Cheers,
Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics


From biopython-dev at maubp.freeserve.co.uk  Wed Aug 16 14:00:36 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Aug 2006 15:00:36 +0100
Subject: [Biopython-dev]  Reading sequences: FormatIO, SeqIO, etc
Message-ID: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>

(I changed the subject to that of the previous discussion, as this
isn't really about "contributing comparative genomics tools")

Albert Krewinkel wrote:
> Hello,
>
> I read Peter's SeqIO/__init__.py replacement and if I may say so: I
> love it.  Thanks a lot for this!  Still, there are some things I'd
> like to talk about.

Thank you :) The code is on Bug 2059 for anyone who hasn't looked yet.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

> The _parse_genbank_features function could also be used to parse embl
> or ddjb features, therefore I think it should be named differently.

First of all, that bit of code is for a new feature which I personally
wanted - to be able to iterate over CDS features in a genbank file.

But yes, I did have in mind that it (and the GenBank parser) could be
re-used to deal with EMBL files.  I have not yet taken the time to
learn the EMBL file format and how it corresponds to the GenBank file
format - but I agree a lot of the code could be shared.

> Since there is a lot of clean up effort right now: How about moving
> the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> are closely related and seperate modules only clutter the namespace.

What real benefit does that give us?  It will cause a certain amount
of upheaval in the short term as people will have to change their
import statements on existing scripts.  If we do start a new branch
for "big changes" then I have no real problem with this suggest.

> To me, this seems to be a general problem. It's very difficult to find
> a tool to use for a certain problem if one doesn't allready know what
> to look for.  I'd pretty much favour to create modules like
> Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
> a very big change, and therefore I'd like to follow Marc's suggestion
> of splitting off a branch.  In general, I pretty much agree with what
> Marc said in his <rant />.
>
> I cannot estimate how much work it would be to maintain two separate
> biopython distributions, so please forgive me if I re-suggest
> something completely idiotic here.  I just don't believe there is much
> that could be lost that way.

BioPython probably would benefit from a little reorganising - and for
anything drastic like moving entire modules about, a new branch makes
sense.  On the other hand, do we have the man-power to do it?  Are any
of the developers familiar with all of (or even most of) the existing
modules?  I would guess I have used less than half of the modules - I
have looked at the very basics of Bio.PDB for example, but have never
tried Bio.NMR

I would favour gradual incremental (and backwards compatible) changes.
 Such as adding a new sequence reading module and then marking the old
code as depreciated.

For example of some small changes, have any of you looked at:

Bug 2057 - SeqRecord has no __str__ or __repr__
http://bugzilla.open-bio.org/show_bug.cgi?id=2057

Bug 1963 - Adding __str__ method to codon tables and translators
http://bugzilla.open-bio.org/show_bug.cgi?id=1963

Little things in themselves that I think would help.

Peter


From krewink at inb.uni-luebeck.de  Wed Aug 16 14:44:36 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Wed, 16 Aug 2006 16:44:36 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
Message-ID: <20060816144436.GG12386@pc09.inb.uni-luebeck.de>

On Wed, Aug 16, 2006 at 03:00:36PM +0100, Peter wrote:
> Albert Krewinkel wrote:
> >The _parse_genbank_features function could also be used to parse embl
> >or ddjb features, therefore I think it should be named differently.
> 
> First of all, that bit of code is for a new feature which I personally
> wanted - to be able to iterate over CDS features in a genbank file.
> 
> But yes, I did have in mind that it (and the GenBank parser) could be
> re-used to deal with EMBL files.  I have not yet taken the time to
> learn the EMBL file format and how it corresponds to the GenBank file
> format - but I agree a lot of the code could be shared.

I will try to build something similar for EMBL files within the next
days.  This should be easy, since features really should look the same
in both formates:

http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html

> >Since there is a lot of clean up effort right now: How about moving
> >the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> >are closely related and seperate modules only clutter the namespace.
> 
> What real benefit does that give us?  It will cause a certain amount
> of upheaval in the short term as people will have to change their
> import statements on existing scripts.  If we do start a new branch
> for "big changes" then I have no real problem with this suggest.

Agree.

> >To me, this seems to be a general problem. It's very difficult to find
> >a tool to use for a certain problem if one doesn't allready know what
> >to look for.  I'd pretty much favour to create modules like
> >Bio.structure to group modules like Bio.PDB and Bio.NMR etc.  This is
> >a very big change, and therefore I'd like to follow Marc's suggestion
> >of splitting off a branch.  In general, I pretty much agree with what
> >Marc said in his <rant />.
> >
> >I cannot estimate how much work it would be to maintain two separate
> >biopython distributions, so please forgive me if I re-suggest
> >something completely idiotic here.  I just don't believe there is much
> >that could be lost that way.
> 
> BioPython probably would benefit from a little reorganising - and for
> anything drastic like moving entire modules about, a new branch makes
> sense.  On the other hand, do we have the man-power to do it?  Are any
> of the developers familiar with all of (or even most of) the existing
> modules?  I would guess I have used less than half of the modules - I
> have looked at the very basics of Bio.PDB for example, but have never
> tried Bio.NMR

I attached a file which I created when I was teaching myself
biopython. It provides a basic grouping for the current biopython
modules.  Naturaly, it's by no means complete and probably wrong in
some places.

> I would favour gradual incremental (and backwards compatible) changes.
> Such as adding a new sequence reading module and then marking the old
> code as depreciated.

I think we could do both: A new branch might make it easier to see
which modules are usefull the way they are and which are not.  Even if
this seperate branch never is released itself, it still would be handy
for reorganising coordination.

> For example of some small changes, have any of you looked at:
> 
> Bug 2057 - SeqRecord has no __str__ or __repr__
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
> 
> Bug 1963 - Adding __str__ method to codon tables and translators
> http://bugzilla.open-bio.org/show_bug.cgi?id=1963
> 
> Little things in themselves that I think would help.

True.  My (naive) hope is, that such things would be by-products of a
new branch.  I have to admit, that this is probably not possible
without doing a code sprint.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics
-------------- next part --------------
Databases:
 o NCBI
   - UniGene
   - GenBank
   - PubMed
   - Entrez
   - LocusLink
   - Geo
 o Kabat
 o KEGG
 o SwissProt
 o Medline
 o biblio (pywebsvcs dependency is mentioned only in the module itself)
 o dbdefs
 o InterPro
 o Gobase
 o Enzyme
 o Rebase

Models and Simulations:
 o Ais
 o MetaTool
 o Pathway
 o ECell                 			    

Algorigthms, Machine Learning and Pattern Recognition:
 o HMM
 o NeuralNetwork
 o Cluster
 o LogisticRegression, Statistics
 o GA
 o MarkovModel
 o pairwise2
 o NaiveBayes
 o MaxEntropy

Alignments:
 o Align
 o Blast
 o AlignAce
 o Clusalw
 o Fasta
 o FSSP
 o SubsMat
 o Search (WUBLAST output)
 o Saf
 o IntelliGenetics

Applications:
 o Application
 o Emboss
 o Nexus
 o AlignAce
 o Blast
 o MEME
 o Sequencing
 o Wise

Data Structures:
 o KDTree
 o trie

Sequences:
 o GFF
 o Seq
 o SeqUtils
 o SeqFeature
 o SeqRecord
 o Alphabet
 o Transcribe
 o Translate
 o lcc
 o Encodings
 o Data
 o NBRF

SeqIO:
 o writers
 o Writer
 o SeqIO
 o builders
 o Fasta
 o Index

Utilities:
 o utils.py
 o ParserSupport
 o File
 o Tools
 o Mindy
 o HotRand
 o config
 o formatdefs
 o MarkupEditor
 o DocSQL (wouldn't usage of SQL-Object be nicer? (if possible))
 o EUtils.ReseakFile
 o Std, StdHandler
 o PropertyManager
 o MultiProc
 o Decode
 o FilteredReader

Graphics:
 o Graphics

Web-Based:
 o GenBank
 o NetCache
 o EUtils
 o WWW

Microarrays:
 o Affy

Structure:
 o NMR
 o PDB
 o Crystal
 o Ndb
 o SCOP
 o SVDSuperimposer

Motives:
 o MEME
 o Prosite
 o CDD
 o Compass

References:
 o Medline, PubMed
 o DBXref

Restriction:
 o Restriction
 o CAPS                  


From biopython-dev at maubp.freeserve.co.uk  Wed Aug 16 16:05:12 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Wed, 16 Aug 2006 17:05:12 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <20060816144436.GG12386@pc09.inb.uni-luebeck.de>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>
Message-ID: <44E34238.2010508@maubp.freeserve.co.uk>

Albert Krewinkel wrote:
>>> The _parse_genbank_features function could also be used to parse embl
>>> or ddjb features, therefore I think it should be named differently.

Peter wrote:
>> First of all, that bit of code is for a new feature which I personally
>> wanted - to be able to iterate over CDS features in a genbank file.
>>
>> But yes, I did have in mind that it (and the GenBank parser) could be
>> re-used to deal with EMBL files.  I have not yet taken the time to
>> learn the EMBL file format and how it corresponds to the GenBank file
>> format - but I agree a lot of the code could be shared.

Albert Krewinkel wrote:
> I will try to build something similar for EMBL files within the next
> days.  This should be easy, since features really should look the same
> in both formates:
> 
> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
> 

Oh - you meant just adding EMBL feature iteration.  I want thinking 
about the larger task of full EMBL file reading.

Doing just the features is very easy, here you go:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2

Any more feedback is very welcome.  Are you using the iterators 
directly, or via the helper function File2SequenceIterator?

Are you using just the sequence iterators, or the dictionary and list 
versions too?

Peter


From biopython-dev at maubp.freeserve.co.uk  Wed Aug 16 22:20:28 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 16 Aug 2006 23:20:28 +0100
Subject: [Biopython-dev] Tweaking the SeqRecord class
Message-ID: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>

In the spirit of gradual improvements, I had a look at the SeqRecord class.

First of all, is there any comment on my suggestion to add __str__ and
__repr__ methods to the SeqRecord object, bug 2057:

http://bugzilla.open-bio.org/show_bug.cgi?id=2057

Next, I'd like to check in some basic __doc__ strings for the
SeqRecord class, e.g. something like this:

>>> from Bio.SeqRecord import SeqRecord
>>> print SeqRecord.__doc__
The SeqRecord object is designed to hold a sequence and information about it.

    Main properties:
    id          - Identifier such as a locus tag (string)
    seq         - The sequence itself (Seq object)

    Additional properties:
    name        - Sequence name, e.g. gene name (string)
    description - Additional text (string)
    dbxrefs     - List of database cross references (list of strings)
    features    - Any (sub)features defined (list of SeqFeature objects)
    annotations - Further information (dictionary)

I would also like to add doc strings to the id, seq, name, ...
themselves.  However, they are currently stored as attributes so this
isn't possible.  See PEP 0224,
http://www.python.org/dev/peps/pep-0224/

However, we could use the Python 2.2 "property" function to implement
these as properties.  The code might be clearer using the Python 2.4
"decorator" syntax, but I don't think we should depend on such a
recent version of python yet.

Using properties would allow this usage:

>>> print SeqRecord.features.__doc__
Annotations about parts of the sequence (list of SeqFeatures)

It would also mean that these properties show up in dir(SeqRecord) and
help(SeqRecord), which all in all should make the object slightly
easier to use.

Finally, using get/set property functions allows us to postpone
creation of string/list/dict objects for unused properties.  This does
actually seem to bring a slight improvement to the timings for Fasta
file parsing discussed last month.

If you recall, for the fastest parsers turning the data into SeqRecord
and Seq objects imposed a fairly large overhead (compared to just
using strings):

http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html

I would be interested to see how those numbers change with the
attached implementation - if you wouldn't mind please Leighton... ;)

I have attached a version of SeqRecord.py which implements the changes
I have described.  The backwards compatibility if statement is a bit
ugly - can we just assume Python 2.2 or later?

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SeqRecord.py
Type: text/x-script.phyton
Size: 9367 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060816/9e2f173c/attachment-0002.bin>

From mdehoon at c2b2.columbia.edu  Thu Aug 17 01:39:12 2006
From: mdehoon at c2b2.columbia.edu (Michiel de Hoon)
Date: Wed, 16 Aug 2006 21:39:12 -0400
Subject: [Biopython-dev] Tweaking the SeqRecord class
In-Reply-To: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>
References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>
Message-ID: <44E3C8C0.5070200@c2b2.columbia.edu>

Peter wrote:
> First of all, is there any comment on my suggestion to add __str__ and
> __repr__ methods to the SeqRecord object, bug 2057:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2057
Here's a thought:
What if Seq were to inherit from str, and SeqRecord from Seq?
Then, you get these for free.

> Next, I'd like to check in some basic __doc__ strings for the
> SeqRecord class, e.g. something like this:
Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have 
documentation.

> If you recall, for the fastest parsers turning the data into SeqRecord
> and Seq objects imposed a fairly large overhead (compared to just
> using strings):
> 
> http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html
I wonder if this is still true if a Seq object and a SeqRecord object 
inherit from string. From the code, I don't see where the overhead comes 
from.


> The backwards compatibility if statement is a bit
> ugly - can we just assume Python 2.2 or later?
Biopython currently requires Python 2.3 or later.

--Michiel.


From krewink at inb.uni-luebeck.de  Thu Aug 17 07:25:34 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Thu, 17 Aug 2006 09:25:34 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E34238.2010508@maubp.freeserve.co.uk>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>
	<44E34238.2010508@maubp.freeserve.co.uk>
Message-ID: <20060817072534.GH12386@pc09.inb.uni-luebeck.de>

Peter wrote:
> Oh - you meant just adding EMBL feature iteration.  I want thinking 
> about the larger task of full EMBL file reading.

I started working on that, but I'm not very far yet.

> Doing just the features is very easy, here you go:
> 
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2

Wow, that was quick. And it's works allmost perfectly. One exception:
In _parse_embl_or_genbank_feature(), when parsing the location, it
shoudl say something like

<code>
from string import digits
while feature_location[-1] not in (')', digits):
    line = iterator.next()
    feature_location += line[FEATURE_QUALIFIER_INDENT:].strip()
</code>

This way, features may have multiline join(...) positions.


> Any more feedback is very welcome.  Are you using the iterators 
> directly, or via the helper function File2SequenceIterator?

I'm using iterators directly, out of old habits.  But most likely I
will finally get addicted to your nice helperfunction.

> Are you using just the sequence iterators, or the dictionary and list 
> versions too?

I don't used those yet.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics


From mcolosimo at mitre.org  Thu Aug 17 12:08:24 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Thu, 17 Aug 2006 08:08:24 -0400
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>
References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>
Message-ID: <FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>

Peter,

Nice quick work on that. For Clustal, I think it should NOT be an  
Iterator, but there should be SequenceDict or SequenceList for it.  
There are other alignment filetypes out there that could use a  
SequenceIterator (those that are not interlaced).  From looking over  
your code, it seem like it would be easy to add a check in  
File2SequenceDict/List to check for Clustal types and do something  
"special"

Marc


On Aug 12, 2006, at 4:25 AM, Peter wrote:

> I've having a few issues with my email setup which is why I haven't
> replied recently.
>
> A week ago I filed bug 2059 for this discussion, and attached some  
> code:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059
>
> I'm interested in your feedback - from the framework down to if you
> don't like the class names for example.
>
> Peter


From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 13:25:07 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 14:25:07 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>
References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>
	<FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>
Message-ID: <44E46E33.3090001@maubp.freeserve.co.uk>

Marc Colosimo wrote:
> Peter,
> 
> Nice quick work on that. For Clustal, I think it should NOT be an  
> Iterator, but there should be SequenceDict or SequenceList for it.  
> There are other alignment filetypes out there that could use a  
> SequenceIterator (those that are not interlaced).  From looking over  
> your code, it seem like it would be easy to add a check in  
> File2SequenceDict/List to check for Clustal types and do something  
> "special"

Yes, I was thinking wondering about that too.

For interlaced file formats (such as clustalw, NEXUS multiple alignment 
format) we have to load the whole file into memory anyway - so using a 
SequenceIterator was a bit odd.

What I was trying to do was use a SequenceIterator as the lowest common 
denominator - the ClustalIterator shows that this can be done for 
interlaced files, and seems to work.

Its trivial to "upgrade" the ClustalIterator to a SequenceDict or 
SequenceList if that's what is needed.

The way I wrote the ClustalIterator it actually reads the whole file and 
stores a list of IDs and a dictionary mapping the ID to the sequence 
string.  It creates SeqRecord objects only on request.  This should use 
less memory than a full list of every SeqRecord (but I have not measured 
this).

Note that I would also want to add an easy way to turn any 
SequenceIterator, SequenceList or SequenceDict into a multiple alignment 
object.

Out of interest, what are the largest alignments you deal with?

I was planning to add a Stockholm parser (where the sequences themselves 
are non-interleaved).  The PFAM database alignments use this, and are 
the largest alignments I am aware of.

However, the format supports per sequence annotation information and 
this information can be rather spread out.  Looking at a real example 
from PFAM, there were blocks of such data both before and after the 
sequences.  The format suggest that such annotation might also be found 
next to each sequence.

i.e. An annotation free Stockholm iterator would be easy, but including 
the meta data would in general require loading the whole file.

http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html

It looks like a subclassed version could be written to handle the PFAM 
annotations nicely.

Peter


From mcolosimo at mitre.org  Thu Aug 17 12:24:24 2006
From: mcolosimo at mitre.org (Marc Colosimo)
Date: Thu, 17 Aug 2006 08:24:24 -0400
Subject: [Biopython-dev] Fwd: contributing comparative genomics tools
In-Reply-To: <20060816124407.GF12386@pc09.inb.uni-luebeck.de>
References: <20060816124407.GF12386@pc09.inb.uni-luebeck.de>
Message-ID: <9A739306-6B91-4E43-87F8-EC464784B4B2@mitre.org>

On Aug 16, 2006, at 8:44 AM, Albert Krewinkel wrote:
> Hello,
>
> I read Peter's SeqIO/__init__.py replacement and if I may say so: I
> love it.  Thanks a lot for this!  Still, there are some things I'd
> like to talk about.
>
> The _parse_genbank_features function could also be used to parse embl
> or ddjb features, therefore I think it should be named differently.
>
>
> Since there is a lot of clean up effort right now: How about moving
> the SeqRecord and SeqFeature objects into the Bio.Seq module?  They
> are closely related and seperate modules only clutter the namespace.
>
The top namespace is sort of a mess of things.
> To me, this seems to be a general problem. It's very difficult to find
> a tool to use for a certain problem if one doesn't allready know what
> to look for.  I'd pretty much favour to create modules like
> Bio.structure to group modules like Bio.PDB and Bio.NMR etc.

I second this.
> This is
> a very big change, and therefore I'd like to follow Marc's suggestion
> of splitting off a branch.  In general, I pretty much agree with what
> Marc said in his <rant />.
>
> I cannot estimate how much work it would be to maintain two seperate
> biopython distributions, so please forgive me if I re-suggest
> something completely idiotic here.  I just don't believe there is much
> that could be lost that way.
I've done this for my internal work, but I never went back to see how  
to check out the other branch (I had not need). CVS is sometimes a  
bear to work with. SVN is suppose to handle branches much better, but  
I can't access SVN repositories that are not through HTTPS (SSL).  
Stupid corporate proxy is  currently not set up to handle external  
webDAV.

This might be a pain for a little while until the next full version  
is released, but I think the benfits of doing this now far out weigh  
the short term pain (of course I'm not an admin who has to build the  
releases).

Marc


From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 15:13:40 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 16:13:40 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <20060817072534.GH12386@pc09.inb.uni-luebeck.de>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>	<44E34238.2010508@maubp.freeserve.co.uk>
	<20060817072534.GH12386@pc09.inb.uni-luebeck.de>
Message-ID: <44E487A4.8040106@maubp.freeserve.co.uk>

Albert Krewinkel wrote:
> Peter wrote:
> 
>>Oh - you meant just adding EMBL feature iteration.  I was thinking 
>>about the larger task of full EMBL file reading.
> 
> I started working on that, but I'm not very far yet.

Are you starting from Bio.GenBank or from scratch?  I would point out 
that the code in Bio.GenBank was inserted into what was once a Martel 
based parser, and designed to be a transparent change for the end user.

What I would like to do is recycle that code into a new far simpler 
SeqIO GenBank parser which would only return SeqRecords.  In particular 
I would get rid off all the scanner/consumer model with all its function 
callbacks.

At this point I would try and handle both GenBank and EMBL files together.

I expect this to be faster, and easier to understand.  It would be a lot 
less flexible for the "power user", but then so is all the new SeqIO 
code I have been writing.

>>Doing just the features is very easy, here you go:
>>
>>http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2
> 
> Wow, that was quick.

Well, I did have something along these lines planned in advance - that's 
why there my parse function was outside the GenbankCdsFeatureIterator class.

 > And it's works allmost perfectly. One exception:
> In _parse_embl_or_genbank_feature(), when parsing the location, it
> shoudl say something like
> 
> <code>
> from string import digits
> while feature_location[-1] not in (')', digits):
>     line = iterator.next()
>     feature_location += line[FEATURE_QUALIFIER_INDENT:].strip()
> </code>
> 
> This way, features may have multiline join(...) positions.

Good point, something I was aware of and coped with in Bio.GenBank but 
hadn't done in the CDS iterator.  Thanks for point this out.

This affects both GenBank and EMBL files by the way.  My code is very 
similar but I included an assert to check the indent, and I only check 
for a trailing comma.  This works on all the files I have tried.

Peter


From krewink at inb.uni-luebeck.de  Thu Aug 17 17:41:06 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Thu, 17 Aug 2006 19:41:06 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E487A4.8040106@maubp.freeserve.co.uk>
References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com>
	<20060816144436.GG12386@pc09.inb.uni-luebeck.de>
	<44E34238.2010508@maubp.freeserve.co.uk>
	<20060817072534.GH12386@pc09.inb.uni-luebeck.de>
	<44E487A4.8040106@maubp.freeserve.co.uk>
Message-ID: <20060817174106.GI12386@pc09.inb.uni-luebeck.de>

Peter wrote:
> > Peter wrote:
> >>Oh - you meant just adding EMBL feature iteration.  I was thinking 
> >>about the larger task of full EMBL file reading.
> >
> Albert wrote:
> >I started working on that, but I'm not very far yet.
> 
> Are you starting from Bio.GenBank or from scratch?  I would point out 
> that the code in Bio.GenBank was inserted into what was once a Martel 
> based parser, and designed to be a transparent change for the end user.
>
> What I would like to do is recycle that code into a new far simpler 
> SeqIO GenBank parser which would only return SeqRecords.  In particular 
> I would get rid off all the scanner/consumer model with all its function 
> callbacks.
> 
> At this point I would try and handle both GenBank and EMBL files together.

I didn't do much more than to play with current code and add some
methods to parse EMBL specific things.  The results can be found here:

http://www.inb.uni-luebeck.de/~krewink/embl.py

It's ugly, and doesn't provide much functionality, but could be a
starting point.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics


From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 20:09:20 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 21:09:20 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E46E33.3090001@maubp.freeserve.co.uk>
References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com>	<FECC9CE1-422B-4506-AEDE-D5391854E7B0@mitre.org>
	<44E46E33.3090001@maubp.freeserve.co.uk>
Message-ID: <44E4CCF0.7090607@maubp.freeserve.co.uk>

Marc Colosimo wrote:
>> Nice quick work on that. For Clustal, I think it should NOT be an  
>> Iterator, but there should be SequenceDict or SequenceList for it.  
>> There are other alignment filetypes out there that could use a  
>> SequenceIterator (those that are not interlaced).  From looking over  
>> your code, it seem like it would be easy to add a check in  
>> File2SequenceDict/List to check for Clustal types and do something  
>> "special"

Peter (BioPython Dev) wrote:
> Yes, I was thinking wondering about that too.
> 
> For interlaced file formats (such as clustalw, NEXUS multiple alignment 
> format) we have to load the whole file into memory anyway - so using a 
> SequenceIterator was a bit odd.
> 
> What I was trying to do was use a SequenceIterator as the lowest common 
> denominator - the ClustalIterator shows that this can be done for 
> interlaced files, and seems to work.

There are two and a half examples done this way now...

> I was planning to add a Stockholm parser (where the sequences themselves 
> are non-interleaved).  The PFAM database alignments use this, and are 
> the largest alignments I am aware of.
> 
> ...
> 
> It looks like a subclassed version could be written to handle the PFAM 
> annotations nicely.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c3

Changes to the clustal parser, and addition of a parser for Stockholm
alignments, and a subclassed version to handle the PFAM style
annotations strings.

I have included basic handling of the sequence specific meta-data [I
need to have a look at real PFAM data to sort of the database cross
references still], but currently ignore the whole file level information
(#=GF lines) and the per column information (#=GC lines).

Maybe reading sequences out of multiple alignment files should be done
as a special case of loading multiple alignments?  Is this what you
meant by "something special" Marc?

Peter


From biopython-dev at maubp.freeserve.co.uk  Mon Aug 21 19:26:06 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Mon, 21 Aug 2006 20:26:06 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44E4D2B4.3000600@maubp.freeserve.co.uk>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>
Message-ID: <44EA08CE.5070802@maubp.freeserve.co.uk>

You probably noticed I sent out a "Dealing with sequence files"
questionnaire on the main discussion list:

http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html

I've had four replies to date (off the list), and with the previous list
discussion and counting myself that makes eight views.  Not a very big
sample I know.

> Question One
> ============
> Is reading sequence files an important function to you, and if so which
> file formats in particular (e.g. Fasta, GenBank, ...)

Fasta very popular, with GenBank also scoring highly.  Michiel and I
both use clustalw.  Apart from EMBL (next question) there wasn't any
other popular file format given.

I'm tempted to ask again regarding multiple alignment formats.

> Question Two
> ============
> Are there any sequence formats you would like to be able to read using 
> BioPython that are not currently supported (e.g. EMBL, ...)

It may have been a leading question, but several respondents would like
to be able to read in EMBL format.

Other requests included:

XML based 454 sequence files
UniGene sequence cluster format

Leighton mentioned:

PTT (Protein table files)
GFF (General Feature Format)

And I wanted to be able to read Stockholm alignments.

> Question Three - Reading Fasta Files
> ====================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
> (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> (c) Bio.Fasta with your own parser (Could you tell us more?)
> (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> (e) Bio.FormatIO (giving SeqRecord objects)
> (f) Other (Could you tell us more?)

A range covering (a), (b) and (d) plus DIY parsers.

> Question Four - Reading GenBank Files
> =====================================
> Which of the following do you currently use (and why)?:
> 
> (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
> (b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
> (c) Other (Could you tell us more?)

Both (a) and (b) with no clear majority.

> Question Five - Record Access...
> ================================
> When loading a file with multiple sequences do you use:
> 
> (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
> records one by one in the order from the file.
> 
> (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> random access to the records using their identifier.
> 
> (c) A list giving random access by index number (e.g. load the records
> using an iterator but save them in a list).

Most of you use iterators, storing records in memory as required.

> Question Six - Martel, Scanners and Consumers
> =============================================
> Some of BioPython's existing parsers (e.g. those using Martel) use an
> event/callback model, where the scanner component generates parsing
> events which are dealt with by the consumer component.
> 
> Do any of you use this system to modify existing parser behaviour, or
> use it as part of your own personal file parser?
> 
> (a) I don't know, or don't care.  I just the the parsers provided.
> (b) I use this framework to modify a parser in order to do ... (please
> provide details).

Almost everyone said (a) which I think is a good thing if we are going
to try and re-work the BioPython's sequence reading.

> And finally...
> ==============
> Do you have any general questions of comments.

Several people have commented that BioPerl has a nice unified system
with good documentation.

-----------------------------------------------------------------------

Where next...

I think my code could be included "in parallel" with the existing
parsers, without the upheaval of creating a new branch etc.

I have started thinking about writing files too.

Part of this will involve trying to be as consistent as possible about
mapping annotations from different file formats to the SeqRecord
object's annotations dictionary.

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

My code currently on bug 2059 is written as a single python file,
provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea
long term as more file formats are supported.

If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a
slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the
filenames would clash on Windows.  Some people are using the code in
Bio.SeqIO.FASTA, but I suppose the file could contain both the old code,
and my new fasta interface.

Alternatively, the new system could be put in Bio.SequenceIO or are
there any other suggestions?

Peter


From krewink at inb.uni-luebeck.de  Tue Aug 22 13:43:56 2006
From: krewink at inb.uni-luebeck.de (Albert Krewinkel)
Date: Tue, 22 Aug 2006 15:43:56 +0200
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>
	<44EA08CE.5070802@maubp.freeserve.co.uk>
Message-ID: <20060822134356.GO12386@pc09.inb.uni-luebeck.de>

I'd like to seriously start working on an EMBL parser, but there are
some things I'm concerned about: It surely would be a good thing to
build the SequenceIO and Parser stuff upon some base classes and agree
on using certain tools which are (or will be) used in the hole
project.  Since I never received any education/training on software
development, I would appreciate if someone can tell me how the code's
structure should look like -- the current Scanner/Consumer code isn't
any help.

> Several people have commented that BioPerl has a nice unified system
> with good documentation.

How about using reStructuredText in docstrings?  IMO it leaves the
.__doc__ string very readable but improves epydoc generated
descriptions.

Albert

-- 
Albert Krewinkel <krewink at inb.uni-luebeck.de>
University of Luebeck, Institute for Neuro- and Bioinformatics


From bsouthey at gmail.com  Tue Aug 22 13:52:10 2006
From: bsouthey at gmail.com (Bruce Southey)
Date: Tue, 22 Aug 2006 08:52:10 -0500
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>
	<44EA08CE.5070802@maubp.freeserve.co.uk>
Message-ID: <bbcd77d00608220652s5b9a5cc3i7f7999b9d27ed18b@mail.gmail.com>

Hi,
To date I have only used SwissProt code from BioPython so I am really
only lurking. But here are some responses.

Bruce

On 8/21/06, Peter (BioPython Dev) <biopython-dev at maubp.freeserve.co.uk> wrote:
> You probably noticed I sent out a "Dealing with sequence files"
> questionnaire on the main discussion list:
>
> http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html
>
> I've had four replies to date (off the list), and with the previous list
> discussion and counting myself that makes eight views.  Not a very big
> sample I know.
>
> > Question One
> > ============
> > Is reading sequence files an important function to you, and if so which
> > file formats in particular (e.g. Fasta, GenBank, ...)
>
> Fasta very popular, with GenBank also scoring highly.  Michiel and I
> both use clustalw.  Apart from EMBL (next question) there wasn't any
> other popular file format given.

Well, this is not a surprise because most apps around also use FASTA
as default format. Although most do not accept a comment line. Thus,
FASTA is the most important format.


>
> I'm tempted to ask again regarding multiple alignment formats.
>
> > Question Two
> > ============
> > Are there any sequence formats you would like to be able to read using
> > BioPython that are not currently supported (e.g. EMBL, ...)
>
> It may have been a leading question, but several respondents would like
> to be able to read in EMBL format.
>
> Other requests included:
>
> XML based 454 sequence files
> UniGene sequence cluster format
>
> Leighton mentioned:
>
> PTT (Protein table files)
> GFF (General Feature Format)
>
> And I wanted to be able to read Stockholm alignments.

I would like to be able to use a custom format that is based on the
FASTA format. That is allowing non-standard characters to included as
part of the sequence that I later remove. Perhaps this is just being
able to do subclassing.


>
> > Question Three - Reading Fasta Files
> > ====================================
> > Which of the following do you currently use (and why)?:
> >
> > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
> > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
> > (c) Bio.Fasta with your own parser (Could you tell us more?)
> > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
> > (e) Bio.FormatIO (giving SeqRecord objects)
> > (f) Other (Could you tell us more?)
>
> A range covering (a), (b) and (d) plus DIY parsers.
>
> > Question Four - Reading GenBank Files
> > =====================================
> > Which of the following do you currently use (and why)?:
> >
> > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
> > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
> > (c) Other (Could you tell us more?)
>
> Both (a) and (b) with no clear majority.
>
> > Question Five - Record Access...
> > ================================
> > When loading a file with multiple sequences do you use:
> >
> > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
> > records one by one in the order from the file.
> >
> > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
> > random access to the records using their identifier.
> >
> > (c) A list giving random access by index number (e.g. load the records
> > using an iterator but save them in a list).
>
> Most of you use iterators, storing records in memory as required.

a

>
> > Question Six - Martel, Scanners and Consumers
> > =============================================
> > Some of BioPython's existing parsers (e.g. those using Martel) use an
> > event/callback model, where the scanner component generates parsing
> > events which are dealt with by the consumer component.
> >
> > Do any of you use this system to modify existing parser behaviour, or
> > use it as part of your own personal file parser?
> >
> > (a) I don't know, or don't care.  I just the the parsers provided.
> > (b) I use this framework to modify a parser in order to do ... (please
> > provide details).
>
> Almost everyone said (a) which I think is a good thing if we are going
> to try and re-work the BioPython's sequence reading.

a

>
> > And finally...
> > ==============
> > Do you have any general questions of comments.
>
> Several people have commented that BioPerl has a nice unified system
> with good documentation.
>
> -----------------------------------------------------------------------
>
> Where next...
>
> I think my code could be included "in parallel" with the existing
> parsers, without the upheaval of creating a new branch etc.
>
> I have started thinking about writing files too.
>
> Part of this will involve trying to be as consistent as possible about
> mapping annotations from different file formats to the SeqRecord
> object's annotations dictionary.
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2059
>
> My code currently on bug 2059 is written as a single python file,
> provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea
> long term as more file formats are supported.
>
> If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a
> slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the
> filenames would clash on Windows.  Some people are using the code in
> Bio.SeqIO.FASTA, but I suppose the file could contain both the old code,
> and my new fasta interface.
>
> Alternatively, the new system could be put in Bio.SequenceIO or are
> there any other suggestions?
>
> Peter
>
>
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>


From biopython-dev at maubp.freeserve.co.uk  Tue Aug 22 16:46:39 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Tue, 22 Aug 2006 17:46:39 +0100
Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc
In-Reply-To: <20060822134356.GO12386@pc09.inb.uni-luebeck.de>
References: <44E4D2B4.3000600@maubp.freeserve.co.uk>	<44EA08CE.5070802@maubp.freeserve.co.uk>
	<20060822134356.GO12386@pc09.inb.uni-luebeck.de>
Message-ID: <44EB34EF.1050901@maubp.freeserve.co.uk>

Albert Krewinkel wrote:
> I'd like to seriously start working on an EMBL parser, but ...

As the de-facto GenBank module owner, I'm also interested getting EMBL 
and GenBank working nicely together.  The big question BEFORE you/we 
start any serious coding on EMBL support is how it fits into BioPython.

Do we (a) add a new module like the existing Bio.Fasta and Bio.GenBank, 
or (b) use a new framework like the one I've put forward here:

http://bugzilla.open-bio.org/show_bug.cgi?id=2059

 > ... there are  some things I'm concerned about: It surely would be a
 > good thing to  build the SequenceIO and Parser stuff upon some base
 > classes and agree on using certain tools which are (or will be) used
 > in the hole project.

What I was proposing was that all the new sequence file format parsers
should be implemented as subclasses of my SequenceIterator class - 
either directly (e.g. FastaIterator) or indirectly (e.g. the 
PfamStockholmIterator) and they should return SeqRecord objects.

I am open to discussion about how interlaced file formats should be
handled, but I think I have shown how the SequenceIterator based scheme 
could work using the Clustalw and Stockholm formats as examples.

> Since I never received any education/training on software 
> development, I would appreciate if someone can tell me how the code's
> structure should look like -- the current Scanner/Consumer code
> isn't any help.

I agree that the current Scanner/Consumer code won't be much help.

The fact that the current Bio.GenBank parser uses the Scanner/Consumer 
model reflects the fact that I rewrote (in Python) what had been done 
using Martel/Mindy.  This is one excuse for the state of that code of 
mine ;)

I don't think the flexibility of the Scanner/Consumer model is needed
just to turn Embl/GenBank data into SeqRecord objects (and only into 
SeqRecord objects).

> How about using reStructuredText in docstrings?  IMO it leaves the 
> .__doc__ string very readable but improves epydoc generated 
> descriptions.

I'm not familiar with how any existing API documentation is extracted
from the source code...

Peter


From biopython-dev at maubp.freeserve.co.uk  Thu Aug 17 08:28:19 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev))
Date: Thu, 17 Aug 2006 09:28:19 +0100
Subject: [Biopython-dev] Tweaking the SeqRecord class
In-Reply-To: <44E3C8C0.5070200@c2b2.columbia.edu>
References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com>
	<44E3C8C0.5070200@c2b2.columbia.edu>
Message-ID: <44E428A3.70103@maubp.freeserve.co.uk>

Michiel de Hoon wrote:
> Peter wrote:
> 
>>First of all, is there any comment on my suggestion to add __str__ and
>>__repr__ methods to the SeqRecord object, bug 2057:
>>
>>http://bugzilla.open-bio.org/show_bug.cgi?id=2057
> 
> Here's a thought:
> What if Seq were to inherit from str, and SeqRecord from Seq?
> Then, you get these for free.

This wouldn't automatically show any id/name/desrc/annotation in the
__str__ and __repr__ methods, so I would want to override these methods
anyway.

We would still need to create and provide a Seq object on request as the
record.seq attribute/property (for backwards compatibility).

I also think we should change the Seq objects __str__, __repr__
functionality (while preserving the .tostring() method for some
backwards compatibility).  It might have been Marc the raised this point
- shouldn't __str__ turn the data into a string, and __repr__ return a
string that you could type into python to recreate the object?  This
would mean we would have to stop truncating the sequence data at 60
characters.

>>Next, I'd like to check in some basic __doc__ strings for the
>>SeqRecord class, e.g. something like this:
> 
> Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have 
> documentation.

OK, basic __doc__ strings checked in,  Bio/SeqRecord.py revision 1.9

The Seq object also needs some love and attention in this area.

>>If you recall, for the fastest parsers turning the data into SeqRecord
>>and Seq objects imposed a fairly large overhead (compared to just
>>using strings):
>>
>>http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html
> 
> I wonder if this is still true if a Seq object and a SeqRecord object 
> inherit from string. From the code, I don't see where the overhead comes 
> from.

I was wondering what the overhead was too.

It could just be creating objects (Seq and SeqRecord) plus their
associated strings/list/dictionary (compared with just two strings, the
fasta title string and the sequence).

My property change should reduce this a little bit as for Fasta files
there is no need to create the dbxrefs list or the annotations
dictionary (unless or until the user records some information here after
creating the SeqRecord object).

Making SeqRecord subclass Seq might help here if only one object needs
to be created.

>>The backwards compatibility if statement is a bit
>>ugly - can we just assume Python 2.2 or later?
> 
> Biopython currently requires Python 2.3 or later.

Great - I'll ditch that nasty big if and just re-write the class to use
properties.

Revised version attached - should be functionally identical.

Peter

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: SeqRecord.py
URL: <http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060817/79dd5fca/attachment.ksh>

From biopython-dev at maubp.freeserve.co.uk  Wed Aug 30 10:22:52 2006
From: biopython-dev at maubp.freeserve.co.uk (Peter)
Date: Wed, 30 Aug 2006 11:22:52 +0100
Subject: [Biopython-dev] Recent bug reports not making it to the mailing list
Message-ID: <44F566FC.30407@maubp.freeserve.co.uk>

Once upon a time (early 2006?) whenever a bug was filed on the BugZilla, 
a copy was sent to the mailing list.

Not any more... and in the last month or so there have been several bugs 
filed which have been ignored.

Does anyone get automatic email notification?

Who should I ask to be included in any default email notification?

Thanks

Peter