From lpritc at scri.sari.ac.uk Tue Aug 1 06:42:37 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 01 Aug 2006 11:42:37 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <1154428959.4871.11.camel@lplinuxdev> On Mon, 2006-07-31 at 12:08 -0400, Marc Colosimo wrote: > On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote: > >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the > >>> entire file into memory in one go, and then parses it. On the other > >>> hand its not perfect: I would use "\n>" as the split marker > >>> rather than > >>> ">" which could appear in the description of a sequence. > >> > >> I agree (not that it's bitten me, yet), but I'd be inclined to go > >> with > >> "%s>" % os.linesep as the split marker, just in case. > > > > Good point. I wonder how many people even know this function exists? > > > > The only problem with this is that if someone sends you a file not > created on your system. [...] > This has mostly simplied down to two - Unix and Windows - unless the > person uses a Mac GUI app some of which use \r (CR) instead of \n > (LF) where Windows uses \r\n (CRLF). I think the standard python > disto comes with crlf.py and lfcr.py that can convert the line endings. Also a good point. I had a play about with regular expression splitting/substitution and the SeqUtils.quick_FASTA_reader method to see if I could capture this variability in line-endings: def method_quick_FASTA_reader3(filename): txt = file(filename).read() entries = [] split_marker = re.compile('^>', re.M) for entry in re.split(split_marker, txt)[1:]: name,seq= re.split('[\r\n]', entry, 1) seq = re.sub('\s', '', seq).upper() entries.append((name, seq)) return "SeqUtils/quick_FASTA_reader (import re)", len(entries) Using regular expressions in this way seems to slow things down to about the same speed as the SeqIO parser, with the disadvantage of still having to process the entries into SeqRecord objects (if that's what you want to do with them). quick_FASTA_reader is a bit of a misnomer in this case, I guess ;) 4.15s SeqIO.FASTA.FastaReader (for record in interator) 3.95s SeqIO.FASTA.FastaReader (iterator.next) 4.13s SeqIO.FASTA.FastaReader (iterator[i]) 1.89s SeqUtils/quick_FASTA_reader 1.03s pyfastaseqlexer/next_record 0.52s pyfastaseqlexer/quick_FASTA_reader 4.44s SeqUtils/quick_FASTA_reader (import re) Results are typical for the 72000 record set, and this doesn't look to be a promising route. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From pfefferp at staff.uni-marburg.de Tue Aug 1 08:02:25 2006 From: pfefferp at staff.uni-marburg.de (Patrick Pfeffer) Date: Tue, 01 Aug 2006 14:02:25 +0200 Subject: [Biopython-dev] GAs in Biopython Message-ID: <44CF42D1.8090209@staff.uni-marburg.de> Hi there, isn't there any documentation available for using the genetic algorithm available in the package? Thanks for any kind of help, Patrick -- ************************************* Dipl. Bioinf. Patrick Pfeffer Arbeitskreis Prof. Dr. G. Klebe Institut f?r Pharmazeutische Chemie Raum A116a Fachbereich Pharmazie Philipps-Universit?t Marburg Marbacher Weg 6 35032 Marburg Germany Fon.: 06421/2825908 http://www.agklebe.de e-mail: pfefferp at staff.uni-marburg.de ************************************* From biopython-dev at maubp.freeserve.co.uk Tue Aug 1 16:53:08 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 01 Aug 2006 21:53:08 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154428959.4871.11.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> <1154428959.4871.11.camel@lplinuxdev> Message-ID: <44CFBF34.7080106@maubp.freeserve.co.uk> Peter wrote: >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the >>> entire file into memory in one go, and then parses it. On the other >>> hand its not perfect: I would use "\n>" as the split marker >>> rather than ">" which could appear in the description of a sequence. Leighton Pritchard replied: >> I agree (not that it's bitten me, yet), but I'd be inclined to go >> with "%s>" % os.linesep as the split marker, just in case. Peter then wrote: > Good point. I take that back - I was right the first time ;) You are right to worry about the line sep changing from platform to platform, but you shouldn't use "%s>" % os.linesep However, when reading windows style files on windows, the newlines appear in python as just \n (as do newlines from unix files read on windows). When writing text files on windows, again \n gets turned into CR LF on the disk. Just using "\n>" would work on any platform reading a FASTA file with the expected newlines. As a bonus it would work on Windows when reading unix style newlines. To get any platform to read newlines from any other platform what I suggest is using "\n>" as the split string, but open the file in universal text mode - this seems to work fine on Python 2.3, but I'm not sure when universal newline reading was introduced. For example, I created a simple file using the three newline conventions (using the TextPad on Windows). >>> import sys >>> sys.platform 'win32' >>> os.linesep '\r\n' >>> open("c:/temp/windows.txt","r").read() 'line\nline\n' >>> open("c:/temp/mac.txt","r").read() 'line\rline\r' >>> open("c:/temp/unix.txt","r").read() 'line\nline\n' (Notice that using "\n>" wouldn't work when reading a Mac style file on Windows) >>> open("c:/temp/windows.txt","rU").read() 'line\nline\n' >>> open("c:/temp/mac.txt","rU").read() 'line\nline\n' >>> open("c:/temp/unix.txt","rU").read() 'line\nline\n' Peter From lpritc at scri.sari.ac.uk Wed Aug 2 05:25:27 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Wed, 02 Aug 2006 10:25:27 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> Message-ID: <1154510728.4871.66.camel@lplinuxdev> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote: > Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Yes. FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW > If you have had to write you own code to read a "common" file format > which BioPython doesn't support, please get in touch. EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not pretty). > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a > title, and the sequence as a string) > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) > (c) Bio.Fasta with your own parser (Could you tell us more?) > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) > (e) Bio.FormatIO (giving SeqRecord objects) > (f) Other (Could you tell us more?) Mostly (f), a homegrown Pyrex/Flex parser. > Question Three - index_file based dictionaries > ============================================== > Do you use any of the following: > (a) Bio.Fasta.Dictionary > (b) Bio.Genbank.Dictionary > (c) Any other "Martel/Mindy" based dictionary which first requires > creation of an index using the index_file function No, but I do create dictionaries on-the-fly from (name, sequence) tuples, where necessary. > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > random access to the records using their identifier. > > (c) A list giving random access by index number (e.g. load the records > using an iterator but saving them in a list). > > Do you have any additional comments on this? For example, flexibility > versus memory requirements. Depending on what I need to do, I might use different approaches. If I'm filtering sequences on, say, sequence composition, I'll use an iterator. If I need to cross-reference sequences from the file to some other set of sequences by ID, I'll use a dictionary. In each case, I will generally either use a for loop or build a dictionary on-the-fly. > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== > If you use Fasta files, do you want get records returned as FastaRecords > or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? I'd rather have SeqRecords. SeqRecords are particularly useful for annotations and attaching data to the sequence which, later, gets written out in some format other than FASTA sequence format. For operations where no further information is associated with the sequence, they offer equivalent functionality to FastaRecords. Currently I default to (name, seq) tuples, and only create SeqRecords when necessary, but this is only out of convenience for the parser I use. > Question Five - GenBank files: GenbankRecord or SeqRecord > ========================================================== > If you use GenBank files, do you use: > (a) Bio.Genbank.FeatureParser which returns SeqRecord objects > (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects > > Do you care much either way? For me the only significant difference is > that feature locations are held as objects in the SeqRecord, and as the > raw string in the Record. I use Bio.GenBank.FeatureParser because I prefer the storage of features (which are what I'm generally interested in) as SeqFeature objects. > Question Six - Martel, Scanners and Consumers > ============================================== > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? > > (a) I don't know, or don't care. I just the the parsers provided. > (b) I use this framework to modify a parser in order to do ... (please > provide details). I care mostly about performance on large files and the convenient representation of sequences and features. Where parsers have not been available (or quickly locatable) for file formats, such as EMBL, I have sometimes used the Bio.ParserSupport classes and the Scanner/Consumer pattern. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Wed Aug 2 06:45:34 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 02 Aug 2006 11:45:34 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154510728.4871.66.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> <1154510728.4871.66.camel@lplinuxdev> Message-ID: <44D0824E.30808@maubp.freeserve.co.uk> Leighton Pritchard wrote: > On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote: > >>Question One >>============ >>Is reading sequence files an important function to you, and if so which >>file formats in particular (e.g. Fasta, GenBank, ...) > > Yes. FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW > PTT (Protein table files) http://www.ibt.unam.mx/biocomputo/hom_make_db.html (Anyone got an NCBI link for the file format?) GFF (General Feature Format) http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml GFF and PTT aren't exactly what I would call sequence files, in that they don't contain any sequence data. But thinking about it, maybe those files could be turned into SeqRecords or SeqFeatures (with empty sequences). > >>If you have had to write you own code to read a "common" file format >>which BioPython doesn't support, please get in touch. > > EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not > pretty). > Its looks like there is enough overlap between the EMBL and Genbank to make sharing code between them a good idea. Certainly EMBL was a file format I was thinking we should try to support. Reading your other comments, it looks like you wouldn't miss FastaRecord or GenBank records if they were phased out. Personally, I'm suggesting we try and standardise on having any Sequence IO framework standardize on returning SeqRecord objects. Does anyone know if SeqIO stood for Sequence or Sequential Input/Ouput? I think we should have a generic "Sequence Iterator" object to do this which takes a file handle, subclassed for each file format - giving a "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc. I'm inclined not to give any choice of parser object (e.g. Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a SeqRecord. The individual readers should offer some level of control, for example the title2ids function for Fasta files lets the user decide how the title line should be broken up into id/name/description. Also for some file formats the user should be able to specify the alphabet. Peter From hoffman at ebi.ac.uk Wed Aug 2 07:00:46 2006 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Wed, 2 Aug 2006 12:00:46 +0100 (BST) Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> Message-ID: > Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Yes. FASTA. > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (f) Other (Could you tell us more?) I have written my own short iterator so that my code is portable without requiring Biopython to be installed. > Question Three - index_file based dictionaries > ============================================== > Do you use any of the following: No. > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. Yes. > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== > If you use Fasta files, do you want get records returned as FastaRecords > or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? SeqRecords. I hate it when an interface tries to parse the definition line for me. Perhaps a set of standard definition line parsers should be provided so that one can choose, but usually I would rather have plain text and parse it myself. > Question Six - Martel, Scanners and Consumers > ============================================== > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? No. -- Michael Hoffman European Bioinformatics Institute From lpritc at scri.sari.ac.uk Wed Aug 2 07:23:27 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Wed, 02 Aug 2006 12:23:27 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44D0824E.30808@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> <1154510728.4871.66.camel@lplinuxdev> <44D0824E.30808@maubp.freeserve.co.uk> Message-ID: <1154517808.4871.93.camel@lplinuxdev> On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote: > GFF (General Feature Format) > > http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml > > GFF and PTT aren't exactly what I would call sequence files, in that > they don't contain any sequence data. Fair point, but GFF3 (see below) can optionally carry sequence data, and I use them for exactly what you say here: > those files could be turned into SeqRecords or SeqFeatures (with empty > sequences). I was thinking that GFF3 would be more useful than GFF: http://song.sourceforge.net/gff3.shtml NCBI have already gone over to this on bacterial genomes, at least, (e.g. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much richer format than the original specification. Andrew Dalke has already written a GFF3 parser/writer, which is available at http://www.dalkescientific.com/PyGFF3-0.5.tar.gz I've not used this in anger, yet... > Its looks like there is enough overlap between the EMBL and Genbank to > make sharing code between them a good idea. Certainly EMBL was a file > format I was thinking we should try to support. In a scanner/consumer pattern it's easy enough. I've not looked under the hood of the new GenBank parser yet, to see what you've done. Most of my contact with EMBL format is with headerless feature tables and Artemis, which aren't directly similar to GenBank entries. > Reading your other comments, it looks like you wouldn't miss FastaRecord > or GenBank records if they were phased out. Not personally, but others may have strong opinions and breakable code, yet. > Personally, I'm suggesting we try and standardise on having any Sequence > IO framework standardize on returning SeqRecord objects. > > I think we should have a generic "Sequence Iterator" object to do this > which takes a file handle, subclassed for each file format - giving a > "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc. > I'm inclined not to give any choice of parser object (e.g. > Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a > SeqRecord. It may be a side-issue, but should a Clustal parser return an Alignment object or iterate over SeqRecord objects? And for that matter, what about other MSA files in FASTA format? I think we ought allow parsers to return an Alignment where the user requests it, which is a functionality I'm not currently aware of in the FASTA sequence parsers. > The individual readers should offer some level of control, for example > the title2ids function for Fasta files lets the user decide how the > title line should be broken up into id/name/description. Also for some > file formats the user should be able to specify the alphabet. Could the alphabet be optionally specified by the user on parsing, and maybe return a warning or error if there are non-compliant symbols in the file, as a quick validator for bad sequences, or reminder to the occasionally forgetful that, for example, they're not working with nucleotide sequences, today ;) L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Wed Aug 2 08:56:23 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 02 Aug 2006 13:56:23 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154517808.4871.93.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> <1154510728.4871.66.camel@lplinuxdev> <44D0824E.30808@maubp.freeserve.co.uk> <1154517808.4871.93.camel@lplinuxdev> Message-ID: <44D0A0F7.1020402@maubp.freeserve.co.uk> Leighton Pritchard wrote: > Fair point, but GFF3 (see below) can optionally carry sequence data, > and I use them for exactly what you say here: > >> maybe those files could be turned into SeqRecords or SeqFeatures >> (with empty sequences). > > I was thinking that GFF3 would be more useful than GFF: > > http://song.sourceforge.net/gff3.shtml > Thanks for the links... interesting that GFF3 allows embedding Fasta sequences. >> Reading your other comments, it looks like you wouldn't miss >> FastaRecord or GenBank records if they were phased out. > > Not personally, but others may have strong opinions and breakable > code, yet. There is no need to remove the current modules, just mark them as depreciated. Of course, if there is some strong support for these objects then we might not want to be so harsh... > It may be a side-issue, but should a Clustal parser return an > Alignment object or iterate over SeqRecord objects? And for that > matter, what about other MSA files in FASTA format? I think we ought > allow parsers to return an Alignment where the user requests it, > which is a functionality I'm not currently aware of in the FASTA > sequence parsers. In my opinion we should offer both. I would go for loading clustal/fasta alignments as sequence iterators (as part of the new SeqIO code) and make it very easy to turn ANY sequence iterator returning SeqRecords into an alignment. The current alignment object stores its sequences as SeqRecords internally but doesn't (yet) allow simple addition of SeqRecords - that would have to be fixed but it looks easy enough. Accepting a SequenceIterator for __init__ would also be nice. >> The individual readers should offer some level of control, for >> example the title2ids function for Fasta files lets the user decide >> how the title line should be broken up into id/name/description. >> Also for some file formats the user should be able to specify the >> alphabet. > > Could the alphabet be optionally specified by the user on parsing, > and maybe return a warning or error if there are non-compliant > symbols in the file, as a quick validator for bad sequences, or > reminder to the occasionally forgetful that, for example, they're not > working with nucleotide sequences, today at floor> ;) For some file formats the parser should be able to deduce the alphabet, but other like Fasta it must be specified. I like the idea of optionally checking the alphabet - but it would impose a speed penalty. Do you think this should be done by the SeqRecord object (on request)? Each parser could simply ask the SeqRecord object to verify itself before returning it. Peter From Leighton.Pritchard at scri.ac.uk Wed Aug 2 05:00:20 2006 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Wed, 2 Aug 2006 10:00:20 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CFBF34.7080106@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> <1154428959.4871.11.camel@lplinuxdev> <44CFBF34.7080106@maubp.freeserve.co.uk> Message-ID: <1154509221.4871.40.camel@lplinuxdev> An embedded and charset-unspecified text was scrubbed... Name: not available Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/605b8b80/attachment.pl -------------- next part -------------- An embedded message was scrubbed... From: "Leighton Pritchard" Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Date: Wed, 2 Aug 2006 10:00:20 +0100 Size: 4641 Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/605b8b80/attachment.mht From lpritc at scri.sari.ac.uk Wed Aug 2 05:02:03 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Wed, 02 Aug 2006 10:02:03 +0100 Subject: [Biopython-dev] [Fwd: Re: Reading sequences: FormatIO, SeqIO, etc] Message-ID: <1154509323.4871.42.camel@lplinuxdev> (this time without the signature) -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). -------------- next part -------------- An embedded message was scrubbed... From: Leighton Pritchard Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Date: Wed, 02 Aug 2006 10:00:20 +0100 Size: 3943 Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060802/5c8fac79/attachment.mht From mdehoon at c2b2.columbia.edu Thu Aug 3 23:20:18 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 03 Aug 2006 23:20:18 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <44D2BCF2.9010500@c2b2.columbia.edu> > Question One > ============ > > Is reading sequence files an important > function to you, and if so which file formats in particular (e.g. > Fasta, GenBank, ...) > I use Fasta, GenBank, and occasionally clustalw. > > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with > a title, and the sequence as a string) (b) Bio.Fasta with the > FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own > parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader > (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord > objects) (f) Other (Could you tell us more?) I use Bio.Fasta with the RecordParser, but just because it's easy to find in the documentation. As a user, I think Bio.Fasta requires too many steps to be typed in; I would prefer something more straightforward. For the output format, I don't care so much, but for the sake of consistency a SeqRecord may be preferable. > > Question Three - index_file based dictionaries > ============================================== Do you use any of the > following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c) > Any other "Martel/Mindy" based dictionary which first requires > creation of an index using the index_file function > No. I never really understood index files. > > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > random access to the records using their identifier. > > (c) A list giving random access by index number (e.g. load the > records using an iterator but saving them in a list). I use (a). It's easy to create (b) or (c), if needed, if (a) is available. > > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== If you use > Fasta files, do you want get records returned as FastaRecords or as > SeqRecords? If SeqRecords, do you use your own title2ids mapping? > > For example, > >> name text text text > ACGTACACGT > > As a FastaRecord this would have: > > FastaRecord.title = "name text text text" (string) > FastaRecord.sequence= "ACGTACACGT" (string) > > As a SeqRecord (with the default title2ids mapping): > > SeqRecord.id = (default string) SeqRecord.name = (default string) > SeqRecord.description = "name text text text" (string) SeqRecord.seq > = Seq("ACGTACACGT", alphabet) I use the FastaRecord, but again for no particular reason. I have not experienced an advantage of Seq objects over simple strings, so for me the fact that FastaRecord contains a simple string is more convenient. But it doesn't matter much. > Question Five - GenBank files: GenbankRecord or SeqRecord > ========================================================== If you use > GenBank files, do you use: (a) Bio.Genbank.FeatureParser which > returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns > Bio.GenBank.Record objects > I don't care so much, but I think that having two record types is confusing, so it would be better if we could decide on one. A SeqRecord is more general than a Bio.GenBank.Record, so I have a slight preference for a SeqRecord. > > Question Six - Martel, Scanners and Consumers > ============================================== Some of BioPython's > existing parsers (e.g. those using Martel) use an event/callback > model, where the scanner component generates parsing events which are > dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? > > (a) I don't know, or don't care. I just the the parsers provided. > (b) I use this framework to modify a parser in order to do ... > (please provide details). > (a). Often, I'm just at the Python prompt typing away. What I like about Python and Numerical Python is that the commands are often obvious and easy to remember. With the parser framework, on the other hand, I always need to look up in the documentation how to use them. --Michiel From dag at sonsorol.org Fri Aug 4 06:38:52 2006 From: dag at sonsorol.org (Chris Dagdigian) Date: Fri, 4 Aug 2006 06:38:52 -0400 Subject: [Biopython-dev] Fwd: contributing comparative genomics tools References: <22DA57C5-461D-48BE-B524-47108330CD80@chem.ucla.edu> Message-ID: <9AFBA2D3-B8DF-4337-A54A-019F6EAFFC38@sonsorol.org> Begin forwarded message: > From: Christopher Lee > Date: August 3, 2006 9:11:42 PM EDT > To: biopython-dev-owner at lists.open-bio.org > Subject: Fwd: contributing comparative genomics tools > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > there appears to be an error in your code submission instructions > on the biopython.org/wiki, or in the configuration of the biopython- > dev list server. The code submission instructions tell me to > submit my proposal by email to biopython-dev at biopython.org, but the > list server responds by saying that all mail will automatically be > rejected! Please forward this proposal to the appropriate people > (presumably biopython-dev?), and let me know that you have done > so. Otherwise I won't have any way of knowing whether anyone even > reads this email address... > > Yours with thanks, > > Chris Lee, Dept. of Chemistry & Biochemistry, UCLA > > Begin forwarded message: > >> You are not allowed to post to this mailing list, and your message >> has >> been automatically rejected. If you think that your messages are >> being rejected in error, contact the mailing list owner at >> biopython-dev-owner at lists.open-bio.org. >> >> >> From: Christopher Lee >> Date: August 3, 2006 3:55:52 PM PDT >> To: biopython-dev at biopython.org >> Cc: Namshin Kim >> Subject: contributing comparative genomics tools >> >> >> Hi Biopython developers, >> I'd like to contribute some Python tools that my lab has been >> developing for large-scale comparative genomics database query. >> These tools make it easy to work with huge multigenome alignment >> databases (e.g. the UCSC Genome Browser multigenome alignments) >> using a new disk-based interval indexing algorithm that gives very >> high performance with minimal memory usage. e.g. whereas queries >> of the UCSC 17genome alignment typically take about 30 sec. per >> query using MySQL, the same query takes about 200 microsec. per >> query, making it possible to run huge numbers of queries for >> genome-wide studies. >> >> Here's an example usage (click the URL or just look at the code >> below) >> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq- >> align.html#SECTION000125000000000000000 >> >> We've tested this code very extensively in our own research, and >> it has had four open source releases so far. At this point the >> code is in production use. All the code is compatible back to >> Python version 2.2, but not 2.1 or before (we use generators). >> There is C code (accessed as Python classes) for the high- >> performance interval database index. For details of history see >> the website >> http://www.bioinformatics.ucla.edu/pygr >> >> There is also extensive tutorial and reference documentation: >> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/ >> >> Let me know what questions you have, and what process we would >> need to follow to contribute this code. >> >> Yours with best wishes, >> >> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA >> >> >> ####### EXAMPLE USAGE >> from pygr import cnestedlist >> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE >> ALIGNMENT DB >> >> def printResults >> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None): >> 'get alignment of each genome to site, print %identity and % >> aligned' >> for src,dest,edge in msa[site].edges(mergeMost=True): # >> ALIGNMENT QUERY! >> print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \ >> %(altID,cluster_id,prefix,seqNames[dest], >> 100.*edge.pIdentity(),100.*edge.pAligned(),src[: >> 2],dest[:2]) >> >> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs): >> 'gene must be a slice of a sequence in our genome alignment msa' >> ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES >> ss2=gene[start2-2:start2] >> ss3=gene[stop:stop+2] >> e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES >> e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND >> stop >> zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd >> REQUESTS >> cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS >> TO MINIMIZE TRAFFIC >> for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3), >> ('e1',e1),('e2',e2)]: >> printResults(prefix,msa,site,seqNames=~ >> (msa.seqDict),**kwargs) >> >> # RUN A QUERY LIKE THIS... >> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop) >> >> ############ EXPLANATION & NOTES >> David Haussler's group has constructed alignments of multiple >> genomes. These alignments are extremely useful and interesting, >> but so large that it is cumbersome to work with the dataset using >> conventional methods. For example, for the 8-genome alignment you >> have to work simultaneously with the individual genome datasets >> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as >> well as the huge alignment itself. Pygr makes this quite easy. >> Here we illustrate an example of mapping an alternative 3' exon, >> which has two alternative splice sites (start1 and start2) and a >> single terminal splice site (stop). We use the alignment database >> to map each of these splice sites onto all the aligned genomes, >> and to print the percent-identity and percent-aligned for each >> genome, as well as the two nucleotides consituting the splice site >> itself. To examine the conservation of the two exonic regions >> (between start1 and start2, and the adjacent region terminated by >> stop, we print the same information for each genome's alignment to >> these two regions as well. The code first opens the alignment >> database. The function (getAlt3Conservation) obtains sequence >> slice objects representing the various ``sites'' to be queried. >> The actual alignment database query is performed in printResults: >> >> * The alignment database query is in the first line of >> printResults(). msa is the database; site is the interval query; >> and the edges methods iterates over the results, returning a tuple >> for each, consisting of a source sequence interval (i.e. an >> interval of site), a destination sequence interval (i.e. an >> interval in an aligned genome), and an edge object describing that >> alignment. We are taking advantage of Pygr's group-by operator >> mergeMost, which will cause multiple intervals in a given sequence >> to be merged into a single interval that constitutes their >> ``union''. Thus, for each aligned genome, the edges iterator will >> return a single aligned interval. The alignment edge object >> provides some useful conveniences, such as calculating the percent- >> identity between src and dest automatically for you. pIdentity() >> computes the fraction of identical residues; pAligned computes the >> fraction of aligned residues (allowing you to see if there are big >> gaps or insertions in the alignment of this interval). If we had >> wanted to inspect the detailed alignment letter by letter, we >> would just iterate over the letters attribute instead of the edges >> method. (See the NLMSASlice documentation for further information). >> >> * src[:2] and dest[:2] print the first two nucleotides of the >> site in gene and in the aligned genome. >> >> * it's worth noting that the actual sequence string >> comparisons are being done using a completely different database >> mechanism (formerly NCBI's fastacmd, now our own (much faster) >> pureseq text format), not the cnestedlist database. Basically, >> each genome is being queried as a separate BLAST formatted >> database, represented in Pygr by the BlastDB class. Pygr makes >> this complex set of multi-database operations more or less >> transparent to the user. For further information, see the BlastDB >> documentation. >> >> * The other operations here are entirely vanilla: mainly >> slicing a gene sequence to obtain the specific sites that we want >> to query. Note: gene must itself be a slice of a sequence in our >> alignment, or the alignment query msa[site] will raise an >> IndexError informing the user that the sequence site is not in the >> alignment. >> >> * The only slightly interesting operation here is the use of >> interval addition to obtain the ``union'' of two intervals, e.g. >> e1=ss1+ss2. This obtains a single interval that contains both of >> the input intervals. >> >> * When the print statement requests str() representations of >> these sequence objects, Pygr uses fastacmd -L to extract just the >> right piece of the corresponding chromosomes from the eight BLAST >> databases. >> >> (Actually, because of Pygr's caching / optimizations, considerably >> more is going on than indicated in this simplified sketch. But you >> get the idea: Pygr makes it relatively effortless to work with a >> variety of disparate (and large) resources in an integrated way.) >> >> Here is some example output: >> >> 1 Mm.99996 ss1 hg17 50.0 100.0 AG GG >> 1 Mm.99996 ss1 canFam1 50.0 100.0 AG GG >> 1 Mm.99996 ss1 panTro1 50.0 100.0 AG GG >> 1 Mm.99996 ss1 rn3 100.0 100.0 AG AG >> 1 Mm.99996 ss2 hg17 100.0 100.0 AG AG >> 1 Mm.99996 ss2 canFam1 100.0 100.0 AG AG >> 1 Mm.99996 ss2 panTro1 100.0 100.0 AG AG >> 1 Mm.99996 ss2 rn3 100.0 100.0 AG AG >> 1 Mm.99996 ss3 hg17 100.0 100.0 GT GT >> 1 Mm.99996 ss3 canFam1 100.0 100.0 GT GT >> 1 Mm.99996 ss3 panTro1 100.0 100.0 GT GT >> 1 Mm.99996 ss3 rn3 100.0 100.0 GT GT >> 1 Mm.99996 e1 hg17 78.9 100.0 AG GG >> 1 Mm.99996 e1 canFam1 84.2 100.0 AG GG >> 1 Mm.99996 e1 panTro1 77.6 100.0 AG GG >> 1 Mm.99996 e1 rn3 97.4 98.7 AG AG >> 1 Mm.99996 e2 hg17 91.6 99.1 CC CC >> 1 Mm.99996 e2 canFam1 88.8 99.1 CC CC >> 1 Mm.99996 e2 panTro1 91.6 99.1 CC CC >> 1 Mm.99996 e2 rn3 97.2 100.0 CC CC >> >> >> >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (Darwin) >> >> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i >> xhxE2zrffLaiCjSiEp4Eo6k= >> =BeOe >> -----END PGP SIGNATURE----- >> >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (Darwin) > > iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm > Wx/4AIstFjb+rbqY2QBppLg= > =fghY > -----END PGP SIGNATURE----- From biopython-dev at maubp.freeserve.co.uk Sat Aug 12 04:25:41 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 12 Aug 2006 09:25:41 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> I've having a few issues with my email setup which is why I haven't replied recently. A week ago I filed bug 2059 for this discussion, and attached some code: http://bugzilla.open-bio.org/show_bug.cgi?id=2059 I'm interested in your feedback - from the framework down to if you don't like the class names for example. Peter From krewink at inb.uni-luebeck.de Wed Aug 16 08:44:07 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Wed, 16 Aug 2006 14:44:07 +0200 Subject: [Biopython-dev] Fwd: contributing comparative genomics tools Message-ID: <20060816124407.GF12386@pc09.inb.uni-luebeck.de> Hello, I read Peter's SeqIO/__init__.py replacement and if I may say so: I love it. Thanks a lot for this! Still, there are some things I'd like to talk about. The _parse_genbank_features function could also be used to parse embl or ddjb features, therefore I think it should be named differently. Since there is a lot of clean up effort right now: How about moving the SeqRecord and SeqFeature objects into the Bio.Seq module? They are closely related and seperate modules only clutter the namespace. To me, this seems to be a general problem. It's very difficult to find a tool to use for a certain problem if one doesn't allready know what to look for. I'd pretty much favour to create modules like Bio.structure to group modules like Bio.PDB and Bio.NMR etc. This is a very big change, and therefore I'd like to follow Marc's suggestion of splitting off a branch. In general, I pretty much agree with what Marc said in his . I cannot estimate how much work it would be to maintain two seperate biopython distributions, so please forgive me if I re-suggest something completely idiotic here. I just don't believe there is much that could be lost that way. Cheers, Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From biopython-dev at maubp.freeserve.co.uk Wed Aug 16 10:00:36 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Aug 2006 15:00:36 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> (I changed the subject to that of the previous discussion, as this isn't really about "contributing comparative genomics tools") Albert Krewinkel wrote: > Hello, > > I read Peter's SeqIO/__init__.py replacement and if I may say so: I > love it. Thanks a lot for this! Still, there are some things I'd > like to talk about. Thank you :) The code is on Bug 2059 for anyone who hasn't looked yet. http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > The _parse_genbank_features function could also be used to parse embl > or ddjb features, therefore I think it should be named differently. First of all, that bit of code is for a new feature which I personally wanted - to be able to iterate over CDS features in a genbank file. But yes, I did have in mind that it (and the GenBank parser) could be re-used to deal with EMBL files. I have not yet taken the time to learn the EMBL file format and how it corresponds to the GenBank file format - but I agree a lot of the code could be shared. > Since there is a lot of clean up effort right now: How about moving > the SeqRecord and SeqFeature objects into the Bio.Seq module? They > are closely related and seperate modules only clutter the namespace. What real benefit does that give us? It will cause a certain amount of upheaval in the short term as people will have to change their import statements on existing scripts. If we do start a new branch for "big changes" then I have no real problem with this suggest. > To me, this seems to be a general problem. It's very difficult to find > a tool to use for a certain problem if one doesn't allready know what > to look for. I'd pretty much favour to create modules like > Bio.structure to group modules like Bio.PDB and Bio.NMR etc. This is > a very big change, and therefore I'd like to follow Marc's suggestion > of splitting off a branch. In general, I pretty much agree with what > Marc said in his . > > I cannot estimate how much work it would be to maintain two separate > biopython distributions, so please forgive me if I re-suggest > something completely idiotic here. I just don't believe there is much > that could be lost that way. BioPython probably would benefit from a little reorganising - and for anything drastic like moving entire modules about, a new branch makes sense. On the other hand, do we have the man-power to do it? Are any of the developers familiar with all of (or even most of) the existing modules? I would guess I have used less than half of the modules - I have looked at the very basics of Bio.PDB for example, but have never tried Bio.NMR I would favour gradual incremental (and backwards compatible) changes. Such as adding a new sequence reading module and then marking the old code as depreciated. For example of some small changes, have any of you looked at: Bug 2057 - SeqRecord has no __str__ or __repr__ http://bugzilla.open-bio.org/show_bug.cgi?id=2057 Bug 1963 - Adding __str__ method to codon tables and translators http://bugzilla.open-bio.org/show_bug.cgi?id=1963 Little things in themselves that I think would help. Peter From krewink at inb.uni-luebeck.de Wed Aug 16 10:44:36 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Wed, 16 Aug 2006 16:44:36 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> Message-ID: <20060816144436.GG12386@pc09.inb.uni-luebeck.de> On Wed, Aug 16, 2006 at 03:00:36PM +0100, Peter wrote: > Albert Krewinkel wrote: > >The _parse_genbank_features function could also be used to parse embl > >or ddjb features, therefore I think it should be named differently. > > First of all, that bit of code is for a new feature which I personally > wanted - to be able to iterate over CDS features in a genbank file. > > But yes, I did have in mind that it (and the GenBank parser) could be > re-used to deal with EMBL files. I have not yet taken the time to > learn the EMBL file format and how it corresponds to the GenBank file > format - but I agree a lot of the code could be shared. I will try to build something similar for EMBL files within the next days. This should be easy, since features really should look the same in both formates: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html > >Since there is a lot of clean up effort right now: How about moving > >the SeqRecord and SeqFeature objects into the Bio.Seq module? They > >are closely related and seperate modules only clutter the namespace. > > What real benefit does that give us? It will cause a certain amount > of upheaval in the short term as people will have to change their > import statements on existing scripts. If we do start a new branch > for "big changes" then I have no real problem with this suggest. Agree. > >To me, this seems to be a general problem. It's very difficult to find > >a tool to use for a certain problem if one doesn't allready know what > >to look for. I'd pretty much favour to create modules like > >Bio.structure to group modules like Bio.PDB and Bio.NMR etc. This is > >a very big change, and therefore I'd like to follow Marc's suggestion > >of splitting off a branch. In general, I pretty much agree with what > >Marc said in his . > > > >I cannot estimate how much work it would be to maintain two separate > >biopython distributions, so please forgive me if I re-suggest > >something completely idiotic here. I just don't believe there is much > >that could be lost that way. > > BioPython probably would benefit from a little reorganising - and for > anything drastic like moving entire modules about, a new branch makes > sense. On the other hand, do we have the man-power to do it? Are any > of the developers familiar with all of (or even most of) the existing > modules? I would guess I have used less than half of the modules - I > have looked at the very basics of Bio.PDB for example, but have never > tried Bio.NMR I attached a file which I created when I was teaching myself biopython. It provides a basic grouping for the current biopython modules. Naturaly, it's by no means complete and probably wrong in some places. > I would favour gradual incremental (and backwards compatible) changes. > Such as adding a new sequence reading module and then marking the old > code as depreciated. I think we could do both: A new branch might make it easier to see which modules are usefull the way they are and which are not. Even if this seperate branch never is released itself, it still would be handy for reorganising coordination. > For example of some small changes, have any of you looked at: > > Bug 2057 - SeqRecord has no __str__ or __repr__ > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > Bug 1963 - Adding __str__ method to codon tables and translators > http://bugzilla.open-bio.org/show_bug.cgi?id=1963 > > Little things in themselves that I think would help. True. My (naive) hope is, that such things would be by-products of a new branch. I have to admit, that this is probably not possible without doing a code sprint. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics -------------- next part -------------- Databases: o NCBI - UniGene - GenBank - PubMed - Entrez - LocusLink - Geo o Kabat o KEGG o SwissProt o Medline o biblio (pywebsvcs dependency is mentioned only in the module itself) o dbdefs o InterPro o Gobase o Enzyme o Rebase Models and Simulations: o Ais o MetaTool o Pathway o ECell Algorigthms, Machine Learning and Pattern Recognition: o HMM o NeuralNetwork o Cluster o LogisticRegression, Statistics o GA o MarkovModel o pairwise2 o NaiveBayes o MaxEntropy Alignments: o Align o Blast o AlignAce o Clusalw o Fasta o FSSP o SubsMat o Search (WUBLAST output) o Saf o IntelliGenetics Applications: o Application o Emboss o Nexus o AlignAce o Blast o MEME o Sequencing o Wise Data Structures: o KDTree o trie Sequences: o GFF o Seq o SeqUtils o SeqFeature o SeqRecord o Alphabet o Transcribe o Translate o lcc o Encodings o Data o NBRF SeqIO: o writers o Writer o SeqIO o builders o Fasta o Index Utilities: o utils.py o ParserSupport o File o Tools o Mindy o HotRand o config o formatdefs o MarkupEditor o DocSQL (wouldn't usage of SQL-Object be nicer? (if possible)) o EUtils.ReseakFile o Std, StdHandler o PropertyManager o MultiProc o Decode o FilteredReader Graphics: o Graphics Web-Based: o GenBank o NetCache o EUtils o WWW Microarrays: o Affy Structure: o NMR o PDB o Crystal o Ndb o SCOP o SVDSuperimposer Motives: o MEME o Prosite o CDD o Compass References: o Medline, PubMed o DBXref Restriction: o Restriction o CAPS From biopython-dev at maubp.freeserve.co.uk Wed Aug 16 12:05:12 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 16 Aug 2006 17:05:12 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <20060816144436.GG12386@pc09.inb.uni-luebeck.de> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> Message-ID: <44E34238.2010508@maubp.freeserve.co.uk> Albert Krewinkel wrote: >>> The _parse_genbank_features function could also be used to parse embl >>> or ddjb features, therefore I think it should be named differently. Peter wrote: >> First of all, that bit of code is for a new feature which I personally >> wanted - to be able to iterate over CDS features in a genbank file. >> >> But yes, I did have in mind that it (and the GenBank parser) could be >> re-used to deal with EMBL files. I have not yet taken the time to >> learn the EMBL file format and how it corresponds to the GenBank file >> format - but I agree a lot of the code could be shared. Albert Krewinkel wrote: > I will try to build something similar for EMBL files within the next > days. This should be easy, since features really should look the same > in both formates: > > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html > Oh - you meant just adding EMBL feature iteration. I want thinking about the larger task of full EMBL file reading. Doing just the features is very easy, here you go: http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2 Any more feedback is very welcome. Are you using the iterators directly, or via the helper function File2SequenceIterator? Are you using just the sequence iterators, or the dictionary and list versions too? Peter From biopython-dev at maubp.freeserve.co.uk Wed Aug 16 18:20:28 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Aug 2006 23:20:28 +0100 Subject: [Biopython-dev] Tweaking the SeqRecord class Message-ID: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> In the spirit of gradual improvements, I had a look at the SeqRecord class. First of all, is there any comment on my suggestion to add __str__ and __repr__ methods to the SeqRecord object, bug 2057: http://bugzilla.open-bio.org/show_bug.cgi?id=2057 Next, I'd like to check in some basic __doc__ strings for the SeqRecord class, e.g. something like this: >>> from Bio.SeqRecord import SeqRecord >>> print SeqRecord.__doc__ The SeqRecord object is designed to hold a sequence and information about it. Main properties: id - Identifier such as a locus tag (string) seq - The sequence itself (Seq object) Additional properties: name - Sequence name, e.g. gene name (string) description - Additional text (string) dbxrefs - List of database cross references (list of strings) features - Any (sub)features defined (list of SeqFeature objects) annotations - Further information (dictionary) I would also like to add doc strings to the id, seq, name, ... themselves. However, they are currently stored as attributes so this isn't possible. See PEP 0224, http://www.python.org/dev/peps/pep-0224/ However, we could use the Python 2.2 "property" function to implement these as properties. The code might be clearer using the Python 2.4 "decorator" syntax, but I don't think we should depend on such a recent version of python yet. Using properties would allow this usage: >>> print SeqRecord.features.__doc__ Annotations about parts of the sequence (list of SeqFeatures) It would also mean that these properties show up in dir(SeqRecord) and help(SeqRecord), which all in all should make the object slightly easier to use. Finally, using get/set property functions allows us to postpone creation of string/list/dict objects for unused properties. This does actually seem to bring a slight improvement to the timings for Fasta file parsing discussed last month. If you recall, for the fastest parsers turning the data into SeqRecord and Seq objects imposed a fairly large overhead (compared to just using strings): http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html I would be interested to see how those numbers change with the attached implementation - if you wouldn't mind please Leighton... ;) I have attached a version of SeqRecord.py which implements the changes I have described. The backwards compatibility if statement is a bit ugly - can we just assume Python 2.2 or later? Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: SeqRecord.py Type: text/x-script.phyton Size: 9367 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060816/9e2f173c/attachment-0001.bin From mdehoon at c2b2.columbia.edu Wed Aug 16 21:39:12 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 16 Aug 2006 21:39:12 -0400 Subject: [Biopython-dev] Tweaking the SeqRecord class In-Reply-To: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> Message-ID: <44E3C8C0.5070200@c2b2.columbia.edu> Peter wrote: > First of all, is there any comment on my suggestion to add __str__ and > __repr__ methods to the SeqRecord object, bug 2057: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 Here's a thought: What if Seq were to inherit from str, and SeqRecord from Seq? Then, you get these for free. > Next, I'd like to check in some basic __doc__ strings for the > SeqRecord class, e.g. something like this: Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have documentation. > If you recall, for the fastest parsers turning the data into SeqRecord > and Seq objects imposed a fairly large overhead (compared to just > using strings): > > http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html I wonder if this is still true if a Seq object and a SeqRecord object inherit from string. From the code, I don't see where the overhead comes from. > The backwards compatibility if statement is a bit > ugly - can we just assume Python 2.2 or later? Biopython currently requires Python 2.3 or later. --Michiel. From krewink at inb.uni-luebeck.de Thu Aug 17 03:25:34 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Thu, 17 Aug 2006 09:25:34 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E34238.2010508@maubp.freeserve.co.uk> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> <44E34238.2010508@maubp.freeserve.co.uk> Message-ID: <20060817072534.GH12386@pc09.inb.uni-luebeck.de> Peter wrote: > Oh - you meant just adding EMBL feature iteration. I want thinking > about the larger task of full EMBL file reading. I started working on that, but I'm not very far yet. > Doing just the features is very easy, here you go: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2 Wow, that was quick. And it's works allmost perfectly. One exception: In _parse_embl_or_genbank_feature(), when parsing the location, it shoudl say something like from string import digits while feature_location[-1] not in (')', digits): line = iterator.next() feature_location += line[FEATURE_QUALIFIER_INDENT:].strip() This way, features may have multiline join(...) positions. > Any more feedback is very welcome. Are you using the iterators > directly, or via the helper function File2SequenceIterator? I'm using iterators directly, out of old habits. But most likely I will finally get addicted to your nice helperfunction. > Are you using just the sequence iterators, or the dictionary and list > versions too? I don't used those yet. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From mcolosimo at mitre.org Thu Aug 17 08:08:24 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Thu, 17 Aug 2006 08:08:24 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> Message-ID: Peter, Nice quick work on that. For Clustal, I think it should NOT be an Iterator, but there should be SequenceDict or SequenceList for it. There are other alignment filetypes out there that could use a SequenceIterator (those that are not interlaced). From looking over your code, it seem like it would be easy to add a check in File2SequenceDict/List to check for Clustal types and do something "special" Marc On Aug 12, 2006, at 4:25 AM, Peter wrote: > I've having a few issues with my email setup which is why I haven't > replied recently. > > A week ago I filed bug 2059 for this discussion, and attached some > code: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > > I'm interested in your feedback - from the framework down to if you > don't like the class names for example. > > Peter From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 09:25:07 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 14:25:07 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> Message-ID: <44E46E33.3090001@maubp.freeserve.co.uk> Marc Colosimo wrote: > Peter, > > Nice quick work on that. For Clustal, I think it should NOT be an > Iterator, but there should be SequenceDict or SequenceList for it. > There are other alignment filetypes out there that could use a > SequenceIterator (those that are not interlaced). From looking over > your code, it seem like it would be easy to add a check in > File2SequenceDict/List to check for Clustal types and do something > "special" Yes, I was thinking wondering about that too. For interlaced file formats (such as clustalw, NEXUS multiple alignment format) we have to load the whole file into memory anyway - so using a SequenceIterator was a bit odd. What I was trying to do was use a SequenceIterator as the lowest common denominator - the ClustalIterator shows that this can be done for interlaced files, and seems to work. Its trivial to "upgrade" the ClustalIterator to a SequenceDict or SequenceList if that's what is needed. The way I wrote the ClustalIterator it actually reads the whole file and stores a list of IDs and a dictionary mapping the ID to the sequence string. It creates SeqRecord objects only on request. This should use less memory than a full list of every SeqRecord (but I have not measured this). Note that I would also want to add an easy way to turn any SequenceIterator, SequenceList or SequenceDict into a multiple alignment object. Out of interest, what are the largest alignments you deal with? I was planning to add a Stockholm parser (where the sequences themselves are non-interleaved). The PFAM database alignments use this, and are the largest alignments I am aware of. However, the format supports per sequence annotation information and this information can be rather spread out. Looking at a real example from PFAM, there were blocks of such data both before and after the sequences. The format suggest that such annotation might also be found next to each sequence. i.e. An annotation free Stockholm iterator would be easy, but including the meta data would in general require loading the whole file. http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html It looks like a subclassed version could be written to handle the PFAM annotations nicely. Peter From mcolosimo at mitre.org Thu Aug 17 08:24:24 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Thu, 17 Aug 2006 08:24:24 -0400 Subject: [Biopython-dev] Fwd: contributing comparative genomics tools In-Reply-To: <20060816124407.GF12386@pc09.inb.uni-luebeck.de> References: <20060816124407.GF12386@pc09.inb.uni-luebeck.de> Message-ID: <9A739306-6B91-4E43-87F8-EC464784B4B2@mitre.org> On Aug 16, 2006, at 8:44 AM, Albert Krewinkel wrote: > Hello, > > I read Peter's SeqIO/__init__.py replacement and if I may say so: I > love it. Thanks a lot for this! Still, there are some things I'd > like to talk about. > > The _parse_genbank_features function could also be used to parse embl > or ddjb features, therefore I think it should be named differently. > > > Since there is a lot of clean up effort right now: How about moving > the SeqRecord and SeqFeature objects into the Bio.Seq module? They > are closely related and seperate modules only clutter the namespace. > The top namespace is sort of a mess of things. > To me, this seems to be a general problem. It's very difficult to find > a tool to use for a certain problem if one doesn't allready know what > to look for. I'd pretty much favour to create modules like > Bio.structure to group modules like Bio.PDB and Bio.NMR etc. I second this. > This is > a very big change, and therefore I'd like to follow Marc's suggestion > of splitting off a branch. In general, I pretty much agree with what > Marc said in his . > > I cannot estimate how much work it would be to maintain two seperate > biopython distributions, so please forgive me if I re-suggest > something completely idiotic here. I just don't believe there is much > that could be lost that way. I've done this for my internal work, but I never went back to see how to check out the other branch (I had not need). CVS is sometimes a bear to work with. SVN is suppose to handle branches much better, but I can't access SVN repositories that are not through HTTPS (SSL). Stupid corporate proxy is currently not set up to handle external webDAV. This might be a pain for a little while until the next full version is released, but I think the benfits of doing this now far out weigh the short term pain (of course I'm not an admin who has to build the releases). Marc From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 11:13:40 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 16:13:40 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <20060817072534.GH12386@pc09.inb.uni-luebeck.de> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> <44E34238.2010508@maubp.freeserve.co.uk> <20060817072534.GH12386@pc09.inb.uni-luebeck.de> Message-ID: <44E487A4.8040106@maubp.freeserve.co.uk> Albert Krewinkel wrote: > Peter wrote: > >>Oh - you meant just adding EMBL feature iteration. I was thinking >>about the larger task of full EMBL file reading. > > I started working on that, but I'm not very far yet. Are you starting from Bio.GenBank or from scratch? I would point out that the code in Bio.GenBank was inserted into what was once a Martel based parser, and designed to be a transparent change for the end user. What I would like to do is recycle that code into a new far simpler SeqIO GenBank parser which would only return SeqRecords. In particular I would get rid off all the scanner/consumer model with all its function callbacks. At this point I would try and handle both GenBank and EMBL files together. I expect this to be faster, and easier to understand. It would be a lot less flexible for the "power user", but then so is all the new SeqIO code I have been writing. >>Doing just the features is very easy, here you go: >> >>http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2 > > Wow, that was quick. Well, I did have something along these lines planned in advance - that's why there my parse function was outside the GenbankCdsFeatureIterator class. > And it's works allmost perfectly. One exception: > In _parse_embl_or_genbank_feature(), when parsing the location, it > shoudl say something like > > > from string import digits > while feature_location[-1] not in (')', digits): > line = iterator.next() > feature_location += line[FEATURE_QUALIFIER_INDENT:].strip() > > > This way, features may have multiline join(...) positions. Good point, something I was aware of and coped with in Bio.GenBank but hadn't done in the CDS iterator. Thanks for point this out. This affects both GenBank and EMBL files by the way. My code is very similar but I included an assert to check the indent, and I only check for a trailing comma. This works on all the files I have tried. Peter From krewink at inb.uni-luebeck.de Thu Aug 17 13:41:06 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Thu, 17 Aug 2006 19:41:06 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E487A4.8040106@maubp.freeserve.co.uk> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> <44E34238.2010508@maubp.freeserve.co.uk> <20060817072534.GH12386@pc09.inb.uni-luebeck.de> <44E487A4.8040106@maubp.freeserve.co.uk> Message-ID: <20060817174106.GI12386@pc09.inb.uni-luebeck.de> Peter wrote: > > Peter wrote: > >>Oh - you meant just adding EMBL feature iteration. I was thinking > >>about the larger task of full EMBL file reading. > > > Albert wrote: > >I started working on that, but I'm not very far yet. > > Are you starting from Bio.GenBank or from scratch? I would point out > that the code in Bio.GenBank was inserted into what was once a Martel > based parser, and designed to be a transparent change for the end user. > > What I would like to do is recycle that code into a new far simpler > SeqIO GenBank parser which would only return SeqRecords. In particular > I would get rid off all the scanner/consumer model with all its function > callbacks. > > At this point I would try and handle both GenBank and EMBL files together. I didn't do much more than to play with current code and add some methods to parse EMBL specific things. The results can be found here: http://www.inb.uni-luebeck.de/~krewink/embl.py It's ugly, and doesn't provide much functionality, but could be a starting point. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 16:09:20 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 21:09:20 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E46E33.3090001@maubp.freeserve.co.uk> References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> <44E46E33.3090001@maubp.freeserve.co.uk> Message-ID: <44E4CCF0.7090607@maubp.freeserve.co.uk> Marc Colosimo wrote: >> Nice quick work on that. For Clustal, I think it should NOT be an >> Iterator, but there should be SequenceDict or SequenceList for it. >> There are other alignment filetypes out there that could use a >> SequenceIterator (those that are not interlaced). From looking over >> your code, it seem like it would be easy to add a check in >> File2SequenceDict/List to check for Clustal types and do something >> "special" Peter (BioPython Dev) wrote: > Yes, I was thinking wondering about that too. > > For interlaced file formats (such as clustalw, NEXUS multiple alignment > format) we have to load the whole file into memory anyway - so using a > SequenceIterator was a bit odd. > > What I was trying to do was use a SequenceIterator as the lowest common > denominator - the ClustalIterator shows that this can be done for > interlaced files, and seems to work. There are two and a half examples done this way now... > I was planning to add a Stockholm parser (where the sequences themselves > are non-interleaved). The PFAM database alignments use this, and are > the largest alignments I am aware of. > > ... > > It looks like a subclassed version could be written to handle the PFAM > annotations nicely. http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c3 Changes to the clustal parser, and addition of a parser for Stockholm alignments, and a subclassed version to handle the PFAM style annotations strings. I have included basic handling of the sequence specific meta-data [I need to have a look at real PFAM data to sort of the database cross references still], but currently ignore the whole file level information (#=GF lines) and the per column information (#=GC lines). Maybe reading sequences out of multiple alignment files should be done as a special case of loading multiple alignments? Is this what you meant by "something special" Marc? Peter From biopython-dev at maubp.freeserve.co.uk Mon Aug 21 15:26:06 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 21 Aug 2006 20:26:06 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E4D2B4.3000600@maubp.freeserve.co.uk> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> Message-ID: <44EA08CE.5070802@maubp.freeserve.co.uk> You probably noticed I sent out a "Dealing with sequence files" questionnaire on the main discussion list: http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html I've had four replies to date (off the list), and with the previous list discussion and counting myself that makes eight views. Not a very big sample I know. > Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Fasta very popular, with GenBank also scoring highly. Michiel and I both use clustalw. Apart from EMBL (next question) there wasn't any other popular file format given. I'm tempted to ask again regarding multiple alignment formats. > Question Two > ============ > Are there any sequence formats you would like to be able to read using > BioPython that are not currently supported (e.g. EMBL, ...) It may have been a leading question, but several respondents would like to be able to read in EMBL format. Other requests included: XML based 454 sequence files UniGene sequence cluster format Leighton mentioned: PTT (Protein table files) GFF (General Feature Format) And I wanted to be able to read Stockholm alignments. > Question Three - Reading Fasta Files > ==================================== > Which of the following do you currently use (and why)?: > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects) > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) > (c) Bio.Fasta with your own parser (Could you tell us more?) > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) > (e) Bio.FormatIO (giving SeqRecord objects) > (f) Other (Could you tell us more?) A range covering (a), (b) and (d) plus DIY parsers. > Question Four - Reading GenBank Files > ===================================== > Which of the following do you currently use (and why)?: > > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects) > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects) > (c) Other (Could you tell us more?) Both (a) and (b) with no clear majority. > Question Five - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > random access to the records using their identifier. > > (c) A list giving random access by index number (e.g. load the records > using an iterator but save them in a list). Most of you use iterators, storing records in memory as required. > Question Six - Martel, Scanners and Consumers > ============================================= > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? > > (a) I don't know, or don't care. I just the the parsers provided. > (b) I use this framework to modify a parser in order to do ... (please > provide details). Almost everyone said (a) which I think is a good thing if we are going to try and re-work the BioPython's sequence reading. > And finally... > ============== > Do you have any general questions of comments. Several people have commented that BioPerl has a nice unified system with good documentation. ----------------------------------------------------------------------- Where next... I think my code could be included "in parallel" with the existing parsers, without the upheaval of creating a new branch etc. I have started thinking about writing files too. Part of this will involve trying to be as consistent as possible about mapping annotations from different file formats to the SeqRecord object's annotations dictionary. http://bugzilla.open-bio.org/show_bug.cgi?id=2059 My code currently on bug 2059 is written as a single python file, provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea long term as more file formats are supported. If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the filenames would clash on Windows. Some people are using the code in Bio.SeqIO.FASTA, but I suppose the file could contain both the old code, and my new fasta interface. Alternatively, the new system could be put in Bio.SequenceIO or are there any other suggestions? Peter From krewink at inb.uni-luebeck.de Tue Aug 22 09:43:56 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Tue, 22 Aug 2006 15:43:56 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> <44EA08CE.5070802@maubp.freeserve.co.uk> Message-ID: <20060822134356.GO12386@pc09.inb.uni-luebeck.de> I'd like to seriously start working on an EMBL parser, but there are some things I'm concerned about: It surely would be a good thing to build the SequenceIO and Parser stuff upon some base classes and agree on using certain tools which are (or will be) used in the hole project. Since I never received any education/training on software development, I would appreciate if someone can tell me how the code's structure should look like -- the current Scanner/Consumer code isn't any help. > Several people have commented that BioPerl has a nice unified system > with good documentation. How about using reStructuredText in docstrings? IMO it leaves the .__doc__ string very readable but improves epydoc generated descriptions. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From bsouthey at gmail.com Tue Aug 22 09:52:10 2006 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 22 Aug 2006 08:52:10 -0500 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> <44EA08CE.5070802@maubp.freeserve.co.uk> Message-ID: Hi, To date I have only used SwissProt code from BioPython so I am really only lurking. But here are some responses. Bruce On 8/21/06, Peter (BioPython Dev) wrote: > You probably noticed I sent out a "Dealing with sequence files" > questionnaire on the main discussion list: > > http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html > > I've had four replies to date (off the list), and with the previous list > discussion and counting myself that makes eight views. Not a very big > sample I know. > > > Question One > > ============ > > Is reading sequence files an important function to you, and if so which > > file formats in particular (e.g. Fasta, GenBank, ...) > > Fasta very popular, with GenBank also scoring highly. Michiel and I > both use clustalw. Apart from EMBL (next question) there wasn't any > other popular file format given. Well, this is not a surprise because most apps around also use FASTA as default format. Although most do not accept a comment line. Thus, FASTA is the most important format. > > I'm tempted to ask again regarding multiple alignment formats. > > > Question Two > > ============ > > Are there any sequence formats you would like to be able to read using > > BioPython that are not currently supported (e.g. EMBL, ...) > > It may have been a leading question, but several respondents would like > to be able to read in EMBL format. > > Other requests included: > > XML based 454 sequence files > UniGene sequence cluster format > > Leighton mentioned: > > PTT (Protein table files) > GFF (General Feature Format) > > And I wanted to be able to read Stockholm alignments. I would like to be able to use a custom format that is based on the FASTA format. That is allowing non-standard characters to included as part of the sequence that I later remove. Perhaps this is just being able to do subclassing. > > > Question Three - Reading Fasta Files > > ==================================== > > Which of the following do you currently use (and why)?: > > > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects) > > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) > > (c) Bio.Fasta with your own parser (Could you tell us more?) > > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) > > (e) Bio.FormatIO (giving SeqRecord objects) > > (f) Other (Could you tell us more?) > > A range covering (a), (b) and (d) plus DIY parsers. > > > Question Four - Reading GenBank Files > > ===================================== > > Which of the following do you currently use (and why)?: > > > > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects) > > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects) > > (c) Other (Could you tell us more?) > > Both (a) and (b) with no clear majority. > > > Question Five - Record Access... > > ================================ > > When loading a file with multiple sequences do you use: > > > > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the > > records one by one in the order from the file. > > > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > > random access to the records using their identifier. > > > > (c) A list giving random access by index number (e.g. load the records > > using an iterator but save them in a list). > > Most of you use iterators, storing records in memory as required. a > > > Question Six - Martel, Scanners and Consumers > > ============================================= > > Some of BioPython's existing parsers (e.g. those using Martel) use an > > event/callback model, where the scanner component generates parsing > > events which are dealt with by the consumer component. > > > > Do any of you use this system to modify existing parser behaviour, or > > use it as part of your own personal file parser? > > > > (a) I don't know, or don't care. I just the the parsers provided. > > (b) I use this framework to modify a parser in order to do ... (please > > provide details). > > Almost everyone said (a) which I think is a good thing if we are going > to try and re-work the BioPython's sequence reading. a > > > And finally... > > ============== > > Do you have any general questions of comments. > > Several people have commented that BioPerl has a nice unified system > with good documentation. > > ----------------------------------------------------------------------- > > Where next... > > I think my code could be included "in parallel" with the existing > parsers, without the upheaval of creating a new branch etc. > > I have started thinking about writing files too. > > Part of this will involve trying to be as consistent as possible about > mapping annotations from different file formats to the SeqRecord > object's annotations dictionary. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > > My code currently on bug 2059 is written as a single python file, > provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea > long term as more file formats are supported. > > If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a > slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the > filenames would clash on Windows. Some people are using the code in > Bio.SeqIO.FASTA, but I suppose the file could contain both the old code, > and my new fasta interface. > > Alternatively, the new system could be put in Bio.SequenceIO or are > there any other suggestions? > > Peter > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython-dev at maubp.freeserve.co.uk Tue Aug 22 12:46:39 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 22 Aug 2006 17:46:39 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <20060822134356.GO12386@pc09.inb.uni-luebeck.de> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> <44EA08CE.5070802@maubp.freeserve.co.uk> <20060822134356.GO12386@pc09.inb.uni-luebeck.de> Message-ID: <44EB34EF.1050901@maubp.freeserve.co.uk> Albert Krewinkel wrote: > I'd like to seriously start working on an EMBL parser, but ... As the de-facto GenBank module owner, I'm also interested getting EMBL and GenBank working nicely together. The big question BEFORE you/we start any serious coding on EMBL support is how it fits into BioPython. Do we (a) add a new module like the existing Bio.Fasta and Bio.GenBank, or (b) use a new framework like the one I've put forward here: http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > ... there are some things I'm concerned about: It surely would be a > good thing to build the SequenceIO and Parser stuff upon some base > classes and agree on using certain tools which are (or will be) used > in the hole project. What I was proposing was that all the new sequence file format parsers should be implemented as subclasses of my SequenceIterator class - either directly (e.g. FastaIterator) or indirectly (e.g. the PfamStockholmIterator) and they should return SeqRecord objects. I am open to discussion about how interlaced file formats should be handled, but I think I have shown how the SequenceIterator based scheme could work using the Clustalw and Stockholm formats as examples. > Since I never received any education/training on software > development, I would appreciate if someone can tell me how the code's > structure should look like -- the current Scanner/Consumer code > isn't any help. I agree that the current Scanner/Consumer code won't be much help. The fact that the current Bio.GenBank parser uses the Scanner/Consumer model reflects the fact that I rewrote (in Python) what had been done using Martel/Mindy. This is one excuse for the state of that code of mine ;) I don't think the flexibility of the Scanner/Consumer model is needed just to turn Embl/GenBank data into SeqRecord objects (and only into SeqRecord objects). > How about using reStructuredText in docstrings? IMO it leaves the > .__doc__ string very readable but improves epydoc generated > descriptions. I'm not familiar with how any existing API documentation is extracted from the source code... Peter From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 04:28:19 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 09:28:19 +0100 Subject: [Biopython-dev] Tweaking the SeqRecord class In-Reply-To: <44E3C8C0.5070200@c2b2.columbia.edu> References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> <44E3C8C0.5070200@c2b2.columbia.edu> Message-ID: <44E428A3.70103@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Peter wrote: > >>First of all, is there any comment on my suggestion to add __str__ and >>__repr__ methods to the SeqRecord object, bug 2057: >> >>http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > Here's a thought: > What if Seq were to inherit from str, and SeqRecord from Seq? > Then, you get these for free. This wouldn't automatically show any id/name/desrc/annotation in the __str__ and __repr__ methods, so I would want to override these methods anyway. We would still need to create and provide a Seq object on request as the record.seq attribute/property (for backwards compatibility). I also think we should change the Seq objects __str__, __repr__ functionality (while preserving the .tostring() method for some backwards compatibility). It might have been Marc the raised this point - shouldn't __str__ turn the data into a string, and __repr__ return a string that you could type into python to recreate the object? This would mean we would have to stop truncating the sequence data at 60 characters. >>Next, I'd like to check in some basic __doc__ strings for the >>SeqRecord class, e.g. something like this: > > Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have > documentation. OK, basic __doc__ strings checked in, Bio/SeqRecord.py revision 1.9 The Seq object also needs some love and attention in this area. >>If you recall, for the fastest parsers turning the data into SeqRecord >>and Seq objects imposed a fairly large overhead (compared to just >>using strings): >> >>http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html > > I wonder if this is still true if a Seq object and a SeqRecord object > inherit from string. From the code, I don't see where the overhead comes > from. I was wondering what the overhead was too. It could just be creating objects (Seq and SeqRecord) plus their associated strings/list/dictionary (compared with just two strings, the fasta title string and the sequence). My property change should reduce this a little bit as for Fasta files there is no need to create the dbxrefs list or the annotations dictionary (unless or until the user records some information here after creating the SeqRecord object). Making SeqRecord subclass Seq might help here if only one object needs to be created. >>The backwards compatibility if statement is a bit >>ugly - can we just assume Python 2.2 or later? > > Biopython currently requires Python 2.3 or later. Great - I'll ditch that nasty big if and just re-write the class to use properties. Revised version attached - should be functionally identical. Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: SeqRecord.py Url: http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060817/79dd5fca/attachment-0001.pl From biopython-dev at maubp.freeserve.co.uk Wed Aug 30 06:22:52 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Aug 2006 11:22:52 +0100 Subject: [Biopython-dev] Recent bug reports not making it to the mailing list Message-ID: <44F566FC.30407@maubp.freeserve.co.uk> Once upon a time (early 2006?) whenever a bug was filed on the BugZilla, a copy was sent to the mailing list. Not any more... and in the last month or so there have been several bugs filed which have been ignored. Does anyone get automatic email notification? Who should I ask to be included in any default email notification? Thanks Peter From lpritc at scri.sari.ac.uk Tue Aug 1 10:42:37 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 01 Aug 2006 11:42:37 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> Message-ID: <1154428959.4871.11.camel@lplinuxdev> On Mon, 2006-07-31 at 12:08 -0400, Marc Colosimo wrote: > On Jul 31, 2006, at 11:14 AM, Peter (BioPython Dev) wrote: > >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the > >>> entire file into memory in one go, and then parses it. On the other > >>> hand its not perfect: I would use "\n>" as the split marker > >>> rather than > >>> ">" which could appear in the description of a sequence. > >> > >> I agree (not that it's bitten me, yet), but I'd be inclined to go > >> with > >> "%s>" % os.linesep as the split marker, just in case. > > > > Good point. I wonder how many people even know this function exists? > > > > The only problem with this is that if someone sends you a file not > created on your system. [...] > This has mostly simplied down to two - Unix and Windows - unless the > person uses a Mac GUI app some of which use \r (CR) instead of \n > (LF) where Windows uses \r\n (CRLF). I think the standard python > disto comes with crlf.py and lfcr.py that can convert the line endings. Also a good point. I had a play about with regular expression splitting/substitution and the SeqUtils.quick_FASTA_reader method to see if I could capture this variability in line-endings: def method_quick_FASTA_reader3(filename): txt = file(filename).read() entries = [] split_marker = re.compile('^>', re.M) for entry in re.split(split_marker, txt)[1:]: name,seq= re.split('[\r\n]', entry, 1) seq = re.sub('\s', '', seq).upper() entries.append((name, seq)) return "SeqUtils/quick_FASTA_reader (import re)", len(entries) Using regular expressions in this way seems to slow things down to about the same speed as the SeqIO parser, with the disadvantage of still having to process the entries into SeqRecord objects (if that's what you want to do with them). quick_FASTA_reader is a bit of a misnomer in this case, I guess ;) 4.15s SeqIO.FASTA.FastaReader (for record in interator) 3.95s SeqIO.FASTA.FastaReader (iterator.next) 4.13s SeqIO.FASTA.FastaReader (iterator[i]) 1.89s SeqUtils/quick_FASTA_reader 1.03s pyfastaseqlexer/next_record 0.52s pyfastaseqlexer/quick_FASTA_reader 4.44s SeqUtils/quick_FASTA_reader (import re) Results are typical for the 72000 record set, and this doesn't look to be a promising route. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From pfefferp at staff.uni-marburg.de Tue Aug 1 12:02:25 2006 From: pfefferp at staff.uni-marburg.de (Patrick Pfeffer) Date: Tue, 01 Aug 2006 14:02:25 +0200 Subject: [Biopython-dev] GAs in Biopython Message-ID: <44CF42D1.8090209@staff.uni-marburg.de> Hi there, isn't there any documentation available for using the genetic algorithm available in the package? Thanks for any kind of help, Patrick -- ************************************* Dipl. Bioinf. Patrick Pfeffer Arbeitskreis Prof. Dr. G. Klebe Institut f?r Pharmazeutische Chemie Raum A116a Fachbereich Pharmazie Philipps-Universit?t Marburg Marbacher Weg 6 35032 Marburg Germany Fon.: 06421/2825908 http://www.agklebe.de e-mail: pfefferp at staff.uni-marburg.de ************************************* From biopython-dev at maubp.freeserve.co.uk Tue Aug 1 20:53:08 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 01 Aug 2006 21:53:08 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154428959.4871.11.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> <1154428959.4871.11.camel@lplinuxdev> Message-ID: <44CFBF34.7080106@maubp.freeserve.co.uk> Peter wrote: >>> The SeqUtils/quick_FASTA_reader is interesting in that it loads the >>> entire file into memory in one go, and then parses it. On the other >>> hand its not perfect: I would use "\n>" as the split marker >>> rather than ">" which could appear in the description of a sequence. Leighton Pritchard replied: >> I agree (not that it's bitten me, yet), but I'd be inclined to go >> with "%s>" % os.linesep as the split marker, just in case. Peter then wrote: > Good point. I take that back - I was right the first time ;) You are right to worry about the line sep changing from platform to platform, but you shouldn't use "%s>" % os.linesep However, when reading windows style files on windows, the newlines appear in python as just \n (as do newlines from unix files read on windows). When writing text files on windows, again \n gets turned into CR LF on the disk. Just using "\n>" would work on any platform reading a FASTA file with the expected newlines. As a bonus it would work on Windows when reading unix style newlines. To get any platform to read newlines from any other platform what I suggest is using "\n>" as the split string, but open the file in universal text mode - this seems to work fine on Python 2.3, but I'm not sure when universal newline reading was introduced. For example, I created a simple file using the three newline conventions (using the TextPad on Windows). >>> import sys >>> sys.platform 'win32' >>> os.linesep '\r\n' >>> open("c:/temp/windows.txt","r").read() 'line\nline\n' >>> open("c:/temp/mac.txt","r").read() 'line\rline\r' >>> open("c:/temp/unix.txt","r").read() 'line\nline\n' (Notice that using "\n>" wouldn't work when reading a Mac style file on Windows) >>> open("c:/temp/windows.txt","rU").read() 'line\nline\n' >>> open("c:/temp/mac.txt","rU").read() 'line\nline\n' >>> open("c:/temp/unix.txt","rU").read() 'line\nline\n' Peter From lpritc at scri.sari.ac.uk Wed Aug 2 09:25:27 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Wed, 02 Aug 2006 10:25:27 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> Message-ID: <1154510728.4871.66.camel@lplinuxdev> On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote: > Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Yes. FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW > If you have had to write you own code to read a "common" file format > which BioPython doesn't support, please get in touch. EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not pretty). > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with a > title, and the sequence as a string) > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) > (c) Bio.Fasta with your own parser (Could you tell us more?) > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) > (e) Bio.FormatIO (giving SeqRecord objects) > (f) Other (Could you tell us more?) Mostly (f), a homegrown Pyrex/Flex parser. > Question Three - index_file based dictionaries > ============================================== > Do you use any of the following: > (a) Bio.Fasta.Dictionary > (b) Bio.Genbank.Dictionary > (c) Any other "Martel/Mindy" based dictionary which first requires > creation of an index using the index_file function No, but I do create dictionaries on-the-fly from (name, sequence) tuples, where necessary. > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > random access to the records using their identifier. > > (c) A list giving random access by index number (e.g. load the records > using an iterator but saving them in a list). > > Do you have any additional comments on this? For example, flexibility > versus memory requirements. Depending on what I need to do, I might use different approaches. If I'm filtering sequences on, say, sequence composition, I'll use an iterator. If I need to cross-reference sequences from the file to some other set of sequences by ID, I'll use a dictionary. In each case, I will generally either use a for loop or build a dictionary on-the-fly. > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== > If you use Fasta files, do you want get records returned as FastaRecords > or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? I'd rather have SeqRecords. SeqRecords are particularly useful for annotations and attaching data to the sequence which, later, gets written out in some format other than FASTA sequence format. For operations where no further information is associated with the sequence, they offer equivalent functionality to FastaRecords. Currently I default to (name, seq) tuples, and only create SeqRecords when necessary, but this is only out of convenience for the parser I use. > Question Five - GenBank files: GenbankRecord or SeqRecord > ========================================================== > If you use GenBank files, do you use: > (a) Bio.Genbank.FeatureParser which returns SeqRecord objects > (b) Bio.Genbank.RecordParser which returns Bio.GenBank.Record objects > > Do you care much either way? For me the only significant difference is > that feature locations are held as objects in the SeqRecord, and as the > raw string in the Record. I use Bio.GenBank.FeatureParser because I prefer the storage of features (which are what I'm generally interested in) as SeqFeature objects. > Question Six - Martel, Scanners and Consumers > ============================================== > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? > > (a) I don't know, or don't care. I just the the parsers provided. > (b) I use this framework to modify a parser in order to do ... (please > provide details). I care mostly about performance on large files and the convenient representation of sequences and features. Where parsers have not been available (or quickly locatable) for file formats, such as EMBL, I have sometimes used the Bio.ParserSupport classes and the Scanner/Consumer pattern. L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Wed Aug 2 10:45:34 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 02 Aug 2006 11:45:34 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154510728.4871.66.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> <1154510728.4871.66.camel@lplinuxdev> Message-ID: <44D0824E.30808@maubp.freeserve.co.uk> Leighton Pritchard wrote: > On Mon, 2006-07-31 at 11:36 +0100, Peter (BioPython Dev) wrote: > >>Question One >>============ >>Is reading sequence files an important function to you, and if so which >>file formats in particular (e.g. Fasta, GenBank, ...) > > Yes. FASTA (sequence), GenBank, GFF, PTT, EMBL, ClustalW > PTT (Protein table files) http://www.ibt.unam.mx/biocomputo/hom_make_db.html (Anyone got an NCBI link for the file format?) GFF (General Feature Format) http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml GFF and PTT aren't exactly what I would call sequence files, in that they don't contain any sequence data. But thinking about it, maybe those files could be turned into SeqRecords or SeqFeatures (with empty sequences). > >>If you have had to write you own code to read a "common" file format >>which BioPython doesn't support, please get in touch. > > EMBL and PTT (though PTT is pretty trivial, and my EMBL parser is not > pretty). > Its looks like there is enough overlap between the EMBL and Genbank to make sharing code between them a good idea. Certainly EMBL was a file format I was thinking we should try to support. Reading your other comments, it looks like you wouldn't miss FastaRecord or GenBank records if they were phased out. Personally, I'm suggesting we try and standardise on having any Sequence IO framework standardize on returning SeqRecord objects. Does anyone know if SeqIO stood for Sequence or Sequential Input/Ouput? I think we should have a generic "Sequence Iterator" object to do this which takes a file handle, subclassed for each file format - giving a "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc. I'm inclined not to give any choice of parser object (e.g. Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a SeqRecord. The individual readers should offer some level of control, for example the title2ids function for Fasta files lets the user decide how the title line should be broken up into id/name/description. Also for some file formats the user should be able to specify the alphabet. Peter From hoffman at ebi.ac.uk Wed Aug 2 11:00:46 2006 From: hoffman at ebi.ac.uk (Michael Hoffman) Date: Wed, 2 Aug 2006 12:00:46 +0100 (BST) Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CDDD10.4020904@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> Message-ID: > Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Yes. FASTA. > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (f) Other (Could you tell us more?) I have written my own short iterator so that my code is portable without requiring Biopython to be installed. > Question Three - index_file based dictionaries > ============================================== > Do you use any of the following: No. > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. Yes. > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== > If you use Fasta files, do you want get records returned as FastaRecords > or as SeqRecords? If SeqRecords, do you use your own title2ids mapping? SeqRecords. I hate it when an interface tries to parse the definition line for me. Perhaps a set of standard definition line parsers should be provided so that one can choose, but usually I would rather have plain text and parse it myself. > Question Six - Martel, Scanners and Consumers > ============================================== > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? No. -- Michael Hoffman European Bioinformatics Institute From lpritc at scri.sari.ac.uk Wed Aug 2 11:23:27 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Wed, 02 Aug 2006 12:23:27 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44D0824E.30808@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> <1154510728.4871.66.camel@lplinuxdev> <44D0824E.30808@maubp.freeserve.co.uk> Message-ID: <1154517808.4871.93.camel@lplinuxdev> On Wed, 2006-08-02 at 11:45 +0100, Peter (BioPython Dev) wrote: > GFF (General Feature Format) > > http://genome.ucsc.edu/goldenPath/help/customTrack.html#GFF > http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml > > GFF and PTT aren't exactly what I would call sequence files, in that > they don't contain any sequence data. Fair point, but GFF3 (see below) can optionally carry sequence data, and I use them for exactly what you say here: > those files could be turned into SeqRecords or SeqFeatures (with empty > sequences). I was thinking that GFF3 would be more useful than GFF: http://song.sourceforge.net/gff3.shtml NCBI have already gone over to this on bacterial genomes, at least, (e.g. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gff), and it's a much richer format than the original specification. Andrew Dalke has already written a GFF3 parser/writer, which is available at http://www.dalkescientific.com/PyGFF3-0.5.tar.gz I've not used this in anger, yet... > Its looks like there is enough overlap between the EMBL and Genbank to > make sharing code between them a good idea. Certainly EMBL was a file > format I was thinking we should try to support. In a scanner/consumer pattern it's easy enough. I've not looked under the hood of the new GenBank parser yet, to see what you've done. Most of my contact with EMBL format is with headerless feature tables and Artemis, which aren't directly similar to GenBank entries. > Reading your other comments, it looks like you wouldn't miss FastaRecord > or GenBank records if they were phased out. Not personally, but others may have strong opinions and breakable code, yet. > Personally, I'm suggesting we try and standardise on having any Sequence > IO framework standardize on returning SeqRecord objects. > > I think we should have a generic "Sequence Iterator" object to do this > which takes a file handle, subclassed for each file format - giving a > "Fasta Iterator", a "Genbank Iterator", a "Clustal Iterator" etc. > I'm inclined not to give any choice of parser object (e.g. > Bio.Fasta.SequenceParser vs Bio.Fasta.RecordParser), and always return a > SeqRecord. It may be a side-issue, but should a Clustal parser return an Alignment object or iterate over SeqRecord objects? And for that matter, what about other MSA files in FASTA format? I think we ought allow parsers to return an Alignment where the user requests it, which is a functionality I'm not currently aware of in the FASTA sequence parsers. > The individual readers should offer some level of control, for example > the title2ids function for Fasta files lets the user decide how the > title line should be broken up into id/name/description. Also for some > file formats the user should be able to specify the alphabet. Could the alphabet be optionally specified by the user on parsing, and maybe return a warning or error if there are non-compliant symbols in the file, as a quick validator for bad sequences, or reminder to the occasionally forgetful that, for example, they're not working with nucleotide sequences, today ;) L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Wed Aug 2 12:56:23 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 02 Aug 2006 13:56:23 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <1154517808.4871.93.camel@lplinuxdev> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CD5AF2.10708@c2b2.columbia.edu> <44CDDD10.4020904@maubp.freeserve.co.uk> <1154510728.4871.66.camel@lplinuxdev> <44D0824E.30808@maubp.freeserve.co.uk> <1154517808.4871.93.camel@lplinuxdev> Message-ID: <44D0A0F7.1020402@maubp.freeserve.co.uk> Leighton Pritchard wrote: > Fair point, but GFF3 (see below) can optionally carry sequence data, > and I use them for exactly what you say here: > >> maybe those files could be turned into SeqRecords or SeqFeatures >> (with empty sequences). > > I was thinking that GFF3 would be more useful than GFF: > > http://song.sourceforge.net/gff3.shtml > Thanks for the links... interesting that GFF3 allows embedding Fasta sequences. >> Reading your other comments, it looks like you wouldn't miss >> FastaRecord or GenBank records if they were phased out. > > Not personally, but others may have strong opinions and breakable > code, yet. There is no need to remove the current modules, just mark them as depreciated. Of course, if there is some strong support for these objects then we might not want to be so harsh... > It may be a side-issue, but should a Clustal parser return an > Alignment object or iterate over SeqRecord objects? And for that > matter, what about other MSA files in FASTA format? I think we ought > allow parsers to return an Alignment where the user requests it, > which is a functionality I'm not currently aware of in the FASTA > sequence parsers. In my opinion we should offer both. I would go for loading clustal/fasta alignments as sequence iterators (as part of the new SeqIO code) and make it very easy to turn ANY sequence iterator returning SeqRecords into an alignment. The current alignment object stores its sequences as SeqRecords internally but doesn't (yet) allow simple addition of SeqRecords - that would have to be fixed but it looks easy enough. Accepting a SequenceIterator for __init__ would also be nice. >> The individual readers should offer some level of control, for >> example the title2ids function for Fasta files lets the user decide >> how the title line should be broken up into id/name/description. >> Also for some file formats the user should be able to specify the >> alphabet. > > Could the alphabet be optionally specified by the user on parsing, > and maybe return a warning or error if there are non-compliant > symbols in the file, as a quick validator for bad sequences, or > reminder to the occasionally forgetful that, for example, they're not > working with nucleotide sequences, today at floor> ;) For some file formats the parser should be able to deduce the alphabet, but other like Fasta it must be specified. I like the idea of optionally checking the alphabet - but it would impose a speed penalty. Do you think this should be done by the SeqRecord object (on request)? Each parser could simply ask the SeqRecord object to verify itself before returning it. Peter From Leighton.Pritchard at scri.ac.uk Wed Aug 2 09:00:20 2006 From: Leighton.Pritchard at scri.ac.uk (Leighton Pritchard) Date: Wed, 2 Aug 2006 10:00:20 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44CFBF34.7080106@maubp.freeserve.co.uk> References: <44CA162F.1040604@maubp.freeserve.co.uk> <44CA27B1.30107@maubp.freeserve.co.uk> <1154339988.1490.81.camel@lplinuxdev> <44CDF3AA.2020308@maubp.freeserve.co.uk> <1154355358.1490.116.camel@lplinuxdev> <44CE1E3C.2050502@maubp.freeserve.co.uk> <1154428959.4871.11.camel@lplinuxdev> <44CFBF34.7080106@maubp.freeserve.co.uk> Message-ID: <1154509221.4871.40.camel@lplinuxdev> An embedded and charset-unspecified text was scrubbed... Name: not available URL: -------------- next part -------------- An embedded message was scrubbed... From: "Leighton Pritchard" Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Date: Wed, 2 Aug 2006 10:00:20 +0100 Size: 4641 URL: From lpritc at scri.sari.ac.uk Wed Aug 2 09:02:03 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Wed, 02 Aug 2006 10:02:03 +0100 Subject: [Biopython-dev] [Fwd: Re: Reading sequences: FormatIO, SeqIO, etc] Message-ID: <1154509323.4871.42.camel@lplinuxdev> (this time without the signature) -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). -------------- next part -------------- An embedded message was scrubbed... From: Leighton Pritchard Subject: Re: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Date: Wed, 02 Aug 2006 10:00:20 +0100 Size: 3943 URL: From mdehoon at c2b2.columbia.edu Fri Aug 4 03:20:18 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 03 Aug 2006 23:20:18 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <44D2BCF2.9010500@c2b2.columbia.edu> > Question One > ============ > > Is reading sequence files an important > function to you, and if so which file formats in particular (e.g. > Fasta, GenBank, ...) > I use Fasta, GenBank, and occasionally clustalw. > > Question Two - Reading Fasta Files > ================================== > Which of the following do you currently use (and why)?: > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects with > a title, and the sequence as a string) (b) Bio.Fasta with the > FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own > parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader > (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord > objects) (f) Other (Could you tell us more?) I use Bio.Fasta with the RecordParser, but just because it's easy to find in the documentation. As a user, I think Bio.Fasta requires too many steps to be typed in; I would prefer something more straightforward. For the output format, I don't care so much, but for the sake of consistency a SeqRecord may be preferable. > > Question Three - index_file based dictionaries > ============================================== Do you use any of the > following: (a) Bio.Fasta.Dictionary (b) Bio.Genbank.Dictionary (c) > Any other "Martel/Mindy" based dictionary which first requires > creation of an index using the index_file function > No. I never really understood index files. > > Question Four - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface(e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > random access to the records using their identifier. > > (c) A list giving random access by index number (e.g. load the > records using an iterator but saving them in a list). I use (a). It's easy to create (b) or (c), if needed, if (a) is available. > > Question Four - Fasta files: FastaRecord or SeqRecord > ===================================================== If you use > Fasta files, do you want get records returned as FastaRecords or as > SeqRecords? If SeqRecords, do you use your own title2ids mapping? > > For example, > >> name text text text > ACGTACACGT > > As a FastaRecord this would have: > > FastaRecord.title = "name text text text" (string) > FastaRecord.sequence= "ACGTACACGT" (string) > > As a SeqRecord (with the default title2ids mapping): > > SeqRecord.id = (default string) SeqRecord.name = (default string) > SeqRecord.description = "name text text text" (string) SeqRecord.seq > = Seq("ACGTACACGT", alphabet) I use the FastaRecord, but again for no particular reason. I have not experienced an advantage of Seq objects over simple strings, so for me the fact that FastaRecord contains a simple string is more convenient. But it doesn't matter much. > Question Five - GenBank files: GenbankRecord or SeqRecord > ========================================================== If you use > GenBank files, do you use: (a) Bio.Genbank.FeatureParser which > returns SeqRecord objects (b) Bio.Genbank.RecordParser which returns > Bio.GenBank.Record objects > I don't care so much, but I think that having two record types is confusing, so it would be better if we could decide on one. A SeqRecord is more general than a Bio.GenBank.Record, so I have a slight preference for a SeqRecord. > > Question Six - Martel, Scanners and Consumers > ============================================== Some of BioPython's > existing parsers (e.g. those using Martel) use an event/callback > model, where the scanner component generates parsing events which are > dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? > > (a) I don't know, or don't care. I just the the parsers provided. > (b) I use this framework to modify a parser in order to do ... > (please provide details). > (a). Often, I'm just at the Python prompt typing away. What I like about Python and Numerical Python is that the commands are often obvious and easy to remember. With the parser framework, on the other hand, I always need to look up in the documentation how to use them. --Michiel From dag at sonsorol.org Fri Aug 4 10:38:52 2006 From: dag at sonsorol.org (Chris Dagdigian) Date: Fri, 4 Aug 2006 06:38:52 -0400 Subject: [Biopython-dev] Fwd: contributing comparative genomics tools References: <22DA57C5-461D-48BE-B524-47108330CD80@chem.ucla.edu> Message-ID: <9AFBA2D3-B8DF-4337-A54A-019F6EAFFC38@sonsorol.org> Begin forwarded message: > From: Christopher Lee > Date: August 3, 2006 9:11:42 PM EDT > To: biopython-dev-owner at lists.open-bio.org > Subject: Fwd: contributing comparative genomics tools > > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi, > there appears to be an error in your code submission instructions > on the biopython.org/wiki, or in the configuration of the biopython- > dev list server. The code submission instructions tell me to > submit my proposal by email to biopython-dev at biopython.org, but the > list server responds by saying that all mail will automatically be > rejected! Please forward this proposal to the appropriate people > (presumably biopython-dev?), and let me know that you have done > so. Otherwise I won't have any way of knowing whether anyone even > reads this email address... > > Yours with thanks, > > Chris Lee, Dept. of Chemistry & Biochemistry, UCLA > > Begin forwarded message: > >> You are not allowed to post to this mailing list, and your message >> has >> been automatically rejected. If you think that your messages are >> being rejected in error, contact the mailing list owner at >> biopython-dev-owner at lists.open-bio.org. >> >> >> From: Christopher Lee >> Date: August 3, 2006 3:55:52 PM PDT >> To: biopython-dev at biopython.org >> Cc: Namshin Kim >> Subject: contributing comparative genomics tools >> >> >> Hi Biopython developers, >> I'd like to contribute some Python tools that my lab has been >> developing for large-scale comparative genomics database query. >> These tools make it easy to work with huge multigenome alignment >> databases (e.g. the UCSC Genome Browser multigenome alignments) >> using a new disk-based interval indexing algorithm that gives very >> high performance with minimal memory usage. e.g. whereas queries >> of the UCSC 17genome alignment typically take about 30 sec. per >> query using MySQL, the same query takes about 200 microsec. per >> query, making it possible to run huge numbers of queries for >> genome-wide studies. >> >> Here's an example usage (click the URL or just look at the code >> below) >> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/seq- >> align.html#SECTION000125000000000000000 >> >> We've tested this code very extensively in our own research, and >> it has had four open source releases so far. At this point the >> code is in production use. All the code is compatible back to >> Python version 2.2, but not 2.1 or before (we use generators). >> There is C code (accessed as Python classes) for the high- >> performance interval database index. For details of history see >> the website >> http://www.bioinformatics.ucla.edu/pygr >> >> There is also extensive tutorial and reference documentation: >> http://bioinfo.mbi.ucla.edu/pygr_0_5_0/ >> >> Let me know what questions you have, and what process we would >> need to follow to contribute this code. >> >> Yours with best wishes, >> >> Chris Lee, Dept. of Chemistry & Biochemistry, UCLA >> >> >> ####### EXAMPLE USAGE >> from pygr import cnestedlist >> msa=cnestedlist.NLMSA('/usr/tmp/ucscDB/mafdb','r') # OPEN THE >> ALIGNMENT DB >> >> def printResults >> (prefix,msa,site,altID='NULL',cluster_id='NULL',seqNames=None): >> 'get alignment of each genome to site, print %identity and % >> aligned' >> for src,dest,edge in msa[site].edges(mergeMost=True): # >> ALIGNMENT QUERY! >> print '%s\t%s\t%s\t%s\t%2.1f\t%2.1f\t%s\t%s' \ >> %(altID,cluster_id,prefix,seqNames[dest], >> 100.*edge.pIdentity(),100.*edge.pAligned(),src[: >> 2],dest[:2]) >> >> def getAlt3Conservation(msa,gene,start1,start2,stop,**kwargs): >> 'gene must be a slice of a sequence in our genome alignment msa' >> ss1=gene[start1-2:start1] # USE SPLICE SITE COORDINATES >> ss2=gene[start2-2:start2] >> ss3=gene[stop:stop+2] >> e1=ss1+ss2 # GET INTERVAL BETWEEN PAIR OF SPLICE SITES >> e2=gene[max(start1,start2):stop] # GET INTERVAL BETWEEN e1 AND >> stop >> zone=e1+ss3 # USE zone AS COVERING INTERVAL TO BUNDLE fastacmd >> REQUESTS >> cache=msa[zone].keys(mergeMost=True) # PYGR BUNDLES REQUESTS >> TO MINIMIZE TRAFFIC >> for prefix,site in [('ss1',ss1),('ss2',ss2),('ss3',ss3), >> ('e1',e1),('e2',e2)]: >> printResults(prefix,msa,site,seqNames=~ >> (msa.seqDict),**kwargs) >> >> # RUN A QUERY LIKE THIS... >> # getAlt3Conservation(msa,some_gene,some_start,other_start,stop) >> >> ############ EXPLANATION & NOTES >> David Haussler's group has constructed alignments of multiple >> genomes. These alignments are extremely useful and interesting, >> but so large that it is cumbersome to work with the dataset using >> conventional methods. For example, for the 8-genome alignment you >> have to work simultaneously with the individual genome datasets >> for human, chimp, mouse, rat, dog, chicken, fugu and zebrafish, as >> well as the huge alignment itself. Pygr makes this quite easy. >> Here we illustrate an example of mapping an alternative 3' exon, >> which has two alternative splice sites (start1 and start2) and a >> single terminal splice site (stop). We use the alignment database >> to map each of these splice sites onto all the aligned genomes, >> and to print the percent-identity and percent-aligned for each >> genome, as well as the two nucleotides consituting the splice site >> itself. To examine the conservation of the two exonic regions >> (between start1 and start2, and the adjacent region terminated by >> stop, we print the same information for each genome's alignment to >> these two regions as well. The code first opens the alignment >> database. The function (getAlt3Conservation) obtains sequence >> slice objects representing the various ``sites'' to be queried. >> The actual alignment database query is performed in printResults: >> >> * The alignment database query is in the first line of >> printResults(). msa is the database; site is the interval query; >> and the edges methods iterates over the results, returning a tuple >> for each, consisting of a source sequence interval (i.e. an >> interval of site), a destination sequence interval (i.e. an >> interval in an aligned genome), and an edge object describing that >> alignment. We are taking advantage of Pygr's group-by operator >> mergeMost, which will cause multiple intervals in a given sequence >> to be merged into a single interval that constitutes their >> ``union''. Thus, for each aligned genome, the edges iterator will >> return a single aligned interval. The alignment edge object >> provides some useful conveniences, such as calculating the percent- >> identity between src and dest automatically for you. pIdentity() >> computes the fraction of identical residues; pAligned computes the >> fraction of aligned residues (allowing you to see if there are big >> gaps or insertions in the alignment of this interval). If we had >> wanted to inspect the detailed alignment letter by letter, we >> would just iterate over the letters attribute instead of the edges >> method. (See the NLMSASlice documentation for further information). >> >> * src[:2] and dest[:2] print the first two nucleotides of the >> site in gene and in the aligned genome. >> >> * it's worth noting that the actual sequence string >> comparisons are being done using a completely different database >> mechanism (formerly NCBI's fastacmd, now our own (much faster) >> pureseq text format), not the cnestedlist database. Basically, >> each genome is being queried as a separate BLAST formatted >> database, represented in Pygr by the BlastDB class. Pygr makes >> this complex set of multi-database operations more or less >> transparent to the user. For further information, see the BlastDB >> documentation. >> >> * The other operations here are entirely vanilla: mainly >> slicing a gene sequence to obtain the specific sites that we want >> to query. Note: gene must itself be a slice of a sequence in our >> alignment, or the alignment query msa[site] will raise an >> IndexError informing the user that the sequence site is not in the >> alignment. >> >> * The only slightly interesting operation here is the use of >> interval addition to obtain the ``union'' of two intervals, e.g. >> e1=ss1+ss2. This obtains a single interval that contains both of >> the input intervals. >> >> * When the print statement requests str() representations of >> these sequence objects, Pygr uses fastacmd -L to extract just the >> right piece of the corresponding chromosomes from the eight BLAST >> databases. >> >> (Actually, because of Pygr's caching / optimizations, considerably >> more is going on than indicated in this simplified sketch. But you >> get the idea: Pygr makes it relatively effortless to work with a >> variety of disparate (and large) resources in an integrated way.) >> >> Here is some example output: >> >> 1 Mm.99996 ss1 hg17 50.0 100.0 AG GG >> 1 Mm.99996 ss1 canFam1 50.0 100.0 AG GG >> 1 Mm.99996 ss1 panTro1 50.0 100.0 AG GG >> 1 Mm.99996 ss1 rn3 100.0 100.0 AG AG >> 1 Mm.99996 ss2 hg17 100.0 100.0 AG AG >> 1 Mm.99996 ss2 canFam1 100.0 100.0 AG AG >> 1 Mm.99996 ss2 panTro1 100.0 100.0 AG AG >> 1 Mm.99996 ss2 rn3 100.0 100.0 AG AG >> 1 Mm.99996 ss3 hg17 100.0 100.0 GT GT >> 1 Mm.99996 ss3 canFam1 100.0 100.0 GT GT >> 1 Mm.99996 ss3 panTro1 100.0 100.0 GT GT >> 1 Mm.99996 ss3 rn3 100.0 100.0 GT GT >> 1 Mm.99996 e1 hg17 78.9 100.0 AG GG >> 1 Mm.99996 e1 canFam1 84.2 100.0 AG GG >> 1 Mm.99996 e1 panTro1 77.6 100.0 AG GG >> 1 Mm.99996 e1 rn3 97.4 98.7 AG AG >> 1 Mm.99996 e2 hg17 91.6 99.1 CC CC >> 1 Mm.99996 e2 canFam1 88.8 99.1 CC CC >> 1 Mm.99996 e2 panTro1 91.6 99.1 CC CC >> 1 Mm.99996 e2 rn3 97.2 100.0 CC CC >> >> >> >> >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1.4.2.2 (Darwin) >> >> iD8DBQFE0n8GLQ4dB3bqQz4RApcxAKCIHdZ9mttB1uC4HkY3xXEw1cWYswCeIg4i >> xhxE2zrffLaiCjSiEp4Eo6k= >> =BeOe >> -----END PGP SIGNATURE----- >> >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (Darwin) > > iD8DBQFE0p7iLQ4dB3bqQz4RAkzJAJ4wxiZqi7lZGBUMTFwyquGOCajiKQCfUDBm > Wx/4AIstFjb+rbqY2QBppLg= > =fghY > -----END PGP SIGNATURE----- From biopython-dev at maubp.freeserve.co.uk Sat Aug 12 08:25:41 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Sat, 12 Aug 2006 09:25:41 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> I've having a few issues with my email setup which is why I haven't replied recently. A week ago I filed bug 2059 for this discussion, and attached some code: http://bugzilla.open-bio.org/show_bug.cgi?id=2059 I'm interested in your feedback - from the framework down to if you don't like the class names for example. Peter From krewink at inb.uni-luebeck.de Wed Aug 16 12:44:07 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Wed, 16 Aug 2006 14:44:07 +0200 Subject: [Biopython-dev] Fwd: contributing comparative genomics tools Message-ID: <20060816124407.GF12386@pc09.inb.uni-luebeck.de> Hello, I read Peter's SeqIO/__init__.py replacement and if I may say so: I love it. Thanks a lot for this! Still, there are some things I'd like to talk about. The _parse_genbank_features function could also be used to parse embl or ddjb features, therefore I think it should be named differently. Since there is a lot of clean up effort right now: How about moving the SeqRecord and SeqFeature objects into the Bio.Seq module? They are closely related and seperate modules only clutter the namespace. To me, this seems to be a general problem. It's very difficult to find a tool to use for a certain problem if one doesn't allready know what to look for. I'd pretty much favour to create modules like Bio.structure to group modules like Bio.PDB and Bio.NMR etc. This is a very big change, and therefore I'd like to follow Marc's suggestion of splitting off a branch. In general, I pretty much agree with what Marc said in his . I cannot estimate how much work it would be to maintain two seperate biopython distributions, so please forgive me if I re-suggest something completely idiotic here. I just don't believe there is much that could be lost that way. Cheers, Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From biopython-dev at maubp.freeserve.co.uk Wed Aug 16 14:00:36 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Aug 2006 15:00:36 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc Message-ID: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> (I changed the subject to that of the previous discussion, as this isn't really about "contributing comparative genomics tools") Albert Krewinkel wrote: > Hello, > > I read Peter's SeqIO/__init__.py replacement and if I may say so: I > love it. Thanks a lot for this! Still, there are some things I'd > like to talk about. Thank you :) The code is on Bug 2059 for anyone who hasn't looked yet. http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > The _parse_genbank_features function could also be used to parse embl > or ddjb features, therefore I think it should be named differently. First of all, that bit of code is for a new feature which I personally wanted - to be able to iterate over CDS features in a genbank file. But yes, I did have in mind that it (and the GenBank parser) could be re-used to deal with EMBL files. I have not yet taken the time to learn the EMBL file format and how it corresponds to the GenBank file format - but I agree a lot of the code could be shared. > Since there is a lot of clean up effort right now: How about moving > the SeqRecord and SeqFeature objects into the Bio.Seq module? They > are closely related and seperate modules only clutter the namespace. What real benefit does that give us? It will cause a certain amount of upheaval in the short term as people will have to change their import statements on existing scripts. If we do start a new branch for "big changes" then I have no real problem with this suggest. > To me, this seems to be a general problem. It's very difficult to find > a tool to use for a certain problem if one doesn't allready know what > to look for. I'd pretty much favour to create modules like > Bio.structure to group modules like Bio.PDB and Bio.NMR etc. This is > a very big change, and therefore I'd like to follow Marc's suggestion > of splitting off a branch. In general, I pretty much agree with what > Marc said in his . > > I cannot estimate how much work it would be to maintain two separate > biopython distributions, so please forgive me if I re-suggest > something completely idiotic here. I just don't believe there is much > that could be lost that way. BioPython probably would benefit from a little reorganising - and for anything drastic like moving entire modules about, a new branch makes sense. On the other hand, do we have the man-power to do it? Are any of the developers familiar with all of (or even most of) the existing modules? I would guess I have used less than half of the modules - I have looked at the very basics of Bio.PDB for example, but have never tried Bio.NMR I would favour gradual incremental (and backwards compatible) changes. Such as adding a new sequence reading module and then marking the old code as depreciated. For example of some small changes, have any of you looked at: Bug 2057 - SeqRecord has no __str__ or __repr__ http://bugzilla.open-bio.org/show_bug.cgi?id=2057 Bug 1963 - Adding __str__ method to codon tables and translators http://bugzilla.open-bio.org/show_bug.cgi?id=1963 Little things in themselves that I think would help. Peter From krewink at inb.uni-luebeck.de Wed Aug 16 14:44:36 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Wed, 16 Aug 2006 16:44:36 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> Message-ID: <20060816144436.GG12386@pc09.inb.uni-luebeck.de> On Wed, Aug 16, 2006 at 03:00:36PM +0100, Peter wrote: > Albert Krewinkel wrote: > >The _parse_genbank_features function could also be used to parse embl > >or ddjb features, therefore I think it should be named differently. > > First of all, that bit of code is for a new feature which I personally > wanted - to be able to iterate over CDS features in a genbank file. > > But yes, I did have in mind that it (and the GenBank parser) could be > re-used to deal with EMBL files. I have not yet taken the time to > learn the EMBL file format and how it corresponds to the GenBank file > format - but I agree a lot of the code could be shared. I will try to build something similar for EMBL files within the next days. This should be easy, since features really should look the same in both formates: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html > >Since there is a lot of clean up effort right now: How about moving > >the SeqRecord and SeqFeature objects into the Bio.Seq module? They > >are closely related and seperate modules only clutter the namespace. > > What real benefit does that give us? It will cause a certain amount > of upheaval in the short term as people will have to change their > import statements on existing scripts. If we do start a new branch > for "big changes" then I have no real problem with this suggest. Agree. > >To me, this seems to be a general problem. It's very difficult to find > >a tool to use for a certain problem if one doesn't allready know what > >to look for. I'd pretty much favour to create modules like > >Bio.structure to group modules like Bio.PDB and Bio.NMR etc. This is > >a very big change, and therefore I'd like to follow Marc's suggestion > >of splitting off a branch. In general, I pretty much agree with what > >Marc said in his . > > > >I cannot estimate how much work it would be to maintain two separate > >biopython distributions, so please forgive me if I re-suggest > >something completely idiotic here. I just don't believe there is much > >that could be lost that way. > > BioPython probably would benefit from a little reorganising - and for > anything drastic like moving entire modules about, a new branch makes > sense. On the other hand, do we have the man-power to do it? Are any > of the developers familiar with all of (or even most of) the existing > modules? I would guess I have used less than half of the modules - I > have looked at the very basics of Bio.PDB for example, but have never > tried Bio.NMR I attached a file which I created when I was teaching myself biopython. It provides a basic grouping for the current biopython modules. Naturaly, it's by no means complete and probably wrong in some places. > I would favour gradual incremental (and backwards compatible) changes. > Such as adding a new sequence reading module and then marking the old > code as depreciated. I think we could do both: A new branch might make it easier to see which modules are usefull the way they are and which are not. Even if this seperate branch never is released itself, it still would be handy for reorganising coordination. > For example of some small changes, have any of you looked at: > > Bug 2057 - SeqRecord has no __str__ or __repr__ > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > Bug 1963 - Adding __str__ method to codon tables and translators > http://bugzilla.open-bio.org/show_bug.cgi?id=1963 > > Little things in themselves that I think would help. True. My (naive) hope is, that such things would be by-products of a new branch. I have to admit, that this is probably not possible without doing a code sprint. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics -------------- next part -------------- Databases: o NCBI - UniGene - GenBank - PubMed - Entrez - LocusLink - Geo o Kabat o KEGG o SwissProt o Medline o biblio (pywebsvcs dependency is mentioned only in the module itself) o dbdefs o InterPro o Gobase o Enzyme o Rebase Models and Simulations: o Ais o MetaTool o Pathway o ECell Algorigthms, Machine Learning and Pattern Recognition: o HMM o NeuralNetwork o Cluster o LogisticRegression, Statistics o GA o MarkovModel o pairwise2 o NaiveBayes o MaxEntropy Alignments: o Align o Blast o AlignAce o Clusalw o Fasta o FSSP o SubsMat o Search (WUBLAST output) o Saf o IntelliGenetics Applications: o Application o Emboss o Nexus o AlignAce o Blast o MEME o Sequencing o Wise Data Structures: o KDTree o trie Sequences: o GFF o Seq o SeqUtils o SeqFeature o SeqRecord o Alphabet o Transcribe o Translate o lcc o Encodings o Data o NBRF SeqIO: o writers o Writer o SeqIO o builders o Fasta o Index Utilities: o utils.py o ParserSupport o File o Tools o Mindy o HotRand o config o formatdefs o MarkupEditor o DocSQL (wouldn't usage of SQL-Object be nicer? (if possible)) o EUtils.ReseakFile o Std, StdHandler o PropertyManager o MultiProc o Decode o FilteredReader Graphics: o Graphics Web-Based: o GenBank o NetCache o EUtils o WWW Microarrays: o Affy Structure: o NMR o PDB o Crystal o Ndb o SCOP o SVDSuperimposer Motives: o MEME o Prosite o CDD o Compass References: o Medline, PubMed o DBXref Restriction: o Restriction o CAPS From biopython-dev at maubp.freeserve.co.uk Wed Aug 16 16:05:12 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 16 Aug 2006 17:05:12 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <20060816144436.GG12386@pc09.inb.uni-luebeck.de> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> Message-ID: <44E34238.2010508@maubp.freeserve.co.uk> Albert Krewinkel wrote: >>> The _parse_genbank_features function could also be used to parse embl >>> or ddjb features, therefore I think it should be named differently. Peter wrote: >> First of all, that bit of code is for a new feature which I personally >> wanted - to be able to iterate over CDS features in a genbank file. >> >> But yes, I did have in mind that it (and the GenBank parser) could be >> re-used to deal with EMBL files. I have not yet taken the time to >> learn the EMBL file format and how it corresponds to the GenBank file >> format - but I agree a lot of the code could be shared. Albert Krewinkel wrote: > I will try to build something similar for EMBL files within the next > days. This should be easy, since features really should look the same > in both formates: > > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html > Oh - you meant just adding EMBL feature iteration. I want thinking about the larger task of full EMBL file reading. Doing just the features is very easy, here you go: http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2 Any more feedback is very welcome. Are you using the iterators directly, or via the helper function File2SequenceIterator? Are you using just the sequence iterators, or the dictionary and list versions too? Peter From biopython-dev at maubp.freeserve.co.uk Wed Aug 16 22:20:28 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 16 Aug 2006 23:20:28 +0100 Subject: [Biopython-dev] Tweaking the SeqRecord class Message-ID: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> In the spirit of gradual improvements, I had a look at the SeqRecord class. First of all, is there any comment on my suggestion to add __str__ and __repr__ methods to the SeqRecord object, bug 2057: http://bugzilla.open-bio.org/show_bug.cgi?id=2057 Next, I'd like to check in some basic __doc__ strings for the SeqRecord class, e.g. something like this: >>> from Bio.SeqRecord import SeqRecord >>> print SeqRecord.__doc__ The SeqRecord object is designed to hold a sequence and information about it. Main properties: id - Identifier such as a locus tag (string) seq - The sequence itself (Seq object) Additional properties: name - Sequence name, e.g. gene name (string) description - Additional text (string) dbxrefs - List of database cross references (list of strings) features - Any (sub)features defined (list of SeqFeature objects) annotations - Further information (dictionary) I would also like to add doc strings to the id, seq, name, ... themselves. However, they are currently stored as attributes so this isn't possible. See PEP 0224, http://www.python.org/dev/peps/pep-0224/ However, we could use the Python 2.2 "property" function to implement these as properties. The code might be clearer using the Python 2.4 "decorator" syntax, but I don't think we should depend on such a recent version of python yet. Using properties would allow this usage: >>> print SeqRecord.features.__doc__ Annotations about parts of the sequence (list of SeqFeatures) It would also mean that these properties show up in dir(SeqRecord) and help(SeqRecord), which all in all should make the object slightly easier to use. Finally, using get/set property functions allows us to postpone creation of string/list/dict objects for unused properties. This does actually seem to bring a slight improvement to the timings for Fasta file parsing discussed last month. If you recall, for the fastest parsers turning the data into SeqRecord and Seq objects imposed a fairly large overhead (compared to just using strings): http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html I would be interested to see how those numbers change with the attached implementation - if you wouldn't mind please Leighton... ;) I have attached a version of SeqRecord.py which implements the changes I have described. The backwards compatibility if statement is a bit ugly - can we just assume Python 2.2 or later? Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: SeqRecord.py Type: text/x-script.phyton Size: 9367 bytes Desc: not available URL: From mdehoon at c2b2.columbia.edu Thu Aug 17 01:39:12 2006 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 16 Aug 2006 21:39:12 -0400 Subject: [Biopython-dev] Tweaking the SeqRecord class In-Reply-To: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> Message-ID: <44E3C8C0.5070200@c2b2.columbia.edu> Peter wrote: > First of all, is there any comment on my suggestion to add __str__ and > __repr__ methods to the SeqRecord object, bug 2057: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2057 Here's a thought: What if Seq were to inherit from str, and SeqRecord from Seq? Then, you get these for free. > Next, I'd like to check in some basic __doc__ strings for the > SeqRecord class, e.g. something like this: Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have documentation. > If you recall, for the fastest parsers turning the data into SeqRecord > and Seq objects imposed a fairly large overhead (compared to just > using strings): > > http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html I wonder if this is still true if a Seq object and a SeqRecord object inherit from string. From the code, I don't see where the overhead comes from. > The backwards compatibility if statement is a bit > ugly - can we just assume Python 2.2 or later? Biopython currently requires Python 2.3 or later. --Michiel. From krewink at inb.uni-luebeck.de Thu Aug 17 07:25:34 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Thu, 17 Aug 2006 09:25:34 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E34238.2010508@maubp.freeserve.co.uk> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> <44E34238.2010508@maubp.freeserve.co.uk> Message-ID: <20060817072534.GH12386@pc09.inb.uni-luebeck.de> Peter wrote: > Oh - you meant just adding EMBL feature iteration. I want thinking > about the larger task of full EMBL file reading. I started working on that, but I'm not very far yet. > Doing just the features is very easy, here you go: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2 Wow, that was quick. And it's works allmost perfectly. One exception: In _parse_embl_or_genbank_feature(), when parsing the location, it shoudl say something like from string import digits while feature_location[-1] not in (')', digits): line = iterator.next() feature_location += line[FEATURE_QUALIFIER_INDENT:].strip() This way, features may have multiline join(...) positions. > Any more feedback is very welcome. Are you using the iterators > directly, or via the helper function File2SequenceIterator? I'm using iterators directly, out of old habits. But most likely I will finally get addicted to your nice helperfunction. > Are you using just the sequence iterators, or the dictionary and list > versions too? I don't used those yet. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From mcolosimo at mitre.org Thu Aug 17 12:08:24 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Thu, 17 Aug 2006 08:08:24 -0400 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> Message-ID: Peter, Nice quick work on that. For Clustal, I think it should NOT be an Iterator, but there should be SequenceDict or SequenceList for it. There are other alignment filetypes out there that could use a SequenceIterator (those that are not interlaced). From looking over your code, it seem like it would be easy to add a check in File2SequenceDict/List to check for Clustal types and do something "special" Marc On Aug 12, 2006, at 4:25 AM, Peter wrote: > I've having a few issues with my email setup which is why I haven't > replied recently. > > A week ago I filed bug 2059 for this discussion, and attached some > code: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > > I'm interested in your feedback - from the framework down to if you > don't like the class names for example. > > Peter From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 13:25:07 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 14:25:07 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> Message-ID: <44E46E33.3090001@maubp.freeserve.co.uk> Marc Colosimo wrote: > Peter, > > Nice quick work on that. For Clustal, I think it should NOT be an > Iterator, but there should be SequenceDict or SequenceList for it. > There are other alignment filetypes out there that could use a > SequenceIterator (those that are not interlaced). From looking over > your code, it seem like it would be easy to add a check in > File2SequenceDict/List to check for Clustal types and do something > "special" Yes, I was thinking wondering about that too. For interlaced file formats (such as clustalw, NEXUS multiple alignment format) we have to load the whole file into memory anyway - so using a SequenceIterator was a bit odd. What I was trying to do was use a SequenceIterator as the lowest common denominator - the ClustalIterator shows that this can be done for interlaced files, and seems to work. Its trivial to "upgrade" the ClustalIterator to a SequenceDict or SequenceList if that's what is needed. The way I wrote the ClustalIterator it actually reads the whole file and stores a list of IDs and a dictionary mapping the ID to the sequence string. It creates SeqRecord objects only on request. This should use less memory than a full list of every SeqRecord (but I have not measured this). Note that I would also want to add an easy way to turn any SequenceIterator, SequenceList or SequenceDict into a multiple alignment object. Out of interest, what are the largest alignments you deal with? I was planning to add a Stockholm parser (where the sequences themselves are non-interleaved). The PFAM database alignments use this, and are the largest alignments I am aware of. However, the format supports per sequence annotation information and this information can be rather spread out. Looking at a real example from PFAM, there were blocks of such data both before and after the sequences. The format suggest that such annotation might also be found next to each sequence. i.e. An annotation free Stockholm iterator would be easy, but including the meta data would in general require loading the whole file. http://www.cgb.ki.se/cgb/groups/sonnhammer/Stockholm.html It looks like a subclassed version could be written to handle the PFAM annotations nicely. Peter From mcolosimo at mitre.org Thu Aug 17 12:24:24 2006 From: mcolosimo at mitre.org (Marc Colosimo) Date: Thu, 17 Aug 2006 08:24:24 -0400 Subject: [Biopython-dev] Fwd: contributing comparative genomics tools In-Reply-To: <20060816124407.GF12386@pc09.inb.uni-luebeck.de> References: <20060816124407.GF12386@pc09.inb.uni-luebeck.de> Message-ID: <9A739306-6B91-4E43-87F8-EC464784B4B2@mitre.org> On Aug 16, 2006, at 8:44 AM, Albert Krewinkel wrote: > Hello, > > I read Peter's SeqIO/__init__.py replacement and if I may say so: I > love it. Thanks a lot for this! Still, there are some things I'd > like to talk about. > > The _parse_genbank_features function could also be used to parse embl > or ddjb features, therefore I think it should be named differently. > > > Since there is a lot of clean up effort right now: How about moving > the SeqRecord and SeqFeature objects into the Bio.Seq module? They > are closely related and seperate modules only clutter the namespace. > The top namespace is sort of a mess of things. > To me, this seems to be a general problem. It's very difficult to find > a tool to use for a certain problem if one doesn't allready know what > to look for. I'd pretty much favour to create modules like > Bio.structure to group modules like Bio.PDB and Bio.NMR etc. I second this. > This is > a very big change, and therefore I'd like to follow Marc's suggestion > of splitting off a branch. In general, I pretty much agree with what > Marc said in his . > > I cannot estimate how much work it would be to maintain two seperate > biopython distributions, so please forgive me if I re-suggest > something completely idiotic here. I just don't believe there is much > that could be lost that way. I've done this for my internal work, but I never went back to see how to check out the other branch (I had not need). CVS is sometimes a bear to work with. SVN is suppose to handle branches much better, but I can't access SVN repositories that are not through HTTPS (SSL). Stupid corporate proxy is currently not set up to handle external webDAV. This might be a pain for a little while until the next full version is released, but I think the benfits of doing this now far out weigh the short term pain (of course I'm not an admin who has to build the releases). Marc From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 15:13:40 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 16:13:40 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <20060817072534.GH12386@pc09.inb.uni-luebeck.de> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> <44E34238.2010508@maubp.freeserve.co.uk> <20060817072534.GH12386@pc09.inb.uni-luebeck.de> Message-ID: <44E487A4.8040106@maubp.freeserve.co.uk> Albert Krewinkel wrote: > Peter wrote: > >>Oh - you meant just adding EMBL feature iteration. I was thinking >>about the larger task of full EMBL file reading. > > I started working on that, but I'm not very far yet. Are you starting from Bio.GenBank or from scratch? I would point out that the code in Bio.GenBank was inserted into what was once a Martel based parser, and designed to be a transparent change for the end user. What I would like to do is recycle that code into a new far simpler SeqIO GenBank parser which would only return SeqRecords. In particular I would get rid off all the scanner/consumer model with all its function callbacks. At this point I would try and handle both GenBank and EMBL files together. I expect this to be faster, and easier to understand. It would be a lot less flexible for the "power user", but then so is all the new SeqIO code I have been writing. >>Doing just the features is very easy, here you go: >> >>http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c2 > > Wow, that was quick. Well, I did have something along these lines planned in advance - that's why there my parse function was outside the GenbankCdsFeatureIterator class. > And it's works allmost perfectly. One exception: > In _parse_embl_or_genbank_feature(), when parsing the location, it > shoudl say something like > > > from string import digits > while feature_location[-1] not in (')', digits): > line = iterator.next() > feature_location += line[FEATURE_QUALIFIER_INDENT:].strip() > > > This way, features may have multiline join(...) positions. Good point, something I was aware of and coped with in Bio.GenBank but hadn't done in the CDS iterator. Thanks for point this out. This affects both GenBank and EMBL files by the way. My code is very similar but I included an assert to check the indent, and I only check for a trailing comma. This works on all the files I have tried. Peter From krewink at inb.uni-luebeck.de Thu Aug 17 17:41:06 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Thu, 17 Aug 2006 19:41:06 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E487A4.8040106@maubp.freeserve.co.uk> References: <320fb6e00608160700t33ed0f4ds7311124d70781a64@mail.gmail.com> <20060816144436.GG12386@pc09.inb.uni-luebeck.de> <44E34238.2010508@maubp.freeserve.co.uk> <20060817072534.GH12386@pc09.inb.uni-luebeck.de> <44E487A4.8040106@maubp.freeserve.co.uk> Message-ID: <20060817174106.GI12386@pc09.inb.uni-luebeck.de> Peter wrote: > > Peter wrote: > >>Oh - you meant just adding EMBL feature iteration. I was thinking > >>about the larger task of full EMBL file reading. > > > Albert wrote: > >I started working on that, but I'm not very far yet. > > Are you starting from Bio.GenBank or from scratch? I would point out > that the code in Bio.GenBank was inserted into what was once a Martel > based parser, and designed to be a transparent change for the end user. > > What I would like to do is recycle that code into a new far simpler > SeqIO GenBank parser which would only return SeqRecords. In particular > I would get rid off all the scanner/consumer model with all its function > callbacks. > > At this point I would try and handle both GenBank and EMBL files together. I didn't do much more than to play with current code and add some methods to parse EMBL specific things. The results can be found here: http://www.inb.uni-luebeck.de/~krewink/embl.py It's ugly, and doesn't provide much functionality, but could be a starting point. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 20:09:20 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 21:09:20 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E46E33.3090001@maubp.freeserve.co.uk> References: <320fb6e00608120125t53f11b6et2bde5154e5bc19ec@mail.gmail.com> <44E46E33.3090001@maubp.freeserve.co.uk> Message-ID: <44E4CCF0.7090607@maubp.freeserve.co.uk> Marc Colosimo wrote: >> Nice quick work on that. For Clustal, I think it should NOT be an >> Iterator, but there should be SequenceDict or SequenceList for it. >> There are other alignment filetypes out there that could use a >> SequenceIterator (those that are not interlaced). From looking over >> your code, it seem like it would be easy to add a check in >> File2SequenceDict/List to check for Clustal types and do something >> "special" Peter (BioPython Dev) wrote: > Yes, I was thinking wondering about that too. > > For interlaced file formats (such as clustalw, NEXUS multiple alignment > format) we have to load the whole file into memory anyway - so using a > SequenceIterator was a bit odd. > > What I was trying to do was use a SequenceIterator as the lowest common > denominator - the ClustalIterator shows that this can be done for > interlaced files, and seems to work. There are two and a half examples done this way now... > I was planning to add a Stockholm parser (where the sequences themselves > are non-interleaved). The PFAM database alignments use this, and are > the largest alignments I am aware of. > > ... > > It looks like a subclassed version could be written to handle the PFAM > annotations nicely. http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c3 Changes to the clustal parser, and addition of a parser for Stockholm alignments, and a subclassed version to handle the PFAM style annotations strings. I have included basic handling of the sequence specific meta-data [I need to have a look at real PFAM data to sort of the database cross references still], but currently ignore the whole file level information (#=GF lines) and the per column information (#=GC lines). Maybe reading sequences out of multiple alignment files should be done as a special case of loading multiple alignments? Is this what you meant by "something special" Marc? Peter From biopython-dev at maubp.freeserve.co.uk Mon Aug 21 19:26:06 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Mon, 21 Aug 2006 20:26:06 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44E4D2B4.3000600@maubp.freeserve.co.uk> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> Message-ID: <44EA08CE.5070802@maubp.freeserve.co.uk> You probably noticed I sent out a "Dealing with sequence files" questionnaire on the main discussion list: http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html I've had four replies to date (off the list), and with the previous list discussion and counting myself that makes eight views. Not a very big sample I know. > Question One > ============ > Is reading sequence files an important function to you, and if so which > file formats in particular (e.g. Fasta, GenBank, ...) Fasta very popular, with GenBank also scoring highly. Michiel and I both use clustalw. Apart from EMBL (next question) there wasn't any other popular file format given. I'm tempted to ask again regarding multiple alignment formats. > Question Two > ============ > Are there any sequence formats you would like to be able to read using > BioPython that are not currently supported (e.g. EMBL, ...) It may have been a leading question, but several respondents would like to be able to read in EMBL format. Other requests included: XML based 454 sequence files UniGene sequence cluster format Leighton mentioned: PTT (Protein table files) GFF (General Feature Format) And I wanted to be able to read Stockholm alignments. > Question Three - Reading Fasta Files > ==================================== > Which of the following do you currently use (and why)?: > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects) > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) > (c) Bio.Fasta with your own parser (Could you tell us more?) > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) > (e) Bio.FormatIO (giving SeqRecord objects) > (f) Other (Could you tell us more?) A range covering (a), (b) and (d) plus DIY parsers. > Question Four - Reading GenBank Files > ===================================== > Which of the following do you currently use (and why)?: > > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects) > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects) > (c) Other (Could you tell us more?) Both (a) and (b) with no clear majority. > Question Five - Record Access... > ================================ > When loading a file with multiple sequences do you use: > > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the > records one by one in the order from the file. > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > random access to the records using their identifier. > > (c) A list giving random access by index number (e.g. load the records > using an iterator but save them in a list). Most of you use iterators, storing records in memory as required. > Question Six - Martel, Scanners and Consumers > ============================================= > Some of BioPython's existing parsers (e.g. those using Martel) use an > event/callback model, where the scanner component generates parsing > events which are dealt with by the consumer component. > > Do any of you use this system to modify existing parser behaviour, or > use it as part of your own personal file parser? > > (a) I don't know, or don't care. I just the the parsers provided. > (b) I use this framework to modify a parser in order to do ... (please > provide details). Almost everyone said (a) which I think is a good thing if we are going to try and re-work the BioPython's sequence reading. > And finally... > ============== > Do you have any general questions of comments. Several people have commented that BioPerl has a nice unified system with good documentation. ----------------------------------------------------------------------- Where next... I think my code could be included "in parallel" with the existing parsers, without the upheaval of creating a new branch etc. I have started thinking about writing files too. Part of this will involve trying to be as consistent as possible about mapping annotations from different file formats to the SeqRecord object's annotations dictionary. http://bugzilla.open-bio.org/show_bug.cgi?id=2059 My code currently on bug 2059 is written as a single python file, provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea long term as more file formats are supported. If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the filenames would clash on Windows. Some people are using the code in Bio.SeqIO.FASTA, but I suppose the file could contain both the old code, and my new fasta interface. Alternatively, the new system could be put in Bio.SequenceIO or are there any other suggestions? Peter From krewink at inb.uni-luebeck.de Tue Aug 22 13:43:56 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Tue, 22 Aug 2006 15:43:56 +0200 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> <44EA08CE.5070802@maubp.freeserve.co.uk> Message-ID: <20060822134356.GO12386@pc09.inb.uni-luebeck.de> I'd like to seriously start working on an EMBL parser, but there are some things I'm concerned about: It surely would be a good thing to build the SequenceIO and Parser stuff upon some base classes and agree on using certain tools which are (or will be) used in the hole project. Since I never received any education/training on software development, I would appreciate if someone can tell me how the code's structure should look like -- the current Scanner/Consumer code isn't any help. > Several people have commented that BioPerl has a nice unified system > with good documentation. How about using reStructuredText in docstrings? IMO it leaves the .__doc__ string very readable but improves epydoc generated descriptions. Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics From bsouthey at gmail.com Tue Aug 22 13:52:10 2006 From: bsouthey at gmail.com (Bruce Southey) Date: Tue, 22 Aug 2006 08:52:10 -0500 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <44EA08CE.5070802@maubp.freeserve.co.uk> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> <44EA08CE.5070802@maubp.freeserve.co.uk> Message-ID: Hi, To date I have only used SwissProt code from BioPython so I am really only lurking. But here are some responses. Bruce On 8/21/06, Peter (BioPython Dev) wrote: > You probably noticed I sent out a "Dealing with sequence files" > questionnaire on the main discussion list: > > http://lists.open-bio.org/pipermail/biopython/2006-August/003171.html > > I've had four replies to date (off the list), and with the previous list > discussion and counting myself that makes eight views. Not a very big > sample I know. > > > Question One > > ============ > > Is reading sequence files an important function to you, and if so which > > file formats in particular (e.g. Fasta, GenBank, ...) > > Fasta very popular, with GenBank also scoring highly. Michiel and I > both use clustalw. Apart from EMBL (next question) there wasn't any > other popular file format given. Well, this is not a surprise because most apps around also use FASTA as default format. Although most do not accept a comment line. Thus, FASTA is the most important format. > > I'm tempted to ask again regarding multiple alignment formats. > > > Question Two > > ============ > > Are there any sequence formats you would like to be able to read using > > BioPython that are not currently supported (e.g. EMBL, ...) > > It may have been a leading question, but several respondents would like > to be able to read in EMBL format. > > Other requests included: > > XML based 454 sequence files > UniGene sequence cluster format > > Leighton mentioned: > > PTT (Protein table files) > GFF (General Feature Format) > > And I wanted to be able to read Stockholm alignments. I would like to be able to use a custom format that is based on the FASTA format. That is allowing non-standard characters to included as part of the sequence that I later remove. Perhaps this is just being able to do subclassing. > > > Question Three - Reading Fasta Files > > ==================================== > > Which of the following do you currently use (and why)?: > > > > (a) Bio.Fasta with the RecordParser (giving FastaRecord objects) > > (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) > > (c) Bio.Fasta with your own parser (Could you tell us more?) > > (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) > > (e) Bio.FormatIO (giving SeqRecord objects) > > (f) Other (Could you tell us more?) > > A range covering (a), (b) and (d) plus DIY parsers. > > > Question Four - Reading GenBank Files > > ===================================== > > Which of the following do you currently use (and why)?: > > > > (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects) > > (b) Bio.GenBank with the RecordParser (giving GenBank Record objects) > > (c) Other (Could you tell us more?) > > Both (a) and (b) with no clear majority. > > > Question Five - Record Access... > > ================================ > > When loading a file with multiple sequences do you use: > > > > (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the > > records one by one in the order from the file. > > > > (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you > > random access to the records using their identifier. > > > > (c) A list giving random access by index number (e.g. load the records > > using an iterator but save them in a list). > > Most of you use iterators, storing records in memory as required. a > > > Question Six - Martel, Scanners and Consumers > > ============================================= > > Some of BioPython's existing parsers (e.g. those using Martel) use an > > event/callback model, where the scanner component generates parsing > > events which are dealt with by the consumer component. > > > > Do any of you use this system to modify existing parser behaviour, or > > use it as part of your own personal file parser? > > > > (a) I don't know, or don't care. I just the the parsers provided. > > (b) I use this framework to modify a parser in order to do ... (please > > provide details). > > Almost everyone said (a) which I think is a good thing if we are going > to try and re-work the BioPython's sequence reading. a > > > And finally... > > ============== > > Do you have any general questions of comments. > > Several people have commented that BioPerl has a nice unified system > with good documentation. > > ----------------------------------------------------------------------- > > Where next... > > I think my code could be included "in parallel" with the existing > parsers, without the upheaval of creating a new branch etc. > > I have started thinking about writing files too. > > Part of this will involve trying to be as consistent as possible about > mapping annotations from different file formats to the SeqRecord > object's annotations dictionary. > > http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > > My code currently on bug 2059 is written as a single python file, > provisionally Bio/SeqIO/__init__.py but this is clearly not a good idea > long term as more file formats are supported. > > If we use Bio.SeqIO then the prior existence of Bio/SeqIO/FASTA.py is a > slight annoyance in that I can't use Bio/SeqIO/Fasta.py because the > filenames would clash on Windows. Some people are using the code in > Bio.SeqIO.FASTA, but I suppose the file could contain both the old code, > and my new fasta interface. > > Alternatively, the new system could be put in Bio.SequenceIO or are > there any other suggestions? > > Peter > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From biopython-dev at maubp.freeserve.co.uk Tue Aug 22 16:46:39 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 22 Aug 2006 17:46:39 +0100 Subject: [Biopython-dev] Reading sequences: FormatIO, SeqIO, etc In-Reply-To: <20060822134356.GO12386@pc09.inb.uni-luebeck.de> References: <44E4D2B4.3000600@maubp.freeserve.co.uk> <44EA08CE.5070802@maubp.freeserve.co.uk> <20060822134356.GO12386@pc09.inb.uni-luebeck.de> Message-ID: <44EB34EF.1050901@maubp.freeserve.co.uk> Albert Krewinkel wrote: > I'd like to seriously start working on an EMBL parser, but ... As the de-facto GenBank module owner, I'm also interested getting EMBL and GenBank working nicely together. The big question BEFORE you/we start any serious coding on EMBL support is how it fits into BioPython. Do we (a) add a new module like the existing Bio.Fasta and Bio.GenBank, or (b) use a new framework like the one I've put forward here: http://bugzilla.open-bio.org/show_bug.cgi?id=2059 > ... there are some things I'm concerned about: It surely would be a > good thing to build the SequenceIO and Parser stuff upon some base > classes and agree on using certain tools which are (or will be) used > in the hole project. What I was proposing was that all the new sequence file format parsers should be implemented as subclasses of my SequenceIterator class - either directly (e.g. FastaIterator) or indirectly (e.g. the PfamStockholmIterator) and they should return SeqRecord objects. I am open to discussion about how interlaced file formats should be handled, but I think I have shown how the SequenceIterator based scheme could work using the Clustalw and Stockholm formats as examples. > Since I never received any education/training on software > development, I would appreciate if someone can tell me how the code's > structure should look like -- the current Scanner/Consumer code > isn't any help. I agree that the current Scanner/Consumer code won't be much help. The fact that the current Bio.GenBank parser uses the Scanner/Consumer model reflects the fact that I rewrote (in Python) what had been done using Martel/Mindy. This is one excuse for the state of that code of mine ;) I don't think the flexibility of the Scanner/Consumer model is needed just to turn Embl/GenBank data into SeqRecord objects (and only into SeqRecord objects). > How about using reStructuredText in docstrings? IMO it leaves the > .__doc__ string very readable but improves epydoc generated > descriptions. I'm not familiar with how any existing API documentation is extracted from the source code... Peter From biopython-dev at maubp.freeserve.co.uk Thu Aug 17 08:28:19 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Thu, 17 Aug 2006 09:28:19 +0100 Subject: [Biopython-dev] Tweaking the SeqRecord class In-Reply-To: <44E3C8C0.5070200@c2b2.columbia.edu> References: <320fb6e00608161520j5fb6b4fejd7aa8cc839989423@mail.gmail.com> <44E3C8C0.5070200@c2b2.columbia.edu> Message-ID: <44E428A3.70103@maubp.freeserve.co.uk> Michiel de Hoon wrote: > Peter wrote: > >>First of all, is there any comment on my suggestion to add __str__ and >>__repr__ methods to the SeqRecord object, bug 2057: >> >>http://bugzilla.open-bio.org/show_bug.cgi?id=2057 > > Here's a thought: > What if Seq were to inherit from str, and SeqRecord from Seq? > Then, you get these for free. This wouldn't automatically show any id/name/desrc/annotation in the __str__ and __repr__ methods, so I would want to override these methods anyway. We would still need to create and provide a Seq object on request as the record.seq attribute/property (for backwards compatibility). I also think we should change the Seq objects __str__, __repr__ functionality (while preserving the .tostring() method for some backwards compatibility). It might have been Marc the raised this point - shouldn't __str__ turn the data into a string, and __repr__ return a string that you could type into python to recreate the object? This would mean we would have to stop truncating the sequence data at 60 characters. >>Next, I'd like to check in some basic __doc__ strings for the >>SeqRecord class, e.g. something like this: > > Sounds good to me. Pretty amazing, actually, that SeqRecord doesn't have > documentation. OK, basic __doc__ strings checked in, Bio/SeqRecord.py revision 1.9 The Seq object also needs some love and attention in this area. >>If you recall, for the fastest parsers turning the data into SeqRecord >>and Seq objects imposed a fairly large overhead (compared to just >>using strings): >> >>http://lists.open-bio.org/pipermail/biopython-dev/2006-July/002407.html > > I wonder if this is still true if a Seq object and a SeqRecord object > inherit from string. From the code, I don't see where the overhead comes > from. I was wondering what the overhead was too. It could just be creating objects (Seq and SeqRecord) plus their associated strings/list/dictionary (compared with just two strings, the fasta title string and the sequence). My property change should reduce this a little bit as for Fasta files there is no need to create the dbxrefs list or the annotations dictionary (unless or until the user records some information here after creating the SeqRecord object). Making SeqRecord subclass Seq might help here if only one object needs to be created. >>The backwards compatibility if statement is a bit >>ugly - can we just assume Python 2.2 or later? > > Biopython currently requires Python 2.3 or later. Great - I'll ditch that nasty big if and just re-write the class to use properties. Revised version attached - should be functionally identical. Peter -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: SeqRecord.py URL: From biopython-dev at maubp.freeserve.co.uk Wed Aug 30 10:22:52 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Wed, 30 Aug 2006 11:22:52 +0100 Subject: [Biopython-dev] Recent bug reports not making it to the mailing list Message-ID: <44F566FC.30407@maubp.freeserve.co.uk> Once upon a time (early 2006?) whenever a bug was filed on the BugZilla, a copy was sent to the mailing list. Not any more... and in the last month or so there have been several bugs filed which have been ignored. Does anyone get automatic email notification? Who should I ask to be included in any default email notification? Thanks Peter