From p.j.a.cock at googlemail.com Fri Aug 1 05:20:01 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Aug 2008 10:20:01 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> Message-ID: <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> On Fri, Aug 1, 2008 at 2:45 AM, Cedar McKay wrote: > Hello, I have SeqRecord objects that I'd like to convert to a string that is > in Genbank format. That way I can do whatever with it, including write it to > a file. The only way I can see to do anything similar is using SeqIO and > doing something like: > > SeqIO.write(my_records, out_file_handle, "genbank") > which I found here: > http://biopython.org/DIST/docs/tutorial/Tutorial.html#chapter:Bio.SeqIO That could would work fine - once Bio.SeqIO supports output in the GenBank format. Its been on my "to do list" for a while, but being annotation rich this is non-trivial one you start to use this with other file formats. http://bugzilla.open-bio.org/show_bug.cgi?id=2294 I've been thinking about writing a unit test using the EMBOSS seqret program for interconverting file formats, as a way of checking our conversions against a third party. > The problem is, it doesn't support something like: > SeqIO.write(seq_record, out_file_handle, "genbank") > Because it requires an iterable object I guess? Yes, you would have to use this: SeqIO.write([seq_record], out_file_handle, "genbank") Of course, if lots of people really want to have the flexibility to supply a SeqRecord or a SeqRecord list/iterator this would be possible. On the other hand, there is something to be said for a simple fixed interface. > And it has to write to a file handle for some reason, and > won't just give me the string to do whatever I want with. This is by design - the API uses handles and only handles. If you want a string containing the data, use StringIO (or cStringIO), something like this: from StringIO import StringIO handle = StringIO() SeqIO.write(seq_records, handle "fasta") handle.seek(0) data = handle.read() This isn't in the tutorial or the wiki page (yet). http://biopython.org/wiki/SeqIO > I've done a lot of searching and mailing lists, and googling, and surely I > must be missing something? What is the simplest way to get a string > representing a genbank file, starting with a SeqRecord? > > I'm sort of shocked that there isn't some sort of SeqRecord.to_genbank() > method. We have discussed something like a SeqRecord.to_format() method (or similar name), which would call Bio.SeqIO internally using StringIO and return a string. This fits in nicely with the planned __format__ and format() functionality in Python 2.6 and 3.0 http://www.python.org/dev/peps/pep-3101/ See http://portal.open-bio.org/pipermail/biopython-dev/2008-June/003793.html Peter From biopython at maubp.freeserve.co.uk Fri Aug 1 05:43:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 1 Aug 2008 10:43:31 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> Message-ID: <320fb6e00808010243p66092123jbc914c68673e6468@mail.gmail.com> >> I've done a lot of searching and mailing lists, and googling, and surely I >> must be missing something? What is the simplest way to get a string >> representing a genbank file, starting with a SeqRecord? >> >> I'm sort of shocked that there isn't some sort of SeqRecord.to_genbank() >> method. > > We have discussed something like a SeqRecord.to_format() method (or > similar name), which would call Bio.SeqIO internally using StringIO > and return a string. This fits in nicely with the planned __format__ > and format() functionality in Python 2.6 and 3.0 > http://www.python.org/dev/peps/pep-3101/ > > See http://portal.open-bio.org/pipermail/biopython-dev/2008-June/003793.html I've filed this enhancement as Bug 2561 - http://bugzilla.open-bio.org/show_bug.cgi?id=2561 Any comments or suggestions on the naming of this function could be recorded there, or discussed on the mailing list. Peter From biopython at maubp.freeserve.co.uk Fri Aug 1 06:21:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 1 Aug 2008 11:21:06 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> Message-ID: <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> On Fri, Aug 1, 2008 at 10:20 AM, Peter Cock wrote: > On Fri, Aug 1, 2008 at 2:45 AM, Cedar McKay wrote: >> Hello, I have SeqRecord objects that I'd like to convert to a string ... >> And it [Bio.SeqIO.write()] has to write to a file handle for some reason, >> and won't just give me the string to do whatever I want with. > > This is by design - the API uses handles and only handles. If you > want a string containing the data, use StringIO (or cStringIO), > something like this: > > from StringIO import StringIO > handle = StringIO() > SeqIO.write(seq_records, handle "fasta") > handle.seek(0) > data = handle.read() > > This isn't in the tutorial or the wiki page (yet). > http://biopython.org/wiki/SeqIO I've just updated the tutorial in CVS to cover this (and a similar example for Bio.AlignIO). We don't normally update the public PDF and HTML versions of the tutorial between releases to avoid the documentation talking about unreleased features, so you won't see this change for a while. Could I ask why you want to get the SeqRecord as a string in GenBank format? My only guess is as part of some webservice where a string would be useful to embed within a page template. Getting a SeqRecord in FASTA format is probably a more common request, given many web tools will take this as input. Peter From mjldehoon at yahoo.com Sat Aug 2 06:35:54 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 2 Aug 2008 03:35:54 -0700 (PDT) Subject: [BioPython] Bio.Medline parser Message-ID: <20101.29085.qm@web62411.mail.re1.yahoo.com> Hi everybody, For bug #2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454 I was looking at the parser in Bio.Medline, which can parse flat files in the Medline format. For an example, see Tests/Medline/pubmed_result2.txt: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/Medline/pubmed_result2.txt?rev=1.1&cvsroot=biopython&content-type=text/vnd.viewcvs-markup I would like to suggest some changes to this parser. Currently, it works as follows: >>> from Bio import Medline >>> parser = Medline.RecordParser() >>> handle = open("mymedlinefile.txt") >>> record = parser.parse(handle) or, to iterate over a bunch of Medline records: >>> from Bio import Medline >>> parser = Medline.RecordParser() >>> handle = open("mymedlinefile.txt") >>> records = Medline.Iterator(handle, parser) >>> for record in records: ... # do something with the record. I'd like to change these to >>> from Bio import Medline >>> handle = open("mymedlinefile.txt") >>> record = Medline.read(handle) and >>> from Bio import Medline >>> handle = open("mymedlinefile.txt") >>> records = Medline.parse(handle) >>> for record in records: ... # do something with the record. respectively. In addition, currently the fields in the Medline file are stored as attributes of the record. For example, if the file is PMID- 12230038 OWN - NLM STAT- MEDLINE DA - 20020916 ... then the corresponding record is record.pubmed_id = "12230038" record.owner = "NLM" record.status = "MEDLINE" record.entry_date = "20020916" I'd like to change two things here: 1) Use the key shown in the Medline file instead of the name to store each field. 2) Let the record class derive from a dictionary, and store each field as a key, value pair in this dictionary. record["PMID"] = "12230038" record["OWN"] = "NLM" record["STAT"] = "MEDLINE" record["DA"] = "20020916" ... This avoids the names that were rather arbitrarily chosen by ourselves, and greatly simplifies the parser. The parser will also be more robust if new fields are added to the Medline file format. Currently there is very little information on the Medline parser in the documentation, so I doubt it has many users. Nevertheless, I wanted to check if anybody has any objections or comments before I implement these changes. --Michiel From biopython at maubp.freeserve.co.uk Sat Aug 2 08:02:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 2 Aug 2008 13:02:13 +0100 Subject: [BioPython] Deprecating Bio.Saf (PredictProtein Simple Alignment Format) In-Reply-To: <320fb6e00807231512s507d2652jc0f26764a62b01d5@mail.gmail.com> References: <320fb6e00807231512s507d2652jc0f26764a62b01d5@mail.gmail.com> Message-ID: <320fb6e00808020502l64eca901gda132250fc85e5be@mail.gmail.com> On Wed, Jul 23, 2008 at 11:12 PM, Peter wrote: > Is anyone using Bio.Saf or PredictProtein's "Simple Alignment Format" (SAF)? > > Bio.Saf is one of the older parsers in Biopython. It parses the > PredictProtein "Simple Alignment Format" (SAF), a fairly free-format > multiple sequence alignment file format described here: > http://www.predictprotein.org/Dexa/optin_saf.html > > Potentially we could support this file format within Bio.AlignIO (if > there was any demand). However, as far as I can tell, this file > format is ONLY used for PredictProtein, and they will accept several > other more mainstream alignment file formats as alternatives. I got in touch with PredictProtein, and Burkhard Rost told me: >> SAF is a simplified subset of MSF, actually it is >> MSF - checksum AND + more flexibility in terms of line length asf. I also said he was aware of several groups who used it/adopted SAF, but didn't have any names to hand. > Bio.Saf uses Martel for parsing, which is not entirely compatible with > mxTextTools 3.0. If we did want to integrate SAF support into > Bio.AlignIO it might be best to reimplement the parser in plain > python. I suspect adding support for the MSF alignment format to Bio.AlignIO would be of more general interest. > If no one is using it, I would like to deprecate Bio.Saf in the next > release of Biopython. Still no objections? Peter From biopython at maubp.freeserve.co.uk Sat Aug 2 08:18:11 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 2 Aug 2008 13:18:11 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <20101.29085.qm@web62411.mail.re1.yahoo.com> References: <20101.29085.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00808020518t516b4ab8gb2b2b3efa85c04ec@mail.gmail.com> On Sat, Aug 2, 2008 at 11:35 AM, Michiel de Hoon wrote: > Hi everybody, > > For bug #2454: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2454 > > I was looking at the parser in Bio.Medline, which can parse flat files > in the Medline format. For an example, see Tests/Medline/pubmed_result2.txt: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/Medline/pubmed_result2.txt?rev=1.1&cvsroot=biopython&content-type=text/vnd.viewcvs-markup > > I would like to suggest some changes to this parser. > > Currently, it works as follows: > >>>> from Bio import Medline >>>> parser = Medline.RecordParser() >>>> handle = open("mymedlinefile.txt") >>>> record = parser.parse(handle) > > or, to iterate over a bunch of Medline records: > >>>> from Bio import Medline >>>> parser = Medline.RecordParser() >>>> handle = open("mymedlinefile.txt") >>>> records = Medline.Iterator(handle, parser) >>>> for record in records: > ... # do something with the record. > > I'd like to change these to > >>>> from Bio import Medline >>>> handle = open("mymedlinefile.txt") >>>> record = Medline.read(handle) > > and > >>>> from Bio import Medline >>>> handle = open("mymedlinefile.txt") >>>> records = Medline.parse(handle) >>>> for record in records: > ... # do something with the record. > > respectively. +1 (I agree) That would fit with our recent parser changes, and consistency is good :) > In addition, currently the fields in the Medline file are stored as attributes of the record. For example, if the file is > > PMID- 12230038 > OWN - NLM > STAT- MEDLINE > DA - 20020916 > ... > > then the corresponding record is > > record.pubmed_id = "12230038" > record.owner = "NLM" > record.status = "MEDLINE" > record.entry_date = "20020916" > > I'd like to change two things here: > > 1) Use the key shown in the Medline file instead of the name to store each field. > 2) Let the record class derive from a dictionary, and store each field as a key, value pair in this dictionary. > > record["PMID"] = "12230038" > record["OWN"] = "NLM" > record["STAT"] = "MEDLINE" > record["DA"] = "20020916" > ... > > This avoids the names that were rather arbitrarily chosen by ourselves, > and greatly simplifies the parser. The parser will also be more robust if > new fields are added to the Medline file format. One downside of this is that the user then has to go and consult the file format documentation to discover "DA" is the entry date, etc. In some cases the abbrevations are probably a little unclear. I would find code using the current named properties easier to read than the suggested dictionary based approach which exposes the raw field names. Also, could you make the changes whiling leaving the older parser with the old record behaviour in place (with deprecation warnings) for a few releases? This would allow existing user's scripts to continue as is with (but with a warning). > Currently there is very little information on the Medline parser in the > documentation, so I doubt it has many users. Nevertheless, I wanted > to check if anybody has any objections or comments before I implement > these changes. I think the first addition (read and parse functions) is very sensisble, but I am not sure about the suggested change to the record behaviour. Peter From mjldehoon at yahoo.com Sat Aug 2 09:32:32 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 2 Aug 2008 06:32:32 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808020518t516b4ab8gb2b2b3efa85c04ec@mail.gmail.com> Message-ID: <407661.78979.qm@web62412.mail.re1.yahoo.com> --- On Sat, 8/2/08, Peter wrote: > > 1) Use the key shown in the Medline file instead of > > the name to store each field. > > 2) Let the record class derive from a dictionary, and > > store each field as a key, value pair in this dictionary. .... > One downside of this is that the user then has to go and > consult the file format documentation to discover "DA" is the > entry date, etc. In some cases the abbrevations are probably > a little unclear. I would find code using the current named > properties easier to read than the suggested dictionary based > approach which exposes the raw field names. What I noticed when I was playing with this parser is that it is often unclear which (Biopython-chosen) name goes with which (NCBI-chosen) key. For example, PMID is the pubmed ID number in the flat file. Should I look under "pmid", "PMID", "PubmedID"? (the correct answer is "pubmed_id"). As you mention, the NCBI-chosen keys are often not very informative (who can guess that TT stands for "transliterated title"?). I was thinking to have a list of NCBI keys and their description in the docstring of Bio.Medline's Record class, so users can always find them without having to go into NCBI's documentation. Another possibility is to overload the dictionary class such that all keys are automatically mapped to their more descriptive names. So the parser only knows about the NCBI-defined keys, but if a user types record["Author"], then the Record class knows it should return record["AU"]. With a corresponding modification of record.keys(). > Also, could you make the changes whiling leaving the older > parser with the old record behaviour in place (with deprecation > warnings) for a few releases? Yes that is possible. Existing scripts will use the parser = RecordParser(); parser.parse(handle) approach. This approach can continue use the same Record class, basically ignoring the fact that it now derives from a dictionary. A deprecation warning is given when a user tries to create a RecordParser instance. --Michiel. From biopython at maubp.freeserve.co.uk Sat Aug 2 10:09:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 2 Aug 2008 15:09:58 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <407661.78979.qm@web62412.mail.re1.yahoo.com> References: <320fb6e00808020518t516b4ab8gb2b2b3efa85c04ec@mail.gmail.com> <407661.78979.qm@web62412.mail.re1.yahoo.com> Message-ID: <320fb6e00808020709v129cd4c2g7f56251235b6abe0@mail.gmail.com> On Sat, Aug 2, 2008 at 2:32 PM, Michiel de Hoon wrote: >> One downside of this is that the user then has to go and >> consult the file format documentation to discover "DA" is the >> entry date, etc. In some cases the abbreviations are probably >> a little unclear. I would find code using the current named >> properties easier to read than the suggested dictionary based >> approach which exposes the raw field names. > > What I noticed when I was playing with this parser is that it is often > unclear which (Biopython-chosen) name goes with which (NCBI-chosen) > key. For example, PMID is the pubmed ID number in the flat file. Should > I look under "pmid", "PMID", "PubmedID"? (the correct answer is "pubmed_id"). If you did dir(record) how many possible candidates would you see? > As you mention, the NCBI-chosen keys are often not very informative > (who can guess that TT stands for "transliterated title"?). I was thinking > to have a list of NCBI keys and their description in the docstring of > Bio.Medline's Record class, so users can always find them without > having to go into NCBI's documentation. That would help users - and also future developers trying to understand what the parser is doing! > Another possibility is to overload the dictionary class such that all keys > are automatically mapped to their more descriptive names. So the > parser only knows about the NCBI-defined keys, but if a user types > record["Author"], then the Record class knows it should return > record["AU"]. With a corresponding modification of record.keys(). The alias idea is nice but does mean there is more than one way to access the data (not encouraged in python). A related suggestion is to support the properties record.entry_date, record.author etc (what ever the current parser does) as alternatives to record["DA"], record["AU"], ... ? This would then be backwards compatible. This could probably be done with a private dictionary mapping keys ("DA") to property names ("entry_date"). When ever we add a new entry to the dictionary, also see if it has a named property to define too. >> Also, could you make the changes whiling leaving the older >> parser with the old record behaviour in place (with deprecation >> warnings) for a few releases? > > Yes that is possible. Existing scripts will use ... Good, we shouldn't break existing scripts during the deprecation transition period. Peter From mjldehoon at yahoo.com Sat Aug 2 21:57:18 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 2 Aug 2008 18:57:18 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808020709v129cd4c2g7f56251235b6abe0@mail.gmail.com> Message-ID: <591490.20671.qm@web62411.mail.re1.yahoo.com> > The alias idea is nice but does mean there is more than one > way to access the data (not encouraged in python). A related > suggestion is to support the properties record.entry_date, > record.author etc (what ever the current parser does) as > alternatives to record["DA"], record["AU"], ... ? This would > then be backwards compatible. This could probably be done with > a private dictionary mapping keys ("DA") to property names > ("entry_date"). When ever we add a new entry to the > dictionary, also see if it has a named property to define > too. > Thinking it over, I think that having a key and an attribute mapping to the same value is not so clean. Alternatively we could add a .find(term) method to the Bio.Medline.Record class, which takes a term and returns the appropriate value. So record.find("author") returns record["AU"]. This gives a clear separation between the raw keys in the Medline file and the more descriptive names. Also, such a .find method can accept a wider variety of terms than an attribute name (e.g., "Full Author", "full_author", etc. all return record["FAU"]). --Michiel From matzke at berkeley.edu Tue Aug 5 16:30:16 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 13:30:16 -0700 Subject: [BioPython] Clustalw.parse_file errors Message-ID: <4898B858.7040405@berkeley.edu> Hi all, I'm running through the excellent biopython tutorial here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 I've hit an error here: 9.4.2 Creating your own substitution matrix from an alignment http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 ...basically the Clustalw parser won't parse even the given example alignment file (protein.aln) or another example file from elsewhere (example.aln). ============ from Bio import Clustalw from Bio.Alphabet import IUPAC from Bio.Align import AlignInfo # get an alignment object from a Clustalw alignment output c_align = Clustalw.parse_file("protein.aln", IUPAC.protein) summary_align = AlignInfo.SummaryInfo(c_align) ============ this code doesn't work with the given protein.aln file, error message: Traceback (most recent call last): File "biopython_alignments.py", line 163, in ? c_align = Clustalw.parse_file(protein_align_file, IUPAC.protein) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", line 60, in parse_file clustal_alignment._star_info = generic_alignment._star_info AttributeError: Alignment instance has no attribute '_star_info' It also doesn't work with the example.aln file here: http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s06.html http://www.pasteur.fr/recherche/unites/sis/formation/python/data/example.aln ...but throws a different error: code: c_align = Clustalw.parse_file('example.aln', alphabet=IUPAC.protein) ================================= Traceback (most recent call last): File "biopython_alignments.py", line 174, in ? c_align = Clustalw.parse_file('example.aln', alphabet=IUPAC.protein) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", line 47, in parse_file generic_alignment = AlignIO.read(handle, "clustal") File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/AlignIO/__init__.py", line 299, in read first = iterator.next() File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/AlignIO/ClustalIO.py", line 169, in next raise ValueError("Could not parse line:\n%s" % line) ValueError: Could not parse line: *:::**:.**.** *.*** .:* *:******* ==================== I am running: wright:/bioinformatics/pyeg nick$ py -V Python 2.4.4 ...& biopython installed just last week... Any help appreciated, since I will have to use this module soon! Nick -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54 Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From biopython at maubp.freeserve.co.uk Tue Aug 5 16:52:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 21:52:01 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898B858.7040405@berkeley.edu> References: <4898B858.7040405@berkeley.edu> Message-ID: <320fb6e00808051352u1bd19467i914ddb48d1c8dde7@mail.gmail.com> Hi Nick, I'll take a look at the other problem, but I think I could diagnose the second one immediately... > It also doesn't work with the example.aln file here: > http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s06.html > http://www.pasteur.fr/recherche/unites/sis/formation/python/data/example.aln > > ...but throws a different error: > > code: > > c_align = Clustalw.parse_file('example.aln', alphabet=IUPAC.protein) > ================================= > Traceback (most recent call last): > ... > ValueError: Could not parse line: > *:::**:.**.** *.*** .:* *:******* > ==================== That looks like a bug in Biopython 1.47 (reported last month by Sebastian Bassi) where there was a problem parsing Clustal files where the first line of the consensus was blank (as here). It has been fixed in CVS... You should only need to replace the file Bio/AlignIO/ClustalIO.py with the latest version from CVS. If you would like guidance on how exactly to update your system please ask - one manual but fairly simple way is to backup and then replace this file: /Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/AlignIO/ClustalIO.py with the latest one from here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/AlignIO/ClustalIO.py?cvsroot=biopython If you are happy at the command line, then I would suggest get the latest version of Biopython from CVS and then re-install from source. Peter From biopython at maubp.freeserve.co.uk Tue Aug 5 17:08:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:08:45 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898B858.7040405@berkeley.edu> References: <4898B858.7040405@berkeley.edu> Message-ID: <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> On Tue, Aug 5, 2008 at 9:30 PM, Nick Matzke wrote: > Hi all, > > I'm running through the excellent biopython tutorial here: > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 I'm glad you are enjoying the Tutorial (apart from the parsing bug!). I can't take any credit for this bit ;) Seeing as you are trying to use the SummaryInfo class, I should mention that in Biopython 1.47 this doesn't work very well with generic alphabets. In some cases this means you have to supply some of the optional arguments like the characters to ignore (e.g. "-") which might otherwise be inferred from the alphabet. There have been some changes in CVS to try and address this. http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Align/AlignInfo.py?cvsroot=biopython > I've hit an error here: > 9.4.2 Creating your own substitution matrix from an alignment > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 > > ...basically the Clustalw parser won't parse even the given example > alignment file (protein.aln) or another example file from elsewhere > (example.aln). The good news is I've just checked protein.aln on my machine, and it can be parsed fine. This is using the CVS version of Biopython, but probably just updating the file .../Bio/AlignIO/ClustalIO.py as I suggested in my earlier email will fix this too. I've realised that our unit tests didn't include the example file protein.aln, otherwise we would have caught this earlier (when I made ClustalW parsing change). Its a bit late now, but I have just added protein.aln to the alignment parsing unit test for future validation. Peter From biopython at maubp.freeserve.co.uk Tue Aug 5 17:27:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:27:50 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> Message-ID: <320fb6e00808051427x532f448r91aa8c35afe59541@mail.gmail.com> Peter wrote: > Nick Matzke wrote: >> Hi all, >> >> I'm running through the excellent biopython tutorial here: >> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 > > I'm glad you are enjoying the Tutorial (apart from the parsing bug!). > I can't take any credit for this bit ;) [I meant I can't take credit for this bit of the tutorial, however the bug was mine!] >> ...basically the Clustalw parser won't parse even the given example >> alignment file (protein.aln) or another example file from elsewhere >> (example.aln). I've also checked I can read this file too: http://www.pasteur.fr/recherche/unites/sis/formation/python/data/example.aln For example, >>> from Bio import AlignIO >>> a = AlignIO.read(open("/tmp/example.aln"), "clustal") >>> print a SingleLetterAlphabet() alignment with 12 rows and 1168 columns MESGHLLWALLFMQSLWPQLTDGATRVYYLGIRDVQWNYAPKGR...FKQ Q9C058 ... Currently Bio.AlignIO does not let you define the alphabet. See: http://bugzilla.open-bio.org/show_bug.cgi?id=2443 Alternatively, using Bio.Clustalw which does let you define an alphabet: >>> from Bio import Clustalw >>> from Bio.Alphabet import IUPAC, Gapped >>> a = Clustalw.parse_file("/tmp/example.aln", Gapped(IUPAC.protein,"-")) >>> print a Note that using Bio.Clustalw you get a sub-class of the generic alignment, which has a different str method (meaning "print a" will re-create the alignment in clustal format). Peter From matzke at berkeley.edu Tue Aug 5 17:45:56 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 14:45:56 -0700 Subject: [BioPython] biopython tutorial In-Reply-To: <4898B858.7040405@berkeley.edu> References: <4898B858.7040405@berkeley.edu> Message-ID: <4898CA14.7080400@berkeley.edu> Hi again, I just ran through the biopython tutorial, sections 1 through 9.5. It is really great, & thanks to the people who wrote it. While copying-pasting code etc. to try it on my own system I noticed a few typos & other minor issues which I figured I should make note of for Peter or whomever maintains it. Thanks again for the tutorial! Nick 1. my_blast_file = "m_cold.fasta" should be: my_blast_db = "m_cold.fasta" 2. record[0]["GBSeq_definition"] 'Opuntia subulata rpl16 gene, intron; chloroplast' ...should be (AFAICT): record['Bioseq-set_seq-set'][0]['Seq-entry_set']['Bioseq-set']['Bioseq-set_seq-set'][0]['Seq-entry_seq']['Bioseq']['Bioseq_descr']['Seq-descr'][2]['Seqdesc_title'] 3. >>> record[0]["GBSeq_source"] 'chloroplast Austrocylindropuntia subulata' ...the exact string 'chloroplast Austrocylindropuntia subulata' doesn't seem to exist in the downloaded data, so not sure what is meant... 4. the 814 hits are now 816 throughout 5. add links for prosite & swissprot db downloads 6. Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can be downloaded from the NCBI here (only 1.15 MB): link location is weird (only paren is linked) 7. ============ As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value (the default is .3) will add the common residue to the consensus. If it doesn?t reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is Seq object whose alphabet is inferred from the alphabets of the sequences making up the consensus. So doing a print consensus would give: consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT ...', IUPACAmbiguousDNA()) You can adjust how dumb_consensus works by passing optional parameters: the threshold This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is .7. ============ Is the default 0.3 or 0.7 -- I assume 0.7 for DNA. 8. info_content = summary_align.information_content(5, 30, log_base = 10 chars_to_ignore = ['N']) missing comma 9. 9.4.1 Using common substitution matrices blank 10. in PDB section: for model in structure.get_list() for chain in model.get_list(): for residue in chain.get_list(): ...first line needs colon (:) happens again lower down: for model in structure.get_list() for chain in model.get_list(): for residue in chain.get_list(): 11. from PDBParser import PDBParser should be: from Bio.PDB.PDBParser import PDBParser From biopython at maubp.freeserve.co.uk Tue Aug 5 17:52:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:52:36 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> Message-ID: <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> Hi Cedar, Did you mean to send this to me personally? I hope you don't mind me sending this reply to the list too. > Thank you all for your replies. > >>> The problem is, it doesn't support something like: >>> SeqIO.write(seq_record, out_file_handle, "genbank") >>> Because it requires an iterable object I guess? > >> Yes, you would have to use this: >> SeqIO.write([seq_record], out_file_handle, "genbank") > > This suggestion makes sense, but when I try it, I get: > > File "downloader.py", line 40, in > SeqIO.write([record], out_file_handle, "genbank") > File "/sw/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", line 238, in > write > raise ValueError("Unknown format '%s'" % format) > ValueError: Unknown format 'genbank' > > and here is the line 40 of code it refers to: > SeqIO.write([record], out_file_handle, "genbank") > > I'm running 1.47 installed via fink. Right - because in Biopython 1.47, Bio.SeqIO don't support GenBank output (as I had tried to make clear). Earlier this week I committed very preliminary support for writing GenBank files with Bio.SeqIO to CVS. Please add yourself as a CC on Bug 2294 if you want to be kept apprised of this. http://bugzilla.open-bio.org/show_bug.cgi?id=2294 Would it help if the error message for this situation was a little more precise? e.g. Rather than "Unknown format 'xxx'", perhaps "Writing 'xxx' format is not supported yet, only reading it". >> Could I ask why you want to get the SeqRecord as a string in GenBank >> format? > > Thanks for the tip for how to get a string. I want to be able to present a > genbank file inline in a webpage. Also during trouble shooting, I was trying > to read a genbank file in, then print it to the console, just to make sure > things were working. OK - wanting a SeqRecord as a string for embedding in a webpage this makes perfect sense. For debugging, "print record" should give you a human readable output (but it isn't in any particular format). You have explicitly asked about SeqRecord to GenBank, but as an aside, the Tutorial does (briefly) talk about using Bio.GenBank to get a "genbank record" rather than a SeqRecord object. This is a simple and direct representation of the raw GenBank fields, and it should be possible to use this to almost recreate the GenBank file. >>> from Bio import GenBank >>> gb_iterator = GenBank.Iterator(open("cor6_6.gb"), GenBank.RecordParser()) >>> for cur_record in gb_iterator : print cur_record This won't be 100% the same as the input file, but it is close. > I'm probably way out of line here, because frankly, I'm not the best python > coder, and I haven't contributed a thing to biopython, but here it is > anyway: > > I don't understand why SeqIO must write to a handle anyway. I think > something like: > > file_handle.write(SeqIO.to_string([record], "genbank")) > > is just as easy as the existing method, and has the advantage of giving us > the option of just getting a string like: > > genbank_string = SeqIO.to_string([record], "genbank") When we first discussed the proposed SeqIO interface, handles were seen as a sensible common abstraction. The desire to get a string was discussed but (as I recall) was not considered to be as common as wanting to write to a file. In fact web-server applications are still the only example I can think of right now, and the StringIO solution or the "to string method" discussed below cover this. > And while I'm at it, I think even easier would be: > > file_handle.write(record.to_format("genbank")) > and > genbank_string = record.to_format("genbank") > > would be even easier. If you have any preference on the precise function name, please add a comment on Bug 2561. http://bugzilla.open-bio.org/show_bug.cgi?id=2561 > In any case, biopython make my life much easier, and I appreciate it! > best, > Cedar Great :) Peter From biopython at maubp.freeserve.co.uk Tue Aug 5 17:57:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:57:45 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898C88D.6010507@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> <4898C88D.6010507@berkeley.edu> Message-ID: <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> On Tue, Aug 5, 2008 at 10:39 PM, Nick Matzke wrote: > Thanks for the help Peter, it really is a great tutorial! > > I've replaced just the ClustalIO.py file as you suggested, and it parses > both the example.aln and protein.aln files. Good :) > However I tried an ClustalW-formatted alignment file I made awhile ago with > my own data and still got the star_info error: > > AttributeError: Alignment instance has no attribute '_star_info' > > But my file could be weird. Does the _star_info error indicate alphabet > issues or something? The _star_info is a nasty private variable used to store the ClustalW consensus, used if writing the file back out again in clustal format. The error suggests something else has gone wrong with the consensus parsing... (and shouldn't be anything to do with the alphabet). Could you file a bug, and (after filing the bug) could you upload one of these example files to the bug as an attachment please? Peter From peter at maubp.freeserve.co.uk Tue Aug 5 18:28:29 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 23:28:29 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <4898CA14.7080400@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> Message-ID: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> On Tue, Aug 5, 2008 at 10:45 PM, Nick Matzke wrote: > Hi again, > > I just ran through the biopython tutorial, sections 1 through 9.5. It is > really great, & thanks to the people who wrote it. On behalf of all the other authors, thank you :) > While copying-pasting code etc. to try it on my own system I noticed a few > typos & other minor issues which I figured I should make note of for Peter > or whomever maintains it. Although I have made plenty of changes and updates to the tutorial, its still a joint effort. I probably tend to make more little fixes than other people, which shows up more on the CVS history! Little things like this are always worth pointing out - and comments from new-comers and beginners can be extra helpful if they reveal assumptions or other things that could be clearer. > 1. > my_blast_file = "m_cold.fasta" > should be: > my_blast_db = "m_cold.fasta" I may have misunderstood you, but I think its correct. There are two important things for a BLAST search, the input file (here the FASTA file m_cold.fasta) and the database to search against (in the example b. subtilis sequences). > 2. > record[0]["GBSeq_definition"] > 'Opuntia subulata rpl16 gene, intron; chloroplast' > > ...should be (AFAICT): Something strange is going on - the NCBI didn't give me XML by default as I expected: from Bio import Entrez handle = Entrez.efetch(db="nucleotide", id="57240072", email="A.N.Other at example.com") data = handle.read() print data[:100] It looks like the NCBI may have changed something - Michiel? > 4. > the 814 hits are now 816 throughout That number is always going to increase - maybe we can reword things slightly to make it clear that may not be exactly what the user will see. > 5. > add links for prosite & swissprot db downloads Where would you add these, and which URLs did you have in mind? > 6. > Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can > be downloaded from the NCBI here (only 1.15 MB): > > link location is weird (only paren is linked) Whoops - both the PDF and HTML are like that... looks like a mix up in the LaTeX syntax. Fixed in CVS. > > 7. > ============ > As the name suggests, this is a really simple consensus calculator, and will > just add up all of the residues at each point in the consensus, and if the > most common value is higher than some threshold value (the default is .3) > will add the common residue to the consensus. If it doesn't reach the > threshold, it adds an ambiguity character to the consensus. The returned > consensus object is Seq object whose alphabet is inferred from the alphabets > of the sequences making up the consensus. So doing a print consensus would > give: > > consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT > ...', IUPACAmbiguousDNA()) > > You can adjust how dumb_consensus works by passing optional parameters: > > the threshold > This is the threshold specifying how common a particular residue has to > be at a position before it is added. The default is .7. > ============ > > Is the default 0.3 or 0.7 -- I assume 0.7 for DNA. The default is 0.7 for any sequence type (DNA, protein, etc). Do you mean which way round is the percentage counted (the letter has to be above 70% I think)? > 8. > info_content = summary_align.information_content(5, 30, log_base = 10 > chars_to_ignore = ['N']) > missing comma Fixed in CVS. > 9. > 9.4.1 Using common substitution matrices > > blank So it is - would anyone like to write something for this? > 10. > in PDB section: > > for model in structure.get_list() > for chain in model.get_list(): > for residue in chain.get_list(): > > ...first line needs colon (:) > > happens again lower down: > for model in structure.get_list() > for chain in model.get_list(): > for residue in chain.get_list(): > Fixed two of these in CVS. > 11. > from PDBParser import PDBParser > > should be: > > from Bio.PDB.PDBParser import PDBParser Fixed in CVS. Note that we don't normally update the online copies of the HTML and PDF tutorial between releases (so as to avoid talking about unreleased features). However, there have been a few updates to the Tutorial since Biopython 1.47 so maybe we should consider it? Thanks again Nick! Peter From matzke at berkeley.edu Tue Aug 5 18:41:46 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 15:41:46 -0700 Subject: [BioPython] biopython tutorial In-Reply-To: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> Message-ID: <4898D72A.8030207@berkeley.edu> Peter wrote: > On Tue, Aug 5, 2008 at 10:45 PM, Nick Matzke wrote: >> Hi again, >> >> I just ran through the biopython tutorial, sections 1 through 9.5. It is >> really great, & thanks to the people who wrote it. > > On behalf of all the other authors, thank you :) > >> While copying-pasting code etc. to try it on my own system I noticed a few >> typos & other minor issues which I figured I should make note of for Peter >> or whomever maintains it. > > Although I have made plenty of changes and updates to the tutorial, > its still a joint effort. I probably tend to make more little fixes > than other people, which shows up more on the CVS history! > > Little things like this are always worth pointing out - and comments > from new-comers and beginners can be extra helpful if they reveal > assumptions or other things that could be clearer. > >> 1. >> my_blast_file = "m_cold.fasta" >> should be: >> my_blast_db = "m_cold.fasta" > > I may have misunderstood you, but I think its correct. There are two > important things for a BLAST search, the input file (here the FASTA > file m_cold.fasta) and the database to search against (in the example > b. subtilis sequences). Yeah sorry, I was confused there but forgot to fix my note after I figured it out! > >> 2. >> record[0]["GBSeq_definition"] >> 'Opuntia subulata rpl16 gene, intron; chloroplast' >> >> ...should be (AFAICT): > > Something strange is going on - the NCBI didn't give me XML by default > as I expected: > > from Bio import Entrez > handle = Entrez.efetch(db="nucleotide", id="57240072", > email="A.N.Other at example.com") > data = handle.read() > print data[:100] > > It looks like the NCBI may have changed something - Michiel? > >> 4. >> the 814 hits are now 816 throughout > > That number is always going to increase - maybe we can reword things > slightly to make it clear that may not be exactly what the user will > see. Yeah I figured it was this no worries. If you want to be OCD like I apparently am you could add a note to this effect. >> 5. >> add links for prosite & swissprot db downloads > > Where would you add these, and which URLs did you have in mind? I was thinking in this section: ======== To parse a file that contains more than one Swiss-Prot record, we use the parse function instead. This function allows us to iterate over the records in the file. For example, let?s parse the full Swiss-Prot database and collect all the descriptions. The full Swiss-Prot database, downloaded from ExPASy on 4 December 2007, contains 290484 Swiss-Prot records in a single gzipped-file uniprot_sprot.dat.gz. ======== ...it could link to: ftp://ca.expasy.org/databases/uniprot/current_release/knowledgebase/complete ...and in this section: ======== In general, a Prosite file can contain more than one Prosite records. For example, the full set of Prosite records, which can be downloaded as a single file (prosite.dat) from ExPASy, contains 2073 records in (version 20.24 released on 4 December 2007). To parse such a file, we again make use of an iterator: ======== ...it could link to: ftp://ftp.expasy.org/databases/prosite/ I found these without too much trouble on my own of course but might be handy for newbies. Also, the tutorial might give an estimate of how long it will take to parse the full Swiss-Prot DB, I waited a few minutes & then decided to move on. Maybe a smaller file or subset with just e.g. 100 records would be appropriate for the tutorial? > >> 6. >> Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can >> be downloaded from the NCBI here (only 1.15 MB): >> >> link location is weird (only paren is linked) > > Whoops - both the PDF and HTML are like that... looks like a mix up in > the LaTeX syntax. Fixed in CVS. > >> 7. >> ============ >> As the name suggests, this is a really simple consensus calculator, and will >> just add up all of the residues at each point in the consensus, and if the >> most common value is higher than some threshold value (the default is .3) >> will add the common residue to the consensus. If it doesn't reach the >> threshold, it adds an ambiguity character to the consensus. The returned >> consensus object is Seq object whose alphabet is inferred from the alphabets >> of the sequences making up the consensus. So doing a print consensus would >> give: >> >> consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT >> ...', IUPACAmbiguousDNA()) >> >> You can adjust how dumb_consensus works by passing optional parameters: >> >> the threshold >> This is the threshold specifying how common a particular residue has to >> be at a position before it is added. The default is .7. >> ============ >> >> Is the default 0.3 or 0.7 -- I assume 0.7 for DNA. > > The default is 0.7 for any sequence type (DNA, protein, etc). Do you > mean which way round is the percentage counted (the letter has to be > above 70% I think)? I meant that this sentence in the above para: "if the most common value is higher than some threshold value (the default is .3)" should probably just say 0.7 I think. Thanks! Nick > >> 8. >> info_content = summary_align.information_content(5, 30, log_base = 10 >> chars_to_ignore = ['N']) >> missing comma > > Fixed in CVS. > >> 9. >> 9.4.1 Using common substitution matrices >> >> blank > > So it is - would anyone like to write something for this? > >> 10. >> in PDB section: >> >> for model in structure.get_list() >> for chain in model.get_list(): >> for residue in chain.get_list(): >> >> ...first line needs colon (:) >> >> happens again lower down: >> for model in structure.get_list() >> for chain in model.get_list(): >> for residue in chain.get_list(): >> > > Fixed two of these in CVS. > >> 11. >> from PDBParser import PDBParser >> >> should be: >> >> from Bio.PDB.PDBParser import PDBParser > > Fixed in CVS. > > Note that we don't normally update the online copies of the HTML and > PDF tutorial between releases (so as to avoid talking about unreleased > features). However, there have been a few updates to the Tutorial > since Biopython 1.47 so maybe we should consider it? > > Thanks again Nick! > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54 Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From peter at maubp.freeserve.co.uk Tue Aug 5 18:55:30 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 23:55:30 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <4898D72A.8030207@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> <4898D72A.8030207@berkeley.edu> Message-ID: <320fb6e00808051555v5bca752bg49da9199d5c682c1@mail.gmail.com> >>> 4. >>> the 814 hits are now 816 throughout >> >> That number is always going to increase - maybe we can reword things >> slightly to make it clear that may not be exactly what the user will >> see. > > Yeah I figured it was this no worries. I might do that tomorrow (along with the links below...) > If you want to be OCD like I apparently am you could add a note to this effect. Having a perfectionist looking after documentation or a website can be a good thing. >>> 5. >>> add links for prosite & swissprot db downloads >> >> Where would you add these, and which URLs did you have in mind? > > > I was thinking in this section: > > ======== > To parse a file that contains more than one Swiss-Prot record, we use the > parse function instead. This function allows us to iterate over the records > in the file. For example, let's parse the full Swiss-Prot database and > collect all the descriptions. The full Swiss-Prot database, downloaded from > ExPASy on 4 December 2007, contains 290484 Swiss-Prot records in a single > gzipped-file uniprot_sprot.dat.gz. > ======== > > ...it could link to: > ftp://ca.expasy.org/databases/uniprot/current_release/knowledgebase/complete > > ...and in this section: > > ======== > In general, a Prosite file can contain more than one Prosite records. For > example, the full set of Prosite records, which can be downloaded as a > single file (prosite.dat) from ExPASy, contains 2073 records in (version > 20.24 released on 4 December 2007). To parse such a file, we again make use > of an iterator: > ======== > > ...it could link to: > ftp://ftp.expasy.org/databases/prosite/ > > I found these without too much trouble on my own of course but might be > handy for newbies. That looks sensible... > Also, the tutorial might give an estimate of how long it will take to parse > the full Swiss-Prot DB, I waited a few minutes & then decided to move on. > Maybe a smaller file or subset with just e.g. 100 records would be > appropriate for the tutorial? It will depend very much on the computer (hard drive mostly). As I recall somewhere between 2 and 10 minutes sounds about right. >>> 7. >>> ============ >>> As the name suggests, this is a really simple consensus calculator, and >>> will ... >> >> The default is 0.7 for any sequence type (DNA, protein, etc). Do you >> mean which way round is the percentage counted (the letter has to be >> above 70% I think)? > > I meant that this sentence in the above para: "if the most common value is > higher than some threshold value (the default is .3)" should probably just > say 0.7 I think. I see it now, fixed in CVS. Thanks! Peter From matzke at berkeley.edu Tue Aug 5 19:19:23 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 16:19:23 -0700 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> <4898C88D.6010507@berkeley.edu> <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> Message-ID: <4898DFFB.3010704@berkeley.edu> Never mind, it turns out my alignment file was missing a blank line after each section of the alignment. The .aln file doesn't have to have a consensus line with "*", ":" characters in it necessarily, but it does have to have at least a line of spaces of the length of the aligned block (this is what protein.aln has). I inserted a line of spaces after each chunk of the alignment and now it parses. (My alignment wasn't generated by Clustal anyway, so I also added this header line to make the parser happy: "CLUSTAL W (1.83) formatted alignment done with PROMALS3D") I.e. for future readers (truncating my .aln file)... ...this got the _star_info error: ================================ CLUSTAL W (1.83) formatted alignment done with PROMALS3D SctN_Salt ----------------MKNEL--------------------------------------- SctN_EHEC MISEHDSVLEKYPRIQKVLNST-------------------------------------- SctN_Chrm ---------MRLPDIRLIENTL-------------------------------------- SctN_Yers ---------MKLPDIARLTPRL-------------------------------------- SctN_Soda ----------MTCNSQRLASML-------------------------------------- SctN_Laws ----------------MALEYI-------------------------------------- SctN_Chl4 ----------------MEEITTE------------------------------------- SctN_Salt --------------------------MQRLRLKYPPP---------DGYCR--------W SctN_EHEC --------------------------VPALSLN-------------SSTRY--------E SctN_Chrm --------------------------RERLTLAPA---PPGQR---SGVEL--------F SctN_Yers --------------------------QQQLTRPSAPP---------EGLRY--------R SctN_Soda --------------------------AQHLTPVDEPP---------DGYRL--------T SctN_Laws --------------------------ASLLEEAVQNT---------SPVEV--------R SctN_Chl4 --------------------------FNTLMTELPDV---------QLTAV--------V =================================== ...but this parsed successfully: ================================ CLUSTAL W (1.83) formatted alignment done with PROMALS3D SctN_Salt ----------------MKNEL--------------------------------------- SctN_EHEC MISEHDSVLEKYPRIQKVLNST-------------------------------------- SctN_Chrm ---------MRLPDIRLIENTL-------------------------------------- SctN_Yers ---------MKLPDIARLTPRL-------------------------------------- SctN_Soda ----------MTCNSQRLASML-------------------------------------- SctN_Laws ----------------MALEYI-------------------------------------- SctN_Chl4 ----------------MEEITTE------------------------------------- SctN_Salt --------------------------MQRLRLKYPPP---------DGYCR--------W SctN_EHEC --------------------------VPALSLN-------------SSTRY--------E SctN_Chrm --------------------------RERLTLAPA---PPGQR---SGVEL--------F SctN_Yers --------------------------QQQLTRPSAPP---------EGLRY--------R SctN_Soda --------------------------AQHLTPVDEPP---------DGYRL--------T SctN_Laws --------------------------ASLLEEAVQNT---------SPVEV--------R SctN_Chl4 --------------------------FNTLMTELPDV---------QLTAV--------V =================================== ...the difference is that the first blank line after the block must be spaces (or consensus characters *:. etc.), not just a blank line. Thanks for the hints! Nick Peter wrote: > On Tue, Aug 5, 2008 at 10:39 PM, Nick Matzke wrote: >> Thanks for the help Peter, it really is a great tutorial! >> >> I've replaced just the ClustalIO.py file as you suggested, and it parses >> both the example.aln and protein.aln files. > > Good :) > >> However I tried an ClustalW-formatted alignment file I made awhile ago with >> my own data and still got the star_info error: >> >> AttributeError: Alignment instance has no attribute '_star_info' >> >> But my file could be weird. Does the _star_info error indicate alphabet >> issues or something? > > The _star_info is a nasty private variable used to store the ClustalW > consensus, used if writing the file back out again in clustal format. > The error suggests something else has gone wrong with the consensus > parsing... (and shouldn't be anything to do with the alphabet). > > Could you file a bug, and (after filing the bug) could you upload one > of these example files to the bug as an attachment please? > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54 Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From cmckay at u.washington.edu Tue Aug 5 18:47:53 2008 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 5 Aug 2008 15:47:53 -0700 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> Message-ID: On Aug 5, 2008, at 2:52 PM, Peter wrote: > Right - because in Biopython 1.47, Bio.SeqIO don't support GenBank > output (as I had tried to make clear). Earlier this week I committed > very preliminary support for writing GenBank files with Bio.SeqIO to > CVS. Please add yourself as a CC on Bug 2294 if you want to be kept > apprised of this. > http://bugzilla.open-bio.org/show_bug.cgi?id=2294 > Aha! I see. The following is on the SeqIO wiki page (http://www.biopython.org/wiki/SeqIO ): "If you supply the sequences as a SeqRecord iterator, then for sequential file formats like Fasta or GenBank, the records can be written one by one" I think I wrongly thought this implied that Genbank Records can be written. But I see now that isn't the case, and the "Fasta or GenBank" files it references must be the input files that are parsed, not the format of the output. I'm looking forward to this functionality. > Would it help if the error message for this situation was a little > more precise? e.g. Rather than "Unknown format 'xxx'", perhaps > "Writing 'xxx' format is not supported yet, only reading it". > I think your new suggested message is more clear, but the existing one is clear enough. I simply thought there was a problem because I had it in my mind that genbank writing was now supported. > If you have any preference on the precise function name, please add a > comment on Bug 2561. > http://bugzilla.open-bio.org/show_bug.cgi?id=2561 I have no particular preference. Thanks again for the help. cedar From mjldehoon at yahoo.com Tue Aug 5 20:42:53 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Aug 2008 17:42:53 -0700 (PDT) Subject: [BioPython] biopython tutorial In-Reply-To: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> Message-ID: <387391.14812.qm@web62412.mail.re1.yahoo.com> --- On Tue, 8/5/08, Peter wrote: > > 2. > > record[0]["GBSeq_definition"] > > 'Opuntia subulata rpl16 gene, intron; > chloroplast' > > > > ...should be (AFAICT): > > Something strange is going on - the NCBI didn't give me > XML by default > as I expected: > > from Bio import Entrez > handle = Entrez.efetch(db="nucleotide", > id="57240072", > email="A.N.Other at example.com") > data = handle.read() > print data[:100] > > It looks like the NCBI may have changed something - > Michiel? In this example, retmode='xml' was missing for the current efetch. With retmode='xml', the rest of the example in the Tutorial works correctly. Confusingly, if you use rettype='xml' you will get a different XML output (this is the XML output Nick was looking at). I fixed this section in CVS. --Michiel From cmckay at u.washington.edu Tue Aug 5 20:36:39 2008 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 5 Aug 2008 17:36:39 -0700 Subject: [BioPython] Bio.SeqFeature.SeqFeature Message-ID: <0CDEE8AC-1271-4617-B333-C18615970E23@u.washington.edu> Hello. I just upgraded from 1.44 to 1.47 and one of my home-brew classes stopped working. This used to work: class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): """Extends the Bio.SeqFeature.SeqFeature class with several new methods and attributes. The Bio.SeqFeature.SeqFeature represents individual features of a genbank record, for example a CDS.""" But now after the upgrade to 1.47 my script throws this: File "/usr/local/python/Rocap.py", line 740, in class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): AttributeError: 'module' object has no attribute 'SeqFeature' FYI fastaTools is a class of my own. I looked at the changelog, and don't see anything obvious. Can anyone give me a pointer as to why this stopped working? best, Cedar From biopython at maubp.freeserve.co.uk Wed Aug 6 04:57:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 09:57:45 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <387391.14812.qm@web62412.mail.re1.yahoo.com> References: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> <387391.14812.qm@web62412.mail.re1.yahoo.com> Message-ID: <320fb6e00808060157x2c495e55gcd4946f5b241d7fd@mail.gmail.com> >> Something strange is going on - the NCBI didn't give me >> XML by default as I expected: >> >> from Bio import Entrez >> handle = Entrez.efetch(db="nucleotide", >> id="57240072", >> email="A.N.Other at example.com") >> data = handle.read() >> print data[:100] >> >> It looks like the NCBI may have changed something - >> Michiel? > > In this example, retmode='xml' was missing for the current efetch. > With retmode='xml', the rest of the example in the Tutorial works correctly. Confusingly, if you use rettype='xml' you will get a different XML output (this is the XML output Nick was looking at). That sort of explains things - I had tried rettype="xml" last night, and got something in XML back which didn't look right. The NCBI have a table of retmode versus rettype which does point out there is more than one XML variant you can get back! http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html Peter From biopython at maubp.freeserve.co.uk Wed Aug 6 05:03:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 10:03:54 +0100 Subject: [BioPython] Bio.SeqFeature.SeqFeature In-Reply-To: <0CDEE8AC-1271-4617-B333-C18615970E23@u.washington.edu> References: <0CDEE8AC-1271-4617-B333-C18615970E23@u.washington.edu> Message-ID: <320fb6e00808060203s7e386ed7t6d40cd4dff4c43bc@mail.gmail.com> On Wed, Aug 6, 2008 at 1:36 AM, Cedar McKay wrote: > Hello. I just upgraded from 1.44 to 1.47 and one of my home-brew classes > stopped working. This used to work: > > class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): > """Extends the Bio.SeqFeature.SeqFeature class with several new > methods and attributes. > The Bio.SeqFeature.SeqFeature represents individual features of a > genbank record, > for example a CDS.""" > > > But now after the upgrade to 1.47 my script throws this: > > File "/usr/local/python/Rocap.py", line 740, in > class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): > AttributeError: 'module' object has no attribute 'SeqFeature' > > > FYI fastaTools is a class of my own. > > I looked at the changelog, and don't see anything obvious. Can anyone give > me a pointer as to why this stopped working? How are you importing the SeqFeature? You would get that error message if you didn't import the SeqFeature at all. Perhaps something subtle has changed there (i.e. you may have been relying on another module importing it on your behalf). You could try: import Bio.SeqFeature class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): #... or, from Bio.SeqFeature import SeqFeature class BetterSeqFeature(SeqFeature, fastaTools): #... Peter From biopython at maubp.freeserve.co.uk Wed Aug 6 05:27:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 10:27:17 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> Message-ID: <320fb6e00808060227l25ccc76bs578f6b6635cdec65@mail.gmail.com> On Tue, Aug 5, 2008 at 11:47 PM, Cedar McKay wrote: > > Aha! I see. The following is on the SeqIO wiki page > (http://www.biopython.org/wiki/SeqIO): > > "If you supply the sequences as a SeqRecord iterator, then for sequential > file formats like Fasta or GenBank, the records can be written one by one" > > I think I wrongly thought this implied that Genbank Records can be written. > But I see now that isn't the case, and the "Fasta or GenBank" files it > references must be the input files that are parsed, not the format of the > output. I'm looking forward to this functionality. I agree with you - in hindsight that bit of the wiki is misleading. Sorry about that. I was using GenBank as an example of a sequential file format where the records can be written one by one (unlike for example Clustal or most multiple sequence alignment formats where the records are interleaved). This is true, and a valid example of what I meant by a "sequential file format" - as are SwissProt and EMBL. However, this wording did wrongly give the impression that Bio.SeqIO could write GenBank files (which Biopython 1.47 can't do). >> Would it help if the error message for this situation was a little >> more precise? e.g. Rather than "Unknown format 'xxx'", perhaps >> "Writing 'xxx' format is not supported yet, only reading it". >> > I think your new suggested message is more clear, but the existing one is > clear enough. I simply thought there was a problem because I had it in my > mind that genbank writing was now supported. I've updated Bio.SeqIO and Bio.AlignIO, so that they will say: ValueError: Reading format 'xxx' is supported, but not writing rather than: ValueError: Unknown format 'xxx' when the format is known (but only as an input format). I think this is more helpful and more accurate. Peter From biopython at maubp.freeserve.co.uk Wed Aug 6 13:10:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 18:10:02 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <320fb6e00808051555v5bca752bg49da9199d5c682c1@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> <4898D72A.8030207@berkeley.edu> <320fb6e00808051555v5bca752bg49da9199d5c682c1@mail.gmail.com> Message-ID: <320fb6e00808061010r3fa01e7m78284b3f54cf51bb@mail.gmail.com> On Tue, Aug 5, 2008 at 11:55 PM, Peter wrote: >>>> 4. >>>> the 814 hits are now 816 throughout >>> >>> That number is always going to increase - maybe we can reword things >>> slightly to make it clear that may not be exactly what the user will >>> see. Michiel has changed the wording slightly here in CVS. >>>> 5. >>>> add links for prosite & swissprot db downloads I've added those links in CVS, and mentioned the SwissProt example takes about seven minutes on my machine. Peter From biopython at maubp.freeserve.co.uk Thu Aug 7 06:11:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Aug 2008 11:11:22 +0100 Subject: [BioPython] SeqRecord to file format as string Message-ID: <320fb6e00808070311s40ff42e6tae265d2aa37ee684@mail.gmail.com> Following discussion both here and on the development mailing list, the SeqRecord and the Alignment objects are getting methods to present the object as a string in a requested file format (using Bio.SeqIO and Bio.AlignIO internally). This is enhancement Bug 2561, http://bugzilla.open-bio.org/show_bug.cgi?id=2561 I've added a .format() method, which takes a format name used in Bio.SeqIO or Bio.AlignIO. The name of this method final until it is part of an official Biopython release, so if there are any strong views on this please voice them sooner rather than later. For an example of how this works, if your have a SeqRecord object in variable record, you could do: print record.format("fasta") print record.format("tab") (or any other output format name supported in Bio.SeqIO for a single record) Similarly, if you had an Alignment object in variable align, you could do: print align.format("fasta") print align.format("clustal") print align.format("stockholm") (or any other output format name supported in Bio.AlignIO) This functionality will also be available via the special format() function being added to Python 2.6 and 3.0, giving the alternative: print format(align, "fasta") See PEP 3101 for details about the format system, http://www.python.org/dev/peps/pep-3101/ Peter From biopython at maubp.freeserve.co.uk Thu Aug 7 08:06:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Aug 2008 13:06:08 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898DFFB.3010704@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> <4898C88D.6010507@berkeley.edu> <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> <4898DFFB.3010704@berkeley.edu> Message-ID: <320fb6e00808070506w34446572nbdc131fa079e8773@mail.gmail.com> On Wed, Aug 6, 2008 at 12:19 AM, Nick Matzke wrote: > (My alignment wasn't generated by Clustal anyway, so I also added this > header line to make the parser happy: "CLUSTAL W (1.83) formatted alignment > done with PROMALS3D") I've just tried an alignment from PROMALS3D in their Clustal W output format: http://prodata.swmed.edu/promals3d/promals3d.php I tried their default settings, a wrap of 50, and unwrapped long lines - and with all of these the CVS Biopython Bio.AlignIO parser seems fine. However, as you point out, when using Bio.Clustalw there is a problem. The missing version number causes an error, which I regard as a bug worth fixing. http://bugzilla.open-bio.org/show_bug.cgi?id=2564 Peter From mjldehoon at yahoo.com Thu Aug 7 10:56:37 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 7 Aug 2008 07:56:37 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <591490.20671.qm@web62411.mail.re1.yahoo.com> Message-ID: <642190.59135.qm@web62407.mail.re1.yahoo.com> If there are no further suggestions, I'll implement the .find() method as described below. --Michiel. --- On Sat, 8/2/08, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [BioPython] Bio.Medline parser > To: "Peter" > Cc: biopython at biopython.org > Date: Saturday, August 2, 2008, 9:57 PM > > The alias idea is nice but does mean there is more than > one > > way to access the data (not encouraged in python). A > related > > suggestion is to support the properties > record.entry_date, > > record.author etc (what ever the current parser does) > as > > alternatives to record["DA"], > record["AU"], ... ? This would > > then be backwards compatible. This could probably be > done with > > a private dictionary mapping keys ("DA") to > property names > > ("entry_date"). When ever we add a new > entry to the > > dictionary, also see if it has a named property to > define > > too. > > > Thinking it over, I think that having a key and an > attribute mapping to the same value is not so clean. > Alternatively we could add a .find(term) method to the > Bio.Medline.Record class, which takes a term and returns the > appropriate value. So record.find("author") > returns record["AU"]. This gives a clear > separation between the raw keys in the Medline file and the > more descriptive names. Also, such a .find method can accept > a wider variety of terms than an attribute name (e.g., > "Full Author", "full_author", etc. all > return record["FAU"]). > > --Michiel > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Aug 7 14:15:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Aug 2008 19:15:08 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <642190.59135.qm@web62407.mail.re1.yahoo.com> References: <591490.20671.qm@web62411.mail.re1.yahoo.com> <642190.59135.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00808071115k5111d184v13cb37457c3fdbd2@mail.gmail.com> On Thu, Aug 7, 2008 at 3:56 PM, Michiel de Hoon wrote: > If there are no further suggestions, I'll implement the .find() method as described below. > > ... > >> Thinking it over, I think that having a key and an >> attribute mapping to the same value is not so clean. >> Alternatively we could add a .find(term) method to the >> Bio.Medline.Record class, which takes a term and returns the >> appropriate value. So record.find("author") >> returns record["AU"]. This gives a clear >> separation between the raw keys in the Medline file and the >> more descriptive names. Also, such a .find method can accept >> a wider variety of terms than an attribute name (e.g., >> "Full Author", "full_author", etc. all >> return record["FAU"]). >> >> --Michiel When would anyone use the .find() method? Perhaps if exploring at the command line. If you are writing a script, then once you know you that "FAU" means "Full Author" then you would always just use record["FAU"] directly. Maybe it would make sense just to describe the keys in the docstring, and that would be enough. On a related point, from the Entrez documentation can the MedLine records be accessed as either plain text, XML (or html or asn.1)/ How does the data structure from parsing the XML version with the Bio.Entrez.read() compare to your ideas for the MedLine plain text parser? Maybe we can just deprecate Bio.Medline (i.e. the plain text parser) in favour of Bio.Entrez (and its XML parser)? http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchlit_help.html Peter From biopython at maubp.freeserve.co.uk Fri Aug 8 06:28:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Aug 2008 11:28:42 +0100 Subject: [BioPython] Deprecating Bio.Saf (PredictProtein Simple Alignment Format) In-Reply-To: <320fb6e00808020502l64eca901gda132250fc85e5be@mail.gmail.com> References: <320fb6e00807231512s507d2652jc0f26764a62b01d5@mail.gmail.com> <320fb6e00808020502l64eca901gda132250fc85e5be@mail.gmail.com> Message-ID: <320fb6e00808080328v4c84639evb74e3c722d15658e@mail.gmail.com> I wrote: >> Is anyone using Bio.Saf or PredictProtein's "Simple Alignment Format" (SAF)? >> >> Bio.Saf is one of the older parsers in Biopython. It parses the >> PredictProtein "Simple Alignment Format" (SAF), a fairly free-format >> multiple sequence alignment file format described here: >> http://www.predictprotein.org/Dexa/optin_saf.html > >... > >> If no one is using it, I would like to deprecate Bio.Saf in the next >> release of Biopython. > > Still no objections? As no-one has objected, I have marked Bio.Saf as deprecated in CVS. As usual, the intention is to keep the deprecated module for a couple more releases before removing it. Peter From biopython at maubp.freeserve.co.uk Fri Aug 8 07:51:35 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Aug 2008 12:51:35 +0100 Subject: [BioPython] Deprecating Bio.NBRF Message-ID: <320fb6e00808080451i2e53de52k551dfd455bcaf3e3@mail.gmail.com> Dear all, Is anyone using Bio.NBRF for reading NBRF/PIR files? Good news: I've just added support for reading NBRF/PIR files as SeqRecord objects to Bio.SeqIO, under the format name "pir" as used in EMBOSS and BioPerl. See enhancement Bug 2535, http://bugzilla.open-bio.org/show_bug.cgi?id=2535 Bad news: I would now like to deprecate the old Bio.NBRF module which was an NBRF/PIR parser which generated its own record objects (not SeqRecord objects). The main reason to drop this module is it relies on some of Biopython's older parsing infrastructure which depends on mxTextTools (and doesn't entirely work with mxTextTools 3.0). So, if anyone if using Bio.NBRF, please get in touch. Peter From biopython at maubp.freeserve.co.uk Fri Aug 8 13:32:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Aug 2008 18:32:58 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <367259.79982.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00808071115k5111d184v13cb37457c3fdbd2@mail.gmail.com> <367259.79982.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00808081032t1bf11e6fv32ed32ca4669cae5@mail.gmail.com> On Fri, Aug 8, 2008 at 3:14 PM, Michiel de Hoon wrote: > >> Maybe it would make sense just to describe >> the keys in the docstring, and that would be enough. > > OK I'll go with just the docstring for now. If users ask for it, we can add a more > descriptive function later. > >> Maybe we can just deprecate Bio.Medline (i.e. the >> plain text parser) in favour of Bio.Entrez (and its XML parser)? > > Usually I am in favor of deprecating modules if their usefulness is not clear. In > this case, however, Medline is a major database, the Medline record format is > readily available from NCBI, it is human readable (more or less) and computer > readable, the resulting Bio.Medline.Record may be easier to deal with than the > record created from XML by Bio.Entrez, and the parser is straightforward but > not entirely trivial. Being able to parse such a file is something I'd expect from > Biopython. That sounds sensible. Maybe we should have an example in the Tutorial of using Bio.Entrez to download some data in the plain text MedLine format, and parsing it with Bio.MedLine? And perhaps also an equivalent using the XML Medline format parsed using Bio.Entrez? Peter From mjldehoon at yahoo.com Fri Aug 8 10:14:18 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 8 Aug 2008 07:14:18 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808071115k5111d184v13cb37457c3fdbd2@mail.gmail.com> Message-ID: <367259.79982.qm@web62403.mail.re1.yahoo.com> > Maybe it would make sense just to describe > the keys in the docstring, and that would be enough. OK I'll go with just the docstring for now. If users ask for it, we can add a more descriptive function later. > > Maybe we can just deprecate Bio.Medline (i.e. the > plain text parser) in favour of Bio.Entrez (and its XML parser)? Usually I am in favor of deprecating modules if their usefulness is not clear. In this case, however, Medline is a major database, the Medline record format is readily available from NCBI, it is human readable (more or less) and computer readable, the resulting Bio.Medline.Record may be easier to deal with than the record created from XML by Bio.Entrez, and the parser is straightforward but not entirely trivial. Being able to parse such a file is something I'd expect from Biopython. --Michiel From mjldehoon at yahoo.com Sat Aug 9 03:39:00 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Aug 2008 00:39:00 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808081032t1bf11e6fv32ed32ca4669cae5@mail.gmail.com> Message-ID: <576584.32208.qm@web62405.mail.re1.yahoo.com> Done -- see CVS. --Michiel. --- On Fri, 8/8/08, Peter wrote: > From: Peter > Subject: Re: [BioPython] Bio.Medline parser > To: mjldehoon at yahoo.com > Cc: biopython at biopython.org > Date: Friday, August 8, 2008, 1:32 PM > On Fri, Aug 8, 2008 at 3:14 PM, Michiel de Hoon wrote: > > > >> Maybe it would make sense just to describe > >> the keys in the docstring, and that would be > enough. > > > > OK I'll go with just the docstring for now. If > users ask for it, we can add a more > > descriptive function later. > > > >> Maybe we can just deprecate Bio.Medline (i.e. the > >> plain text parser) in favour of Bio.Entrez (and > its XML parser)? > > > > Usually I am in favor of deprecating modules if their > usefulness is not clear. In > > this case, however, Medline is a major database, the > Medline record format is > > readily available from NCBI, it is human readable > (more or less) and computer > > readable, the resulting Bio.Medline.Record may be > easier to deal with than the > > record created from XML by Bio.Entrez, and the parser > is straightforward but > > not entirely trivial. Being able to parse such a file > is something I'd expect from > > Biopython. > > That sounds sensible. Maybe we should have an example in > the Tutorial > of using Bio.Entrez to download some data in the plain text > MedLine > format, and parsing it with Bio.MedLine? And perhaps also > an > equivalent using the XML Medline format parsed using > Bio.Entrez? > > Peter From ochipepe at gmail.com Sun Aug 10 04:12:56 2008 From: ochipepe at gmail.com (Alexandre Santos) Date: Sun, 10 Aug 2008 10:12:56 +0200 Subject: [BioPython] (bio)python for vector cloning Message-ID: Hello, I'm currently evaluating the suitability of python and biopython for the planning of my molecular biology chores. In particular, I would like to use for instance the ipython shell to pick up my vectors of interest, the list of restriction enzymes I have in the shelf, and design cloning strategies, plot annotated vector maps, etc. My question is on whether anybody has experience doing this with python related tools? I had a look at some biopython documentation and tutorials (http://www.pasteur.fr/recherche/unites/sis/formation/python/apa.html#sol_digest), and it seems feasible, but if possible I would like some experience-based feedback. Cheers, Alex Santos From mjldehoon at yahoo.com Sun Aug 10 04:13:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 10 Aug 2008 01:13:31 -0700 (PDT) Subject: [BioPython] Bio.Emboss Message-ID: <646424.85840.qm@web62402.mail.re1.yahoo.com> Hi everybody, I am looking at the remaining Biopython modules that use Martel for parsing. Most of these have already been deprecated or replaced by alternatives. Without Martel, we can drop the dependency on mxTextTools, and make the Biopython installation a bit easier. One of the remaining Martel-dependent parsers are in Bio.Emboss. There are two parsers, one for PrimerSearch and one for Primer3. Currently, both of these reside in Bio.Emboss.Primer; these are the classes in Bio.Emboss.Primer: class PrimerSearchInputRecord: class PrimerSearchParser: class PrimerSearchOutputRecord: class PrimerSearchAmplifier: class _PrimerSearchRecordConsumer(AbstractConsumer): class _PrimerSearchScanner: class Primer3Record: class Primer3Primers: class Primer3Parser: class _Primer3RecordConsumer(AbstractConsumer): class _Primer3Scanner: I'd like to split Bio.Emboss.Primer into a Bio.Emboss.PrimerSearch and a Bio.Emboss.Primer3 module, with an InputRecord and OutputRecord class in Bio.Emboss.PrimerSearch, a Record class in Bio.Emboss.Primer3, and a read() function in each. This function would then do the parsing, without using Martel. Any objections? --Michiel From biopython at maubp.freeserve.co.uk Sun Aug 10 09:04:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 10 Aug 2008 14:04:20 +0100 Subject: [BioPython] (bio)python for vector cloning In-Reply-To: References: Message-ID: <320fb6e00808100604n696efc29r8b04239cdbb7296@mail.gmail.com> On Sun, Aug 10, 2008, Alexandre Santos wrote: > Hello, > > I'm currently evaluating the suitability of python and biopython for > the planning of my molecular biology chores. > > In particular, I would like to use for instance the ipython shell to > pick up my vectors of interest, the list of restriction enzymes I have > in the shelf, and design cloning strategies, plot annotated vector > maps, etc. What format will you have your raw vector sequences in? Maybe FASTA? Biopython's Bio.Restriction module (contributed by Frederic Sohm) may be helpful. It is documented here (separate from the main tutorial at the moment), http://biopython.org/DIST/docs/cookbook/Restriction.html What exactly do you mean by plot annotated vector maps? There are some basic graphics capabilities in Biopython which use ReportLab. Depending on what you want to do, GenomeDiagram might be helpful too. http://bioinf.scri.ac.uk/lp/programs.php#genomediagram > My question is on whether anybody has experience doing this with > python related tools? I had a look at some biopython documentation and > tutorials (http://www.pasteur.fr/recherche/unites/sis/formation/python/apa.html#sol_digest), > and it seems feasible, but if possible I would like some > experience-based feedback. I have no personal experience of doing this kind of worth with Biopython, but it should be feasible. If you try this, and have suggestions for the Biopython documentation (or code) that would be great. Also please be aware that some bits of the Pasteur Biopython tutorial are out of date - I did try and get in touch with the authors about this via the help at pasteur.fr email address listed on the main page. Maybe I should try and contact the authors directly... http://www.pasteur.fr/recherche/unites/sis/formation/python/ Peter From biopython at maubp.freeserve.co.uk Tue Aug 12 08:10:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 13:10:47 +0100 Subject: [BioPython] (bio)python for vector cloning In-Reply-To: References: <320fb6e00808100604n696efc29r8b04239cdbb7296@mail.gmail.com> Message-ID: <320fb6e00808120510s6d5d3725v8ffa2643a55f47ee@mail.gmail.com> Hi Alex, I hope you don't mind me CC'ing this back onto the mailing list. On Tue, Aug 12, 2008 at 10:56 AM, Alexandre Santos wrote: >> What format will you have your raw vector sequences in? Maybe FASTA? > Both FASTA and GenBank formats. I think this should not be a problem Good. >> Biopython's Bio.Restriction module (contributed by Frederic Sohm) may >> be helpful. It is documented here (separate from the main tutorial at >> the moment), http://biopython.org/DIST/docs/cookbook/Restriction.html > > I checked the documentation and it's exactly what I need! Good. >> What exactly do you mean by plot annotated vector maps? There are >> some basic graphics capabilities in Biopython which use ReportLab. >> Depending on what you want to do, GenomeDiagram might be helpful too. >> http://bioinf.scri.ac.uk/lp/programs.php#genomediagram > > I mean the typical vector graphic representation that gives you an > idea of the vector sequence structure (see for instance > http://www.addgene.org/pgvec1?f=d&vectorid=345&cmd=genvecmap&dim=800&format=html&mtime=1187931178). > I would use it for personal documentation, but also when I send the > plasmids to other people. > > It seems GenomeDiagram could be used for that job, but not without > some heavy customization... It would be nice to have something already > usable for this purpose. I have used GenomeDiagram for plasmid figures, for example showing the location of microarray probe target sequences. However, right now it does lack support for "arrowed features" on the circles, and the fancy labeling in that example. So I would agree, recreating that figure using Biopython and GenomeDiagram would need plenty of additional work. However, a simplified version would be fairly easy I think. >> Also please be aware that some bits of the Pasteur Biopython tutorial >> are out of date > > Thanks for the warning, I will mind it when I try biopython. > > Thanks for the help! > > Alex Sure, Peter From rik at cogsci.ucsd.edu Tue Aug 12 17:24:44 2008 From: rik at cogsci.ucsd.edu (richard k belew) Date: Tue, 12 Aug 2008 14:24:44 -0700 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? Message-ID: <48A1FF9C.4060206@cogsci.ucsd.edu> i am sure this has to have been addressed in a universe long ago and far away, but... i'm trying to use Bio.EUtils to access NCBI/Entrez, but seem unable to use its MultDict utilities as they are intended. i include a sample run below. i can contact NCBI and get the data just fine. and the AuthorList = is accessible. the summary() whines about finding "multiple Items named 'Author'!" is there some (recursive?) idiom that is typically used? i can explicitly make the hack for extracting from the AuthorList, but want to do something similar for any other OrderedMultiDicts, and would like it all to stay as close to the DTDs as possible! thanks for your help. rik > Python 2.4.4 (#1, Oct 18 2006, 10:34:39) > [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> from Bio import EUtils >>>> from Bio.EUtils import DBIdsClient >>>> PMID = "17447753" >>>> idList = EUtils.DBIds("pubmed", PMID) >>>> result = DBIdsClient.from_dbids(idList) >>>> summary = result[0].summary() > Found multiple Items named 'Author'! > Found multiple Items named 'Author'! > Found multiple Items named 'Author'! >>>> data = summary.dataitems >>>> data > ), ('LastAuthor', u'Belew RK'), ('Title', u'Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries.'), ('Volume', u'47'), ('Issue', u'3'), ('Pages', u'1258-62'), ('LangList', ), ('NlmUniqueID', u'101230060'), ('ISSN', u'1549-9596'), ('ESSN', u'1549-960X'), ('PubTypeList', ), ('RecordStatus', u'PubMed - indexed for MEDLINE'), ('PubStatus', u'ppublish+epublish'), ('ArticleIds', ), ('DOI', u'10.1021/ci700044s'), ('History', ), ('References', ), ('HasAbstract', 1), ('PmcRefCount', 0), ('FullJournalName', u'Journal of chemical information and modeling'), ('ELocationID', u''), ('SO', u'2007 May-Jun;47(3):1258-62')]> >>>> for k in keys: print k,data[k] > ... > DOI 10.1021/ci700044s > Title Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries. > Source J Chem Inf Model > PmcRefCount 0 > Issue 3 > SO 2007 May-Jun;47(3):1258-62 > ISSN 1549-9596 > Volume 47 > FullJournalName Journal of chemical information and modeling > RecordStatus PubMed - indexed for MEDLINE > ESSN 1549-960X > ELocationID > Pages 1258-62 > PubStatus ppublish+epublish > AuthorList {'Author': u'Belew RK'} > EPubDate 2007/04/21 > PubDate 2007/05/01 > NlmUniqueID 101230060 > LastAuthor Belew RK > ArticleIds {'doi': u'10.1021/ci700044s', 'pubmed': u'17447753'} > HasAbstract 1 > History {'medline': Date(2007, 9, 6), 'pubmed': Date(2007, 4, 24), 'aheadofprint': Date(2007, 4, 21)} > LangList {'Lang': u'English'} > References {} > PubTypeList {'PubType': u'Journal Article'} >>>> alist = data.get('AuthorList') >>>> alist > >>>> for k,v in data.allitems(): print k,v > ... > PubDate 2007/05/01 > EPubDate 2007/04/21 > Source J Chem Inf Model > AuthorList {'Author': u'Belew RK'} > LastAuthor Belew RK > Title Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries. > Volume 47 > Issue 3 > Pages 1258-62 > LangList {'Lang': u'English'} > NlmUniqueID 101230060 > ISSN 1549-9596 > ESSN 1549-960X > PubTypeList {'PubType': u'Journal Article'} > RecordStatus PubMed - indexed for MEDLINE > PubStatus ppublish+epublish > ArticleIds {'doi': u'10.1021/ci700044s', 'pubmed': u'17447753'} > DOI 10.1021/ci700044s > History {'medline': Date(2007, 9, 6), 'pubmed': Date(2007, 4, 24), 'aheadofprint': Date(2007, 4, 21)} > References {} > HasAbstract 1 > PmcRefCount 0 > FullJournalName Journal of chemical information and modeling > ELocationID > SO 2007 May-Jun;47(3):1258-62 From biopython at maubp.freeserve.co.uk Tue Aug 12 17:49:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 22:49:43 +0100 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? In-Reply-To: <48A1FF9C.4060206@cogsci.ucsd.edu> References: <48A1FF9C.4060206@cogsci.ucsd.edu> Message-ID: <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> On Tue, Aug 12, 2008 at 10:24 PM, richard k belew wrote: > i am sure this has to have been addressed in a universe long ago and far > away, but... > > i'm trying to use Bio.EUtils to access NCBI/Entrez, but seem unable to > use its MultDict utilities as they are intended. > > i include a sample run below. > i can contact NCBI and get the data just fine. and the AuthorList = > is accessible. the summary() whines about > finding "multiple Items named 'Author'!" > > is there some (recursive?) idiom that is typically used? i can > explicitly make the hack for extracting from the AuthorList, but > want to do something similar for any other OrderedMultiDicts, and > would like it all to stay as close to the DTDs as possible! > > thanks for your help. Hi Rik, I don't know enough about Bio.EUtils to be able to help. This module is currently without an maintainer, and its deprecation has been suggested in favour of the much simpler Bio.Entrez module (which is covered pretty thoroughly in the documentation). I would suggest you try Bio.Entrez.efetch() to get the data as XML, and the Bio.Entrez.read() function to parse the XML. You'll get a nested structure of python dictionaries and lists. See "Chapter 7" of the Tutorial, http://www.biopython.org/DIST/docs/tutorial/Tutorial.html Was there anything particular piece of information you wanted to extract? Peter From biopython at maubp.freeserve.co.uk Tue Aug 12 18:19:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 23:19:52 +0100 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? In-Reply-To: <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> References: <48A1FF9C.4060206@cogsci.ucsd.edu> <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> Message-ID: <320fb6e00808121519r3a2b2d53p8039abbc70a28369@mail.gmail.com> > I would suggest you try Bio.Entrez.efetch() to get the data as XML, > and the Bio.Entrez.read() function to parse the XML. You'll get a > nested structure of python dictionaries and lists. See "Chapter 7" of > the Tutorial, > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html > > Was there anything particular piece of information you wanted to extract? Assuming it was just the author list, something like this might suit you: from Bio import Entrez PMIDs = "17447753,17447754" handle = Entrez.efetch(db="pubmed", id=PMIDs, retmode="XML") records = Entrez.read(handle) for record in records : print record['MedlineCitation']['Article']['ArticleTitle'] for author_dict in record['MedlineCitation']['Article']['AuthorList'] : print " - %(ForeName)s %(LastName)s" % author_dict handle.close() And the output, Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries. - Max W Chang - William Lindstrom - Arthur J Olson - Richard K Belew Synthesis and spectroscopic characterization of copper(II)-nitrito complexes with hydrotris(pyrazolyl)borate and related coligands. - Nicolai Lehnert - Ursula Cornelissen - Frank Neese - Tetsuya Ono - Yuki Noguchi - Ken-Ichi Okamoto - Kiyoshi Fujisawa And done. The author's initial are also included in the dictionary (but not printed). If you are familar with the XML DTD, working out where the data you want is much easier! As you desired, the Bio.Entrez parser does stay close to the DTDs - both a blessing and a curse. Peter From biopython at maubp.freeserve.co.uk Tue Aug 12 18:22:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 23:22:53 +0100 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? In-Reply-To: <48A20AE8.40900@cogsci.ucsd.edu> References: <48A1FF9C.4060206@cogsci.ucsd.edu> <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> <48A20AE8.40900@cogsci.ucsd.edu> Message-ID: <320fb6e00808121522yd221531pbb75ea484f14b2ea@mail.gmail.com> On Tue, Aug 12, 2008 at 11:12 PM, richard k belew wrote: > thanks Peter! depriction of EUtils seems right > (i was following stale pointers i guess). i'll > tryout Bio.Entrez. > > rik Can I ask you why you ended up at Bio.EUtils? If its documentation on 3rd party sites, there's not so much we can do about it. But if there is anything misleading in the tutorial or on the Biopython website we can fix that. Peter From lpritc at scri.ac.uk Wed Aug 13 05:17:49 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 13 Aug 2008 10:17:49 +0100 Subject: [BioPython] (bio)python for vector cloning In-Reply-To: <320fb6e00808120510s6d5d3725v8ffa2643a55f47ee@mail.gmail.com> Message-ID: Hi, On 12/08/2008 13:10, "Peter" wrote: >>> What exactly do you mean by plot annotated vector maps? There are >>> some basic graphics capabilities in Biopython which use ReportLab. >>> Depending on what you want to do, GenomeDiagram might be helpful too. >>> http://bioinf.scri.ac.uk/lp/programs.php#genomediagram >> >> I mean the typical vector graphic representation that gives you an >> idea of the vector sequence structure (see for instance >> http://www.addgene.org/pgvec1?f=d&vectorid=345&cmd=genvecmap&dim=800&format=h >> tml&mtime=1187931178). >> I would use it for personal documentation, but also when I send the >> plasmids to other people. >> >> It seems GenomeDiagram could be used for that job, but not without >> some heavy customization... It would be nice to have something already >> usable for this purpose. > > I have used GenomeDiagram for plasmid figures, for example showing the > location of microarray probe target sequences. However, right now it > does lack support for "arrowed features" on the circles, and the fancy > labeling in that example. So I would agree, recreating that figure > using Biopython and GenomeDiagram would need plenty of additional > work. However, a simplified version would be fairly easy I think. Peter's correct: currently GenomeDiagram only has support for drawing arrow features in linear diagrams, and the labelling in the diagram that you link to is not achievable by the GenomeDiagram API. Features can be labelled individually, just not in the style shown. GenomeDiagram was designed for the presentation of hundreds of genomes, rather than single plasmids, so this use is a little out of its original scope ;) There is a package called Plasmidomics, written in Python, that I've never used, but which is designed for this kind of task: http://www.bioprocess.org/plasmid/ It might be what you need and, if the source code is available, you might be able to work it into your own code. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython at maubp.freeserve.co.uk Wed Aug 13 17:26:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Aug 2008 22:26:52 +0100 Subject: [BioPython] Deprecating Bio.EUtils in favour of Bio.Entrez? In-Reply-To: <320fb6e00808130200v6c69922au99b4623b67a2eb88@mail.gmail.com> References: <320fb6e00808130200v6c69922au99b4623b67a2eb88@mail.gmail.com> Message-ID: <320fb6e00808131426m6bb72b8fh6399734a8359d8a5@mail.gmail.com> Hello to all the NCBI fans... As you may know, the NCBI Entrez database has some "Entrez Programming Utilities" also known as EUtils, http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html As of Biopython 1.45, the module Bio.Entrez (originally Bio.WWW.NCBI) supported all the EUtils functions, and as of Biopython 1.46 could parse the NCBI XML output too. This code is well documented with a whole Bio.Entrez chapter in the tutorial, which means we are now in a position to retire the older unmaintained Bio.EUtils module. We'd like to propose the deprecation of Bio.EUtils in the next release of Biopython, in favour of Bio.Entrez. If anyone is currently using Bio.EUtils, then we'd like to hear from you. It should be possible to offer advice on migrating the code to Bio.Entrez, or we can reconsider deprecating Bio.EUtils if there is some major functionality that would be lost, or users that would be inconvenienced by its premature retirement. Thank you, Peter From biopython at maubp.freeserve.co.uk Sun Aug 17 09:05:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 Aug 2008 14:05:52 +0100 Subject: [BioPython] Bio.EUtils deprecated in favour of Bio.Entrez Message-ID: <320fb6e00808170605t5dd7b787i2dfcc4f5f4dde6ed@mail.gmail.com> Dear all, I've just deprecated Bio.EUtils in CVS, leaving Bio.Entrez as Biopython's preferred interface to the NCBI "Entrez Programming Utilities" also known as EUtils. http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html If anyone is currently using Bio.EUtils, then we'd like to hear from you. It should be possible to offer advice on migrating your code to Bio.Entrez. Also, up until the next Biopython release, we can still reconsider deprecating Bio.EUtils if there is some major functionality that would be lost, or users that would be inconvenienced by its premature retirement. Thank you, Peter From biopython at maubp.freeserve.co.uk Tue Aug 19 06:45:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Aug 2008 11:45:31 +0100 Subject: [BioPython] Deprecating Bio.NBRF In-Reply-To: <320fb6e00808080451i2e53de52k551dfd455bcaf3e3@mail.gmail.com> References: <320fb6e00808080451i2e53de52k551dfd455bcaf3e3@mail.gmail.com> Message-ID: <320fb6e00808190345i7d1e2a47l98e4426f5ab87b68@mail.gmail.com> On Fri, Aug 8, 2008 at 12:51 PM, Peter wrote: > Dear all, > > Is anyone using Bio.NBRF for reading NBRF/PIR files? > > Good news: I've just added support for reading NBRF/PIR files as > SeqRecord objects to Bio.SeqIO, under the format name "pir" as used in > EMBOSS and BioPerl. See enhancement Bug 2535, > http://bugzilla.open-bio.org/show_bug.cgi?id=2535 > > Bad news: I would now like to deprecate the old Bio.NBRF module which > was an NBRF/PIR parser which generated its own record objects (not > SeqRecord objects). The main reason to drop this module is it relies > on some of Biopython's older parsing infrastructure which depends on > mxTextTools (and doesn't entirely work with mxTextTools 3.0). > > So, if anyone if using Bio.NBRF, please get in touch. > I have now deprecated Bio.NBRF in CVS. Its not to late to revert this change if anyone missed the last email warning people about this plan. (Like many other deprecations, even after we remove a module from the distribution, the old code is still there in CVS, and can be resurrected if someone really wanted it) Peter From agarbino at gmail.com Thu Aug 21 01:44:27 2008 From: agarbino at gmail.com (Alex Garbino) Date: Thu, 21 Aug 2008 00:44:27 -0500 Subject: [BioPython] Parsing BLAST for ClustalW Message-ID: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Hello, I'm a new python and biophython user. I'm trying to pull a BLAST result, parse it into a csv with the following fields: protein name, organism, common name, protein length, and FASTA sequence The goal is to then feed the fasta sequences into ClustalW (to do a phylogeny tree, look for conserved regions, etc). I've managed to do the blast search, and parse the results into xml from python. However, I'm not sure how to grab the above information and put it together, so that I can save a csv and push it into clustalw. Could someone help? Thanks! Alex From allank at sanbi.ac.za Thu Aug 21 04:29:24 2008 From: allank at sanbi.ac.za (Allan Kamau) Date: Thu, 21 Aug 2008 10:29:24 +0200 Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> References: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Message-ID: <48AD2764.9010400@sanbi.ac.za> Hi Alex, I haven't yet used BioPython (therefore my suggestion may be quite wrong). To generate CSV from XML may require use of general XML SAX parser solution (unless BioPython has a package to output CSV from XML of that particular structure). I prefer to use SAX (in many cases) as opposed to other more memory resident XML parsing solutions (DOM etc) due to memory issues especially if your XML is large. Have a look at "http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/" Allan. Alex Garbino wrote: > Hello, > > I'm a new python and biophython user. > I'm trying to pull a BLAST result, parse it into a csv with the > following fields: > protein name, organism, common name, protein length, and FASTA sequence > The goal is to then feed the fasta sequences into ClustalW (to do a > phylogeny tree, look for conserved regions, etc). > > I've managed to do the blast search, and parse the results into xml > from python. However, I'm not sure how to grab the above information > and put it together, so that I can save a csv and push it into > clustalw. > > Could someone help? > > Thanks! > Alex > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Thu Aug 21 05:14:20 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 21 Aug 2008 02:14:20 -0700 (PDT) Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Message-ID: <862312.62767.qm@web62403.mail.re1.yahoo.com> Dear Alex, Did you look at section 6.4 in the Biopython tutorial? --Michiel. --- On Thu, 8/21/08, Alex Garbino wrote: > From: Alex Garbino > Subject: [BioPython] Parsing BLAST for ClustalW > To: biopython at lists.open-bio.org > Date: Thursday, August 21, 2008, 1:44 AM > Hello, > > I'm a new python and biophython user. > I'm trying to pull a BLAST result, parse it into a csv > with the > following fields: > protein name, organism, common name, protein length, and > FASTA sequence > The goal is to then feed the fasta sequences into ClustalW > (to do a > phylogeny tree, look for conserved regions, etc). > > I've managed to do the blast search, and parse the > results into xml > from python. However, I'm not sure how to grab the > above information > and put it together, so that I can save a csv and push it > into > clustalw. > > Could someone help? > > Thanks! > Alex > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Aug 21 08:15:15 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Aug 2008 13:15:15 +0100 Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> References: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Message-ID: <320fb6e00808210515j737f3374t457d09350ae2a674@mail.gmail.com> On Thu, Aug 21, 2008 at 6:44 AM, Alex Garbino wrote: > Hello, > > I'm a new python and biophython user. > I'm trying to pull a BLAST result, parse it into a csv with the > following fields: > protein name, organism, common name, protein length, and FASTA sequence > The goal is to then feed the fasta sequences into ClustalW (to do a > phylogeny tree, look for conserved regions, etc). > > I've managed to do the blast search, and parse the results into xml > from python. However, I'm not sure how to grab the above information > and put it together, so that I can save a csv and push it into > clustalw. > > Could someone help? Hi Alex, You said you are a Python and Biopython beginner - are you already familiar with BLAST and ClustalW? It sounds like you have a query sequence, and want to extract matching target sequences from a database using BLAST, and then build a multiple sequence alignment from them. If you just want the matching region of these other genes, then you can work from the BLAST output (just take the aligned sequence and remove the gaps). However, if you want the full gene sequences these are not in the BLAST output. You would have to take the target match ID, and look it up in the original database. As Michiel suggested, have a look over the BLAST chapter in the Biopython tutorial. http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Writing a CVS file in python is simple enough, e.g. handle = open("example.txt","w") #some loop over the blast file to extract the fields... handle.write("%s, %s, %s, %i, %s\n" % (protein_name, organism, common_name, protein_length, sequence_string) handle.close() However, for input to ClustalW to build a tree you don't want a CSV file, but a FASTA file containing the sequences without gaps. You could write these out yourself, e.g. handle = open("example.faa","w") #some loop over the blast file to extract the fields... handle.write(">%s\n%s\n" % (protein_name, sequence_string) handle.close() Peter From bsouthey at gmail.com Thu Aug 21 09:27:34 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 21 Aug 2008 08:27:34 -0500 Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <48AD2764.9010400@sanbi.ac.za> References: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> <48AD2764.9010400@sanbi.ac.za> Message-ID: <48AD6D46.1060601@gmail.com> Allan Kamau wrote: > Hi Alex, > I haven't yet used BioPython (therefore my suggestion may be quite > wrong). > To generate CSV from XML may require use of general XML SAX parser > solution (unless BioPython has a package to output CSV from XML of > that particular structure). > I prefer to use SAX (in many cases) as opposed to other more memory > resident XML parsing solutions (DOM etc) due to memory issues > especially if your XML is large. > Have a look at > "http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/" > Note that the Elementtree XML parser is now standard in Python 2.5 see one of many examples: http://www.learningpython.com/2008/05/07/elegant-xml-parsing-using-the-elementtree-module/ As I understand things, elementtree does not fit into BioPython since it was not standard for earlier versions of Python supported by BioPython. This may change once BioPython supports Python 3K. Bruce From sbassi at gmail.com Sun Aug 24 18:18:09 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 19:18:09 -0300 Subject: [BioPython] Problem with Entrez? Message-ID: >>> handle = Entrez.efetch(db="nucleotide", id="326625") >>> record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in record = Entrez.read(handle) File "/mnt/hda2/py252/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 283, in read record = handler.run(handle) File "/mnt/hda2/py252/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 95, in run self.parser.ParseFile(handle) ExpatError: syntax error: line 1, column 0 So I efetch it again just to show the format of handle: >>> handle = Entrez.efetch(db="nucleotide", id="326625") >>> print handle.read()[:200] Seq-entry ::= seq { id { genbank { name "HIVED82FO" , accession "M77599" , version 1 } , gi 326625 } , descr { title "Human immunodeficiency virus type 1 gp120 (env) Looks like ASN1 format, but according to the tutorial efetch should return its output in XML format: "By default you get the output in XML format, which you can parse using the Bio.Entrez.read() function " As a workaround I specify the format with rettype='gb' From sbassi at gmail.com Sun Aug 24 18:46:10 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 19:46:10 -0300 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: On Sun, Aug 24, 2008 at 7:18 PM, Sebastian Bassi wrote: > As a workaround I specify the format with rettype='gb' Sorry, the workaround is to set retmode to xml: handle = Entrez.efetch(db="nucleotide", id="326625", retmode='xml') But I thought that this should be default behivor. From sbassi at gmail.com Sun Aug 24 19:03:32 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 20:03:32 -0300 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: On Sun, Aug 24, 2008 at 8:00 PM, Sebastian Bassi wrote: > My proposed solution is: Sorry!!! Now I think that the way to force an option to be default is to declare a default value in function definition: def efetch(db, cgi=None, retmode='xml', **keywds): instead of: def efetch(db, cgi=None, **keywds): From sbassi at gmail.com Sun Aug 24 19:00:08 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 20:00:08 -0300 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: On Sun, Aug 24, 2008 at 7:46 PM, Sebastian Bassi wrote: >> As a workaround I specify the format with rettype='gb' > Sorry, the workaround is to set retmode to xml: > handle = Entrez.efetch(db="nucleotide", id="326625", retmode='xml') > But I thought that this should be default behivor. My proposed solution is: Change line variables = {'db' : db} To: variables = {'db' : db , 'retmode' : 'xml'} In Bio/Entrez/__init__.py Doing this, it work as expected, but I don't know if this breaks something else. From srbanator at heckler-koch.cz Wed Aug 27 04:41:18 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Wed, 27 Aug 2008 10:41:18 +0200 Subject: [BioPython] versions and doc's Message-ID: <48B5132E.6080601@heckler-koch.cz> hi all, i am very new to biopython. I am working with debian etch stable version. During reading tutorial i have found out, that the current documentation works with version 1.47-1 (in debian it is in unstable repository). Are there old tutorials related to 1.42-2 version (debian stable), or you believe i should rather start studying biopython in it's newest version? thank you for introduction pavel srb From biopython at maubp.freeserve.co.uk Wed Aug 27 05:12:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 10:12:17 +0100 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: <320fb6e00808270212y48f12111u16c2908337732ed3@mail.gmail.com> On Sun, Aug 24, 2008 at 11:18 PM, Sebastian Bassi wrote: > Looks like ASN1 format, but according to the tutorial efetch should > return its output in XML format: > "By default you get the output in XML format, which you can parse > using the Bio.Entrez.read() function " Its a documentation bug (my mistaken assumption), as the NCBI do not default to XML. The efetch doc string was fixed in CVS but I'll do the tutorial now... thanks for the report. > As a workaround I specify the format with rettype='gb' As you realized, efetch listens to the rettype and retmode arguments. Peter From biopython at maubp.freeserve.co.uk Wed Aug 27 05:19:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 10:19:02 +0100 Subject: [BioPython] Problem with Entrez? In-Reply-To: <320fb6e00808270212y48f12111u16c2908337732ed3@mail.gmail.com> References: <320fb6e00808270212y48f12111u16c2908337732ed3@mail.gmail.com> Message-ID: <320fb6e00808270219k450d55d3g259c829de8593139@mail.gmail.com> On Wed, Aug 27, 2008 at 10:12 AM, Peter wrote: > On Sun, Aug 24, 2008 at 11:18 PM, Sebastian Bassi wrote: >> Looks like ASN1 format, but according to the tutorial efetch should >> return its output in XML format: >> "By default you get the output in XML format, which you can parse >> using the Bio.Entrez.read() function " > > Its a documentation bug (my mistaken assumption), as the NCBI do not > default to XML. The efetch doc string was fixed in CVS but I'll do the > tutorial now... thanks for the report. Already fixed in CVS, as of /biopython/Doc/Tutorial.tex revision 1.135 - but worth double checking. Peter From p.j.a.cock at googlemail.com Wed Aug 27 05:30:12 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 27 Aug 2008 10:30:12 +0100 Subject: [BioPython] versions and doc's In-Reply-To: <48B5132E.6080601@heckler-koch.cz> References: <48B5132E.6080601@heckler-koch.cz> Message-ID: <320fb6e00808270230g34e97347h5d21d75154536614@mail.gmail.com> On Wed, Aug 27, 2008 at 9:41 AM, Pavel SRB wrote: > hi all, i am very new to biopython. I am working with debian etch stable > version. During reading tutorial i have found out, that the current > documentation works with version 1.47-1 (in debian it is in unstable > repository). Hi Pavel, Yes - the documentation on our website, in particular the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf is for the current latest release of Biopython (i.e. Biopython 1.47 at this time) > Are there old tutorials related to 1.42-2 version (debian stable), or > you believe i should rather start studying biopython in it's newest version? Debian may have installed the Biopython 1.42 version of the tutorial for you (but if it did, I don't know where to look). If you want the old tutorial you can either get the LaTeX source from from CVS and recompile it, or more simply download the Biopython 1.42 source code release, which contain both the HTML and PDF versions of the tutorial under the Doc folder: http://biopython.org/DIST/biopython-1.42.tar.gz http://biopython.org/DIST/biopython-1.42.zip However, I would strongly encourage you to install the current version of Biopython instead - there have been a lot of bug fixes plus the addition modules like Bio.SeqIO, Bio.AlignIO and Bio.Entrez. In addition, several of the modules present in Biopython 1.42 have since been moved or deprecated and some have even been removed. For debian (or ubuntu), I would suggest you install Biopython from source. First uninstall the debian package for Biopython 1.42, then you should be able to install the build dependencies automatically using: sudo apt-get build-dep python-biopython Then you should be ready to install Biopython 1.47 from source. See http://biopython.org/wiki/Download for more details. Peter From agarbino at gmail.com Wed Aug 27 13:12:58 2008 From: agarbino at gmail.com (Alex Garbino) Date: Wed, 27 Aug 2008 12:12:58 -0500 Subject: [BioPython] Parsing BLAST Message-ID: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> Hello, I'm following the tutorials to do BLAST queries, however, I can't get the Blast object to work. I've downloaded the blast search, saved it in XML, etc, as the tutorial does. However, when I get to the step where I'm trying to get actual data out, it fails (the for loops part). Here is a simplified version that illustrates the problem: for x in blast_record.alignments: print alignment.title Traceback (most recent call last): File "", line 2, in NameError: name 'alignment' is not defined ----------------------- blast_record contains lots of data, I just can't seem to be able to get anything out of it... what am I doing wrong? Thanks, Alex From sbassi at gmail.com Wed Aug 27 13:33:52 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 27 Aug 2008 14:33:52 -0300 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> Message-ID: On Wed, Aug 27, 2008 at 2:12 PM, Alex Garbino wrote: > for x in blast_record.alignments: > print alignment.title > > Traceback (most recent call last): > File "", line 2, in > NameError: name 'alignment' is not defined You should do: print x.title Instead of: print alignment.title From cg5x6 at yahoo.com Wed Aug 27 13:27:54 2008 From: cg5x6 at yahoo.com (C. G.) Date: Wed, 27 Aug 2008 10:27:54 -0700 (PDT) Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> Message-ID: <384507.82825.qm@web65609.mail.ac4.yahoo.com> --- On Wed, 8/27/08, Alex Garbino wrote: > From: Alex Garbino > Subject: [BioPython] Parsing BLAST > To: biopython at lists.open-bio.org > Date: Wednesday, August 27, 2008, 11:12 AM > Hello, > > I'm following the tutorials to do BLAST queries, > however, I can't get > the Blast object to work. > > I've downloaded the blast search, saved it in XML, etc, > as the > tutorial does. However, when I get to the step where > I'm trying to get > actual data out, it fails (the for loops part). > Here is a simplified version that illustrates the problem: > > for x in blast_record.alignments: > print alignment.title > > Traceback (most recent call last): > File "", line 2, in > NameError: name 'alignment' is not defined > > ----------------------- It's a Python coding error. Try: print x.title Or change the name of your variable in the "for" loop to "alignment". From srbanator at heckler-koch.cz Wed Aug 27 16:48:24 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Wed, 27 Aug 2008 22:48:24 +0200 Subject: [BioPython] sql create tables script in tutorial Message-ID: <48B5BD98.8050101@heckler-koch.cz> hi all, as i am reading through "Basic BioSQL with Biopython" http://biopython.org/DIST/docs/biosql/python_biosql_basic.html after executing >>> db = server.new_database("cold") i have got an "XXX.biodatabase' doesn't exist" error. At "BioSQL" http://www.biopython.org/wiki/BioSQL i have found out sql create table batch script mysql -u root bioseqdb < biosqldb-mysql.sql Maybe it should also be in the first tutorial, maybe not. Just mentioning it. pavel srb From agarbino at gmail.com Wed Aug 27 17:01:53 2008 From: agarbino at gmail.com (Alex Garbino) Date: Wed, 27 Aug 2008 16:01:53 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <384507.82825.qm@web65609.mail.ac4.yahoo.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> Message-ID: <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> Thanks for the help; that was a lot of time wasted for something simple.... I do have an additional request: once I parse these out, I only get 50 entries. however, if I do the same search online, I get 138... what accounts for the difference? This is my code: from Bio import SeqIO from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML record = SeqIO.read(open("protein_fasta.txt"), format="fasta") result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() for x in blast_record.alignments: print x.title, x.accession, x.length acc_list = [] for x in blast_record.alignments: acc_list.append(x.accession) len(acc_list) tells me 50... Is there a default limit somewhere? Thanks! Alex On Wed, Aug 27, 2008 at 12:27 PM, C. G. wrote: > > > > --- On Wed, 8/27/08, Alex Garbino wrote: > >> From: Alex Garbino >> Subject: [BioPython] Parsing BLAST >> To: biopython at lists.open-bio.org >> Date: Wednesday, August 27, 2008, 11:12 AM >> Hello, >> >> I'm following the tutorials to do BLAST queries, >> however, I can't get >> the Blast object to work. >> >> I've downloaded the blast search, saved it in XML, etc, >> as the >> tutorial does. However, when I get to the step where >> I'm trying to get >> actual data out, it fails (the for loops part). >> Here is a simplified version that illustrates the problem: >> >> for x in blast_record.alignments: >> print alignment.title >> >> Traceback (most recent call last): >> File "", line 2, in >> NameError: name 'alignment' is not defined >> >> ----------------------- > > It's a Python coding error. Try: > > print x.title > > Or change the name of your variable in the "for" loop to "alignment". > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Aug 27 17:44:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 22:44:47 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> Message-ID: <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> > I do have an additional request: once I parse these out, I only get 50 > entries. however, if I do the same search online, I get 138... what > accounts for the difference? > > This is my code: > > from Bio import SeqIO > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > record = SeqIO.read(open("protein_fasta.txt"), format="fasta") > result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) > > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > for x in blast_record.alignments: > print x.title, x.accession, x.length > > acc_list = [] > for x in blast_record.alignments: > acc_list.append(x.accession) > > len(acc_list) tells me 50... > > Is there a default limit somewhere? Yes there is. At the python prompt (or in IDLE), try: >>> from Bio.Blast import NCBIWWW >>> help(NCBIWWW.qblast) (You can try this trick on all python objects and functions - although not everything as any help text defined) I think you probably want to override hitlist_size=50, so try changing: result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) to: result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(), hitlist_size=200) Peter From srbanator at heckler-koch.cz Wed Aug 27 17:47:52 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Wed, 27 Aug 2008 23:47:52 +0200 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> Message-ID: <48B5CB88.6040303@heckler-koch.cz> hi alex, when i am accessing data like handle = Entrez.esearch(db="nucleotide", rettype="fasta", retmax=100, email=my_email) i get 100 results, but without i get only 20. When looking into /usr/share/python-support/python-biopython/Bio/EUtils/ThinClient.py there is defaul value for retmax set to 20. hope it helps pavel srb Alex Garbino wrote: > Thanks for the help; that was a lot of time wasted for something simple.... > > I do have an additional request: once I parse these out, I only get 50 > entries. however, if I do the same search online, I get 138... what > accounts for the difference? > > This is my code: > > from Bio import SeqIO > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > record = SeqIO.read(open("protein_fasta.txt"), format="fasta") > result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) > > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > for x in blast_record.alignments: > print x.title, x.accession, x.length > > acc_list = [] > for x in blast_record.alignments: > acc_list.append(x.accession) > > len(acc_list) tells me 50... > > Is there a default limit somewhere? > > Thanks! > Alex > > On Wed, Aug 27, 2008 at 12:27 PM, C. G. wrote: > >> >> --- On Wed, 8/27/08, Alex Garbino wrote: >> >> >>> From: Alex Garbino >>> Subject: [BioPython] Parsing BLAST >>> To: biopython at lists.open-bio.org >>> Date: Wednesday, August 27, 2008, 11:12 AM >>> Hello, >>> >>> I'm following the tutorials to do BLAST queries, >>> however, I can't get >>> the Blast object to work. >>> >>> I've downloaded the blast search, saved it in XML, etc, >>> as the >>> tutorial does. However, when I get to the step where >>> I'm trying to get >>> actual data out, it fails (the for loops part). >>> Here is a simplified version that illustrates the problem: >>> >>> for x in blast_record.alignments: >>> print alignment.title >>> >>> Traceback (most recent call last): >>> File "", line 2, in >>> NameError: name 'alignment' is not defined >>> >>> ----------------------- >>> >> It's a Python coding error. Try: >> >> print x.title >> >> Or change the name of your variable in the "for" loop to "alignment". >> >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Aug 27 17:54:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 22:54:42 +0100 Subject: [BioPython] sql create tables script in tutorial In-Reply-To: <48B5BD98.8050101@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> Message-ID: <320fb6e00808271454j4a00c2b4j19d89c8c5ec9d4c@mail.gmail.com> On Wed, Aug 27, 2008 at 9:48 PM, Pavel SRB wrote: > hi all, as i am reading through "Basic BioSQL with Biopython" > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html Those are old and haven't been updated. See below... > after executing > >>>> db = server.new_database("cold") > > i have got an "XXX.biodatabase' doesn't exist" error. You must have skipped over section "3.1 Prerequisites" which does say it assumes you have installed a database, a python binding to this database, and loaded the BioSQL schema into the database. > At "BioSQL" http://www.biopython.org/wiki/BioSQL > i have found out sql create table batch script > > mysql -u root bioseqdb < biosqldb-mysql.sql > > Maybe it should also be in the first tutorial, maybe not. Just mentioning > it. I'm glad http://www.biopython.org/wiki/BioSQL is proving useful at least. Due to an accident of history, the source for python_biosql_basic.html and python_biosql_basic.pdf currently lives in BioSQL's SVN repository rather than in Biopython's (which has led to them being more complicated to update). Do you think it is worth trying to fully update those documents, or just add a link at the top of those documents directing people to http://biopython.org/wiki/BioSQL instead? Peter From p.j.a.cock at googlemail.com Wed Aug 27 18:03:44 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 27 Aug 2008 23:03:44 +0100 Subject: [BioPython] Bio.Entrez and retmax Message-ID: <320fb6e00808271503ydf662c8p8e4755e59bb423f7@mail.gmail.com> Hi Pavel, I've change the subject/title as this isn't about BLAST anymore. On Wed, Aug 27, 2008 at 10:47 PM, Pavel SRB wrote: > hi alex, when i am accessing data like > > handle = Entrez.esearch(db="nucleotide", rettype="fasta", retmax=100, > email=my_email) > > i get 100 results, but without i get only 20. If you look in /usr/share/python-support/python-biopython/Bio/Entrez/__init__.py or online at http://biopython.org/SRC/biopython/Bio/Entrez/__init__.py you'll see the function esearch doesn't set a default value. I guess that means the NCBI defaults to giving you only 20 unless you ask for more. > When looking into > /usr/share/python-support/python-biopython/Bio/EUtils/ThinClient.py > there is defaul value for retmax set to 20. Bio.EUtils is separate from Bio.Entrez, but they both give access to the NCBI Entrez Utilities. You should ignore Bio.EUtils as it will be deprecated in the next release of Biopython. Peter From srbanator at heckler-koch.cz Wed Aug 27 18:16:10 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Thu, 28 Aug 2008 00:16:10 +0200 Subject: [BioPython] sql create tables script in tutorial In-Reply-To: <48B5BD98.8050101@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> Message-ID: <48B5D22A.2060806@heckler-koch.cz> as i was reading tutorial cookbook and i have reached sql, by some chance i have started by reading http://biopython.org/DIST/docs/biosql/python_biosql_basic.html do not know why, when there is direct link to http://www.biopython.org/wiki/BioSQL just did :o) thanks for explanation, On Wed, Aug 27, 2008 at 9:48 PM, Pavel SRB wrote: > hi all, as i am reading through "Basic BioSQL with Biopython" > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html Those are old and haven't been updated. See below... > after executing > >>>> db = server.new_database("cold") > > i have got an "XXX.biodatabase' doesn't exist" error. You must have skipped over section "3.1 Prerequisites" which does say it assumes you have installed a database, a python binding to this database, and loaded the BioSQL schema into the database. > At "BioSQL" http://www.biopython.org/wiki/BioSQL > i have found out sql create table batch script > > mysql -u root bioseqdb < biosqldb-mysql.sql > > Maybe it should also be in the first tutorial, maybe not. Just mentioning > it. I'm glad http://www.biopython.org/wiki/BioSQL is proving useful at least. Due to an accident of history, the source for python_biosql_basic.html and python_biosql_basic.pdf currently lives in BioSQL's SVN repository rather than in Biopython's (which has led to them being more complicated to update). Do you think it is worth trying to fully update those documents, or just add a link at the top of those documents directing people to http://biopython.org/wiki/BioSQL instead? Peter From biopython at maubp.freeserve.co.uk Wed Aug 27 18:27:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 23:27:21 +0100 Subject: [BioPython] sql create tables script in tutorial In-Reply-To: <320fb6e00808271454j4a00c2b4j19d89c8c5ec9d4c@mail.gmail.com> References: <48B5BD98.8050101@heckler-koch.cz> <320fb6e00808271454j4a00c2b4j19d89c8c5ec9d4c@mail.gmail.com> Message-ID: <320fb6e00808271527haf574fcv9adcd4ab4f964d87@mail.gmail.com> On Wed, Aug 27, 2008 at 10:54 PM, Peter wrote: >> hi all, as i am reading through "Basic BioSQL with Biopython" >> http://biopython.org/DIST/docs/biosql/python_biosql_basic.html > > Those are old and haven't been updated. ... > Due to an accident of history, the source for python_biosql_basic.html > and python_biosql_basic.pdf currently lives in BioSQL's SVN repository > rather than in Biopython's (which has led to them being more > complicated to update). I've just updated http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf and python_biosql_basic.html with the latest version from BioSQL's SVN repository - viewable here if anyone is interested: http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/ This does now include a link to the wiki page http://www.biopython.org/wiki/BioSQL but there are still several things I think need fixing or updating (e.g. using Bio.SeqIO instead of Bio.GenBank). Peter From srbanator at heckler-koch.cz Thu Aug 28 04:06:51 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Thu, 28 Aug 2008 10:06:51 +0200 Subject: [BioPython] development question In-Reply-To: <48B5BD98.8050101@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> Message-ID: <48B65C9B.4000407@heckler-koch.cz> hi all, please i have a question about your development settings. example: at my work we keep all code in svn repository. Each developer checkout the code, work on it, after every code edit i restart my apache-prefork and then see the results in browser, log or whatever. so now to biopython. On my system i have biopython from debian repository via apt-get. But i would like to have second version of biopython in system just to check, log and change the code to learn more. This can be done with removing sys.path.remove("/var/lib/python-support/python2.5") and importing Bio from some other development directory. But this way i loose all modules in direcotory mentioned above and i believe it can be done more clearly so how you are coding your biopython? thanks for advice pavel srb From biopython at maubp.freeserve.co.uk Thu Aug 28 05:53:18 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 28 Aug 2008 10:53:18 +0100 Subject: [BioPython] development question In-Reply-To: <48B65C9B.4000407@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> <48B65C9B.4000407@heckler-koch.cz> Message-ID: <320fb6e00808280253p5524a04aw2ab2e8791c0c5eec@mail.gmail.com> On Thu, Aug 28, 2008 at 9:06 AM, Pavel SRB wrote: > hi all, please i have a question about your development settings. > > example: at my work we keep all code in svn repository. Each developer > checkout the code, work on it, after every code edit i restart my > apache-prefork and then see the results in browser, log or whatever. Biopython currently uses CVS, but we will hopefully be transitioning to SVN shortly (most of the other Bio* projects have already moved over). > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be done > more clearly > > so how you are coding your biopython? Since you asked about Debian, I'll talk about my Linux machine which is currently running Ubuntu Dapper Drake (which I know is overdue for an update, but it works fine for me). The official Biopython packages were too out of date for me, so I uninstalled them and instead stay up to date with CVS which I install that under my home directory using "python setup.py install --prefix=/home/maubp". Then, to make sure my python packages (installed in my home directory) get priority over the system level packages, I set the PYTHONPATH envirnment variable. As I use bash, I just added this to my .bashrc file: # Tell Python about my locally installed Python Modules: export PYTHONPATH="/home/maubp/lib/python2.4/site-packages" (Getting IDLE to use my local packages is harder - I have a hack solution but its not very nice) Alternatively, in any individual python script you can do "import sys" and then manipulate sys.path before doing any "import Bio" statements. If you want to have both the Debian (old) Biopython and the latest CVS Biopython, I suggest you use apt-get or equivalent to install the official Debian Biopython AND install CVS biopython from source in your home directory (using something like "python setup.py --prefix=/home/pavel" according to your user name). You can then change your python path environment variable to switch between the two installations. However, having both an old and a new Biopython could be very confusing - so I personally wanted to avoid this. Peter From agarbino at gmail.com Thu Aug 28 14:51:47 2008 From: agarbino at gmail.com (Alex Garbino) Date: Thu, 28 Aug 2008 13:51:47 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> Message-ID: <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> Thanks for all the help! I'm now almost done. My script is to take a fasta file, run blast, and output a comma-separated-values list in the following format: AccessionID, Source, Length, FASTA sequence. I have one last issue: How do I get the fasta sequence out? I can easily get the raw sequence, but I need it in fasta format. I left a couple of things I've tried from tutorials commented out at the bottom, in case it helps. My csv output may also need help, depending on how the Fasta output behaves in a csv... from Bio import SeqIO from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import Entrez #Open file to blast file = "protein.txt" #In fasta format #Blast, save copy record = SeqIO.read(open(file), format="fasta") result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(), hitlist_size=1) #Don't hit the servers hard until ready blast_results = result_handle.read() save_file = open(file[:-4]+".xml", "w") save_file.write(blast_results) save_file.close() result_handle = open(file[:-4]+".xml") #Load the blast record blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() output = {} for x in blast_record.alignments: output[x.accession] = [x.length] for x in output: handle = Entrez.efetch(db="protein", id=x, rettype="genbank") record = SeqIO.parse(handle, "genbank") recurd = record.next() output[x].insert(0, recurd.id) output[x].insert(1, recurd.annotations["source"]) #SeqIO.write(recurd, output[x].extend, "fasta") """ handle2 = Entrez.efetch(db="protein", id=x, rettype="fasta") recurd2 = SeqIO.read(handle2, "fasta") output[x].extend = [recurd2.seq.tostring()] """ print output save_file = open(file[:-4]+".csv", "w") #Generate CSV for item in output: save_file.write('%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2])) #save_file.write('%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2],output[item][3]) (When Fasta works) save_file.close() -------------------------- Thanks! Alex From biopython at maubp.freeserve.co.uk Fri Aug 29 11:04:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Aug 2008 16:04:58 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> Message-ID: <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> On Thu, Aug 28, 2008 at 7:51 PM, Alex Garbino wrote: > Thanks for all the help! > I'm now almost done. My script is to take a fasta file, run blast, and > output a comma-separated-values list in the following format: > AccessionID, Source, Length, FASTA sequence. FASTA sequence format looks like this: >name and description CATACGACTACGTCAACGATCCGAACT GACTACGATCAGCATCGACTAGCTGTG GTGTGGT >name2 and second sequence description AGCGACAGCGACGAGCAGCGACGAG AGCGAGC Its not something you can squeeze into a comma separared file. I think you might just mean getting the sequence itself - or have two files (one CVS, one FASTA). Peter From agarbino at gmail.com Fri Aug 29 11:39:22 2008 From: agarbino at gmail.com (Alex Garbino) Date: Fri, 29 Aug 2008 10:39:22 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> Message-ID: <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> >> I'm now almost done. My script is to take a fasta file, run blast, and >> output a comma-separated-values list in the following format: >> AccessionID, Source, Length, FASTA sequence. > > FASTA sequence format looks like this: > >>name and description > CATACGACTACGTCAACGATCCGAACT > GACTACGATCAGCATCGACTAGCTGTG > GTGTGGT >>name2 and second sequence description > AGCGACAGCGACGAGCAGCGACGAG > AGCGAGC > > Its not something you can squeeze into a comma separared file. I > think you might just mean getting the sequence itself - or have two > files (one CVS, one FASTA). > > Peter > That's the problem I'm having... I want to keep FASTA format (so I can plug it into ClustalW, etc), which is difficult to do because of the newline after the fasta title. Manually in excel, I could fit the whole FASTA into a cell, I think it was converted to a string (when I copy-pasted it into clustalw, it would be in " "). Is there a way to ignore the newline between description and sequence? Thanks, Alex From agarbino at gmail.com Fri Aug 29 12:10:00 2008 From: agarbino at gmail.com (Alex Garbino) Date: Fri, 29 Aug 2008 11:10:00 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> Message-ID: <4cf37ad00808290910i719aa046i13de5d5816e9a7e3@mail.gmail.com> Assuming I just stick to making the plain sequence the 4th variable (instead of in fasta format), how should I add it to my dictionary? Doing: output[x].extend(record.seq.tostring()) Will add each letter individually, so each entry has a few hundred elements, rather than the forth element being the full string. join() doesn't seem to be it... Thanks, Alex On Fri, Aug 29, 2008 at 10:39 AM, Alex Garbino wrote: >>> I'm now almost done. My script is to take a fasta file, run blast, and >>> output a comma-separated-values list in the following format: >>> AccessionID, Source, Length, FASTA sequence. >> >> FASTA sequence format looks like this: >> >>>name and description >> CATACGACTACGTCAACGATCCGAACT >> GACTACGATCAGCATCGACTAGCTGTG >> GTGTGGT >>>name2 and second sequence description >> AGCGACAGCGACGAGCAGCGACGAG >> AGCGAGC >> >> Its not something you can squeeze into a comma separared file. I >> think you might just mean getting the sequence itself - or have two >> files (one CVS, one FASTA). >> >> Peter >> > > That's the problem I'm having... I want to keep FASTA format (so I can > plug it into ClustalW, etc), which is difficult to do because of the > newline after the fasta title. > Manually in excel, I could fit the whole FASTA into a cell, I think it > was converted to a string (when I copy-pasted it into clustalw, it > would be in " "). > Is there a way to ignore the newline between description and sequence? > > Thanks, > Alex > From biopython at maubp.freeserve.co.uk Fri Aug 29 12:13:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Aug 2008 17:13:59 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> Message-ID: <320fb6e00808290913j2ea5c48eo420ed8d2c0691e85@mail.gmail.com> Alex wrote: > That's the problem I'm having... I want to keep FASTA format (so I can > plug it into ClustalW, etc), which is difficult to do because of the > newline after the fasta title. If you want to put it into FASTA format, you can't use a CSV file (unless you use embedded \n notation but I don't see how that would help). You could record the name and the sequence in your CSV file and later extract these into a FASTA file for use with ClustalW. I do still suggest you write out two files, a FASTA file and a separate CSV file containing the sequence if you want this too. Peter From freeman at stanfordalumni.org Fri Aug 29 17:33:09 2008 From: freeman at stanfordalumni.org (Ted Larson Freeman) Date: Fri, 29 Aug 2008 14:33:09 -0700 Subject: [BioPython] New user question: biopython compatability with EPD Message-ID: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> I'm using the Enthought Python Distribution, which contains many libraries including numpy. Reading the requirements for biopython here: http://biopython.org/wiki/Download#Required_Software I see that biopython requires Numerical Python, an older version of numpy. Can I install Numerical Python alongside numpy and use them both? Thanks. Ted From matzke at berkeley.edu Fri Aug 29 18:21:48 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Fri, 29 Aug 2008 15:21:48 -0700 Subject: [BioPython] New user question: biopython compatability with EPD In-Reply-To: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> References: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> Message-ID: <48B8767C.4000109@berkeley.edu> I did exactly this and it worked fine... (Enthought, then Numerical, then biopython) Ted Larson Freeman wrote: > I'm using the Enthought Python Distribution, which contains many > libraries including numpy. Reading the requirements for biopython > here: > http://biopython.org/wiki/Download#Required_Software > > I see that biopython requires Numerical Python, an older version of > numpy. Can I install Numerical Python alongside numpy and use them > both? > > Thanks. > > Ted > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From mjldehoon at yahoo.com Fri Aug 29 22:45:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 29 Aug 2008 19:45:31 -0700 (PDT) Subject: [BioPython] Bio.MetaTool Message-ID: <46010.36121.qm@web62405.mail.re1.yahoo.com> Hi everybody, Is anybody using the Bio.MetaTool module? If not, can we deprecate it? The Bio.MetaTool tests suggest that this module was written for MetaTool version 3.5 (28.03.2001), while the most current MetaTool version is at 5.0. Since MetaTool is written for Matlab/Octave, and it seems to be out of data, I expect that few people are using it with Python. Currently, Bio.MetaTool is the only non-deprecated module in Biopython that uses Martel. If we can deprecate Bio.MetaTool, then (over time) we can deprecate Martel, which means that Biopython won't need the mxTextTools any more, making Biopython's installation a lot easier. --Michiel. From mjldehoon at yahoo.com Fri Aug 29 22:47:54 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 29 Aug 2008 19:47:54 -0700 (PDT) Subject: [BioPython] NumPy Message-ID: <128888.36737.qm@web62405.mail.re1.yahoo.com> Hi everybody, Previously we discussed on the developer's mailing list whether Biopython should adopt the "new" Numerical Python (aka NumPy, currently at version 1.1.1) instead of the "old" Numerical Python (version 24.2). My objections against NumPy were that its documentation is not freely available, it doesn't compile cleanly on all platforms, and some other scientific and computational biology libraries use the old Numerical Python. Last week, the NumPy documentation did become freely available. Compilation of NumPy is still not perfect on all platforms (e.g. on Cygwin it may fail), however recently I have also noticed that compilation of the "old" Numerical Python may fail on modern systems. As far as I can tell, MMTK and PyMOL are (still?) based on the "old" Numerical Python, but Matplotlib now relies on the "new" Numerical Python. In my opinion, the balance is now tilting in favor of the new NumPy, and we should consider transitioning Biopython to the new NumPy. Does anybody have a strong preference for the "old" Numerical Python? --Michiel. From freeman at stanfordalumni.org Sat Aug 30 12:50:17 2008 From: freeman at stanfordalumni.org (Ted Larson Freeman) Date: Sat, 30 Aug 2008 09:50:17 -0700 Subject: [BioPython] NumPy In-Reply-To: <128888.36737.qm@web62405.mail.re1.yahoo.com> References: <128888.36737.qm@web62405.mail.re1.yahoo.com> Message-ID: <5d8729f00808300950k469e4b08w978f5614e6fc3d00@mail.gmail.com> As a newcomer, I would be in favor of updating to numpy, because it would make biopython appear (to other newcomers) to be a current, active project. How much development work would be necessary to switch? Ted On Fri, Aug 29, 2008 at 7:47 PM, Michiel de Hoon wrote: > Hi everybody, > > Previously we discussed on the developer's mailing list whether Biopython should adopt the "new" Numerical Python (aka NumPy, currently at version 1.1.1) instead of the "old" Numerical Python (version 24.2). My objections against NumPy were that its documentation is not freely available, it doesn't compile cleanly on all platforms, and some other scientific and computational biology libraries use the old Numerical Python. > > Last week, the NumPy documentation did become freely available. Compilation of NumPy is still not perfect on all platforms (e.g. on Cygwin it may fail), however recently I have also noticed that compilation of the "old" Numerical Python may fail on modern systems. As far as I can tell, MMTK and PyMOL are (still?) based on the "old" Numerical Python, but Matplotlib now relies on the "new" Numerical Python. > > In my opinion, the balance is now tilting in favor of the new NumPy, and we should consider transitioning Biopython to the new NumPy. Does anybody have a strong preference for the "old" Numerical Python? > > --Michiel. > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > From bsouthey at gmail.com Sat Aug 30 14:30:27 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Sat, 30 Aug 2008 13:30:27 -0500 Subject: [BioPython] NumPy In-Reply-To: <128888.36737.qm@web62405.mail.re1.yahoo.com> References: <128888.36737.qm@web62405.mail.re1.yahoo.com> Message-ID: On Fri, Aug 29, 2008 at 9:47 PM, Michiel de Hoon wrote: > Hi everybody, > > Previously we discussed on the developer's mailing list whether Biopython should adopt the "new" Numerical Python (aka NumPy, currently at version 1.1.1) instead of the "old" Numerical Python (version 24.2). My objections against NumPy were that its documentation is not freely available, it doesn't compile cleanly on all platforms, and some other scientific and computational biology libraries use the old Numerical Python. Actually NumPy is doing a 1.2 release and may be one to watch. Also NumPy will be using Nose for testing so if not installed you can not run the tests. I did not find Travis book that much different from existing documentation. In anay case, you can get it at: http://svn.scipy.org/svn/numpy/trunk/doc/numpybook/ http://www.tramy.us/numpybook.pdf I would also point out the huge NumPy 'Marathon': http://scipy.org/Developer_Zone/DocMarathon2008 http://sd-2116.dedibox.fr/pydocweb/wiki/Front%20Page/ > > Last week, the NumPy documentation did become freely available. Compilation of NumPy is still not perfect on all platforms (e.g. on Cygwin it may fail), however recently I have also noticed that compilation of the "old" Numerical Python may fail on modern systems. As far as I can tell, MMTK and PyMOL are (still?) based on the "old" Numerical Python, but Matplotlib now relies on the "new" Numerical Python. The Cygwin has a major bug that is not due to NumPy. But I am sure the NumPy developers would like to know any compilation problems. > > In my opinion, the balance is now tilting in favor of the new NumPy, and we should consider transitioning Biopython to the new NumPy. Does anybody have a strong preference for the "old" Numerical Python? > Actually, how critical is having Numerical Python in BioPython in the first place? Is there a case to remove functionality or have special sub-modules? One rather critical aspect is that both NumPy and Matplotlib also require Python 2.4. Regards Bruce From mjldehoon at yahoo.com Sun Aug 31 04:39:08 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 31 Aug 2008 01:39:08 -0700 (PDT) Subject: [BioPython] NumPy In-Reply-To: <5d8729f00808300950k469e4b08w978f5614e6fc3d00@mail.gmail.com> Message-ID: <431678.62092.qm@web62405.mail.re1.yahoo.com> --- On Sat, 8/30/08, Ted Larson Freeman wrote: > As a newcomer, I would be in favor of updating to numpy, > because it would make biopython appear (to other > newcomers) to be a current, active project. > > How much development work would be necessary to switch? > It's not so bad. The biggest one is Bio.Cluster. This module is also available separate from Biopython as Pycluster. Its latest version already uses NumPy. Most of the other modules use Numerical Python at the Python level, which is much easier to fix. I am more worried about the portability of NumPy. A while back there were some installation problems with Numerical Python on some platforms. This caused a lot of user questions, since they weren't able to install it. Hence my comment about NumPy sometimes failing to build on Cygwin. --Michiel. From mjldehoon at yahoo.com Sun Aug 31 05:01:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 31 Aug 2008 02:01:31 -0700 (PDT) Subject: [BioPython] NumPy In-Reply-To: Message-ID: <917875.96314.qm@web62408.mail.re1.yahoo.com> > Actually NumPy is doing a 1.2 release and may be one to > watch. OK maybe we should wait until release 1.2. Though from what I understood, NumPy-dependent code won't have to be changed going from Numpy 1.1 to 1.2. > Also NumPy will be using Nose for testing so if > not installed you can not run the tests. These kinds of things I find really annoying about NumPy. Such kind of basic libraries should only rely on run-of-the-mill Python. > The Cygwin has a major bug that is not due to NumPy. But I > am sure the NumPy developers would like to know any > compilation problems. I filed this bug report about 10 months ago: http://projects.scipy.org/scipy/numpy/ticket/612 As I just found out, this bug was fixed a while back. I should try and see if NumPy compiles correctly on Cygwin now. > Actually, how critical is having Numerical Python in > BioPython in the first place? Numerical Python is now used by the following modules: Bio.Affy Bio.Cluster Bio.MarkovModel Bio.distance Bio.KDTree Bio.kNN Bio.LogisticRegression Bio.MaxEntropy Bio.MetaTool Bio.NaiveBayes Bio.PDB Bio.Statistics Bio.SVDSuperimposer In this list, Bio.Cluster and Bio.PDB are the biggest ones. The other ones, IMHO, are not the core functionality of Biopython. While I wouldn't just want to get rid of them, we have some more flexibility there. Bio.Cluster also exists as a separate library (Pycluster), so it's not a complete disaster if Bio.Cluster disappears. However, Bio.PDB is a serious issue. One option to consider is to allow Numerical Python / NumPy only at the Python level, and not at the C level. Then these modules can be written such that they try to import (the new) NumPy first, and failing that, try to import (the old) Numerical Python instead. --Michiel. > Is there a case to remove functionality or have special > sub-modules? > > One rather critical aspect is that both NumPy and > Matplotlib also > require Python 2.4. > > Regards > Bruce From biopython at maubp.freeserve.co.uk Sun Aug 31 11:28:49 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 31 Aug 2008 16:28:49 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808290910i719aa046i13de5d5816e9a7e3@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> <4cf37ad00808290910i719aa046i13de5d5816e9a7e3@mail.gmail.com> Message-ID: <320fb6e00808310828t777f7768oc7a3a6616cc1ba75@mail.gmail.com> On Fri, Aug 29, 2008 at 5:10 PM, Alex Garbino wrote: > Assuming I just stick to making the plain sequence the 4th variable > (instead of in fasta format), how should I add it to my dictionary? > Doing: > > output[x].extend(record.seq.tostring()) > > Will add each letter individually, so each entry has a few hundred > elements, rather than the forth element being the full string. join() > doesn't seem to be it... Assuming output is a dictionary whose elements are lists, try output[x].append(record.seq.tostring()) You need to read about the difference between the append and extend methods of a list in python. Peter From biopython at maubp.freeserve.co.uk Sun Aug 31 11:34:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 31 Aug 2008 16:34:46 +0100 Subject: [BioPython] New user question: biopython compatability with EPD In-Reply-To: <48B8767C.4000109@berkeley.edu> References: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> <48B8767C.4000109@berkeley.edu> Message-ID: <320fb6e00808310834j6166a176id05ce51e43973f9b@mail.gmail.com> Ted Larson Freeman wrote: >> I see that biopython requires Numerical Python, an older version of >> numpy. Yes, although we are planning to move - its just we have some C-code that uses numeric/numpy so this isn't a trivial switch. >> Can I install Numerical Python alongside numpy and use them both? Yes, I've got both on several machines and they co-incide happily (using standard python). Nick confirmed this worked for him using Enthought's bundle. Peter From krewink at inb.uni-luebeck.de Thu Aug 28 05:14:37 2008 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Thu, 28 Aug 2008 09:14:37 -0000 Subject: [BioPython] development question In-Reply-To: <48B65C9B.4000407@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> <48B65C9B.4000407@heckler-koch.cz> Message-ID: <20080828090431.GD5801@inb.uni-luebeck.de> Hi Pavel, On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote: > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be > done more clearly An easy way would be to just add the path to your biopython svn-version to the _front_ of the sys.path list: sys.path = ['/your/path/to/biopython/'] + sys.path Please note, however, that this isn't really a biopython related question, so you might be better off asking in a general python forum/newsgroup/mailing-list. Cheers, Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics http://www.inb.uni-luebeck.de/~krewink/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: From cmckay at u.washington.edu Fri Aug 1 01:45:15 2008 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 31 Jul 2008 18:45:15 -0700 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? Message-ID: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> Hello, I have SeqRecord objects that I'd like to convert to a string that is in Genbank format. That way I can do whatever with it, including write it to a file. The only way I can see to do anything similar is using SeqIO and doing something like: SeqIO.write(my_records, out_file_handle, "genbank") which I found here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#chapter:Bio.SeqIO The problem is, it doesn't support something like: SeqIO.write(seq_record, out_file_handle, "genbank") Because it requires an iterable object I guess? And it has to write to a file handle for some reason, and won't just give me the string to do whatever I want with. I've done a lot of searching and mailing lists, and googling, and surely I must be missing something? What is the simplest way to get a string representing a genbank file, starting with a SeqRecord? I'm sort of shocked that there isn't some sort of SeqRecord.to_genbank() method. Hope someone can point me in the right direction! best, Cedar From p.j.a.cock at googlemail.com Fri Aug 1 09:20:01 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Aug 2008 10:20:01 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> Message-ID: <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> On Fri, Aug 1, 2008 at 2:45 AM, Cedar McKay wrote: > Hello, I have SeqRecord objects that I'd like to convert to a string that is > in Genbank format. That way I can do whatever with it, including write it to > a file. The only way I can see to do anything similar is using SeqIO and > doing something like: > > SeqIO.write(my_records, out_file_handle, "genbank") > which I found here: > http://biopython.org/DIST/docs/tutorial/Tutorial.html#chapter:Bio.SeqIO That could would work fine - once Bio.SeqIO supports output in the GenBank format. Its been on my "to do list" for a while, but being annotation rich this is non-trivial one you start to use this with other file formats. http://bugzilla.open-bio.org/show_bug.cgi?id=2294 I've been thinking about writing a unit test using the EMBOSS seqret program for interconverting file formats, as a way of checking our conversions against a third party. > The problem is, it doesn't support something like: > SeqIO.write(seq_record, out_file_handle, "genbank") > Because it requires an iterable object I guess? Yes, you would have to use this: SeqIO.write([seq_record], out_file_handle, "genbank") Of course, if lots of people really want to have the flexibility to supply a SeqRecord or a SeqRecord list/iterator this would be possible. On the other hand, there is something to be said for a simple fixed interface. > And it has to write to a file handle for some reason, and > won't just give me the string to do whatever I want with. This is by design - the API uses handles and only handles. If you want a string containing the data, use StringIO (or cStringIO), something like this: from StringIO import StringIO handle = StringIO() SeqIO.write(seq_records, handle "fasta") handle.seek(0) data = handle.read() This isn't in the tutorial or the wiki page (yet). http://biopython.org/wiki/SeqIO > I've done a lot of searching and mailing lists, and googling, and surely I > must be missing something? What is the simplest way to get a string > representing a genbank file, starting with a SeqRecord? > > I'm sort of shocked that there isn't some sort of SeqRecord.to_genbank() > method. We have discussed something like a SeqRecord.to_format() method (or similar name), which would call Bio.SeqIO internally using StringIO and return a string. This fits in nicely with the planned __format__ and format() functionality in Python 2.6 and 3.0 http://www.python.org/dev/peps/pep-3101/ See http://portal.open-bio.org/pipermail/biopython-dev/2008-June/003793.html Peter From biopython at maubp.freeserve.co.uk Fri Aug 1 09:43:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 1 Aug 2008 10:43:31 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> Message-ID: <320fb6e00808010243p66092123jbc914c68673e6468@mail.gmail.com> >> I've done a lot of searching and mailing lists, and googling, and surely I >> must be missing something? What is the simplest way to get a string >> representing a genbank file, starting with a SeqRecord? >> >> I'm sort of shocked that there isn't some sort of SeqRecord.to_genbank() >> method. > > We have discussed something like a SeqRecord.to_format() method (or > similar name), which would call Bio.SeqIO internally using StringIO > and return a string. This fits in nicely with the planned __format__ > and format() functionality in Python 2.6 and 3.0 > http://www.python.org/dev/peps/pep-3101/ > > See http://portal.open-bio.org/pipermail/biopython-dev/2008-June/003793.html I've filed this enhancement as Bug 2561 - http://bugzilla.open-bio.org/show_bug.cgi?id=2561 Any comments or suggestions on the naming of this function could be recorded there, or discussed on the mailing list. Peter From biopython at maubp.freeserve.co.uk Fri Aug 1 10:21:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 1 Aug 2008 11:21:06 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> Message-ID: <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> On Fri, Aug 1, 2008 at 10:20 AM, Peter Cock wrote: > On Fri, Aug 1, 2008 at 2:45 AM, Cedar McKay wrote: >> Hello, I have SeqRecord objects that I'd like to convert to a string ... >> And it [Bio.SeqIO.write()] has to write to a file handle for some reason, >> and won't just give me the string to do whatever I want with. > > This is by design - the API uses handles and only handles. If you > want a string containing the data, use StringIO (or cStringIO), > something like this: > > from StringIO import StringIO > handle = StringIO() > SeqIO.write(seq_records, handle "fasta") > handle.seek(0) > data = handle.read() > > This isn't in the tutorial or the wiki page (yet). > http://biopython.org/wiki/SeqIO I've just updated the tutorial in CVS to cover this (and a similar example for Bio.AlignIO). We don't normally update the public PDF and HTML versions of the tutorial between releases to avoid the documentation talking about unreleased features, so you won't see this change for a while. Could I ask why you want to get the SeqRecord as a string in GenBank format? My only guess is as part of some webservice where a string would be useful to embed within a page template. Getting a SeqRecord in FASTA format is probably a more common request, given many web tools will take this as input. Peter From mjldehoon at yahoo.com Sat Aug 2 10:35:54 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 2 Aug 2008 03:35:54 -0700 (PDT) Subject: [BioPython] Bio.Medline parser Message-ID: <20101.29085.qm@web62411.mail.re1.yahoo.com> Hi everybody, For bug #2454: http://bugzilla.open-bio.org/show_bug.cgi?id=2454 I was looking at the parser in Bio.Medline, which can parse flat files in the Medline format. For an example, see Tests/Medline/pubmed_result2.txt: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/Medline/pubmed_result2.txt?rev=1.1&cvsroot=biopython&content-type=text/vnd.viewcvs-markup I would like to suggest some changes to this parser. Currently, it works as follows: >>> from Bio import Medline >>> parser = Medline.RecordParser() >>> handle = open("mymedlinefile.txt") >>> record = parser.parse(handle) or, to iterate over a bunch of Medline records: >>> from Bio import Medline >>> parser = Medline.RecordParser() >>> handle = open("mymedlinefile.txt") >>> records = Medline.Iterator(handle, parser) >>> for record in records: ... # do something with the record. I'd like to change these to >>> from Bio import Medline >>> handle = open("mymedlinefile.txt") >>> record = Medline.read(handle) and >>> from Bio import Medline >>> handle = open("mymedlinefile.txt") >>> records = Medline.parse(handle) >>> for record in records: ... # do something with the record. respectively. In addition, currently the fields in the Medline file are stored as attributes of the record. For example, if the file is PMID- 12230038 OWN - NLM STAT- MEDLINE DA - 20020916 ... then the corresponding record is record.pubmed_id = "12230038" record.owner = "NLM" record.status = "MEDLINE" record.entry_date = "20020916" I'd like to change two things here: 1) Use the key shown in the Medline file instead of the name to store each field. 2) Let the record class derive from a dictionary, and store each field as a key, value pair in this dictionary. record["PMID"] = "12230038" record["OWN"] = "NLM" record["STAT"] = "MEDLINE" record["DA"] = "20020916" ... This avoids the names that were rather arbitrarily chosen by ourselves, and greatly simplifies the parser. The parser will also be more robust if new fields are added to the Medline file format. Currently there is very little information on the Medline parser in the documentation, so I doubt it has many users. Nevertheless, I wanted to check if anybody has any objections or comments before I implement these changes. --Michiel From biopython at maubp.freeserve.co.uk Sat Aug 2 12:02:13 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 2 Aug 2008 13:02:13 +0100 Subject: [BioPython] Deprecating Bio.Saf (PredictProtein Simple Alignment Format) In-Reply-To: <320fb6e00807231512s507d2652jc0f26764a62b01d5@mail.gmail.com> References: <320fb6e00807231512s507d2652jc0f26764a62b01d5@mail.gmail.com> Message-ID: <320fb6e00808020502l64eca901gda132250fc85e5be@mail.gmail.com> On Wed, Jul 23, 2008 at 11:12 PM, Peter wrote: > Is anyone using Bio.Saf or PredictProtein's "Simple Alignment Format" (SAF)? > > Bio.Saf is one of the older parsers in Biopython. It parses the > PredictProtein "Simple Alignment Format" (SAF), a fairly free-format > multiple sequence alignment file format described here: > http://www.predictprotein.org/Dexa/optin_saf.html > > Potentially we could support this file format within Bio.AlignIO (if > there was any demand). However, as far as I can tell, this file > format is ONLY used for PredictProtein, and they will accept several > other more mainstream alignment file formats as alternatives. I got in touch with PredictProtein, and Burkhard Rost told me: >> SAF is a simplified subset of MSF, actually it is >> MSF - checksum AND + more flexibility in terms of line length asf. I also said he was aware of several groups who used it/adopted SAF, but didn't have any names to hand. > Bio.Saf uses Martel for parsing, which is not entirely compatible with > mxTextTools 3.0. If we did want to integrate SAF support into > Bio.AlignIO it might be best to reimplement the parser in plain > python. I suspect adding support for the MSF alignment format to Bio.AlignIO would be of more general interest. > If no one is using it, I would like to deprecate Bio.Saf in the next > release of Biopython. Still no objections? Peter From biopython at maubp.freeserve.co.uk Sat Aug 2 12:18:11 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 2 Aug 2008 13:18:11 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <20101.29085.qm@web62411.mail.re1.yahoo.com> References: <20101.29085.qm@web62411.mail.re1.yahoo.com> Message-ID: <320fb6e00808020518t516b4ab8gb2b2b3efa85c04ec@mail.gmail.com> On Sat, Aug 2, 2008 at 11:35 AM, Michiel de Hoon wrote: > Hi everybody, > > For bug #2454: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2454 > > I was looking at the parser in Bio.Medline, which can parse flat files > in the Medline format. For an example, see Tests/Medline/pubmed_result2.txt: > > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/Medline/pubmed_result2.txt?rev=1.1&cvsroot=biopython&content-type=text/vnd.viewcvs-markup > > I would like to suggest some changes to this parser. > > Currently, it works as follows: > >>>> from Bio import Medline >>>> parser = Medline.RecordParser() >>>> handle = open("mymedlinefile.txt") >>>> record = parser.parse(handle) > > or, to iterate over a bunch of Medline records: > >>>> from Bio import Medline >>>> parser = Medline.RecordParser() >>>> handle = open("mymedlinefile.txt") >>>> records = Medline.Iterator(handle, parser) >>>> for record in records: > ... # do something with the record. > > I'd like to change these to > >>>> from Bio import Medline >>>> handle = open("mymedlinefile.txt") >>>> record = Medline.read(handle) > > and > >>>> from Bio import Medline >>>> handle = open("mymedlinefile.txt") >>>> records = Medline.parse(handle) >>>> for record in records: > ... # do something with the record. > > respectively. +1 (I agree) That would fit with our recent parser changes, and consistency is good :) > In addition, currently the fields in the Medline file are stored as attributes of the record. For example, if the file is > > PMID- 12230038 > OWN - NLM > STAT- MEDLINE > DA - 20020916 > ... > > then the corresponding record is > > record.pubmed_id = "12230038" > record.owner = "NLM" > record.status = "MEDLINE" > record.entry_date = "20020916" > > I'd like to change two things here: > > 1) Use the key shown in the Medline file instead of the name to store each field. > 2) Let the record class derive from a dictionary, and store each field as a key, value pair in this dictionary. > > record["PMID"] = "12230038" > record["OWN"] = "NLM" > record["STAT"] = "MEDLINE" > record["DA"] = "20020916" > ... > > This avoids the names that were rather arbitrarily chosen by ourselves, > and greatly simplifies the parser. The parser will also be more robust if > new fields are added to the Medline file format. One downside of this is that the user then has to go and consult the file format documentation to discover "DA" is the entry date, etc. In some cases the abbrevations are probably a little unclear. I would find code using the current named properties easier to read than the suggested dictionary based approach which exposes the raw field names. Also, could you make the changes whiling leaving the older parser with the old record behaviour in place (with deprecation warnings) for a few releases? This would allow existing user's scripts to continue as is with (but with a warning). > Currently there is very little information on the Medline parser in the > documentation, so I doubt it has many users. Nevertheless, I wanted > to check if anybody has any objections or comments before I implement > these changes. I think the first addition (read and parse functions) is very sensisble, but I am not sure about the suggested change to the record behaviour. Peter From mjldehoon at yahoo.com Sat Aug 2 13:32:32 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 2 Aug 2008 06:32:32 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808020518t516b4ab8gb2b2b3efa85c04ec@mail.gmail.com> Message-ID: <407661.78979.qm@web62412.mail.re1.yahoo.com> --- On Sat, 8/2/08, Peter wrote: > > 1) Use the key shown in the Medline file instead of > > the name to store each field. > > 2) Let the record class derive from a dictionary, and > > store each field as a key, value pair in this dictionary. .... > One downside of this is that the user then has to go and > consult the file format documentation to discover "DA" is the > entry date, etc. In some cases the abbrevations are probably > a little unclear. I would find code using the current named > properties easier to read than the suggested dictionary based > approach which exposes the raw field names. What I noticed when I was playing with this parser is that it is often unclear which (Biopython-chosen) name goes with which (NCBI-chosen) key. For example, PMID is the pubmed ID number in the flat file. Should I look under "pmid", "PMID", "PubmedID"? (the correct answer is "pubmed_id"). As you mention, the NCBI-chosen keys are often not very informative (who can guess that TT stands for "transliterated title"?). I was thinking to have a list of NCBI keys and their description in the docstring of Bio.Medline's Record class, so users can always find them without having to go into NCBI's documentation. Another possibility is to overload the dictionary class such that all keys are automatically mapped to their more descriptive names. So the parser only knows about the NCBI-defined keys, but if a user types record["Author"], then the Record class knows it should return record["AU"]. With a corresponding modification of record.keys(). > Also, could you make the changes whiling leaving the older > parser with the old record behaviour in place (with deprecation > warnings) for a few releases? Yes that is possible. Existing scripts will use the parser = RecordParser(); parser.parse(handle) approach. This approach can continue use the same Record class, basically ignoring the fact that it now derives from a dictionary. A deprecation warning is given when a user tries to create a RecordParser instance. --Michiel. From biopython at maubp.freeserve.co.uk Sat Aug 2 14:09:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 2 Aug 2008 15:09:58 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <407661.78979.qm@web62412.mail.re1.yahoo.com> References: <320fb6e00808020518t516b4ab8gb2b2b3efa85c04ec@mail.gmail.com> <407661.78979.qm@web62412.mail.re1.yahoo.com> Message-ID: <320fb6e00808020709v129cd4c2g7f56251235b6abe0@mail.gmail.com> On Sat, Aug 2, 2008 at 2:32 PM, Michiel de Hoon wrote: >> One downside of this is that the user then has to go and >> consult the file format documentation to discover "DA" is the >> entry date, etc. In some cases the abbreviations are probably >> a little unclear. I would find code using the current named >> properties easier to read than the suggested dictionary based >> approach which exposes the raw field names. > > What I noticed when I was playing with this parser is that it is often > unclear which (Biopython-chosen) name goes with which (NCBI-chosen) > key. For example, PMID is the pubmed ID number in the flat file. Should > I look under "pmid", "PMID", "PubmedID"? (the correct answer is "pubmed_id"). If you did dir(record) how many possible candidates would you see? > As you mention, the NCBI-chosen keys are often not very informative > (who can guess that TT stands for "transliterated title"?). I was thinking > to have a list of NCBI keys and their description in the docstring of > Bio.Medline's Record class, so users can always find them without > having to go into NCBI's documentation. That would help users - and also future developers trying to understand what the parser is doing! > Another possibility is to overload the dictionary class such that all keys > are automatically mapped to their more descriptive names. So the > parser only knows about the NCBI-defined keys, but if a user types > record["Author"], then the Record class knows it should return > record["AU"]. With a corresponding modification of record.keys(). The alias idea is nice but does mean there is more than one way to access the data (not encouraged in python). A related suggestion is to support the properties record.entry_date, record.author etc (what ever the current parser does) as alternatives to record["DA"], record["AU"], ... ? This would then be backwards compatible. This could probably be done with a private dictionary mapping keys ("DA") to property names ("entry_date"). When ever we add a new entry to the dictionary, also see if it has a named property to define too. >> Also, could you make the changes whiling leaving the older >> parser with the old record behaviour in place (with deprecation >> warnings) for a few releases? > > Yes that is possible. Existing scripts will use ... Good, we shouldn't break existing scripts during the deprecation transition period. Peter From mjldehoon at yahoo.com Sun Aug 3 01:57:18 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 2 Aug 2008 18:57:18 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808020709v129cd4c2g7f56251235b6abe0@mail.gmail.com> Message-ID: <591490.20671.qm@web62411.mail.re1.yahoo.com> > The alias idea is nice but does mean there is more than one > way to access the data (not encouraged in python). A related > suggestion is to support the properties record.entry_date, > record.author etc (what ever the current parser does) as > alternatives to record["DA"], record["AU"], ... ? This would > then be backwards compatible. This could probably be done with > a private dictionary mapping keys ("DA") to property names > ("entry_date"). When ever we add a new entry to the > dictionary, also see if it has a named property to define > too. > Thinking it over, I think that having a key and an attribute mapping to the same value is not so clean. Alternatively we could add a .find(term) method to the Bio.Medline.Record class, which takes a term and returns the appropriate value. So record.find("author") returns record["AU"]. This gives a clear separation between the raw keys in the Medline file and the more descriptive names. Also, such a .find method can accept a wider variety of terms than an attribute name (e.g., "Full Author", "full_author", etc. all return record["FAU"]). --Michiel From matzke at berkeley.edu Tue Aug 5 20:30:16 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 13:30:16 -0700 Subject: [BioPython] Clustalw.parse_file errors Message-ID: <4898B858.7040405@berkeley.edu> Hi all, I'm running through the excellent biopython tutorial here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 I've hit an error here: 9.4.2 Creating your own substitution matrix from an alignment http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 ...basically the Clustalw parser won't parse even the given example alignment file (protein.aln) or another example file from elsewhere (example.aln). ============ from Bio import Clustalw from Bio.Alphabet import IUPAC from Bio.Align import AlignInfo # get an alignment object from a Clustalw alignment output c_align = Clustalw.parse_file("protein.aln", IUPAC.protein) summary_align = AlignInfo.SummaryInfo(c_align) ============ this code doesn't work with the given protein.aln file, error message: Traceback (most recent call last): File "biopython_alignments.py", line 163, in ? c_align = Clustalw.parse_file(protein_align_file, IUPAC.protein) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", line 60, in parse_file clustal_alignment._star_info = generic_alignment._star_info AttributeError: Alignment instance has no attribute '_star_info' It also doesn't work with the example.aln file here: http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s06.html http://www.pasteur.fr/recherche/unites/sis/formation/python/data/example.aln ...but throws a different error: code: c_align = Clustalw.parse_file('example.aln', alphabet=IUPAC.protein) ================================= Traceback (most recent call last): File "biopython_alignments.py", line 174, in ? c_align = Clustalw.parse_file('example.aln', alphabet=IUPAC.protein) File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", line 47, in parse_file generic_alignment = AlignIO.read(handle, "clustal") File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/AlignIO/__init__.py", line 299, in read first = iterator.next() File "/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/AlignIO/ClustalIO.py", line 169, in next raise ValueError("Could not parse line:\n%s" % line) ValueError: Could not parse line: *:::**:.**.** *.*** .:* *:******* ==================== I am running: wright:/bioinformatics/pyeg nick$ py -V Python 2.4.4 ...& biopython installed just last week... Any help appreciated, since I will have to use this module soon! Nick -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54 Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From biopython at maubp.freeserve.co.uk Tue Aug 5 20:52:01 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 21:52:01 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898B858.7040405@berkeley.edu> References: <4898B858.7040405@berkeley.edu> Message-ID: <320fb6e00808051352u1bd19467i914ddb48d1c8dde7@mail.gmail.com> Hi Nick, I'll take a look at the other problem, but I think I could diagnose the second one immediately... > It also doesn't work with the example.aln file here: > http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s06.html > http://www.pasteur.fr/recherche/unites/sis/formation/python/data/example.aln > > ...but throws a different error: > > code: > > c_align = Clustalw.parse_file('example.aln', alphabet=IUPAC.protein) > ================================= > Traceback (most recent call last): > ... > ValueError: Could not parse line: > *:::**:.**.** *.*** .:* *:******* > ==================== That looks like a bug in Biopython 1.47 (reported last month by Sebastian Bassi) where there was a problem parsing Clustal files where the first line of the consensus was blank (as here). It has been fixed in CVS... You should only need to replace the file Bio/AlignIO/ClustalIO.py with the latest version from CVS. If you would like guidance on how exactly to update your system please ask - one manual but fairly simple way is to backup and then replace this file: /Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/Bio/AlignIO/ClustalIO.py with the latest one from here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/AlignIO/ClustalIO.py?cvsroot=biopython If you are happy at the command line, then I would suggest get the latest version of Biopython from CVS and then re-install from source. Peter From biopython at maubp.freeserve.co.uk Tue Aug 5 21:08:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:08:45 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898B858.7040405@berkeley.edu> References: <4898B858.7040405@berkeley.edu> Message-ID: <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> On Tue, Aug 5, 2008 at 9:30 PM, Nick Matzke wrote: > Hi all, > > I'm running through the excellent biopython tutorial here: > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 I'm glad you are enjoying the Tutorial (apart from the parsing bug!). I can't take any credit for this bit ;) Seeing as you are trying to use the SummaryInfo class, I should mention that in Biopython 1.47 this doesn't work very well with generic alphabets. In some cases this means you have to supply some of the optional arguments like the characters to ignore (e.g. "-") which might otherwise be inferred from the alphabet. There have been some changes in CVS to try and address this. http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Align/AlignInfo.py?cvsroot=biopython > I've hit an error here: > 9.4.2 Creating your own substitution matrix from an alignment > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 > > ...basically the Clustalw parser won't parse even the given example > alignment file (protein.aln) or another example file from elsewhere > (example.aln). The good news is I've just checked protein.aln on my machine, and it can be parsed fine. This is using the CVS version of Biopython, but probably just updating the file .../Bio/AlignIO/ClustalIO.py as I suggested in my earlier email will fix this too. I've realised that our unit tests didn't include the example file protein.aln, otherwise we would have caught this earlier (when I made ClustalW parsing change). Its a bit late now, but I have just added protein.aln to the alignment parsing unit test for future validation. Peter From biopython at maubp.freeserve.co.uk Tue Aug 5 21:27:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:27:50 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> Message-ID: <320fb6e00808051427x532f448r91aa8c35afe59541@mail.gmail.com> Peter wrote: > Nick Matzke wrote: >> Hi all, >> >> I'm running through the excellent biopython tutorial here: >> http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc100 > > I'm glad you are enjoying the Tutorial (apart from the parsing bug!). > I can't take any credit for this bit ;) [I meant I can't take credit for this bit of the tutorial, however the bug was mine!] >> ...basically the Clustalw parser won't parse even the given example >> alignment file (protein.aln) or another example file from elsewhere >> (example.aln). I've also checked I can read this file too: http://www.pasteur.fr/recherche/unites/sis/formation/python/data/example.aln For example, >>> from Bio import AlignIO >>> a = AlignIO.read(open("/tmp/example.aln"), "clustal") >>> print a SingleLetterAlphabet() alignment with 12 rows and 1168 columns MESGHLLWALLFMQSLWPQLTDGATRVYYLGIRDVQWNYAPKGR...FKQ Q9C058 ... Currently Bio.AlignIO does not let you define the alphabet. See: http://bugzilla.open-bio.org/show_bug.cgi?id=2443 Alternatively, using Bio.Clustalw which does let you define an alphabet: >>> from Bio import Clustalw >>> from Bio.Alphabet import IUPAC, Gapped >>> a = Clustalw.parse_file("/tmp/example.aln", Gapped(IUPAC.protein,"-")) >>> print a Note that using Bio.Clustalw you get a sub-class of the generic alignment, which has a different str method (meaning "print a" will re-create the alignment in clustal format). Peter From matzke at berkeley.edu Tue Aug 5 21:45:56 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 14:45:56 -0700 Subject: [BioPython] biopython tutorial In-Reply-To: <4898B858.7040405@berkeley.edu> References: <4898B858.7040405@berkeley.edu> Message-ID: <4898CA14.7080400@berkeley.edu> Hi again, I just ran through the biopython tutorial, sections 1 through 9.5. It is really great, & thanks to the people who wrote it. While copying-pasting code etc. to try it on my own system I noticed a few typos & other minor issues which I figured I should make note of for Peter or whomever maintains it. Thanks again for the tutorial! Nick 1. my_blast_file = "m_cold.fasta" should be: my_blast_db = "m_cold.fasta" 2. record[0]["GBSeq_definition"] 'Opuntia subulata rpl16 gene, intron; chloroplast' ...should be (AFAICT): record['Bioseq-set_seq-set'][0]['Seq-entry_set']['Bioseq-set']['Bioseq-set_seq-set'][0]['Seq-entry_seq']['Bioseq']['Bioseq_descr']['Seq-descr'][2]['Seqdesc_title'] 3. >>> record[0]["GBSeq_source"] 'chloroplast Austrocylindropuntia subulata' ...the exact string 'chloroplast Austrocylindropuntia subulata' doesn't seem to exist in the downloaded data, so not sure what is meant... 4. the 814 hits are now 816 throughout 5. add links for prosite & swissprot db downloads 6. Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can be downloaded from the NCBI here (only 1.15 MB): link location is weird (only paren is linked) 7. ============ As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value (the default is .3) will add the common residue to the consensus. If it doesn?t reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is Seq object whose alphabet is inferred from the alphabets of the sequences making up the consensus. So doing a print consensus would give: consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT ...', IUPACAmbiguousDNA()) You can adjust how dumb_consensus works by passing optional parameters: the threshold This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is .7. ============ Is the default 0.3 or 0.7 -- I assume 0.7 for DNA. 8. info_content = summary_align.information_content(5, 30, log_base = 10 chars_to_ignore = ['N']) missing comma 9. 9.4.1 Using common substitution matrices blank 10. in PDB section: for model in structure.get_list() for chain in model.get_list(): for residue in chain.get_list(): ...first line needs colon (:) happens again lower down: for model in structure.get_list() for chain in model.get_list(): for residue in chain.get_list(): 11. from PDBParser import PDBParser should be: from Bio.PDB.PDBParser import PDBParser From biopython at maubp.freeserve.co.uk Tue Aug 5 21:52:36 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:52:36 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> Message-ID: <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> Hi Cedar, Did you mean to send this to me personally? I hope you don't mind me sending this reply to the list too. > Thank you all for your replies. > >>> The problem is, it doesn't support something like: >>> SeqIO.write(seq_record, out_file_handle, "genbank") >>> Because it requires an iterable object I guess? > >> Yes, you would have to use this: >> SeqIO.write([seq_record], out_file_handle, "genbank") > > This suggestion makes sense, but when I try it, I get: > > File "downloader.py", line 40, in > SeqIO.write([record], out_file_handle, "genbank") > File "/sw/lib/python2.5/site-packages/Bio/SeqIO/__init__.py", line 238, in > write > raise ValueError("Unknown format '%s'" % format) > ValueError: Unknown format 'genbank' > > and here is the line 40 of code it refers to: > SeqIO.write([record], out_file_handle, "genbank") > > I'm running 1.47 installed via fink. Right - because in Biopython 1.47, Bio.SeqIO don't support GenBank output (as I had tried to make clear). Earlier this week I committed very preliminary support for writing GenBank files with Bio.SeqIO to CVS. Please add yourself as a CC on Bug 2294 if you want to be kept apprised of this. http://bugzilla.open-bio.org/show_bug.cgi?id=2294 Would it help if the error message for this situation was a little more precise? e.g. Rather than "Unknown format 'xxx'", perhaps "Writing 'xxx' format is not supported yet, only reading it". >> Could I ask why you want to get the SeqRecord as a string in GenBank >> format? > > Thanks for the tip for how to get a string. I want to be able to present a > genbank file inline in a webpage. Also during trouble shooting, I was trying > to read a genbank file in, then print it to the console, just to make sure > things were working. OK - wanting a SeqRecord as a string for embedding in a webpage this makes perfect sense. For debugging, "print record" should give you a human readable output (but it isn't in any particular format). You have explicitly asked about SeqRecord to GenBank, but as an aside, the Tutorial does (briefly) talk about using Bio.GenBank to get a "genbank record" rather than a SeqRecord object. This is a simple and direct representation of the raw GenBank fields, and it should be possible to use this to almost recreate the GenBank file. >>> from Bio import GenBank >>> gb_iterator = GenBank.Iterator(open("cor6_6.gb"), GenBank.RecordParser()) >>> for cur_record in gb_iterator : print cur_record This won't be 100% the same as the input file, but it is close. > I'm probably way out of line here, because frankly, I'm not the best python > coder, and I haven't contributed a thing to biopython, but here it is > anyway: > > I don't understand why SeqIO must write to a handle anyway. I think > something like: > > file_handle.write(SeqIO.to_string([record], "genbank")) > > is just as easy as the existing method, and has the advantage of giving us > the option of just getting a string like: > > genbank_string = SeqIO.to_string([record], "genbank") When we first discussed the proposed SeqIO interface, handles were seen as a sensible common abstraction. The desire to get a string was discussed but (as I recall) was not considered to be as common as wanting to write to a file. In fact web-server applications are still the only example I can think of right now, and the StringIO solution or the "to string method" discussed below cover this. > And while I'm at it, I think even easier would be: > > file_handle.write(record.to_format("genbank")) > and > genbank_string = record.to_format("genbank") > > would be even easier. If you have any preference on the precise function name, please add a comment on Bug 2561. http://bugzilla.open-bio.org/show_bug.cgi?id=2561 > In any case, biopython make my life much easier, and I appreciate it! > best, > Cedar Great :) Peter From biopython at maubp.freeserve.co.uk Tue Aug 5 21:57:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 22:57:45 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898C88D.6010507@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> <4898C88D.6010507@berkeley.edu> Message-ID: <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> On Tue, Aug 5, 2008 at 10:39 PM, Nick Matzke wrote: > Thanks for the help Peter, it really is a great tutorial! > > I've replaced just the ClustalIO.py file as you suggested, and it parses > both the example.aln and protein.aln files. Good :) > However I tried an ClustalW-formatted alignment file I made awhile ago with > my own data and still got the star_info error: > > AttributeError: Alignment instance has no attribute '_star_info' > > But my file could be weird. Does the _star_info error indicate alphabet > issues or something? The _star_info is a nasty private variable used to store the ClustalW consensus, used if writing the file back out again in clustal format. The error suggests something else has gone wrong with the consensus parsing... (and shouldn't be anything to do with the alphabet). Could you file a bug, and (after filing the bug) could you upload one of these example files to the bug as an attachment please? Peter From peter at maubp.freeserve.co.uk Tue Aug 5 22:28:29 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 23:28:29 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <4898CA14.7080400@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> Message-ID: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> On Tue, Aug 5, 2008 at 10:45 PM, Nick Matzke wrote: > Hi again, > > I just ran through the biopython tutorial, sections 1 through 9.5. It is > really great, & thanks to the people who wrote it. On behalf of all the other authors, thank you :) > While copying-pasting code etc. to try it on my own system I noticed a few > typos & other minor issues which I figured I should make note of for Peter > or whomever maintains it. Although I have made plenty of changes and updates to the tutorial, its still a joint effort. I probably tend to make more little fixes than other people, which shows up more on the CVS history! Little things like this are always worth pointing out - and comments from new-comers and beginners can be extra helpful if they reveal assumptions or other things that could be clearer. > 1. > my_blast_file = "m_cold.fasta" > should be: > my_blast_db = "m_cold.fasta" I may have misunderstood you, but I think its correct. There are two important things for a BLAST search, the input file (here the FASTA file m_cold.fasta) and the database to search against (in the example b. subtilis sequences). > 2. > record[0]["GBSeq_definition"] > 'Opuntia subulata rpl16 gene, intron; chloroplast' > > ...should be (AFAICT): Something strange is going on - the NCBI didn't give me XML by default as I expected: from Bio import Entrez handle = Entrez.efetch(db="nucleotide", id="57240072", email="A.N.Other at example.com") data = handle.read() print data[:100] It looks like the NCBI may have changed something - Michiel? > 4. > the 814 hits are now 816 throughout That number is always going to increase - maybe we can reword things slightly to make it clear that may not be exactly what the user will see. > 5. > add links for prosite & swissprot db downloads Where would you add these, and which URLs did you have in mind? > 6. > Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can > be downloaded from the NCBI here (only 1.15 MB): > > link location is weird (only paren is linked) Whoops - both the PDF and HTML are like that... looks like a mix up in the LaTeX syntax. Fixed in CVS. > > 7. > ============ > As the name suggests, this is a really simple consensus calculator, and will > just add up all of the residues at each point in the consensus, and if the > most common value is higher than some threshold value (the default is .3) > will add the common residue to the consensus. If it doesn't reach the > threshold, it adds an ambiguity character to the consensus. The returned > consensus object is Seq object whose alphabet is inferred from the alphabets > of the sequences making up the consensus. So doing a print consensus would > give: > > consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT > ...', IUPACAmbiguousDNA()) > > You can adjust how dumb_consensus works by passing optional parameters: > > the threshold > This is the threshold specifying how common a particular residue has to > be at a position before it is added. The default is .7. > ============ > > Is the default 0.3 or 0.7 -- I assume 0.7 for DNA. The default is 0.7 for any sequence type (DNA, protein, etc). Do you mean which way round is the percentage counted (the letter has to be above 70% I think)? > 8. > info_content = summary_align.information_content(5, 30, log_base = 10 > chars_to_ignore = ['N']) > missing comma Fixed in CVS. > 9. > 9.4.1 Using common substitution matrices > > blank So it is - would anyone like to write something for this? > 10. > in PDB section: > > for model in structure.get_list() > for chain in model.get_list(): > for residue in chain.get_list(): > > ...first line needs colon (:) > > happens again lower down: > for model in structure.get_list() > for chain in model.get_list(): > for residue in chain.get_list(): > Fixed two of these in CVS. > 11. > from PDBParser import PDBParser > > should be: > > from Bio.PDB.PDBParser import PDBParser Fixed in CVS. Note that we don't normally update the online copies of the HTML and PDF tutorial between releases (so as to avoid talking about unreleased features). However, there have been a few updates to the Tutorial since Biopython 1.47 so maybe we should consider it? Thanks again Nick! Peter From matzke at berkeley.edu Tue Aug 5 22:41:46 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 15:41:46 -0700 Subject: [BioPython] biopython tutorial In-Reply-To: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> Message-ID: <4898D72A.8030207@berkeley.edu> Peter wrote: > On Tue, Aug 5, 2008 at 10:45 PM, Nick Matzke wrote: >> Hi again, >> >> I just ran through the biopython tutorial, sections 1 through 9.5. It is >> really great, & thanks to the people who wrote it. > > On behalf of all the other authors, thank you :) > >> While copying-pasting code etc. to try it on my own system I noticed a few >> typos & other minor issues which I figured I should make note of for Peter >> or whomever maintains it. > > Although I have made plenty of changes and updates to the tutorial, > its still a joint effort. I probably tend to make more little fixes > than other people, which shows up more on the CVS history! > > Little things like this are always worth pointing out - and comments > from new-comers and beginners can be extra helpful if they reveal > assumptions or other things that could be clearer. > >> 1. >> my_blast_file = "m_cold.fasta" >> should be: >> my_blast_db = "m_cold.fasta" > > I may have misunderstood you, but I think its correct. There are two > important things for a BLAST search, the input file (here the FASTA > file m_cold.fasta) and the database to search against (in the example > b. subtilis sequences). Yeah sorry, I was confused there but forgot to fix my note after I figured it out! > >> 2. >> record[0]["GBSeq_definition"] >> 'Opuntia subulata rpl16 gene, intron; chloroplast' >> >> ...should be (AFAICT): > > Something strange is going on - the NCBI didn't give me XML by default > as I expected: > > from Bio import Entrez > handle = Entrez.efetch(db="nucleotide", id="57240072", > email="A.N.Other at example.com") > data = handle.read() > print data[:100] > > It looks like the NCBI may have changed something - Michiel? > >> 4. >> the 814 hits are now 816 throughout > > That number is always going to increase - maybe we can reword things > slightly to make it clear that may not be exactly what the user will > see. Yeah I figured it was this no worries. If you want to be OCD like I apparently am you could add a note to this effect. >> 5. >> add links for prosite & swissprot db downloads > > Where would you add these, and which URLs did you have in mind? I was thinking in this section: ======== To parse a file that contains more than one Swiss-Prot record, we use the parse function instead. This function allows us to iterate over the records in the file. For example, let?s parse the full Swiss-Prot database and collect all the descriptions. The full Swiss-Prot database, downloaded from ExPASy on 4 December 2007, contains 290484 Swiss-Prot records in a single gzipped-file uniprot_sprot.dat.gz. ======== ...it could link to: ftp://ca.expasy.org/databases/uniprot/current_release/knowledgebase/complete ...and in this section: ======== In general, a Prosite file can contain more than one Prosite records. For example, the full set of Prosite records, which can be downloaded as a single file (prosite.dat) from ExPASy, contains 2073 records in (version 20.24 released on 4 December 2007). To parse such a file, we again make use of an iterator: ======== ...it could link to: ftp://ftp.expasy.org/databases/prosite/ I found these without too much trouble on my own of course but might be handy for newbies. Also, the tutorial might give an estimate of how long it will take to parse the full Swiss-Prot DB, I waited a few minutes & then decided to move on. Maybe a smaller file or subset with just e.g. 100 records would be appropriate for the tutorial? > >> 6. >> Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can >> be downloaded from the NCBI here (only 1.15 MB): >> >> link location is weird (only paren is linked) > > Whoops - both the PDF and HTML are like that... looks like a mix up in > the LaTeX syntax. Fixed in CVS. > >> 7. >> ============ >> As the name suggests, this is a really simple consensus calculator, and will >> just add up all of the residues at each point in the consensus, and if the >> most common value is higher than some threshold value (the default is .3) >> will add the common residue to the consensus. If it doesn't reach the >> threshold, it adds an ambiguity character to the consensus. The returned >> consensus object is Seq object whose alphabet is inferred from the alphabets >> of the sequences making up the consensus. So doing a print consensus would >> give: >> >> consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT >> ...', IUPACAmbiguousDNA()) >> >> You can adjust how dumb_consensus works by passing optional parameters: >> >> the threshold >> This is the threshold specifying how common a particular residue has to >> be at a position before it is added. The default is .7. >> ============ >> >> Is the default 0.3 or 0.7 -- I assume 0.7 for DNA. > > The default is 0.7 for any sequence type (DNA, protein, etc). Do you > mean which way round is the percentage counted (the letter has to be > above 70% I think)? I meant that this sentence in the above para: "if the most common value is higher than some threshold value (the default is .3)" should probably just say 0.7 I think. Thanks! Nick > >> 8. >> info_content = summary_align.information_content(5, 30, log_base = 10 >> chars_to_ignore = ['N']) >> missing comma > > Fixed in CVS. > >> 9. >> 9.4.1 Using common substitution matrices >> >> blank > > So it is - would anyone like to write something for this? > >> 10. >> in PDB section: >> >> for model in structure.get_list() >> for chain in model.get_list(): >> for residue in chain.get_list(): >> >> ...first line needs colon (:) >> >> happens again lower down: >> for model in structure.get_list() >> for chain in model.get_list(): >> for residue in chain.get_list(): >> > > Fixed two of these in CVS. > >> 11. >> from PDBParser import PDBParser >> >> should be: >> >> from Bio.PDB.PDBParser import PDBParser > > Fixed in CVS. > > Note that we don't normally update the online copies of the HTML and > PDF tutorial between releases (so as to avoid talking about unreleased > features). However, there have been a few updates to the Tutorial > since Biopython 1.47 so maybe we should consider it? > > Thanks again Nick! > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54 Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From peter at maubp.freeserve.co.uk Tue Aug 5 22:55:30 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Aug 2008 23:55:30 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <4898D72A.8030207@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> <4898D72A.8030207@berkeley.edu> Message-ID: <320fb6e00808051555v5bca752bg49da9199d5c682c1@mail.gmail.com> >>> 4. >>> the 814 hits are now 816 throughout >> >> That number is always going to increase - maybe we can reword things >> slightly to make it clear that may not be exactly what the user will >> see. > > Yeah I figured it was this no worries. I might do that tomorrow (along with the links below...) > If you want to be OCD like I apparently am you could add a note to this effect. Having a perfectionist looking after documentation or a website can be a good thing. >>> 5. >>> add links for prosite & swissprot db downloads >> >> Where would you add these, and which URLs did you have in mind? > > > I was thinking in this section: > > ======== > To parse a file that contains more than one Swiss-Prot record, we use the > parse function instead. This function allows us to iterate over the records > in the file. For example, let's parse the full Swiss-Prot database and > collect all the descriptions. The full Swiss-Prot database, downloaded from > ExPASy on 4 December 2007, contains 290484 Swiss-Prot records in a single > gzipped-file uniprot_sprot.dat.gz. > ======== > > ...it could link to: > ftp://ca.expasy.org/databases/uniprot/current_release/knowledgebase/complete > > ...and in this section: > > ======== > In general, a Prosite file can contain more than one Prosite records. For > example, the full set of Prosite records, which can be downloaded as a > single file (prosite.dat) from ExPASy, contains 2073 records in (version > 20.24 released on 4 December 2007). To parse such a file, we again make use > of an iterator: > ======== > > ...it could link to: > ftp://ftp.expasy.org/databases/prosite/ > > I found these without too much trouble on my own of course but might be > handy for newbies. That looks sensible... > Also, the tutorial might give an estimate of how long it will take to parse > the full Swiss-Prot DB, I waited a few minutes & then decided to move on. > Maybe a smaller file or subset with just e.g. 100 records would be > appropriate for the tutorial? It will depend very much on the computer (hard drive mostly). As I recall somewhere between 2 and 10 minutes sounds about right. >>> 7. >>> ============ >>> As the name suggests, this is a really simple consensus calculator, and >>> will ... >> >> The default is 0.7 for any sequence type (DNA, protein, etc). Do you >> mean which way round is the percentage counted (the letter has to be >> above 70% I think)? > > I meant that this sentence in the above para: "if the most common value is > higher than some threshold value (the default is .3)" should probably just > say 0.7 I think. I see it now, fixed in CVS. Thanks! Peter From matzke at berkeley.edu Tue Aug 5 23:19:23 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 05 Aug 2008 16:19:23 -0700 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> <4898C88D.6010507@berkeley.edu> <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> Message-ID: <4898DFFB.3010704@berkeley.edu> Never mind, it turns out my alignment file was missing a blank line after each section of the alignment. The .aln file doesn't have to have a consensus line with "*", ":" characters in it necessarily, but it does have to have at least a line of spaces of the length of the aligned block (this is what protein.aln has). I inserted a line of spaces after each chunk of the alignment and now it parses. (My alignment wasn't generated by Clustal anyway, so I also added this header line to make the parser happy: "CLUSTAL W (1.83) formatted alignment done with PROMALS3D") I.e. for future readers (truncating my .aln file)... ...this got the _star_info error: ================================ CLUSTAL W (1.83) formatted alignment done with PROMALS3D SctN_Salt ----------------MKNEL--------------------------------------- SctN_EHEC MISEHDSVLEKYPRIQKVLNST-------------------------------------- SctN_Chrm ---------MRLPDIRLIENTL-------------------------------------- SctN_Yers ---------MKLPDIARLTPRL-------------------------------------- SctN_Soda ----------MTCNSQRLASML-------------------------------------- SctN_Laws ----------------MALEYI-------------------------------------- SctN_Chl4 ----------------MEEITTE------------------------------------- SctN_Salt --------------------------MQRLRLKYPPP---------DGYCR--------W SctN_EHEC --------------------------VPALSLN-------------SSTRY--------E SctN_Chrm --------------------------RERLTLAPA---PPGQR---SGVEL--------F SctN_Yers --------------------------QQQLTRPSAPP---------EGLRY--------R SctN_Soda --------------------------AQHLTPVDEPP---------DGYRL--------T SctN_Laws --------------------------ASLLEEAVQNT---------SPVEV--------R SctN_Chl4 --------------------------FNTLMTELPDV---------QLTAV--------V =================================== ...but this parsed successfully: ================================ CLUSTAL W (1.83) formatted alignment done with PROMALS3D SctN_Salt ----------------MKNEL--------------------------------------- SctN_EHEC MISEHDSVLEKYPRIQKVLNST-------------------------------------- SctN_Chrm ---------MRLPDIRLIENTL-------------------------------------- SctN_Yers ---------MKLPDIARLTPRL-------------------------------------- SctN_Soda ----------MTCNSQRLASML-------------------------------------- SctN_Laws ----------------MALEYI-------------------------------------- SctN_Chl4 ----------------MEEITTE------------------------------------- SctN_Salt --------------------------MQRLRLKYPPP---------DGYCR--------W SctN_EHEC --------------------------VPALSLN-------------SSTRY--------E SctN_Chrm --------------------------RERLTLAPA---PPGQR---SGVEL--------F SctN_Yers --------------------------QQQLTRPSAPP---------EGLRY--------R SctN_Soda --------------------------AQHLTPVDEPP---------DGYRL--------T SctN_Laws --------------------------ASLLEEAVQNT---------SPVEV--------R SctN_Chl4 --------------------------FNTLMTELPDV---------QLTAV--------V =================================== ...the difference is that the first blank line after the block must be spaces (or consensus characters *:. etc.), not just a blank line. Thanks for the hints! Nick Peter wrote: > On Tue, Aug 5, 2008 at 10:39 PM, Nick Matzke wrote: >> Thanks for the help Peter, it really is a great tutorial! >> >> I've replaced just the ClustalIO.py file as you suggested, and it parses >> both the example.aln and protein.aln files. > > Good :) > >> However I tried an ClustalW-formatted alignment file I made awhile ago with >> my own data and still got the star_info error: >> >> AttributeError: Alignment instance has no attribute '_star_info' >> >> But my file could be weird. Does the _star_info error indicate alphabet >> issues or something? > > The _star_info is a nasty private variable used to store the ClustalW > consensus, used if writing the file back out again in clustal format. > The error suggests something else has gone wrong with the consensus > parsing... (and shouldn't be anything to do with the alphabet). > > Could you file a bug, and (after filing the bug) could you upload one > of these example files to the bug as an attachment please? > > Peter > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54 Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From cmckay at u.washington.edu Tue Aug 5 22:47:53 2008 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 5 Aug 2008 15:47:53 -0700 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> Message-ID: On Aug 5, 2008, at 2:52 PM, Peter wrote: > Right - because in Biopython 1.47, Bio.SeqIO don't support GenBank > output (as I had tried to make clear). Earlier this week I committed > very preliminary support for writing GenBank files with Bio.SeqIO to > CVS. Please add yourself as a CC on Bug 2294 if you want to be kept > apprised of this. > http://bugzilla.open-bio.org/show_bug.cgi?id=2294 > Aha! I see. The following is on the SeqIO wiki page (http://www.biopython.org/wiki/SeqIO ): "If you supply the sequences as a SeqRecord iterator, then for sequential file formats like Fasta or GenBank, the records can be written one by one" I think I wrongly thought this implied that Genbank Records can be written. But I see now that isn't the case, and the "Fasta or GenBank" files it references must be the input files that are parsed, not the format of the output. I'm looking forward to this functionality. > Would it help if the error message for this situation was a little > more precise? e.g. Rather than "Unknown format 'xxx'", perhaps > "Writing 'xxx' format is not supported yet, only reading it". > I think your new suggested message is more clear, but the existing one is clear enough. I simply thought there was a problem because I had it in my mind that genbank writing was now supported. > If you have any preference on the precise function name, please add a > comment on Bug 2561. > http://bugzilla.open-bio.org/show_bug.cgi?id=2561 I have no particular preference. Thanks again for the help. cedar From mjldehoon at yahoo.com Wed Aug 6 00:42:53 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Aug 2008 17:42:53 -0700 (PDT) Subject: [BioPython] biopython tutorial In-Reply-To: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> Message-ID: <387391.14812.qm@web62412.mail.re1.yahoo.com> --- On Tue, 8/5/08, Peter wrote: > > 2. > > record[0]["GBSeq_definition"] > > 'Opuntia subulata rpl16 gene, intron; > chloroplast' > > > > ...should be (AFAICT): > > Something strange is going on - the NCBI didn't give me > XML by default > as I expected: > > from Bio import Entrez > handle = Entrez.efetch(db="nucleotide", > id="57240072", > email="A.N.Other at example.com") > data = handle.read() > print data[:100] > > It looks like the NCBI may have changed something - > Michiel? In this example, retmode='xml' was missing for the current efetch. With retmode='xml', the rest of the example in the Tutorial works correctly. Confusingly, if you use rettype='xml' you will get a different XML output (this is the XML output Nick was looking at). I fixed this section in CVS. --Michiel From cmckay at u.washington.edu Wed Aug 6 00:36:39 2008 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 5 Aug 2008 17:36:39 -0700 Subject: [BioPython] Bio.SeqFeature.SeqFeature Message-ID: <0CDEE8AC-1271-4617-B333-C18615970E23@u.washington.edu> Hello. I just upgraded from 1.44 to 1.47 and one of my home-brew classes stopped working. This used to work: class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): """Extends the Bio.SeqFeature.SeqFeature class with several new methods and attributes. The Bio.SeqFeature.SeqFeature represents individual features of a genbank record, for example a CDS.""" But now after the upgrade to 1.47 my script throws this: File "/usr/local/python/Rocap.py", line 740, in class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): AttributeError: 'module' object has no attribute 'SeqFeature' FYI fastaTools is a class of my own. I looked at the changelog, and don't see anything obvious. Can anyone give me a pointer as to why this stopped working? best, Cedar From biopython at maubp.freeserve.co.uk Wed Aug 6 08:57:45 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 09:57:45 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <387391.14812.qm@web62412.mail.re1.yahoo.com> References: <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> <387391.14812.qm@web62412.mail.re1.yahoo.com> Message-ID: <320fb6e00808060157x2c495e55gcd4946f5b241d7fd@mail.gmail.com> >> Something strange is going on - the NCBI didn't give me >> XML by default as I expected: >> >> from Bio import Entrez >> handle = Entrez.efetch(db="nucleotide", >> id="57240072", >> email="A.N.Other at example.com") >> data = handle.read() >> print data[:100] >> >> It looks like the NCBI may have changed something - >> Michiel? > > In this example, retmode='xml' was missing for the current efetch. > With retmode='xml', the rest of the example in the Tutorial works correctly. Confusingly, if you use rettype='xml' you will get a different XML output (this is the XML output Nick was looking at). That sort of explains things - I had tried rettype="xml" last night, and got something in XML back which didn't look right. The NCBI have a table of retmode versus rettype which does point out there is more than one XML variant you can get back! http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html Peter From biopython at maubp.freeserve.co.uk Wed Aug 6 09:03:54 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 10:03:54 +0100 Subject: [BioPython] Bio.SeqFeature.SeqFeature In-Reply-To: <0CDEE8AC-1271-4617-B333-C18615970E23@u.washington.edu> References: <0CDEE8AC-1271-4617-B333-C18615970E23@u.washington.edu> Message-ID: <320fb6e00808060203s7e386ed7t6d40cd4dff4c43bc@mail.gmail.com> On Wed, Aug 6, 2008 at 1:36 AM, Cedar McKay wrote: > Hello. I just upgraded from 1.44 to 1.47 and one of my home-brew classes > stopped working. This used to work: > > class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): > """Extends the Bio.SeqFeature.SeqFeature class with several new > methods and attributes. > The Bio.SeqFeature.SeqFeature represents individual features of a > genbank record, > for example a CDS.""" > > > But now after the upgrade to 1.47 my script throws this: > > File "/usr/local/python/Rocap.py", line 740, in > class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): > AttributeError: 'module' object has no attribute 'SeqFeature' > > > FYI fastaTools is a class of my own. > > I looked at the changelog, and don't see anything obvious. Can anyone give > me a pointer as to why this stopped working? How are you importing the SeqFeature? You would get that error message if you didn't import the SeqFeature at all. Perhaps something subtle has changed there (i.e. you may have been relying on another module importing it on your behalf). You could try: import Bio.SeqFeature class BetterSeqFeature(Bio.SeqFeature.SeqFeature, fastaTools): #... or, from Bio.SeqFeature import SeqFeature class BetterSeqFeature(SeqFeature, fastaTools): #... Peter From biopython at maubp.freeserve.co.uk Wed Aug 6 09:27:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 10:27:17 +0100 Subject: [BioPython] SeqRecord to Genbank: use SeqIO? In-Reply-To: References: <0B3D0CC5-820B-4D98-8EB6-6ED6F7D2B6C8@u.washington.edu> <320fb6e00808010220k647f7dcfw71e58ff09a457c32@mail.gmail.com> <320fb6e00808010321o2aaec1a3je6a3a8d263769931@mail.gmail.com> <0C6CF316-7956-4C60-BAD8-F2962A1D6C60@u.washington.edu> <320fb6e00808051452o1a3cf5bqbdccb97dd0fa7c9d@mail.gmail.com> Message-ID: <320fb6e00808060227l25ccc76bs578f6b6635cdec65@mail.gmail.com> On Tue, Aug 5, 2008 at 11:47 PM, Cedar McKay wrote: > > Aha! I see. The following is on the SeqIO wiki page > (http://www.biopython.org/wiki/SeqIO): > > "If you supply the sequences as a SeqRecord iterator, then for sequential > file formats like Fasta or GenBank, the records can be written one by one" > > I think I wrongly thought this implied that Genbank Records can be written. > But I see now that isn't the case, and the "Fasta or GenBank" files it > references must be the input files that are parsed, not the format of the > output. I'm looking forward to this functionality. I agree with you - in hindsight that bit of the wiki is misleading. Sorry about that. I was using GenBank as an example of a sequential file format where the records can be written one by one (unlike for example Clustal or most multiple sequence alignment formats where the records are interleaved). This is true, and a valid example of what I meant by a "sequential file format" - as are SwissProt and EMBL. However, this wording did wrongly give the impression that Bio.SeqIO could write GenBank files (which Biopython 1.47 can't do). >> Would it help if the error message for this situation was a little >> more precise? e.g. Rather than "Unknown format 'xxx'", perhaps >> "Writing 'xxx' format is not supported yet, only reading it". >> > I think your new suggested message is more clear, but the existing one is > clear enough. I simply thought there was a problem because I had it in my > mind that genbank writing was now supported. I've updated Bio.SeqIO and Bio.AlignIO, so that they will say: ValueError: Reading format 'xxx' is supported, but not writing rather than: ValueError: Unknown format 'xxx' when the format is known (but only as an input format). I think this is more helpful and more accurate. Peter From biopython at maubp.freeserve.co.uk Wed Aug 6 17:10:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Aug 2008 18:10:02 +0100 Subject: [BioPython] biopython tutorial In-Reply-To: <320fb6e00808051555v5bca752bg49da9199d5c682c1@mail.gmail.com> References: <4898B858.7040405@berkeley.edu> <4898CA14.7080400@berkeley.edu> <320fb6e00808051528q3e1a8bccx14dce611a2e6cbba@mail.gmail.com> <4898D72A.8030207@berkeley.edu> <320fb6e00808051555v5bca752bg49da9199d5c682c1@mail.gmail.com> Message-ID: <320fb6e00808061010r3fa01e7m78284b3f54cf51bb@mail.gmail.com> On Tue, Aug 5, 2008 at 11:55 PM, Peter wrote: >>>> 4. >>>> the 814 hits are now 816 throughout >>> >>> That number is always going to increase - maybe we can reword things >>> slightly to make it clear that may not be exactly what the user will >>> see. Michiel has changed the wording slightly here in CVS. >>>> 5. >>>> add links for prosite & swissprot db downloads I've added those links in CVS, and mentioned the SwissProt example takes about seven minutes on my machine. Peter From biopython at maubp.freeserve.co.uk Thu Aug 7 10:11:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Aug 2008 11:11:22 +0100 Subject: [BioPython] SeqRecord to file format as string Message-ID: <320fb6e00808070311s40ff42e6tae265d2aa37ee684@mail.gmail.com> Following discussion both here and on the development mailing list, the SeqRecord and the Alignment objects are getting methods to present the object as a string in a requested file format (using Bio.SeqIO and Bio.AlignIO internally). This is enhancement Bug 2561, http://bugzilla.open-bio.org/show_bug.cgi?id=2561 I've added a .format() method, which takes a format name used in Bio.SeqIO or Bio.AlignIO. The name of this method final until it is part of an official Biopython release, so if there are any strong views on this please voice them sooner rather than later. For an example of how this works, if your have a SeqRecord object in variable record, you could do: print record.format("fasta") print record.format("tab") (or any other output format name supported in Bio.SeqIO for a single record) Similarly, if you had an Alignment object in variable align, you could do: print align.format("fasta") print align.format("clustal") print align.format("stockholm") (or any other output format name supported in Bio.AlignIO) This functionality will also be available via the special format() function being added to Python 2.6 and 3.0, giving the alternative: print format(align, "fasta") See PEP 3101 for details about the format system, http://www.python.org/dev/peps/pep-3101/ Peter From biopython at maubp.freeserve.co.uk Thu Aug 7 12:06:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Aug 2008 13:06:08 +0100 Subject: [BioPython] Clustalw.parse_file errors In-Reply-To: <4898DFFB.3010704@berkeley.edu> References: <4898B858.7040405@berkeley.edu> <320fb6e00808051408n25278f02ua89c02784f72dfc0@mail.gmail.com> <4898C88D.6010507@berkeley.edu> <320fb6e00808051457h754ddebfm5e102a570c544cb2@mail.gmail.com> <4898DFFB.3010704@berkeley.edu> Message-ID: <320fb6e00808070506w34446572nbdc131fa079e8773@mail.gmail.com> On Wed, Aug 6, 2008 at 12:19 AM, Nick Matzke wrote: > (My alignment wasn't generated by Clustal anyway, so I also added this > header line to make the parser happy: "CLUSTAL W (1.83) formatted alignment > done with PROMALS3D") I've just tried an alignment from PROMALS3D in their Clustal W output format: http://prodata.swmed.edu/promals3d/promals3d.php I tried their default settings, a wrap of 50, and unwrapped long lines - and with all of these the CVS Biopython Bio.AlignIO parser seems fine. However, as you point out, when using Bio.Clustalw there is a problem. The missing version number causes an error, which I regard as a bug worth fixing. http://bugzilla.open-bio.org/show_bug.cgi?id=2564 Peter From mjldehoon at yahoo.com Thu Aug 7 14:56:37 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 7 Aug 2008 07:56:37 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <591490.20671.qm@web62411.mail.re1.yahoo.com> Message-ID: <642190.59135.qm@web62407.mail.re1.yahoo.com> If there are no further suggestions, I'll implement the .find() method as described below. --Michiel. --- On Sat, 8/2/08, Michiel de Hoon wrote: > From: Michiel de Hoon > Subject: Re: [BioPython] Bio.Medline parser > To: "Peter" > Cc: biopython at biopython.org > Date: Saturday, August 2, 2008, 9:57 PM > > The alias idea is nice but does mean there is more than > one > > way to access the data (not encouraged in python). A > related > > suggestion is to support the properties > record.entry_date, > > record.author etc (what ever the current parser does) > as > > alternatives to record["DA"], > record["AU"], ... ? This would > > then be backwards compatible. This could probably be > done with > > a private dictionary mapping keys ("DA") to > property names > > ("entry_date"). When ever we add a new > entry to the > > dictionary, also see if it has a named property to > define > > too. > > > Thinking it over, I think that having a key and an > attribute mapping to the same value is not so clean. > Alternatively we could add a .find(term) method to the > Bio.Medline.Record class, which takes a term and returns the > appropriate value. So record.find("author") > returns record["AU"]. This gives a clear > separation between the raw keys in the Medline file and the > more descriptive names. Also, such a .find method can accept > a wider variety of terms than an attribute name (e.g., > "Full Author", "full_author", etc. all > return record["FAU"]). > > --Michiel > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Aug 7 18:15:08 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Aug 2008 19:15:08 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <642190.59135.qm@web62407.mail.re1.yahoo.com> References: <591490.20671.qm@web62411.mail.re1.yahoo.com> <642190.59135.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00808071115k5111d184v13cb37457c3fdbd2@mail.gmail.com> On Thu, Aug 7, 2008 at 3:56 PM, Michiel de Hoon wrote: > If there are no further suggestions, I'll implement the .find() method as described below. > > ... > >> Thinking it over, I think that having a key and an >> attribute mapping to the same value is not so clean. >> Alternatively we could add a .find(term) method to the >> Bio.Medline.Record class, which takes a term and returns the >> appropriate value. So record.find("author") >> returns record["AU"]. This gives a clear >> separation between the raw keys in the Medline file and the >> more descriptive names. Also, such a .find method can accept >> a wider variety of terms than an attribute name (e.g., >> "Full Author", "full_author", etc. all >> return record["FAU"]). >> >> --Michiel When would anyone use the .find() method? Perhaps if exploring at the command line. If you are writing a script, then once you know you that "FAU" means "Full Author" then you would always just use record["FAU"] directly. Maybe it would make sense just to describe the keys in the docstring, and that would be enough. On a related point, from the Entrez documentation can the MedLine records be accessed as either plain text, XML (or html or asn.1)/ How does the data structure from parsing the XML version with the Bio.Entrez.read() compare to your ideas for the MedLine plain text parser? Maybe we can just deprecate Bio.Medline (i.e. the plain text parser) in favour of Bio.Entrez (and its XML parser)? http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetchlit_help.html Peter From biopython at maubp.freeserve.co.uk Fri Aug 8 10:28:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Aug 2008 11:28:42 +0100 Subject: [BioPython] Deprecating Bio.Saf (PredictProtein Simple Alignment Format) In-Reply-To: <320fb6e00808020502l64eca901gda132250fc85e5be@mail.gmail.com> References: <320fb6e00807231512s507d2652jc0f26764a62b01d5@mail.gmail.com> <320fb6e00808020502l64eca901gda132250fc85e5be@mail.gmail.com> Message-ID: <320fb6e00808080328v4c84639evb74e3c722d15658e@mail.gmail.com> I wrote: >> Is anyone using Bio.Saf or PredictProtein's "Simple Alignment Format" (SAF)? >> >> Bio.Saf is one of the older parsers in Biopython. It parses the >> PredictProtein "Simple Alignment Format" (SAF), a fairly free-format >> multiple sequence alignment file format described here: >> http://www.predictprotein.org/Dexa/optin_saf.html > >... > >> If no one is using it, I would like to deprecate Bio.Saf in the next >> release of Biopython. > > Still no objections? As no-one has objected, I have marked Bio.Saf as deprecated in CVS. As usual, the intention is to keep the deprecated module for a couple more releases before removing it. Peter From biopython at maubp.freeserve.co.uk Fri Aug 8 11:51:35 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Aug 2008 12:51:35 +0100 Subject: [BioPython] Deprecating Bio.NBRF Message-ID: <320fb6e00808080451i2e53de52k551dfd455bcaf3e3@mail.gmail.com> Dear all, Is anyone using Bio.NBRF for reading NBRF/PIR files? Good news: I've just added support for reading NBRF/PIR files as SeqRecord objects to Bio.SeqIO, under the format name "pir" as used in EMBOSS and BioPerl. See enhancement Bug 2535, http://bugzilla.open-bio.org/show_bug.cgi?id=2535 Bad news: I would now like to deprecate the old Bio.NBRF module which was an NBRF/PIR parser which generated its own record objects (not SeqRecord objects). The main reason to drop this module is it relies on some of Biopython's older parsing infrastructure which depends on mxTextTools (and doesn't entirely work with mxTextTools 3.0). So, if anyone if using Bio.NBRF, please get in touch. Peter From biopython at maubp.freeserve.co.uk Fri Aug 8 17:32:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Aug 2008 18:32:58 +0100 Subject: [BioPython] Bio.Medline parser In-Reply-To: <367259.79982.qm@web62403.mail.re1.yahoo.com> References: <320fb6e00808071115k5111d184v13cb37457c3fdbd2@mail.gmail.com> <367259.79982.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00808081032t1bf11e6fv32ed32ca4669cae5@mail.gmail.com> On Fri, Aug 8, 2008 at 3:14 PM, Michiel de Hoon wrote: > >> Maybe it would make sense just to describe >> the keys in the docstring, and that would be enough. > > OK I'll go with just the docstring for now. If users ask for it, we can add a more > descriptive function later. > >> Maybe we can just deprecate Bio.Medline (i.e. the >> plain text parser) in favour of Bio.Entrez (and its XML parser)? > > Usually I am in favor of deprecating modules if their usefulness is not clear. In > this case, however, Medline is a major database, the Medline record format is > readily available from NCBI, it is human readable (more or less) and computer > readable, the resulting Bio.Medline.Record may be easier to deal with than the > record created from XML by Bio.Entrez, and the parser is straightforward but > not entirely trivial. Being able to parse such a file is something I'd expect from > Biopython. That sounds sensible. Maybe we should have an example in the Tutorial of using Bio.Entrez to download some data in the plain text MedLine format, and parsing it with Bio.MedLine? And perhaps also an equivalent using the XML Medline format parsed using Bio.Entrez? Peter From mjldehoon at yahoo.com Fri Aug 8 14:14:18 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 8 Aug 2008 07:14:18 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808071115k5111d184v13cb37457c3fdbd2@mail.gmail.com> Message-ID: <367259.79982.qm@web62403.mail.re1.yahoo.com> > Maybe it would make sense just to describe > the keys in the docstring, and that would be enough. OK I'll go with just the docstring for now. If users ask for it, we can add a more descriptive function later. > > Maybe we can just deprecate Bio.Medline (i.e. the > plain text parser) in favour of Bio.Entrez (and its XML parser)? Usually I am in favor of deprecating modules if their usefulness is not clear. In this case, however, Medline is a major database, the Medline record format is readily available from NCBI, it is human readable (more or less) and computer readable, the resulting Bio.Medline.Record may be easier to deal with than the record created from XML by Bio.Entrez, and the parser is straightforward but not entirely trivial. Being able to parse such a file is something I'd expect from Biopython. --Michiel From mjldehoon at yahoo.com Sat Aug 9 07:39:00 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Aug 2008 00:39:00 -0700 (PDT) Subject: [BioPython] Bio.Medline parser In-Reply-To: <320fb6e00808081032t1bf11e6fv32ed32ca4669cae5@mail.gmail.com> Message-ID: <576584.32208.qm@web62405.mail.re1.yahoo.com> Done -- see CVS. --Michiel. --- On Fri, 8/8/08, Peter wrote: > From: Peter > Subject: Re: [BioPython] Bio.Medline parser > To: mjldehoon at yahoo.com > Cc: biopython at biopython.org > Date: Friday, August 8, 2008, 1:32 PM > On Fri, Aug 8, 2008 at 3:14 PM, Michiel de Hoon wrote: > > > >> Maybe it would make sense just to describe > >> the keys in the docstring, and that would be > enough. > > > > OK I'll go with just the docstring for now. If > users ask for it, we can add a more > > descriptive function later. > > > >> Maybe we can just deprecate Bio.Medline (i.e. the > >> plain text parser) in favour of Bio.Entrez (and > its XML parser)? > > > > Usually I am in favor of deprecating modules if their > usefulness is not clear. In > > this case, however, Medline is a major database, the > Medline record format is > > readily available from NCBI, it is human readable > (more or less) and computer > > readable, the resulting Bio.Medline.Record may be > easier to deal with than the > > record created from XML by Bio.Entrez, and the parser > is straightforward but > > not entirely trivial. Being able to parse such a file > is something I'd expect from > > Biopython. > > That sounds sensible. Maybe we should have an example in > the Tutorial > of using Bio.Entrez to download some data in the plain text > MedLine > format, and parsing it with Bio.MedLine? And perhaps also > an > equivalent using the XML Medline format parsed using > Bio.Entrez? > > Peter From ochipepe at gmail.com Sun Aug 10 08:12:56 2008 From: ochipepe at gmail.com (Alexandre Santos) Date: Sun, 10 Aug 2008 10:12:56 +0200 Subject: [BioPython] (bio)python for vector cloning Message-ID: Hello, I'm currently evaluating the suitability of python and biopython for the planning of my molecular biology chores. In particular, I would like to use for instance the ipython shell to pick up my vectors of interest, the list of restriction enzymes I have in the shelf, and design cloning strategies, plot annotated vector maps, etc. My question is on whether anybody has experience doing this with python related tools? I had a look at some biopython documentation and tutorials (http://www.pasteur.fr/recherche/unites/sis/formation/python/apa.html#sol_digest), and it seems feasible, but if possible I would like some experience-based feedback. Cheers, Alex Santos From mjldehoon at yahoo.com Sun Aug 10 08:13:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 10 Aug 2008 01:13:31 -0700 (PDT) Subject: [BioPython] Bio.Emboss Message-ID: <646424.85840.qm@web62402.mail.re1.yahoo.com> Hi everybody, I am looking at the remaining Biopython modules that use Martel for parsing. Most of these have already been deprecated or replaced by alternatives. Without Martel, we can drop the dependency on mxTextTools, and make the Biopython installation a bit easier. One of the remaining Martel-dependent parsers are in Bio.Emboss. There are two parsers, one for PrimerSearch and one for Primer3. Currently, both of these reside in Bio.Emboss.Primer; these are the classes in Bio.Emboss.Primer: class PrimerSearchInputRecord: class PrimerSearchParser: class PrimerSearchOutputRecord: class PrimerSearchAmplifier: class _PrimerSearchRecordConsumer(AbstractConsumer): class _PrimerSearchScanner: class Primer3Record: class Primer3Primers: class Primer3Parser: class _Primer3RecordConsumer(AbstractConsumer): class _Primer3Scanner: I'd like to split Bio.Emboss.Primer into a Bio.Emboss.PrimerSearch and a Bio.Emboss.Primer3 module, with an InputRecord and OutputRecord class in Bio.Emboss.PrimerSearch, a Record class in Bio.Emboss.Primer3, and a read() function in each. This function would then do the parsing, without using Martel. Any objections? --Michiel From biopython at maubp.freeserve.co.uk Sun Aug 10 13:04:20 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 10 Aug 2008 14:04:20 +0100 Subject: [BioPython] (bio)python for vector cloning In-Reply-To: References: Message-ID: <320fb6e00808100604n696efc29r8b04239cdbb7296@mail.gmail.com> On Sun, Aug 10, 2008, Alexandre Santos wrote: > Hello, > > I'm currently evaluating the suitability of python and biopython for > the planning of my molecular biology chores. > > In particular, I would like to use for instance the ipython shell to > pick up my vectors of interest, the list of restriction enzymes I have > in the shelf, and design cloning strategies, plot annotated vector > maps, etc. What format will you have your raw vector sequences in? Maybe FASTA? Biopython's Bio.Restriction module (contributed by Frederic Sohm) may be helpful. It is documented here (separate from the main tutorial at the moment), http://biopython.org/DIST/docs/cookbook/Restriction.html What exactly do you mean by plot annotated vector maps? There are some basic graphics capabilities in Biopython which use ReportLab. Depending on what you want to do, GenomeDiagram might be helpful too. http://bioinf.scri.ac.uk/lp/programs.php#genomediagram > My question is on whether anybody has experience doing this with > python related tools? I had a look at some biopython documentation and > tutorials (http://www.pasteur.fr/recherche/unites/sis/formation/python/apa.html#sol_digest), > and it seems feasible, but if possible I would like some > experience-based feedback. I have no personal experience of doing this kind of worth with Biopython, but it should be feasible. If you try this, and have suggestions for the Biopython documentation (or code) that would be great. Also please be aware that some bits of the Pasteur Biopython tutorial are out of date - I did try and get in touch with the authors about this via the help at pasteur.fr email address listed on the main page. Maybe I should try and contact the authors directly... http://www.pasteur.fr/recherche/unites/sis/formation/python/ Peter From biopython at maubp.freeserve.co.uk Tue Aug 12 12:10:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 13:10:47 +0100 Subject: [BioPython] (bio)python for vector cloning In-Reply-To: References: <320fb6e00808100604n696efc29r8b04239cdbb7296@mail.gmail.com> Message-ID: <320fb6e00808120510s6d5d3725v8ffa2643a55f47ee@mail.gmail.com> Hi Alex, I hope you don't mind me CC'ing this back onto the mailing list. On Tue, Aug 12, 2008 at 10:56 AM, Alexandre Santos wrote: >> What format will you have your raw vector sequences in? Maybe FASTA? > Both FASTA and GenBank formats. I think this should not be a problem Good. >> Biopython's Bio.Restriction module (contributed by Frederic Sohm) may >> be helpful. It is documented here (separate from the main tutorial at >> the moment), http://biopython.org/DIST/docs/cookbook/Restriction.html > > I checked the documentation and it's exactly what I need! Good. >> What exactly do you mean by plot annotated vector maps? There are >> some basic graphics capabilities in Biopython which use ReportLab. >> Depending on what you want to do, GenomeDiagram might be helpful too. >> http://bioinf.scri.ac.uk/lp/programs.php#genomediagram > > I mean the typical vector graphic representation that gives you an > idea of the vector sequence structure (see for instance > http://www.addgene.org/pgvec1?f=d&vectorid=345&cmd=genvecmap&dim=800&format=html&mtime=1187931178). > I would use it for personal documentation, but also when I send the > plasmids to other people. > > It seems GenomeDiagram could be used for that job, but not without > some heavy customization... It would be nice to have something already > usable for this purpose. I have used GenomeDiagram for plasmid figures, for example showing the location of microarray probe target sequences. However, right now it does lack support for "arrowed features" on the circles, and the fancy labeling in that example. So I would agree, recreating that figure using Biopython and GenomeDiagram would need plenty of additional work. However, a simplified version would be fairly easy I think. >> Also please be aware that some bits of the Pasteur Biopython tutorial >> are out of date > > Thanks for the warning, I will mind it when I try biopython. > > Thanks for the help! > > Alex Sure, Peter From rik at cogsci.ucsd.edu Tue Aug 12 21:24:44 2008 From: rik at cogsci.ucsd.edu (richard k belew) Date: Tue, 12 Aug 2008 14:24:44 -0700 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? Message-ID: <48A1FF9C.4060206@cogsci.ucsd.edu> i am sure this has to have been addressed in a universe long ago and far away, but... i'm trying to use Bio.EUtils to access NCBI/Entrez, but seem unable to use its MultDict utilities as they are intended. i include a sample run below. i can contact NCBI and get the data just fine. and the AuthorList = is accessible. the summary() whines about finding "multiple Items named 'Author'!" is there some (recursive?) idiom that is typically used? i can explicitly make the hack for extracting from the AuthorList, but want to do something similar for any other OrderedMultiDicts, and would like it all to stay as close to the DTDs as possible! thanks for your help. rik > Python 2.4.4 (#1, Oct 18 2006, 10:34:39) > [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> from Bio import EUtils >>>> from Bio.EUtils import DBIdsClient >>>> PMID = "17447753" >>>> idList = EUtils.DBIds("pubmed", PMID) >>>> result = DBIdsClient.from_dbids(idList) >>>> summary = result[0].summary() > Found multiple Items named 'Author'! > Found multiple Items named 'Author'! > Found multiple Items named 'Author'! >>>> data = summary.dataitems >>>> data > ), ('LastAuthor', u'Belew RK'), ('Title', u'Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries.'), ('Volume', u'47'), ('Issue', u'3'), ('Pages', u'1258-62'), ('LangList', ), ('NlmUniqueID', u'101230060'), ('ISSN', u'1549-9596'), ('ESSN', u'1549-960X'), ('PubTypeList', ), ('RecordStatus', u'PubMed - indexed for MEDLINE'), ('PubStatus', u'ppublish+epublish'), ('ArticleIds', ), ('DOI', u'10.1021/ci700044s'), ('History', ), ('References', ), ('HasAbstract', 1), ('PmcRefCount', 0), ('FullJournalName', u'Journal of chemical information and modeling'), ('ELocationID', u''), ('SO', u'2007 May-Jun;47(3):1258-62')]> >>>> for k in keys: print k,data[k] > ... > DOI 10.1021/ci700044s > Title Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries. > Source J Chem Inf Model > PmcRefCount 0 > Issue 3 > SO 2007 May-Jun;47(3):1258-62 > ISSN 1549-9596 > Volume 47 > FullJournalName Journal of chemical information and modeling > RecordStatus PubMed - indexed for MEDLINE > ESSN 1549-960X > ELocationID > Pages 1258-62 > PubStatus ppublish+epublish > AuthorList {'Author': u'Belew RK'} > EPubDate 2007/04/21 > PubDate 2007/05/01 > NlmUniqueID 101230060 > LastAuthor Belew RK > ArticleIds {'doi': u'10.1021/ci700044s', 'pubmed': u'17447753'} > HasAbstract 1 > History {'medline': Date(2007, 9, 6), 'pubmed': Date(2007, 4, 24), 'aheadofprint': Date(2007, 4, 21)} > LangList {'Lang': u'English'} > References {} > PubTypeList {'PubType': u'Journal Article'} >>>> alist = data.get('AuthorList') >>>> alist > >>>> for k,v in data.allitems(): print k,v > ... > PubDate 2007/05/01 > EPubDate 2007/04/21 > Source J Chem Inf Model > AuthorList {'Author': u'Belew RK'} > LastAuthor Belew RK > Title Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries. > Volume 47 > Issue 3 > Pages 1258-62 > LangList {'Lang': u'English'} > NlmUniqueID 101230060 > ISSN 1549-9596 > ESSN 1549-960X > PubTypeList {'PubType': u'Journal Article'} > RecordStatus PubMed - indexed for MEDLINE > PubStatus ppublish+epublish > ArticleIds {'doi': u'10.1021/ci700044s', 'pubmed': u'17447753'} > DOI 10.1021/ci700044s > History {'medline': Date(2007, 9, 6), 'pubmed': Date(2007, 4, 24), 'aheadofprint': Date(2007, 4, 21)} > References {} > HasAbstract 1 > PmcRefCount 0 > FullJournalName Journal of chemical information and modeling > ELocationID > SO 2007 May-Jun;47(3):1258-62 From biopython at maubp.freeserve.co.uk Tue Aug 12 21:49:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 22:49:43 +0100 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? In-Reply-To: <48A1FF9C.4060206@cogsci.ucsd.edu> References: <48A1FF9C.4060206@cogsci.ucsd.edu> Message-ID: <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> On Tue, Aug 12, 2008 at 10:24 PM, richard k belew wrote: > i am sure this has to have been addressed in a universe long ago and far > away, but... > > i'm trying to use Bio.EUtils to access NCBI/Entrez, but seem unable to > use its MultDict utilities as they are intended. > > i include a sample run below. > i can contact NCBI and get the data just fine. and the AuthorList = > is accessible. the summary() whines about > finding "multiple Items named 'Author'!" > > is there some (recursive?) idiom that is typically used? i can > explicitly make the hack for extracting from the AuthorList, but > want to do something similar for any other OrderedMultiDicts, and > would like it all to stay as close to the DTDs as possible! > > thanks for your help. Hi Rik, I don't know enough about Bio.EUtils to be able to help. This module is currently without an maintainer, and its deprecation has been suggested in favour of the much simpler Bio.Entrez module (which is covered pretty thoroughly in the documentation). I would suggest you try Bio.Entrez.efetch() to get the data as XML, and the Bio.Entrez.read() function to parse the XML. You'll get a nested structure of python dictionaries and lists. See "Chapter 7" of the Tutorial, http://www.biopython.org/DIST/docs/tutorial/Tutorial.html Was there anything particular piece of information you wanted to extract? Peter From biopython at maubp.freeserve.co.uk Tue Aug 12 22:19:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 23:19:52 +0100 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? In-Reply-To: <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> References: <48A1FF9C.4060206@cogsci.ucsd.edu> <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> Message-ID: <320fb6e00808121519r3a2b2d53p8039abbc70a28369@mail.gmail.com> > I would suggest you try Bio.Entrez.efetch() to get the data as XML, > and the Bio.Entrez.read() function to parse the XML. You'll get a > nested structure of python dictionaries and lists. See "Chapter 7" of > the Tutorial, > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html > > Was there anything particular piece of information you wanted to extract? Assuming it was just the author list, something like this might suit you: from Bio import Entrez PMIDs = "17447753,17447754" handle = Entrez.efetch(db="pubmed", id=PMIDs, retmode="XML") records = Entrez.read(handle) for record in records : print record['MedlineCitation']['Article']['ArticleTitle'] for author_dict in record['MedlineCitation']['Article']['AuthorList'] : print " - %(ForeName)s %(LastName)s" % author_dict handle.close() And the output, Analysis of HIV wild-type and mutant structures via in silico docking against diverse ligand libraries. - Max W Chang - William Lindstrom - Arthur J Olson - Richard K Belew Synthesis and spectroscopic characterization of copper(II)-nitrito complexes with hydrotris(pyrazolyl)borate and related coligands. - Nicolai Lehnert - Ursula Cornelissen - Frank Neese - Tetsuya Ono - Yuki Noguchi - Ken-Ichi Okamoto - Kiyoshi Fujisawa And done. The author's initial are also included in the dictionary (but not printed). If you are familar with the XML DTD, working out where the data you want is much easier! As you desired, the Bio.Entrez parser does stay close to the DTDs - both a blessing and a curse. Peter From biopython at maubp.freeserve.co.uk Tue Aug 12 22:22:53 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Aug 2008 23:22:53 +0100 Subject: [BioPython] Bio.EUtils, MultiDict: getting all the authors? In-Reply-To: <48A20AE8.40900@cogsci.ucsd.edu> References: <48A1FF9C.4060206@cogsci.ucsd.edu> <320fb6e00808121449j388e4cael490c7fa7ebce6d8e@mail.gmail.com> <48A20AE8.40900@cogsci.ucsd.edu> Message-ID: <320fb6e00808121522yd221531pbb75ea484f14b2ea@mail.gmail.com> On Tue, Aug 12, 2008 at 11:12 PM, richard k belew wrote: > thanks Peter! depriction of EUtils seems right > (i was following stale pointers i guess). i'll > tryout Bio.Entrez. > > rik Can I ask you why you ended up at Bio.EUtils? If its documentation on 3rd party sites, there's not so much we can do about it. But if there is anything misleading in the tutorial or on the Biopython website we can fix that. Peter From lpritc at scri.ac.uk Wed Aug 13 09:17:49 2008 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 13 Aug 2008 10:17:49 +0100 Subject: [BioPython] (bio)python for vector cloning In-Reply-To: <320fb6e00808120510s6d5d3725v8ffa2643a55f47ee@mail.gmail.com> Message-ID: Hi, On 12/08/2008 13:10, "Peter" wrote: >>> What exactly do you mean by plot annotated vector maps? There are >>> some basic graphics capabilities in Biopython which use ReportLab. >>> Depending on what you want to do, GenomeDiagram might be helpful too. >>> http://bioinf.scri.ac.uk/lp/programs.php#genomediagram >> >> I mean the typical vector graphic representation that gives you an >> idea of the vector sequence structure (see for instance >> http://www.addgene.org/pgvec1?f=d&vectorid=345&cmd=genvecmap&dim=800&format=h >> tml&mtime=1187931178). >> I would use it for personal documentation, but also when I send the >> plasmids to other people. >> >> It seems GenomeDiagram could be used for that job, but not without >> some heavy customization... It would be nice to have something already >> usable for this purpose. > > I have used GenomeDiagram for plasmid figures, for example showing the > location of microarray probe target sequences. However, right now it > does lack support for "arrowed features" on the circles, and the fancy > labeling in that example. So I would agree, recreating that figure > using Biopython and GenomeDiagram would need plenty of additional > work. However, a simplified version would be fairly easy I think. Peter's correct: currently GenomeDiagram only has support for drawing arrow features in linear diagrams, and the labelling in the diagram that you link to is not achievable by the GenomeDiagram API. Features can be labelled individually, just not in the style shown. GenomeDiagram was designed for the presentation of hundreds of genomes, rather than single plasmids, so this use is a little out of its original scope ;) There is a package called Plasmidomics, written in Python, that I've never used, but which is designed for this kind of task: http://www.bioprocess.org/plasmid/ It might be what you need and, if the source code is available, you might be able to work it into your own code. Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython at maubp.freeserve.co.uk Wed Aug 13 21:26:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 13 Aug 2008 22:26:52 +0100 Subject: [BioPython] Deprecating Bio.EUtils in favour of Bio.Entrez? In-Reply-To: <320fb6e00808130200v6c69922au99b4623b67a2eb88@mail.gmail.com> References: <320fb6e00808130200v6c69922au99b4623b67a2eb88@mail.gmail.com> Message-ID: <320fb6e00808131426m6bb72b8fh6399734a8359d8a5@mail.gmail.com> Hello to all the NCBI fans... As you may know, the NCBI Entrez database has some "Entrez Programming Utilities" also known as EUtils, http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html As of Biopython 1.45, the module Bio.Entrez (originally Bio.WWW.NCBI) supported all the EUtils functions, and as of Biopython 1.46 could parse the NCBI XML output too. This code is well documented with a whole Bio.Entrez chapter in the tutorial, which means we are now in a position to retire the older unmaintained Bio.EUtils module. We'd like to propose the deprecation of Bio.EUtils in the next release of Biopython, in favour of Bio.Entrez. If anyone is currently using Bio.EUtils, then we'd like to hear from you. It should be possible to offer advice on migrating the code to Bio.Entrez, or we can reconsider deprecating Bio.EUtils if there is some major functionality that would be lost, or users that would be inconvenienced by its premature retirement. Thank you, Peter From biopython at maubp.freeserve.co.uk Sun Aug 17 13:05:52 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 Aug 2008 14:05:52 +0100 Subject: [BioPython] Bio.EUtils deprecated in favour of Bio.Entrez Message-ID: <320fb6e00808170605t5dd7b787i2dfcc4f5f4dde6ed@mail.gmail.com> Dear all, I've just deprecated Bio.EUtils in CVS, leaving Bio.Entrez as Biopython's preferred interface to the NCBI "Entrez Programming Utilities" also known as EUtils. http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html If anyone is currently using Bio.EUtils, then we'd like to hear from you. It should be possible to offer advice on migrating your code to Bio.Entrez. Also, up until the next Biopython release, we can still reconsider deprecating Bio.EUtils if there is some major functionality that would be lost, or users that would be inconvenienced by its premature retirement. Thank you, Peter From biopython at maubp.freeserve.co.uk Tue Aug 19 10:45:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Aug 2008 11:45:31 +0100 Subject: [BioPython] Deprecating Bio.NBRF In-Reply-To: <320fb6e00808080451i2e53de52k551dfd455bcaf3e3@mail.gmail.com> References: <320fb6e00808080451i2e53de52k551dfd455bcaf3e3@mail.gmail.com> Message-ID: <320fb6e00808190345i7d1e2a47l98e4426f5ab87b68@mail.gmail.com> On Fri, Aug 8, 2008 at 12:51 PM, Peter wrote: > Dear all, > > Is anyone using Bio.NBRF for reading NBRF/PIR files? > > Good news: I've just added support for reading NBRF/PIR files as > SeqRecord objects to Bio.SeqIO, under the format name "pir" as used in > EMBOSS and BioPerl. See enhancement Bug 2535, > http://bugzilla.open-bio.org/show_bug.cgi?id=2535 > > Bad news: I would now like to deprecate the old Bio.NBRF module which > was an NBRF/PIR parser which generated its own record objects (not > SeqRecord objects). The main reason to drop this module is it relies > on some of Biopython's older parsing infrastructure which depends on > mxTextTools (and doesn't entirely work with mxTextTools 3.0). > > So, if anyone if using Bio.NBRF, please get in touch. > I have now deprecated Bio.NBRF in CVS. Its not to late to revert this change if anyone missed the last email warning people about this plan. (Like many other deprecations, even after we remove a module from the distribution, the old code is still there in CVS, and can be resurrected if someone really wanted it) Peter From agarbino at gmail.com Thu Aug 21 05:44:27 2008 From: agarbino at gmail.com (Alex Garbino) Date: Thu, 21 Aug 2008 00:44:27 -0500 Subject: [BioPython] Parsing BLAST for ClustalW Message-ID: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Hello, I'm a new python and biophython user. I'm trying to pull a BLAST result, parse it into a csv with the following fields: protein name, organism, common name, protein length, and FASTA sequence The goal is to then feed the fasta sequences into ClustalW (to do a phylogeny tree, look for conserved regions, etc). I've managed to do the blast search, and parse the results into xml from python. However, I'm not sure how to grab the above information and put it together, so that I can save a csv and push it into clustalw. Could someone help? Thanks! Alex From allank at sanbi.ac.za Thu Aug 21 08:29:24 2008 From: allank at sanbi.ac.za (Allan Kamau) Date: Thu, 21 Aug 2008 10:29:24 +0200 Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> References: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Message-ID: <48AD2764.9010400@sanbi.ac.za> Hi Alex, I haven't yet used BioPython (therefore my suggestion may be quite wrong). To generate CSV from XML may require use of general XML SAX parser solution (unless BioPython has a package to output CSV from XML of that particular structure). I prefer to use SAX (in many cases) as opposed to other more memory resident XML parsing solutions (DOM etc) due to memory issues especially if your XML is large. Have a look at "http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/" Allan. Alex Garbino wrote: > Hello, > > I'm a new python and biophython user. > I'm trying to pull a BLAST result, parse it into a csv with the > following fields: > protein name, organism, common name, protein length, and FASTA sequence > The goal is to then feed the fasta sequences into ClustalW (to do a > phylogeny tree, look for conserved regions, etc). > > I've managed to do the blast search, and parse the results into xml > from python. However, I'm not sure how to grab the above information > and put it together, so that I can save a csv and push it into > clustalw. > > Could someone help? > > Thanks! > Alex > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Thu Aug 21 09:14:20 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 21 Aug 2008 02:14:20 -0700 (PDT) Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Message-ID: <862312.62767.qm@web62403.mail.re1.yahoo.com> Dear Alex, Did you look at section 6.4 in the Biopython tutorial? --Michiel. --- On Thu, 8/21/08, Alex Garbino wrote: > From: Alex Garbino > Subject: [BioPython] Parsing BLAST for ClustalW > To: biopython at lists.open-bio.org > Date: Thursday, August 21, 2008, 1:44 AM > Hello, > > I'm a new python and biophython user. > I'm trying to pull a BLAST result, parse it into a csv > with the > following fields: > protein name, organism, common name, protein length, and > FASTA sequence > The goal is to then feed the fasta sequences into ClustalW > (to do a > phylogeny tree, look for conserved regions, etc). > > I've managed to do the blast search, and parse the > results into xml > from python. However, I'm not sure how to grab the > above information > and put it together, so that I can save a csv and push it > into > clustalw. > > Could someone help? > > Thanks! > Alex > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Thu Aug 21 12:15:15 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Aug 2008 13:15:15 +0100 Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> References: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> Message-ID: <320fb6e00808210515j737f3374t457d09350ae2a674@mail.gmail.com> On Thu, Aug 21, 2008 at 6:44 AM, Alex Garbino wrote: > Hello, > > I'm a new python and biophython user. > I'm trying to pull a BLAST result, parse it into a csv with the > following fields: > protein name, organism, common name, protein length, and FASTA sequence > The goal is to then feed the fasta sequences into ClustalW (to do a > phylogeny tree, look for conserved regions, etc). > > I've managed to do the blast search, and parse the results into xml > from python. However, I'm not sure how to grab the above information > and put it together, so that I can save a csv and push it into > clustalw. > > Could someone help? Hi Alex, You said you are a Python and Biopython beginner - are you already familiar with BLAST and ClustalW? It sounds like you have a query sequence, and want to extract matching target sequences from a database using BLAST, and then build a multiple sequence alignment from them. If you just want the matching region of these other genes, then you can work from the BLAST output (just take the aligned sequence and remove the gaps). However, if you want the full gene sequences these are not in the BLAST output. You would have to take the target match ID, and look it up in the original database. As Michiel suggested, have a look over the BLAST chapter in the Biopython tutorial. http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Writing a CVS file in python is simple enough, e.g. handle = open("example.txt","w") #some loop over the blast file to extract the fields... handle.write("%s, %s, %s, %i, %s\n" % (protein_name, organism, common_name, protein_length, sequence_string) handle.close() However, for input to ClustalW to build a tree you don't want a CSV file, but a FASTA file containing the sequences without gaps. You could write these out yourself, e.g. handle = open("example.faa","w") #some loop over the blast file to extract the fields... handle.write(">%s\n%s\n" % (protein_name, sequence_string) handle.close() Peter From bsouthey at gmail.com Thu Aug 21 13:27:34 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Thu, 21 Aug 2008 08:27:34 -0500 Subject: [BioPython] Parsing BLAST for ClustalW In-Reply-To: <48AD2764.9010400@sanbi.ac.za> References: <4cf37ad00808202244h6714929at5c85222ab0de406e@mail.gmail.com> <48AD2764.9010400@sanbi.ac.za> Message-ID: <48AD6D46.1060601@gmail.com> Allan Kamau wrote: > Hi Alex, > I haven't yet used BioPython (therefore my suggestion may be quite > wrong). > To generate CSV from XML may require use of general XML SAX parser > solution (unless BioPython has a package to output CSV from XML of > that particular structure). > I prefer to use SAX (in many cases) as opposed to other more memory > resident XML parsing solutions (DOM etc) due to memory issues > especially if your XML is large. > Have a look at > "http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/" > Note that the Elementtree XML parser is now standard in Python 2.5 see one of many examples: http://www.learningpython.com/2008/05/07/elegant-xml-parsing-using-the-elementtree-module/ As I understand things, elementtree does not fit into BioPython since it was not standard for earlier versions of Python supported by BioPython. This may change once BioPython supports Python 3K. Bruce From sbassi at gmail.com Sun Aug 24 22:18:09 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 19:18:09 -0300 Subject: [BioPython] Problem with Entrez? Message-ID: >>> handle = Entrez.efetch(db="nucleotide", id="326625") >>> record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in record = Entrez.read(handle) File "/mnt/hda2/py252/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 283, in read record = handler.run(handle) File "/mnt/hda2/py252/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 95, in run self.parser.ParseFile(handle) ExpatError: syntax error: line 1, column 0 So I efetch it again just to show the format of handle: >>> handle = Entrez.efetch(db="nucleotide", id="326625") >>> print handle.read()[:200] Seq-entry ::= seq { id { genbank { name "HIVED82FO" , accession "M77599" , version 1 } , gi 326625 } , descr { title "Human immunodeficiency virus type 1 gp120 (env) Looks like ASN1 format, but according to the tutorial efetch should return its output in XML format: "By default you get the output in XML format, which you can parse using the Bio.Entrez.read() function " As a workaround I specify the format with rettype='gb' From sbassi at gmail.com Sun Aug 24 22:46:10 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 19:46:10 -0300 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: On Sun, Aug 24, 2008 at 7:18 PM, Sebastian Bassi wrote: > As a workaround I specify the format with rettype='gb' Sorry, the workaround is to set retmode to xml: handle = Entrez.efetch(db="nucleotide", id="326625", retmode='xml') But I thought that this should be default behivor. From sbassi at gmail.com Sun Aug 24 23:03:32 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 20:03:32 -0300 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: On Sun, Aug 24, 2008 at 8:00 PM, Sebastian Bassi wrote: > My proposed solution is: Sorry!!! Now I think that the way to force an option to be default is to declare a default value in function definition: def efetch(db, cgi=None, retmode='xml', **keywds): instead of: def efetch(db, cgi=None, **keywds): From sbassi at gmail.com Sun Aug 24 23:00:08 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 24 Aug 2008 20:00:08 -0300 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: On Sun, Aug 24, 2008 at 7:46 PM, Sebastian Bassi wrote: >> As a workaround I specify the format with rettype='gb' > Sorry, the workaround is to set retmode to xml: > handle = Entrez.efetch(db="nucleotide", id="326625", retmode='xml') > But I thought that this should be default behivor. My proposed solution is: Change line variables = {'db' : db} To: variables = {'db' : db , 'retmode' : 'xml'} In Bio/Entrez/__init__.py Doing this, it work as expected, but I don't know if this breaks something else. From srbanator at heckler-koch.cz Wed Aug 27 08:41:18 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Wed, 27 Aug 2008 10:41:18 +0200 Subject: [BioPython] versions and doc's Message-ID: <48B5132E.6080601@heckler-koch.cz> hi all, i am very new to biopython. I am working with debian etch stable version. During reading tutorial i have found out, that the current documentation works with version 1.47-1 (in debian it is in unstable repository). Are there old tutorials related to 1.42-2 version (debian stable), or you believe i should rather start studying biopython in it's newest version? thank you for introduction pavel srb From biopython at maubp.freeserve.co.uk Wed Aug 27 09:12:17 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 10:12:17 +0100 Subject: [BioPython] Problem with Entrez? In-Reply-To: References: Message-ID: <320fb6e00808270212y48f12111u16c2908337732ed3@mail.gmail.com> On Sun, Aug 24, 2008 at 11:18 PM, Sebastian Bassi wrote: > Looks like ASN1 format, but according to the tutorial efetch should > return its output in XML format: > "By default you get the output in XML format, which you can parse > using the Bio.Entrez.read() function " Its a documentation bug (my mistaken assumption), as the NCBI do not default to XML. The efetch doc string was fixed in CVS but I'll do the tutorial now... thanks for the report. > As a workaround I specify the format with rettype='gb' As you realized, efetch listens to the rettype and retmode arguments. Peter From biopython at maubp.freeserve.co.uk Wed Aug 27 09:19:02 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 10:19:02 +0100 Subject: [BioPython] Problem with Entrez? In-Reply-To: <320fb6e00808270212y48f12111u16c2908337732ed3@mail.gmail.com> References: <320fb6e00808270212y48f12111u16c2908337732ed3@mail.gmail.com> Message-ID: <320fb6e00808270219k450d55d3g259c829de8593139@mail.gmail.com> On Wed, Aug 27, 2008 at 10:12 AM, Peter wrote: > On Sun, Aug 24, 2008 at 11:18 PM, Sebastian Bassi wrote: >> Looks like ASN1 format, but according to the tutorial efetch should >> return its output in XML format: >> "By default you get the output in XML format, which you can parse >> using the Bio.Entrez.read() function " > > Its a documentation bug (my mistaken assumption), as the NCBI do not > default to XML. The efetch doc string was fixed in CVS but I'll do the > tutorial now... thanks for the report. Already fixed in CVS, as of /biopython/Doc/Tutorial.tex revision 1.135 - but worth double checking. Peter From p.j.a.cock at googlemail.com Wed Aug 27 09:30:12 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 27 Aug 2008 10:30:12 +0100 Subject: [BioPython] versions and doc's In-Reply-To: <48B5132E.6080601@heckler-koch.cz> References: <48B5132E.6080601@heckler-koch.cz> Message-ID: <320fb6e00808270230g34e97347h5d21d75154536614@mail.gmail.com> On Wed, Aug 27, 2008 at 9:41 AM, Pavel SRB wrote: > hi all, i am very new to biopython. I am working with debian etch stable > version. During reading tutorial i have found out, that the current > documentation works with version 1.47-1 (in debian it is in unstable > repository). Hi Pavel, Yes - the documentation on our website, in particular the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf is for the current latest release of Biopython (i.e. Biopython 1.47 at this time) > Are there old tutorials related to 1.42-2 version (debian stable), or > you believe i should rather start studying biopython in it's newest version? Debian may have installed the Biopython 1.42 version of the tutorial for you (but if it did, I don't know where to look). If you want the old tutorial you can either get the LaTeX source from from CVS and recompile it, or more simply download the Biopython 1.42 source code release, which contain both the HTML and PDF versions of the tutorial under the Doc folder: http://biopython.org/DIST/biopython-1.42.tar.gz http://biopython.org/DIST/biopython-1.42.zip However, I would strongly encourage you to install the current version of Biopython instead - there have been a lot of bug fixes plus the addition modules like Bio.SeqIO, Bio.AlignIO and Bio.Entrez. In addition, several of the modules present in Biopython 1.42 have since been moved or deprecated and some have even been removed. For debian (or ubuntu), I would suggest you install Biopython from source. First uninstall the debian package for Biopython 1.42, then you should be able to install the build dependencies automatically using: sudo apt-get build-dep python-biopython Then you should be ready to install Biopython 1.47 from source. See http://biopython.org/wiki/Download for more details. Peter From agarbino at gmail.com Wed Aug 27 17:12:58 2008 From: agarbino at gmail.com (Alex Garbino) Date: Wed, 27 Aug 2008 12:12:58 -0500 Subject: [BioPython] Parsing BLAST Message-ID: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> Hello, I'm following the tutorials to do BLAST queries, however, I can't get the Blast object to work. I've downloaded the blast search, saved it in XML, etc, as the tutorial does. However, when I get to the step where I'm trying to get actual data out, it fails (the for loops part). Here is a simplified version that illustrates the problem: for x in blast_record.alignments: print alignment.title Traceback (most recent call last): File "", line 2, in NameError: name 'alignment' is not defined ----------------------- blast_record contains lots of data, I just can't seem to be able to get anything out of it... what am I doing wrong? Thanks, Alex From sbassi at gmail.com Wed Aug 27 17:33:52 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 27 Aug 2008 14:33:52 -0300 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> Message-ID: On Wed, Aug 27, 2008 at 2:12 PM, Alex Garbino wrote: > for x in blast_record.alignments: > print alignment.title > > Traceback (most recent call last): > File "", line 2, in > NameError: name 'alignment' is not defined You should do: print x.title Instead of: print alignment.title From cg5x6 at yahoo.com Wed Aug 27 17:27:54 2008 From: cg5x6 at yahoo.com (C. G.) Date: Wed, 27 Aug 2008 10:27:54 -0700 (PDT) Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> Message-ID: <384507.82825.qm@web65609.mail.ac4.yahoo.com> --- On Wed, 8/27/08, Alex Garbino wrote: > From: Alex Garbino > Subject: [BioPython] Parsing BLAST > To: biopython at lists.open-bio.org > Date: Wednesday, August 27, 2008, 11:12 AM > Hello, > > I'm following the tutorials to do BLAST queries, > however, I can't get > the Blast object to work. > > I've downloaded the blast search, saved it in XML, etc, > as the > tutorial does. However, when I get to the step where > I'm trying to get > actual data out, it fails (the for loops part). > Here is a simplified version that illustrates the problem: > > for x in blast_record.alignments: > print alignment.title > > Traceback (most recent call last): > File "", line 2, in > NameError: name 'alignment' is not defined > > ----------------------- It's a Python coding error. Try: print x.title Or change the name of your variable in the "for" loop to "alignment". From srbanator at heckler-koch.cz Wed Aug 27 20:48:24 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Wed, 27 Aug 2008 22:48:24 +0200 Subject: [BioPython] sql create tables script in tutorial Message-ID: <48B5BD98.8050101@heckler-koch.cz> hi all, as i am reading through "Basic BioSQL with Biopython" http://biopython.org/DIST/docs/biosql/python_biosql_basic.html after executing >>> db = server.new_database("cold") i have got an "XXX.biodatabase' doesn't exist" error. At "BioSQL" http://www.biopython.org/wiki/BioSQL i have found out sql create table batch script mysql -u root bioseqdb < biosqldb-mysql.sql Maybe it should also be in the first tutorial, maybe not. Just mentioning it. pavel srb From agarbino at gmail.com Wed Aug 27 21:01:53 2008 From: agarbino at gmail.com (Alex Garbino) Date: Wed, 27 Aug 2008 16:01:53 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <384507.82825.qm@web65609.mail.ac4.yahoo.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> Message-ID: <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> Thanks for the help; that was a lot of time wasted for something simple.... I do have an additional request: once I parse these out, I only get 50 entries. however, if I do the same search online, I get 138... what accounts for the difference? This is my code: from Bio import SeqIO from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML record = SeqIO.read(open("protein_fasta.txt"), format="fasta") result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() for x in blast_record.alignments: print x.title, x.accession, x.length acc_list = [] for x in blast_record.alignments: acc_list.append(x.accession) len(acc_list) tells me 50... Is there a default limit somewhere? Thanks! Alex On Wed, Aug 27, 2008 at 12:27 PM, C. G. wrote: > > > > --- On Wed, 8/27/08, Alex Garbino wrote: > >> From: Alex Garbino >> Subject: [BioPython] Parsing BLAST >> To: biopython at lists.open-bio.org >> Date: Wednesday, August 27, 2008, 11:12 AM >> Hello, >> >> I'm following the tutorials to do BLAST queries, >> however, I can't get >> the Blast object to work. >> >> I've downloaded the blast search, saved it in XML, etc, >> as the >> tutorial does. However, when I get to the step where >> I'm trying to get >> actual data out, it fails (the for loops part). >> Here is a simplified version that illustrates the problem: >> >> for x in blast_record.alignments: >> print alignment.title >> >> Traceback (most recent call last): >> File "", line 2, in >> NameError: name 'alignment' is not defined >> >> ----------------------- > > It's a Python coding error. Try: > > print x.title > > Or change the name of your variable in the "for" loop to "alignment". > > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Aug 27 21:44:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 22:44:47 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> Message-ID: <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> > I do have an additional request: once I parse these out, I only get 50 > entries. however, if I do the same search online, I get 138... what > accounts for the difference? > > This is my code: > > from Bio import SeqIO > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > record = SeqIO.read(open("protein_fasta.txt"), format="fasta") > result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) > > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > for x in blast_record.alignments: > print x.title, x.accession, x.length > > acc_list = [] > for x in blast_record.alignments: > acc_list.append(x.accession) > > len(acc_list) tells me 50... > > Is there a default limit somewhere? Yes there is. At the python prompt (or in IDLE), try: >>> from Bio.Blast import NCBIWWW >>> help(NCBIWWW.qblast) (You can try this trick on all python objects and functions - although not everything as any help text defined) I think you probably want to override hitlist_size=50, so try changing: result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) to: result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(), hitlist_size=200) Peter From srbanator at heckler-koch.cz Wed Aug 27 21:47:52 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Wed, 27 Aug 2008 23:47:52 +0200 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> Message-ID: <48B5CB88.6040303@heckler-koch.cz> hi alex, when i am accessing data like handle = Entrez.esearch(db="nucleotide", rettype="fasta", retmax=100, email=my_email) i get 100 results, but without i get only 20. When looking into /usr/share/python-support/python-biopython/Bio/EUtils/ThinClient.py there is defaul value for retmax set to 20. hope it helps pavel srb Alex Garbino wrote: > Thanks for the help; that was a lot of time wasted for something simple.... > > I do have an additional request: once I parse these out, I only get 50 > entries. however, if I do the same search online, I get 138... what > accounts for the difference? > > This is my code: > > from Bio import SeqIO > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > record = SeqIO.read(open("protein_fasta.txt"), format="fasta") > result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring()) > > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > for x in blast_record.alignments: > print x.title, x.accession, x.length > > acc_list = [] > for x in blast_record.alignments: > acc_list.append(x.accession) > > len(acc_list) tells me 50... > > Is there a default limit somewhere? > > Thanks! > Alex > > On Wed, Aug 27, 2008 at 12:27 PM, C. G. wrote: > >> >> --- On Wed, 8/27/08, Alex Garbino wrote: >> >> >>> From: Alex Garbino >>> Subject: [BioPython] Parsing BLAST >>> To: biopython at lists.open-bio.org >>> Date: Wednesday, August 27, 2008, 11:12 AM >>> Hello, >>> >>> I'm following the tutorials to do BLAST queries, >>> however, I can't get >>> the Blast object to work. >>> >>> I've downloaded the blast search, saved it in XML, etc, >>> as the >>> tutorial does. However, when I get to the step where >>> I'm trying to get >>> actual data out, it fails (the for loops part). >>> Here is a simplified version that illustrates the problem: >>> >>> for x in blast_record.alignments: >>> print alignment.title >>> >>> Traceback (most recent call last): >>> File "", line 2, in >>> NameError: name 'alignment' is not defined >>> >>> ----------------------- >>> >> It's a Python coding error. Try: >> >> print x.title >> >> Or change the name of your variable in the "for" loop to "alignment". >> >> >> >> >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Aug 27 21:54:42 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 22:54:42 +0100 Subject: [BioPython] sql create tables script in tutorial In-Reply-To: <48B5BD98.8050101@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> Message-ID: <320fb6e00808271454j4a00c2b4j19d89c8c5ec9d4c@mail.gmail.com> On Wed, Aug 27, 2008 at 9:48 PM, Pavel SRB wrote: > hi all, as i am reading through "Basic BioSQL with Biopython" > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html Those are old and haven't been updated. See below... > after executing > >>>> db = server.new_database("cold") > > i have got an "XXX.biodatabase' doesn't exist" error. You must have skipped over section "3.1 Prerequisites" which does say it assumes you have installed a database, a python binding to this database, and loaded the BioSQL schema into the database. > At "BioSQL" http://www.biopython.org/wiki/BioSQL > i have found out sql create table batch script > > mysql -u root bioseqdb < biosqldb-mysql.sql > > Maybe it should also be in the first tutorial, maybe not. Just mentioning > it. I'm glad http://www.biopython.org/wiki/BioSQL is proving useful at least. Due to an accident of history, the source for python_biosql_basic.html and python_biosql_basic.pdf currently lives in BioSQL's SVN repository rather than in Biopython's (which has led to them being more complicated to update). Do you think it is worth trying to fully update those documents, or just add a link at the top of those documents directing people to http://biopython.org/wiki/BioSQL instead? Peter From p.j.a.cock at googlemail.com Wed Aug 27 22:03:44 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 27 Aug 2008 23:03:44 +0100 Subject: [BioPython] Bio.Entrez and retmax Message-ID: <320fb6e00808271503ydf662c8p8e4755e59bb423f7@mail.gmail.com> Hi Pavel, I've change the subject/title as this isn't about BLAST anymore. On Wed, Aug 27, 2008 at 10:47 PM, Pavel SRB wrote: > hi alex, when i am accessing data like > > handle = Entrez.esearch(db="nucleotide", rettype="fasta", retmax=100, > email=my_email) > > i get 100 results, but without i get only 20. If you look in /usr/share/python-support/python-biopython/Bio/Entrez/__init__.py or online at http://biopython.org/SRC/biopython/Bio/Entrez/__init__.py you'll see the function esearch doesn't set a default value. I guess that means the NCBI defaults to giving you only 20 unless you ask for more. > When looking into > /usr/share/python-support/python-biopython/Bio/EUtils/ThinClient.py > there is defaul value for retmax set to 20. Bio.EUtils is separate from Bio.Entrez, but they both give access to the NCBI Entrez Utilities. You should ignore Bio.EUtils as it will be deprecated in the next release of Biopython. Peter From srbanator at heckler-koch.cz Wed Aug 27 22:16:10 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Thu, 28 Aug 2008 00:16:10 +0200 Subject: [BioPython] sql create tables script in tutorial In-Reply-To: <48B5BD98.8050101@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> Message-ID: <48B5D22A.2060806@heckler-koch.cz> as i was reading tutorial cookbook and i have reached sql, by some chance i have started by reading http://biopython.org/DIST/docs/biosql/python_biosql_basic.html do not know why, when there is direct link to http://www.biopython.org/wiki/BioSQL just did :o) thanks for explanation, On Wed, Aug 27, 2008 at 9:48 PM, Pavel SRB wrote: > hi all, as i am reading through "Basic BioSQL with Biopython" > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html Those are old and haven't been updated. See below... > after executing > >>>> db = server.new_database("cold") > > i have got an "XXX.biodatabase' doesn't exist" error. You must have skipped over section "3.1 Prerequisites" which does say it assumes you have installed a database, a python binding to this database, and loaded the BioSQL schema into the database. > At "BioSQL" http://www.biopython.org/wiki/BioSQL > i have found out sql create table batch script > > mysql -u root bioseqdb < biosqldb-mysql.sql > > Maybe it should also be in the first tutorial, maybe not. Just mentioning > it. I'm glad http://www.biopython.org/wiki/BioSQL is proving useful at least. Due to an accident of history, the source for python_biosql_basic.html and python_biosql_basic.pdf currently lives in BioSQL's SVN repository rather than in Biopython's (which has led to them being more complicated to update). Do you think it is worth trying to fully update those documents, or just add a link at the top of those documents directing people to http://biopython.org/wiki/BioSQL instead? Peter From biopython at maubp.freeserve.co.uk Wed Aug 27 22:27:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 27 Aug 2008 23:27:21 +0100 Subject: [BioPython] sql create tables script in tutorial In-Reply-To: <320fb6e00808271454j4a00c2b4j19d89c8c5ec9d4c@mail.gmail.com> References: <48B5BD98.8050101@heckler-koch.cz> <320fb6e00808271454j4a00c2b4j19d89c8c5ec9d4c@mail.gmail.com> Message-ID: <320fb6e00808271527haf574fcv9adcd4ab4f964d87@mail.gmail.com> On Wed, Aug 27, 2008 at 10:54 PM, Peter wrote: >> hi all, as i am reading through "Basic BioSQL with Biopython" >> http://biopython.org/DIST/docs/biosql/python_biosql_basic.html > > Those are old and haven't been updated. ... > Due to an accident of history, the source for python_biosql_basic.html > and python_biosql_basic.pdf currently lives in BioSQL's SVN repository > rather than in Biopython's (which has led to them being more > complicated to update). I've just updated http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf and python_biosql_basic.html with the latest version from BioSQL's SVN repository - viewable here if anyone is interested: http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/ This does now include a link to the wiki page http://www.biopython.org/wiki/BioSQL but there are still several things I think need fixing or updating (e.g. using Bio.SeqIO instead of Bio.GenBank). Peter From srbanator at heckler-koch.cz Thu Aug 28 08:06:51 2008 From: srbanator at heckler-koch.cz (Pavel SRB) Date: Thu, 28 Aug 2008 10:06:51 +0200 Subject: [BioPython] development question In-Reply-To: <48B5BD98.8050101@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> Message-ID: <48B65C9B.4000407@heckler-koch.cz> hi all, please i have a question about your development settings. example: at my work we keep all code in svn repository. Each developer checkout the code, work on it, after every code edit i restart my apache-prefork and then see the results in browser, log or whatever. so now to biopython. On my system i have biopython from debian repository via apt-get. But i would like to have second version of biopython in system just to check, log and change the code to learn more. This can be done with removing sys.path.remove("/var/lib/python-support/python2.5") and importing Bio from some other development directory. But this way i loose all modules in direcotory mentioned above and i believe it can be done more clearly so how you are coding your biopython? thanks for advice pavel srb From biopython at maubp.freeserve.co.uk Thu Aug 28 09:53:18 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 28 Aug 2008 10:53:18 +0100 Subject: [BioPython] development question In-Reply-To: <48B65C9B.4000407@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> <48B65C9B.4000407@heckler-koch.cz> Message-ID: <320fb6e00808280253p5524a04aw2ab2e8791c0c5eec@mail.gmail.com> On Thu, Aug 28, 2008 at 9:06 AM, Pavel SRB wrote: > hi all, please i have a question about your development settings. > > example: at my work we keep all code in svn repository. Each developer > checkout the code, work on it, after every code edit i restart my > apache-prefork and then see the results in browser, log or whatever. Biopython currently uses CVS, but we will hopefully be transitioning to SVN shortly (most of the other Bio* projects have already moved over). > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be done > more clearly > > so how you are coding your biopython? Since you asked about Debian, I'll talk about my Linux machine which is currently running Ubuntu Dapper Drake (which I know is overdue for an update, but it works fine for me). The official Biopython packages were too out of date for me, so I uninstalled them and instead stay up to date with CVS which I install that under my home directory using "python setup.py install --prefix=/home/maubp". Then, to make sure my python packages (installed in my home directory) get priority over the system level packages, I set the PYTHONPATH envirnment variable. As I use bash, I just added this to my .bashrc file: # Tell Python about my locally installed Python Modules: export PYTHONPATH="/home/maubp/lib/python2.4/site-packages" (Getting IDLE to use my local packages is harder - I have a hack solution but its not very nice) Alternatively, in any individual python script you can do "import sys" and then manipulate sys.path before doing any "import Bio" statements. If you want to have both the Debian (old) Biopython and the latest CVS Biopython, I suggest you use apt-get or equivalent to install the official Debian Biopython AND install CVS biopython from source in your home directory (using something like "python setup.py --prefix=/home/pavel" according to your user name). You can then change your python path environment variable to switch between the two installations. However, having both an old and a new Biopython could be very confusing - so I personally wanted to avoid this. Peter From agarbino at gmail.com Thu Aug 28 18:51:47 2008 From: agarbino at gmail.com (Alex Garbino) Date: Thu, 28 Aug 2008 13:51:47 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> Message-ID: <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> Thanks for all the help! I'm now almost done. My script is to take a fasta file, run blast, and output a comma-separated-values list in the following format: AccessionID, Source, Length, FASTA sequence. I have one last issue: How do I get the fasta sequence out? I can easily get the raw sequence, but I need it in fasta format. I left a couple of things I've tried from tutorials commented out at the bottom, in case it helps. My csv output may also need help, depending on how the Fasta output behaves in a csv... from Bio import SeqIO from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML from Bio import Entrez #Open file to blast file = "protein.txt" #In fasta format #Blast, save copy record = SeqIO.read(open(file), format="fasta") result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(), hitlist_size=1) #Don't hit the servers hard until ready blast_results = result_handle.read() save_file = open(file[:-4]+".xml", "w") save_file.write(blast_results) save_file.close() result_handle = open(file[:-4]+".xml") #Load the blast record blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() output = {} for x in blast_record.alignments: output[x.accession] = [x.length] for x in output: handle = Entrez.efetch(db="protein", id=x, rettype="genbank") record = SeqIO.parse(handle, "genbank") recurd = record.next() output[x].insert(0, recurd.id) output[x].insert(1, recurd.annotations["source"]) #SeqIO.write(recurd, output[x].extend, "fasta") """ handle2 = Entrez.efetch(db="protein", id=x, rettype="fasta") recurd2 = SeqIO.read(handle2, "fasta") output[x].extend = [recurd2.seq.tostring()] """ print output save_file = open(file[:-4]+".csv", "w") #Generate CSV for item in output: save_file.write('%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2])) #save_file.write('%s,%s,%s\n' % (output[item][0],output[item][1],output[item][2],output[item][3]) (When Fasta works) save_file.close() -------------------------- Thanks! Alex From biopython at maubp.freeserve.co.uk Fri Aug 29 15:04:58 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Aug 2008 16:04:58 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> Message-ID: <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> On Thu, Aug 28, 2008 at 7:51 PM, Alex Garbino wrote: > Thanks for all the help! > I'm now almost done. My script is to take a fasta file, run blast, and > output a comma-separated-values list in the following format: > AccessionID, Source, Length, FASTA sequence. FASTA sequence format looks like this: >name and description CATACGACTACGTCAACGATCCGAACT GACTACGATCAGCATCGACTAGCTGTG GTGTGGT >name2 and second sequence description AGCGACAGCGACGAGCAGCGACGAG AGCGAGC Its not something you can squeeze into a comma separared file. I think you might just mean getting the sequence itself - or have two files (one CVS, one FASTA). Peter From agarbino at gmail.com Fri Aug 29 15:39:22 2008 From: agarbino at gmail.com (Alex Garbino) Date: Fri, 29 Aug 2008 10:39:22 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> Message-ID: <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> >> I'm now almost done. My script is to take a fasta file, run blast, and >> output a comma-separated-values list in the following format: >> AccessionID, Source, Length, FASTA sequence. > > FASTA sequence format looks like this: > >>name and description > CATACGACTACGTCAACGATCCGAACT > GACTACGATCAGCATCGACTAGCTGTG > GTGTGGT >>name2 and second sequence description > AGCGACAGCGACGAGCAGCGACGAG > AGCGAGC > > Its not something you can squeeze into a comma separared file. I > think you might just mean getting the sequence itself - or have two > files (one CVS, one FASTA). > > Peter > That's the problem I'm having... I want to keep FASTA format (so I can plug it into ClustalW, etc), which is difficult to do because of the newline after the fasta title. Manually in excel, I could fit the whole FASTA into a cell, I think it was converted to a string (when I copy-pasted it into clustalw, it would be in " "). Is there a way to ignore the newline between description and sequence? Thanks, Alex From agarbino at gmail.com Fri Aug 29 16:10:00 2008 From: agarbino at gmail.com (Alex Garbino) Date: Fri, 29 Aug 2008 11:10:00 -0500 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> Message-ID: <4cf37ad00808290910i719aa046i13de5d5816e9a7e3@mail.gmail.com> Assuming I just stick to making the plain sequence the 4th variable (instead of in fasta format), how should I add it to my dictionary? Doing: output[x].extend(record.seq.tostring()) Will add each letter individually, so each entry has a few hundred elements, rather than the forth element being the full string. join() doesn't seem to be it... Thanks, Alex On Fri, Aug 29, 2008 at 10:39 AM, Alex Garbino wrote: >>> I'm now almost done. My script is to take a fasta file, run blast, and >>> output a comma-separated-values list in the following format: >>> AccessionID, Source, Length, FASTA sequence. >> >> FASTA sequence format looks like this: >> >>>name and description >> CATACGACTACGTCAACGATCCGAACT >> GACTACGATCAGCATCGACTAGCTGTG >> GTGTGGT >>>name2 and second sequence description >> AGCGACAGCGACGAGCAGCGACGAG >> AGCGAGC >> >> Its not something you can squeeze into a comma separared file. I >> think you might just mean getting the sequence itself - or have two >> files (one CVS, one FASTA). >> >> Peter >> > > That's the problem I'm having... I want to keep FASTA format (so I can > plug it into ClustalW, etc), which is difficult to do because of the > newline after the fasta title. > Manually in excel, I could fit the whole FASTA into a cell, I think it > was converted to a string (when I copy-pasted it into clustalw, it > would be in " "). > Is there a way to ignore the newline between description and sequence? > > Thanks, > Alex > From biopython at maubp.freeserve.co.uk Fri Aug 29 16:13:59 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 29 Aug 2008 17:13:59 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> Message-ID: <320fb6e00808290913j2ea5c48eo420ed8d2c0691e85@mail.gmail.com> Alex wrote: > That's the problem I'm having... I want to keep FASTA format (so I can > plug it into ClustalW, etc), which is difficult to do because of the > newline after the fasta title. If you want to put it into FASTA format, you can't use a CSV file (unless you use embedded \n notation but I don't see how that would help). You could record the name and the sequence in your CSV file and later extract these into a FASTA file for use with ClustalW. I do still suggest you write out two files, a FASTA file and a separate CSV file containing the sequence if you want this too. Peter From freeman at stanfordalumni.org Fri Aug 29 21:33:09 2008 From: freeman at stanfordalumni.org (Ted Larson Freeman) Date: Fri, 29 Aug 2008 14:33:09 -0700 Subject: [BioPython] New user question: biopython compatability with EPD Message-ID: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> I'm using the Enthought Python Distribution, which contains many libraries including numpy. Reading the requirements for biopython here: http://biopython.org/wiki/Download#Required_Software I see that biopython requires Numerical Python, an older version of numpy. Can I install Numerical Python alongside numpy and use them both? Thanks. Ted From matzke at berkeley.edu Fri Aug 29 22:21:48 2008 From: matzke at berkeley.edu (Nick Matzke) Date: Fri, 29 Aug 2008 15:21:48 -0700 Subject: [BioPython] New user question: biopython compatability with EPD In-Reply-To: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> References: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> Message-ID: <48B8767C.4000109@berkeley.edu> I did exactly this and it worked fine... (Enthought, then Numerical, then biopython) Ted Larson Freeman wrote: > I'm using the Enthought Python Distribution, which contains many > libraries including numpy. Reading the requirements for biopython > here: > http://biopython.org/wiki/Download#Required_Software > > I see that biopython requires Numerical Python, an older version of > numpy. Can I install Numerical Python alongside numpy and use them > both? > > Thanks. > > Ted > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- ==================================================== Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week) Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ==================================================== From mjldehoon at yahoo.com Sat Aug 30 02:45:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 29 Aug 2008 19:45:31 -0700 (PDT) Subject: [BioPython] Bio.MetaTool Message-ID: <46010.36121.qm@web62405.mail.re1.yahoo.com> Hi everybody, Is anybody using the Bio.MetaTool module? If not, can we deprecate it? The Bio.MetaTool tests suggest that this module was written for MetaTool version 3.5 (28.03.2001), while the most current MetaTool version is at 5.0. Since MetaTool is written for Matlab/Octave, and it seems to be out of data, I expect that few people are using it with Python. Currently, Bio.MetaTool is the only non-deprecated module in Biopython that uses Martel. If we can deprecate Bio.MetaTool, then (over time) we can deprecate Martel, which means that Biopython won't need the mxTextTools any more, making Biopython's installation a lot easier. --Michiel. From mjldehoon at yahoo.com Sat Aug 30 02:47:54 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 29 Aug 2008 19:47:54 -0700 (PDT) Subject: [BioPython] NumPy Message-ID: <128888.36737.qm@web62405.mail.re1.yahoo.com> Hi everybody, Previously we discussed on the developer's mailing list whether Biopython should adopt the "new" Numerical Python (aka NumPy, currently at version 1.1.1) instead of the "old" Numerical Python (version 24.2). My objections against NumPy were that its documentation is not freely available, it doesn't compile cleanly on all platforms, and some other scientific and computational biology libraries use the old Numerical Python. Last week, the NumPy documentation did become freely available. Compilation of NumPy is still not perfect on all platforms (e.g. on Cygwin it may fail), however recently I have also noticed that compilation of the "old" Numerical Python may fail on modern systems. As far as I can tell, MMTK and PyMOL are (still?) based on the "old" Numerical Python, but Matplotlib now relies on the "new" Numerical Python. In my opinion, the balance is now tilting in favor of the new NumPy, and we should consider transitioning Biopython to the new NumPy. Does anybody have a strong preference for the "old" Numerical Python? --Michiel. From freeman at stanfordalumni.org Sat Aug 30 16:50:17 2008 From: freeman at stanfordalumni.org (Ted Larson Freeman) Date: Sat, 30 Aug 2008 09:50:17 -0700 Subject: [BioPython] NumPy In-Reply-To: <128888.36737.qm@web62405.mail.re1.yahoo.com> References: <128888.36737.qm@web62405.mail.re1.yahoo.com> Message-ID: <5d8729f00808300950k469e4b08w978f5614e6fc3d00@mail.gmail.com> As a newcomer, I would be in favor of updating to numpy, because it would make biopython appear (to other newcomers) to be a current, active project. How much development work would be necessary to switch? Ted On Fri, Aug 29, 2008 at 7:47 PM, Michiel de Hoon wrote: > Hi everybody, > > Previously we discussed on the developer's mailing list whether Biopython should adopt the "new" Numerical Python (aka NumPy, currently at version 1.1.1) instead of the "old" Numerical Python (version 24.2). My objections against NumPy were that its documentation is not freely available, it doesn't compile cleanly on all platforms, and some other scientific and computational biology libraries use the old Numerical Python. > > Last week, the NumPy documentation did become freely available. Compilation of NumPy is still not perfect on all platforms (e.g. on Cygwin it may fail), however recently I have also noticed that compilation of the "old" Numerical Python may fail on modern systems. As far as I can tell, MMTK and PyMOL are (still?) based on the "old" Numerical Python, but Matplotlib now relies on the "new" Numerical Python. > > In my opinion, the balance is now tilting in favor of the new NumPy, and we should consider transitioning Biopython to the new NumPy. Does anybody have a strong preference for the "old" Numerical Python? > > --Michiel. > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > From bsouthey at gmail.com Sat Aug 30 18:30:27 2008 From: bsouthey at gmail.com (Bruce Southey) Date: Sat, 30 Aug 2008 13:30:27 -0500 Subject: [BioPython] NumPy In-Reply-To: <128888.36737.qm@web62405.mail.re1.yahoo.com> References: <128888.36737.qm@web62405.mail.re1.yahoo.com> Message-ID: On Fri, Aug 29, 2008 at 9:47 PM, Michiel de Hoon wrote: > Hi everybody, > > Previously we discussed on the developer's mailing list whether Biopython should adopt the "new" Numerical Python (aka NumPy, currently at version 1.1.1) instead of the "old" Numerical Python (version 24.2). My objections against NumPy were that its documentation is not freely available, it doesn't compile cleanly on all platforms, and some other scientific and computational biology libraries use the old Numerical Python. Actually NumPy is doing a 1.2 release and may be one to watch. Also NumPy will be using Nose for testing so if not installed you can not run the tests. I did not find Travis book that much different from existing documentation. In anay case, you can get it at: http://svn.scipy.org/svn/numpy/trunk/doc/numpybook/ http://www.tramy.us/numpybook.pdf I would also point out the huge NumPy 'Marathon': http://scipy.org/Developer_Zone/DocMarathon2008 http://sd-2116.dedibox.fr/pydocweb/wiki/Front%20Page/ > > Last week, the NumPy documentation did become freely available. Compilation of NumPy is still not perfect on all platforms (e.g. on Cygwin it may fail), however recently I have also noticed that compilation of the "old" Numerical Python may fail on modern systems. As far as I can tell, MMTK and PyMOL are (still?) based on the "old" Numerical Python, but Matplotlib now relies on the "new" Numerical Python. The Cygwin has a major bug that is not due to NumPy. But I am sure the NumPy developers would like to know any compilation problems. > > In my opinion, the balance is now tilting in favor of the new NumPy, and we should consider transitioning Biopython to the new NumPy. Does anybody have a strong preference for the "old" Numerical Python? > Actually, how critical is having Numerical Python in BioPython in the first place? Is there a case to remove functionality or have special sub-modules? One rather critical aspect is that both NumPy and Matplotlib also require Python 2.4. Regards Bruce From mjldehoon at yahoo.com Sun Aug 31 08:39:08 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 31 Aug 2008 01:39:08 -0700 (PDT) Subject: [BioPython] NumPy In-Reply-To: <5d8729f00808300950k469e4b08w978f5614e6fc3d00@mail.gmail.com> Message-ID: <431678.62092.qm@web62405.mail.re1.yahoo.com> --- On Sat, 8/30/08, Ted Larson Freeman wrote: > As a newcomer, I would be in favor of updating to numpy, > because it would make biopython appear (to other > newcomers) to be a current, active project. > > How much development work would be necessary to switch? > It's not so bad. The biggest one is Bio.Cluster. This module is also available separate from Biopython as Pycluster. Its latest version already uses NumPy. Most of the other modules use Numerical Python at the Python level, which is much easier to fix. I am more worried about the portability of NumPy. A while back there were some installation problems with Numerical Python on some platforms. This caused a lot of user questions, since they weren't able to install it. Hence my comment about NumPy sometimes failing to build on Cygwin. --Michiel. From mjldehoon at yahoo.com Sun Aug 31 09:01:31 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 31 Aug 2008 02:01:31 -0700 (PDT) Subject: [BioPython] NumPy In-Reply-To: Message-ID: <917875.96314.qm@web62408.mail.re1.yahoo.com> > Actually NumPy is doing a 1.2 release and may be one to > watch. OK maybe we should wait until release 1.2. Though from what I understood, NumPy-dependent code won't have to be changed going from Numpy 1.1 to 1.2. > Also NumPy will be using Nose for testing so if > not installed you can not run the tests. These kinds of things I find really annoying about NumPy. Such kind of basic libraries should only rely on run-of-the-mill Python. > The Cygwin has a major bug that is not due to NumPy. But I > am sure the NumPy developers would like to know any > compilation problems. I filed this bug report about 10 months ago: http://projects.scipy.org/scipy/numpy/ticket/612 As I just found out, this bug was fixed a while back. I should try and see if NumPy compiles correctly on Cygwin now. > Actually, how critical is having Numerical Python in > BioPython in the first place? Numerical Python is now used by the following modules: Bio.Affy Bio.Cluster Bio.MarkovModel Bio.distance Bio.KDTree Bio.kNN Bio.LogisticRegression Bio.MaxEntropy Bio.MetaTool Bio.NaiveBayes Bio.PDB Bio.Statistics Bio.SVDSuperimposer In this list, Bio.Cluster and Bio.PDB are the biggest ones. The other ones, IMHO, are not the core functionality of Biopython. While I wouldn't just want to get rid of them, we have some more flexibility there. Bio.Cluster also exists as a separate library (Pycluster), so it's not a complete disaster if Bio.Cluster disappears. However, Bio.PDB is a serious issue. One option to consider is to allow Numerical Python / NumPy only at the Python level, and not at the C level. Then these modules can be written such that they try to import (the new) NumPy first, and failing that, try to import (the old) Numerical Python instead. --Michiel. > Is there a case to remove functionality or have special > sub-modules? > > One rather critical aspect is that both NumPy and > Matplotlib also > require Python 2.4. > > Regards > Bruce From biopython at maubp.freeserve.co.uk Sun Aug 31 15:28:49 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 31 Aug 2008 16:28:49 +0100 Subject: [BioPython] Parsing BLAST In-Reply-To: <4cf37ad00808290910i719aa046i13de5d5816e9a7e3@mail.gmail.com> References: <4cf37ad00808271012k39f795c1q36515207b4dca421@mail.gmail.com> <384507.82825.qm@web65609.mail.ac4.yahoo.com> <4cf37ad00808271401u36316d7eq431552ea3088646e@mail.gmail.com> <320fb6e00808271444h27217aa3hb547a2701420cf4@mail.gmail.com> <4cf37ad00808281151t1024f4cdv840be0ff81fbcce8@mail.gmail.com> <320fb6e00808290804i6e552fcdk52db836a89eee946@mail.gmail.com> <4cf37ad00808290839o205a607k486ddd13a65ed276@mail.gmail.com> <4cf37ad00808290910i719aa046i13de5d5816e9a7e3@mail.gmail.com> Message-ID: <320fb6e00808310828t777f7768oc7a3a6616cc1ba75@mail.gmail.com> On Fri, Aug 29, 2008 at 5:10 PM, Alex Garbino wrote: > Assuming I just stick to making the plain sequence the 4th variable > (instead of in fasta format), how should I add it to my dictionary? > Doing: > > output[x].extend(record.seq.tostring()) > > Will add each letter individually, so each entry has a few hundred > elements, rather than the forth element being the full string. join() > doesn't seem to be it... Assuming output is a dictionary whose elements are lists, try output[x].append(record.seq.tostring()) You need to read about the difference between the append and extend methods of a list in python. Peter From biopython at maubp.freeserve.co.uk Sun Aug 31 15:34:46 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 31 Aug 2008 16:34:46 +0100 Subject: [BioPython] New user question: biopython compatability with EPD In-Reply-To: <48B8767C.4000109@berkeley.edu> References: <5d8729f00808291433x4e29b980je7948a8a72238734@mail.gmail.com> <48B8767C.4000109@berkeley.edu> Message-ID: <320fb6e00808310834j6166a176id05ce51e43973f9b@mail.gmail.com> Ted Larson Freeman wrote: >> I see that biopython requires Numerical Python, an older version of >> numpy. Yes, although we are planning to move - its just we have some C-code that uses numeric/numpy so this isn't a trivial switch. >> Can I install Numerical Python alongside numpy and use them both? Yes, I've got both on several machines and they co-incide happily (using standard python). Nick confirmed this worked for him using Enthought's bundle. Peter From krewink at inb.uni-luebeck.de Thu Aug 28 09:14:37 2008 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Thu, 28 Aug 2008 09:14:37 -0000 Subject: [BioPython] development question In-Reply-To: <48B65C9B.4000407@heckler-koch.cz> References: <48B5BD98.8050101@heckler-koch.cz> <48B65C9B.4000407@heckler-koch.cz> Message-ID: <20080828090431.GD5801@inb.uni-luebeck.de> Hi Pavel, On Thu, Aug 28, 2008 at 10:06:51AM +0200, Pavel SRB wrote: > so now to biopython. On my system i have biopython from debian repository > via apt-get. But i would like to have second version of biopython in system > just to check, log and change the code to learn more. This can be done with > removing sys.path.remove("/var/lib/python-support/python2.5") > and importing Bio from some other development directory. But this way i > loose all modules in direcotory mentioned above and i believe it can be > done more clearly An easy way would be to just add the path to your biopython svn-version to the _front_ of the sys.path list: sys.path = ['/your/path/to/biopython/'] + sys.path Please note, however, that this isn't really a biopython related question, so you might be better off asking in a general python forum/newsgroup/mailing-list. Cheers, Albert -- Albert Krewinkel University of Luebeck, Institute for Neuro- and Bioinformatics http://www.inb.uni-luebeck.de/~krewink/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: Digital signature URL: