From anaryin at gmail.com Sat Dec 1 15:02:53 2007 From: anaryin at gmail.com (=?ISO-8859-1?Q?Jo=E3o_Rodrigues?=) Date: Sat, 1 Dec 2007 20:02:53 +0000 Subject: [BioPython] GenBank and raw_input() Message-ID: Hello all! I'm trying to code a small function that uses the GenBank.search_for() method but I can't get it to work with raw_input(). I tried using input and then converting to str, tried to create a raw string and then concatenate with my raw_input string, nothing works.. I keep having an error in the urllib2 (probably because the link isn't properly built). Any ideas? Thanks in advance! Jo?o Rodrigues From biopython at maubp.freeserve.co.uk Mon Dec 3 07:26:34 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Dec 2007 12:26:34 +0000 Subject: [BioPython] GenBank and raw_input() In-Reply-To: References: Message-ID: <320fb6e00712030426v74424b43w4925c814751c7431@mail.gmail.com> On Dec 1, 2007 8:02 PM, Jo?o Rodrigues wrote: > Hello all! > > I'm trying to code a small function that uses the GenBank.search_for() > method but I can't get it to work with raw_input(). I tried using input and > then converting to str, tried to create a raw string and then concatenate > with my raw_input string, nothing works.. I keep having an error in the > urllib2 (probably because the link isn't properly built). > > Any ideas? Can you get GenBank.search_for() to work fine with a predefined search term? When you are using raw_input() to get the user to type in some search terms, have you tried stripping off any whitespace (new lines, spaces) as that might cause problems. If you could show us a short example that doesn't work it would be easier to try and help. Peter From matthew.neilson at utoledo.edu Mon Dec 3 10:32:39 2007 From: matthew.neilson at utoledo.edu (Matthew Neilson) Date: Mon, 3 Dec 2007 10:32:39 -0500 Subject: [BioPython] Biopython and sequence trace files... Message-ID: <464c3d980712030732q5bb16ccas3927132668cc973f@mail.gmail.com> Hi, This question might be better suited for the development list, but here goes anyway. Are there any facilities in Biopython to read/write information from sequencing trace files (e.g., .abi, .scf, .ztr, etc). I know that Bioperl has a way of utilizing the Staden io_lib, and I was hoping for the same thing in Python. Has anyone been able to convert io_lib into Python module, or could someone point me towards resources that would help me to do this? Thanks in advance. -Matt -- Matt Neilson Graduate Research Assistant Great Lakes Genetics Lab Lake Erie Center-University of Toledo 6200 Bayshore Rd. Oregon, OH 43618 Lab: (419) 530-8370 Fax: (419) 530-8399 matthew.neilson at utoledo.edu From tiagoantao at gmail.com Mon Dec 3 16:48:15 2007 From: tiagoantao at gmail.com (Tiago Antao) Date: Mon, 3 Dec 2007 21:48:15 +0000 (WET) Subject: [BioPython] Population genetics code example application Message-ID: Hi, For anyone interested, we have developed a selection detection application based on the code that is currently available in the PopGen code. You can find it here: http://popgen.eu/soft/selwb/ It is actually a Jython application. In fact the code developed for this application served as the base for what is now the PopGen module (still, a very small module, but coalescent simulation and basic statistics are on the way). Any problems with the application, just send me an email, Tiago From luca.beltrame at unimi.it Tue Dec 4 05:19:42 2007 From: luca.beltrame at unimi.it (Luca Beltrame) Date: Tue, 04 Dec 2007 11:19:42 +0100 Subject: [BioPython] Adding new database types to EUtils Message-ID: <200712041119.46997.luca.beltrame@unimi.it> Hello. I've been trying to use EUtils to do run some queries through NCBI, but apparently GEO isn't present in the database list defined by EUtils: In [8]: EUtils.databases Out[8]: {'gene': , 'genome': , 'journals': , 'nucleotide': , 'omim': , 'popset': , 'protein': , 'pubmed': , 'sequences': , 'unigene': } Therefore queries using the DBIdsClient method search() trying to use GEO, such as this one: from Bio.EUtils import DBIdsClient client = DBIdsClient.DBIdsClient() test_search = client.search("GSE4830",db="geo") will fail with KeyError (because it's not defined). How can I extend EUtils.databases to add support for GEO? I've looked a bit at the class definitions in the API, and I'm not sure on how to proceed. Any hints would be greatly appreciated. Thanks. -- Luca Beltrame, MSc. - Molecular Medicine PhD Student Dipartimento di Scienze e Tecnologie Biomediche - UniMI CNR - Institute of Biomedical Technologies Research Fellow -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071204/fa18da1a/attachment.bin From biopython at maubp.freeserve.co.uk Tue Dec 4 06:16:36 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Dec 2007 11:16:36 +0000 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <200712041119.46997.luca.beltrame@unimi.it> References: <200712041119.46997.luca.beltrame@unimi.it> Message-ID: <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> > Hello. > I've been trying to use EUtils to do run some queries through NCBI, but > apparently GEO isn't present in the database list defined by [Biopython's] EUtils: I guess the first thing to do is double check that the NCBI EUtils API will support GEO files, and then see if you can manage to fetch anything "by hand". It is very simple to construct a URL by hand to fetch a GEO file directly (bypassing EUtils). Once you have downloaded the GEO files, what do you plan to do with them? Biopython's GEO parser is very basic... Peter P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery for this sort of thing. From luca.beltrame at unimi.it Tue Dec 4 06:21:12 2007 From: luca.beltrame at unimi.it (Luca Beltrame) Date: Tue, 04 Dec 2007 12:21:12 +0100 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> References: <200712041119.46997.luca.beltrame@unimi.it> <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> Message-ID: <200712041221.16990.luca.beltrame@unimi.it> Il Tuesday 04 December 2007 12:16:36 Peter ha scritto: > Once you have downloaded the GEO files, what do you plan to do with them? > Biopython's GEO parser is very basic... It was mostly to check their basic description to see if they were feasible to be included in my current work. As I have a large list of accessions, fetching them all at once would reduce the time needed to go through them. To be more clear, downloading their summary. > P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery > for this sort of thing. I mostly use it when I need to download data set information and expression levels. For this simpler task, I turned to Python first as GEOquery has some performance issues on my machine. I'll take a look at NCBI's EUils and see if they support GEO. Thanks for the tip. -- Luca Beltrame, MSc. - Molecular Medicine PhD Student Dipartimento di Scienze e Tecnologie Biomediche - UniMI CNR - Institute of Biomedical Technologies Research Fellow E-mail: luca.beltrame at unimi.it - Phone: +39-02-50320924 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071204/29cf5f97/attachment.bin From sdavis2 at mail.nih.gov Tue Dec 4 08:35:13 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 4 Dec 2007 08:35:13 -0500 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <200712041221.16990.luca.beltrame@unimi.it> References: <200712041119.46997.luca.beltrame@unimi.it> <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> <200712041221.16990.luca.beltrame@unimi.it> Message-ID: <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> On Dec 4, 2007 6:21 AM, Luca Beltrame wrote: > Il Tuesday 04 December 2007 12:16:36 Peter ha scritto: > > > Once you have downloaded the GEO files, what do you plan to do with > them? > > Biopython's GEO parser is very basic... > > It was mostly to check their basic description to see if they were > feasible to > be included in my current work. As I have a large list of accessions, > fetching them all at once would reduce the time needed to go through them. > To > be more clear, downloading their summary. > > > P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery > > for this sort of thing. > > I mostly use it when I need to download data set information and > expression > levels. For this simpler task, I turned to Python first as GEOquery has > some > performance issues on my machine. > > I'll take a look at NCBI's EUils and see if they support GEO. Thanks for > the > tip. Thought I would chime in here. GEOquery definitely does have some performance issues, some of which I have addressed in the most recent release. I have thought about making a python-based version, but I find R a much more compelling framework for statistical computing and array-based analyses, despite such tools as Rpy and numpy. Usage of GEOquery also requires a bit of understanding of the formats used by GEO, as some of them are monstrously large. My goal with GEOquery was to allow full parsing of even the monstrous files. However, GEO has recently released a GSEMatrix format (which GEOquery now handles) that is much faster and easier to parse (meant specifically for Excel to load), so the largest performance issue, parsing GSE SOFT files, is now pretty much gone. EUtils support is, as far as I know, pretty limited for GEO. Data download is best accomplished via ftp, generally. However, if one wants only Metadata (and not values), then URLs can be constructed against their web page to get back various formats, including SOFT and, in some cases, XML. I'm not sure that exactly the same functionality is available via Eutils, but I think not. Obviously, GEOquery is open-source and I continue to develop it if there is a need (and in response to changes by NCBI), so feedback is appreciated. Also, if there are improvements on the GEO side that would improve its utility, the folks at GEO do take comments and suggestions pretty seriously, so feel free to pass comments on to them (or to me and I will do the same). Sean From luca.beltrame at unimi.it Tue Dec 4 08:49:36 2007 From: luca.beltrame at unimi.it (Luca Beltrame) Date: Tue, 04 Dec 2007 14:49:36 +0100 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> References: <200712041119.46997.luca.beltrame@unimi.it> <200712041221.16990.luca.beltrame@unimi.it> <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> Message-ID: <200712041449.39100.luca.beltrame@unimi.it> Il Tuesday 04 December 2007 14:35:13 hai scritto: > release. I have thought about making a python-based version, but I find R > a much more compelling framework for statistical computing and array-based I think it is mostly a matter of personal preference. I turned to Python (but I have been using GEOquery in the past) because I like the language more than R. > Metadata (and not values), then URLs can be constructed against their web I guess I did not make the statement clear enough in my original mail. Yes, I meant to fetch only the metadata because I wanted to gather the experiment descriptions from all the accessions I had (a rather large number) in order to look through them without having to query for each one. I will try looking at the queries via web and see if I can write something useful (although I still think that, as basic as it is, it would be nice to have EUtils GEO support in Bio.EUtils, at least for the metadata). > I'm not sure that exactly the same functionality is available via Eutils, > but I think not. I have played a bit with EUtils, but I haven't yet been able to use esearch to work with a GEO accession. Since I have just looked at them briefly, I can't guarantee it was just a mistake on my part, though. -- Luca Beltrame, MSc. - Molecular Medicine PhD Student Dipartimento di Scienze e Tecnologie Biomediche - UniMI CNR - Institute of Biomedical Technologies Research Fellow -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071204/efe6ed21/attachment-0001.bin From mdehoon at c2b2.columbia.edu Fri Dec 7 22:18:09 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 08 Dec 2007 12:18:09 +0900 Subject: [BioPython] [Biopython-dev] Accessing ExPASy through Bio.SwissProt /Bio.SeqIO In-Reply-To: <320fb6e00712070246g53e8096ew156f4502791bce9b@mail.gmail.com> References: <6243BAA9F5E0D24DA41B27997D1FD14402B66F@mail2.exch.c2b2.columbia.edu> <320fb6e00712040226o7ecda7e2g9fb124b3a52de026@mail.gmail.com> <6243BAA9F5E0D24DA41B27997D1FD14402B670@mail2.exch.c2b2.columbia.edu> <475691C1.3020705@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B673@mail2.exch.c2b2.columbia.edu> <320fb6e00712070246g53e8096ew156f4502791bce9b@mail.gmail.com> Message-ID: <475A0CF1.1080802@c2b2.columbia.edu> Peter wrote: > I would add a note saying doing it this way gives > Bio.SwissProt.SProt.Record objects, > while you could alternatively get SeqRecord objects as described in > the SeqIO chapter > (use a reference). OK I will add that. > > I'd suggested a Bio.SeqIO function, with a name like parse1() or > parse_sole() etc which > would return a single SeqRecord - and raise an error if the handle > didn't contain one > and only one record. We could call this function read() if you prefer. > I'd prefer read() instead of parse1(), parse_sole() etc. for the following reasons: 1) Having two names that are clearly different emphasizes the fact that they return different things (parse() returns an iterator, read() a record). 2) Some modules deal with data that always consist of one record (for example, gene expression data in case of Bio.Cluster). Such modules can have a read() function but not a parse(). It would feel strange if a module has a parse1() function but not a parse(). --Michiel. From p.j.a.cock at googlemail.com Sat Dec 8 08:10:33 2007 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 8 Dec 2007 13:10:33 +0000 Subject: [BioPython] [Biopython-dev] Bio.SeqIO function to read a single record Message-ID: <320fb6e00712080510k3d4e5148gb0ec332a0d745452@mail.gmail.com> Michiel de Hoon wrote: > > > > I'd suggested a Bio.SeqIO function, with a name like parse1() or > > parse_sole() etc which would return a single SeqRecord - and raise > > an error if the handle didn't contain one and only one record. We > > could call this function read() if you prefer. > > > I'd prefer read() instead of parse1(), parse_sole() etc. for the > following reasons: > > 1) Having two names that are clearly different emphasizes the fact that > they return different things (parse() returns an iterator, read() a record). > > 2) Some modules deal with data that always consist of one record (for > example, gene expression data in case of Bio.Cluster). Such modules can > have a read() function but not a parse(). It would feel strange if a > module has a parse1() function but not a parse(). OK. I've filed an enhancement bug, which I'll mention on the main mailing list, http://bugzilla.open-bio.org/show_bug.cgi?id=2417 Unless there is some negative feedback, I'll add that functionality shortly. Peter From biopython at maubp.freeserve.co.uk Sat Dec 8 08:20:35 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Dec 2007 13:20:35 +0000 Subject: [BioPython] Bio.SeqIO and files with one record In-Reply-To: <4693E5FE.708@maubp.freeserve.co.uk> References: <4693E5FE.708@maubp.freeserve.co.uk> Message-ID: <320fb6e00712080520jdd1a06dka1a8bfe03d69a1fd@mail.gmail.com> In July 2007, Peter wrote: > Dear Biopython people, > > I'd like a little feedback on the Bio.SeqIO module - in particular, one > situation I think could be improved is when dealing with sequences files > which contain a single record - for example a very simple Fasta file, or > a chromosome in a GenBank file. > > http://www.biopython.org/wiki/SeqIO > > The shortest way to get this one record as a SeqRecord object is probably: > > from Bio import SeqIO > record = SeqIO.parse(open("example.gbk"), "genbank").next() > > This works, assuming there is at least one record, but will not trigger > any error if there was more than one record - something you may want to > check. > > Do any of you think this situation is common enough to warrant adding > another function to Bio.SeqIO to do this for you (raising errors for no > records or more than one record). My suggestions for possible names > include parse_single, parse_one, parse_sole, parse_individual and mono_parse We had a few other name suggestions including "parse_the_only_one" from Martin which while nice and clear is very long. Over on the dev-mailing list, Michiel suggested we call this the "read" function, which seems sensible. I've filed an enhancement bug for this whole issue: Bugzilla Bug 2417 - Bio.SeqIO single SeqRecord read/parse functionhttp://bugzilla.open-bio.org/show_bug.cgi?id=2417 I think the general consensus was this functionality could be useful, but perhaps not to everyone. In fact it turns out to be very helpful when parsing records downloaded from the internet - which I hadn't pointed out earlier. I plan to add this new functionality as a "read" function - unless anyone here wants to add anything... Thanks, Peter From e.picardi at unical.it Sat Dec 8 12:39:01 2007 From: e.picardi at unical.it (Ernesto) Date: Sat, 8 Dec 2007 18:39:01 +0100 Subject: [BioPython] GFF parser Message-ID: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> Hi all, can biopython handle GFF files? And GTFs? Many thanks, Ernesto -------------------------------------------------------- Dr Ernesto Picardi, PhD Dept. of Biochemistry and Molecular Biology University of Bari Italy E-mail: e.picardi at unical.it -------------------------------------------------------- From biopython at maubp.freeserve.co.uk Sun Dec 9 10:53:46 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 9 Dec 2007 15:53:46 +0000 Subject: [BioPython] GFF parser In-Reply-To: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> References: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> Message-ID: <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> Ernesto wrote: > Hi all, > > can biopython handle GFF files? And GTFs? > > Many thanks, > > Ernesto Hi Ernesto, The short answer is that no, Biopython does not (currently) handle GFF files. We do have a module, Bio.GFF which is designed to work with an MySQL database containing GFF data, which you must first setup using BioPerl. However, Bio.GFF does not work with the GFF or GTF files directly. You are not alone in wanting this sort of functionality - for example earlier this year Giovanni Marco Dall'Olio asked about GFF files on this mailing list: http://lists.open-bio.org/pipermail/biopython/2007-June/003522.html Peter From sdavis2 at mail.nih.gov Sun Dec 9 13:42:58 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 9 Dec 2007 13:42:58 -0500 Subject: [BioPython] GFF parser In-Reply-To: <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> References: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> Message-ID: <264855a00712091042y541fb565sc5fd112948411ac8@mail.gmail.com> On Dec 9, 2007 10:53 AM, Peter wrote: > Ernesto wrote: > > Hi all, > > > > can biopython handle GFF files? And GTFs? > > > > Many thanks, > > > > Ernesto > > Hi Ernesto, > > The short answer is that no, Biopython does not (currently) handle GFF > files. > > We do have a module, Bio.GFF which is designed to work with an MySQL > database containing GFF data, which you must first setup using > BioPerl. However, Bio.GFF does not work with the GFF or GTF files > directly. > > You are not alone in wanting this sort of functionality - for example > earlier this year Giovanni Marco Dall'Olio asked about GFF files on > this mailing list: > > http://lists.open-bio.org/pipermail/biopython/2007-June/003522.html > You might take a look at this project (http://g2.trac.bx.psu.edu/ ). The code is available for download. It might have some GFF parsing abilities that could be hacked to do what you want. An alternative is to load the data using perl and access using Bio.GFF. If I recall, there is basically a single perl script for loading GFF data in to a database. Just some ideas. Sean From sdavis2 at mail.nih.gov Sun Dec 9 13:42:58 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 9 Dec 2007 13:42:58 -0500 Subject: [BioPython] GFF parser In-Reply-To: <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> References: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> Message-ID: <264855a00712091042y541fb565sc5fd112948411ac8@mail.gmail.com> On Dec 9, 2007 10:53 AM, Peter wrote: > Ernesto wrote: > > Hi all, > > > > can biopython handle GFF files? And GTFs? > > > > Many thanks, > > > > Ernesto > > Hi Ernesto, > > The short answer is that no, Biopython does not (currently) handle GFF > files. > > We do have a module, Bio.GFF which is designed to work with an MySQL > database containing GFF data, which you must first setup using > BioPerl. However, Bio.GFF does not work with the GFF or GTF files > directly. > > You are not alone in wanting this sort of functionality - for example > earlier this year Giovanni Marco Dall'Olio asked about GFF files on > this mailing list: > > http://lists.open-bio.org/pipermail/biopython/2007-June/003522.html > You might take a look at this project (http://g2.trac.bx.psu.edu/ ). The code is available for download. It might have some GFF parsing abilities that could be hacked to do what you want. An alternative is to load the data using perl and access using Bio.GFF. If I recall, there is basically a single perl script for loading GFF data in to a database. Just some ideas. Sean From mmokrejs at ribosome.natur.cuni.cz Mon Dec 10 12:11:18 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 10 Dec 2007 18:11:18 +0100 Subject: [BioPython] [Biopython-dev] Bio.SeqIO function to read a single record In-Reply-To: <320fb6e00712080510k3d4e5148gb0ec332a0d745452@mail.gmail.com> References: <320fb6e00712080510k3d4e5148gb0ec332a0d745452@mail.gmail.com> Message-ID: <475D7336.9070009@ribosome.natur.cuni.cz> Peter, so how about parse_one()? Or having a pair parse_file() and parse_entry()? M. Peter Cock wrote: > Michiel de Hoon wrote: >>> I'd suggested a Bio.SeqIO function, with a name like parse1() or >>> parse_sole() etc which would return a single SeqRecord - and raise >>> an error if the handle didn't contain one and only one record. We >>> could call this function read() if you prefer. >>> >> I'd prefer read() instead of parse1(), parse_sole() etc. for the >> following reasons: >> >> 1) Having two names that are clearly different emphasizes the fact that >> they return different things (parse() returns an iterator, read() a record). >> >> 2) Some modules deal with data that always consist of one record (for >> example, gene expression data in case of Bio.Cluster). Such modules can >> have a read() function but not a parse(). It would feel strange if a >> module has a parse1() function but not a parse(). > > OK. I've filed an enhancement bug, which I'll mention on the main mailing list, > http://bugzilla.open-bio.org/show_bug.cgi?id=2417 > Unless there is some negative feedback, I'll add that functionality shortly. From karin.lagesen at medisin.uio.no Thu Dec 13 07:30:43 2007 From: karin.lagesen at medisin.uio.no (Karin Lagesen) Date: Thu, 13 Dec 2007 13:30:43 +0100 Subject: [BioPython] alignment alphabet problem - upper/lower case? Message-ID: I have an alignment that I read in with : alignment = Clustalw.parse_file(infile, alphabet=IUPAC.IUPACAmbiguousDNA()) This works fine with upper case alignments. However, I am now working with alignments that are in lower case. I would still like to use the same alphabet, but I don't know how to get it to accept lower case? The error I get is: File "/usr/local/python//lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 389, in pos_specific_score_matrix raise ValueError("Residue %s not found in alphabet %s" ValueError: Residue c not found in alphabet Gapped(IUPACAmbiguousDNA(), '-') Thanks! Karin -- Karin Lagesen, PhD student karin.lagesen at medisin.uio.no http://folk.uio.no/karinlag From biopython at maubp.freeserve.co.uk Thu Dec 13 09:11:02 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Dec 2007 14:11:02 +0000 Subject: [BioPython] alignment alphabet problem - upper/lower case? In-Reply-To: References: Message-ID: <320fb6e00712130611q4dd1ee63p9f29c2774f447eb2@mail.gmail.com> Karin Lagesen wrote: > I have an alignment that I read in with : > > alignment = Clustalw.parse_file(infile, alphabet=IUPAC.IUPACAmbiguousDNA()) > > This works fine with upper case alignments. > > However, I am now working with alignments that are in lower case. I > would still like to use the same alphabet, but I don't know how to get > it to accept lower case? You can't use the *same* alphabet, because IUPAC.IUPACAmbiguousDNA() is explicitly defined with upper case letters. You need to use a different alphabet - either a generic DNA alphabet where the letters are not specified, or create a lower case equivalent of IUPAC.IUPACAmbiguousDNA(). Peter From dtomso at athenixcorp.com Thu Dec 20 15:08:47 2007 From: dtomso at athenixcorp.com (Daniel Tomso) Date: Thu, 20 Dec 2007 15:08:47 -0500 Subject: [BioPython] BioSQL problems on load Message-ID: Hi- This is possibly a simple configuration problem, but I'm having some problems loading files into BioSQL. Here's the code I'm using: from BioSQL import BioSeqDatabase from Bio import SeqIO import sys sfilename = sys.argv[1] server = BioSeqDatabase.open_database(driver = 'MySQLdb', user = 'pythonapi', passwd = 'xxxxxxx', host = 'localhost', db = 'bioseqdb', ) contigdb = server.new_database('test') sfile = open(sfilename, 'rU') contigdb.load(SeqIO.parse(sfile, 'genbank')) sfile.close() And here's the error message: /usr/lib/python2.5/site-packages/Bio/crc.py:5: DeprecationWarning: Bio.crc is deprecated; use crc32 and crc64 in Bio.SeqUtils.CheckSum instead warnings.warn("Bio.crc is deprecated; use crc32 and crc64 in Bio.SeqUtils.CheckSum instead", DeprecationWarning) Traceback (most recent call last): File "biosql_driver.py", line 26, in contigdb.load(SeqIO.parse(sfile, 'genbank')) File "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe qDatabase.py", line 412, in load db_loader.load_seqrecord(cur_record) File "/usr/lib/python2.5/site-packages/BioSQL/Loader.py", line 30, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "/usr/lib/python2.5/site-packages/BioSQL/Loader.py", line 253, in _load_bioentry_table version)) File "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe qDatabase.py", line 277, in execute self.cursor.execute(sql, args or ()) File "/usr/lib/python2.5/site-packages/MySQLdb/cursors.py", line 149, in execute query = query % db.literal(args) TypeError: not all arguments converted during string formatting This happens with standard genbank and fasta files pulled off of NCBI. Any suggestions? There's another issue regarding standard parsing of accession numbers to get version IDs (the code doesn't like non-NCBI fasta headers, e.g. those produced by phrap), but that is pretty minor and doesn't seem to be related to this. Thanks for any ideas! Dan Tomso ---------------------------------------- Daniel J. Tomso Senior Scientist Athenix Corporation PO Box 110347 Research Triangle Park, NC 27709 ---------------------------------------- 919.328.4122 dtomso at athenixcorp.com www.athenixcorp.com ---------------------------------------- Disclaimer: This message (including any attachments) may contain confidential or privileged information and is intended only for the use of the addressee named above. If you are not the intended recipient of this message, you are hereby notified that you must not use, copy, disclose or take any action based on this message or information herein. If you have received this message in error, please advise the sender immediately and erase all copies of this message and any related attachments. Thank you. From biopython at maubp.freeserve.co.uk Fri Dec 21 12:56:08 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Dec 2007 18:56:08 +0100 Subject: [BioPython] BioSQL problems on load In-Reply-To: References: Message-ID: <320fb6e00712210956l6541cf7o482e9c7bf4c7db85@mail.gmail.com> > This is possibly a simple configuration problem, but I'm having some > problems loading files into BioSQL. > > ... > > warnings.warn("Bio.crc is deprecated; use crc32 and crc64 in > Bio.SeqUtils.CheckSum instead", DeprecationWarning) We should fix that - I hadn't seen that warning being triggered on my machine. > File > "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe > qDatabase.py", line 277, in execute > > self.cursor.execute(sql, args or ()) > > File "/usr/lib/python2.5/site-packages/MySQLdb/cursors.py", line 149, > in execute > > query = query % db.literal(args) > > TypeError: not all arguments converted during string formatting > > This happens with standard genbank and fasta files pulled off of NCBI. > Any suggestions? That looks like Bug 2390, http://bugzilla.open-bio.org/show_bug.cgi?id=2390 You didn't say what version of Biopython you are using, but that has been fixed in CVS. Could you update to CVS, or wait for the next release (hopefully in January?). > There's another issue regarding standard parsing of accession numbers to > get version IDs (the code doesn't like non-NCBI fasta headers, e.g. > those produced by phrap), but that is pretty minor and doesn't seem to > be related to this. Perhaps the parser or BioSQL can be a bit more robust here. Could you file a bug on this issue, and then attach an example input file with some code showing the problem? Thanks Peter From vmatthewa at gmail.com Wed Dec 26 17:41:55 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Wed, 26 Dec 2007 15:41:55 -0700 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <320fb6e00711151311q700a473age0e026b8fcfb5c5@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> <320fb6e00711151311q700a473age0e026b8fcfb5c5@mail.gmail.com> Message-ID: <8fc5e4c20712261441tf0d1887q3455bda40cead942@mail.gmail.com> Hi everyone, Sorry to keep going back to this script I am working on, but I was wondering going back to what Peter said about my installation of Bio-python that I think keeps messing up my code no matter what I do to attempt to fix it. Could I use the "PortsCollection" to install bio-python again since I am using BSD as my OS. I realize that you were only speculation as to the nature of the problem with my install but do you think that might work? Since I am not the one installing Bio-python and it is my system administrator that is doing the installing I should talk to them about it? Thanks. Matthew On Nov 15, 2007 2:11 PM, Peter wrote: > > Thanks for all the comments, so since you said you removed Bio.FormatIOin > > version 1.44 and replaced it with Bio.SeqIO do you think I can still > > successfully use that code I was given if I have 1.44 provided I watch > out > > for bugs and so on? > > Assuming you can apply the fix for Bug 2393, then that > Bio.GenBank.NCBIDictionary code should work fine with Biopython 1.44. > > There is also a related example in the SeqIO chapter of the tutorial > using the Bio.GenBank.download_many() function. > > > What is the difference between Bio.FormatIO and > > Bio.SeqIO, other then them describing file formats differently? > > In terms of typical end use, Bio.SeqIO and Bio.FormatIO provided > similar capabilities, but FormatIO wasn't very up to date in terms of > its format support. The big differences are internal. For any new > code, please try Bio.SeqIO (available in Biopython 1.43 onwards), > which is described in the tutorial and the wiki: > http://biopython.org/wiki/SeqIO > > > Also how exactly could one have a partial installation, some of the > package not > > installing? > > This was a guess - there is/was clearly something odd about your > install. If you installed from source, maybe some step failed part > way leaving you with only some parts installed. Another possibility > is on BSD is there is something different about the installation paths > which is confusing things. We haven't worked out what went wrong on > your system so I'm was just speculating. > > Peter > From dtomso at athenixcorp.com Fri Dec 28 10:50:25 2007 From: dtomso at athenixcorp.com (Daniel Tomso) Date: Fri, 28 Dec 2007 10:50:25 -0500 Subject: [BioPython] BioSQL problems on load In-Reply-To: <320fb6e00712210956l6541cf7o482e9c7bf4c7db85@mail.gmail.com> References: <320fb6e00712210956l6541cf7o482e9c7bf4c7db85@mail.gmail.com> Message-ID: Thanks-- I'll get the last bug logged ASAP. Regarding the main problem--I'm pretty sure I'm up-to-date via CVS--I have persistent configuration problems on this Ubuntu box. I suspect right now that something is digging down some alternate hierarchy and coming up with this error, even though the correct versions are lurking somewhere else. I'll dig it up eventually. Thanks again. DT -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Friday, December 21, 2007 12:56 PM To: Daniel Tomso Cc: biopython at biopython.org Subject: Re: [BioPython] BioSQL problems on load > This is possibly a simple configuration problem, but I'm having some > problems loading files into BioSQL. > > ... > > warnings.warn("Bio.crc is deprecated; use crc32 and crc64 in > Bio.SeqUtils.CheckSum instead", DeprecationWarning) We should fix that - I hadn't seen that warning being triggered on my machine. > File > "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe > qDatabase.py", line 277, in execute > > self.cursor.execute(sql, args or ()) > > File "/usr/lib/python2.5/site-packages/MySQLdb/cursors.py", line 149, > in execute > > query = query % db.literal(args) > > TypeError: not all arguments converted during string formatting > > This happens with standard genbank and fasta files pulled off of NCBI. > Any suggestions? That looks like Bug 2390, http://bugzilla.open-bio.org/show_bug.cgi?id=2390 You didn't say what version of Biopython you are using, but that has been fixed in CVS. Could you update to CVS, or wait for the next release (hopefully in January?). > There's another issue regarding standard parsing of accession numbers to > get version IDs (the code doesn't like non-NCBI fasta headers, e.g. > those produced by phrap), but that is pretty minor and doesn't seem to > be related to this. Perhaps the parser or BioSQL can be a bit more robust here. Could you file a bug on this issue, and then attach an example input file with some code showing the problem? Thanks Peter From anaryin at gmail.com Sat Dec 1 20:02:53 2007 From: anaryin at gmail.com (=?ISO-8859-1?Q?Jo=E3o_Rodrigues?=) Date: Sat, 1 Dec 2007 20:02:53 +0000 Subject: [BioPython] GenBank and raw_input() Message-ID: Hello all! I'm trying to code a small function that uses the GenBank.search_for() method but I can't get it to work with raw_input(). I tried using input and then converting to str, tried to create a raw string and then concatenate with my raw_input string, nothing works.. I keep having an error in the urllib2 (probably because the link isn't properly built). Any ideas? Thanks in advance! Jo?o Rodrigues From biopython at maubp.freeserve.co.uk Mon Dec 3 12:26:34 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 3 Dec 2007 12:26:34 +0000 Subject: [BioPython] GenBank and raw_input() In-Reply-To: References: Message-ID: <320fb6e00712030426v74424b43w4925c814751c7431@mail.gmail.com> On Dec 1, 2007 8:02 PM, Jo?o Rodrigues wrote: > Hello all! > > I'm trying to code a small function that uses the GenBank.search_for() > method but I can't get it to work with raw_input(). I tried using input and > then converting to str, tried to create a raw string and then concatenate > with my raw_input string, nothing works.. I keep having an error in the > urllib2 (probably because the link isn't properly built). > > Any ideas? Can you get GenBank.search_for() to work fine with a predefined search term? When you are using raw_input() to get the user to type in some search terms, have you tried stripping off any whitespace (new lines, spaces) as that might cause problems. If you could show us a short example that doesn't work it would be easier to try and help. Peter From matthew.neilson at utoledo.edu Mon Dec 3 15:32:39 2007 From: matthew.neilson at utoledo.edu (Matthew Neilson) Date: Mon, 3 Dec 2007 10:32:39 -0500 Subject: [BioPython] Biopython and sequence trace files... Message-ID: <464c3d980712030732q5bb16ccas3927132668cc973f@mail.gmail.com> Hi, This question might be better suited for the development list, but here goes anyway. Are there any facilities in Biopython to read/write information from sequencing trace files (e.g., .abi, .scf, .ztr, etc). I know that Bioperl has a way of utilizing the Staden io_lib, and I was hoping for the same thing in Python. Has anyone been able to convert io_lib into Python module, or could someone point me towards resources that would help me to do this? Thanks in advance. -Matt -- Matt Neilson Graduate Research Assistant Great Lakes Genetics Lab Lake Erie Center-University of Toledo 6200 Bayshore Rd. Oregon, OH 43618 Lab: (419) 530-8370 Fax: (419) 530-8399 matthew.neilson at utoledo.edu From tiagoantao at gmail.com Mon Dec 3 21:48:15 2007 From: tiagoantao at gmail.com (Tiago Antao) Date: Mon, 3 Dec 2007 21:48:15 +0000 (WET) Subject: [BioPython] Population genetics code example application Message-ID: Hi, For anyone interested, we have developed a selection detection application based on the code that is currently available in the PopGen code. You can find it here: http://popgen.eu/soft/selwb/ It is actually a Jython application. In fact the code developed for this application served as the base for what is now the PopGen module (still, a very small module, but coalescent simulation and basic statistics are on the way). Any problems with the application, just send me an email, Tiago From luca.beltrame at unimi.it Tue Dec 4 10:19:42 2007 From: luca.beltrame at unimi.it (Luca Beltrame) Date: Tue, 04 Dec 2007 11:19:42 +0100 Subject: [BioPython] Adding new database types to EUtils Message-ID: <200712041119.46997.luca.beltrame@unimi.it> Hello. I've been trying to use EUtils to do run some queries through NCBI, but apparently GEO isn't present in the database list defined by EUtils: In [8]: EUtils.databases Out[8]: {'gene': , 'genome': , 'journals': , 'nucleotide': , 'omim': , 'popset': , 'protein': , 'pubmed': , 'sequences': , 'unigene': } Therefore queries using the DBIdsClient method search() trying to use GEO, such as this one: from Bio.EUtils import DBIdsClient client = DBIdsClient.DBIdsClient() test_search = client.search("GSE4830",db="geo") will fail with KeyError (because it's not defined). How can I extend EUtils.databases to add support for GEO? I've looked a bit at the class definitions in the API, and I'm not sure on how to proceed. Any hints would be greatly appreciated. Thanks. -- Luca Beltrame, MSc. - Molecular Medicine PhD Student Dipartimento di Scienze e Tecnologie Biomediche - UniMI CNR - Institute of Biomedical Technologies Research Fellow -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From biopython at maubp.freeserve.co.uk Tue Dec 4 11:16:36 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Dec 2007 11:16:36 +0000 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <200712041119.46997.luca.beltrame@unimi.it> References: <200712041119.46997.luca.beltrame@unimi.it> Message-ID: <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> > Hello. > I've been trying to use EUtils to do run some queries through NCBI, but > apparently GEO isn't present in the database list defined by [Biopython's] EUtils: I guess the first thing to do is double check that the NCBI EUtils API will support GEO files, and then see if you can manage to fetch anything "by hand". It is very simple to construct a URL by hand to fetch a GEO file directly (bypassing EUtils). Once you have downloaded the GEO files, what do you plan to do with them? Biopython's GEO parser is very basic... Peter P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery for this sort of thing. From luca.beltrame at unimi.it Tue Dec 4 11:21:12 2007 From: luca.beltrame at unimi.it (Luca Beltrame) Date: Tue, 04 Dec 2007 12:21:12 +0100 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> References: <200712041119.46997.luca.beltrame@unimi.it> <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> Message-ID: <200712041221.16990.luca.beltrame@unimi.it> Il Tuesday 04 December 2007 12:16:36 Peter ha scritto: > Once you have downloaded the GEO files, what do you plan to do with them? > Biopython's GEO parser is very basic... It was mostly to check their basic description to see if they were feasible to be included in my current work. As I have a large list of accessions, fetching them all at once would reduce the time needed to go through them. To be more clear, downloading their summary. > P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery > for this sort of thing. I mostly use it when I need to download data set information and expression levels. For this simpler task, I turned to Python first as GEOquery has some performance issues on my machine. I'll take a look at NCBI's EUils and see if they support GEO. Thanks for the tip. -- Luca Beltrame, MSc. - Molecular Medicine PhD Student Dipartimento di Scienze e Tecnologie Biomediche - UniMI CNR - Institute of Biomedical Technologies Research Fellow E-mail: luca.beltrame at unimi.it - Phone: +39-02-50320924 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From sdavis2 at mail.nih.gov Tue Dec 4 13:35:13 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 4 Dec 2007 08:35:13 -0500 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <200712041221.16990.luca.beltrame@unimi.it> References: <200712041119.46997.luca.beltrame@unimi.it> <320fb6e00712040316i35978f0dlc6ec52c8a904f986@mail.gmail.com> <200712041221.16990.luca.beltrame@unimi.it> Message-ID: <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> On Dec 4, 2007 6:21 AM, Luca Beltrame wrote: > Il Tuesday 04 December 2007 12:16:36 Peter ha scritto: > > > Once you have downloaded the GEO files, what do you plan to do with > them? > > Biopython's GEO parser is very basic... > > It was mostly to check their basic description to see if they were > feasible to > be included in my current work. As I have a large list of accessions, > fetching them all at once would reduce the time needed to go through them. > To > be more clear, downloading their summary. > > > P.S. If you use R/BioConductor, I would recommend Sean Davis' GEOquery > > for this sort of thing. > > I mostly use it when I need to download data set information and > expression > levels. For this simpler task, I turned to Python first as GEOquery has > some > performance issues on my machine. > > I'll take a look at NCBI's EUils and see if they support GEO. Thanks for > the > tip. Thought I would chime in here. GEOquery definitely does have some performance issues, some of which I have addressed in the most recent release. I have thought about making a python-based version, but I find R a much more compelling framework for statistical computing and array-based analyses, despite such tools as Rpy and numpy. Usage of GEOquery also requires a bit of understanding of the formats used by GEO, as some of them are monstrously large. My goal with GEOquery was to allow full parsing of even the monstrous files. However, GEO has recently released a GSEMatrix format (which GEOquery now handles) that is much faster and easier to parse (meant specifically for Excel to load), so the largest performance issue, parsing GSE SOFT files, is now pretty much gone. EUtils support is, as far as I know, pretty limited for GEO. Data download is best accomplished via ftp, generally. However, if one wants only Metadata (and not values), then URLs can be constructed against their web page to get back various formats, including SOFT and, in some cases, XML. I'm not sure that exactly the same functionality is available via Eutils, but I think not. Obviously, GEOquery is open-source and I continue to develop it if there is a need (and in response to changes by NCBI), so feedback is appreciated. Also, if there are improvements on the GEO side that would improve its utility, the folks at GEO do take comments and suggestions pretty seriously, so feel free to pass comments on to them (or to me and I will do the same). Sean From luca.beltrame at unimi.it Tue Dec 4 13:49:36 2007 From: luca.beltrame at unimi.it (Luca Beltrame) Date: Tue, 04 Dec 2007 14:49:36 +0100 Subject: [BioPython] Adding new database types to EUtils In-Reply-To: <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> References: <200712041119.46997.luca.beltrame@unimi.it> <200712041221.16990.luca.beltrame@unimi.it> <264855a00712040535l2911e009t48e4e6f483461ac5@mail.gmail.com> Message-ID: <200712041449.39100.luca.beltrame@unimi.it> Il Tuesday 04 December 2007 14:35:13 hai scritto: > release. I have thought about making a python-based version, but I find R > a much more compelling framework for statistical computing and array-based I think it is mostly a matter of personal preference. I turned to Python (but I have been using GEOquery in the past) because I like the language more than R. > Metadata (and not values), then URLs can be constructed against their web I guess I did not make the statement clear enough in my original mail. Yes, I meant to fetch only the metadata because I wanted to gather the experiment descriptions from all the accessions I had (a rather large number) in order to look through them without having to query for each one. I will try looking at the queries via web and see if I can write something useful (although I still think that, as basic as it is, it would be nice to have EUtils GEO support in Bio.EUtils, at least for the metadata). > I'm not sure that exactly the same functionality is available via Eutils, > but I think not. I have played a bit with EUtils, but I haven't yet been able to use esearch to work with a GEO accession. Since I have just looked at them briefly, I can't guarantee it was just a mistake on my part, though. -- Luca Beltrame, MSc. - Molecular Medicine PhD Student Dipartimento di Scienze e Tecnologie Biomediche - UniMI CNR - Institute of Biomedical Technologies Research Fellow -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From mdehoon at c2b2.columbia.edu Sat Dec 8 03:18:09 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 08 Dec 2007 12:18:09 +0900 Subject: [BioPython] [Biopython-dev] Accessing ExPASy through Bio.SwissProt /Bio.SeqIO In-Reply-To: <320fb6e00712070246g53e8096ew156f4502791bce9b@mail.gmail.com> References: <6243BAA9F5E0D24DA41B27997D1FD14402B66F@mail2.exch.c2b2.columbia.edu> <320fb6e00712040226o7ecda7e2g9fb124b3a52de026@mail.gmail.com> <6243BAA9F5E0D24DA41B27997D1FD14402B670@mail2.exch.c2b2.columbia.edu> <475691C1.3020705@maubp.freeserve.co.uk> <6243BAA9F5E0D24DA41B27997D1FD14402B673@mail2.exch.c2b2.columbia.edu> <320fb6e00712070246g53e8096ew156f4502791bce9b@mail.gmail.com> Message-ID: <475A0CF1.1080802@c2b2.columbia.edu> Peter wrote: > I would add a note saying doing it this way gives > Bio.SwissProt.SProt.Record objects, > while you could alternatively get SeqRecord objects as described in > the SeqIO chapter > (use a reference). OK I will add that. > > I'd suggested a Bio.SeqIO function, with a name like parse1() or > parse_sole() etc which > would return a single SeqRecord - and raise an error if the handle > didn't contain one > and only one record. We could call this function read() if you prefer. > I'd prefer read() instead of parse1(), parse_sole() etc. for the following reasons: 1) Having two names that are clearly different emphasizes the fact that they return different things (parse() returns an iterator, read() a record). 2) Some modules deal with data that always consist of one record (for example, gene expression data in case of Bio.Cluster). Such modules can have a read() function but not a parse(). It would feel strange if a module has a parse1() function but not a parse(). --Michiel. From p.j.a.cock at googlemail.com Sat Dec 8 13:10:33 2007 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 8 Dec 2007 13:10:33 +0000 Subject: [BioPython] [Biopython-dev] Bio.SeqIO function to read a single record Message-ID: <320fb6e00712080510k3d4e5148gb0ec332a0d745452@mail.gmail.com> Michiel de Hoon wrote: > > > > I'd suggested a Bio.SeqIO function, with a name like parse1() or > > parse_sole() etc which would return a single SeqRecord - and raise > > an error if the handle didn't contain one and only one record. We > > could call this function read() if you prefer. > > > I'd prefer read() instead of parse1(), parse_sole() etc. for the > following reasons: > > 1) Having two names that are clearly different emphasizes the fact that > they return different things (parse() returns an iterator, read() a record). > > 2) Some modules deal with data that always consist of one record (for > example, gene expression data in case of Bio.Cluster). Such modules can > have a read() function but not a parse(). It would feel strange if a > module has a parse1() function but not a parse(). OK. I've filed an enhancement bug, which I'll mention on the main mailing list, http://bugzilla.open-bio.org/show_bug.cgi?id=2417 Unless there is some negative feedback, I'll add that functionality shortly. Peter From biopython at maubp.freeserve.co.uk Sat Dec 8 13:20:35 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 8 Dec 2007 13:20:35 +0000 Subject: [BioPython] Bio.SeqIO and files with one record In-Reply-To: <4693E5FE.708@maubp.freeserve.co.uk> References: <4693E5FE.708@maubp.freeserve.co.uk> Message-ID: <320fb6e00712080520jdd1a06dka1a8bfe03d69a1fd@mail.gmail.com> In July 2007, Peter wrote: > Dear Biopython people, > > I'd like a little feedback on the Bio.SeqIO module - in particular, one > situation I think could be improved is when dealing with sequences files > which contain a single record - for example a very simple Fasta file, or > a chromosome in a GenBank file. > > http://www.biopython.org/wiki/SeqIO > > The shortest way to get this one record as a SeqRecord object is probably: > > from Bio import SeqIO > record = SeqIO.parse(open("example.gbk"), "genbank").next() > > This works, assuming there is at least one record, but will not trigger > any error if there was more than one record - something you may want to > check. > > Do any of you think this situation is common enough to warrant adding > another function to Bio.SeqIO to do this for you (raising errors for no > records or more than one record). My suggestions for possible names > include parse_single, parse_one, parse_sole, parse_individual and mono_parse We had a few other name suggestions including "parse_the_only_one" from Martin which while nice and clear is very long. Over on the dev-mailing list, Michiel suggested we call this the "read" function, which seems sensible. I've filed an enhancement bug for this whole issue: Bugzilla Bug 2417 - Bio.SeqIO single SeqRecord read/parse functionhttp://bugzilla.open-bio.org/show_bug.cgi?id=2417 I think the general consensus was this functionality could be useful, but perhaps not to everyone. In fact it turns out to be very helpful when parsing records downloaded from the internet - which I hadn't pointed out earlier. I plan to add this new functionality as a "read" function - unless anyone here wants to add anything... Thanks, Peter From e.picardi at unical.it Sat Dec 8 17:39:01 2007 From: e.picardi at unical.it (Ernesto) Date: Sat, 8 Dec 2007 18:39:01 +0100 Subject: [BioPython] GFF parser Message-ID: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> Hi all, can biopython handle GFF files? And GTFs? Many thanks, Ernesto -------------------------------------------------------- Dr Ernesto Picardi, PhD Dept. of Biochemistry and Molecular Biology University of Bari Italy E-mail: e.picardi at unical.it -------------------------------------------------------- From biopython at maubp.freeserve.co.uk Sun Dec 9 15:53:46 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 9 Dec 2007 15:53:46 +0000 Subject: [BioPython] GFF parser In-Reply-To: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> References: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> Message-ID: <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> Ernesto wrote: > Hi all, > > can biopython handle GFF files? And GTFs? > > Many thanks, > > Ernesto Hi Ernesto, The short answer is that no, Biopython does not (currently) handle GFF files. We do have a module, Bio.GFF which is designed to work with an MySQL database containing GFF data, which you must first setup using BioPerl. However, Bio.GFF does not work with the GFF or GTF files directly. You are not alone in wanting this sort of functionality - for example earlier this year Giovanni Marco Dall'Olio asked about GFF files on this mailing list: http://lists.open-bio.org/pipermail/biopython/2007-June/003522.html Peter From sdavis2 at mail.nih.gov Sun Dec 9 18:42:58 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 9 Dec 2007 13:42:58 -0500 Subject: [BioPython] GFF parser In-Reply-To: <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> References: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> Message-ID: <264855a00712091042y541fb565sc5fd112948411ac8@mail.gmail.com> On Dec 9, 2007 10:53 AM, Peter wrote: > Ernesto wrote: > > Hi all, > > > > can biopython handle GFF files? And GTFs? > > > > Many thanks, > > > > Ernesto > > Hi Ernesto, > > The short answer is that no, Biopython does not (currently) handle GFF > files. > > We do have a module, Bio.GFF which is designed to work with an MySQL > database containing GFF data, which you must first setup using > BioPerl. However, Bio.GFF does not work with the GFF or GTF files > directly. > > You are not alone in wanting this sort of functionality - for example > earlier this year Giovanni Marco Dall'Olio asked about GFF files on > this mailing list: > > http://lists.open-bio.org/pipermail/biopython/2007-June/003522.html > You might take a look at this project (http://g2.trac.bx.psu.edu/ ). The code is available for download. It might have some GFF parsing abilities that could be hacked to do what you want. An alternative is to load the data using perl and access using Bio.GFF. If I recall, there is basically a single perl script for loading GFF data in to a database. Just some ideas. Sean From sdavis2 at mail.nih.gov Sun Dec 9 18:42:58 2007 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 9 Dec 2007 13:42:58 -0500 Subject: [BioPython] GFF parser In-Reply-To: <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> References: <13F18780-3F4D-4DC6-A830-F8751552BA53@unical.it> <320fb6e00712090753t219029c5lf1a7c820bff125fc@mail.gmail.com> Message-ID: <264855a00712091042y541fb565sc5fd112948411ac8@mail.gmail.com> On Dec 9, 2007 10:53 AM, Peter wrote: > Ernesto wrote: > > Hi all, > > > > can biopython handle GFF files? And GTFs? > > > > Many thanks, > > > > Ernesto > > Hi Ernesto, > > The short answer is that no, Biopython does not (currently) handle GFF > files. > > We do have a module, Bio.GFF which is designed to work with an MySQL > database containing GFF data, which you must first setup using > BioPerl. However, Bio.GFF does not work with the GFF or GTF files > directly. > > You are not alone in wanting this sort of functionality - for example > earlier this year Giovanni Marco Dall'Olio asked about GFF files on > this mailing list: > > http://lists.open-bio.org/pipermail/biopython/2007-June/003522.html > You might take a look at this project (http://g2.trac.bx.psu.edu/ ). The code is available for download. It might have some GFF parsing abilities that could be hacked to do what you want. An alternative is to load the data using perl and access using Bio.GFF. If I recall, there is basically a single perl script for loading GFF data in to a database. Just some ideas. Sean From mmokrejs at ribosome.natur.cuni.cz Mon Dec 10 17:11:18 2007 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Mon, 10 Dec 2007 18:11:18 +0100 Subject: [BioPython] [Biopython-dev] Bio.SeqIO function to read a single record In-Reply-To: <320fb6e00712080510k3d4e5148gb0ec332a0d745452@mail.gmail.com> References: <320fb6e00712080510k3d4e5148gb0ec332a0d745452@mail.gmail.com> Message-ID: <475D7336.9070009@ribosome.natur.cuni.cz> Peter, so how about parse_one()? Or having a pair parse_file() and parse_entry()? M. Peter Cock wrote: > Michiel de Hoon wrote: >>> I'd suggested a Bio.SeqIO function, with a name like parse1() or >>> parse_sole() etc which would return a single SeqRecord - and raise >>> an error if the handle didn't contain one and only one record. We >>> could call this function read() if you prefer. >>> >> I'd prefer read() instead of parse1(), parse_sole() etc. for the >> following reasons: >> >> 1) Having two names that are clearly different emphasizes the fact that >> they return different things (parse() returns an iterator, read() a record). >> >> 2) Some modules deal with data that always consist of one record (for >> example, gene expression data in case of Bio.Cluster). Such modules can >> have a read() function but not a parse(). It would feel strange if a >> module has a parse1() function but not a parse(). > > OK. I've filed an enhancement bug, which I'll mention on the main mailing list, > http://bugzilla.open-bio.org/show_bug.cgi?id=2417 > Unless there is some negative feedback, I'll add that functionality shortly. From karin.lagesen at medisin.uio.no Thu Dec 13 12:30:43 2007 From: karin.lagesen at medisin.uio.no (Karin Lagesen) Date: Thu, 13 Dec 2007 13:30:43 +0100 Subject: [BioPython] alignment alphabet problem - upper/lower case? Message-ID: I have an alignment that I read in with : alignment = Clustalw.parse_file(infile, alphabet=IUPAC.IUPACAmbiguousDNA()) This works fine with upper case alignments. However, I am now working with alignments that are in lower case. I would still like to use the same alphabet, but I don't know how to get it to accept lower case? The error I get is: File "/usr/local/python//lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 389, in pos_specific_score_matrix raise ValueError("Residue %s not found in alphabet %s" ValueError: Residue c not found in alphabet Gapped(IUPACAmbiguousDNA(), '-') Thanks! Karin -- Karin Lagesen, PhD student karin.lagesen at medisin.uio.no http://folk.uio.no/karinlag From biopython at maubp.freeserve.co.uk Thu Dec 13 14:11:02 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Dec 2007 14:11:02 +0000 Subject: [BioPython] alignment alphabet problem - upper/lower case? In-Reply-To: References: Message-ID: <320fb6e00712130611q4dd1ee63p9f29c2774f447eb2@mail.gmail.com> Karin Lagesen wrote: > I have an alignment that I read in with : > > alignment = Clustalw.parse_file(infile, alphabet=IUPAC.IUPACAmbiguousDNA()) > > This works fine with upper case alignments. > > However, I am now working with alignments that are in lower case. I > would still like to use the same alphabet, but I don't know how to get > it to accept lower case? You can't use the *same* alphabet, because IUPAC.IUPACAmbiguousDNA() is explicitly defined with upper case letters. You need to use a different alphabet - either a generic DNA alphabet where the letters are not specified, or create a lower case equivalent of IUPAC.IUPACAmbiguousDNA(). Peter From dtomso at athenixcorp.com Thu Dec 20 20:08:47 2007 From: dtomso at athenixcorp.com (Daniel Tomso) Date: Thu, 20 Dec 2007 15:08:47 -0500 Subject: [BioPython] BioSQL problems on load Message-ID: Hi- This is possibly a simple configuration problem, but I'm having some problems loading files into BioSQL. Here's the code I'm using: from BioSQL import BioSeqDatabase from Bio import SeqIO import sys sfilename = sys.argv[1] server = BioSeqDatabase.open_database(driver = 'MySQLdb', user = 'pythonapi', passwd = 'xxxxxxx', host = 'localhost', db = 'bioseqdb', ) contigdb = server.new_database('test') sfile = open(sfilename, 'rU') contigdb.load(SeqIO.parse(sfile, 'genbank')) sfile.close() And here's the error message: /usr/lib/python2.5/site-packages/Bio/crc.py:5: DeprecationWarning: Bio.crc is deprecated; use crc32 and crc64 in Bio.SeqUtils.CheckSum instead warnings.warn("Bio.crc is deprecated; use crc32 and crc64 in Bio.SeqUtils.CheckSum instead", DeprecationWarning) Traceback (most recent call last): File "biosql_driver.py", line 26, in contigdb.load(SeqIO.parse(sfile, 'genbank')) File "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe qDatabase.py", line 412, in load db_loader.load_seqrecord(cur_record) File "/usr/lib/python2.5/site-packages/BioSQL/Loader.py", line 30, in load_seqrecord bioentry_id = self._load_bioentry_table(record) File "/usr/lib/python2.5/site-packages/BioSQL/Loader.py", line 253, in _load_bioentry_table version)) File "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe qDatabase.py", line 277, in execute self.cursor.execute(sql, args or ()) File "/usr/lib/python2.5/site-packages/MySQLdb/cursors.py", line 149, in execute query = query % db.literal(args) TypeError: not all arguments converted during string formatting This happens with standard genbank and fasta files pulled off of NCBI. Any suggestions? There's another issue regarding standard parsing of accession numbers to get version IDs (the code doesn't like non-NCBI fasta headers, e.g. those produced by phrap), but that is pretty minor and doesn't seem to be related to this. Thanks for any ideas! Dan Tomso ---------------------------------------- Daniel J. Tomso Senior Scientist Athenix Corporation PO Box 110347 Research Triangle Park, NC 27709 ---------------------------------------- 919.328.4122 dtomso at athenixcorp.com www.athenixcorp.com ---------------------------------------- Disclaimer: This message (including any attachments) may contain confidential or privileged information and is intended only for the use of the addressee named above. If you are not the intended recipient of this message, you are hereby notified that you must not use, copy, disclose or take any action based on this message or information herein. If you have received this message in error, please advise the sender immediately and erase all copies of this message and any related attachments. Thank you. From biopython at maubp.freeserve.co.uk Fri Dec 21 17:56:08 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 21 Dec 2007 18:56:08 +0100 Subject: [BioPython] BioSQL problems on load In-Reply-To: References: Message-ID: <320fb6e00712210956l6541cf7o482e9c7bf4c7db85@mail.gmail.com> > This is possibly a simple configuration problem, but I'm having some > problems loading files into BioSQL. > > ... > > warnings.warn("Bio.crc is deprecated; use crc32 and crc64 in > Bio.SeqUtils.CheckSum instead", DeprecationWarning) We should fix that - I hadn't seen that warning being triggered on my machine. > File > "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe > qDatabase.py", line 277, in execute > > self.cursor.execute(sql, args or ()) > > File "/usr/lib/python2.5/site-packages/MySQLdb/cursors.py", line 149, > in execute > > query = query % db.literal(args) > > TypeError: not all arguments converted during string formatting > > This happens with standard genbank and fasta files pulled off of NCBI. > Any suggestions? That looks like Bug 2390, http://bugzilla.open-bio.org/show_bug.cgi?id=2390 You didn't say what version of Biopython you are using, but that has been fixed in CVS. Could you update to CVS, or wait for the next release (hopefully in January?). > There's another issue regarding standard parsing of accession numbers to > get version IDs (the code doesn't like non-NCBI fasta headers, e.g. > those produced by phrap), but that is pretty minor and doesn't seem to > be related to this. Perhaps the parser or BioSQL can be a bit more robust here. Could you file a bug on this issue, and then attach an example input file with some code showing the problem? Thanks Peter From vmatthewa at gmail.com Wed Dec 26 22:41:55 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Wed, 26 Dec 2007 15:41:55 -0700 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <320fb6e00711151311q700a473age0e026b8fcfb5c5@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> <320fb6e00711151311q700a473age0e026b8fcfb5c5@mail.gmail.com> Message-ID: <8fc5e4c20712261441tf0d1887q3455bda40cead942@mail.gmail.com> Hi everyone, Sorry to keep going back to this script I am working on, but I was wondering going back to what Peter said about my installation of Bio-python that I think keeps messing up my code no matter what I do to attempt to fix it. Could I use the "PortsCollection" to install bio-python again since I am using BSD as my OS. I realize that you were only speculation as to the nature of the problem with my install but do you think that might work? Since I am not the one installing Bio-python and it is my system administrator that is doing the installing I should talk to them about it? Thanks. Matthew On Nov 15, 2007 2:11 PM, Peter wrote: > > Thanks for all the comments, so since you said you removed Bio.FormatIOin > > version 1.44 and replaced it with Bio.SeqIO do you think I can still > > successfully use that code I was given if I have 1.44 provided I watch > out > > for bugs and so on? > > Assuming you can apply the fix for Bug 2393, then that > Bio.GenBank.NCBIDictionary code should work fine with Biopython 1.44. > > There is also a related example in the SeqIO chapter of the tutorial > using the Bio.GenBank.download_many() function. > > > What is the difference between Bio.FormatIO and > > Bio.SeqIO, other then them describing file formats differently? > > In terms of typical end use, Bio.SeqIO and Bio.FormatIO provided > similar capabilities, but FormatIO wasn't very up to date in terms of > its format support. The big differences are internal. For any new > code, please try Bio.SeqIO (available in Biopython 1.43 onwards), > which is described in the tutorial and the wiki: > http://biopython.org/wiki/SeqIO > > > Also how exactly could one have a partial installation, some of the > package not > > installing? > > This was a guess - there is/was clearly something odd about your > install. If you installed from source, maybe some step failed part > way leaving you with only some parts installed. Another possibility > is on BSD is there is something different about the installation paths > which is confusing things. We haven't worked out what went wrong on > your system so I'm was just speculating. > > Peter > From dtomso at athenixcorp.com Fri Dec 28 15:50:25 2007 From: dtomso at athenixcorp.com (Daniel Tomso) Date: Fri, 28 Dec 2007 10:50:25 -0500 Subject: [BioPython] BioSQL problems on load In-Reply-To: <320fb6e00712210956l6541cf7o482e9c7bf4c7db85@mail.gmail.com> References: <320fb6e00712210956l6541cf7o482e9c7bf4c7db85@mail.gmail.com> Message-ID: Thanks-- I'll get the last bug logged ASAP. Regarding the main problem--I'm pretty sure I'm up-to-date via CVS--I have persistent configuration problems on this Ubuntu box. I suspect right now that something is digging down some alternate hierarchy and coming up with this error, even though the correct versions are lurking somewhere else. I'll dig it up eventually. Thanks again. DT -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Friday, December 21, 2007 12:56 PM To: Daniel Tomso Cc: biopython at biopython.org Subject: Re: [BioPython] BioSQL problems on load > This is possibly a simple configuration problem, but I'm having some > problems loading files into BioSQL. > > ... > > warnings.warn("Bio.crc is deprecated; use crc32 and crc64 in > Bio.SeqUtils.CheckSum instead", DeprecationWarning) We should fix that - I hadn't seen that warning being triggered on my machine. > File > "/home/dtomso/repository/biopython/build/lib.linux-i686-2.5/BioSQL/BioSe > qDatabase.py", line 277, in execute > > self.cursor.execute(sql, args or ()) > > File "/usr/lib/python2.5/site-packages/MySQLdb/cursors.py", line 149, > in execute > > query = query % db.literal(args) > > TypeError: not all arguments converted during string formatting > > This happens with standard genbank and fasta files pulled off of NCBI. > Any suggestions? That looks like Bug 2390, http://bugzilla.open-bio.org/show_bug.cgi?id=2390 You didn't say what version of Biopython you are using, but that has been fixed in CVS. Could you update to CVS, or wait for the next release (hopefully in January?). > There's another issue regarding standard parsing of accession numbers to > get version IDs (the code doesn't like non-NCBI fasta headers, e.g. > those produced by phrap), but that is pretty minor and doesn't seem to > be related to this. Perhaps the parser or BioSQL can be a bit more robust here. Could you file a bug on this issue, and then attach an example input file with some code showing the problem? Thanks Peter