From p.j.a.cock at googlemail.com Tue May 5 12:41:31 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 May 2009 17:41:31 +0100 Subject: [Biopython] Dropping Python 2.3 support in Biopython Message-ID: <320fb6e00905050941m37725eb5ibba02ca99236212e@mail.gmail.com> Hello all, This is a final warning that the next release of Biopython will not support Python 2.3. As far as we are aware, no-one has come forward with a need for continued support for Python 2.3, so we will soon begin removing the special case code needed to keep Biopython working on Python 2.3. This will give us a simpler code base, less platforms to test on, and we can also take advantage of various language features only available in Python 2.4+ (e.g. generator expressions and decorators). Any last minute requests to postpone this should be made to the main Biopython mailing list by Friday 8 May. Thank you, Peter From scheper at email.unc.edu Wed May 6 11:33:40 2009 From: scheper at email.unc.edu (Walter Scheper) Date: Wed, 6 May 2009 11:33:40 -0400 Subject: [Biopython] [BioPython] Question about using Entrez.epost Message-ID: Hey folks, I'm currently working on a project where we need to download large numbers (1000s) of SNPs from NCBI's database. The documentation for BioPython tells me I should be using Entrez.epost for this, and then using the resulting search history to pull down my snps. However, I find that epost itself has a maxium limit to the number of rs Ids I can use in a single search, which roughly translates into about 700 rs Ids. Is this as intended, or am I not using epost correctly? If I am using epost correctly, what's the best way to break this up so that (a) I get my data and (b) don't overburden NCBI's system. Here's how I'm calling epost, mostly this is straight out of the tutorial: search_results = Entrez.read(Entrez.epost(db='snp', id=id_string)) Thanks for any help, Walter Scheper From biopython at maubp.freeserve.co.uk Wed May 6 12:43:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 May 2009 17:43:15 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: References: Message-ID: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> On Wed, May 6, 2009 at 4:33 PM, Walter Scheper wrote: > Hey folks, > > I'm currently working on a project where we need to download large numbers > (1000s) of SNPs from NCBI's database. The documentation for BioPython tells > me I should be using Entrez.epost for this, and then using the resulting > search history to pull down my snps. However, I find that epost itself has a > maxium limit to the number of rs Ids I can use in a single search, which > roughly translates into about 700 rs Ids. Is this as intended, or am I not > using epost correctly? If I am using epost correctly, what's the best way to > break this up so that (a) I get my data and (b) don't overburden NCBI's > system. > > Here's how I'm calling epost, mostly this is straight out of the tutorial: > ? ?search_results = Entrez.read(Entrez.epost(db='snp', id=id_string)) > > Thanks for any help, > Walter Scheper Where is the 700 ID limit from? I don't recall any limits on the EPost documentation. Was this found by experimentation? http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html Note that with such large numbers, once you have got EPost to work, you should probably be using EFetch to download the results in batches. This doesn't really answer your question, but it should be fairly simple to use EPost and EFetch in batches of (say) 100 records, which avoids the whole issue. How big a batch size you choose will depend on how big the records are. As far as I know the NCBI don't give any guidance on batch sizes. However, do make sure you run this script at the weekend or outside normal USA working hours: http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements Peter From biopython at maubp.freeserve.co.uk Thu May 7 04:59:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 May 2009 09:59:44 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> References: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> Message-ID: <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> On Wed, May 6, 2009 at 6:07 PM, Walter Scheper wrote: > > On May 6, 2009, at 12:43 PM, Peter wrote: > >> Where is the 700 ID limit from? ?I don't recall any limits on the EPost >> documentation. ?Was this found by experimentation? >> http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html >> > Yes, that was found by experimentation with epost. I don't think the limit > is really tied to the number of Ids, but to the length of the string. > However, I thought the purpose of epost was that you didn't run into url > length problems, or is it more a way to queue up a request? I think I see what is going on now. Using an HTTP POST would avoid long URL problems with an HTTP GET (something the NCBI documents don't explain - at first reading the EPost command seems redundant). However, from looking over our Bio.Entrez code more carefully it seems we are not actually using a POST after all - just a plain HTTP GET. This would explain the limit you are seeing - you are either the first person to try such a long ID list with Bio.Entrez (personally I have only ever downloaded much smaller datasets in one go), or at least you are the first person to actually report it is broken. So thank you! I've filed Bug 2824 on this issue: http://bugzilla.open-bio.org/show_bug.cgi?id=2824 >> Note that with such large numbers, once you have got EPost to work, you >> should probably be using EFetch to download the results in batches. > > Sure. I suppose I'll just have to break the whole thing, search and > retrieval, into chunks. In the short term yes, that is a practical solution. If you would be happy to update your installation (or at least, update the Bio/Entrez/__init__.py file) then you can help test any fix. Peter From biopython at maubp.freeserve.co.uk Thu May 7 06:24:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 May 2009 11:24:26 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> References: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> Message-ID: <320fb6e00905070324j7eb96d0bs768f72769f75a9c7@mail.gmail.com> On Thu, May 7, 2009 at 9:59 AM, Peter wrote: > I think I see what is going on now. ?Using an HTTP POST would avoid > long URL problems with an HTTP GET (something the NCBI documents > don't explain - at first reading the EPost command seems redundant). > However, ?from looking over our Bio.Entrez code more carefully it > seems we are not actually using a POST after all - just a plain HTTP > GET. > > This would explain the limit you are seeing - you are either the first > person to try such a long ID list with Bio.Entrez (personally I have > only ever downloaded much smaller datasets in one go), or at least > you are the first person to actually report it is broken. ?So thank you! > > I've filed Bug 2824 on this issue: > http://bugzilla.open-bio.org/show_bug.cgi?id=2824 Fixed in CVS (which will propagate to github shortly). All you need to do to apply the fix is update your .../site-packages/Bio/Entrez/__init__.py file with the new version, which can be downloaded from here shortly: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Entrez/__init__.py?cvsroot=biopython Or, just install all of Biopython from the latest source (fetched from CVS or github). If you need more details instructions, let us know. This should fix your EPost problem - if you can confirm this by email that would be great. Thank you. Peter From biopython at maubp.freeserve.co.uk Fri May 8 15:55:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 May 2009 20:55:56 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: References: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> <320fb6e00905070324j7eb96d0bs768f72769f75a9c7@mail.gmail.com> Message-ID: <320fb6e00905081255x611872f9pe48b1818aec0a930@mail.gmail.com> On Fri, May 8, 2009 at 8:38 PM, Walter Scheper wrote: > > On May 7, 2009, at 6:24 AM, Peter wrote: > >> This should fix your EPost problem - if you can confirm this by email >> that would be great. >> >> Thank you. >> >> Peter > > Hey Peter, > > Thanks for the help, and the quick fix. This does indeed fix the problem I > was having. Now I can at least get NCBI to cache the whole set before I > start downloading. > > Walter Great - thanks again for reporting this problem. Peter From biopython at maubp.freeserve.co.uk Tue May 12 12:10:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 17:10:02 +0100 Subject: [Biopython] Loading SeqRecords into BioSQL with NCBI taxon ID In-Reply-To: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com> References: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com> Message-ID: <320fb6e00905120910v7bc942eai1e163d29b0d88f00@mail.gmail.com> Sorry - I meant to post this to the main mailing list rather than the dev list, as it is of general interest. Peter On Tue, May 12, 2009 at 5:05 PM, Peter wrote: > Over on Bug 2826, David wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2 > >> Thank you. I'm new to BioPython. >> >> The goal was to take some whole-genome sequence (which isn't in Genbank) and >> attach a taxon to it, in order that it be written to a BioSQL database. > > You've talked about trying to parse WGS GenBank files on Bug 2825 but > presumable if this new data isn't in GenBank, it is in another format. > > What format is your ?whole-genome sequence? ?FASTA or something simple? > >> Other records in the BioSQL database derive from NCBI and so have taxon_ids, >> so the additional WGS being in a similar format would make things simpler. > > I see. Basically you need to import a SeqRecord into BioSQL with an > NCBI taxon ID. ?You don't need to write out a GenBank file to do this. > > First create the SeqRecord, e.g. > > from Bio import SeqIO > record = SeqIO.read(handle, format, alphabet) > > There are now two options - because the BioSQL loader will look for > the NCBI taxon ID in two places: > > (Option 1) Record the NCBI taxon ID in the SeqRecord's annotation > dictionary under the "ncbi_taxid" key. ?This should work (untested): > > record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345] > > (Option 2) Mimic a SeqRecord from parsing a GenBank file with a source > feature containing the taxon ID. This should work (untested): > > #Create the SeqRecord: > record = SeqIO.read(handle, format, alphabet) > #Create the source features: > from Bio.SeqFeature import SeqFeature, FeatureLocation > f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source") > f.qualifiers["db_xref"] = ["taxon:12345"] > record.features = [f] #or insert at start > > If you don't really have a sequence, this second approach doesn't make > so much sense. > > [Arguably there could be a third option via the dbxref's list] > > Then in either case, load the modified SeqRecord into the database. > You may want to pre-load the NCBI taxonomy, see > http://www.biopython.org/wiki/BioSQL > > Alternatively, using Biopython 1.49+ you can have this fetched from > Entrez on demand with the fetch_NCBI_taxonomy=True option. ?The BioSQL > wiki page needs updating on this topic. > > Peter > From chapmanb at 50mail.com Thu May 14 08:23:19 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 14 May 2009 08:23:19 -0400 Subject: [Biopython] BOSC/ISMB conference reminder Message-ID: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Hi all; A friendly reminder to all of us Python bioinformatics folks; the early registration deadline for the Bioinformatics Open Source Conference (BOSC) is this Friday, May 15th: http://www.open-bio.org/wiki/BOSC_2009 BOSC is on in Stockholm, Sweden from June 27th to 28th, and takes place in conjunction with the ISMB meeting. Peter will be there presenting an update on the latest and greatest in Biopython. I will be giving a 10 minute talk during the Data and Analysis Management section about ideas for publishing biological data on the web with Python examples. Generally, BOSC is a great chance to meet like-minded informatics folks and learn about all the great tools being developed out there. I mentioned previously the idea of an informal Biopython hackathon on the Monday and Tuesday after BOSC. Basically, this would be a chance to sit down and do some Biopython programming while being able to physically talk. I will definitely be there and am up for some coding with as many people as are inclined; Peter and Tiago responded favorably to the idea earlier. If you are at all interested send an e-mail. We can keep it informal with a few people, and I will try to organize space if it looks like we will have more. Looking forward to seeing everyone there, Brad From p.j.a.cock at googlemail.com Thu May 14 08:47:55 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 May 2009 13:47:55 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <20090514122319.GD59158@sobchak.mgh.harvard.edu> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> On Thu, May 14, 2009 at 1:23 PM, Brad Chapman wrote: > Hi all; > A friendly reminder to all of us Python bioinformatics folks; > the early registration deadline for the Bioinformatics Open Source > Conference (BOSC) is this Friday, May 15th: > > http://www.open-bio.org/wiki/BOSC_2009 > > BOSC is on in Stockholm, Sweden from June 27th to 28th, and takes > place in conjunction with the ISMB meeting. > > Peter will be there presenting an update on the latest and greatest > in Biopython. I will be giving a 10 minute talk during the Data > and Analysis Management section about ideas for publishing > biological data on the web with Python examples. Generally, BOSC is > a great chance to meet like-minded informatics folks and learn about > all the great tools being developed out there. > > I mentioned previously the idea of an informal Biopython hackathon on > the Monday and Tuesday after BOSC. Basically, this would be a chance > to sit down and do some Biopython programming while being able to > physically talk. I will definitely be there and am up for some coding > with as many people as are inclined; Peter and Tiago responded favorably > to the idea earlier. If you are at all interested send an e-mail. We can > keep it informal with a few people, and I will try to organize space if > it looks like we will have more. You mean Monday June 29 and/or Tuesday June 30? These are the first two days of the ISMB meeting, so my time would have to be juggled between this and whatever conference sessions look most relevant. http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are asking all speakers to come prepared to lead an informal tutorial on their software during a Birds of a Feather/hackathon session." So I will ask them to allocate us a slot on the Saturday/Sunday during BOSC itself. They should take care of getting a room and internet access too. This would be more of a basic introduction to Biopython (hopefully we can generate some useful additions to the cookbook from this) rather than a developmental session like Brad is planning. Peter From tiagoantao at gmail.com Thu May 14 08:50:55 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 14 May 2009 13:50:55 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <20090514122319.GD59158@sobchak.mgh.harvard.edu> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Message-ID: <6d941f120905140550v4185ab73sdb2d3a8d47e80cea@mail.gmail.com> On Thu, May 14, 2009 at 1:23 PM, Brad Chapman wrote: > I mentioned previously the idea of an informal Biopython hackathon on > the Monday and Tuesday after BOSC. Basically, this would be a chance > to sit down and do some Biopython programming while being able to > physically talk. I will definitely be there and am up for some coding > with as many people as are inclined; Peter and Tiago responded favorably > to the idea earlier. If you are at all interested send an e-mail. We can > keep it informal with a few people, and I will try to organize space if > it looks like we will have more. I will only be going if this goes forward. For me any group size is good (small, medium, big). But I would need to know if this goes forward or not (doesn't need to be 100% sure, just the committed intention to go ahead). Regards, Tiago From bartek at rezolwenta.eu.org Thu May 14 08:58:09 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 14 May 2009 14:58:09 +0200 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> Message-ID: <8b34ec180905140558v1cc3142fnb4bc300133e32076@mail.gmail.com> On Thu, May 14, 2009 at 2:47 PM, Peter Cock wrote: > On Thu, May 14, 2009 at 1:23 PM, Brad Chapman wrote: > You mean Monday June 29 and/or Tuesday June 30? ?These are the > first two days of the ISMB meeting, so my time would have to be > juggled between this and whatever conference sessions look most > relevant. > > http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are > asking all speakers to come prepared to lead an informal tutorial on > their software during a Birds of a Feather/hackathon session." > So I will ask them to allocate us a slot on the Saturday/Sunday > during BOSC itself. ?They should take care of getting a room and > internet access too. ?This would be more of a basic introduction to > Biopython (hopefully we can generate some useful additions to the > cookbook from this) rather than a developmental session like Brad > is planning. > Hi, I'll also be presenting at BOSC and I'd be glad to take part in the meeting/hackathon. Unfortunately, I'm not staying for ISMB, so my vote goes definitely on Saturday/Sunday timing. looking forward to see you in Stockholm Bartek From p.j.a.cock at googlemail.com Thu May 14 09:26:41 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 May 2009 14:26:41 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> Message-ID: <320fb6e00905140626l82001bbw8ba7912d2b596d50@mail.gmail.com> Brad wrote: >> I mentioned previously the idea of an informal Biopython hackathon on >> the Monday and Tuesday after BOSC. Basically, this would be a chance >> to sit down and do some Biopython programming while being able to >> physically talk. I will definitely be there and am up for some coding >> with as many people as are inclined; Peter and Tiago responded >> favorably to the idea earlier. If you are at all interested send an e-mail. >> We can keep it informal with a few people, and I will try to organize >> space if it looks like we will have more. Peter wrote: > You mean Monday June 29 and/or Tuesday June 30? ?These are the > first two days of the ISMB meeting, so my time would have to be > juggled between this and whatever conference sessions look most > relevant. > > http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are > asking all speakers to come prepared to lead an informal tutorial on > their software during a Birds of a Feather/hackathon session." > So I will ask them to allocate us a slot on the Saturday/Sunday > during BOSC itself. ?They should take care of getting a room and > internet access too. ?This would be more of a basic introduction to > Biopython (hopefully we can generate some useful additions to the > cookbook from this) rather than a developmental session like Brad > is planning. I have emailed the BOSC organisers to ask for a Biopython "Birds of a Feather/hackathon session" slot. In previous years these sessions have been on the Sunday afternoon - and didn't clash with any of the BOSC talks. If we have me, Brad, Bartek and Tiago there we should be able to cover most user questions, and maybe also get some development in too. If it is just me and Brad staying on for the rest of the week, we can get together informally (maybe an evening session in a hotel or something). Peter From kkelchev at bulsyst.com Thu May 14 09:44:53 2009 From: kkelchev at bulsyst.com (Kamen Kelchev) Date: Thu, 14 May 2009 16:44:53 +0300 Subject: [Biopython] HMM example Message-ID: Hi to all, I need simple example for using bio.HMM library. I red documentation (cookbook) and tutorial but point 15.x in not filled. Thanks Kamen From biopython at maubp.freeserve.co.uk Thu May 14 09:57:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 14:57:23 +0100 Subject: [Biopython] HMM example In-Reply-To: References: Message-ID: <320fb6e00905140657m44cb32c7j897a8e43208c13f2@mail.gmail.com> On Thu, May 14, 2009 at 2:44 PM, Kamen Kelchev wrote: > Hi to all, > > I need simple example for using bio.HMM library. > > I red documentation (cookbook) and tutorial but point 15.x in not > filled. > > Thanks Kamen You are right, we don't have any proper documentation in the tutorial for Bio.HMM yet. There is the code's built in documentation (python docstrings) which you can access with the python help command, or read online here: http://biopython.org/DIST/docs/api/Bio.HMM-module.html There are also some example scripts in our unit tests, files test_HMMCasino.py and test_HMMGeneral.py, which should be a good starting point. Hopefully you can make some progress from that.... If you would like to contribute some documentation that would be great! Peter From chapmanb at 50mail.com Thu May 14 10:13:17 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 14 May 2009 10:13:17 -0400 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> Message-ID: <20090514141317.GI59158@sobchak.mgh.harvard.edu> Hi all; > > I mentioned previously the idea of an informal Biopython hackathon on > > the Monday and Tuesday after BOSC. Peter: > http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are > asking all speakers to come prepared to lead an informal tutorial on > their software during a Birds of a Feather/hackathon session." > So I will ask them to allocate us a slot on the Saturday/Sunday > during BOSC itself. They should take care of getting a room and > internet access too. This would be more of a basic introduction to > Biopython (hopefully we can generate some useful additions to the > cookbook from this) rather than a developmental session like Brad > is planning. Glad to see all the interest. We can be flexible in how we schedule things but we will definitely have some time for coding. Tiago, you should come; BOSC is well worth it in addition to the hacking. Peter, thanks for organizing the BoF session. This will probably be more of an introduction and question/answer period if my past experience is any predictor, but we can certainly morph into a coding session afterwards. In the past these were in the evening and we had rooms for as long as we wanted, so that'll make the coordination easy. It will also accommodate Bartek and anyone else who is around for the weekend. Peter: > You mean Monday June 29 and/or Tuesday June 30? These are the > first two days of the ISMB meeting, so my time would have to be > juggled between this and whatever conference sessions look most > relevant. That was my initial idea since the BOSC schedule isn't all nailed down, but fitting time in on Saturday and Sunday as available is also great. I will be around for the days afterwards but not attending ISMB, so can be flexible on times for anyone who want to keep things going. I'm looking forward to it. As we get closer we should spend some time brainstorming a list of things to do. I would like it if some non-Biopython programming people jumped in as well; it would be a great chance to discuss coordination with other projects. Brad From macrozhu at gmail.com Thu May 14 11:13:45 2009 From: macrozhu at gmail.com (Hongbo Zhu) Date: Thu, 14 May 2009 17:13:45 +0200 Subject: [Biopython] Bio.PDB.PDBIO.save renumbers Atom Serial Number Message-ID: <11b97ec0905140813p22216df3i72c11ad802d53a5e@mail.gmail.com> Hi, The >>Atom Serial Number<< in the .pdb files from the Protein Data Bank (PDB) are assigned consecutively from 1 to ATOM/HETATM records. This helps to identify atoms, though it may be risky to rely on that number completely. Apparently, function Bio.PDB.PDBIO.save() renumbers the >>Atom Serial Number<< in newly generated PDB file. This is on one hand consistent with .pdb file style, on the other hand very confusing, as the same atom gets different serial numbers. I suggest at least give the users the choice to keep the original >>Atom Serial Number<< by adding a new parameter to the save() function of class Bio.PDB.PDBIO. cheers, hongbo From fahy at chapman.edu Thu May 14 13:07:42 2009 From: fahy at chapman.edu (Fahy, Michael) Date: Thu, 14 May 2009 10:07:42 -0700 Subject: [Biopython] Sequence alignment with multiple proteins Message-ID: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> This is not strictly a BioPython question but I'm using BioPython for the work. I have a set of 45 proteins and 10 species. I have a representative orthologous protein from each set for each of the 10 species. I'm trying to build a phylogenetic tree by aligning the data from the 10 species. I've tried concatenating the 45 protein sequences for each of the 10 species and aligning the concatenated sequences but this has produced results that do not make sense. What do you recommend for such a problem? ------------------------------------------------------- Michael A. Fahy, PhD Professor and Chair Mathematics and Computer Science Chapman University One University Drive Orange, CA 92866 (714) 997-6879 fahy at chapman.edu From biopython at maubp.freeserve.co.uk Thu May 14 13:24:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 18:24:11 +0100 Subject: [Biopython] Sequence alignment with multiple proteins In-Reply-To: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> References: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> Message-ID: <320fb6e00905141024n623333a3p18ef45258ce1d554@mail.gmail.com> On Thu, May 14, 2009 at 6:07 PM, Fahy, Michael wrote: > This is not strictly a BioPython question but I'm using BioPython for > the work. > > I have a set of 45 proteins and 10 species. ?I have a ?representative > orthologous protein from each set for each of the 10 species. ?I'm > trying to build a phylogenetic tree by aligning the data from the 10 > species. ?I've tried concatenating the 45 protein sequences for each of > the 10 species and aligning the concatenated sequences but this has > produced results that do not make sense. ?What do you recommend for such > a problem? Concatenating the sequences and then aligning them sounds like asking for trouble. I would suggest taking each gene in isolation, and making a protein sequence alignment. Then take the 45 alignments and concatenate them into one super-alignment [*]. Then make a tree. There are things you should assess - for example do trees from each of the separate 45 protein alignments, and compare them - you may find some of the genes are evolving at different rates etc. Maybe only some of the 45 proteins are suitable. Perhaps looking at the nucleotides would also be wise. I'm sure an expert in phylogenetics (i.e. not me) could give much more advice. Peter [*] This can be done in Biopython, but isn't that straight forward at the moment, see this thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006044.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006046.html From cy at cymon.org Thu May 14 13:29:47 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 14 May 2009 18:29:47 +0100 Subject: [Biopython] Sequence alignment with multiple proteins In-Reply-To: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> References: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> Message-ID: <7265d4f0905141029r73a85d68ga4371c47cc361ba4@mail.gmail.com> Hi Michael, 2009/5/14 Fahy, Michael > This is not strictly a BioPython question but I'm using BioPython for > the work. > > I have a set of 45 proteins and 10 species. I have a representative > orthologous protein from each set for each of the 10 species. I'm > trying to build a phylogenetic tree by aligning the data from the 10 > species. I've tried concatenating the 45 protein sequences for each of > the 10 species and aligning the concatenated sequences but this has > produced results that do not make sense. What do you recommend for such > a problem? The way I (and I suspect most others) approach this is to align each protein data individually (ie you'll have 45 separate protein alignments) and then concatenated them into one super-matrix. Currently, Bio.AlignIO does not support column to column concatenation of data. But by happy coincidence, David Winter, posted today that he has included a cookbook example of how to combine alignments using the Bio.Nexus interface - you can find the example here: http://biopython.org/wiki/Concatenate_nexus If you alignment viewer does not support export in Nexus format, you can use Bio.AlignIO to convert the alignment to Nexus. Cheers, Cymon -- From cjfields at illinois.edu Thu May 14 17:14:49 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 16:14:49 -0500 Subject: [Biopython] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai Message-ID: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> All, I am proud to introduce Xin 'David' Shuai, my student for the Google Summer of Code 2009, to the Open Bioinformatics community. David's project centers on developing SWIG-based bindings to libsequence (a population genetics library) for the BioLib project: http://biolib.open-bio.org/wiki/Main_Page Besides myself, David will be co-mentored by Mark Jensen and Pjotr Prins. As the BioLib project centers on creating common, maintainable SWIG- based bindings to popular bioinformatics libraries for the various Bio* toolkits, we will likely need input from the various Open Bio communities at various stages in the project. At this time, David's initial plans are to develop and test libsequence bindings for Perl and Python. David's proposal and project plan are available here: http://biolib.open-bio.org/wiki/User:David Congratulations David, and welcome to the Open-Bio community! Sincerely, Christopher Fields University of Illinois Urbana-Champaign Institute for Genomic Biology Urbana, IL 61801 From hlapp at gmx.net Thu May 14 18:16:30 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 14 May 2009 18:16:30 -0400 Subject: [Biopython] [Bioperl-l] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai In-Reply-To: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> References: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> Message-ID: <4482F182-B0B0-4C71-B8DE-B5B7A4EC4D81@gmx.net> Welcome David, good luck with your project, and I hope (actually, am quite certain) that you'll enjoy your summer with us. -hilmar On May 14, 2009, at 5:14 PM, Chris Fields wrote: > All, > > I am proud to introduce Xin 'David' Shuai, my student for the Google > Summer of Code 2009, to the Open Bioinformatics community. David's > project centers on developing SWIG-based bindings to libsequence (a > population genetics library) for the BioLib project: > > http://biolib.open-bio.org/wiki/Main_Page > > Besides myself, David will be co-mentored by Mark Jensen and Pjotr > Prins. > > As the BioLib project centers on creating common, maintainable SWIG- > based bindings to popular bioinformatics libraries for the various > Bio* toolkits, we will likely need input from the various Open Bio > communities at various stages in the project. At this time, David's > initial plans are to develop and test libsequence bindings for Perl > and Python. > > David's proposal and project plan are available here: > > http://biolib.open-bio.org/wiki/User:David > > Congratulations David, and welcome to the Open-Bio community! > > Sincerely, > > Christopher Fields > University of Illinois Urbana-Champaign > Institute for Genomic Biology > Urbana, IL 61801 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From winda002 at student.otago.ac.nz Thu May 14 18:52:47 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 15 May 2009 10:52:47 +1200 Subject: [Biopython] Sequence alignment with multiple proteins In-Reply-To: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> References: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> Message-ID: <4A0CA0BF.5040609@student.otago.ac.nz> Fahy, Michael wrote: > This is not strictly a BioPython question but I'm using BioPython for > the work. > > > > I have a set of 45 proteins and 10 species. I have a representative > orthologous protein from each set for each of the 10 species. I'm > trying to build a phylogenetic tree by aligning the data from the 10 > species. I've tried concatenating the 45 protein sequences for each of > the 10 species and aligning the concatenated sequences but this has > produced results that do not make sense. What do you recommend for such > a problem? > Hi Michael, As you've heard the usual approach is to align the sequences individually first then make a supermatrix. Without knowing the details of the analysis you want to do I'd imagine with that many sequences for each taxon you're likely to have some protein-trees behaving differently than others (which might explain your unexpected results). There are ways of dealing with this (depending on your dedication to getting The One True Tree) like "gene jackknifing" (taking a set of protein's out and seeing how they effect the topology) and partion based tests. Sadly these are frequently run on super computers... Cheers, david From thomas.hamelryck at gmail.com Fri May 15 05:24:19 2009 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Fri, 15 May 2009 11:24:19 +0200 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <20090514122319.GD59158@sobchak.mgh.harvard.edu> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Message-ID: <2d7c25310905150224s56b1ef75h40fc038e578f4570@mail.gmail.com> Hi all, I mentioned previously the idea of an informal Biopython hackathon on > the Monday and Tuesday after BOSC. Basically, this would be a chance > to sit down and do some Biopython programming while being able to > physically talk. I will definitely be there and am up for some coding > with as many people as are inclined; Peter and Tiago responded favorably > to the idea earlier. If you are at all interested send an e-mail. We can > keep it informal with a few people, and I will try to organize space if > it looks like we will have more. I won't be at BOSC, because I'm going to the 3D-SIG, but a hackathon on Monday and/or Tuesday sounds fine! Who else is coming? -Thomas From p.j.a.cock at googlemail.com Fri May 15 05:47:29 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 May 2009 10:47:29 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <2d7c25310905150224s56b1ef75h40fc038e578f4570@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <2d7c25310905150224s56b1ef75h40fc038e578f4570@mail.gmail.com> Message-ID: <320fb6e00905150247u6038bf66s98575d84f4e5ae9a@mail.gmail.com> On Fri, May 15, 2009 at 10:24 AM, Thomas Hamelryck wrote: > > I won't be at BOSC, because I'm going to the 3D-SIG, but a hackathon on > Monday and/or Tuesday sounds fine! Who else is coming? > On the 3Dsig Satellite Meeting section, http://www.iscb.org/ismbeccb2009/registration.php says: "Satellite Meeting registration does not allow access to the SIG Meetings. Delegates wishing to attend portions of the Satellite Meeting and portions of one or more SIGs must register for both events." However, for the other SIG meetings, http://www.iscb.org/ismbeccb2009/registration.php says: "Registering for a SIG allows you to move freely between all SIGs that take place at the same time as the meeting for which you are registered, to the extent that the room capacities can accommodate. SIG registration does not allow access to the Satellite Meeting." The Birds of a Feather sessions at BOSC will probably be at the end of Sat and/or Sun, allowed to run on into the evening. Once all the talks are finished you could probably drop by and say hi. For BOSC itself (Sat/Sun), it looks like me, Brad, Tiago and Bartek (so far). For the Monday/Tuesday it looks like just Brad (and hopefully me - just looking at budgets now). Peter From rcsqtc at iqac.csic.es Tue May 19 04:55:14 2009 From: rcsqtc at iqac.csic.es (Ramon Crehuet) Date: Tue, 19 May 2009 10:55:14 +0200 Subject: [Biopython] Bio.PDB: removing disordred atoms Message-ID: <4A1273F2.4050007@iqac.csic.es> Dear all, I'd like to save a pdb without the positions of alternative atoms, i.e, for disordered atoms keep only atom.altloc='A'. I though of something like: all_atoms=[] for chain in structure[0]: for residue in chain.child_list: all_atoms=all_atoms+residue.get_unpacked_list() for atom in all_atoms: if atom.altloc=='B': del atom io=Bio.PDB.PDBIO() io.set_structure(structure) io.save('pdb_out_filename') But that doesn't works. Disordred atoms are still there :-( Thanks in advance, Ramon From p.j.a.cock at googlemail.com Tue May 19 05:39:07 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 May 2009 10:39:07 +0100 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <4A1273F2.4050007@iqac.csic.es> References: <4A1273F2.4050007@iqac.csic.es> Message-ID: <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> On Tue, May 19, 2009 at 9:55 AM, Ramon Crehuet wrote: > Dear all, > I'd like to save a pdb without the positions of alternative atoms, > i.e, for disordered atoms keep only atom.altloc='A'. > I though of something like: > > all_atoms=[] > for chain in structure[0]: > ? ?for residue in chain.child_list: > ? ? ? ?all_atoms=all_atoms+residue.get_unpacked_list() > > for atom in all_atoms: > ? ?if atom.altloc=='B': del atom > ... Doing "del atom" just deletes the local variable atom. i.e. it won't affect the PDB structure at all. I would suggest you look at pages 5 and 6 of the Bio.PDB documentation, the bit on the Select class: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf You might also find this recent thread useful: http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html Peter From rcsqtc at iqac.csic.es Tue May 19 10:55:11 2009 From: rcsqtc at iqac.csic.es (Ramon Crehuet) Date: Tue, 19 May 2009 16:55:11 +0200 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> References: <4A1273F2.4050007@iqac.csic.es> <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> Message-ID: <4A12C84F.5080808@iqac.csic.es> Thanks, The easyiest way I found was defining a class to assert disordered atoms: class NotDisordered(Select): def accept_atom(self, atom): if not atom.is_disordered(): return 1 elif atom.get_altloc()=='B': return 1 else: return 0 io=PDBIO() io.set_structure(s) io.save("1GS5-ord.pdb", select=NotDisordered()) Peter Cock wrote: > On Tue, May 19, 2009 at 9:55 AM, Ramon Crehuet wrote: >> Dear all, >> I'd like to save a pdb without the positions of alternative atoms, >> i.e, for disordered atoms keep only atom.altloc='A'. >> I though of something like: >> >> all_atoms=[] >> for chain in structure[0]: >> for residue in chain.child_list: >> all_atoms=all_atoms+residue.get_unpacked_list() >> >> for atom in all_atoms: >> if atom.altloc=='B': del atom >> ... > > Doing "del atom" just deletes the local variable atom. > i.e. it won't affect the PDB structure at all. > > I would suggest you look at pages 5 and 6 of the Bio.PDB > documentation, the bit on the Select class: > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > You might also find this recent thread useful: > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > Peter > From p.j.a.cock at googlemail.com Tue May 19 12:32:04 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 May 2009 17:32:04 +0100 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <4A12C84F.5080808@iqac.csic.es> References: <4A1273F2.4050007@iqac.csic.es> <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> <4A12C84F.5080808@iqac.csic.es> Message-ID: <320fb6e00905190932n546e8134h9cceceb918dcaa94@mail.gmail.com> On Tue, May 19, 2009 at 3:55 PM, Ramon Crehuet wrote: > Thanks, > The easyiest way I found was defining a class to assert disordered atoms: > > class NotDisordered(Select): > ? ?def accept_atom(self, atom): > ? ? ? ?if not atom.is_disordered(): > ? ? ? ? ? ?return 1 > ? ? ? ?elif atom.get_altloc()=='B': > ? ? ? ? ? ?return 1 > ? ? ? ?else: > ? ? ? ? ? ?return 0 > > io=PDBIO() > io.set_structure(s) > io.save("1GS5-ord.pdb", select=NotDisordered()) Good - that's what you are expected to do. I'm glad it made sense. Peter P.S. Personally I would use True and False instead of 1 and 0. From jasperkoehorst at gmail.com Wed May 20 08:51:13 2009 From: jasperkoehorst at gmail.com (Jasper Koehorst) Date: Wed, 20 May 2009 14:51:13 +0200 Subject: [Biopython] NCBIWWW.qblast Expect value / E-value In-Reply-To: References: Message-ID: Im running against problems when i try to identify short peptide sequences of ?10 - 20 peptides. When i run this at the NCBI Blast website i get results, these all have high E-Values that i dont really care at the moment. The problem is when i do this in biopython as stated below, i will not get any results... I believe this is due to the fact that biopython will not show results with a "high" E-Value. Is there a way to change this? So it will allow results with an E-value of ?1500 or more? I tried the sript below but that does not quit work... Anybody has an idea? result_handle = NCBIWWW.qblast("blastp","nr",peptide[1],entrez_query=i, expect=2000, matrix_name='BLOSUM80') blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() E_VALUE_TRESH = 10000000 for alignment in blast_record.alignments: for hsp in alignment.hsps: From p.j.a.cock at googlemail.com Wed May 20 09:26:08 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 May 2009 14:26:08 +0100 Subject: [Biopython] NCBIWWW.qblast Expect value / E-value In-Reply-To: References: Message-ID: <320fb6e00905200626j37d1349dwafb1c5caa5dbd040@mail.gmail.com> On Wed, May 20, 2009 at 1:51 PM, Jasper Koehorst wrote: > Im running against problems when i try to identify short peptide sequences > of ?10 - 20 peptides. When i run this at the NCBI Blast website i get > results, these all have > high E-Values that i dont really care at the moment. The problem is when i > do this in biopython as stated below, i will not get any results... > > I believe this is due to the fact that biopython will not show results with > a "high" E-Value. Is there a way to change this? So it will allow results > with an E-value of ?1500 or more? Nothing in Biopython limits the expectation values - our qblast function defaults to 10, but you can set this to what you like. However, the NCBI may be imposing their own limit. Are you sure using anything more than 10 is actually meaningful? > I tried the sript below but that does not quit work... > > Anybody has an idea? > > ? ? ? ?result_handle = > NCBIWWW.qblast("blastp","nr",peptide[1],entrez_query=i, expect=2000, > matrix_name='BLOSUM80') > ? ? ? ?blast_records = NCBIXML.parse(result_handle) > ? ? ? ?blast_record = blast_records.next() > ? ? ? ?E_VALUE_TRESH = 10000000 > ? ? ? ?for alignment in blast_record.alignments: > ? ? ? ? ? ?for hsp in alignment.hsps: > I would guess (from previous examples) that this is due to the NCBI website and QBLAST API using different default parameters - the NCBI likes to change the defaults on the website from time to time, and these may differ from what you are getting via their QBLAST API. I would start by checking the gap parameters. See also: http://lists.open-bio.org/pipermail/biopython/2008-May/004252.html http://lists.open-bio.org/pipermail/biopython/2007-August/003679.html Peter From jasperkoehorst at gmail.com Wed May 20 09:52:22 2009 From: jasperkoehorst at gmail.com (Jasper Koehorst) Date: Wed, 20 May 2009 15:52:22 +0200 Subject: [Biopython] NCBIWWW.qblast Expect value / E-value In-Reply-To: <320fb6e00905200626j37d1349dwafb1c5caa5dbd040@mail.gmail.com> References: <320fb6e00905200626j37d1349dwafb1c5caa5dbd040@mail.gmail.com> Message-ID: Yes we have, at the moment are we trying to identify a lot of small peptides in several organisms for a project. What we would like to do is to ignore the e-value for this moment. But a solution has yet to be found. jasper koehorst 2009/5/20 Peter Cock > On Wed, May 20, 2009 at 1:51 PM, Jasper Koehorst > wrote: > > Im running against problems when i try to identify short peptide > sequences > > of ?10 - 20 peptides. When i run this at the NCBI Blast website i get > > results, these all have > > high E-Values that i dont really care at the moment. The problem is when > i > > do this in biopython as stated below, i will not get any results... > > > > I believe this is due to the fact that biopython will not show results > with > > a "high" E-Value. Is there a way to change this? So it will allow results > > with an E-value of ?1500 or more? > > Nothing in Biopython limits the expectation values - our qblast function > defaults to 10, but you can set this to what you like. However, the NCBI > may be imposing their own limit. Are you sure using anything more than > 10 is actually meaningful? > > > I tried the sript below but that does not quit work... > > > > Anybody has an idea? > > > > result_handle = > > NCBIWWW.qblast("blastp","nr",peptide[1],entrez_query=i, expect=2000, > > matrix_name='BLOSUM80') > > blast_records = NCBIXML.parse(result_handle) > > blast_record = blast_records.next() > > E_VALUE_TRESH = 10000000 > > for alignment in blast_record.alignments: > > for hsp in alignment.hsps: > > > > I would guess (from previous examples) that this is due to the NCBI > website and QBLAST API using different default parameters - the > NCBI likes to change the defaults on the website from time to time, > and these may differ from what you are getting via their QBLAST > API. I would start by checking the gap parameters. > > See also: > http://lists.open-bio.org/pipermail/biopython/2008-May/004252.html > http://lists.open-bio.org/pipermail/biopython/2007-August/003679.html > > Peter > From chapmanb at 50mail.com Thu May 21 08:09:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 21 May 2009 08:09:26 -0400 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <4A12C84F.5080808@iqac.csic.es> References: <4A1273F2.4050007@iqac.csic.es> <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> <4A12C84F.5080808@iqac.csic.es> Message-ID: <20090521120926.GL84112@sobchak.mgh.harvard.edu> Ramon; Great to hear you got this figured out with Peter's helpful direction. It would be very useful if you could contribute this as a cookbook example: http://biopython.org/wiki/Category:Cookbook with a short description of your motivation and the final code. This would make it accessible to others with a similar problem in the future. Brad > Thanks, > The easyiest way I found was defining a class to assert disordered atoms: > > > class NotDisordered(Select): > def accept_atom(self, atom): > if not atom.is_disordered(): > return 1 > elif atom.get_altloc()=='B': > return 1 > else: > return 0 > > io=PDBIO() > > io.set_structure(s) > io.save("1GS5-ord.pdb", select=NotDisordered()) > > > > Peter Cock wrote: > > On Tue, May 19, 2009 at 9:55 AM, Ramon Crehuet wrote: > >> Dear all, > >> I'd like to save a pdb without the positions of alternative atoms, > >> i.e, for disordered atoms keep only atom.altloc='A'. > >> I though of something like: > >> > >> all_atoms=[] > >> for chain in structure[0]: > >> for residue in chain.child_list: > >> all_atoms=all_atoms+residue.get_unpacked_list() > >> > >> for atom in all_atoms: > >> if atom.altloc=='B': del atom > >> ... > > > > Doing "del atom" just deletes the local variable atom. > > i.e. it won't affect the PDB structure at all. > > > > I would suggest you look at pages 5 and 6 of the Bio.PDB > > documentation, the bit on the Select class: > > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > > > You might also find this recent thread useful: > > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > > > Peter > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From matzke at berkeley.edu Thu May 21 15:00:11 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Thu, 21 May 2009 12:00:11 -0700 Subject: [Biopython] GSoC 2009/Matzke: BioGeographical Phylogenetics for Biopython Message-ID: <4A15A4BB.8040706@berkeley.edu> Hi all, I have been on the biopython list for some time, but as I am starting a Google Summer of Code project (hosted by NESCENT Phyloinformatics Summer of Code) involving this, I felt like I should introduce myself. Below are links to my set up/ grant proposal / project plan. I am open to comments/suggestions via email or the list. Cheers! Nick ============== Links: PhyloSoc Summer of Code 2009 summary: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biogeographical_Phylogenetics_for_BioPython Bio page: https://www.nescent.org/wg_phyloinformatics/User:Matzke BioPython wiki summary: http://biopython.org/wiki/Active_projects#Biogeography_.28GSoC.29 BioPython wiki work page, work plan: http://biopython.org/wiki/BioGeography Code repository (suggestions welcome from Brad et al. on the best way to do this): http://github.com/nmatzke/biopython/tree/master Comments welcome! I believe there is additional stuff to do, I will get on it tomorrow. Cheers! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From lueck at ipk-gatersleben.de Tue May 26 03:34:50 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 26 May 2009 09:34:50 +0200 Subject: [Biopython] blastall - query starts in xml Message-ID: <004a01c9ddd4$73f12ac0$1022a8c0@ipkgatersleben.de> Hi! Is there a way to get the query start information of the hit in the xml output? Alternatively I can find the hit on the query. Kind regards Stefanie From cmckay at u.washington.edu Tue May 26 15:20:47 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 26 May 2009 12:20:47 -0700 Subject: [Biopython] SeqIO and fastq Message-ID: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> I just used SeqIO to convert 10 million fastq reads to fasta. Fast and simple. Thanks for adding the functionality! best, Cedar UW Oceanography From winda002 at student.otago.ac.nz Tue May 26 21:18:57 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 27 May 2009 13:18:57 +1200 Subject: [Biopython] (no subject) Message-ID: <4A1C9501.9000406@student.otago.ac.nz> Stefanie L?ck wrote: > Hi! > > Is there a way to get the query start information of the hit in the xml output? > Alternatively I can find the hit on the query Hi Stefanie, The query_start is in the "hsp" instance for each alignment in each blast record, if you have a record called b_record you can do this: >>>for alignment in b_record.alignments: >>> for hsp in alignment.hsps: >>> print "hit '%s' matches query '%s' starting a query position %i" % (alignment.title, b_record.query, hsp.query_start ) hit 'gene1' matches query 'my_query1' from query position 841 hit 'gene2' matches query 'my_query1' from query position 190 There is a nice diagram of all the 'stuff' in the blast record class in the tutorial here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecord Hope that helps you do what you want to, David From lueck at ipk-gatersleben.de Wed May 27 03:06:34 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 27 May 2009 09:06:34 +0200 Subject: [Biopython] (no subject) References: <4A1C9501.9000406@student.otago.ac.nz> Message-ID: <002f01c9de99$abd511c0$1022a8c0@ipkgatersleben.de> Hi David! I found the problem. I got allways the same start for the hsp.query_start and was wonder whats happened. I had a mistake in my code. Sorry for your waste of time! S. ----- Original Message ----- From: "David Winter" To: Sent: Wednesday, May 27, 2009 3:18 AM Subject: [Biopython] (no subject) Stefanie L?ck wrote: > Hi! > > Is there a way to get the query start information of the hit in the xml output? > Alternatively I can find the hit on the query Hi Stefanie, The query_start is in the "hsp" instance for each alignment in each blast record, if you have a record called b_record you can do this: >>>for alignment in b_record.alignments: >>> for hsp in alignment.hsps: >>> print "hit '%s' matches query '%s' starting a query position %i" % (alignment.title, b_record.query, hsp.query_start ) hit 'gene1' matches query 'my_query1' from query position 841 hit 'gene2' matches query 'my_query1' from query position 190 There is a nice diagram of all the 'stuff' in the blast record class in the tutorial here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecord Hope that helps you do what you want to, David _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From lueck at ipk-gatersleben.de Thu May 28 03:55:30 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 28 May 2009 09:55:30 +0200 Subject: [Biopython] blastall - strange results References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> Message-ID: <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> Hi! The question is not really related to a biopython problem but nevertheless I want to be sure that I do everything correct. I get strange results with blast. My aim is to blast a query sequence, spitted to 21-mers, against a database. Since I need only 100 % matches of 21-mers, a set the word size parameter to 21. Now, as a positive control, I took one EST sequence and made a database of it. Then I took 100 bp of that sequence, spitted to 21-mers and blast each of them against my DB. Now I expect to get a full coverage (or better 80 hits because everything below 21 bp I don't blast) of hits because the sequence is fully present in the DB. Unfortunately blast finds much less (60-80 %, depending on the sequence). Is this normal? I would expect to find all 21-mers. Why only some? If I blast without to change the word size parameter its find all hits. But I would like to use this parameter because the blast is much faster and I don't need to take care about gaps etc. since I really need only 100 % 21 mer matches. Does someone have any ideas what could be the problem? Thanks in advance! Stefanie From chapmanb at 50mail.com Thu May 28 08:02:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 28 May 2009 08:02:41 -0400 Subject: [Biopython] blastall - strange results In-Reply-To: <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> Message-ID: <20090528120241.GG94873@sobchak.mgh.harvard.edu> Hi Stefanie; > I get strange results with blast. > My aim is to blast a query sequence, spitted to 21-mers, against a database. [...] > Is this normal? I would expect to find all 21-mers. Why only some? BLAST isn't the best tool for this sort of problem. For exhaustively aligning short sequences to a database of target sequences, you should think about using a short read aligner. This is a nice summary of available aligners: http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml Personally, I have had good experiences using Mosaik and Bowtie. Hope this helps, Brad From biopythonlist at gmail.com Fri May 29 05:36:29 2009 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 29 May 2009 11:36:29 +0200 Subject: [Biopython] searching for a human chromosome position Message-ID: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> Hello, I am new using biopython and after reading the documentation I'd like some guides to resolve one "simple" thing. I want to, given a number of a human chromosome, the position of the nucleotide and the nucleotide that should be in this position, search for that position and determine if there has been a mutation and if that mutation produces an aminoacid change or not. I supose that first of all I have to query genome database(?) using Entrez module and retrieve the sequence where this base is. Then I supose I have to look for translated sequences of this sequence and see what is the most probably frame of traduction for this sequence and then see if there is a change of aminoacid or not. Please could anybody send some clues for querying the database and find the most probably frame of traduction to protein (in case that this is a good workflow to solve this particular problem)?? Thankyou very much. d From biopythonlist at gmail.com Fri May 29 06:06:44 2009 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 29 May 2009 12:06:44 +0200 Subject: [Biopython] searching for a human chromosome position Message-ID: <9b15d9f30905290306w58e227dew2cb164608acf4010@mail.gmail.com> Hello, I am new using biopython and after reading the documentation I'd like some guides to resolve one "simple" thing. I want to, given a number of a human chromosome, the position of the nucleotide and the nucleotide that should be in this position, search for that position and determine if there has been a mutation and if that mutation produces an aminoacid change or not. I supose that first of all I have to query genome database(?) using Entrez module and retrieve the sequence where this base is. Then I supose I have to look for translated sequences of this sequence and see what is the most probably frame of traduction for this sequence and then see if there is a change of aminoacid or not. Please could anybody send some clues for querying the database and find the most probably frame of traduction to protein (in case that this is a good workflow to solve this particular problem)?? Thankyou very much. d From rjalves at igc.gulbenkian.pt Sun May 31 13:16:27 2009 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Sun, 31 May 2009 18:16:27 +0100 Subject: [Biopython] Entrez.esearch sort by publication date Message-ID: <4A22BB6B.8010305@igc.gulbenkian.pt> Hi everyone, I've been using Entrez.esearch for a while without problems but today I wanted to have the results sorted by publication date. According to the docs at: http://www.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html#Sort I should use 'pub+date', however this doesn't work. If I use 'author' and 'journal' I have no problems but if I use 'last+author' or 'pub+date' I get an empty reply: >>>Entrez.esearch(db='pubmed', term=search, retmax=5, sort='pub+date').read() \n\n\n' Any suggestions on how to make this work? Thanks, Renato From p.j.a.cock at googlemail.com Tue May 5 16:41:31 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 May 2009 17:41:31 +0100 Subject: [Biopython] Dropping Python 2.3 support in Biopython Message-ID: <320fb6e00905050941m37725eb5ibba02ca99236212e@mail.gmail.com> Hello all, This is a final warning that the next release of Biopython will not support Python 2.3. As far as we are aware, no-one has come forward with a need for continued support for Python 2.3, so we will soon begin removing the special case code needed to keep Biopython working on Python 2.3. This will give us a simpler code base, less platforms to test on, and we can also take advantage of various language features only available in Python 2.4+ (e.g. generator expressions and decorators). Any last minute requests to postpone this should be made to the main Biopython mailing list by Friday 8 May. Thank you, Peter From scheper at email.unc.edu Wed May 6 15:33:40 2009 From: scheper at email.unc.edu (Walter Scheper) Date: Wed, 6 May 2009 11:33:40 -0400 Subject: [Biopython] [BioPython] Question about using Entrez.epost Message-ID: Hey folks, I'm currently working on a project where we need to download large numbers (1000s) of SNPs from NCBI's database. The documentation for BioPython tells me I should be using Entrez.epost for this, and then using the resulting search history to pull down my snps. However, I find that epost itself has a maxium limit to the number of rs Ids I can use in a single search, which roughly translates into about 700 rs Ids. Is this as intended, or am I not using epost correctly? If I am using epost correctly, what's the best way to break this up so that (a) I get my data and (b) don't overburden NCBI's system. Here's how I'm calling epost, mostly this is straight out of the tutorial: search_results = Entrez.read(Entrez.epost(db='snp', id=id_string)) Thanks for any help, Walter Scheper From biopython at maubp.freeserve.co.uk Wed May 6 16:43:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 May 2009 17:43:15 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: References: Message-ID: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> On Wed, May 6, 2009 at 4:33 PM, Walter Scheper wrote: > Hey folks, > > I'm currently working on a project where we need to download large numbers > (1000s) of SNPs from NCBI's database. The documentation for BioPython tells > me I should be using Entrez.epost for this, and then using the resulting > search history to pull down my snps. However, I find that epost itself has a > maxium limit to the number of rs Ids I can use in a single search, which > roughly translates into about 700 rs Ids. Is this as intended, or am I not > using epost correctly? If I am using epost correctly, what's the best way to > break this up so that (a) I get my data and (b) don't overburden NCBI's > system. > > Here's how I'm calling epost, mostly this is straight out of the tutorial: > ? ?search_results = Entrez.read(Entrez.epost(db='snp', id=id_string)) > > Thanks for any help, > Walter Scheper Where is the 700 ID limit from? I don't recall any limits on the EPost documentation. Was this found by experimentation? http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html Note that with such large numbers, once you have got EPost to work, you should probably be using EFetch to download the results in batches. This doesn't really answer your question, but it should be fairly simple to use EPost and EFetch in batches of (say) 100 records, which avoids the whole issue. How big a batch size you choose will depend on how big the records are. As far as I know the NCBI don't give any guidance on batch sizes. However, do make sure you run this script at the weekend or outside normal USA working hours: http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements Peter From biopython at maubp.freeserve.co.uk Thu May 7 08:59:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 May 2009 09:59:44 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> References: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> Message-ID: <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> On Wed, May 6, 2009 at 6:07 PM, Walter Scheper wrote: > > On May 6, 2009, at 12:43 PM, Peter wrote: > >> Where is the 700 ID limit from? ?I don't recall any limits on the EPost >> documentation. ?Was this found by experimentation? >> http://www.ncbi.nlm.nih.gov/entrez/query/static/epost_help.html >> > Yes, that was found by experimentation with epost. I don't think the limit > is really tied to the number of Ids, but to the length of the string. > However, I thought the purpose of epost was that you didn't run into url > length problems, or is it more a way to queue up a request? I think I see what is going on now. Using an HTTP POST would avoid long URL problems with an HTTP GET (something the NCBI documents don't explain - at first reading the EPost command seems redundant). However, from looking over our Bio.Entrez code more carefully it seems we are not actually using a POST after all - just a plain HTTP GET. This would explain the limit you are seeing - you are either the first person to try such a long ID list with Bio.Entrez (personally I have only ever downloaded much smaller datasets in one go), or at least you are the first person to actually report it is broken. So thank you! I've filed Bug 2824 on this issue: http://bugzilla.open-bio.org/show_bug.cgi?id=2824 >> Note that with such large numbers, once you have got EPost to work, you >> should probably be using EFetch to download the results in batches. > > Sure. I suppose I'll just have to break the whole thing, search and > retrieval, into chunks. In the short term yes, that is a practical solution. If you would be happy to update your installation (or at least, update the Bio/Entrez/__init__.py file) then you can help test any fix. Peter From biopython at maubp.freeserve.co.uk Thu May 7 10:24:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 May 2009 11:24:26 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> References: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> Message-ID: <320fb6e00905070324j7eb96d0bs768f72769f75a9c7@mail.gmail.com> On Thu, May 7, 2009 at 9:59 AM, Peter wrote: > I think I see what is going on now. ?Using an HTTP POST would avoid > long URL problems with an HTTP GET (something the NCBI documents > don't explain - at first reading the EPost command seems redundant). > However, ?from looking over our Bio.Entrez code more carefully it > seems we are not actually using a POST after all - just a plain HTTP > GET. > > This would explain the limit you are seeing - you are either the first > person to try such a long ID list with Bio.Entrez (personally I have > only ever downloaded much smaller datasets in one go), or at least > you are the first person to actually report it is broken. ?So thank you! > > I've filed Bug 2824 on this issue: > http://bugzilla.open-bio.org/show_bug.cgi?id=2824 Fixed in CVS (which will propagate to github shortly). All you need to do to apply the fix is update your .../site-packages/Bio/Entrez/__init__.py file with the new version, which can be downloaded from here shortly: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Entrez/__init__.py?cvsroot=biopython Or, just install all of Biopython from the latest source (fetched from CVS or github). If you need more details instructions, let us know. This should fix your EPost problem - if you can confirm this by email that would be great. Thank you. Peter From biopython at maubp.freeserve.co.uk Fri May 8 19:55:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 May 2009 20:55:56 +0100 Subject: [Biopython] [BioPython] Question about using Entrez.epost In-Reply-To: References: <320fb6e00905060943p1e65890y325e65db6271501@mail.gmail.com> <061B355A-D859-43F6-A10B-233043FB0858@email.unc.edu> <320fb6e00905070159v5fe22d81ibec1611af2ee83e0@mail.gmail.com> <320fb6e00905070324j7eb96d0bs768f72769f75a9c7@mail.gmail.com> Message-ID: <320fb6e00905081255x611872f9pe48b1818aec0a930@mail.gmail.com> On Fri, May 8, 2009 at 8:38 PM, Walter Scheper wrote: > > On May 7, 2009, at 6:24 AM, Peter wrote: > >> This should fix your EPost problem - if you can confirm this by email >> that would be great. >> >> Thank you. >> >> Peter > > Hey Peter, > > Thanks for the help, and the quick fix. This does indeed fix the problem I > was having. Now I can at least get NCBI to cache the whole set before I > start downloading. > > Walter Great - thanks again for reporting this problem. Peter From biopython at maubp.freeserve.co.uk Tue May 12 16:10:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 May 2009 17:10:02 +0100 Subject: [Biopython] Loading SeqRecords into BioSQL with NCBI taxon ID In-Reply-To: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com> References: <320fb6e00905120905l3d0e31b2p2a3d92f61096cbd5@mail.gmail.com> Message-ID: <320fb6e00905120910v7bc942eai1e163d29b0d88f00@mail.gmail.com> Sorry - I meant to post this to the main mailing list rather than the dev list, as it is of general interest. Peter On Tue, May 12, 2009 at 5:05 PM, Peter wrote: > Over on Bug 2826, David wrote: > http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2 > >> Thank you. I'm new to BioPython. >> >> The goal was to take some whole-genome sequence (which isn't in Genbank) and >> attach a taxon to it, in order that it be written to a BioSQL database. > > You've talked about trying to parse WGS GenBank files on Bug 2825 but > presumable if this new data isn't in GenBank, it is in another format. > > What format is your ?whole-genome sequence? ?FASTA or something simple? > >> Other records in the BioSQL database derive from NCBI and so have taxon_ids, >> so the additional WGS being in a similar format would make things simpler. > > I see. Basically you need to import a SeqRecord into BioSQL with an > NCBI taxon ID. ?You don't need to write out a GenBank file to do this. > > First create the SeqRecord, e.g. > > from Bio import SeqIO > record = SeqIO.read(handle, format, alphabet) > > There are now two options - because the BioSQL loader will look for > the NCBI taxon ID in two places: > > (Option 1) Record the NCBI taxon ID in the SeqRecord's annotation > dictionary under the "ncbi_taxid" key. ?This should work (untested): > > record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345] > > (Option 2) Mimic a SeqRecord from parsing a GenBank file with a source > feature containing the taxon ID. This should work (untested): > > #Create the SeqRecord: > record = SeqIO.read(handle, format, alphabet) > #Create the source features: > from Bio.SeqFeature import SeqFeature, FeatureLocation > f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source") > f.qualifiers["db_xref"] = ["taxon:12345"] > record.features = [f] #or insert at start > > If you don't really have a sequence, this second approach doesn't make > so much sense. > > [Arguably there could be a third option via the dbxref's list] > > Then in either case, load the modified SeqRecord into the database. > You may want to pre-load the NCBI taxonomy, see > http://www.biopython.org/wiki/BioSQL > > Alternatively, using Biopython 1.49+ you can have this fetched from > Entrez on demand with the fetch_NCBI_taxonomy=True option. ?The BioSQL > wiki page needs updating on this topic. > > Peter > From chapmanb at 50mail.com Thu May 14 12:23:19 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 14 May 2009 08:23:19 -0400 Subject: [Biopython] BOSC/ISMB conference reminder Message-ID: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Hi all; A friendly reminder to all of us Python bioinformatics folks; the early registration deadline for the Bioinformatics Open Source Conference (BOSC) is this Friday, May 15th: http://www.open-bio.org/wiki/BOSC_2009 BOSC is on in Stockholm, Sweden from June 27th to 28th, and takes place in conjunction with the ISMB meeting. Peter will be there presenting an update on the latest and greatest in Biopython. I will be giving a 10 minute talk during the Data and Analysis Management section about ideas for publishing biological data on the web with Python examples. Generally, BOSC is a great chance to meet like-minded informatics folks and learn about all the great tools being developed out there. I mentioned previously the idea of an informal Biopython hackathon on the Monday and Tuesday after BOSC. Basically, this would be a chance to sit down and do some Biopython programming while being able to physically talk. I will definitely be there and am up for some coding with as many people as are inclined; Peter and Tiago responded favorably to the idea earlier. If you are at all interested send an e-mail. We can keep it informal with a few people, and I will try to organize space if it looks like we will have more. Looking forward to seeing everyone there, Brad From p.j.a.cock at googlemail.com Thu May 14 12:47:55 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 May 2009 13:47:55 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <20090514122319.GD59158@sobchak.mgh.harvard.edu> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> On Thu, May 14, 2009 at 1:23 PM, Brad Chapman wrote: > Hi all; > A friendly reminder to all of us Python bioinformatics folks; > the early registration deadline for the Bioinformatics Open Source > Conference (BOSC) is this Friday, May 15th: > > http://www.open-bio.org/wiki/BOSC_2009 > > BOSC is on in Stockholm, Sweden from June 27th to 28th, and takes > place in conjunction with the ISMB meeting. > > Peter will be there presenting an update on the latest and greatest > in Biopython. I will be giving a 10 minute talk during the Data > and Analysis Management section about ideas for publishing > biological data on the web with Python examples. Generally, BOSC is > a great chance to meet like-minded informatics folks and learn about > all the great tools being developed out there. > > I mentioned previously the idea of an informal Biopython hackathon on > the Monday and Tuesday after BOSC. Basically, this would be a chance > to sit down and do some Biopython programming while being able to > physically talk. I will definitely be there and am up for some coding > with as many people as are inclined; Peter and Tiago responded favorably > to the idea earlier. If you are at all interested send an e-mail. We can > keep it informal with a few people, and I will try to organize space if > it looks like we will have more. You mean Monday June 29 and/or Tuesday June 30? These are the first two days of the ISMB meeting, so my time would have to be juggled between this and whatever conference sessions look most relevant. http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are asking all speakers to come prepared to lead an informal tutorial on their software during a Birds of a Feather/hackathon session." So I will ask them to allocate us a slot on the Saturday/Sunday during BOSC itself. They should take care of getting a room and internet access too. This would be more of a basic introduction to Biopython (hopefully we can generate some useful additions to the cookbook from this) rather than a developmental session like Brad is planning. Peter From tiagoantao at gmail.com Thu May 14 12:50:55 2009 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 14 May 2009 13:50:55 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <20090514122319.GD59158@sobchak.mgh.harvard.edu> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Message-ID: <6d941f120905140550v4185ab73sdb2d3a8d47e80cea@mail.gmail.com> On Thu, May 14, 2009 at 1:23 PM, Brad Chapman wrote: > I mentioned previously the idea of an informal Biopython hackathon on > the Monday and Tuesday after BOSC. Basically, this would be a chance > to sit down and do some Biopython programming while being able to > physically talk. I will definitely be there and am up for some coding > with as many people as are inclined; Peter and Tiago responded favorably > to the idea earlier. If you are at all interested send an e-mail. We can > keep it informal with a few people, and I will try to organize space if > it looks like we will have more. I will only be going if this goes forward. For me any group size is good (small, medium, big). But I would need to know if this goes forward or not (doesn't need to be 100% sure, just the committed intention to go ahead). Regards, Tiago From bartek at rezolwenta.eu.org Thu May 14 12:58:09 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 14 May 2009 14:58:09 +0200 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> Message-ID: <8b34ec180905140558v1cc3142fnb4bc300133e32076@mail.gmail.com> On Thu, May 14, 2009 at 2:47 PM, Peter Cock wrote: > On Thu, May 14, 2009 at 1:23 PM, Brad Chapman wrote: > You mean Monday June 29 and/or Tuesday June 30? ?These are the > first two days of the ISMB meeting, so my time would have to be > juggled between this and whatever conference sessions look most > relevant. > > http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are > asking all speakers to come prepared to lead an informal tutorial on > their software during a Birds of a Feather/hackathon session." > So I will ask them to allocate us a slot on the Saturday/Sunday > during BOSC itself. ?They should take care of getting a room and > internet access too. ?This would be more of a basic introduction to > Biopython (hopefully we can generate some useful additions to the > cookbook from this) rather than a developmental session like Brad > is planning. > Hi, I'll also be presenting at BOSC and I'd be glad to take part in the meeting/hackathon. Unfortunately, I'm not staying for ISMB, so my vote goes definitely on Saturday/Sunday timing. looking forward to see you in Stockholm Bartek From p.j.a.cock at googlemail.com Thu May 14 13:26:41 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 14 May 2009 14:26:41 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> Message-ID: <320fb6e00905140626l82001bbw8ba7912d2b596d50@mail.gmail.com> Brad wrote: >> I mentioned previously the idea of an informal Biopython hackathon on >> the Monday and Tuesday after BOSC. Basically, this would be a chance >> to sit down and do some Biopython programming while being able to >> physically talk. I will definitely be there and am up for some coding >> with as many people as are inclined; Peter and Tiago responded >> favorably to the idea earlier. If you are at all interested send an e-mail. >> We can keep it informal with a few people, and I will try to organize >> space if it looks like we will have more. Peter wrote: > You mean Monday June 29 and/or Tuesday June 30? ?These are the > first two days of the ISMB meeting, so my time would have to be > juggled between this and whatever conference sessions look most > relevant. > > http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are > asking all speakers to come prepared to lead an informal tutorial on > their software during a Birds of a Feather/hackathon session." > So I will ask them to allocate us a slot on the Saturday/Sunday > during BOSC itself. ?They should take care of getting a room and > internet access too. ?This would be more of a basic introduction to > Biopython (hopefully we can generate some useful additions to the > cookbook from this) rather than a developmental session like Brad > is planning. I have emailed the BOSC organisers to ask for a Biopython "Birds of a Feather/hackathon session" slot. In previous years these sessions have been on the Sunday afternoon - and didn't clash with any of the BOSC talks. If we have me, Brad, Bartek and Tiago there we should be able to cover most user questions, and maybe also get some development in too. If it is just me and Brad staying on for the rest of the week, we can get together informally (maybe an evening session in a hotel or something). Peter From kkelchev at bulsyst.com Thu May 14 13:44:53 2009 From: kkelchev at bulsyst.com (Kamen Kelchev) Date: Thu, 14 May 2009 16:44:53 +0300 Subject: [Biopython] HMM example Message-ID: Hi to all, I need simple example for using bio.HMM library. I red documentation (cookbook) and tutorial but point 15.x in not filled. Thanks Kamen From biopython at maubp.freeserve.co.uk Thu May 14 13:57:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 14:57:23 +0100 Subject: [Biopython] HMM example In-Reply-To: References: Message-ID: <320fb6e00905140657m44cb32c7j897a8e43208c13f2@mail.gmail.com> On Thu, May 14, 2009 at 2:44 PM, Kamen Kelchev wrote: > Hi to all, > > I need simple example for using bio.HMM library. > > I red documentation (cookbook) and tutorial but point 15.x in not > filled. > > Thanks Kamen You are right, we don't have any proper documentation in the tutorial for Bio.HMM yet. There is the code's built in documentation (python docstrings) which you can access with the python help command, or read online here: http://biopython.org/DIST/docs/api/Bio.HMM-module.html There are also some example scripts in our unit tests, files test_HMMCasino.py and test_HMMGeneral.py, which should be a good starting point. Hopefully you can make some progress from that.... If you would like to contribute some documentation that would be great! Peter From chapmanb at 50mail.com Thu May 14 14:13:17 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 14 May 2009 10:13:17 -0400 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <320fb6e00905140547x6a128212ne529fda93f84aa16@mail.gmail.com> Message-ID: <20090514141317.GI59158@sobchak.mgh.harvard.edu> Hi all; > > I mentioned previously the idea of an informal Biopython hackathon on > > the Monday and Tuesday after BOSC. Peter: > http://www.open-bio.org/wiki/BOSC_2009 says "In addition, we are > asking all speakers to come prepared to lead an informal tutorial on > their software during a Birds of a Feather/hackathon session." > So I will ask them to allocate us a slot on the Saturday/Sunday > during BOSC itself. They should take care of getting a room and > internet access too. This would be more of a basic introduction to > Biopython (hopefully we can generate some useful additions to the > cookbook from this) rather than a developmental session like Brad > is planning. Glad to see all the interest. We can be flexible in how we schedule things but we will definitely have some time for coding. Tiago, you should come; BOSC is well worth it in addition to the hacking. Peter, thanks for organizing the BoF session. This will probably be more of an introduction and question/answer period if my past experience is any predictor, but we can certainly morph into a coding session afterwards. In the past these were in the evening and we had rooms for as long as we wanted, so that'll make the coordination easy. It will also accommodate Bartek and anyone else who is around for the weekend. Peter: > You mean Monday June 29 and/or Tuesday June 30? These are the > first two days of the ISMB meeting, so my time would have to be > juggled between this and whatever conference sessions look most > relevant. That was my initial idea since the BOSC schedule isn't all nailed down, but fitting time in on Saturday and Sunday as available is also great. I will be around for the days afterwards but not attending ISMB, so can be flexible on times for anyone who want to keep things going. I'm looking forward to it. As we get closer we should spend some time brainstorming a list of things to do. I would like it if some non-Biopython programming people jumped in as well; it would be a great chance to discuss coordination with other projects. Brad From macrozhu at gmail.com Thu May 14 15:13:45 2009 From: macrozhu at gmail.com (Hongbo Zhu) Date: Thu, 14 May 2009 17:13:45 +0200 Subject: [Biopython] Bio.PDB.PDBIO.save renumbers Atom Serial Number Message-ID: <11b97ec0905140813p22216df3i72c11ad802d53a5e@mail.gmail.com> Hi, The >>Atom Serial Number<< in the .pdb files from the Protein Data Bank (PDB) are assigned consecutively from 1 to ATOM/HETATM records. This helps to identify atoms, though it may be risky to rely on that number completely. Apparently, function Bio.PDB.PDBIO.save() renumbers the >>Atom Serial Number<< in newly generated PDB file. This is on one hand consistent with .pdb file style, on the other hand very confusing, as the same atom gets different serial numbers. I suggest at least give the users the choice to keep the original >>Atom Serial Number<< by adding a new parameter to the save() function of class Bio.PDB.PDBIO. cheers, hongbo From fahy at chapman.edu Thu May 14 17:07:42 2009 From: fahy at chapman.edu (Fahy, Michael) Date: Thu, 14 May 2009 10:07:42 -0700 Subject: [Biopython] Sequence alignment with multiple proteins Message-ID: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> This is not strictly a BioPython question but I'm using BioPython for the work. I have a set of 45 proteins and 10 species. I have a representative orthologous protein from each set for each of the 10 species. I'm trying to build a phylogenetic tree by aligning the data from the 10 species. I've tried concatenating the 45 protein sequences for each of the 10 species and aligning the concatenated sequences but this has produced results that do not make sense. What do you recommend for such a problem? ------------------------------------------------------- Michael A. Fahy, PhD Professor and Chair Mathematics and Computer Science Chapman University One University Drive Orange, CA 92866 (714) 997-6879 fahy at chapman.edu From biopython at maubp.freeserve.co.uk Thu May 14 17:24:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 18:24:11 +0100 Subject: [Biopython] Sequence alignment with multiple proteins In-Reply-To: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> References: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> Message-ID: <320fb6e00905141024n623333a3p18ef45258ce1d554@mail.gmail.com> On Thu, May 14, 2009 at 6:07 PM, Fahy, Michael wrote: > This is not strictly a BioPython question but I'm using BioPython for > the work. > > I have a set of 45 proteins and 10 species. ?I have a ?representative > orthologous protein from each set for each of the 10 species. ?I'm > trying to build a phylogenetic tree by aligning the data from the 10 > species. ?I've tried concatenating the 45 protein sequences for each of > the 10 species and aligning the concatenated sequences but this has > produced results that do not make sense. ?What do you recommend for such > a problem? Concatenating the sequences and then aligning them sounds like asking for trouble. I would suggest taking each gene in isolation, and making a protein sequence alignment. Then take the 45 alignments and concatenate them into one super-alignment [*]. Then make a tree. There are things you should assess - for example do trees from each of the separate 45 protein alignments, and compare them - you may find some of the genes are evolving at different rates etc. Maybe only some of the 45 proteins are suitable. Perhaps looking at the nucleotides would also be wise. I'm sure an expert in phylogenetics (i.e. not me) could give much more advice. Peter [*] This can be done in Biopython, but isn't that straight forward at the moment, see this thread: http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006044.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/006046.html From cy at cymon.org Thu May 14 17:29:47 2009 From: cy at cymon.org (Cymon Cox) Date: Thu, 14 May 2009 18:29:47 +0100 Subject: [Biopython] Sequence alignment with multiple proteins In-Reply-To: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> References: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> Message-ID: <7265d4f0905141029r73a85d68ga4371c47cc361ba4@mail.gmail.com> Hi Michael, 2009/5/14 Fahy, Michael > This is not strictly a BioPython question but I'm using BioPython for > the work. > > I have a set of 45 proteins and 10 species. I have a representative > orthologous protein from each set for each of the 10 species. I'm > trying to build a phylogenetic tree by aligning the data from the 10 > species. I've tried concatenating the 45 protein sequences for each of > the 10 species and aligning the concatenated sequences but this has > produced results that do not make sense. What do you recommend for such > a problem? The way I (and I suspect most others) approach this is to align each protein data individually (ie you'll have 45 separate protein alignments) and then concatenated them into one super-matrix. Currently, Bio.AlignIO does not support column to column concatenation of data. But by happy coincidence, David Winter, posted today that he has included a cookbook example of how to combine alignments using the Bio.Nexus interface - you can find the example here: http://biopython.org/wiki/Concatenate_nexus If you alignment viewer does not support export in Nexus format, you can use Bio.AlignIO to convert the alignment to Nexus. Cheers, Cymon -- From cjfields at illinois.edu Thu May 14 21:14:49 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 16:14:49 -0500 Subject: [Biopython] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai Message-ID: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> All, I am proud to introduce Xin 'David' Shuai, my student for the Google Summer of Code 2009, to the Open Bioinformatics community. David's project centers on developing SWIG-based bindings to libsequence (a population genetics library) for the BioLib project: http://biolib.open-bio.org/wiki/Main_Page Besides myself, David will be co-mentored by Mark Jensen and Pjotr Prins. As the BioLib project centers on creating common, maintainable SWIG- based bindings to popular bioinformatics libraries for the various Bio* toolkits, we will likely need input from the various Open Bio communities at various stages in the project. At this time, David's initial plans are to develop and test libsequence bindings for Perl and Python. David's proposal and project plan are available here: http://biolib.open-bio.org/wiki/User:David Congratulations David, and welcome to the Open-Bio community! Sincerely, Christopher Fields University of Illinois Urbana-Champaign Institute for Genomic Biology Urbana, IL 61801 From hlapp at gmx.net Thu May 14 22:16:30 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 14 May 2009 18:16:30 -0400 Subject: [Biopython] [Bioperl-l] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai In-Reply-To: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> References: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> Message-ID: <4482F182-B0B0-4C71-B8DE-B5B7A4EC4D81@gmx.net> Welcome David, good luck with your project, and I hope (actually, am quite certain) that you'll enjoy your summer with us. -hilmar On May 14, 2009, at 5:14 PM, Chris Fields wrote: > All, > > I am proud to introduce Xin 'David' Shuai, my student for the Google > Summer of Code 2009, to the Open Bioinformatics community. David's > project centers on developing SWIG-based bindings to libsequence (a > population genetics library) for the BioLib project: > > http://biolib.open-bio.org/wiki/Main_Page > > Besides myself, David will be co-mentored by Mark Jensen and Pjotr > Prins. > > As the BioLib project centers on creating common, maintainable SWIG- > based bindings to popular bioinformatics libraries for the various > Bio* toolkits, we will likely need input from the various Open Bio > communities at various stages in the project. At this time, David's > initial plans are to develop and test libsequence bindings for Perl > and Python. > > David's proposal and project plan are available here: > > http://biolib.open-bio.org/wiki/User:David > > Congratulations David, and welcome to the Open-Bio community! > > Sincerely, > > Christopher Fields > University of Illinois Urbana-Champaign > Institute for Genomic Biology > Urbana, IL 61801 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From winda002 at student.otago.ac.nz Thu May 14 22:52:47 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Fri, 15 May 2009 10:52:47 +1200 Subject: [Biopython] Sequence alignment with multiple proteins In-Reply-To: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> References: <14308220CCA3654CBC504707EE8C65090BBD0001@ADAM.chapman.edu> Message-ID: <4A0CA0BF.5040609@student.otago.ac.nz> Fahy, Michael wrote: > This is not strictly a BioPython question but I'm using BioPython for > the work. > > > > I have a set of 45 proteins and 10 species. I have a representative > orthologous protein from each set for each of the 10 species. I'm > trying to build a phylogenetic tree by aligning the data from the 10 > species. I've tried concatenating the 45 protein sequences for each of > the 10 species and aligning the concatenated sequences but this has > produced results that do not make sense. What do you recommend for such > a problem? > Hi Michael, As you've heard the usual approach is to align the sequences individually first then make a supermatrix. Without knowing the details of the analysis you want to do I'd imagine with that many sequences for each taxon you're likely to have some protein-trees behaving differently than others (which might explain your unexpected results). There are ways of dealing with this (depending on your dedication to getting The One True Tree) like "gene jackknifing" (taking a set of protein's out and seeing how they effect the topology) and partion based tests. Sadly these are frequently run on super computers... Cheers, david From thomas.hamelryck at gmail.com Fri May 15 09:24:19 2009 From: thomas.hamelryck at gmail.com (Thomas Hamelryck) Date: Fri, 15 May 2009 11:24:19 +0200 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <20090514122319.GD59158@sobchak.mgh.harvard.edu> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> Message-ID: <2d7c25310905150224s56b1ef75h40fc038e578f4570@mail.gmail.com> Hi all, I mentioned previously the idea of an informal Biopython hackathon on > the Monday and Tuesday after BOSC. Basically, this would be a chance > to sit down and do some Biopython programming while being able to > physically talk. I will definitely be there and am up for some coding > with as many people as are inclined; Peter and Tiago responded favorably > to the idea earlier. If you are at all interested send an e-mail. We can > keep it informal with a few people, and I will try to organize space if > it looks like we will have more. I won't be at BOSC, because I'm going to the 3D-SIG, but a hackathon on Monday and/or Tuesday sounds fine! Who else is coming? -Thomas From p.j.a.cock at googlemail.com Fri May 15 09:47:29 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 May 2009 10:47:29 +0100 Subject: [Biopython] BOSC/ISMB conference reminder In-Reply-To: <2d7c25310905150224s56b1ef75h40fc038e578f4570@mail.gmail.com> References: <20090514122319.GD59158@sobchak.mgh.harvard.edu> <2d7c25310905150224s56b1ef75h40fc038e578f4570@mail.gmail.com> Message-ID: <320fb6e00905150247u6038bf66s98575d84f4e5ae9a@mail.gmail.com> On Fri, May 15, 2009 at 10:24 AM, Thomas Hamelryck wrote: > > I won't be at BOSC, because I'm going to the 3D-SIG, but a hackathon on > Monday and/or Tuesday sounds fine! Who else is coming? > On the 3Dsig Satellite Meeting section, http://www.iscb.org/ismbeccb2009/registration.php says: "Satellite Meeting registration does not allow access to the SIG Meetings. Delegates wishing to attend portions of the Satellite Meeting and portions of one or more SIGs must register for both events." However, for the other SIG meetings, http://www.iscb.org/ismbeccb2009/registration.php says: "Registering for a SIG allows you to move freely between all SIGs that take place at the same time as the meeting for which you are registered, to the extent that the room capacities can accommodate. SIG registration does not allow access to the Satellite Meeting." The Birds of a Feather sessions at BOSC will probably be at the end of Sat and/or Sun, allowed to run on into the evening. Once all the talks are finished you could probably drop by and say hi. For BOSC itself (Sat/Sun), it looks like me, Brad, Tiago and Bartek (so far). For the Monday/Tuesday it looks like just Brad (and hopefully me - just looking at budgets now). Peter From rcsqtc at iqac.csic.es Tue May 19 08:55:14 2009 From: rcsqtc at iqac.csic.es (Ramon Crehuet) Date: Tue, 19 May 2009 10:55:14 +0200 Subject: [Biopython] Bio.PDB: removing disordred atoms Message-ID: <4A1273F2.4050007@iqac.csic.es> Dear all, I'd like to save a pdb without the positions of alternative atoms, i.e, for disordered atoms keep only atom.altloc='A'. I though of something like: all_atoms=[] for chain in structure[0]: for residue in chain.child_list: all_atoms=all_atoms+residue.get_unpacked_list() for atom in all_atoms: if atom.altloc=='B': del atom io=Bio.PDB.PDBIO() io.set_structure(structure) io.save('pdb_out_filename') But that doesn't works. Disordred atoms are still there :-( Thanks in advance, Ramon From p.j.a.cock at googlemail.com Tue May 19 09:39:07 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 May 2009 10:39:07 +0100 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <4A1273F2.4050007@iqac.csic.es> References: <4A1273F2.4050007@iqac.csic.es> Message-ID: <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> On Tue, May 19, 2009 at 9:55 AM, Ramon Crehuet wrote: > Dear all, > I'd like to save a pdb without the positions of alternative atoms, > i.e, for disordered atoms keep only atom.altloc='A'. > I though of something like: > > all_atoms=[] > for chain in structure[0]: > ? ?for residue in chain.child_list: > ? ? ? ?all_atoms=all_atoms+residue.get_unpacked_list() > > for atom in all_atoms: > ? ?if atom.altloc=='B': del atom > ... Doing "del atom" just deletes the local variable atom. i.e. it won't affect the PDB structure at all. I would suggest you look at pages 5 and 6 of the Bio.PDB documentation, the bit on the Select class: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf You might also find this recent thread useful: http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html Peter From rcsqtc at iqac.csic.es Tue May 19 14:55:11 2009 From: rcsqtc at iqac.csic.es (Ramon Crehuet) Date: Tue, 19 May 2009 16:55:11 +0200 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> References: <4A1273F2.4050007@iqac.csic.es> <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> Message-ID: <4A12C84F.5080808@iqac.csic.es> Thanks, The easyiest way I found was defining a class to assert disordered atoms: class NotDisordered(Select): def accept_atom(self, atom): if not atom.is_disordered(): return 1 elif atom.get_altloc()=='B': return 1 else: return 0 io=PDBIO() io.set_structure(s) io.save("1GS5-ord.pdb", select=NotDisordered()) Peter Cock wrote: > On Tue, May 19, 2009 at 9:55 AM, Ramon Crehuet wrote: >> Dear all, >> I'd like to save a pdb without the positions of alternative atoms, >> i.e, for disordered atoms keep only atom.altloc='A'. >> I though of something like: >> >> all_atoms=[] >> for chain in structure[0]: >> for residue in chain.child_list: >> all_atoms=all_atoms+residue.get_unpacked_list() >> >> for atom in all_atoms: >> if atom.altloc=='B': del atom >> ... > > Doing "del atom" just deletes the local variable atom. > i.e. it won't affect the PDB structure at all. > > I would suggest you look at pages 5 and 6 of the Bio.PDB > documentation, the bit on the Select class: > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > You might also find this recent thread useful: > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > Peter > From p.j.a.cock at googlemail.com Tue May 19 16:32:04 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 May 2009 17:32:04 +0100 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <4A12C84F.5080808@iqac.csic.es> References: <4A1273F2.4050007@iqac.csic.es> <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> <4A12C84F.5080808@iqac.csic.es> Message-ID: <320fb6e00905190932n546e8134h9cceceb918dcaa94@mail.gmail.com> On Tue, May 19, 2009 at 3:55 PM, Ramon Crehuet wrote: > Thanks, > The easyiest way I found was defining a class to assert disordered atoms: > > class NotDisordered(Select): > ? ?def accept_atom(self, atom): > ? ? ? ?if not atom.is_disordered(): > ? ? ? ? ? ?return 1 > ? ? ? ?elif atom.get_altloc()=='B': > ? ? ? ? ? ?return 1 > ? ? ? ?else: > ? ? ? ? ? ?return 0 > > io=PDBIO() > io.set_structure(s) > io.save("1GS5-ord.pdb", select=NotDisordered()) Good - that's what you are expected to do. I'm glad it made sense. Peter P.S. Personally I would use True and False instead of 1 and 0. From jasperkoehorst at gmail.com Wed May 20 12:51:13 2009 From: jasperkoehorst at gmail.com (Jasper Koehorst) Date: Wed, 20 May 2009 14:51:13 +0200 Subject: [Biopython] NCBIWWW.qblast Expect value / E-value In-Reply-To: References: Message-ID: Im running against problems when i try to identify short peptide sequences of ?10 - 20 peptides. When i run this at the NCBI Blast website i get results, these all have high E-Values that i dont really care at the moment. The problem is when i do this in biopython as stated below, i will not get any results... I believe this is due to the fact that biopython will not show results with a "high" E-Value. Is there a way to change this? So it will allow results with an E-value of ?1500 or more? I tried the sript below but that does not quit work... Anybody has an idea? result_handle = NCBIWWW.qblast("blastp","nr",peptide[1],entrez_query=i, expect=2000, matrix_name='BLOSUM80') blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() E_VALUE_TRESH = 10000000 for alignment in blast_record.alignments: for hsp in alignment.hsps: From p.j.a.cock at googlemail.com Wed May 20 13:26:08 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 May 2009 14:26:08 +0100 Subject: [Biopython] NCBIWWW.qblast Expect value / E-value In-Reply-To: References: Message-ID: <320fb6e00905200626j37d1349dwafb1c5caa5dbd040@mail.gmail.com> On Wed, May 20, 2009 at 1:51 PM, Jasper Koehorst wrote: > Im running against problems when i try to identify short peptide sequences > of ?10 - 20 peptides. When i run this at the NCBI Blast website i get > results, these all have > high E-Values that i dont really care at the moment. The problem is when i > do this in biopython as stated below, i will not get any results... > > I believe this is due to the fact that biopython will not show results with > a "high" E-Value. Is there a way to change this? So it will allow results > with an E-value of ?1500 or more? Nothing in Biopython limits the expectation values - our qblast function defaults to 10, but you can set this to what you like. However, the NCBI may be imposing their own limit. Are you sure using anything more than 10 is actually meaningful? > I tried the sript below but that does not quit work... > > Anybody has an idea? > > ? ? ? ?result_handle = > NCBIWWW.qblast("blastp","nr",peptide[1],entrez_query=i, expect=2000, > matrix_name='BLOSUM80') > ? ? ? ?blast_records = NCBIXML.parse(result_handle) > ? ? ? ?blast_record = blast_records.next() > ? ? ? ?E_VALUE_TRESH = 10000000 > ? ? ? ?for alignment in blast_record.alignments: > ? ? ? ? ? ?for hsp in alignment.hsps: > I would guess (from previous examples) that this is due to the NCBI website and QBLAST API using different default parameters - the NCBI likes to change the defaults on the website from time to time, and these may differ from what you are getting via their QBLAST API. I would start by checking the gap parameters. See also: http://lists.open-bio.org/pipermail/biopython/2008-May/004252.html http://lists.open-bio.org/pipermail/biopython/2007-August/003679.html Peter From jasperkoehorst at gmail.com Wed May 20 13:52:22 2009 From: jasperkoehorst at gmail.com (Jasper Koehorst) Date: Wed, 20 May 2009 15:52:22 +0200 Subject: [Biopython] NCBIWWW.qblast Expect value / E-value In-Reply-To: <320fb6e00905200626j37d1349dwafb1c5caa5dbd040@mail.gmail.com> References: <320fb6e00905200626j37d1349dwafb1c5caa5dbd040@mail.gmail.com> Message-ID: Yes we have, at the moment are we trying to identify a lot of small peptides in several organisms for a project. What we would like to do is to ignore the e-value for this moment. But a solution has yet to be found. jasper koehorst 2009/5/20 Peter Cock > On Wed, May 20, 2009 at 1:51 PM, Jasper Koehorst > wrote: > > Im running against problems when i try to identify short peptide > sequences > > of ?10 - 20 peptides. When i run this at the NCBI Blast website i get > > results, these all have > > high E-Values that i dont really care at the moment. The problem is when > i > > do this in biopython as stated below, i will not get any results... > > > > I believe this is due to the fact that biopython will not show results > with > > a "high" E-Value. Is there a way to change this? So it will allow results > > with an E-value of ?1500 or more? > > Nothing in Biopython limits the expectation values - our qblast function > defaults to 10, but you can set this to what you like. However, the NCBI > may be imposing their own limit. Are you sure using anything more than > 10 is actually meaningful? > > > I tried the sript below but that does not quit work... > > > > Anybody has an idea? > > > > result_handle = > > NCBIWWW.qblast("blastp","nr",peptide[1],entrez_query=i, expect=2000, > > matrix_name='BLOSUM80') > > blast_records = NCBIXML.parse(result_handle) > > blast_record = blast_records.next() > > E_VALUE_TRESH = 10000000 > > for alignment in blast_record.alignments: > > for hsp in alignment.hsps: > > > > I would guess (from previous examples) that this is due to the NCBI > website and QBLAST API using different default parameters - the > NCBI likes to change the defaults on the website from time to time, > and these may differ from what you are getting via their QBLAST > API. I would start by checking the gap parameters. > > See also: > http://lists.open-bio.org/pipermail/biopython/2008-May/004252.html > http://lists.open-bio.org/pipermail/biopython/2007-August/003679.html > > Peter > From chapmanb at 50mail.com Thu May 21 12:09:26 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 21 May 2009 08:09:26 -0400 Subject: [Biopython] Bio.PDB: removing disordred atoms In-Reply-To: <4A12C84F.5080808@iqac.csic.es> References: <4A1273F2.4050007@iqac.csic.es> <320fb6e00905190239md48cba4s62f78da26802a256@mail.gmail.com> <4A12C84F.5080808@iqac.csic.es> Message-ID: <20090521120926.GL84112@sobchak.mgh.harvard.edu> Ramon; Great to hear you got this figured out with Peter's helpful direction. It would be very useful if you could contribute this as a cookbook example: http://biopython.org/wiki/Category:Cookbook with a short description of your motivation and the final code. This would make it accessible to others with a similar problem in the future. Brad > Thanks, > The easyiest way I found was defining a class to assert disordered atoms: > > > class NotDisordered(Select): > def accept_atom(self, atom): > if not atom.is_disordered(): > return 1 > elif atom.get_altloc()=='B': > return 1 > else: > return 0 > > io=PDBIO() > > io.set_structure(s) > io.save("1GS5-ord.pdb", select=NotDisordered()) > > > > Peter Cock wrote: > > On Tue, May 19, 2009 at 9:55 AM, Ramon Crehuet wrote: > >> Dear all, > >> I'd like to save a pdb without the positions of alternative atoms, > >> i.e, for disordered atoms keep only atom.altloc='A'. > >> I though of something like: > >> > >> all_atoms=[] > >> for chain in structure[0]: > >> for residue in chain.child_list: > >> all_atoms=all_atoms+residue.get_unpacked_list() > >> > >> for atom in all_atoms: > >> if atom.altloc=='B': del atom > >> ... > > > > Doing "del atom" just deletes the local variable atom. > > i.e. it won't affect the PDB structure at all. > > > > I would suggest you look at pages 5 and 6 of the Bio.PDB > > documentation, the bit on the Select class: > > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > > > You might also find this recent thread useful: > > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > > > Peter > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From matzke at berkeley.edu Thu May 21 19:00:11 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Thu, 21 May 2009 12:00:11 -0700 Subject: [Biopython] GSoC 2009/Matzke: BioGeographical Phylogenetics for Biopython Message-ID: <4A15A4BB.8040706@berkeley.edu> Hi all, I have been on the biopython list for some time, but as I am starting a Google Summer of Code project (hosted by NESCENT Phyloinformatics Summer of Code) involving this, I felt like I should introduce myself. Below are links to my set up/ grant proposal / project plan. I am open to comments/suggestions via email or the list. Cheers! Nick ============== Links: PhyloSoc Summer of Code 2009 summary: https://www.nescent.org/wg_phyloinformatics/Phyloinformatics_Summer_of_Code_2009#Biogeographical_Phylogenetics_for_BioPython Bio page: https://www.nescent.org/wg_phyloinformatics/User:Matzke BioPython wiki summary: http://biopython.org/wiki/Active_projects#Biogeography_.28GSoC.29 BioPython wiki work page, work plan: http://biopython.org/wiki/BioGeography Code repository (suggestions welcome from Brad et al. on the best way to do this): http://github.com/nmatzke/biopython/tree/master Comments welcome! I believe there is additional stuff to do, I will get on it tomorrow. Cheers! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From lueck at ipk-gatersleben.de Tue May 26 07:34:50 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 26 May 2009 09:34:50 +0200 Subject: [Biopython] blastall - query starts in xml Message-ID: <004a01c9ddd4$73f12ac0$1022a8c0@ipkgatersleben.de> Hi! Is there a way to get the query start information of the hit in the xml output? Alternatively I can find the hit on the query. Kind regards Stefanie From cmckay at u.washington.edu Tue May 26 19:20:47 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 26 May 2009 12:20:47 -0700 Subject: [Biopython] SeqIO and fastq Message-ID: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> I just used SeqIO to convert 10 million fastq reads to fasta. Fast and simple. Thanks for adding the functionality! best, Cedar UW Oceanography From winda002 at student.otago.ac.nz Wed May 27 01:18:57 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 27 May 2009 13:18:57 +1200 Subject: [Biopython] (no subject) Message-ID: <4A1C9501.9000406@student.otago.ac.nz> Stefanie L?ck wrote: > Hi! > > Is there a way to get the query start information of the hit in the xml output? > Alternatively I can find the hit on the query Hi Stefanie, The query_start is in the "hsp" instance for each alignment in each blast record, if you have a record called b_record you can do this: >>>for alignment in b_record.alignments: >>> for hsp in alignment.hsps: >>> print "hit '%s' matches query '%s' starting a query position %i" % (alignment.title, b_record.query, hsp.query_start ) hit 'gene1' matches query 'my_query1' from query position 841 hit 'gene2' matches query 'my_query1' from query position 190 There is a nice diagram of all the 'stuff' in the blast record class in the tutorial here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecord Hope that helps you do what you want to, David From lueck at ipk-gatersleben.de Wed May 27 07:06:34 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 27 May 2009 09:06:34 +0200 Subject: [Biopython] (no subject) References: <4A1C9501.9000406@student.otago.ac.nz> Message-ID: <002f01c9de99$abd511c0$1022a8c0@ipkgatersleben.de> Hi David! I found the problem. I got allways the same start for the hsp.query_start and was wonder whats happened. I had a mistake in my code. Sorry for your waste of time! S. ----- Original Message ----- From: "David Winter" To: Sent: Wednesday, May 27, 2009 3:18 AM Subject: [Biopython] (no subject) Stefanie L?ck wrote: > Hi! > > Is there a way to get the query start information of the hit in the xml output? > Alternatively I can find the hit on the query Hi Stefanie, The query_start is in the "hsp" instance for each alignment in each blast record, if you have a record called b_record you can do this: >>>for alignment in b_record.alignments: >>> for hsp in alignment.hsps: >>> print "hit '%s' matches query '%s' starting a query position %i" % (alignment.title, b_record.query, hsp.query_start ) hit 'gene1' matches query 'my_query1' from query position 841 hit 'gene2' matches query 'my_query1' from query position 190 There is a nice diagram of all the 'stuff' in the blast record class in the tutorial here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecord Hope that helps you do what you want to, David _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From lueck at ipk-gatersleben.de Thu May 28 07:55:30 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Thu, 28 May 2009 09:55:30 +0200 Subject: [Biopython] blastall - strange results References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> Message-ID: <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> Hi! The question is not really related to a biopython problem but nevertheless I want to be sure that I do everything correct. I get strange results with blast. My aim is to blast a query sequence, spitted to 21-mers, against a database. Since I need only 100 % matches of 21-mers, a set the word size parameter to 21. Now, as a positive control, I took one EST sequence and made a database of it. Then I took 100 bp of that sequence, spitted to 21-mers and blast each of them against my DB. Now I expect to get a full coverage (or better 80 hits because everything below 21 bp I don't blast) of hits because the sequence is fully present in the DB. Unfortunately blast finds much less (60-80 %, depending on the sequence). Is this normal? I would expect to find all 21-mers. Why only some? If I blast without to change the word size parameter its find all hits. But I would like to use this parameter because the blast is much faster and I don't need to take care about gaps etc. since I really need only 100 % 21 mer matches. Does someone have any ideas what could be the problem? Thanks in advance! Stefanie From chapmanb at 50mail.com Thu May 28 12:02:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 28 May 2009 08:02:41 -0400 Subject: [Biopython] blastall - strange results In-Reply-To: <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> Message-ID: <20090528120241.GG94873@sobchak.mgh.harvard.edu> Hi Stefanie; > I get strange results with blast. > My aim is to blast a query sequence, spitted to 21-mers, against a database. [...] > Is this normal? I would expect to find all 21-mers. Why only some? BLAST isn't the best tool for this sort of problem. For exhaustively aligning short sequences to a database of target sequences, you should think about using a short read aligner. This is a nice summary of available aligners: http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml Personally, I have had good experiences using Mosaik and Bowtie. Hope this helps, Brad From biopythonlist at gmail.com Fri May 29 09:36:29 2009 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 29 May 2009 11:36:29 +0200 Subject: [Biopython] searching for a human chromosome position Message-ID: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> Hello, I am new using biopython and after reading the documentation I'd like some guides to resolve one "simple" thing. I want to, given a number of a human chromosome, the position of the nucleotide and the nucleotide that should be in this position, search for that position and determine if there has been a mutation and if that mutation produces an aminoacid change or not. I supose that first of all I have to query genome database(?) using Entrez module and retrieve the sequence where this base is. Then I supose I have to look for translated sequences of this sequence and see what is the most probably frame of traduction for this sequence and then see if there is a change of aminoacid or not. Please could anybody send some clues for querying the database and find the most probably frame of traduction to protein (in case that this is a good workflow to solve this particular problem)?? Thankyou very much. d From biopythonlist at gmail.com Fri May 29 10:06:44 2009 From: biopythonlist at gmail.com (dr goettel) Date: Fri, 29 May 2009 12:06:44 +0200 Subject: [Biopython] searching for a human chromosome position Message-ID: <9b15d9f30905290306w58e227dew2cb164608acf4010@mail.gmail.com> Hello, I am new using biopython and after reading the documentation I'd like some guides to resolve one "simple" thing. I want to, given a number of a human chromosome, the position of the nucleotide and the nucleotide that should be in this position, search for that position and determine if there has been a mutation and if that mutation produces an aminoacid change or not. I supose that first of all I have to query genome database(?) using Entrez module and retrieve the sequence where this base is. Then I supose I have to look for translated sequences of this sequence and see what is the most probably frame of traduction for this sequence and then see if there is a change of aminoacid or not. Please could anybody send some clues for querying the database and find the most probably frame of traduction to protein (in case that this is a good workflow to solve this particular problem)?? Thankyou very much. d From rjalves at igc.gulbenkian.pt Sun May 31 17:16:27 2009 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Sun, 31 May 2009 18:16:27 +0100 Subject: [Biopython] Entrez.esearch sort by publication date Message-ID: <4A22BB6B.8010305@igc.gulbenkian.pt> Hi everyone, I've been using Entrez.esearch for a while without problems but today I wanted to have the results sorted by publication date. According to the docs at: http://www.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html#Sort I should use 'pub+date', however this doesn't work. If I use 'author' and 'journal' I have no problems but if I use 'last+author' or 'pub+date' I get an empty reply: >>>Entrez.esearch(db='pubmed', term=search, retmax=5, sort='pub+date').read() \n\n\n' Any suggestions on how to make this work? Thanks, Renato