From mjldehoon at yahoo.com Mon Mar 1 04:40:25 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 1 Mar 2010 01:40:25 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002271119me5db4ddud784bf1573fddfb@mail.gmail.com> Message-ID: <180385.44498.qm@web62405.mail.re1.yahoo.com> --- On Sat, 2/27/10, Peter wrote: > I hadn't realised the NCBI had changed the XML. I > wonder if multiple query PSI-BLAST output works > nicely now? The psiblast program as part of blast+ doesn't allow multiple queries, so in that sense the problem was disappeared. > If the existing NCBI XML parser can cover both variants, > then it makes more sense to me to continue to use the > existing read & parse functions under > Bio.Blast.NCBIXML. Well I was thinking that this is a good time to tackle all outstanding Blast parser bugs & issues, which may break consistency with the existing parsers. So I would prefer to copy the code in Bio.Blast.NCBIXML, modify it as needed for blast+, and in some future Biopython release (not anytime soon) to deprecated NCBIStandalone and NCBIXML. In any case, I think it is nicer to have a read() function directly under Bio.Blast, so I don't have to remember and type in the names of the submodules NCBIXML and NCBIStandalone (the name of the latter doesn't make much sense anyway). --Michiel. From biopython at maubp.freeserve.co.uk Mon Mar 1 05:08:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Mar 2010 10:08:17 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <180385.44498.qm@web62405.mail.re1.yahoo.com> References: <320fb6e01002271119me5db4ddud784bf1573fddfb@mail.gmail.com> <180385.44498.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e01003010208n13d4de6au2299a2e3221fd2dc@mail.gmail.com> On Mon, Mar 1, 2010 at 9:40 AM, Michiel de Hoon wrote: > --- On Sat, 2/27/10, Peter wrote: >> I hadn't realised the NCBI had changed the XML. I >> wonder if multiple query PSI-BLAST output works >> nicely now? > > The psiblast program as part of blast+ doesn't allow > multiple queries, so in that sense the problem was > disappeared. That is a very practical solution to the problem. Chuckle. >> If the existing NCBI XML parser can cover both variants, >> then it makes more sense to me to continue to use the >> existing read & parse functions under >> Bio.Blast.NCBIXML. > > Well I was thinking that this is a good time to tackle all > outstanding Blast parser bugs & issues, which may break > consistency with the existing parsers. So I would prefer to > copy the code in Bio.Blast.NCBIXML, modify it as needed > for blast+, and in some future Biopython release (not anytime > soon) to deprecated NCBIStandalone and NCBIXML. Would you be thinking of having Bio.Blast.read() and parse() only supporting NCBI BLAST+ XML files, or take a format argument like we do for sequences and alignments? i.e. What about other formats like the old NCBI XML (if it has changed), the assorted tabular BLAST outputs, non-NCBI BLAST, and finally the still sometimes useful plain text output (e.g. for use with third party tools like BLAT). > In any case, I think it is nicer to have a read() function directly > under Bio.Blast, so I don't have to remember and type in the > names of the submodules NCBIXML and NCBIStandalone > (the name of the latter doesn't make much sense anyway). The name of Bio.Blast.NCBIStandalone is a historical relic, and I agree should be retired. Can we label the whole of this module as obsolete? As discussed earlier on this thread people are still using it for calling BLAST so we won't deprecate it in the next release (but likely the one after that). Peter From chapmanb at 50mail.com Mon Mar 1 08:09:33 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Mar 2010 08:09:33 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: Message-ID: <20100301130933.GA98028@sobchak.mgh.harvard.edu> Istvan; > Our bioinformatics question and answer site seems to be > picking up steam lately: > > http://biostar.stackexchange.com/ > > I dream of a bioinformatics forum where one can ask a generic > bioinformatics question and get high quality responses in short order, > but not just in one particular approach but everything that is > applicable: perl, python, R, java, Galaxy etc Thanks for setting this up and promoting it. I'm happy to hear you have funding to continue it beyond the beta period. I added links to BioStar and the main StackOverflow Biopython question page from our discussion/mailing list page: http://biopython.org/wiki/Mailing_lists and am also redirecting the RSS feeds for new questions tagged with 'biopython' to the development list using Feed My Inbox (http://www.feedmyinbox.com). So we will now get a daily e-mail digest reminder to the list if any questions are posted. Looking forward to using this. Thanks again, Brad From chapmanb at 50mail.com Mon Mar 1 08:19:40 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Mar 2010 08:19:40 -0500 Subject: [Biopython] GFF parsing In-Reply-To: References: <20100226132834.GA66415@sobchak.mgh.harvard.edu> Message-ID: <20100301131940.GB98028@sobchak.mgh.harvard.edu> John; [GFF parser testing] > For my purposes the python csv module is doing the job. I would prefer > to use a proper GFF parser but for the moment your parser is taking 100 > seconds to parse a 40Mb file and the csv reader is doing it in about 10 > seconds. Do you think this is reasonable or do you want to take a closer > look? The straight CSV module will always destroy a full featured parser, but we may be able to get that 10x multiplier down. I'm happy to take a look if you want to send a pointer to your GFF file; if it's not publicly available feel free to send a representative subset of it to me off list. I'd be interested to hear your use case as well. Are there general things you want to do for which you had to write code and a supplemental GFF library would help? The trick with developing a GFF parser is to provide useful high level functionality, since it is relatively easy to split strings and write a one-off solution. Thanks, Brad From istvan.albert at gmail.com Mon Mar 1 08:51:56 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Mon, 1 Mar 2010 08:51:56 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: <20100301130933.GA98028@sobchak.mgh.harvard.edu> References: <20100301130933.GA98028@sobchak.mgh.harvard.edu> Message-ID: On Mon, Mar 1, 2010 at 8:09 AM, Brad Chapman wrote: > and am also redirecting the RSS feeds for new questions tagged with > 'biopython' to the development list using Feed My Inbox > (http://www.feedmyinbox.com). So we will now get a daily e-mail > digest reminder to the list if any questions are posted. Thank you Brad, I have been seriously considering finding interesting blog posts that demonstrate the use of biopython and posting them on BioStar, the advantage being that it is easier to interact, comment and evolve code in the StackOverflow framework of course the original post owners might not agree to that, so it would require contacting them. On the other hand if it is all right with everyone I would like to take some examples inspired from the biopython cookbook and post those. Very often I get questions such as how to do this or that in biopython. For that type of questions this platform is ideal. best and thanks again, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From mjldehoon at yahoo.com Tue Mar 2 05:01:29 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 2 Mar 2010 02:01:29 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01003010208n13d4de6au2299a2e3221fd2dc@mail.gmail.com> Message-ID: <293938.66771.qm@web62407.mail.re1.yahoo.com> > Would you be thinking of having Bio.Blast.read() and > parse() only supporting NCBI BLAST+ XML files, or take > a format argument like we do for sequences and alignments? I would support BLAST+ XML files only at first, and add parser capability for other formats later if needed. If so, I would use a format argument, same as how Bio.SeqIO works. > The name of? Bio.Blast.NCBIStandalone is a historical > relic, and I agree should be retired. Can we label the > whole of this module as obsolete? This module also contains the parser for Blast text output, so I think we cannot declare it obsolete just yet. However, if the XML output of BLAST+ is complete, I don't see the need for such a plain-text Blast parser any more. --Michiel. From biopython at maubp.freeserve.co.uk Tue Mar 2 05:14:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 10:14:25 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <293938.66771.qm@web62407.mail.re1.yahoo.com> References: <320fb6e01003010208n13d4de6au2299a2e3221fd2dc@mail.gmail.com> <293938.66771.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e01003020214k781eff0p77ba5002ca4786ab@mail.gmail.com> On Tue, Mar 2, 2010 at 10:01 AM, Michiel de Hoon wrote: > >> Would you be thinking of having Bio.Blast.read() and >> parse() only supporting NCBI BLAST+ XML files, or take >> a format argument like we do for sequences and alignments? > > I would support BLAST+ XML files only at first, and add > parser capability for other formats later if needed. If so, > I would use a format argument, same as how Bio.SeqIO > works. Sounds sensible. Would you be using the existing Record classes to hold the output? >> The name of? Bio.Blast.NCBIStandalone is a historical >> relic, and I agree should be retired. Can we label the >> whole of this module as obsolete? > > This module also contains the parser for Blast text output, > so I think we cannot declare it obsolete just yet. However, > if the XML output of BLAST+ is complete, I don't see the > need for such a plain-text Blast parser any more. We've been referring to the plain text BLAST parser as obsolete or deprecated in the documentation for some time now (although there isn't yet an actual deprecation warning issues). So I don't see a problem with calling the whole of Bio.Blast.NCBIStandalone obsolete. I don't think we can add deprecation warnings to the plain text parser yet. While the XML format(s) are better for parsing, there are still corner cases where the plain text has advantages (file size, BLAST like output from non-NCBI tools like BLAT, NCBI psi-blast output although they have apparently improved the XML here). We also should worry about non-NCBI BLAST tools and their output. Peter From mjldehoon at yahoo.com Tue Mar 2 10:45:34 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 2 Mar 2010 07:45:34 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01003020214k781eff0p77ba5002ca4786ab@mail.gmail.com> Message-ID: <12210.49894.qm@web62402.mail.re1.yahoo.com> > Sounds sensible. Would you be using the existing Record > classes to hold the output? Probably not; I don't like the design of the existing Record classes much (in particular, with the Record classes inheriting from Header, DatabaseReport, and Parameters). This is also a good opportunity to remove inconsistencies between attribute names between the different parsers. The DTD of the blast XML output can help us to decide on appropriate attribute names. That said, I expect that from a user perspective there will be little difference between an old-blast Record and a blast+ Record. For the development, I was thinking of setting up the parser step by step, and to discuss on the mailing list if any potential differences arise with the existing parsers. > We've been referring to the plain text BLAST parser as > obsolete or deprecated in the documentation for some > time now (although there isn't yet an actual deprecation > warning issues). So I don't see a problem with calling > the whole of Bio.Blast.NCBIStandalone obsolete. I don't have any strong objections here, so as far as I am concerned feel free to declare Bio.Blast.NCBIStandalone obsolete. --Michiel From biopython at maubp.freeserve.co.uk Wed Mar 3 08:15:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 13:15:32 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <12210.49894.qm@web62402.mail.re1.yahoo.com> References: <320fb6e01003020214k781eff0p77ba5002ca4786ab@mail.gmail.com> <12210.49894.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e01003030515o6e1b12bg768318f3d08fc1ef@mail.gmail.com> On Tue, Mar 2, 2010 at 3:45 PM, Michiel de Hoon wrote: > >> Sounds sensible. Would you be using the existing Record >> classes to hold the output? > > Probably not; I don't like the design of the existing Record > classes much (in particular, with the Record classes inheriting > from Header, DatabaseReport, and Parameters). Yes, that is odd. > This is also a good opportunity to remove inconsistencie > between attribute names between the different parsers. > The DTD of the blast XML output can help us to decide > on appropriate attribute names. Again, this is worthwhile (i.e. fix Bug 2176). http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > That said, I expect that from a user perspective there will > be little difference between an old-blast Record and a > blast+ Record. For the development, I was thinking of > setting up the parser step by step, and to discuss on the > mailing list if any potential differences arise with the > existing parsers. Great :) >> We've been referring to the plain text BLAST parser as >> obsolete or deprecated in the documentation for some >> time now (although there isn't yet an actual deprecation >> warning issues). So I don't see a problem with calling >> the whole of Bio.Blast.NCBIStandalone obsolete. > > I don't have any strong objections here, so as far as I am > concerned feel free to declare Bio.Blast.NCBIStandalone > obsolete. Done in the repository. Peter From ap12 at sanger.ac.uk Thu Mar 4 08:31:33 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Thu, 4 Mar 2010 13:31:33 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> Message-ID: <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> Dear Peter, Sorry for taking so much time to come back to you. I've managed to fork the biopython repository on github and I think I am ready now to help writing improvements to Bio/SeqIO/InsdcIO.py by adding missing fields on the ID line and adding a PR line. I may look also at the SQ line. Does this sound right to you? Thanks to let me know. Kind regards, Anne. On 12 Jan 2010, at 12:33, Peter wrote: > On Tue, Jan 12, 2010 at 10:27 AM, Peter > wrote: >> On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon >> wrote: >>> Here is the diff between the EMBL output from Bio.SeqIO and the >>> genbank >>> output from Bio.SeqIO converted with the EMBOSS tool to an EMBL >>> file: >>> >>> ... >>> >>> The main differences are on line breaks. >> >> I hadn't yet done a comparison against EMBOSS (what version do you >> have), but yes, it looks like I am wrapping the feature tables >> using a >> shorter line length - we should check that, and it would be easy to >> adjust in Bio/SeqIO/InsdcIO.py > > The spec is pretty clear than the feature lines should be up to 80 > characters. The premature wrapping was because I had been > testing length < 80 instead of <= 80, which is now fixed in git. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Thu Mar 4 09:12:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 14:12:47 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> References: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> Message-ID: <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> On Thu, Mar 4, 2010 at 1:31 PM, Anne Pajon wrote: > > Dear Peter, > > Sorry for taking so much time to come back to you. > > I've managed to fork the biopython repository on github and I think I am > ready now to help writing improvements to Bio/SeqIO/InsdcIO.py by adding > missing fields on the ID line and adding a PR line. I may look also at the > SQ line. > > Does this sound right to you? Thanks to let me know. > > Kind regards, > Anne. Hi Anne, If you are happy working with git, then showing us fixes there is great. Have a read of these pages before you get going - it should help: http://www.biopython.org/wiki/GitUsage Otherwise patch files are OK - you can attach them to bugs on bugzilla rather than on the mailing list. Thanks. Peter From ap12 at sanger.ac.uk Thu Mar 4 09:26:37 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Thu, 4 Mar 2010 14:26:37 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> References: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> Message-ID: Hi Peter, I'm happy to work with git. I've already read the bioython wiki page on it so I'm hoping to do the right thing. I'm going to commit and push the ID line fix as soon as I am happy with it, and to see if I understood how it should be done. Looking forward to get your feedback. Thanks, Anne. On 4 Mar 2010, at 14:12, Peter wrote: > On Thu, Mar 4, 2010 at 1:31 PM, Anne Pajon wrote: >> >> Dear Peter, >> >> Sorry for taking so much time to come back to you. >> >> I've managed to fork the biopython repository on github and I think >> I am >> ready now to help writing improvements to Bio/SeqIO/InsdcIO.py by >> adding >> missing fields on the ID line and adding a PR line. I may look also >> at the >> SQ line. >> >> Does this sound right to you? Thanks to let me know. >> >> Kind regards, >> Anne. > > Hi Anne, > > If you are happy working with git, then showing us fixes there is > great. > Have a read of these pages before you get going - it should help: > http://www.biopython.org/wiki/GitUsage > > Otherwise patch files are OK - you can attach them to bugs on bugzilla > rather than on the mailing list. > > Thanks. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Thu Mar 4 14:59:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 19:59:47 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> Message-ID: <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> On Thu, Mar 4, 2010 at 2:26 PM, Anne Pajon wrote: > Hi Peter, > > I'm happy to work with git. I've already read the bioython wiki page on it > so I'm hoping to do the right thing. OK - now you get to try doing a merge (grin), as I have committed the SQ line change (with some minor changes, for example I changed your variable names to keep the line length down). > I'm going to commit and push the ID line fix as soon as I am happy with it, > and to see if I understood how it should be done. Looking forward to get > your feedback. One minor issue is you accidentally checked in the BioSQL database created by the unit tests. I've update the .gitignore file to stop this happening to someone else. The EMBL data division stuff makes sense (I simply hadn't gotten round to it when I was doing it for the GenBank output). Some of your other changes need to be co-ordinated with the EMBL (and GenBank) parser. See also Bug 2578, http://bugzilla.open-bio.org/show_bug.cgi?id=2578 In the case of PR (project lines) I think we must be ignoring them at the moment, but to match the GenBank parser the information should be stored in the SeqRecord dbxrefs list not the annotations dictionary. Regards, Peter From biopython at maubp.freeserve.co.uk Thu Mar 4 15:04:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 20:04:55 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> References: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> Message-ID: <320fb6e01003041204m1de894dfic1518d7a1623b8a2@mail.gmail.com> On Thu, Mar 4, 2010 at 7:59 PM, Peter wrote: > On Thu, Mar 4, 2010 at 2:26 PM, Anne Pajon wrote: >> Hi Peter, >> >> I'm happy to work with git. I've already read the bioython wiki page on it >> so I'm hoping to do the right thing. > > OK - now you get to try doing a merge (grin), as I have committed the > SQ line change (with some minor changes, for example I changed your > variable names to keep the line length down). I also noted in a comment that we should perhaps be writing out GenBank and EMBL in lower case (as I noted a while ago on Bug 2999). It will also make counting the bases easier if we only need to look at one case ;) As an EMBL file user, does this seem like the right thing to do? Peter From abumustafa3 at gmail.com Thu Mar 4 15:27:59 2010 From: abumustafa3 at gmail.com (Nizar Ghneim) Date: Thu, 4 Mar 2010 14:27:59 -0600 Subject: [Biopython] Error with py2exe and Entrez functions Message-ID: Hello All, I am writing a short script that others would like to use on their own computers. I decided to use the py2exe tool to create an executable. The script runs perfectly in Python, but whenever in my exe file, the first line to access any Bio.Enterez function (such as esearch or efetch), gives me the following error: File "Bio\Entrez\__init__.pyc", line 258, in read > File "Bio\Entrez\Parser.pyc", line 108, in read > File "Bio\Entrez\Parser.pyc", line 377, in externalEntityRefHandler > RuntimeError: Unable to load DTD file eSearch_020511.dtd. > > Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI > Entrez. > Though most of NCBI's DTD files are included in the Biopython distribution, > sometimes you may find that a particular DTD file is missing. In such a > case, you can download the DTD file from NCBI and install it manually. > > Usually, you can find missing DTD files at either > http://www.ncbi.nlm.nih.gov/dtd/ > or > http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ > If you cannot find eSearch_020511.dtd there, you may also try to search > for it with a search engine such as Google. > > Please save eSearch_020511.dtd in the directory > C:\Python26\dist\library.zip\Bio\Entrez\DTDs > in order for Bio.Entrez to find it. > Alternatively, you can save eSearch_020511.dtd in the directory > Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython. > > Please also inform the Biopython developers by sending an email to > biopython-dev at biopython.org to inform us about this missing DTD, so that > we > can include it with the next release of Biopython. > It seems to me that the py2exe compiler does not grab the necessary DTD files. How can I solve this? Thank you in advance, Nizar Ghneim From biopython at maubp.freeserve.co.uk Thu Mar 4 16:21:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 21:21:17 +0000 Subject: [Biopython] Error with py2exe and Entrez functions In-Reply-To: References: Message-ID: <320fb6e01003041321l23526902i55e60a1ef7a12cd3@mail.gmail.com> On Thu, Mar 4, 2010 at 8:27 PM, Nizar Ghneim wrote: > Hello All, > > I am writing a short script that others would like to use on their own > computers. I decided to use the py2exe tool to create an executable. > ... > It seems to me that the py2exe compiler does not grab the necessary > DTD files. How can I solve this? That does sound like the problem. Have you searched the py2exe documentation for how to specify extra files like this - others tools must have similar needs. Peter From biopython at maubp.freeserve.co.uk Thu Mar 4 16:50:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 21:50:16 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> Message-ID: <320fb6e01003041350r127a463fga8e9a315d298a0bc@mail.gmail.com> Hi Anne, You had a comment in your change about the RA line, where you tried to append the semi-colon on output. The reason this broke the unit tests was that the EMBL parser was not removing the semi-colon. I've fixed this now - thanks for flagging this issue. Peter From ap12 at sanger.ac.uk Fri Mar 5 09:26:57 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 5 Mar 2010 14:26:57 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003041204m1de894dfic1518d7a1623b8a2@mail.gmail.com> References: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> <320fb6e01003041204m1de894dfic1518d7a1623b8a2@mail.gmail.com> Message-ID: <1D4AAD34-592F-48D6-BCB1-79019DA5A5C9@sanger.ac.uk> On 4 Mar 2010, at 20:04, Peter wrote: > On Thu, Mar 4, 2010 at 7:59 PM, Peter > wrote: >> On Thu, Mar 4, 2010 at 2:26 PM, Anne Pajon wrote: >>> Hi Peter, >>> >>> I'm happy to work with git. I've already read the bioython wiki >>> page on it >>> so I'm hoping to do the right thing. >> >> OK - now you get to try doing a merge (grin), as I have committed the >> SQ line change (with some minor changes, for example I changed your >> variable names to keep the line length down). > > I also noted in a comment that we should perhaps be writing out > GenBank and EMBL in lower case (as I noted a while ago on Bug > 2999). It will also make counting the bases easier if we only need > to look at one case ;) > > As an EMBL file user, does this seem like the right thing to do? I do not really mind one way or another. EMBOSS seems to write the sequence all in upper case. Anne. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap12 at sanger.ac.uk Fri Mar 5 09:28:44 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 5 Mar 2010 14:28:44 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003041350r127a463fga8e9a315d298a0bc@mail.gmail.com> References: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041350r127a463fga8e9a315d298a0bc@mail.gmail.com> Message-ID: Hi Peter, I saw your fix. Maybe would be better to have: self._write_multi_line("RA", "%s;" % ref.authors) instead of self._write_multi_line("RA", ref.authors+";") But it is a very minor detail. Thanks for fixing it. Kind regards, Anne. On 4 Mar 2010, at 21:50, Peter wrote: > Hi Anne, > > You had a comment in your change about the RA line, where you tried > to append the semi-colon on output. The reason this broke the unit > tests was that the EMBL parser was not removing the semi-colon. > I've fixed this now - thanks for flagging this issue. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap12 at sanger.ac.uk Fri Mar 5 12:34:24 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 5 Mar 2010 17:34:24 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003050814nddcf6a2x3402c9d3f535eb10@mail.gmail.com> References: <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> <3449050D-E60E-4FB5-AA82-E32A8F131DCA@sanger.ac.uk> <320fb6e01003050713w5e58073fydc2108af73d61dfb@mail.gmail.com> <0486B125-79A5-41BE-85E5-625903BFED04@sanger.ac.uk> <320fb6e01003050814nddcf6a2x3402c9d3f535eb10@mail.gmail.com> Message-ID: Hi Peter, I've tested the PR line with dbxrefs and it works fine, thanks. I've sent you a request for improving the writing of the references by adding the RG line. I've CC the list again... sorry for not having done so for two replies. Kind regards, Anne. On 5 Mar 2010, at 16:14, Peter wrote: > On Fri, Mar 5, 2010 at 4:02 PM, Anne Pajon wrote: >> >> On 5 Mar 2010, at 15:13, Peter wrote: >> >>> On Fri, Mar 5, 2010 at 2:24 PM, Anne Pajon >>> wrote: >>>>> >>>>> In the case of PR (project lines) I think we must be ignoring >>>>> them at >>>>> the moment, but to match the GenBank parser the information should >>>>> be stored in the SeqRecord dbxrefs list not the annotations >>>>> dictionary. >>>> >>>> Would be great to have a place where to store the PR line. >>> >>> Perhaps I was unclear - we do have a place to store the PR line, the >>> SeqRecord's dbxrefs list (following how the GenBank parser stores >>> the project information). >> >> Sorry I did not understood that. Great if I could do it with >> dbxrefs. I'll >> try right now then. >> >>> >>> Getting the EMBL parser to do the same was trivial, although this >>> does make doing the output a tiny bit more complex. See github. >>> >> >> I will have a look. > > I meant I just did this and checked in the change to github ;) > > Thanks for the example - I'll take a look. > > Regarding the mailing list, you probably just clicked on "reply" > rather than "reply all" so it came to just me. > > Thanks, > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Fri Mar 5 12:50:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Mar 2010 17:50:29 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> <3449050D-E60E-4FB5-AA82-E32A8F131DCA@sanger.ac.uk> <320fb6e01003050713w5e58073fydc2108af73d61dfb@mail.gmail.com> <0486B125-79A5-41BE-85E5-625903BFED04@sanger.ac.uk> <320fb6e01003050814nddcf6a2x3402c9d3f535eb10@mail.gmail.com> Message-ID: <320fb6e01003050950l2fd94d4cx315ecc408ab2a577@mail.gmail.com> On Fri, Mar 5, 2010 at 5:34 PM, Anne Pajon wrote: > Hi Peter, > > I've tested the PR line with dbxrefs and it works fine, thanks. Great. > I've sent you a request for improving the writing of the references by > adding the RG line. I've merged that (using a git cherry-pick) and added support for parsing the RG lines too. I'm pleased you seem to be doing such a good job identifying these little issues with the new EMBL code :) Thank you, Peter From daniel at dim.fm.usp.br Fri Mar 5 16:35:57 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Fri, 05 Mar 2010 18:35:57 -0300 Subject: [Biopython] SFF parser Message-ID: <4B91793D.5060001@dim.fm.usp.br> Hi biopythonists, Anyone has information about the status of the future SFF parser? Att. Daniel -- +---------------------------------------+ Daniel de A. M. M. Silvestre LIM01 - Laborat?rio de Inform?tica M?dica - HCFMUSP Sala 1349 - Depto. de Patologia Faculdade de Medicina Universidade de S?o Paulo Av. Dr. Arnaldo, 455 | e-mail: daniel at dim.fm.usp.br Cerqueira C?sar | Tel: +55-11-3061-7381 01246-903 - S?o Paulo - SP | Cel: +55-11-8042-9369 BRASIL | Skype: jarretinha --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: daniel.vcf Type: text/x-vcard Size: 375 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Fri Mar 5 19:12:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 6 Mar 2010 00:12:34 +0000 Subject: [Biopython] SFF parser In-Reply-To: <4B91793D.5060001@dim.fm.usp.br> References: <4B91793D.5060001@dim.fm.usp.br> Message-ID: <320fb6e01003051612p51c003f2g7ce154498c7fb97f@mail.gmail.com> 2010/3/5 Daniel Silvestre : > Hi biopythonists, > > Anyone has information about the status of the future SFF parser? > > Att. > Daniel Hi Daniel, The code was recently merged into the master branch and will be included with our next release (Biopython 1.54). There has been discussion and some useful feedback already on the dev mailing list - more would be great. If you are happy to install from source, you can try it out now. The latest version of the tutorial (with the source code, not yet published) has a brief example in the cookbook chapter, but the module docstrings are quite extensive. Once installed, try: from Bio.SeqIO import SffIO help(SffIO) Or, just had a read of the code online on github or here: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py Peter P.S. The vcard attachment on your email (file daniel.vcf) seems to mean your emails get held in the moderation queue. From aloraine at gmail.com Sun Mar 7 09:55:19 2010 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 7 Mar 2010 09:55:19 -0500 Subject: [Biopython] how to get the hit length from Bio.Blast.NCBIXML? Message-ID: <83722dde1003070655k7f328600w5320367a688112de@mail.gmail.com> Hello, I'm using Bio.Blast.NCBIXML to parse blastx results for an annotation project. I'm searching contig consensus sequences (assembled from 454 reads) against a protein database. Since these are assembled ESTs and may be incomplete, I need to know how much of a matched sequence was included in the alignment so that I can compute the percent coverage of both the hit and query. How do I retrieve the "hit length" from the objects returned by the parser? I couldn't find anything in the record and alignment objects that contains this information -- if it is not there, should it be added? The hit length appears in the XML: *cut* 3 lcl|3_0 Both_1_c25003 422 1 gnl|BL_ORD_ID|12864 gi|255551002|ref|XP_002516549.1| catalytic, putative [Ricinus communis] 12864 431 1 112.079 *paste* Best, Ann Loraine From p.j.a.cock at googlemail.com Sun Mar 7 11:06:22 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 7 Mar 2010 16:06:22 +0000 Subject: [Biopython] how to get the hit length from Bio.Blast.NCBIXML? In-Reply-To: <83722dde1003070655k7f328600w5320367a688112de@mail.gmail.com> References: <83722dde1003070655k7f328600w5320367a688112de@mail.gmail.com> Message-ID: <320fb6e01003070806o5918743bp72ecb0311cd6b2c2@mail.gmail.com> On Sun, Mar 7, 2010 at 2:55 PM, Ann Loraine wrote: > Hello, > > I'm using Bio.Blast.NCBIXML to parse blastx results for an annotation > project. I'm searching contig consensus sequences (assembled from 454 > reads) against a protein database. > > Since these are assembled ESTs and may be incomplete, I need to know > how much of a matched sequence was included in the alignment so that I > can compute the percent coverage of both the hit and query. > > How do I retrieve the "hit length" from the objects returned by the parser? > > I couldn't find anything in the record and alignment objects that > contains this information -- if it is not there, should it be added? Hi Ann, I think you are looking for the BLAST alignment's length attribute, or perhaps the HSP's align_length attribute. Peter From anaryin at gmail.com Tue Mar 9 17:21:27 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 9 Mar 2010 14:21:27 -0800 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? Message-ID: Hello all, Maybe I'm getting this wrong to begin with but bear with me. I'm trying to renumber the atoms in a PDB file (atoms, not residues). I found a method called get_serial_number that gives me back the atom number, and another called set_serial_number that allows me to change this value. It works wonders for the Structure object, but when I save it with the PDBIO module to a PDB file, it resets the numbering. I checked the code of PDBIO and apparently, it has hard-coded a resetting of the atom number. My question is, what is this set_serial_number for then? Is there a way for me to override this easily? Regards, Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From vincent at vincentdavis.net Tue Mar 9 22:46:56 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 9 Mar 2010 20:46:56 -0700 Subject: [Biopython] matching sequences from fasta files Message-ID: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Let me fist say that I am new to biopython and dna/fasta files. I have been trying to use blastall to get the results I need but I am doing most of my work in python so why use blastall if I can get the results using python. I need to check if any/all the sequence from one fasta file are in another. Looking through the docs I think I could do this. I then what to find "close matches" and for me this means they differ by 1 snp and I need to know the location of this differing snp. How would I do this? *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Wed Mar 10 05:31:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 10:31:17 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Message-ID: <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> On Wed, Mar 10, 2010 at 3:46 AM, Vincent Davis wrote: > Let me fist say that I am new to biopython and dna/fasta files. I have been > trying to use blastall to get the results I need but I am doing most of my > work in python so why use blastall if I can get the results using python. > > I need to check if any/all the sequence from one fasta file are in another. > Looking through the docs I think I could do this. > > I then what to find "close matches" and for me this means they differ by 1 > snp and I need to know the location of this differing snp. How would I do > this? If you want "close matches", then using a tool like command line tool like BLAST (or FASTA, or needle etc) may be the fastest option. You can call these tools from a Python script, and parse their output within the script. (This is probably what you are already doing.) If you want to, you can do pairwise sequence alignment from within Biopython with the Bio.pairwise2 (the module uses C for speed). This isn't covered in the tutorial, read the module documentation: http://www.biopython.org/DIST/docs/api/Bio.pairwise2-module.html For the special case of looking for perfect matches, you would be fine with just Python - depending on your data files, you may be able to match on the record identifiers or simply do string comparisons of the sequences. If you know in advance the pattern of SNPs, then you would be able to efficiently search for them using a regular expression. However, it sounds like you are doing SNP discovery. Here too there should be existing command line tools designed for just this task (and described in the literature). Regards, Peter From ivan at biodec.com Wed Mar 10 06:15:38 2010 From: ivan at biodec.com (Ivan Rossi) Date: Wed, 10 Mar 2010 12:15:38 +0100 (CET) Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> Message-ID: On Wed, 10 Mar 2010, Peter wrote: > For the special case of looking for perfect matches, you would be fine > with just Python - depending on your data files, you may be able to > match on the record identifiers Don't trust that. We have seen many many times the sequence change over time (in different releases of the databases) while keeping the same id. it is much more robust to compare SHA1 (or MD5) hashes of the sequence, or do string comparisons. > or simply do string comparisons of the sequences. This is OK. -- Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Wed Mar 10 08:00:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 13:00:15 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> Message-ID: <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> On Wed, Mar 10, 2010 at 11:15 AM, Ivan Rossi wrote: > On Wed, 10 Mar 2010, Peter wrote: > >> For the special case of looking for perfect matches, you would be fine >> with just Python - depending on your data files, you may be able to >> match on the record identifiers > > Don't trust that. We have seen many many times the sequence change > over time (in different releases of the databases) while keeping the same id. Yes, be cautious about blindly matching on just the identifier. That's why I said "may" ;) > it is much more robust to compare SHA1 (or MD5) hashes of the > sequence, or do string comparisons. MD5 is known to have collisions, but Sebasti?n Bassi added support in Biopython for the GCG and SEGUID checksums, e.g. see: from Bio.SeqUtils.CheckSum import seguid help(seguid) SHA1 is used by SEGUID internally, taking care of the case. Peter From ismail.fsr at gmail.com Wed Mar 10 07:57:15 2010 From: ismail.fsr at gmail.com (ismail kaarouch) Date: Wed, 10 Mar 2010 12:57:15 +0000 Subject: [Biopython] help Message-ID: <2eb473a91003100457n2889f502x965af00cc259c63a@mail.gmail.com> when i import class Translate from module Bio i have this msg & i will be forced to re installate the Biopython softwar so i need your help IDLE 2.6.2 >>> from Bio import Translate Warning (from warnings module): File "C:\Python26\lib\site-packages\Bio\Translate.py", line 23 DeprecationWarning) DeprecationWarning: Bio.Translate and Bio.Transcribe are deprecated, and will be removed in a future release of Biopython. Please use the functions or object methods defined in Bio.Seq instead (described in the tutorial). If you want to continue to use this code, please get in contact with the Biopython developers via the mailing lists to avoid its permanent removal from Biopython. From biopython at maubp.freeserve.co.uk Wed Mar 10 08:11:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 13:11:59 +0000 Subject: [Biopython] help In-Reply-To: <2eb473a91003100457n2889f502x965af00cc259c63a@mail.gmail.com> References: <2eb473a91003100457n2889f502x965af00cc259c63a@mail.gmail.com> Message-ID: <320fb6e01003100511j31db9127qe5fdfbe3a9725445@mail.gmail.com> On Wed, Mar 10, 2010 at 12:57 PM, ismail kaarouch wrote: > when i import class Translate from module Bio i have this msg & i will be > forced to re installate the Biopython softwar > so i need your help > > IDLE 2.6.2 >>>> from Bio import Translate > > Warning (from warnings module): > ?File "C:\Python26\lib\site-packages\Bio\Translate.py", line 23 > ? ?DeprecationWarning) > DeprecationWarning: Bio.Translate and Bio.Transcribe are deprecated, and > will be removed in a future release of Biopython. Please use the functions > or object methods defined in Bio.Seq instead (described in the tutorial). If > you want to continue to use this code, please get in contact with the > Biopython developers via the mailing lists to avoid its permanent removal > from Biopython. Hi Ismail, This warning is saying you shouldn't be using Bio.Translate (it will be removed from Biopython). Are you reading an out of date tutorial? The current tutorial is here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From p.j.a.cock at googlemail.com Wed Mar 10 09:30:57 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Mar 2010 14:30:57 +0000 Subject: [Biopython] Biopython & Google Summer of Code 2010 (GSoc) Message-ID: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> Dear Biopythoneers, The Open Bioinformatics Foundation (the Bio* umbrella organisation) is preparing an application for the 2010 Google Summer of Code (GSoC). http://code.google.com/soc/ If you are interested in becoming a mentor for a Biopython related project, you can join us in the application. If you are a student and are interested in a project (or would like to propose one), please take a look at these pages: http://www.open-bio.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/Google_Summer_of_Code Regards, Brad & Peter From cjfields at illinois.edu Wed Mar 10 09:31:39 2010 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 10 Mar 2010 08:31:39 -0600 Subject: [Biopython] matching sequences from fasta files In-Reply-To: References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> Message-ID: On Mar 10, 2010, at 5:15 AM, Ivan Rossi wrote: > On Wed, 10 Mar 2010, Peter wrote: > >> For the special case of looking for perfect matches, you would be fine >> with just Python - depending on your data files, you may be able to >> match on the record identifiers > > Don't trust that. We have seen many many times the sequence change over time (in different releases of the databases) while keeping the same id. If the database has a proper versioning scheme or date information this should be detectable, otherwise I agree. > it is much more robust to compare SHA1 (or MD5) hashes of the sequence, or do string comparisons. Agreed there; it's probably the only full-proof way. >> or simply do string comparisons of the sequences. > > This is OK. > > -- > Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it > BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy > Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com chris (peeking in from bioperl ;) From vincent at vincentdavis.net Wed Mar 10 10:19:00 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Wed, 10 Mar 2010 08:19:00 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> Message-ID: <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> I am considering just using just python and regular expression. Blast is great but I don't seem to be able to easily filter it to get only close matched that differ at 1 snp. I have a custom microarray and a list of the sequences it will bind. I need to test if they are in the genome of toxoplasma gondii (just yes or no) and if there are close matches (differ at 1 snp) and where the diff is in the sequence. So from reading the responses I should consider python.re. or look more into FASTA or needle. to see if i can get my version of a close match from them. Is this right? Like I said I am very new to this, just got called in to get this project done. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Wed, Mar 10, 2010 at 6:00 AM, Peter wrote: > On Wed, Mar 10, 2010 at 11:15 AM, Ivan Rossi wrote: > > On Wed, 10 Mar 2010, Peter wrote: > > > >> For the special case of looking for perfect matches, you would be fine > >> with just Python - depending on your data files, you may be able to > >> match on the record identifiers > > > > Don't trust that. We have seen many many times the sequence change > > over time (in different releases of the databases) while keeping the same > id. > > Yes, be cautious about blindly matching on just the identifier. > That's why I said "may" ;) > > > it is much more robust to compare SHA1 (or MD5) hashes of the > > sequence, or do string comparisons. > > MD5 is known to have collisions, but Sebasti?n Bassi added support > in Biopython for the GCG and SEGUID checksums, e.g. see: > > from Bio.SeqUtils.CheckSum import seguid > help(seguid) > > SHA1 is used by SEGUID internally, taking care of the case. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Mar 10 11:29:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 16:29:17 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> Message-ID: <320fb6e01003100829v764fc89am7350f40cc10b6936@mail.gmail.com> On Wed, Mar 10, 2010 at 3:19 PM, Vincent Davis wrote: > I am considering just using just python and regular expression. Blast is > great but I don't seem to be able to easily filter it to get only close > matched that differ at 1 snp. > I have a custom microarray and a list of the sequences it will bind. I need > to test if they are in the genome of toxoplasma gondii (just yes or no) and > if there are close matches (differ at 1 snp) and where the diff is in the > sequence. > > So from reading the responses I should consider python.re. or look more into > FASTA or needle. to see if i can get my version of a close match from them. > Is this right? Like I said I am very new to this, just got called in to get > this project done. Using BLAST / FASTA / needle / any pairwise alignment is going to boil down running the tool and parsing to filter out what you want. I don't think any of these general purpose tools allow for a "single base pair difference" threshold. This approach should work though. If you want to allow a single mis-match anywhere in the sequence, I'm not sure regular expressions are ideal either. If you wanted to look for matches with a single mis-match at a particular point (i.e. a know SNP) then a regular expression would work fine. However, you might have more success with software designed for second generation sequencing - there are certainly similarities to mapping short reads (e.g. Solexa/Illumina data) to a reference genome. You might also be able to use software designed to look for primer matches (again, these are short sequences). Just some ideas... Peter From lpritc at scri.ac.uk Wed Mar 10 10:53:45 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 10 Mar 2010 15:53:45 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Message-ID: Hi, On 10/03/2010 Wednesday, March 10, 03:46, "Vincent Davis" wrote: > I need to check if any/all the sequence from one fasta file are in another. > Looking through the docs I think I could do this. As others have pointed out, a simple string comparison will do this. > I then what to find "close matches" and for me this means they differ by 1 > snp and I need to know the location of this differing snp. How would I do > this? There are many ways in which this *could* be done. You probably want one that is quite quick, though If I never needed to do this again, I would probably run BLAST or FASTA (or my favourite search algorithm, running ungapped) using one set of sequences as a query, and the other as the target database, using the program parameters to report only one match each time. I'd then use Python to parse the results, throwing away all those matches where i) if the number of aligned bases is the same as the number of bases in the query: the number of match identities differs from the number of aligned bases by more than one ii) if the number of aligned bases differs from the number of bases in the query by exactly one: the number of match identities differs from the number of aligned bases iii) the number of aligned bases differs from the number of bases in the query by two or more The remainder should be your set of (almost) full-length 1/0 SNP matches, and there should be enough data in your search program output to identify the location of the SNP. I think it would be faster to use something off-the-shelf like BLAST and parse the output, than to write something to do the search. It will probably work quicker, too. Lots of ways to do this repeatably, including writing a generator function. I hope this is useful, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From dalloliogm at gmail.com Wed Mar 10 12:27:50 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 10 Mar 2010 18:27:50 +0100 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> Message-ID: <5aa3b3571003100927l382ec1c1w3dd81a61372a7660@mail.gmail.com> On Wed, Mar 10, 2010 at 4:19 PM, Vincent Davis wrote: > I am considering just using just python and regular expression. Blast is > great but I don't seem to be able to easily filter it to get only close > matched that differ at 1 snp. I am not sure I followed all the discussion in this topic, but if you to find sequences that differ for one or two positions and you don't need to do it in any explicit biological context, you may look for algorithms that do fuzzy matching like agrep. One example may be this module: - http://www.personal.psu.edu/iua1/libs/apse.html which as you can read is outdated and probably won't work properly, but it is based on a C library which may have been implemented in other python modules. I would look for this and also do a google/yahoo/anyother search for 'string fuzzy matching python' or similar, I am sure you can find a lot of literature and modules about that. If you are comfortable with the unix shell, you may be probably be able to implement all your pipeline with some emboss tool to read the sequences and agrep for the matching. Anyway, I didn't understand your use case very well, and I am sure that if you look better on the Internet you can find some tool that does this already without having to write a new script and test it. If you do look for that it would be better, for you and for the people who will read your papers. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From vincent at vincentdavis.net Wed Mar 10 13:10:20 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Wed, 10 Mar 2010 11:10:20 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Message-ID: <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> @Leighton "If I never needed to do this again, I would probably run BLAST or FASTA (or my favourite search algorithm, running ungapped) using one set of sequences as a query, and the other as the target database, using the program parameters to report only one match each time. I'd then use Python to parse the results, throwing away all those matches where" I don't have a favorite, I have only tried BLAST :) Is there an example of how to interface between python and BLAST. I have no idea where to start. I have never done anything similar. @ Leighton I think I will take your approach. Thanks for the input. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Wed, Mar 10, 2010 at 8:53 AM, Leighton Pritchard wrote: > Hi, > > On 10/03/2010 Wednesday, March 10, 03:46, "Vincent Davis" > wrote: > > > I need to check if any/all the sequence from one fasta file are in > another. > > Looking through the docs I think I could do this. > > As others have pointed out, a simple string comparison will do this. > > > I then what to find "close matches" and for me this means they differ by > 1 > > snp and I need to know the location of this differing snp. How would I do > > this? > > There are many ways in which this *could* be done. You probably want one > that is quite quick, though > > If I never needed to do this again, I would probably run BLAST or FASTA (or > my favourite search algorithm, running ungapped) using one set of sequences > as a query, and the other as the target database, using the program > parameters to report only one match each time. I'd then use Python to > parse the results, throwing away all those matches where > > i) if the number of aligned bases is the same as the number of bases in the > query: the number of match identities differs from the number of aligned > bases by more than one > ii) if the number of aligned bases differs from the number of bases in the > query by exactly one: the number of match identities differs from the > number > of aligned bases > iii) the number of aligned bases differs from the number of bases in the > query by two or more > > The remainder should be your set of (almost) full-length 1/0 SNP matches, > and there should be enough data in your search program output to identify > the location of the SNP. > > I think it would be faster to use something off-the-shelf like BLAST and > parse the output, than to write something to do the search. It will > probably work quicker, too. > > Lots of ways to do this repeatably, including writing a generator function. > > I hope this is useful, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk w: > http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by > guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views > expressed by the sender are not necessarily the views of SCRI and its > subsidiaries. This email and any files transmitted with it are confidential > to the intended recipient at the e-mail address to which it has been > addressed. It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this > confidentiality and you must not use, disclose, copy, print or rely on this > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of > the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are > present in this email, neither the Institute nor the sender accepts any > responsibility for any viruses, and it is your responsibility to scan the > email and the attachments (if any). > ______________________________________________________ > From subhodeep.moitra at gmail.com Wed Mar 10 13:51:08 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Wed, 10 Mar 2010 13:51:08 -0500 Subject: [Biopython] BioPython GSOC 2010 Message-ID: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Hi I am interested in applying for GSOC 2010. Particularly liked the R and Python integration proposal. There are lot of other cool R packages too, such as Bio3d that one can think of. Do you guys have an IRC channel ? Thanks Subho -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From biopython at maubp.freeserve.co.uk Wed Mar 10 16:56:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 21:56:48 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> Message-ID: <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> On Wed, Mar 10, 2010 at 6:10 PM, Vincent Davis wrote: > I don't have a favorite, I have only tried BLAST ?:) > Is there an example of how to interface between python and > BLAST. I have no idea where to start. I have never done > anything similar. There are examples of how to call BLAST and parse its (XML) output with Biopython in our tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter P.S. I am reminded of the old saying, "When all you have is a hammer, everything looks like a nail." (by which I mean even if it is not the best tool for the job, you could do it with BLAST). From biopython at maubp.freeserve.co.uk Wed Mar 10 16:59:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 21:59:19 +0000 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Message-ID: <320fb6e01003101359n7fd4883fhd2ee2ec3a5b9d9d0@mail.gmail.com> On Wed, Mar 10, 2010 at 6:51 PM, subhodeep moitra wrote: > Hi > > I am interested in applying for GSOC 2010. > > Particularly liked the R and Python integration proposal. There are lot of > other cool R packages too, such as Bio3d that one can think of. > > Do you guys have an IRC channel ? No - one reason is the Biopython developers cover several timezones, so email is generally more useful. Brad is also in the USA, and he is the Biopython person to talk to about this suggested GSoC project. Peter From vincent at vincentdavis.net Wed Mar 10 19:47:49 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Wed, 10 Mar 2010 17:47:49 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> Message-ID: <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> So I had an idea and wanted to get some feedback. I could make all possible single position mismatches for the sequences. I have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then use BLAST to look for perfect matches. I would probably do this incrementally maybe even just blast for each sequence. The advantage I see in this is that BLAST can run multi core and I am running it on an 8core with 48gb of memory So it seems that this would be the fastest way to do this and very straight forward as there is very little parsing. There is either a match or not. I am purely guessing that generating the list if faster than parsing the results. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Wed, Mar 10, 2010 at 2:56 PM, Peter wrote: > On Wed, Mar 10, 2010 at 6:10 PM, Vincent Davis > wrote: > > I don't have a favorite, I have only tried BLAST :) > > Is there an example of how to interface between python and > > BLAST. I have no idea where to start. I have never done > > anything similar. > > There are examples of how to call BLAST and parse its > (XML) output with Biopython in our tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Peter > > P.S. I am reminded of the old saying, "When all you have is > a hammer, everything looks like a nail." (by which I mean > even if it is not the best tool for the job, you could do it with > BLAST). > From mjldehoon at yahoo.com Wed Mar 10 20:19:09 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 10 Mar 2010 17:19:09 -0800 (PST) Subject: [Biopython] matching sequences from fasta files In-Reply-To: Message-ID: <224464.63537.qm@web62408.mail.re1.yahoo.com> > On 10/03/2010 Wednesday, March 10, 03:46, "Vincent Davis" > wrote: > > I then what to find "close matches" and for me this > > means they differ by 1 snp and I need to know the > > location of this differing snp. How would I do this? > You could use nexalign for that. http://genome.gsc.riken.jp/osc/english/dataresource/ --Michiel. From lpritc at scri.ac.uk Thu Mar 11 03:35:35 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 11 Mar 2010 08:35:35 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> Message-ID: Hi, On 10/03/2010 Wednesday, March 10, 18:10, "Vincent Davis" wrote: > I don't have a favorite, I have only tried BLAST :) > Is there an example of how to interface between python and BLAST. I have no > idea where to start. I have never done anything similar. For a one-off, I'd run BLAST from the command-line, and use Python to parse the results. http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82 Is the tutorial page that will be most help there, I think. > @ Leighton > I think I will take your approach. Thanks for the input. As with anything I suggest: treat with caution, and check for sanity at each step ;) L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From lpritc at scri.ac.uk Thu Mar 11 04:00:37 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 11 Mar 2010 09:00:37 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> Message-ID: On 11/03/2010 Thursday, March 11, 00:47, "Vincent Davis" wrote: > So I had an idea and wanted to get some feedback. > I could make all possible single position mismatches for the sequences. I > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then > use BLAST to look for perfect matches. That doesn't sound very elegant (or like a good solution) to me, but if you wanted to do that you wouldn't necessarily need Python, except perhaps to generate all possible mismatches. You can restrict BLAST output to the best match, and match identities to 100% with the option (in BLAST+) -word_size 25 Which restricts BLAST to finding seed words of the same length (25) as your oligos. This would also speed up BLAST. You might also consider exploring other output formats, so you could process tabular output from the command line, for instance. However, given the size of your data set, and the sizes of your sequences (neither of which were stated in the OP), I'd be inclined to bypass this altogether, and instead use one of the short-read sequence alignment packages such as SOAP or PASS, to see if it can be applied to your problem. Michiel's suggestion of NEXALIGN might be a good one - I've never used it, so can't say much about it. > I would probably do this > incrementally maybe even just blast for each sequence. The advantage I see > in this is that BLAST can run multi core and I am running it on an 8core > with 48gb of memory So it seems that this would be the fastest way to do > this and very straight forward as there is very little parsing. If you BLASTed each of 17m sequences individually, you would have to parse 17m output files. That sounds like a *lot* of parsing and file IO to me. ;) > There is > either a match or not. I am purely guessing that generating the list if > faster than parsing the results. You could try timing it with 10, 100 and 1000 sequences and see if you notice a trend. With your sequence set, I wouldn't bother - I'd jump straight to the next-gen sequence aligners. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Thu Mar 11 06:06:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 11:06:24 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> Message-ID: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> On Thu, Mar 11, 2010 at 12:47 AM, Vincent Davis wrote: > So I had an idea and wanted to get some feedback. > I could make all possible single position mismatches for the sequences. I > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then > use BLAST to look for perfect matches. I would probably do this > incrementally maybe even just blast for each sequence. The advantage I see > in this is that BLAST can run multi core and I am running it on an 8core > with 48gb of memory So it seems that this would be the fastest way to do > this and very straight forward as there is very little parsing. There is > either a match or not. I am purely guessing that generating the list if > faster than parsing the results. The strengths of BLAST are in fast fuzzy matching. My instinct is is would be silly to take your 230,000 queries, generate an extra queries 17,250,000 queries, and then run BLAST against your (organism specific?) database. Just run the BLAST on your queries with some reasonably strict match parameters, then post filter for your single base change. Now, if you really want to go for the brute force approach of looking for the perfect matches, what you could do is for each query of length 25, generate 25 simple regular expressions (e.g. using the "any letter" wild card in each position). You can do the regular expression matching within Python, or even with a command line tool like EMBOSS dreg. http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dreg.html Speaking of the EMBOSS tools, their fuzzy nucleotide search tool fuzznuc might be useful (you can specify the patterns using the IUPAC codes rather than regular expressions): http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/fuzznuc.html As far as I know, EMBOSS don't have a tool/option for fuzzy matching where you can specify a allowed number of miss-matches - unless one of the primer/vector tools can be used in this way? I'd suggest using primersearch but I think that only takes pairs of primers (not single probes). There is going to more than one way to solve your problem. This will be a useful learning process for you. Regards, Peter From chapmanb at 50mail.com Thu Mar 11 07:37:28 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 11 Mar 2010 07:37:28 -0500 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Message-ID: <20100311123728.GB36200@sobchak.mgh.harvard.edu> Subho; > I am interested in applying for GSOC 2010. > > Particularly liked the R and Python integration proposal. There are lot of > other cool R packages too, such as Bio3d that one can think of. Great to hear you are interested. The official student application period will run from March 29th-April 9th; we will have more specifics about when and where to apply once the organizational application round in finished. There is plenty you can do in the meantime. The selection process for students is competitive, and some of the things that help give proposals an advantage are: - Demonstrating knowledge of the projects. For the R/python idea, this would involve digging into Rpy2, some R packages you would be interested in exposing, and Biopython to get a sense of what a compatible API would look like. - Demonstrating open source coding capabilities. If you've not already worked on an open source project, this could involve putting together working code demonstrating an aspect of your proposal and making it available on Bitbucket or GitHub. - Showing the ability to communicate effectively with the community. Once you have code available, write up some information about it on a blog, ask for feedback on mailing lists, or otherwise let people know it is out there and you want to talk about it. These tips are generally useful independent of what specific project you are applying for. Hope this helps, Brad From vincent at vincentdavis.net Thu Mar 11 08:42:40 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 11 Mar 2010 06:42:40 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> Message-ID: <77e831101003110542s270c2722w20970cf2fd278f9@mail.gmail.com> Thanks again for all the responses I'll let you know what I end up with. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Mar 11, 2010 at 4:06 AM, Peter wrote: > On Thu, Mar 11, 2010 at 12:47 AM, Vincent Davis > wrote: > > So I had an idea and wanted to get some feedback. > > I could make all possible single position mismatches for the sequences. I > > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). > Then > > use BLAST to look for perfect matches. I would probably do this > > incrementally maybe even just blast for each sequence. The advantage I > see > > in this is that BLAST can run multi core and I am running it on an 8core > > with 48gb of memory So it seems that this would be the fastest way to do > > this and very straight forward as there is very little parsing. There is > > either a match or not. I am purely guessing that generating the list if > > faster than parsing the results. > > The strengths of BLAST are in fast fuzzy matching. My instinct is is > would be silly to take your 230,000 queries, generate an extra queries > 17,250,000 queries, and then run BLAST against your (organism > specific?) database. Just run the BLAST on your queries with some > reasonably strict match parameters, then post filter for your single > base change. > > Now, if you really want to go for the brute force approach of looking > for the perfect matches, what you could do is for each query of length > 25, generate 25 simple regular expressions (e.g. using the "any letter" > wild card in each position). You can do the regular expression matching > within Python, or even with a command line tool like EMBOSS dreg. > http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dreg.html > > Speaking of the EMBOSS tools, their fuzzy nucleotide search tool > fuzznuc might be useful (you can specify the patterns using the > IUPAC codes rather than regular expressions): > http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/fuzznuc.html > > As far as I know, EMBOSS don't have a tool/option for fuzzy matching > where you can specify a allowed number of miss-matches - unless one > of the primer/vector tools can be used in this way? I'd suggest using > primersearch but I think that only takes pairs of primers (not single > probes). > > There is going to more than one way to solve your problem. This > will be a useful learning process for you. > > Regards, > > Peter > From subhodeep.moitra at gmail.com Thu Mar 11 13:17:43 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Thu, 11 Mar 2010 13:17:43 -0500 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <20100311123728.GB36200@sobchak.mgh.harvard.edu> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> <20100311123728.GB36200@sobchak.mgh.harvard.edu> Message-ID: <6a2880081003111017p3bf02764wd9a1f3455dbb9b49@mail.gmail.com> Hi Brad All good advice. I've used BioPython and R for a few things, but am still new to it. I would like to start coding straightaway, and work on my favorite R package as you suggested. Will stay in touch. Thanks Subho On Thu, Mar 11, 2010 at 7:37 AM, Brad Chapman wrote: > Subho; > > > I am interested in applying for GSOC 2010. > > > > Particularly liked the R and Python integration proposal. There are lot > of > > other cool R packages too, such as Bio3d that one can think of. > > Great to hear you are interested. The official student application > period will run from March 29th-April 9th; we will have more > specifics about when and where to apply once the organizational > application round in finished. > > There is plenty you can do in the meantime. The selection process > for students is competitive, and some of the things that help give > proposals an advantage are: > > - Demonstrating knowledge of the projects. For the R/python idea, this > would involve digging into Rpy2, some R packages you would be > interested in exposing, and Biopython to get a sense of what a > compatible API would look like. > > - Demonstrating open source coding capabilities. If you've not > already worked on an open source project, this could involve > putting together working code demonstrating an aspect of your > proposal and making it available on Bitbucket or GitHub. > > - Showing the ability to communicate effectively with the community. > Once you have code available, write up some information about it > on a blog, ask for feedback on mailing lists, or otherwise let > people know it is out there and you want to talk about it. > > These tips are generally useful independent of what specific project > you are applying for. > > Hope this helps, > Brad > -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From mjldehoon at yahoo.com Thu Mar 11 19:36:07 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 11 Mar 2010 16:36:07 -0800 (PST) Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> Message-ID: <790299.17186.qm@web62404.mail.re1.yahoo.com> --- On Thu, 3/11/10, Peter wrote: > From: Peter > Subject: Re: [Biopython] matching sequences from fasta files > To: "Vincent Davis" > Cc: "biopython" > Date: Thursday, March 11, 2010, 6:06 AM > On Thu, Mar 11, 2010 at 12:47 AM, > Vincent Davis > > wrote: > > So I had an idea and wanted to get some feedback. > > I could make all possible single position mismatches > for the sequences. I > > have 230,000 now and the would give me 17,250,000 (3 * > 25 * 230,000). Then > > use BLAST to look for perfect matches. I would > probably do this > > incrementally maybe even just blast for each sequence. > The advantage I see > > in this is that BLAST can run multi core and I am > running it on an 8core > > with 48gb of memory So it seems that this would be the > fastest way to do > > this and very straight forward as there is very little > parsing. There is > > either a match or not. I am purely guessing that > generating the list if > > faster than parsing the results. > Nexalign can do exactly what you are trying to do. See http://genome.gsc.riken.jp/osc/english/dataresource/. --Michiel. From vincent at vincentdavis.net Thu Mar 11 22:08:23 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 11 Mar 2010 20:08:23 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <790299.17186.qm@web62404.mail.re1.yahoo.com> References: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> <790299.17186.qm@web62404.mail.re1.yahoo.com> Message-ID: <77e831101003111908l4e75b898yfa045ccc96d1850@mail.gmail.com> @Michiel de Hoon Nexalign can do exactly what you are trying to do. See http://genome.gsc.riken.jp/osc/english/dataresource/. Thanks for the link to nextalign. It is perfect and fast. This is exactly what I needed. I already have the results I needed 5min from download to results. Need to spend a little time verifying I have what I what but it looks right. Again thank you very much. * * *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Mar 11, 2010 at 5:36 PM, Michiel de Hoon wrote: > --- On Thu, 3/11/10, Peter wrote: > > > From: Peter > > Subject: Re: [Biopython] matching sequences from fasta files > > To: "Vincent Davis" > > Cc: "biopython" > > Date: Thursday, March 11, 2010, 6:06 AM > > On Thu, Mar 11, 2010 at 12:47 AM, > > Vincent Davis > > > > wrote: > > > So I had an idea and wanted to get some feedback. > > > I could make all possible single position mismatches > > for the sequences. I > > > have 230,000 now and the would give me 17,250,000 (3 * > > 25 * 230,000). Then > > > use BLAST to look for perfect matches. I would > > probably do this > > > incrementally maybe even just blast for each sequence. > > The advantage I see > > > in this is that BLAST can run multi core and I am > > running it on an 8core > > > with 48gb of memory So it seems that this would be the > > fastest way to do > > > this and very straight forward as there is very little > > parsing. There is > > > either a match or not. I am purely guessing that > > generating the list if > > > faster than parsing the results. > > > Nexalign can do exactly what you are trying to do. > See http://genome.gsc.riken.jp/osc/english/dataresource/. > > --Michiel. > > > > From sbassi at genesdigitales.com Sat Mar 13 09:41:09 2010 From: sbassi at genesdigitales.com (Sebastian Bassi) Date: Sat, 13 Mar 2010 11:41:09 -0300 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Message-ID: On Wed, Feb 24, 2010 at 3:52 PM, Eric Talevich wrote: > On Monday I hosted a 2-hour programming workshop focusing on Biopython and > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > I hope others find these slides useful. Hello, I am about to give a talk about Biopython in a local event (1er Congreso Argentino de Bioinformatica y Biologia Computacional) and I think I could retrieve material from some of your slides (with attribution). What do you think? Best, SB. -- Curso de Python en un d?a: http://bit.ly/cursopython Non standard disclaimer: READ CAREFULLY. By reading this email, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. Google ads remover words: suicide, murder From etal at uga.edu Sat Mar 13 10:27:51 2010 From: etal at uga.edu (Eric Talevich) Date: Sat, 13 Mar 2010 10:27:51 -0500 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Message-ID: <3f6baf361003130727g1831dbb8gba7f9b63d160fd0b@mail.gmail.com> On Sat, Mar 13, 2010 at 9:41 AM, Sebastian Bassi wrote: > On Wed, Feb 24, 2010 at 3:52 PM, Eric Talevich wrote: > > On Monday I hosted a 2-hour programming workshop focusing on Biopython > and > > > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I hope others find these slides useful. > > Hello, I am about to give a talk about Biopython in a local event (1er > Congreso Argentino de Bioinformatica y Biologia Computacional) and I > think I could retrieve material from some of your slides (with > attribution). What do you think? > Best, > SB. > Great, I'm glad you found the slides helpful. The Latex Beamer source isn't in a publishable state yet, but I can e-mail it to you if you'd like. -Eric From lgautier at gmail.com Sat Mar 13 13:42:47 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 13 Mar 2010 19:42:47 +0100 Subject: [Biopython] Biopython Digest, Vol 87, Issue 12 In-Reply-To: References: Message-ID: <4B9BDCA7.8010302@gmail.com> When looking at rpy2, do consider the 2.1-dev version. 2.1 will be released before the SoC starts. L. On 3/12/10 6:00 PM, biopython-request at lists.open-bio.org wrote: > Hi Brad > > All good advice. > > I've used BioPython and R for a few things, but am still new to it. > I would like to start coding straightaway, and work on my favorite R package > as you suggested. > Will stay in touch. > > Thanks > Subho > > > On Thu, Mar 11, 2010 at 7:37 AM, Brad Chapman wrote: > >> Subho; >> >>> I am interested in applying for GSOC 2010. >>> >>> Particularly liked the R and Python integration proposal. There are lot >> of >>> other cool R packages too, such as Bio3d that one can think of. >> >> Great to hear you are interested. The official student application >> period will run from March 29th-April 9th; we will have more >> specifics about when and where to apply once the organizational >> application round in finished. >> >> There is plenty you can do in the meantime. The selection process >> for students is competitive, and some of the things that help give >> proposals an advantage are: >> >> - Demonstrating knowledge of the projects. For the R/python idea, this >> would involve digging into Rpy2, some R packages you would be >> interested in exposing, and Biopython to get a sense of what a >> compatible API would look like. >> >> - Demonstrating open source coding capabilities. If you've not >> already worked on an open source project, this could involve >> putting together working code demonstrating an aspect of your >> proposal and making it available on Bitbucket or GitHub. >> >> - Showing the ability to communicate effectively with the community. >> Once you have code available, write up some information about it >> on a blog, ask for feedback on mailing lists, or otherwise let >> people know it is out there and you want to talk about it. >> >> These tips are generally useful independent of what specific project >> you are applying for. >> >> Hope this helps, >> Brad >> > > > From sbassi at genesdigitales.com Sun Mar 14 01:53:45 2010 From: sbassi at genesdigitales.com (Sebastian Bassi) Date: Sun, 14 Mar 2010 03:53:45 -0300 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: <3f6baf361003130727g1831dbb8gba7f9b63d160fd0b@mail.gmail.com> References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> <3f6baf361003130727g1831dbb8gba7f9b63d160fd0b@mail.gmail.com> Message-ID: On Sat, Mar 13, 2010 at 12:27 PM, Eric Talevich wrote: > Great, I'm glad you found the slides helpful. The Latex Beamer source isn't > in a publishable state yet, but I can e-mail it to you if you'd like. Don't worry, I don't need the source, I am planning to use some of the content to write my own. Thank you again, I will post them after the congress. Best, SB. From sbassi at gmail.com Mon Mar 15 02:35:16 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 15 Mar 2010 03:35:16 -0300 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: References: Message-ID: On Tue, Mar 9, 2010 at 7:21 PM, Jo?o Rodrigues wrote: > I checked the code of PDBIO and apparently, it has hard-coded a resetting of > the atom number. My question is, what is this set_serial_number for then? Is > there a way for me to override this easily? It may be a bug. Could you post your code related to this? From biopython at maubp.freeserve.co.uk Mon Mar 15 04:34:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Mar 2010 08:34:41 +0000 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: References: Message-ID: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> On Mon, Mar 15, 2010 at 6:35 AM, Sebastian Bassi wrote: > On Tue, Mar 9, 2010 at 7:21 PM, Jo?o Rodrigues wrote: >> I checked the code of PDBIO and apparently, it has hard-coded a resetting of >> the atom number. My question is, what is this set_serial_number for then? Is >> there a way for me to override this easily? > > It may be a bug. Could you post your code related to this? PDBIO does explicitly just use an incremental counter for the atom number. I don't know why for sure, but this is a simple way to ensure the atoms are given unique identifiers on output. I guess the serial_number is just set by the parser. I don't see an easy way to override it - why do you want to change it? Regarding the point of get_serial_number and set_serial_number, they seem to be rather pointless methods - since you can just edit the serial_number attribute directly. Maybe Thomas has been using Java while writing this code? We have talked about deprecating the pointless get/set functions to make the PDB API a little more transparent. Peter From anaryin at gmail.com Mon Mar 15 04:58:29 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 15 Mar 2010 01:58:29 -0700 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> References: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> Message-ID: Exactly my point. Those two functions are pretty much useless at the moment since the PDBIO module ignores those values. I just changed the value of the atom number in PDBIO for atom.get_serial_number() and it worked as I wanted, so it isn't that hard. I just wanted to ask if this had a particular reason or if it was some forgotten old setting or bug. Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ 2010/3/15 Peter > On Mon, Mar 15, 2010 at 6:35 AM, Sebastian Bassi wrote: > > On Tue, Mar 9, 2010 at 7:21 PM, Jo?o Rodrigues > wrote: > >> I checked the code of PDBIO and apparently, it has hard-coded a > resetting of > >> the atom number. My question is, what is this set_serial_number for > then? Is > >> there a way for me to override this easily? > > > > It may be a bug. Could you post your code related to this? > > PDBIO does explicitly just use an incremental counter for the > atom number. I don't know why for sure, but this is a simple way > to ensure the atoms are given unique identifiers on output. I > guess the serial_number is just set by the parser. I don't see an > easy way to override it - why do you want to change it? > > Regarding the point of get_serial_number and set_serial_number, > they seem to be rather pointless methods - since you can just edit > the serial_number attribute directly. Maybe Thomas has been using > Java while writing this code? We have talked about deprecating the > pointless get/set functions to make the PDB API a little more transparent. > > Peter > From biopython at maubp.freeserve.co.uk Mon Mar 15 05:29:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Mar 2010 09:29:09 +0000 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: References: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> Message-ID: <320fb6e01003150229i763e203w9a8da06f3ebb0662@mail.gmail.com> On Mon, Mar 15, 2010 at 8:58 AM, Jo?o Rodrigues wrote: > Exactly my point. Those two functions are pretty much useless at the moment > since the PDBIO module ignores those values. I just changed the value of the > atom number in PDBIO for atom.get_serial_number() and it worked as I wanted, > so it isn't that hard. > > I just wanted to ask if this had a particular reason or if it was some > forgotten old setting or bug. My guess is if you have selected only part of a PDB file, and written this out to new sub-file, then it is conventional to have the atoms numbered sequentially from one. This is what the current code does, but using the serial_number from the objects would results in irregular numbering with gaps in it (not sure if that is against the PDB specification, but it would not surprise me if third party tools don't like it). i.e. Not a bug, but a deliberate design choice. (We'd have to ask Thomas what he was thinking to be sure.) Again, why do you want to change the atom numbers on output? Peter From vincent at vincentdavis.net Tue Mar 16 11:03:45 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 09:03:45 -0600 Subject: [Biopython] comparing micro array data Message-ID: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> So I am very new to this so please accept my ignorance on this subject. I have several micro array samples ~ 8 for each of 3 known genomes. So I know which probes/sequences are a match and which have close matches. I would like to identify which sequences exist in an unknown sample. The array is custom and there is little to know overlap between probes. What is the "standard" way of doing this? I don't care to know if a SNP is present only if the sequence is present. Is this standard available in biopython ? Thanks *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Tue Mar 16 11:15:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 15:15:27 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> Message-ID: <320fb6e01003160815s1e051330ve62211d6c7843f64@mail.gmail.com> On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis wrote: > So I am very new to this so please accept my ignorance on this subject. > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > know which probes/sequences are a match and which have close matches. I > would like to identify which sequences exist in an unknown sample. The array > is custom and there is little to know overlap between probes. > What is the "standard" way of doing this? I don't care to know if a SNP is > present only if the sequence is present. > Is this standard available in biopython ? Hi Vincent, Biopython has only limited pairwise alignment built in - we normally just call specialised command line tools. In addition to classic microarray probe design tools, you *might* be able to exploit related tools for PCR primers or short read tools from next generation sequencing. However, these won't be specifically aware of microarray probe affinities and how to model them. For microarray work I would have to say using R/Bioconductor will probably be more sensible for the very practical reason that they have a much larger community using microarrays than Python does. http://www.bioconductor.org/ Peter P.S. You can call R from Python, see http://rpy.sourceforge.net/ From vincent at vincentdavis.net Tue Mar 16 11:30:42 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 09:30:42 -0600 Subject: [Biopython] comparing micro array data In-Reply-To: <320fb6e01003160815s1e051330ve62211d6c7843f64@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <320fb6e01003160815s1e051330ve62211d6c7843f64@mail.gmail.com> Message-ID: <77e831101003160830m4e679fa0v21df651d79db582a@mail.gmail.com> > > @Peter For microarray work I would have to say using R/Bioconductor will probably be more sensible for the very practical reason that they have a much larger community using microarrays than Python does. http://www.bioconductor.org/ I am working at getting up to speed with R and bioconductor. I ask the question here as I got such a great answer for the last question I had and thought if the tool was available in biopython then I would try it. I don't know how this problem is normally solved. > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Tue, Mar 16, 2010 at 9:15 AM, Peter wrote: > On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis > wrote: > > So I am very new to this so please accept my ignorance on this subject. > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > > know which probes/sequences are a match and which have close matches. I > > would like to identify which sequences exist in an unknown sample. The > array > > is custom and there is little to know overlap between probes. > > What is the "standard" way of doing this? I don't care to know if a SNP > is > > present only if the sequence is present. > > Is this standard available in biopython ? > > Hi Vincent, > > Biopython has only limited pairwise alignment built in - we normally just > call specialised command line tools. In addition to classic microarray > probe design tools, you *might* be able to exploit related tools for PCR > primers or short read tools from next generation sequencing. However, > these won't be specifically aware of microarray probe affinities and how > to model them. > > For microarray work I would have to say using R/Bioconductor will > probably be more sensible for the very practical reason that they > have a much larger community using microarrays than Python does. > http://www.bioconductor.org/ > > Peter > > P.S. You can call R from Python, see http://rpy.sourceforge.net/ > From lpritc at scri.ac.uk Tue Mar 16 12:03:06 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 16 Mar 2010 16:03:06 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160830m4e679fa0v21df651d79db582a@mail.gmail.com> Message-ID: Hi Vincent, On 16/03/2010 Tuesday, March 16, 15:30, "Vincent Davis" wrote: > On Tue, Mar 16, 2010 at 9:15 AM, Peter wrote: > >> @Peter >> For microarray work I would have to say using R/Bioconductor will >> probably be more sensible for the very practical reason that they >> have a much larger community using microarrays than Python does. >> >> http://www.bioconductor.org/ > > I am working at getting up to speed with R and bioconductor. I ask the > question here as I got such a great answer for the last question I had and > thought if the tool was available in biopython then I would try it. I don't > know how this problem is normally solved. Peter's suggestion is a good one, in general. Biopython is lacking in support for microarray analysis - not least in part because there's already an adaptor to R, from which the mature and powerful Bioconductor libraries are available (not to mention that arrays are being superseded by sequencing, so now might not be the time to put too much effort in to that ;)). If you've got microarray issues, a Bioconductor mailing list might be a better first port of call. >> On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis >> wrote: >>> So I am very new to this so please accept my ignorance on this subject. >>> >>> I have several micro array samples ~ 8 for each of 3 known genomes. So I >>> know which probes/sequences are a match and which have close matches. I >>> would like to identify which sequences exist in an unknown sample. The >> array >>> is custom and there is little to know overlap between probes. >>> What is the "standard" way of doing this? I don't care to know if a SNP >> is >>> present only if the sequence is present. >>> Is this standard available in biopython ? It's not very clear to me what the problem is, from your description here. It sounds a bit like you are doing array CGH, starting with an array that was raised to species X, and you then have eight sets of array results (this wouldn't be two samples with three replicates, and a single sample with two replicates, would it?) from known species A, B, and C. Then it seems like you have a sample from species D, and you want to know - perhaps from the array hybridisation data, perhaps from the genome sequence, it's hard to tell - possibly one of two things: which probes will bind to species D; or how many genes from species D are similar to those in species X. These two questions would require quite different approaches; can you be clearer? Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From sdavis2 at mail.nih.gov Tue Mar 16 12:38:31 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 16 Mar 2010 12:38:31 -0400 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> Message-ID: <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> On Tue, Mar 16, 2010 at 11:03 AM, Vincent Davis wrote: > So I am very new to this so please accept my ignorance on this subject. > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > know which probes/sequences are a match and which have close matches. I > would like to identify which sequences exist in an unknown sample. The array > is custom and there is little to know overlap between probes. > What is the "standard" way of doing this? I don't care to know if a SNP is > present only if the sequence is present. > Is this standard available in biopython ? Hi, Vincent. I'm not clear on what the study is here. Could you explain a bit more what you are doing? I get the suggestion from your email that you want to do a cross-species comparison using microarrays. If this is the case, this is notoriously difficult to do, so, in addition to the comments here, I would suggest finding a local collaborator if you are relatively new to the microarray field. Sean From vincent at vincentdavis.net Tue Mar 16 12:38:43 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 10:38:43 -0600 Subject: [Biopython] comparing micro array data In-Reply-To: References: <77e831101003160830m4e679fa0v21df651d79db582a@mail.gmail.com> Message-ID: <77e831101003160938o10f53c15m51c1aa559def5513@mail.gmail.com> > > t sounds a bit like you are doing array CGH, starting with an array that > was raised to species X, Yes eight sets of array results (this > wouldn't be two samples with three replicates, and a single sample with two > replicates, would it?) from known species A, B, and C. I have 3 know species, X (the one that matches the array),B, C and about 8 arrays/samples and for each we know if a probe/sequence matches a sequence in the genome. And several different unknown samples D,E,F..... What to know if at any given sequence/probe does the unknown have that sequence or some probability or the most likely to be different. B,and C only help by allowing us to test our method. I also have close mismatch data for the know, that is I know if there is a single mismatch match and the distance of that mismatch from the center of the sequence. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Tue, Mar 16, 2010 at 10:03 AM, Leighton Pritchard wrote: > Hi Vincent, > > On 16/03/2010 Tuesday, March 16, 15:30, "Vincent Davis" > wrote: > > > On Tue, Mar 16, 2010 at 9:15 AM, Peter >wrote: > > > >> @Peter > >> For microarray work I would have to say using R/Bioconductor will > >> probably be more sensible for the very practical reason that they > >> have a much larger community using microarrays than Python does. > >> > >> http://www.bioconductor.org/ > > > > I am working at getting up to speed with R and bioconductor. I ask the > > question here as I got such a great answer for the last question I had > and > > thought if the tool was available in biopython then I would try it. I > don't > > know how this problem is normally solved. > > Peter's suggestion is a good one, in general. Biopython is lacking in > support for microarray analysis - not least in part because there's already > an adaptor to R, from which the mature and powerful Bioconductor libraries > are available (not to mention that arrays are being superseded by > sequencing, so now might not be the time to put too much effort in to that > ;)). If you've got microarray issues, a Bioconductor mailing list might be > a better first port of call. > > >> On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis < > vincent at vincentdavis.net> > >> wrote: > >>> So I am very new to this so please accept my ignorance on this subject. > >>> > >>> I have several micro array samples ~ 8 for each of 3 known genomes. So > I > >>> know which probes/sequences are a match and which have close matches. I > >>> would like to identify which sequences exist in an unknown sample. The > >> array > >>> is custom and there is little to know overlap between probes. > >>> What is the "standard" way of doing this? I don't care to know if a SNP > >> is > >>> present only if the sequence is present. > >>> Is this standard available in biopython ? > > It's not very clear to me what the problem is, from your description here. > It sounds a bit like you are doing array CGH, starting with an array that > was raised to species X, and you then have eight sets of array results > (this > wouldn't be two samples with three replicates, and a single sample with two > replicates, would it?) from known species A, B, and C. Then it seems like > you have a sample from species D, and you want to know - perhaps from the > array hybridisation data, perhaps from the genome sequence, it's hard to > tell - possibly one of two things: which probes will bind to species D; or > how many genes from species D are similar to those in species X. These two > questions would require quite different approaches; can you be clearer? > > Cheers, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk w: > http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by > guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views > expressed by the sender are not necessarily the views of SCRI and its > subsidiaries. This email and any files transmitted with it are confidential > to the intended recipient at the e-mail address to which it has been > addressed. It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this > confidentiality and you must not use, disclose, copy, print or rely on this > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of > the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are > present in this email, neither the Institute nor the sender accepts any > responsibility for any viruses, and it is your responsibility to scan the > email and the attachments (if any). > ______________________________________________________ > From vincent at vincentdavis.net Tue Mar 16 12:49:26 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 10:49:26 -0600 Subject: [Biopython] comparing micro array data In-Reply-To: <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> Message-ID: <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> > > @ Sean I would suggest finding a local collaborator if you are relatively new to the microarray field. I actually was brought into this project by a team from an university. They know lots including that this is a difficult problem. They did not have any references as to how others have solved this problem with whatever success was possible. Since I know python, biopython has been my first choice to ask other smart people :) I am an economist. I am ok with the stats and data but don't know the terminology well, It's been a 3 week crash course in my free time. I wrote my own modules for reading in CEL and CDF files as python objects. I know there are existing solution but I would not learned as much that way. I used the nexalign program that was recommended on this list to get the mismatch data. It's all coming along nicely andI am learning lots. The prject has been languishing for a list of reasons and now there is a push to get it finished. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Tue, Mar 16, 2010 at 10:38 AM, Sean Davis wrote: > On Tue, Mar 16, 2010 at 11:03 AM, Vincent Davis > wrote: > > So I am very new to this so please accept my ignorance on this subject. > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > > know which probes/sequences are a match and which have close matches. I > > would like to identify which sequences exist in an unknown sample. The > array > > is custom and there is little to know overlap between probes. > > What is the "standard" way of doing this? I don't care to know if a SNP > is > > present only if the sequence is present. > > Is this standard available in biopython ? > > Hi, Vincent. I'm not clear on what the study is here. Could you > explain a bit more what you are doing? I get the suggestion from your > email that you want to do a cross-species comparison using > microarrays. If this is the case, this is notoriously difficult to > do, so, in addition to the comments here, I would suggest finding a > local collaborator if you are relatively new to the microarray field. > > Sean > From sdavis2 at mail.nih.gov Tue Mar 16 12:56:12 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 16 Mar 2010 12:56:12 -0400 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> Message-ID: <264855a01003160956o1bc9b432qf46abbc001fcbde7@mail.gmail.com> On Tue, Mar 16, 2010 at 12:49 PM, Vincent Davis wrote: > @ Sean > > I would suggest finding a > > local collaborator if you are relatively new to the microarray field. > > > I actually was brought into this project by a team from an university. They > know lots including that this is a difficult problem. They did not have any > references as to how others have solved this problem with whatever success > was possible. Since I know python, biopython has been my first choice to ask > other smart people :) > > > I am an economist. I am ok with the stats and data but don't know the > terminology well, It's been a 3 week crash course in my free time. I wrote > my own modules for reading in CEL and CDF files as python objects. I know > there are existing solution but I would not learned as much that way. I used > the nexalign program that was recommended on this list to get the mismatch > data. It's all coming along nicely andI am learning lots. The prject has > been languishing for a list of reasons and now there is a push to get it > finished. > Perfect! A mathematician working with biologists--this is the way of the world these days. Given the issues that you describe, I would definitely suggest looking at R/bioconductor. That said, I'm not sure that there is a good answer to the problem, as you suggest. If you don't mind a couple of questions, for curiosity sake, how big is the genome of model organism? And what size are the arrays, in terms of probes? Sean > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > > > On Tue, Mar 16, 2010 at 10:38 AM, Sean Davis wrote: > >> On Tue, Mar 16, 2010 at 11:03 AM, Vincent Davis >> wrote: >> > So I am very new to this so please accept my ignorance on this subject. >> > >> > I have several micro array samples ~ 8 for each of 3 known genomes. So I >> > know which probes/sequences are a match and which have close matches. I >> > would like to identify which sequences exist in an unknown sample. The >> array >> > is custom and there is little to know overlap between probes. >> > What is the "standard" way of doing this? I don't care to know if a SNP >> is >> > present only if the sequence is present. >> > Is this standard available in biopython ? >> >> Hi, Vincent. I'm not clear on what the study is here. Could you >> explain a bit more what you are doing? I get the suggestion from your >> email that you want to do a cross-species comparison using >> microarrays. If this is the case, this is notoriously difficult to >> do, so, in addition to the comments here, I would suggest finding a >> local collaborator if you are relatively new to the microarray field. >> >> Sean >> > > From subhodeep.moitra at gmail.com Tue Mar 16 12:56:06 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Tue, 16 Mar 2010 12:56:06 -0400 Subject: [Biopython] comparing micro array data Message-ID: <6a2880081003160956t21e30d1v35d9b9df240370c4@mail.gmail.com> If you need to visualize the microarray data and also do some analysis for interaction networks, then 'Cytoscape' is a good option to go for. Thanks Subho On Tue, Mar 16, 2010 at 12:00 PM, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. comparing micro array data (Vincent Davis) > 2. Re: comparing micro array data (Peter) > 3. Re: comparing micro array data (Vincent Davis) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 16 Mar 2010 09:03:45 -0600 > From: Vincent Davis > Subject: [Biopython] comparing micro array data > To: biopython > Message-ID: > <77e831101003160803n24a4568aq68793a367059f956 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > So I am very new to this so please accept my ignorance on this subject. > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > know which probes/sequences are a match and which have close matches. I > would like to identify which sequences exist in an unknown sample. The > array > is custom and there is little to know overlap between probes. > What is the "standard" way of doing this? I don't care to know if a SNP is > present only if the sequence is present. > Is this standard available in biopython ? > > Thanks > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > > > ------------------------------ > > Message: 2 > Date: Tue, 16 Mar 2010 15:15:27 +0000 > From: Peter > Subject: Re: [Biopython] comparing micro array data > To: Vincent Davis > Cc: biopython > Message-ID: > <320fb6e01003160815s1e051330ve62211d6c7843f64 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis > wrote: > > So I am very new to this so please accept my ignorance on this subject. > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > > know which probes/sequences are a match and which have close matches. I > > would like to identify which sequences exist in an unknown sample. The > array > > is custom and there is little to know overlap between probes. > > What is the "standard" way of doing this? I don't care to know if a SNP > is > > present only if the sequence is present. > > Is this standard available in biopython ? > > Hi Vincent, > > Biopython has only limited pairwise alignment built in - we normally just > call specialised command line tools. In addition to classic microarray > probe design tools, you *might* be able to exploit related tools for PCR > primers or short read tools from next generation sequencing. However, > these won't be specifically aware of microarray probe affinities and how > to model them. > > For microarray work I would have to say using R/Bioconductor will > probably be more sensible for the very practical reason that they > have a much larger community using microarrays than Python does. > http://www.bioconductor.org/ > > Peter > > P.S. You can call R from Python, see http://rpy.sourceforge.net/ > > > ------------------------------ > > Message: 3 > Date: Tue, 16 Mar 2010 09:30:42 -0600 > From: Vincent Davis > Subject: Re: [Biopython] comparing micro array data > To: Peter > Cc: biopython > Message-ID: > <77e831101003160830m4e679fa0v21df651d79db582a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > > > > @Peter > > For microarray work I would have to say using R/Bioconductor will > > probably be more sensible for the very practical reason that they > > have a much larger community using microarrays than Python does. > > http://www.bioconductor.org/ > > > I am working at getting up to speed with R and bioconductor. I ask the > question here as I got such a great answer for the last question I had and > thought if the tool was available in biopython then I would try it. I don't > know how this problem is normally solved. > > > > > > > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > > > On Tue, Mar 16, 2010 at 9:15 AM, Peter >wrote: > > > On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis > > > wrote: > > > So I am very new to this so please accept my ignorance on this subject. > > > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So > I > > > know which probes/sequences are a match and which have close matches. I > > > would like to identify which sequences exist in an unknown sample. The > > array > > > is custom and there is little to know overlap between probes. > > > What is the "standard" way of doing this? I don't care to know if a SNP > > is > > > present only if the sequence is present. > > > Is this standard available in biopython ? > > > > Hi Vincent, > > > > Biopython has only limited pairwise alignment built in - we normally just > > call specialised command line tools. In addition to classic microarray > > probe design tools, you *might* be able to exploit related tools for PCR > > primers or short read tools from next generation sequencing. However, > > these won't be specifically aware of microarray probe affinities and how > > to model them. > > > > For microarray work I would have to say using R/Bioconductor will > > probably be more sensible for the very practical reason that they > > have a much larger community using microarrays than Python does. > > http://www.bioconductor.org/ > > > > Peter > > > > P.S. You can call R from Python, see http://rpy.sourceforge.net/ > > > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 87, Issue 16 > ***************************************** > -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From biopython at maubp.freeserve.co.uk Tue Mar 16 13:29:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 17:29:17 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <264855a01003160956o1bc9b432qf46abbc001fcbde7@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> <264855a01003160956o1bc9b432qf46abbc001fcbde7@mail.gmail.com> Message-ID: <320fb6e01003161029i5feddf8ck76ba2b9ecd2056f2@mail.gmail.com> On Tue, Mar 16, 2010 at 4:56 PM, Sean Davis wrote: > > If you don't mind a couple of questions, for curiosity sake, how big is the > genome of model organism? ?And what size are the arrays, in terms of > probes? Also, what kind of organism? e.g. Plant, animal, bacteria? This will make a difference for the number of papers you'll find doing this kind of thing in the literature. On Tue, Mar 16, 2010 at 4:49 PM, Vincent Davis wrote: > I actually was brought into this project by a team from an university. They > know lots including that this is a difficult problem. They did not have any > references as to how others have solved this problem with whatever success > was possible. Since I know python, biopython has been my first choice to ask > other smart people :) For a recent example using microarrays for cross-species comparison (aka microarray comparative genomic hybridisation) in bacteria you might want to read Leighton's paper (and the references within - which include work on humans): http://www.ncbi.nlm.nih.gov/pubmed/19696881 You can probably guess why he asked if you were doing array CGH ;) Peter From hlapp at drycafe.net Tue Mar 16 16:03:50 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Tue, 16 Mar 2010 16:03:50 -0400 Subject: [Biopython] [OT] Job opportunity: Training coordinator and Bioinformatics Project Manager Message-ID: <0CDDCED9-266E-4CCE-8240-D7E2C8522784@drycafe.net> Hi all - first off, sorry for the cross-posting, we're trying to advertise this as widely as possible. Second, apologies if this is committing an offense and considered spam. I thought though that there might be some people around here who may be interested and suitable. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== A unique position is available for a training coordinator and bioinformatics project manager at the U.S. National Evolutionary Synthesis Center in Durham, North Carolina (NESCent, http:// nescent.org). NESCent is a National Science Foundation funded research center managed by Duke University, the University of North Carolina at Chapel Hill and North Carolina State University on behalf of the international evolutionary biology community. NESCent facilitates synthetic research by bringing together diverse expertise, data, tools and concepts (Sidlauskas et al. 2009). In addition to a resident population of 20-30 scientists, the Center hosts over 800 visitors a year. An informatics staff is on-site to support resident and visiting scientists? needs in high-performance computing, electronic collaboration, scientific software and databases; this includes custom software development for a limited number of high- impact projects. NESCent?s informatics training program includes a rotating series of open-application summer courses, ad-hoc short courses for resident scientists, and remote internships (including past participation in the Google Summer of Code). The training coordinator and bioinformatics project manager will provide oversight to the Center?s training activities. The incumbent will also serve as the interface between scientists and software developers at NESCent. The position provides extensive opportunities for collaboration and intellectual engagement with both NESCent- sponsored scientists and informatics staff; however, this is not an independent research position. The incumbent will report to the Director, while overseeing the work of a small informatics team and coordinating activities among the Center?s science, education and informatics programs. Responsibilities: ? 50% - Consult with sponsored scientists (including scientists in residence and working group participants) about informatics resources and needs. Manage software product development by gathering requirements from scientists, participating in conceptual design, monitoring implementation progress and product quality, facilitating communication between software developers and scientists, and researching software solutions. ? 25% - Oversee NESCent?s course curriculum by identifying opportunities for onsite or online informatics courses that satisfy demand for advanced training of resident and visiting scientists, recruiting instructors, providing guidance to instructors in developing course syllabi, coordinating logistical and technical support requirements, conducting assessments, and serving as a liaison to course organizers at other institutions. ? 25% - Assisting in the management of NESCent?s summer informatics intern program, by coordinating the recruitment, application & review process for students, communicating expectations to students and mentors, monitoring student progress, documenting student outcomes, and performing assessments. Education: Required: M.S. in Biology, Bioinformatics, or a related field. Preferred: Ph.D. and two years postdoctoral experience in evolutionary biology, or an equivalent combination of relevant education and/or experience. Experience: Required: Excellent communication, interpersonal, and organizational skills. Experience with computationally oriented scientific research. Preferred: At least two years in development of databases and open source software. Organization, coordination, development and delivery of courses and workshops appropriate for graduate-level participants. Terms of Employment: Salary will be competitive and commensurate with experience. As a full-time employee, the incumbent will receive Duke University?s benefits package (http://hr.duke.edu/benefits/main.html). The position is available immediately and will remain open until filled. The position is currently funded through November 2014, contingent on annual renewal of the Center by the NSF. How to Apply: Please send a C.V., including contact information for three references, and a brief statement of interest to Allen Rodrigo, Director, NESCent, at a.rodrigo at nescent.org. Inquiries about suitability for the position are welcome. Duke University is an Equal Opportunity/Affirmative Action employer. Additional information about NESCent: http://www.nescent.org References: Sidlauskas B, Ganapathy G, Hazkani-Covo E, Jenkins KP, Lapp H, McCall LW, Price S, Scherle R, Spaeth PA, Kidd DM (2009) Linking Big: The Continuing Promise of Evolutionary Synthesis. Evolution. http://dx.doi.org/10.1111/j.1558-5646.2009.00892.x From lpritc at scri.ac.uk Wed Mar 17 04:20:30 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 17 Mar 2010 08:20:30 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <320fb6e01003161029i5feddf8ck76ba2b9ecd2056f2@mail.gmail.com> Message-ID: Hi, On 16/03/2010 Tuesday, March 16, 17:29, "Peter" wrote: > On Tue, Mar 16, 2010 at 4:56 PM, Sean Davis wrote: >> >> If you don't mind a couple of questions, for curiosity sake, how big is the >> genome of model organism? ?And what size are the arrays, in terms of >> probes? > > Also, what kind of organism? e.g. Plant, animal, bacteria? This will > make a difference for the number of papers you'll find doing this kind > of thing in the literature. And the type of analysis that's being done, too: human aCGH (lots of references) tends to concentrate on copy number variation and SNP identification, while bacterial aCGH (not so many) focuses largely on presence/absence of putative orthologues. > On Tue, Mar 16, 2010 at 4:49 PM, Vincent Davis wrote: >> I actually was brought into this project by a team from an university. They >> know lots including that this is a difficult problem. They did not have any >> references as to how others have solved this problem with whatever success >> was possible. Since I know python, biopython has been my first choice to ask >> other smart people :) > > For a recent example using microarrays for cross-species comparison > (aka microarray comparative genomic hybridisation) in bacteria you > might want to read Leighton's paper (and the references within - which > include work on humans): > > http://www.ncbi.nlm.nih.gov/pubmed/19696881 > > You can probably guess why he asked if you were doing array CGH ;) And I was just about to blow my own trumpet, too ;) If you've got any questions that are specifically about the paper, I'm happy to take them off-list. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From lpritc at scri.ac.uk Wed Mar 17 05:26:22 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 17 Mar 2010 09:26:22 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: Message-ID: Hi Vincent, I've not read this yet, but it might be useful to you: http://zetoc.mimas.ac.uk/wzgw?db=etoc&terms=RN267048680&field=zid L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From mitlox at op.pl Wed Mar 17 06:08:24 2010 From: mitlox at op.pl (xyz) Date: Wed, 17 Mar 2010 20:08:24 +1000 Subject: [Biopython] sort fasta file Message-ID: <20100317200824.5f363f77@wp01> Hello, I would like sort multiple fasta file depends on the sequence length, ie. from the read with longest sequence to the read with the shortest sequence. I have tried to do it but I do not how to sort the records depends on the sequence length. from Bio import SeqIO handle = open("example.fasta", "rU") records = list(SeqIO.parse(handle, "fasta")) records.sort(reverse=True) Thank you in advance. Best regards, From biopython at maubp.freeserve.co.uk Wed Mar 17 06:22:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Mar 2010 10:22:48 +0000 Subject: [Biopython] sort fasta file In-Reply-To: <20100317200824.5f363f77@wp01> References: <20100317200824.5f363f77@wp01> Message-ID: <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> On Wed, Mar 17, 2010 at 10:08 AM, xyz wrote: > Hello, > I would like sort multiple fasta file depends on the sequence length, > ?ie. from the read with longest sequence to the read with the shortest > sequence. > > I have tried to do it but I do not how to sort the records depends on > the sequence length. > > from Bio import SeqIO > > handle = open("example.fasta", "rU") > records = list(SeqIO.parse(handle, "fasta")) > records.sort(reverse=True) > > Thank you in advance. > > Best regards, If you can hold all the records in memory at once (which it looks like you can) then this is pretty easy. You need to do a custom search - the built in list help is a bit terse: >>> help([].sort) Help on built-in function sort: sort(...) L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*; cmp(x, y) -> -1, 0, 1 You need to pass in a function as the cmp argument, which will take two objects (here SeqRecords) and return -1, 0 or 1. The concise way to do this is with a lambda, and reuse the built-in function cmp but acting on the length of the records. For example, handle = open("example.fasta", "rU") records = list(SeqIO.parse(handle, "fasta")) handle.close() records.sort(cmp=lambda x,y: cmp(len(x), len(y))) #records.sort(cmp=reverse=True) out_handle = open("sorted.fasta", "w") SeqIO.write(records, out_handle, "fasta") out_handle.close() Peter From mitlox at op.pl Wed Mar 17 08:01:35 2010 From: mitlox at op.pl (xyz) Date: Wed, 17 Mar 2010 22:01:35 +1000 Subject: [Biopython] sort fasta file In-Reply-To: <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> References: <20100317200824.5f363f77@wp01> <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> Message-ID: <20100317220135.3c12e3c4@wp01> On Wed, 17 Mar 2010 10:22:48 +0000 Peter wrote: > For example, > > handle = open("example.fasta", "rU") > records = list(SeqIO.parse(handle, "fasta")) > handle.close() > records.sort(cmp=lambda x,y: cmp(len(x), len(y))) > #records.sort(cmp=reverse=True) > out_handle = open("sorted.fasta", "w") > SeqIO.write(records, out_handle, "fasta") > out_handle.close() > > Peter Thank you for the code. I only changed this and it works. records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq)) If I could not hold all the records in memory at once what could I do? From crosvera at gmail.com Wed Mar 17 13:08:21 2010 From: crosvera at gmail.com (Carlos =?ISO-8859-1?Q?R=EDos?= V.) Date: Wed, 17 Mar 2010 14:08:21 -0300 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Message-ID: <1268845701.2161.10.camel@cabernet> Hello people, I'm very interesting in this idea: http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files I have some experience with the Bio.PDB Module, and I think that would be a very useful tool for labs. Brad Chapman wrote an e-mail that said that we have to demonstrate our knowledge of the project and open source coding capabilities, where I have to show you that? Regards. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 From eric.talevich at gmail.com Wed Mar 17 14:32:44 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 17 Mar 2010 14:32:44 -0400 Subject: [Biopython] sort fasta file Message-ID: <3f6baf361003171132s4ec12e4bw12d80e2a5edf6977@mail.gmail.com> xyz wrote: > > Hello, > I would like sort multiple fasta file depends on the sequence length, > ie. from the read with longest sequence to the read with the shortest > sequence. > > I have tried to do it but I do not how to sort the records depends on > the sequence length. > > [...] > > If I could not hold all the records in memory at once what could I do? > There's also a program called uclust which can sort reads by sequence length very quickly: http://www.drive5.com/uclust/ It's designed for clustering short reads, but it includes a feature to sort sequences by decreasing length. I think it can handle files larger than available RAM, too, though I haven't tested that. -Eric From biopython at maubp.freeserve.co.uk Thu Mar 18 06:44:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 10:44:09 +0000 Subject: [Biopython] sort fasta file In-Reply-To: <20100317220135.3c12e3c4@wp01> References: <20100317200824.5f363f77@wp01> <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> <20100317220135.3c12e3c4@wp01> Message-ID: <320fb6e01003180344n47fc9ba3y54c7284fc6747e25@mail.gmail.com> On Wed, Mar 17, 2010 at 12:01 PM, xyz wrote: > On Wed, 17 Mar 2010 10:22:48 +0000 > Peter wrote: >> For example, >> >> handle = open("example.fasta", "rU") >> records = list(SeqIO.parse(handle, "fasta")) >> handle.close() >> records.sort(cmp=lambda x,y: cmp(len(x), len(y))) >> #records.sort(cmp=reverse=True) >> out_handle = open("sorted.fasta", "w") >> SeqIO.write(records, out_handle, "fasta") >> out_handle.close() >> >> Peter > > Thank you for the code. I only changed this and it works. > > records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq)) > > If I could not hold all the records in memory at once what could I do? I would use Bio.SeqIO.index() to give random access to the records. You would also need to load and sort the record identifiers and the lengths. Something like this: from Bio import SeqIO #Get the lengths and ids, and sort on length len_and_ids = sorted((len(rec), rec.id) for rec in \ SeqIO.parse(open("ls_orchid.fasta"),"fasta")) #Once sorted only need the ids, so can free some memory ids = [id for (length, id) in len_and_ids] del len_and_ids #Now prepare the index record_index = SeqIO.index("ls_orchid.fasta", "fasta") #Now prepare a generator expression to give the #records one-by-one for output records = (record_index[id] for id in ids) #Finally write these to a file handle = open("sorted.fasta", "w") count = SeqIO.write(records, handle, "fasta") handle.close() print "Sorted %i records" % count That code should work for any file format support by the Bio.SeqIO parse, index and write functions (e.g. GenBank files, FASTQ, etc). Notice that it actually reads though the input file twice, once to get the ids and lengths, and once to build the index (getting the ids and file offsets). If you wanted to get a bit more low level you could do this in a single pass - but it would be more effort than using the SeqIO functions. I wonder if this example is useful enough to go in the tutorial? What do you think? Peter From subhodeep.moitra at gmail.com Thu Mar 18 13:11:56 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Thu, 18 Mar 2010 13:11:56 -0400 Subject: [Biopython] PDB Tidy Message-ID: <6a2880081003181011j2bac6661gae5dbeec4a0eb7d5@mail.gmail.com> Hi Carlos and BioPythoneers Has anyone come across PDB-Tools : http://code.google.com/p/pdb-tools/ It's a python implementation to clean up pdbs and some other stuff. Might be useful for someone interested in the PDB-Tidy project. :) :) Thanks Subho > Message: 1 > Date: Wed, 17 Mar 2010 14:08:21 -0300 > From: Carlos R?os "V." > Subject: Re: [Biopython] BioPython GSOC 2010 > To: biopython at lists.open-bio.org > Message-ID: <1268845701.2161.10.camel at cabernet> > Content-Type: text/plain; charset="UTF-8" > > Hello people, > > I'm very interesting in this idea: > > http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files > > I have some experience with the Bio.PDB Module, and I think that would > be a very useful tool for labs. > > Brad Chapman wrote an e-mail that said that we have to demonstrate our > knowledge of the project and open source coding capabilities, where I > have to show you that? > > Regards. > > -- > http://crosvera.blogspot.com > > Carlos R?os V. > Estudiante de Ing. (E) en Computaci?n e Inform?tica. > Universidad del B?o-B?o > VIII Regi?n, Chile > > Linux user number 425502 > > > > > > ------------------------------ > > Message: 2 > Date: Wed, 17 Mar 2010 14:32:44 -0400 > From: Eric Talevich > Subject: Re: [Biopython] sort fasta file > To: xyz , biopython at lists.open-bio.org > Message-ID: > <3f6baf361003171132s4ec12e4bw12d80e2a5edf6977 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > xyz wrote: > > > > > Hello, > > I would like sort multiple fasta file depends on the sequence length, > > ie. from the read with longest sequence to the read with the shortest > > sequence. > > > > I have tried to do it but I do not how to sort the records depends on > > the sequence length. > > > > [...] > > > > If I could not hold all the records in memory at once what could I do? > > > > There's also a program called uclust which can sort reads by sequence > length > very quickly: > http://www.drive5.com/uclust/ > > It's designed for clustering short reads, but it includes a feature to sort > sequences by decreasing length. I think it can handle files larger than > available RAM, too, though I haven't tested that. > > -Eric > > > ------------------------------ > > Message: 3 > Date: Thu, 18 Mar 2010 10:44:09 +0000 > From: Peter > Subject: Re: [Biopython] sort fasta file > To: xyz > Cc: biopython at lists.open-bio.org > Message-ID: > <320fb6e01003180344n47fc9ba3y54c7284fc6747e25 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Mar 17, 2010 at 12:01 PM, xyz wrote: > > On Wed, 17 Mar 2010 10:22:48 +0000 > > Peter wrote: > >> For example, > >> > >> handle = open("example.fasta", "rU") > >> records = list(SeqIO.parse(handle, "fasta")) > >> handle.close() > >> records.sort(cmp=lambda x,y: cmp(len(x), len(y))) > >> #records.sort(cmp=reverse=True) > >> out_handle = open("sorted.fasta", "w") > >> SeqIO.write(records, out_handle, "fasta") > >> out_handle.close() > >> > >> Peter > > > > Thank you for the code. I only changed this and it works. > > > > records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq)) > > > > If I could not hold all the records in memory at once what could I do? > > I would use Bio.SeqIO.index() to give random access to the > records. You would also need to load and sort the record > identifiers and the lengths. Something like this: > > from Bio import SeqIO > #Get the lengths and ids, and sort on length > len_and_ids = sorted((len(rec), rec.id) for rec in \ > SeqIO.parse(open("ls_orchid.fasta"),"fasta")) > #Once sorted only need the ids, so can free some memory > ids = [id for (length, id) in len_and_ids] > del len_and_ids > #Now prepare the index > record_index = SeqIO.index("ls_orchid.fasta", "fasta") > #Now prepare a generator expression to give the > #records one-by-one for output > records = (record_index[id] for id in ids) > #Finally write these to a file > handle = open("sorted.fasta", "w") > count = SeqIO.write(records, handle, "fasta") > handle.close() > print "Sorted %i records" % count > > That code should work for any file format support by > the Bio.SeqIO parse, index and write functions (e.g. > GenBank files, FASTQ, etc). > > Notice that it actually reads though the input file twice, > once to get the ids and lengths, and once to build the > index (getting the ids and file offsets). If you wanted to > get a bit more low level you could do this in a single > pass - but it would be more effort than using the SeqIO > functions. > > I wonder if this example is useful enough to go in the > tutorial? What do you think? > > Peter > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 87, Issue 19 > ***************************************** > -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From eric.talevich at gmail.com Thu Mar 18 15:25:01 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Mar 2010 15:25:01 -0400 Subject: [Biopython] BioPython GSOC 2010 Message-ID: <3f6baf361003181225w8bce2fdg5bd7ba894a717ccf@mail.gmail.com> On Wed, 17 Mar 2010 at 14:08:21 -0300, Carlos Rios "V." wrote: > Hello people, > > I'm very interesting in this idea: > > http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files > > I have some experience with the Bio.PDB Module, and I think that would > be a very useful tool for labs. > > Brad Chapman wrote an e-mail that said that we have to demonstrate our > knowledge of the project and open source coding capabilities, where I > have to show you that? > Well, OBF has been accepted as a mentoring organization now: http://socghop.appspot.com/gsoc/program/accepted_orgs/google/gsoc2010 So I'd recommend getting yourself set up on GitHub -- other mentoring organizations use git too, and it helps your application to show that you're already familiar with the build tools. Carlos, I see that you have plenty of code that you're willing to share, currently distributed as tarballs from your blog. You could start by publishing some selected projects on GitHub and playing around with it a little there, as well as making your own fork of Biopython. For bonus points, once you have your own Biopython development branch, see if you can write a patch for any of the open issues on Bugzilla: http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED This would all look great on your GSoC application. Thanks for your interest, and best of luck! -Eric From p.j.a.cock at googlemail.com Thu Mar 18 18:03:09 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 18 Mar 2010 22:03:09 +0000 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4BA29706.8040606@cornell.edu> References: <4BA29706.8040606@cornell.edu> Message-ID: <320fb6e01003181503j7e3030aao7bce7ebf4d8be06@mail.gmail.com> Good news for GSoC 2010 :) ---------- Forwarded message ---------- From: Robert Buels Date: Thu, Mar 18, 2010 at 9:11 PM Subject: Google Summer of Code is *ON* for OBF projects! Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). ? Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo Student applications are due April 9, 2010 at 19:00 UTC. ?Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and who to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2010 Administrator From cjfields at illinois.edu Thu Mar 18 17:57:13 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 18 Mar 2010 16:57:13 -0500 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! References: <4BA29706.8040606@cornell.edu> Message-ID: <21A0665D-C3CA-4830-A8F7-A989C4D23627@illinois.edu> (forwarding to the BioPython list, as the original post is still clearing the OBF mail filters) Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo Student applications are due April 9, 2010 at 19:00 UTC. Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and who to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2010 Administrator From ap12 at sanger.ac.uk Fri Mar 19 15:19:05 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 19 Mar 2010 19:19:05 +0000 Subject: [Biopython] zero-length feature Message-ID: Dear, I am having trouble writing out EMBL file for feature of size one. I've modified InsdcIO.py to fit my need. Because when I try to submit my file to EMBL, it comes back with this comment: badly formatted -- you need a .. between locations. def _insdc_location_string_ignoring_strand_and_subfeatures(feature): if feature.ref: ref = "%s:" % feature.ref else: ref = "" assert not feature.ref_db if feature.location.start == feature.location.end \ and isinstance(feature.location.end, SeqFeature.ExactPosition): #Special case, 12^13 gets mapped to location 12:12 #(a zero length slice, meaning the point between two letters) return "%s%i..%i" % (ref, feature.location.end.position+1, feature.location.end.position+1) else: #Typical case, e.g. 12..15 gets mapped to 11:15 return ref \ + _insdc_feature_position_string(feature.location.start, +1) \ + ".." + \ _insdc_feature_position_string(feature.location.end) But of course I am getting errors when running the tests: ====================================================================== FAIL: GenBank file to BioSQL and back to a GenBank file, NC_005816. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 419, in test_NC_005816 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb") File "test_BioSQL.py", line 481, in loop self.assert_(compare_record(old, new)) File "seq_tests_common.py", line 261, in compare_record if not compare_features(old.features, new.features): File "seq_tests_common.py", line 243, in compare_features if not compare_feature(old_f, new_f): File "seq_tests_common.py", line 98, in compare_feature raise e AssertionError: [5933:5933] -> [5933:5934] ====================================================================== ERROR: Write and read back AE017046.embl ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 777, in test_AE017046 write_read(os.path.join("EMBL", "AE017046.embl"), "embl", "gb") File "test_SeqIO_features.py", line 32, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 99, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 50, in compare_record if not compare_features(old.features, new.features): File "test_SeqIO_features.py", line 149, in compare_features if not compare_feature(old,new,ignore_sub_features): File "test_SeqIO_features.py", line 110, in compare_feature % (old.location, new.location, str(old), str(new))) ValueError: [5933:5933] versus [5933:5934]: type: variation location: [5933:5933] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] vs: type: variation location: [5933:5934] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] ====================================================================== ERROR: Write and read back NC_005816.gb ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 702, in test_NC_005816 write_read(os.path.join("GenBank", "NC_005816.gb"), "gb", "gb") File "test_SeqIO_features.py", line 32, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 99, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 50, in compare_record if not compare_features(old.features, new.features): File "test_SeqIO_features.py", line 149, in compare_features if not compare_feature(old,new,ignore_sub_features): File "test_SeqIO_features.py", line 110, in compare_feature % (old.location, new.location, str(old), str(new))) ValueError: [5933:5933] versus [5933:5934]: type: variation location: [5933:5933] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] vs: type: variation location: [5933:5934] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] ====================================================================== ERROR: Write and read back SC10H5.embl ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 792, in test_SC10H5 write_read(os.path.join("EMBL", "SC10H5.embl"), "embl", "gb") File "test_SeqIO_features.py", line 32, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 99, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 50, in compare_record if not compare_features(old.features, new.features): File "test_SeqIO_features.py", line 149, in compare_features if not compare_feature(old,new,ignore_sub_features): File "test_SeqIO_features.py", line 110, in compare_feature % (old.location, new.location, str(old), str(new))) ValueError: [1800:1800] versus [1800:1801]: type: misc_feature location: [1800:1800] ref: None:None strand: 1 qualifiers: Key: note, Value: ['Zero-length feature added to test Bioperl parsing'] vs: type: misc_feature location: [1800:1801] ref: None:None strand: 1 qualifiers: Key: note, Value: ['Zero-length feature added to test Bioperl parsing'] ====================================================================== FAIL: Features: write/read simple between locations. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 373, in test_between "10^11") AssertionError: '11..11' != '10^11' ---------------------------------------------------------------------- Ran 144 tests in 226.037 seconds FAILED (failures = 2) What could be a better solution? Thanks to let me know. Kind regards, Anne. -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From barendt at mail.med.upenn.edu Fri Mar 19 22:53:07 2010 From: barendt at mail.med.upenn.edu (Gregory Barendt) Date: Fri, 19 Mar 2010 22:53:07 -0400 Subject: [Biopython] RNA Secondary structure Message-ID: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> Does anyone know of good libraries for looking at RNA secondary structure? I'm looking for particular stem loops in particular locations in lots (hundreds of thousands) of sequences. Right now, I'm pretty inelegantly parsing the .ct file generated by UNAfold. I need to modify my search to be a little more flexible, so I'd much rather use an existing tool than continue to reinvent the wheel. Any advice would be greatly appreciated. Thanks, Greg From vincent at vincentdavis.net Fri Mar 19 23:56:46 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Fri, 19 Mar 2010 21:56:46 -0600 Subject: [Biopython] quantile normalization method Message-ID: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> Is there a quantile normalization method in biopython, I search but did not find. If not it looks straight forward would it be of any interest to the community for me to contribute a method 1. given n arrays of length p, form X of dimension p ? n where each array is a column; 2. sort each column of X to give X sort ; 3. take the means across rows of X sort and assign this mean to each element in the row to get X sort ; 4. get X normalized by rearranging each column of X sort to have the same ordering as original X From A comparison of normalization methods for high density oligonucleotide array data based on variance and bias B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 ? *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From bartek at rezolwenta.eu.org Sat Mar 20 03:55:20 2010 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sat, 20 Mar 2010 08:55:20 +0100 Subject: [Biopython] quantile normalization method In-Reply-To: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> References: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> Message-ID: <8b34ec181003200055j76230b4uc1b4e0707b24afd1@mail.gmail.com> On Sat, Mar 20, 2010 at 4:56 AM, Vincent Davis wrote: > Is there a quantile normalization method in biopython, I search but did not > find. If not it looks straight forward would it be of any interest to the > community for me to contribute a method > > 1. given n arrays of length p, form X of dimension > p ? n where each array is a column; > 2. sort each column of X to give X sort ; > 3. take the means across rows of X sort and assign this > mean to each element in the row to get X sort ; > 4. get X normalized by rearranging each column of > X sort to have the same ordering as original X > > From > A comparison of normalization methods for high > density oligonucleotide array data based on > variance and bias > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 > ? > Hi, I don't think there is such a method available. I'm myself using the original R implementation by Bolstad et al. It requires rPy and R installed. It can be achieved in a few lines of code:
import rpy2.robjects as robjects
#ll = list of concatenated values to normalize
v = robjects.FloatVector(ll)
#numrows=number of vectors that made up ll
m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
robjects.r('require("preprocessCore")')
normq=robjects.r('normalize.quantiles')
norm_a=numpy.array(normq(m))
#norm_a=normalized array
 
If your method is a pure python implementation which is comparably fast I think it would be worth to have it in Biopython since the method is (in my opinion) quite useful and it would remove the dependency on R from some of my scripts. cheers Bartek From vincent at vincentdavis.net Sat Mar 20 13:16:37 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 20 Mar 2010 11:16:37 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <8b34ec181003200055j76230b4uc1b4e0707b24afd1@mail.gmail.com> References: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> <8b34ec181003200055j76230b4uc1b4e0707b24afd1@mail.gmail.com> Message-ID: <77e831101003201016u32b29872ic71ca87654c45215@mail.gmail.com> @Bartek Wilczynski Could you test the following code against R, speed and acuracy? I am using numpy so you will need to; import numpy as np I did not find any clear documentation as to if the* Bolstad method or quantile normalization methods in general are dropping outliers. Any input here would be great.* I also have to thank Anne Archibald on the scipy mailing list for the fancy array indexing help. def quantile_normalization(anarray): """ anarray with samples in the columns and probes across the rows import numpy as np """ A=anarray AA = np.zeros_like(A) I = np.argsort(A,axis=0) AA[I,np.arange(A.shape[1])] = > np.mean(A[I,np.arange(A.shape[1])],axis=1)[:,np.newaxis] return AA *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 1:55 AM, Bartek Wilczynski wrote: > On Sat, Mar 20, 2010 at 4:56 AM, Vincent Davis wrote: > >> Is there a quantile normalization method in biopython, I search but did >> not >> find. If not it looks straight forward would it be of any interest to the >> community for me to contribute a method >> >> 1. given n arrays of length p, form X of dimension >> p ? n where each array is a column; >> 2. sort each column of X to give X sort ; >> 3. take the means across rows of X sort and assign this >> mean to each element in the row to get X sort ; >> 4. get X normalized by rearranging each column of >> X sort to have the same ordering as original X >> >> From >> A comparison of normalization methods for high >> density oligonucleotide array data based on >> variance and bias >> B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 >> ? >> > > Hi, > > I don't think there is such a method available. > > I'm myself using the original R implementation by Bolstad et al. It > requires rPy and R installed. It can be achieved in a few lines of code: > >
> import rpy2.robjects as robjects
> #ll = list of concatenated values to normalize
> v = robjects.FloatVector(ll)
> #numrows=number of vectors that made up ll
> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
> robjects.r('require("preprocessCore")')
> normq=robjects.r('normalize.quantiles')
> norm_a=numpy.array(normq(m))
> #norm_a=normalized array
>  
> > If your method is a pure python implementation which is comparably fast I > think it would be worth to have it in Biopython since the method is (in my > opinion) quite useful and it would remove the dependency on R from some of > my scripts. > > cheers > Bartek > From lgautier at gmail.com Sat Mar 20 14:05:42 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 20 Mar 2010 19:05:42 +0100 Subject: [Biopython] quantile normalization method In-Reply-To: References: Message-ID: <4BA50E76.5040304@gmail.com> Hi Bartek and Vincent, Few comments: A/ The algorithm is fairly straightforward, as you noted it, but beware of details such missing values, ability to normalize against a target distribution, or ties when ranking (although I'd have to check if those receive a special treatment). The quantile normalization code in the R package "preprocessCore" is in C and might outperform a pure Python implementation. B/ There is a variety of normalization methods in bioconductor, and it might make sense to embrace it as a dependency (rather than reimplement it). I have bindings for Bioconductor up my sleeve about to be distributed to few people for testing. The public release might be around ISMB, BOSC time. C/ norm_a = numpy.array(normq(m)) can be replaced by norm_a = numpy.as_array(normq(m)) to improve performances whenever m is of substantial size (as no copy is made - see http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy ) Best, Laurent On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org wrote: >> > Is there a quantile normalization method in biopython, I search but did not >> > find. If not it looks straight forward would it be of any interest to the >> > community for me to contribute a method >> > >> > 1. given n arrays of length p, form X of dimension >> > p ? n where each array is a column; >> > 2. sort each column of X to give X sort ; >> > 3. take the means across rows of X sort and assign this >> > mean to each element in the row to get X sort ; >> > 4. get X normalized by rearranging each column of >> > X sort to have the same ordering as original X >> > >> > From >> > A comparison of normalization methods for high >> > density oligonucleotide array data based on >> > variance and bias >> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 >> > ? >> > > Hi, > > I don't think there is such a method available. > > I'm myself using the original R implementation by Bolstad et al. It requires > rPy and R installed. It can be achieved in a few lines of code: > >
> import rpy2.robjects as robjects
> #ll = list of concatenated values to normalize
> v = robjects.FloatVector(ll)
> #numrows=number of vectors that made up ll
> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
> robjects.r('require("preprocessCore")')
> normq=robjects.r('normalize.quantiles')
> norm_a=numpy.array(normq(m))
> #norm_a=normalized array
>   
> > If your method is a pure python implementation which is comparably fast I > think it would be worth to have it in Biopython since the method is (in my > opinion) quite useful and it would remove the dependency on R from some of > my scripts. > > cheers > Bartek > From vincent at vincentdavis.net Sat Mar 20 14:26:27 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 20 Mar 2010 12:26:27 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <4BA50E76.5040304@gmail.com> References: <4BA50E76.5040304@gmail.com> Message-ID: <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> > > @Laurent Gautier The algorithm is fairly straightforward, as you noted it, but beware of > details such missing values, ability to normalize against a target > distribution, or ties when ranking (although I'd have to check if those > receive a special treatment).The quantile normalization code in the R > package "preprocessCore" is in C and might outperform a pure Python > implementation. Not sure about speed. I have 84 microarrays samples with ~190,000 probes and it normalizes in 7 sec. I have no idea how fast R is or how many arrays are common to normalize. There is a variety of normalization methods in bioconductor, and it might > make sense to embrace it as a dependency (rather than reimplement it). I > have bindings for Bioconductor up my sleeve about to be distributed to few > people for testing. The public release might be around ISMB, BOSC time. I considered this and in the long run you might be right. But I don't know R and I placed more value on understanding the normalization than learning R. This is in part because there is little advantage in using R in the next steps of my analysis. Bindings seem like a good idea but they would be a black box to me. I guess for me since most of this is new the value of implementing my own normalization in both learning more about python and understanding the normalization out ways the benefits of implementing it in R. As a side question, why use biopython, are there ways in which it is better than R ? For me it is purely that I know python (a little) and can nothing about R. Sure If I am just doing through step by step instruction from a bioconductor use manual I am fine but once I what to do something new am am lost. Not that I can't learn I am just prioritizing my learning. And thanks for this > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no copy is > made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy > ) > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier wrote: > Hi Bartek and Vincent, > > Few comments: > > A/ > > The algorithm is fairly straightforward, as you noted it, but beware of > details such missing values, ability to normalize against a target > distribution, or ties when ranking (although I'd have to check if those > receive a special treatment). > The quantile normalization code in the R package "preprocessCore" is in C > and might outperform a pure Python implementation. > > B/ > > There is a variety of normalization methods in bioconductor, and it might > make sense to embrace it as a dependency (rather than reimplement it). I > have bindings for Bioconductor up my sleeve about to be distributed to few > people for testing. The public release might be around ISMB, BOSC time. > > C/ > > > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no copy is > made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy) > > > > Best, > > > Laurent > > > > > On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org wrote: > >> > Is there a quantile normalization method in biopython, I search but did >>> not >>> > find. If not it looks straight forward would it be of any interest to >>> the >>> > community for me to contribute a method >>> > >>> > 1. given n arrays of length p, form X of dimension >>> > p ? n where each array is a column; >>> > 2. sort each column of X to give X sort ; >>> > 3. take the means across rows of X sort and assign this >>> > mean to each element in the row to get X sort ; >>> > 4. get X normalized by rearranging each column of >>> > X sort to have the same ordering as original X >>> > >>> > From >>> > A comparison of normalization methods for high >>> > density oligonucleotide array data based on >>> > variance and bias >>> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 >>> > ? >>> > >>> >> Hi, >> >> I don't think there is such a method available. >> >> I'm myself using the original R implementation by Bolstad et al. It >> requires >> rPy and R installed. It can be achieved in a few lines of code: >> >>
>> import rpy2.robjects as robjects
>> #ll = list of concatenated values to normalize
>> v = robjects.FloatVector(ll)
>> #numrows=number of vectors that made up ll
>> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>> robjects.r('require("preprocessCore")')
>> normq=robjects.r('normalize.quantiles')
>> norm_a=numpy.array(normq(m))
>> #norm_a=normalized array
>>  
>> >> If your method is a pure python implementation which is comparably fast I >> think it would be worth to have it in Biopython since the method is (in my >> opinion) quite useful and it would remove the dependency on R from some of >> my scripts. >> >> cheers >> Bartek >> >> > From lgautier at gmail.com Sat Mar 20 15:30:45 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 20 Mar 2010 20:30:45 +0100 Subject: [Biopython] quantile normalization method In-Reply-To: <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> References: <4BA50E76.5040304@gmail.com> <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> Message-ID: <4BA52265.9060908@gmail.com> On 3/20/10 7:26 PM, Vincent Davis wrote: > @Laurent Gautier > > The algorithm is fairly straightforward, as you noted it, but beware > of details such missing values, ability to normalize against a > target distribution, or ties when ranking (although I'd have to > check if those receive a special treatment).The quantile > normalization code in the R package "preprocessCore" is in C and > might outperform a pure Python implementation. > > > Not sure about speed. I have 84 microarrays samples with ~190,000 probes > and it normalizes in 7 sec. I have no idea how fast R is or how many > arrays are common to normalize. So speed is not an issue for your use-case; even a 10x speedup might not justify the effort required to move to C, as this operation is performed once in a while (once per dataset mostly). I am not sure there is a "common" number. When still working with arrays, I can find myself with several hundred arrays with ~2 million probes each. > There is a variety of normalization methods in bioconductor, and it > might make sense to embrace it as a dependency (rather than > reimplement it). I have bindings for Bioconductor up my sleeve about > to be distributed to few people for testing. The public release > might be around ISMB, BOSC time. > > > I considered this and in the long run you might be right. But I don't > know R and I placed more value on understanding the normalization than > learning R. This is in part because there is little advantage in using R > in the next steps of my analysis. Surprising, but you'll know best. > Bindings seem like a good idea but > they would be a black box to me. I guess for me since most of this is > new the value of implementing my own normalization in both learning more > about python and understanding the normalization out ways the benefits > of implementing it in R. Everyone's mileage will vary. I often like building on existing libraries (although I frequently read how methods work): this makes my palette of tools richer than if I had to reimplement everything, and gives me time to create my own. Having this said, learning a language by implementing is a great way to go. > As a side question, why use biopython, are there ways in which it is > better than R ? In short (and therefore with some imprecision and/or distortion), Biopython is a "Python package" (i.e., collection of modules) for bioinformatics, with a forte in handling a number of bioinformatics file formats. R is a language for statistics, data analysis and graphics. > For me it is purely that I know python (a little) and can nothing about > R. Sure If I am just doing through step by step instruction from > a bioconductor use manual I am fine but once I what to do something new > am am lost. Not that I can't learn I am just prioritizing my learning. Then the idea is that you consider R/bioconductor as a Python library. Should you want something new, you can then implement it in Python. Laurent > > And thanks for this > > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no > copy is made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy) > > > > > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > > my blog | LinkedIn > > > > > On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier > wrote: > > Hi Bartek and Vincent, > > Few comments: > > A/ > > The algorithm is fairly straightforward, as you noted it, but beware > of details such missing values, ability to normalize against a > target distribution, or ties when ranking (although I'd have to > check if those receive a special treatment). > The quantile normalization code in the R package "preprocessCore" is > in C and might outperform a pure Python implementation. > > B/ > > There is a variety of normalization methods in bioconductor, and it > might make sense to embrace it as a dependency (rather than > reimplement it). I have bindings for Bioconductor up my sleeve about > to be distributed to few people for testing. The public release > might be around ISMB, BOSC time. > > C/ > > > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no > copy is made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy > ) > > > > Best, > > > Laurent > > > > > On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org > wrote: > > > Is there a quantile normalization method in biopython, I > search but did not > > find. If not it looks straight forward would it be of > any interest to the > > community for me to contribute a method > > > > 1. given n arrays of length p, form X of dimension > > p ? n where each array is a column; > > 2. sort each column of X to give X sort ; > > 3. take the means across rows of X sort and assign this > > mean to each element in the row to get X sort ; > > 4. get X normalized by rearranging each column of > > X sort to have the same ordering as original X > > > > From > > A comparison of normalization methods for high > > density oligonucleotide array data based on > > variance and bias > > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. > P. Speed 4, 5 > > ? > > > > Hi, > > I don't think there is such a method available. > > I'm myself using the original R implementation by Bolstad et al. > It requires > rPy and R installed. It can be achieved in a few lines of code: > >
>         import rpy2.robjects as robjects
>         #ll = list of concatenated values to normalize
>         v = robjects.FloatVector(ll)
>         #numrows=number of vectors that made up ll
>         m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>         robjects.r('require("preprocessCore")')
>         normq=robjects.r('normalize.quantiles')
>         norm_a=numpy.array(normq(m))
>         #norm_a=normalized array
>         
> > If your method is a pure python implementation which is > comparably fast I > think it would be worth to have it in Biopython since the method > is (in my > opinion) quite useful and it would remove the dependency on R > from some of > my scripts. > > cheers > Bartek > > > From vincent at vincentdavis.net Sat Mar 20 15:35:33 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 20 Mar 2010 13:35:33 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <4BA52265.9060908@gmail.com> References: <4BA50E76.5040304@gmail.com> <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> <4BA52265.9060908@gmail.com> Message-ID: <77e831101003201235w41db2061q95871d17f797136@mail.gmail.com> @Laurent Gautier, I agree with everything you said :) What I could really use is some to test the python code against R Just to help very if that the results are not completely wrong. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 1:30 PM, Laurent Gautier wrote: > On 3/20/10 7:26 PM, Vincent Davis wrote: > >> @Laurent Gautier >> >> The algorithm is fairly straightforward, as you noted it, but beware >> of details such missing values, ability to normalize against a >> target distribution, or ties when ranking (although I'd have to >> check if those receive a special treatment).The quantile >> normalization code in the R package "preprocessCore" is in C and >> might outperform a pure Python implementation. >> >> >> Not sure about speed. I have 84 microarrays samples with ~190,000 probes >> and it normalizes in 7 sec. I have no idea how fast R is or how many >> arrays are common to normalize. >> > > So speed is not an issue for your use-case; even a 10x speedup might not > justify the effort required to move to C, as this operation is performed > once in a while (once per dataset mostly). > > I am not sure there is a "common" number. When still working with arrays, I > can find myself with several hundred arrays with ~2 million probes each. > > > There is a variety of normalization methods in bioconductor, and it >> might make sense to embrace it as a dependency (rather than >> reimplement it). I have bindings for Bioconductor up my sleeve about >> to be distributed to few people for testing. The public release >> might be around ISMB, BOSC time. >> >> >> I considered this and in the long run you might be right. But I don't >> know R and I placed more value on understanding the normalization than >> learning R. This is in part because there is little advantage in using R >> in the next steps of my analysis. >> > > Surprising, but you'll know best. > > > Bindings seem like a good idea but >> they would be a black box to me. I guess for me since most of this is >> new the value of implementing my own normalization in both learning more >> about python and understanding the normalization out ways the benefits >> of implementing it in R. >> > > Everyone's mileage will vary. I often like building on existing libraries > (although I frequently read how methods work): this makes my palette of > tools richer than if I had to reimplement everything, and gives me time to > create my own. > Having this said, learning a language by implementing is a great way to go. > > > As a side question, why use biopython, are there ways in which it is >> better than R ? >> > > In short (and therefore with some imprecision and/or distortion), Biopython > is a "Python package" (i.e., collection of modules) for bioinformatics, with > a forte in handling a number of bioinformatics file formats. R is a language > for statistics, data analysis and graphics. > > > For me it is purely that I know python (a little) and can nothing about >> R. Sure If I am just doing through step by step instruction from >> a bioconductor use manual I am fine but once I what to do something new >> am am lost. Not that I can't learn I am just prioritizing my learning. >> > > Then the idea is that you consider R/bioconductor as a Python library. > Should you want something new, you can then implement it in Python. > > > > Laurent > > >> And thanks for this >> >> norm_a = numpy.array(normq(m)) >> >> can be replaced by >> >> norm_a = numpy.as_array(normq(m)) >> >> to improve performances whenever m is of substantial size (as no >> copy is made - see >> >> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >> ) >> >> >> >> >> >> *Vincent Davis >> 720-301-3003 * >> vincent at vincentdavis.net >> >> my blog | LinkedIn >> >> >> >> >> >> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier > > wrote: >> >> Hi Bartek and Vincent, >> >> Few comments: >> >> A/ >> >> The algorithm is fairly straightforward, as you noted it, but beware >> of details such missing values, ability to normalize against a >> target distribution, or ties when ranking (although I'd have to >> check if those receive a special treatment). >> The quantile normalization code in the R package "preprocessCore" is >> in C and might outperform a pure Python implementation. >> >> B/ >> >> There is a variety of normalization methods in bioconductor, and it >> might make sense to embrace it as a dependency (rather than >> reimplement it). I have bindings for Bioconductor up my sleeve about >> to be distributed to few people for testing. The public release >> might be around ISMB, BOSC time. >> >> C/ >> >> >> norm_a = numpy.array(normq(m)) >> >> can be replaced by >> >> norm_a = numpy.as_array(normq(m)) >> >> to improve performances whenever m is of substantial size (as no >> copy is made - see >> >> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >> ) >> >> >> >> Best, >> >> >> Laurent >> >> >> >> >> On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org >> wrote: >> >> > Is there a quantile normalization method in biopython, I >> search but did not >> > find. If not it looks straight forward would it be of >> any interest to the >> > community for me to contribute a method >> > >> > 1. given n arrays of length p, form X of dimension >> > p ? n where each array is a column; >> > 2. sort each column of X to give X sort ; >> > 3. take the means across rows of X sort and assign this >> > mean to each element in the row to get X sort ; >> > 4. get X normalized by rearranging each column of >> > X sort to have the same ordering as original X >> > >> > From >> > A comparison of normalization methods for high >> > density oligonucleotide array data based on >> > variance and bias >> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. >> P. Speed 4, 5 >> > ? >> > >> >> Hi, >> >> I don't think there is such a method available. >> >> I'm myself using the original R implementation by Bolstad et al. >> It requires >> rPy and R installed. It can be achieved in a few lines of code: >> >>
>>        import rpy2.robjects as robjects
>>        #ll = list of concatenated values to normalize
>>        v = robjects.FloatVector(ll)
>>        #numrows=number of vectors that made up ll
>>        m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>>        robjects.r('require("preprocessCore")')
>>        normq=robjects.r('normalize.quantiles')
>>        norm_a=numpy.array(normq(m))
>>        #norm_a=normalized array
>>        
>> >> If your method is a pure python implementation which is >> comparably fast I >> think it would be worth to have it in Biopython since the method >> is (in my >> opinion) quite useful and it would remove the dependency on R >> from some of >> my scripts. >> >> cheers >> Bartek >> >> >> >> > From anaryin at gmail.com Sat Mar 20 21:38:07 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 20 Mar 2010 18:38:07 -0700 Subject: [Biopython] GSOC Bio.PDB Project Message-ID: Hello All, I've been using BioPython for a while now and I guess I'm a spoiled brat for never giving anything back :) Also, I've been wanting to participate in the last two years of GSOC but I've never found a project that I felt adequate to my knowledge (ie. usually too hard). Thus, with this year's Bio.PDB project, I guess I can give it a try to be accepted. I don't have that much experience with coding in collaborative environments, nor I have in big projects, but that's exactly what I'm looking forward to earn. I know my way around BioPython and the Bio.PDB module, and I've had enough headaches dealing with PDB files in the past couple of years to nurture hatred up to a certain level :) And I have a B.Sc in Biochem, which is a double-edged knife for comp. biology. With this said, I guess I have to wait for a reply. If you need extra info, feel free to email me. Jo?o Rodrigues @ http://stanford.edu/~joaor/ From vincent at vincentdavis.net Mon Mar 22 00:02:20 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sun, 21 Mar 2010 22:02:20 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <77e831101003201235w41db2061q95871d17f797136@mail.gmail.com> References: <4BA50E76.5040304@gmail.com> <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> <4BA52265.9060908@gmail.com> <77e831101003201235w41db2061q95871d17f797136@mail.gmail.com> Message-ID: <77e831101003212102s3850b60au553d19a719b4742c@mail.gmail.com> I found a mistake, the np.zeros_like(A) array need to be set as a float64, otherwise it was assumed int. So the final results would have been rounded to int. def quantile_normalization(anarray): """ anarray with samples in the columns and probes across the rows import numpy as np """ anarray.dtype = np.float64 A=anarray AA = np.float64(np.zeros_like(A)) I = np.argsort(A,axis=0) AA[I,np.arange(A.shape[1])] = np.float64(np.mean(A[I,np.arange(A.shape[1])],axis=1)[:,np.newaxis]) return AA *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 1:35 PM, Vincent Davis wrote: > @Laurent Gautier, I agree with everything you said :) > > What I could really use is some to test the python code against R > Just to help very if that the results are not completely wrong. > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > > > On Sat, Mar 20, 2010 at 1:30 PM, Laurent Gautier wrote: > >> On 3/20/10 7:26 PM, Vincent Davis wrote: >> >>> @Laurent Gautier >>> >>> The algorithm is fairly straightforward, as you noted it, but beware >>> of details such missing values, ability to normalize against a >>> target distribution, or ties when ranking (although I'd have to >>> check if those receive a special treatment).The quantile >>> normalization code in the R package "preprocessCore" is in C and >>> might outperform a pure Python implementation. >>> >>> >>> Not sure about speed. I have 84 microarrays samples with ~190,000 probes >>> and it normalizes in 7 sec. I have no idea how fast R is or how many >>> arrays are common to normalize. >>> >> >> So speed is not an issue for your use-case; even a 10x speedup might not >> justify the effort required to move to C, as this operation is performed >> once in a while (once per dataset mostly). >> >> I am not sure there is a "common" number. When still working with arrays, >> I can find myself with several hundred arrays with ~2 million probes each. >> >> >> There is a variety of normalization methods in bioconductor, and it >>> might make sense to embrace it as a dependency (rather than >>> reimplement it). I have bindings for Bioconductor up my sleeve about >>> to be distributed to few people for testing. The public release >>> might be around ISMB, BOSC time. >>> >>> >>> I considered this and in the long run you might be right. But I don't >>> know R and I placed more value on understanding the normalization than >>> learning R. This is in part because there is little advantage in using R >>> in the next steps of my analysis. >>> >> >> Surprising, but you'll know best. >> >> >> Bindings seem like a good idea but >>> they would be a black box to me. I guess for me since most of this is >>> new the value of implementing my own normalization in both learning more >>> about python and understanding the normalization out ways the benefits >>> of implementing it in R. >>> >> >> Everyone's mileage will vary. I often like building on existing libraries >> (although I frequently read how methods work): this makes my palette of >> tools richer than if I had to reimplement everything, and gives me time to >> create my own. >> Having this said, learning a language by implementing is a great way to >> go. >> >> >> As a side question, why use biopython, are there ways in which it is >>> better than R ? >>> >> >> In short (and therefore with some imprecision and/or distortion), >> Biopython is a "Python package" (i.e., collection of modules) for >> bioinformatics, with a forte in handling a number of bioinformatics file >> formats. R is a language for statistics, data analysis and graphics. >> >> >> For me it is purely that I know python (a little) and can nothing about >>> R. Sure If I am just doing through step by step instruction from >>> a bioconductor use manual I am fine but once I what to do something new >>> am am lost. Not that I can't learn I am just prioritizing my learning. >>> >> >> Then the idea is that you consider R/bioconductor as a Python library. >> Should you want something new, you can then implement it in Python. >> >> >> >> Laurent >> >> >>> And thanks for this >>> >>> norm_a = numpy.array(normq(m)) >>> >>> can be replaced by >>> >>> norm_a = numpy.as_array(normq(m)) >>> >>> to improve performances whenever m is of substantial size (as no >>> copy is made - see >>> >>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >>> ) >>> >>> >>> >>> >>> >>> *Vincent Davis >>> 720-301-3003 * >>> vincent at vincentdavis.net >>> >>> my blog | LinkedIn >>> >>> >>> >>> >>> >>> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier >> > wrote: >>> >>> Hi Bartek and Vincent, >>> >>> Few comments: >>> >>> A/ >>> >>> The algorithm is fairly straightforward, as you noted it, but beware >>> of details such missing values, ability to normalize against a >>> target distribution, or ties when ranking (although I'd have to >>> check if those receive a special treatment). >>> The quantile normalization code in the R package "preprocessCore" is >>> in C and might outperform a pure Python implementation. >>> >>> B/ >>> >>> There is a variety of normalization methods in bioconductor, and it >>> might make sense to embrace it as a dependency (rather than >>> reimplement it). I have bindings for Bioconductor up my sleeve about >>> to be distributed to few people for testing. The public release >>> might be around ISMB, BOSC time. >>> >>> C/ >>> >>> >>> norm_a = numpy.array(normq(m)) >>> >>> can be replaced by >>> >>> norm_a = numpy.as_array(normq(m)) >>> >>> to improve performances whenever m is of substantial size (as no >>> copy is made - see >>> >>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >>> ) >>> >>> >>> >>> Best, >>> >>> >>> Laurent >>> >>> >>> >>> >>> On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org >>> wrote: >>> >>> > Is there a quantile normalization method in biopython, I >>> search but did not >>> > find. If not it looks straight forward would it be of >>> any interest to the >>> > community for me to contribute a method >>> > >>> > 1. given n arrays of length p, form X of dimension >>> > p ? n where each array is a column; >>> > 2. sort each column of X to give X sort ; >>> > 3. take the means across rows of X sort and assign this >>> > mean to each element in the row to get X sort ; >>> > 4. get X normalized by rearranging each column of >>> > X sort to have the same ordering as original X >>> > >>> > From >>> > A comparison of normalization methods for high >>> > density oligonucleotide array data based on >>> > variance and bias >>> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. >>> P. Speed 4, 5 >>> > ? >>> > >>> >>> Hi, >>> >>> I don't think there is such a method available. >>> >>> I'm myself using the original R implementation by Bolstad et al. >>> It requires >>> rPy and R installed. It can be achieved in a few lines of code: >>> >>>
>>>        import rpy2.robjects as robjects
>>>        #ll = list of concatenated values to normalize
>>>        v = robjects.FloatVector(ll)
>>>        #numrows=number of vectors that made up ll
>>>        m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>>>        robjects.r('require("preprocessCore")')
>>>        normq=robjects.r('normalize.quantiles')
>>>        norm_a=numpy.array(normq(m))
>>>        #norm_a=normalized array
>>>        
>>> >>> If your method is a pure python implementation which is >>> comparably fast I >>> think it would be worth to have it in Biopython since the method >>> is (in my >>> opinion) quite useful and it would remove the dependency on R >>> from some of >>> my scripts. >>> >>> cheers >>> Bartek >>> >>> >>> >>> >> > From eric.talevich at gmail.com Mon Mar 22 00:12:36 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Mar 2010 00:12:36 -0400 Subject: [Biopython] GSOC Bio.PDB Project In-Reply-To: References: Message-ID: <3f6baf361003212112o3a2ebe5bq50a7d59eae06492c@mail.gmail.com> On Sat, Mar 20, 2010 at 9:38 PM, Jo?o Rodrigues wrote: > Hello All, > > I've been using BioPython for a while now and I guess I'm a spoiled brat > for > never giving anything back :) Also, I've been wanting to participate in the > last two years of GSOC but I've never found a project that I felt adequate > to my knowledge (ie. usually too hard). Thus, with this year's Bio.PDB > project, I guess I can give it a try to be accepted. > Sounds good to me! The GSoC projects are meant to be a stretch for students' skills; otherwise you wouldn't need mentors. I don't have that much experience with coding in collaborative environments, > nor I have in big projects, but that's exactly what I'm looking forward to > earn. I know my way around BioPython and the Bio.PDB module, and I've had > enough headaches dealing with PDB files in the past couple of years to > nurture hatred up to a certain level :) And I have a B.Sc in Biochem, which > is a double-edged knife for comp. biology. > Did you see my earlier e-mail about refining ideas for Bio.PDB? Looking at your webpage, I can definitely think of some more specific projects you could do for GSoC. If you don't want other potential students to read your ideas in the formative stages, you can of course e-mail me directly about planning a project. Also, I've been attempting to herd applicants toward our bug tracker: http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED Thanks for your interest, Eric From biopython at maubp.freeserve.co.uk Mon Mar 22 05:27:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 09:27:04 +0000 Subject: [Biopython] RNA Secondary structure In-Reply-To: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> References: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> Message-ID: <320fb6e01003220227u19be07c8p6f19c1d28ad37bb2@mail.gmail.com> On Sat, Mar 20, 2010 at 2:53 AM, Gregory Barendt wrote: > Does anyone know of good libraries for looking at RNA secondary > structure? I'm looking for particular stem loops in particular locations > in lots (hundreds of thousands) of sequences. > > Right now, I'm pretty inelegantly parsing the .ct file generated by > UNAfold. I need to modify my search to be a little more flexible, so > I'd much rather use an existing tool than continue to reinvent the > wheel. Any advice would be greatly appreciated. > > Thanks, > Greg I think Kristian Rother was looking at RNA support in Biopython last year (CC'd). Peter From biopython at maubp.freeserve.co.uk Mon Mar 22 05:31:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 09:31:49 +0000 Subject: [Biopython] zero-length feature In-Reply-To: References: Message-ID: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> On Fri, Mar 19, 2010 at 7:19 PM, Anne Pajon wrote: > Dear, > > I am having trouble writing out EMBL file for feature of size one. > I've modified InsdcIO.py to fit my need. Because when I try to submit my > file to EMBL, it comes back with this comment: badly formatted -- you > need a .. between locations. Hi Anne, Could you show us the feature location string you are trying to achieve in the EMBL output? That would help me to follow - an example FT entry would be great. Peter From ap12 at sanger.ac.uk Mon Mar 22 07:24:43 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 22 Mar 2010 11:24:43 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> Message-ID: <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> Hi Peter, Here is the feature location string I would like to achieve in the EMBL output: FT gap 422950..422950 FT /estimated_length=1 Regards, Anne. On 22 Mar 2010, at 09:31, Peter wrote: > On Fri, Mar 19, 2010 at 7:19 PM, Anne Pajon wrote: >> Dear, >> >> I am having trouble writing out EMBL file for feature of size one. >> I've modified InsdcIO.py to fit my need. Because when I try to >> submit my >> file to EMBL, it comes back with this comment: badly formatted -- you >> need a .. between locations. > > Hi Anne, > > Could you show us the feature location string you are trying to > achieve in the EMBL output? That would help me to follow - > an example FT entry would be great. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Mon Mar 22 07:37:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 11:37:58 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> Message-ID: <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> On Mon, Mar 22, 2010 at 11:24 AM, Anne Pajon wrote: > Hi Peter, > > Here is the feature location string I would like to achieve in the EMBL > output: > > FT ? gap ? ? ? ? ? ? 422950..422950 > FT ? ? ? ? ? ? ? ? ? /estimated_length=1 > > > Regards, > Anne. Does your genome have a single N (or n) character at this point? If so, it does make sense to use 422950..422950 to mean that single letter - it really is a feature of length one. That should be possible with the existing (unmodified) Biopython EMBL/GenBank output. Note that in python notation this would be the region [422949:422950], where start != end but instead start+1 == end. If however the gap isn't explicitly in the genome string, I think you should be using something like 422950^422951 to indicate the gap is between bases 422950 and 422951. This is a zero length feature. Perhaps I have misunderstood your aim? Peter From biopython at maubp.freeserve.co.uk Mon Mar 22 07:41:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 11:41:52 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> Message-ID: <320fb6e01003220441n5867af3ei39a4a90fc8c53586@mail.gmail.com> On Mon, Mar 22, 2010 at 11:37 AM, Peter wrote: > Does your genome have a single N (or n) character at this point? > > If so, it does make sense to use 422950..422950 to mean that > single letter - it really is a feature of length one. That should be > possible with the existing (unmodified) Biopython EMBL/GenBank > output. Note that in python notation this would be the region > [422949:422950], where start != end but instead start+1 == end. > > If however the gap isn't explicitly in the genome string, I think you > should be using something like 422950^422951 to indicate the > gap is between bases 422950 and 422951. This is a zero length > feature. > > Perhaps I have misunderstood your aim? I should perhaps include a quote from the EMBL documentation to explain my question a little further: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html Feature Key gap Definition gap in the sequence Mandatory qualifiers /estimated_length=unknown or Optional qualifiers /experiment="text" /inference="TYPE[ (same species)][:EVIDENCE_BASIS]" /map="text" /note="text" Comment the location span of the gap feature for an unknown gap is 100 bp, with the 100 bp indicated as 100 "n"'s in the sequence. Where estimated length is indicated by an integer, this is indicated by the same number of "n"'s in the sequence. No upper or lower limit is set on the size of the gap. i.e. I think EMBL would want you to insert a string of n characters into the genome where you have a gap, and then the gap feature would describe this string of n characters. Peter From ap12 at sanger.ac.uk Mon Mar 22 07:44:00 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 22 Mar 2010 11:44:00 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> Message-ID: My genome has a single N character at this point. Here is the code I use to insert these gaps: # Add FT gap seq = record.seq in_N = False gap_features = [] for i in range(len(seq)): if seq[i] == 'N' and not in_N: start_N = i in_N = True if in_N and not seq[i+1] == 'N': end_N = i if start_N == end_N: log.warning("gap of size 1 %s..%s" % (start_N, end_N)) length = (end_N - start_N) + 1 gap_feature = SeqFeature(FeatureLocation(start_N,end_N +1), strand=1, type="gap") gap_feature.qualifiers['estimated_length'] = [length] gap_features.append(gap_feature) in_N = False What should I do to make it works with (unmodified) Biopython EMBL output? Thanks in advance for your help. Regards, Anne. On 22 Mar 2010, at 11:37, Peter wrote: > On Mon, Mar 22, 2010 at 11:24 AM, Anne Pajon > wrote: >> Hi Peter, >> >> Here is the feature location string I would like to achieve in the >> EMBL >> output: >> >> FT gap 422950..422950 >> FT /estimated_length=1 >> >> >> Regards, >> Anne. > > Does your genome have a single N (or n) character at this point? > > If so, it does make sense to use 422950..422950 to mean that > single letter - it really is a feature of length one. That should be > possible with the existing (unmodified) Biopython EMBL/GenBank > output. Note that in python notation this would be the region > [422949:422950], where start != end but instead start+1 == end. > > If however the gap isn't explicitly in the genome string, I think you > should be using something like 422950^422951 to indicate the > gap is between bases 422950 and 422951. This is a zero length > feature. > > Perhaps I have misunderstood your aim? > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Mon Mar 22 08:07:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 12:07:06 +0000 Subject: [Biopython] zero-length feature In-Reply-To: References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> Message-ID: <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon wrote: > My genome has a single N character at this point. > OK - then the feature should be length one, describing this single base region. i.e. Using python counting, start+1 == end > > Here is the code I use to insert these gaps: > > ? ?# Add FT gap > ? ?seq = record.seq > ? ?in_N = False > ? ?gap_features = [] > ? ?for i in range(len(seq)): > ? ? ? ?if seq[i] == 'N' and not in_N: > ? ? ? ? ? ?start_N = i > ? ? ? ? ? ?in_N = True > ? ? ? ?if in_N and not seq[i+1] == 'N': > ? ? ? ? ? ?end_N = i > ? ? ? ? ? ?if start_N == end_N: > ? ? ? ? ? ? ? ?log.warning("gap of size 1 %s..%s" % (start_N, end_N)) > ? ? ? ? ? ?length = (end_N - start_N) + 1 > ? ? ? ? ? ?gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1), > strand=1, type="gap") > ? ? ? ? ? ?gap_feature.qualifiers['estimated_length'] = [length] > ? ? ? ? ? ?gap_features.append(gap_feature) > ? ? ? ? ? ?in_N = False > > What should I do to make it works with (unmodified) Biopython EMBL output? > Thanks in advance for your help. > > Regards, > Anne. I think you have some out by one counting there (resulting in features of length one shorted than they should have been). How does this self contained example look? from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation seq = Seq("ANANNANNNANNNNNA", generic_dna) record = SeqRecord(seq, id="Test") print "Finding stretches of N in this:" print seq # TODO - Cope with a sequence which ends with N assert seq[-1] != "N", "FIXME - seq ends with N" in_N = False for i in range(len(seq)): if seq[i] == 'N' and not in_N: start_N = i in_N = True if in_N and not seq[i+1] == 'N': end_N = i+1 length = end_N - start_N assert length > 0 assert str(seq[start_N:end_N]) == "N"*length print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N), print "gap of size %i, Python slicing %s:%s" % (length, start_N, end_N) gap_feature = SeqFeature(FeatureLocation(start_N,end_N), strand=1, type="gap") gap_feature.qualifiers['estimated_length'] = [length] record.features.append(gap_feature) in_N = False print print record.format("embl") And the output, which looks fine to me (this is more readable if your email client uses a fixed width font): Finding stretches of N in this: ANANNANNNANNNNNA .N.............. gap of size 1, Python slicing 1:2 ...NN........... gap of size 2, Python slicing 3:5 ......NNN....... gap of size 3, Python slicing 6:9 ..........NNNNN. gap of size 5, Python slicing 10:15 ID Test; ; ; DNA; ; UNC; 16 BP. XX AC Test; XX DE . XX OS . OC . XX FH Key Location/Qualifiers FT gap 2..2 FT /estimated_length=1 FT gap 4..5 FT /estimated_length=2 FT gap 7..9 FT /estimated_length=3 FT gap 11..15 FT /estimated_length=5 SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; ANANNANNNA NNNNNA 16 // Regards, Peter From ap12 at sanger.ac.uk Mon Mar 22 10:52:53 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 22 Mar 2010 14:52:53 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> Message-ID: <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> Brilliant! Thanks. Regards, Anne. On 22 Mar 2010, at 12:07, Peter wrote: > On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon > wrote: >> My genome has a single N character at this point. >> > > OK - then the feature should be length one, describing this single > base region. i.e. Using python counting, start+1 == end > >> >> Here is the code I use to insert these gaps: >> >> # Add FT gap >> seq = record.seq >> in_N = False >> gap_features = [] >> for i in range(len(seq)): >> if seq[i] == 'N' and not in_N: >> start_N = i >> in_N = True >> if in_N and not seq[i+1] == 'N': >> end_N = i >> if start_N == end_N: >> log.warning("gap of size 1 %s..%s" % (start_N, end_N)) >> length = (end_N - start_N) + 1 >> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1), >> strand=1, type="gap") >> gap_feature.qualifiers['estimated_length'] = [length] >> gap_features.append(gap_feature) >> in_N = False >> >> What should I do to make it works with (unmodified) Biopython EMBL >> output? >> Thanks in advance for your help. >> >> Regards, >> Anne. > > I think you have some out by one counting there (resulting in features > of length one shorted than they should have been). How does this self > contained example look? > > from Bio.Alphabet import generic_dna > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > from Bio.SeqFeature import SeqFeature, FeatureLocation > seq = Seq("ANANNANNNANNNNNA", generic_dna) > record = SeqRecord(seq, id="Test") > print "Finding stretches of N in this:" > print seq > # TODO - Cope with a sequence which ends with N > assert seq[-1] != "N", "FIXME - seq ends with N" > in_N = False > for i in range(len(seq)): > if seq[i] == 'N' and not in_N: > start_N = i > in_N = True > if in_N and not seq[i+1] == 'N': > end_N = i+1 > length = end_N - start_N > assert length > 0 > assert str(seq[start_N:end_N]) == "N"*length > print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N), > print "gap of size %i, Python slicing %s:%s" % (length, > start_N, end_N) > gap_feature = SeqFeature(FeatureLocation(start_N,end_N), > strand=1, type="gap") > gap_feature.qualifiers['estimated_length'] = [length] > record.features.append(gap_feature) > in_N = False > print > print record.format("embl") > > > And the output, which looks fine to me (this is more readable if your > email client uses a fixed width font): > > > Finding stretches of N in this: > ANANNANNNANNNNNA > .N.............. gap of size 1, Python slicing 1:2 > ...NN........... gap of size 2, Python slicing 3:5 > ......NNN....... gap of size 3, Python slicing 6:9 > ..........NNNNN. gap of size 5, Python slicing 10:15 > > ID Test; ; ; DNA; ; UNC; 16 BP. > XX > AC Test; > XX > DE . > XX > OS . > OC . > XX > FH Key Location/Qualifiers > FT gap 2..2 > FT /estimated_length=1 > FT gap 4..5 > FT /estimated_length=2 > FT gap 7..9 > FT /estimated_length=3 > FT gap 11..15 > FT /estimated_length=5 > SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; > ANANNANNNA > NNNNNA 16 > // > > Regards, > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cjfields at illinois.edu Mon Mar 22 11:01:48 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 22 Mar 2010 10:01:48 -0500 Subject: [Biopython] zero-length feature In-Reply-To: <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> Message-ID: <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> All, Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ location specs indicate a location of one nucleotide (inclusive) in length is to be characterized as one number, not a range at all: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3 2..2 should just be: 2 Or, did I miss something in the discussion? chris On Mar 22, 2010, at 9:52 AM, Anne Pajon wrote: > Brilliant! Thanks. > > Regards, > Anne. > > On 22 Mar 2010, at 12:07, Peter wrote: > >> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon wrote: >>> My genome has a single N character at this point. >>> >> >> OK - then the feature should be length one, describing this single >> base region. i.e. Using python counting, start+1 == end >> >>> >>> Here is the code I use to insert these gaps: >>> >>> # Add FT gap >>> seq = record.seq >>> in_N = False >>> gap_features = [] >>> for i in range(len(seq)): >>> if seq[i] == 'N' and not in_N: >>> start_N = i >>> in_N = True >>> if in_N and not seq[i+1] == 'N': >>> end_N = i >>> if start_N == end_N: >>> log.warning("gap of size 1 %s..%s" % (start_N, end_N)) >>> length = (end_N - start_N) + 1 >>> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1), >>> strand=1, type="gap") >>> gap_feature.qualifiers['estimated_length'] = [length] >>> gap_features.append(gap_feature) >>> in_N = False >>> >>> What should I do to make it works with (unmodified) Biopython EMBL output? >>> Thanks in advance for your help. >>> >>> Regards, >>> Anne. >> >> I think you have some out by one counting there (resulting in features >> of length one shorted than they should have been). How does this self >> contained example look? >> >> from Bio.Alphabet import generic_dna >> from Bio.Seq import Seq >> from Bio.SeqRecord import SeqRecord >> from Bio.SeqFeature import SeqFeature, FeatureLocation >> seq = Seq("ANANNANNNANNNNNA", generic_dna) >> record = SeqRecord(seq, id="Test") >> print "Finding stretches of N in this:" >> print seq >> # TODO - Cope with a sequence which ends with N >> assert seq[-1] != "N", "FIXME - seq ends with N" >> in_N = False >> for i in range(len(seq)): >> if seq[i] == 'N' and not in_N: >> start_N = i >> in_N = True >> if in_N and not seq[i+1] == 'N': >> end_N = i+1 >> length = end_N - start_N >> assert length > 0 >> assert str(seq[start_N:end_N]) == "N"*length >> print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N), >> print "gap of size %i, Python slicing %s:%s" % (length, start_N, end_N) >> gap_feature = SeqFeature(FeatureLocation(start_N,end_N), >> strand=1, type="gap") >> gap_feature.qualifiers['estimated_length'] = [length] >> record.features.append(gap_feature) >> in_N = False >> print >> print record.format("embl") >> >> >> And the output, which looks fine to me (this is more readable if your >> email client uses a fixed width font): >> >> >> Finding stretches of N in this: >> ANANNANNNANNNNNA >> .N.............. gap of size 1, Python slicing 1:2 >> ...NN........... gap of size 2, Python slicing 3:5 >> ......NNN....... gap of size 3, Python slicing 6:9 >> ..........NNNNN. gap of size 5, Python slicing 10:15 >> >> ID Test; ; ; DNA; ; UNC; 16 BP. >> XX >> AC Test; >> XX >> DE . >> XX >> OS . >> OC . >> XX >> FH Key Location/Qualifiers >> FT gap 2..2 >> FT /estimated_length=1 >> FT gap 4..5 >> FT /estimated_length=2 >> FT gap 7..9 >> FT /estimated_length=3 >> FT gap 11..15 >> FT /estimated_length=5 >> SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; >> ANANNANNNA NNNNNA 16 >> // >> >> Regards, >> >> Peter > > -- > Dr Anne Pajon - Pathogen Genomics, Team 81 > Sanger Institute, Wellcome Trust Genome Campus, Hinxton > Cambridge CB10 1SA, United Kingdom > +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Mar 22 11:38:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 15:38:42 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> Message-ID: <320fb6e01003220838x685f6815v7f2ba40983d7316a@mail.gmail.com> On Mon, Mar 22, 2010 at 3:01 PM, Chris Fields wrote: > > All, > > Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ > location specs indicate a location of one nucleotide (inclusive) in length > is to be characterized as one number, not a range at all: > > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3 > > 2..2 > > should just be: > > 2 > > Or, did I miss something in the discussion? > > chris On the face of it, I think you are right Chris. Good point. Peter From biopython at maubp.freeserve.co.uk Tue Mar 23 08:43:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Mar 2010 12:43:57 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220838x685f6815v7f2ba40983d7316a@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> <320fb6e01003220838x685f6815v7f2ba40983d7316a@mail.gmail.com> Message-ID: <320fb6e01003230543u6f531f44w634f27b4498db937@mail.gmail.com> On Mon, Mar 22, 2010 at 3:38 PM, Peter wrote: > On Mon, Mar 22, 2010 at 3:01 PM, Chris Fields wrote: >> >> All, >> >> Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ >> location specs indicate a location of one nucleotide (inclusive) in length >> is to be characterized as one number, not a range at all: >> >> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3 >> >> 2..2 >> >> should just be: >> >> 2 >> >> Or, did I miss something in the discussion? >> >> chris > > On the face of it, I think you are right Chris. Good point. > > Peter > Hi again, I've updated the trunk to handle single letter features like that. This means the output of the example script I showed earlier is now: ID Test; ; ; DNA; ; UNC; 16 BP. XX AC Test; XX DE . XX OS . OC . XX FH Key Location/Qualifiers FT gap 2 FT /estimated_length=1 FT gap 4..5 FT /estimated_length=2 FT gap 7..9 FT /estimated_length=3 FT gap 11..15 FT /estimated_length=5 SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; ANANNANNNA NNNNNA 16 // Note the single gap feature now has a location "2" not "2..2" Thanks Chris, Peter From biopython at maubp.freeserve.co.uk Wed Mar 24 10:58:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:58:07 +0000 Subject: [Biopython] RNA Secondary structure In-Reply-To: <320fb6e01003220227u19be07c8p6f19c1d28ad37bb2@mail.gmail.com> References: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> <320fb6e01003220227u19be07c8p6f19c1d28ad37bb2@mail.gmail.com> Message-ID: <320fb6e01003240758h36a5eb3v91faa70faf8e0f@mail.gmail.com> On Mon, Mar 22, 2010 at 9:27 AM, Peter wrote: > On Sat, Mar 20, 2010 at 2:53 AM, Gregory Barendt > wrote: >> Does anyone know of good libraries for looking at RNA secondary >> structure? I'm looking for particular stem loops in particular locations >> in lots (hundreds of thousands) of sequences. >> >> Right now, I'm pretty inelegantly parsing the .ct file generated by >> UNAfold. I need to modify my search to be a little more flexible, so >> I'd much rather use an existing tool than continue to reinvent the >> wheel. Any advice would be greatly appreciated. >> >> Thanks, >> Greg > > I think Kristian Rother was looking at RNA support in Biopython last > year (CC'd). > Hi again Greg, In case you are not also on the dev mailing list, you might be interested to look at Kristian's code. If you could help out with testing/feedback that would be great: http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007482.html Peter From richard_w_g_price at academia.edu Thu Mar 25 21:33:12 2010 From: richard_w_g_price at academia.edu (Richard Price) Date: Thu, 25 Mar 2010 18:33:12 -0700 Subject: [Biopython] Recent Activity of the 15 Biopython members on Academia.edu Message-ID: Dear Biopython members, We just wanted to let you know about some recent activity on the Biopython group on Academia.edu. In the Biopython group on Academia.edu, there are now: - 15 people (10 in the last month) - 1 paper Biopython members? pages have been viewed a total of 1,801 times, and their papers have been viewed a total of 4 times. To see these people, papers and status updates, follow the link below: http://lists.academia.edu/See-members-of-Biopython Richard Dr. Richard Price, post-doc, Philosophy Dept, Oxford University. Founder of Academia.edu From rmb32 at cornell.edu Fri Mar 26 13:14:32 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Fri, 26 Mar 2010 10:14:32 -0700 Subject: [Biopython] Announcing OBF Summer of Code - please forward! Message-ID: <4BACEB78.3090600@cornell.edu> Hi all, Here's an advertising-ready announcement for OBF's Summer of Code, thanks to Christian Zmasek and Hilmar Lapp for their excellent writing. Student applications are due April 9! Please spread it widely, we need to reach lots of students with it! Rob Buels OBF GSoC 2010 Admin ============================================================ *** Please disseminate widely at your local institutions *** *** including posting to message and job boards, so that *** *** we reach as many students as possible. *** ============================================================ OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2010 Applications due 19:00 UTC, April 9, 2010. http://www.open-bio.org/wiki/Google_Summer_of_Code The Open Bioinformatics Foundation Summer of Code program provides a unique opportunity for undergraduate, masters, and PhD students to obtain hands-on experience writing and extending open-source software for bioinformatics under the mentorship of experienced developers from around the world. The program is the participation of the Open Bioinformatics Foundation (OBF) as a mentoring organization in the Google Summer of Code(tm) (http://code.google.com/soc/). Students successfully completing the 3 month program receive a $5,000 USD stipend, and may work entirely from their home or home institution. Participation is open to students from any country in the world except countries subject to US trade restrictions. Each student will have at least one dedicated mentor to show them the ropes and help them complete their project. The Open Bioinformatics Foundation is particularly seeking students interested in both bioinformatics (computational biology) and software development. Some initial project ideas are listed on the website. These range from Galaxy phylogenetics pipeline development in Biopython to lightweight sequence objects and lazy parsing in BioPerl, a DAS Server for large files on local filesystems, and mapping Java libraries to Perl/Ruby/Python using Biolib+SWIG+JNI. All project ideas are flexible and many can be adjusted in scope to match the skills of the student. We also welcome and encourage students proposing their own project ideas; historically some of the most successful Summer of Code projects are ones proposed by the students themselves. TO APPLY: Apply online at the Google Summer of Code website (http://socghop.appspot.com/), where you will also find GSoC program rules and eligibility requirements. The 12-day application period for students runs from Monday, March 29 through Friday, April 9th, 2010. INQUIRIES: We strongly encourage all interested students to get in touch with us with their ideas as early on as possible. See the OBF GSoC page for contact details. 2010 OBF Summer of Code: http://www.open-bio.org/wiki/Google_Summer_of_Code Google Summer of Code FAQ: http://socghop.appspot.com/document/show/program/google/gsoc2010/faqs From chintal at iitk.ac.in Fri Mar 26 20:36:57 2010 From: chintal at iitk.ac.in (Chintalagiri Shashank) Date: Sat, 27 Mar 2010 06:06:57 +0530 Subject: [Biopython] Introduction Message-ID: <201003270606.59717.chintal@iitk.ac.in> Hello, I'm an undergraduate student of Physics, from the Indian Institute of Technology, Kanpur, and am interested in applying to BioPython for this year's Google Summer of Code. My interest in biology in the context of my Major is a somewhat complex and long-winded explanation, the basis of which is that for the last couple of years I've been seriously looking into biology (specifically, Structural Biology within the context of elements from Bioinformatics and certain other fields) as a potentially interesting field of study, and have been doing courses about the same. I was initially toying with the idea of attempting to write a sequence analysis 'framework' of sorts, where I could have the scaffolding to play around with simple algorithms for structure prediction. In retrospect, I should have make a more thorough search which should have led to OBF and BioPython, but as it is the idea went into cold storage due to certain other pressing constraints on my time, specifically a time-bound institute project that was behind on its schedule. I found OBF soon after the initial GSoC organizations announcement, and have since been looking over various pieces of documentation on it. I did look at the bugtracker as well, as was suggested on list, but it seemed to me that a lot of the bugs listed there were patches awaiting review. I do intend to take another look at the list and see if there is anything I can do there, but I decided that I shouldn't wait any longer before introducing myself formally on the list. I'm interested in working on BioPython/PyCogent interop, because I see a lot of potential in tying the two toolkits together and doing so before more wheels are reinvented. The ability to look at evolutionary effects and structural effects simultaneously could be quite interesting. To be fair, I must note here that while I am quite at home with Python and have a working understanding of the elements that make up BioPython, I have no production experience with either toolkits, and do not have a theoretical understanding of the evolutionary algorithms behind pyCogent. However, I am confident that I will be able to pick up the necessary skills over time, atleast to a degree necessary to make interoperability possible. I also have a couple of ideas in mind for BioPython projects, which really aren't well fleshed out yet. I'll think about them, specifically, their need and feasibility, and send the details to the list in a few days. Please do let me know if you would like any more information in the meanwhile. I've been on the mailing list for a couple of weeks now, so you can just reply on-list unless there is a need for off-list communication. Regards Chintalagiri Shashank chintal at iitk.ac.in From chapmanb at 50mail.com Sat Mar 27 08:36:53 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 27 Mar 2010 08:36:53 -0400 Subject: [Biopython] Introduction In-Reply-To: <201003270606.59717.chintal@iitk.ac.in> References: <201003270606.59717.chintal@iitk.ac.in> Message-ID: <20100327123653.GA1959@kunkel> Chintalagiri; Thanks for the e-mail and introduction. It's great to have you interested in Biopython and GSoC. The path you took to Biopython definitely echos the experience of lots of us; first you try building everything yourself and then realize: there must be some code frameworks out there that make this easier. > I'm interested in working on BioPython/PyCogent interop, because I see a lot > of potential in tying the two toolkits together and doing so before more > wheels are reinvented. The ability to look at evolutionary effects and > structural effects simultaneously could be quite interesting. [...] > I also have a couple of ideas in mind for BioPython projects, which really > aren't well fleshed out yet. I'll think about them, specifically, their need > and feasibility, and send the details to the list in a few days. Great, it sounds like you've already given this a bit of thought. You're welcome to either build off of the Biopython/PyCogent project or develop one of your own ideas into a proposal. Either way, the first step is to start putting together your project proposal and sharing it with us (Google Docs is a good option) so we can offer specific feedback on the programming and science part of things. We can work on the proposals up until Friday, April 9th. If you haven't already it's worth taking a look at the GSoC timeline for all the major dates: http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/timeline Generally, the proposal should contain: - A high level overview of what you hope to accomplish during the summer. - A week by week action plan for work to be done, including specific deliverables. This should be the bulk of the proposal. - A short section with relevant background and experience. We can work on this iteratively until the cutoff, and will be able to offer more specific feedback as we get an idea of your interests and directions. It would also be really useful to provide pointers to any open source code we could look at. If you don't have anything online now, uploading some relevant scripts to a GitHub or Bitbucket repository is a good start. Demonstrating bug fixing ability, as you mentioned, is also a helpful way to show off your programming skills to mentors. Thanks again. Looking forward to working on the proposal with you, Brad From biopyuser at gmail.com Mon Mar 29 01:16:37 2010 From: biopyuser at gmail.com (Biopython User) Date: Sun, 28 Mar 2010 22:16:37 -0700 Subject: [Biopython] KDTree with multidimensional radius? Message-ID: Hi all - New to K-D Trees and biopython, and have a question regarding the feasibility of this setup: Is it possible to create a 3-D tree of (X,Y,T=time) and do a search (node count) with a 2-D "radius" of (d,t) where d is the cartesian distance from a center point (x,y), and t is a temporal distance only on the T=time axis? The problem class I'm trying to solve is as follows: Given a set of nodes (possibly as many as 10 million) in (X,Y,T), find all groups where the group is defined by a central node (x,y,t) and N or more nodes within d distance and t time from that center. I've come to the conclusion that I can do this in a two-step process: that is, first search() on a 2-D (X,Y) tree, and then, for each of the arrays produced, do a 1-D (T) search - but given that the tree creation cost is high, this is potentially very inefficient, and I'm hoping there's a better way. Ideas/feedback/other options greatly appreciated. Kurt. From crosvera at gmail.com Wed Mar 31 18:39:06 2010 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Wed, 31 Mar 2010 18:39:06 -0400 Subject: [Biopython] PDB-Tidy proposal Message-ID: Dear Biopythoners, I'm Carlos R?os, a student from Chile. As some of you may know, I'm very interested in apply to the Google Summer of Code with the PDB-Tidy idea. So, I wrote a draft that suppose to be my proposal. I'm open to receive any comment, feedback, disagreement... here is the link of the draft: http://github.com/crosvera/pdbtidy_proposal/blob/master/proposal Regards. Ps: sorry if my English is not so good. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 From mjldehoon at yahoo.com Mon Mar 1 09:40:25 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 1 Mar 2010 01:40:25 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002271119me5db4ddud784bf1573fddfb@mail.gmail.com> Message-ID: <180385.44498.qm@web62405.mail.re1.yahoo.com> --- On Sat, 2/27/10, Peter wrote: > I hadn't realised the NCBI had changed the XML. I > wonder if multiple query PSI-BLAST output works > nicely now? The psiblast program as part of blast+ doesn't allow multiple queries, so in that sense the problem was disappeared. > If the existing NCBI XML parser can cover both variants, > then it makes more sense to me to continue to use the > existing read & parse functions under > Bio.Blast.NCBIXML. Well I was thinking that this is a good time to tackle all outstanding Blast parser bugs & issues, which may break consistency with the existing parsers. So I would prefer to copy the code in Bio.Blast.NCBIXML, modify it as needed for blast+, and in some future Biopython release (not anytime soon) to deprecated NCBIStandalone and NCBIXML. In any case, I think it is nicer to have a read() function directly under Bio.Blast, so I don't have to remember and type in the names of the submodules NCBIXML and NCBIStandalone (the name of the latter doesn't make much sense anyway). --Michiel. From biopython at maubp.freeserve.co.uk Mon Mar 1 10:08:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Mar 2010 10:08:17 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <180385.44498.qm@web62405.mail.re1.yahoo.com> References: <320fb6e01002271119me5db4ddud784bf1573fddfb@mail.gmail.com> <180385.44498.qm@web62405.mail.re1.yahoo.com> Message-ID: <320fb6e01003010208n13d4de6au2299a2e3221fd2dc@mail.gmail.com> On Mon, Mar 1, 2010 at 9:40 AM, Michiel de Hoon wrote: > --- On Sat, 2/27/10, Peter wrote: >> I hadn't realised the NCBI had changed the XML. I >> wonder if multiple query PSI-BLAST output works >> nicely now? > > The psiblast program as part of blast+ doesn't allow > multiple queries, so in that sense the problem was > disappeared. That is a very practical solution to the problem. Chuckle. >> If the existing NCBI XML parser can cover both variants, >> then it makes more sense to me to continue to use the >> existing read & parse functions under >> Bio.Blast.NCBIXML. > > Well I was thinking that this is a good time to tackle all > outstanding Blast parser bugs & issues, which may break > consistency with the existing parsers. So I would prefer to > copy the code in Bio.Blast.NCBIXML, modify it as needed > for blast+, and in some future Biopython release (not anytime > soon) to deprecated NCBIStandalone and NCBIXML. Would you be thinking of having Bio.Blast.read() and parse() only supporting NCBI BLAST+ XML files, or take a format argument like we do for sequences and alignments? i.e. What about other formats like the old NCBI XML (if it has changed), the assorted tabular BLAST outputs, non-NCBI BLAST, and finally the still sometimes useful plain text output (e.g. for use with third party tools like BLAT). > In any case, I think it is nicer to have a read() function directly > under Bio.Blast, so I don't have to remember and type in the > names of the submodules NCBIXML and NCBIStandalone > (the name of the latter doesn't make much sense anyway). The name of Bio.Blast.NCBIStandalone is a historical relic, and I agree should be retired. Can we label the whole of this module as obsolete? As discussed earlier on this thread people are still using it for calling BLAST so we won't deprecate it in the next release (but likely the one after that). Peter From chapmanb at 50mail.com Mon Mar 1 13:09:33 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Mar 2010 08:09:33 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: Message-ID: <20100301130933.GA98028@sobchak.mgh.harvard.edu> Istvan; > Our bioinformatics question and answer site seems to be > picking up steam lately: > > http://biostar.stackexchange.com/ > > I dream of a bioinformatics forum where one can ask a generic > bioinformatics question and get high quality responses in short order, > but not just in one particular approach but everything that is > applicable: perl, python, R, java, Galaxy etc Thanks for setting this up and promoting it. I'm happy to hear you have funding to continue it beyond the beta period. I added links to BioStar and the main StackOverflow Biopython question page from our discussion/mailing list page: http://biopython.org/wiki/Mailing_lists and am also redirecting the RSS feeds for new questions tagged with 'biopython' to the development list using Feed My Inbox (http://www.feedmyinbox.com). So we will now get a daily e-mail digest reminder to the list if any questions are posted. Looking forward to using this. Thanks again, Brad From chapmanb at 50mail.com Mon Mar 1 13:19:40 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Mar 2010 08:19:40 -0500 Subject: [Biopython] GFF parsing In-Reply-To: References: <20100226132834.GA66415@sobchak.mgh.harvard.edu> Message-ID: <20100301131940.GB98028@sobchak.mgh.harvard.edu> John; [GFF parser testing] > For my purposes the python csv module is doing the job. I would prefer > to use a proper GFF parser but for the moment your parser is taking 100 > seconds to parse a 40Mb file and the csv reader is doing it in about 10 > seconds. Do you think this is reasonable or do you want to take a closer > look? The straight CSV module will always destroy a full featured parser, but we may be able to get that 10x multiplier down. I'm happy to take a look if you want to send a pointer to your GFF file; if it's not publicly available feel free to send a representative subset of it to me off list. I'd be interested to hear your use case as well. Are there general things you want to do for which you had to write code and a supplemental GFF library would help? The trick with developing a GFF parser is to provide useful high level functionality, since it is relatively easy to split strings and write a one-off solution. Thanks, Brad From istvan.albert at gmail.com Mon Mar 1 13:51:56 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Mon, 1 Mar 2010 08:51:56 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: <20100301130933.GA98028@sobchak.mgh.harvard.edu> References: <20100301130933.GA98028@sobchak.mgh.harvard.edu> Message-ID: On Mon, Mar 1, 2010 at 8:09 AM, Brad Chapman wrote: > and am also redirecting the RSS feeds for new questions tagged with > 'biopython' to the development list using Feed My Inbox > (http://www.feedmyinbox.com). So we will now get a daily e-mail > digest reminder to the list if any questions are posted. Thank you Brad, I have been seriously considering finding interesting blog posts that demonstrate the use of biopython and posting them on BioStar, the advantage being that it is easier to interact, comment and evolve code in the StackOverflow framework of course the original post owners might not agree to that, so it would require contacting them. On the other hand if it is all right with everyone I would like to take some examples inspired from the biopython cookbook and post those. Very often I get questions such as how to do this or that in biopython. For that type of questions this platform is ideal. best and thanks again, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From mjldehoon at yahoo.com Tue Mar 2 10:01:29 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 2 Mar 2010 02:01:29 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01003010208n13d4de6au2299a2e3221fd2dc@mail.gmail.com> Message-ID: <293938.66771.qm@web62407.mail.re1.yahoo.com> > Would you be thinking of having Bio.Blast.read() and > parse() only supporting NCBI BLAST+ XML files, or take > a format argument like we do for sequences and alignments? I would support BLAST+ XML files only at first, and add parser capability for other formats later if needed. If so, I would use a format argument, same as how Bio.SeqIO works. > The name of? Bio.Blast.NCBIStandalone is a historical > relic, and I agree should be retired. Can we label the > whole of this module as obsolete? This module also contains the parser for Blast text output, so I think we cannot declare it obsolete just yet. However, if the XML output of BLAST+ is complete, I don't see the need for such a plain-text Blast parser any more. --Michiel. From biopython at maubp.freeserve.co.uk Tue Mar 2 10:14:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Mar 2010 10:14:25 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <293938.66771.qm@web62407.mail.re1.yahoo.com> References: <320fb6e01003010208n13d4de6au2299a2e3221fd2dc@mail.gmail.com> <293938.66771.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e01003020214k781eff0p77ba5002ca4786ab@mail.gmail.com> On Tue, Mar 2, 2010 at 10:01 AM, Michiel de Hoon wrote: > >> Would you be thinking of having Bio.Blast.read() and >> parse() only supporting NCBI BLAST+ XML files, or take >> a format argument like we do for sequences and alignments? > > I would support BLAST+ XML files only at first, and add > parser capability for other formats later if needed. If so, > I would use a format argument, same as how Bio.SeqIO > works. Sounds sensible. Would you be using the existing Record classes to hold the output? >> The name of? Bio.Blast.NCBIStandalone is a historical >> relic, and I agree should be retired. Can we label the >> whole of this module as obsolete? > > This module also contains the parser for Blast text output, > so I think we cannot declare it obsolete just yet. However, > if the XML output of BLAST+ is complete, I don't see the > need for such a plain-text Blast parser any more. We've been referring to the plain text BLAST parser as obsolete or deprecated in the documentation for some time now (although there isn't yet an actual deprecation warning issues). So I don't see a problem with calling the whole of Bio.Blast.NCBIStandalone obsolete. I don't think we can add deprecation warnings to the plain text parser yet. While the XML format(s) are better for parsing, there are still corner cases where the plain text has advantages (file size, BLAST like output from non-NCBI tools like BLAT, NCBI psi-blast output although they have apparently improved the XML here). We also should worry about non-NCBI BLAST tools and their output. Peter From mjldehoon at yahoo.com Tue Mar 2 15:45:34 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 2 Mar 2010 07:45:34 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01003020214k781eff0p77ba5002ca4786ab@mail.gmail.com> Message-ID: <12210.49894.qm@web62402.mail.re1.yahoo.com> > Sounds sensible. Would you be using the existing Record > classes to hold the output? Probably not; I don't like the design of the existing Record classes much (in particular, with the Record classes inheriting from Header, DatabaseReport, and Parameters). This is also a good opportunity to remove inconsistencies between attribute names between the different parsers. The DTD of the blast XML output can help us to decide on appropriate attribute names. That said, I expect that from a user perspective there will be little difference between an old-blast Record and a blast+ Record. For the development, I was thinking of setting up the parser step by step, and to discuss on the mailing list if any potential differences arise with the existing parsers. > We've been referring to the plain text BLAST parser as > obsolete or deprecated in the documentation for some > time now (although there isn't yet an actual deprecation > warning issues). So I don't see a problem with calling > the whole of Bio.Blast.NCBIStandalone obsolete. I don't have any strong objections here, so as far as I am concerned feel free to declare Bio.Blast.NCBIStandalone obsolete. --Michiel From biopython at maubp.freeserve.co.uk Wed Mar 3 13:15:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 3 Mar 2010 13:15:32 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <12210.49894.qm@web62402.mail.re1.yahoo.com> References: <320fb6e01003020214k781eff0p77ba5002ca4786ab@mail.gmail.com> <12210.49894.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e01003030515o6e1b12bg768318f3d08fc1ef@mail.gmail.com> On Tue, Mar 2, 2010 at 3:45 PM, Michiel de Hoon wrote: > >> Sounds sensible. Would you be using the existing Record >> classes to hold the output? > > Probably not; I don't like the design of the existing Record > classes much (in particular, with the Record classes inheriting > from Header, DatabaseReport, and Parameters). Yes, that is odd. > This is also a good opportunity to remove inconsistencie > between attribute names between the different parsers. > The DTD of the blast XML output can help us to decide > on appropriate attribute names. Again, this is worthwhile (i.e. fix Bug 2176). http://bugzilla.open-bio.org/show_bug.cgi?id=2176 > That said, I expect that from a user perspective there will > be little difference between an old-blast Record and a > blast+ Record. For the development, I was thinking of > setting up the parser step by step, and to discuss on the > mailing list if any potential differences arise with the > existing parsers. Great :) >> We've been referring to the plain text BLAST parser as >> obsolete or deprecated in the documentation for some >> time now (although there isn't yet an actual deprecation >> warning issues). So I don't see a problem with calling >> the whole of Bio.Blast.NCBIStandalone obsolete. > > I don't have any strong objections here, so as far as I am > concerned feel free to declare Bio.Blast.NCBIStandalone > obsolete. Done in the repository. Peter From ap12 at sanger.ac.uk Thu Mar 4 13:31:33 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Thu, 4 Mar 2010 13:31:33 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> Message-ID: <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> Dear Peter, Sorry for taking so much time to come back to you. I've managed to fork the biopython repository on github and I think I am ready now to help writing improvements to Bio/SeqIO/InsdcIO.py by adding missing fields on the ID line and adding a PR line. I may look also at the SQ line. Does this sound right to you? Thanks to let me know. Kind regards, Anne. On 12 Jan 2010, at 12:33, Peter wrote: > On Tue, Jan 12, 2010 at 10:27 AM, Peter > wrote: >> On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon >> wrote: >>> Here is the diff between the EMBL output from Bio.SeqIO and the >>> genbank >>> output from Bio.SeqIO converted with the EMBOSS tool to an EMBL >>> file: >>> >>> ... >>> >>> The main differences are on line breaks. >> >> I hadn't yet done a comparison against EMBOSS (what version do you >> have), but yes, it looks like I am wrapping the feature tables >> using a >> shorter line length - we should check that, and it would be easy to >> adjust in Bio/SeqIO/InsdcIO.py > > The spec is pretty clear than the feature lines should be up to 80 > characters. The premature wrapping was because I had been > testing length < 80 instead of <= 80, which is now fixed in git. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Thu Mar 4 14:12:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 14:12:47 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> References: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> Message-ID: <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> On Thu, Mar 4, 2010 at 1:31 PM, Anne Pajon wrote: > > Dear Peter, > > Sorry for taking so much time to come back to you. > > I've managed to fork the biopython repository on github and I think I am > ready now to help writing improvements to Bio/SeqIO/InsdcIO.py by adding > missing fields on the ID line and adding a PR line. I may look also at the > SQ line. > > Does this sound right to you? Thanks to let me know. > > Kind regards, > Anne. Hi Anne, If you are happy working with git, then showing us fixes there is great. Have a read of these pages before you get going - it should help: http://www.biopython.org/wiki/GitUsage Otherwise patch files are OK - you can attach them to bugs on bugzilla rather than on the mailing list. Thanks. Peter From ap12 at sanger.ac.uk Thu Mar 4 14:26:37 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Thu, 4 Mar 2010 14:26:37 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> References: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> Message-ID: Hi Peter, I'm happy to work with git. I've already read the bioython wiki page on it so I'm hoping to do the right thing. I'm going to commit and push the ID line fix as soon as I am happy with it, and to see if I understood how it should be done. Looking forward to get your feedback. Thanks, Anne. On 4 Mar 2010, at 14:12, Peter wrote: > On Thu, Mar 4, 2010 at 1:31 PM, Anne Pajon wrote: >> >> Dear Peter, >> >> Sorry for taking so much time to come back to you. >> >> I've managed to fork the biopython repository on github and I think >> I am >> ready now to help writing improvements to Bio/SeqIO/InsdcIO.py by >> adding >> missing fields on the ID line and adding a PR line. I may look also >> at the >> SQ line. >> >> Does this sound right to you? Thanks to let me know. >> >> Kind regards, >> Anne. > > Hi Anne, > > If you are happy working with git, then showing us fixes there is > great. > Have a read of these pages before you get going - it should help: > http://www.biopython.org/wiki/GitUsage > > Otherwise patch files are OK - you can attach them to bugs on bugzilla > rather than on the mailing list. > > Thanks. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Thu Mar 4 19:59:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 19:59:47 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> Message-ID: <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> On Thu, Mar 4, 2010 at 2:26 PM, Anne Pajon wrote: > Hi Peter, > > I'm happy to work with git. I've already read the bioython wiki page on it > so I'm hoping to do the right thing. OK - now you get to try doing a merge (grin), as I have committed the SQ line change (with some minor changes, for example I changed your variable names to keep the line length down). > I'm going to commit and push the ID line fix as soon as I am happy with it, > and to see if I understood how it should be done. Looking forward to get > your feedback. One minor issue is you accidentally checked in the BioSQL database created by the unit tests. I've update the .gitignore file to stop this happening to someone else. The EMBL data division stuff makes sense (I simply hadn't gotten round to it when I was doing it for the GenBank output). Some of your other changes need to be co-ordinated with the EMBL (and GenBank) parser. See also Bug 2578, http://bugzilla.open-bio.org/show_bug.cgi?id=2578 In the case of PR (project lines) I think we must be ignoring them at the moment, but to match the GenBank parser the information should be stored in the SeqRecord dbxrefs list not the annotations dictionary. Regards, Peter From biopython at maubp.freeserve.co.uk Thu Mar 4 20:04:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 20:04:55 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> References: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> Message-ID: <320fb6e01003041204m1de894dfic1518d7a1623b8a2@mail.gmail.com> On Thu, Mar 4, 2010 at 7:59 PM, Peter wrote: > On Thu, Mar 4, 2010 at 2:26 PM, Anne Pajon wrote: >> Hi Peter, >> >> I'm happy to work with git. I've already read the bioython wiki page on it >> so I'm hoping to do the right thing. > > OK - now you get to try doing a merge (grin), as I have committed the > SQ line change (with some minor changes, for example I changed your > variable names to keep the line length down). I also noted in a comment that we should perhaps be writing out GenBank and EMBL in lower case (as I noted a while ago on Bug 2999). It will also make counting the bases easier if we only need to look at one case ;) As an EMBL file user, does this seem like the right thing to do? Peter From abumustafa3 at gmail.com Thu Mar 4 20:27:59 2010 From: abumustafa3 at gmail.com (Nizar Ghneim) Date: Thu, 4 Mar 2010 14:27:59 -0600 Subject: [Biopython] Error with py2exe and Entrez functions Message-ID: Hello All, I am writing a short script that others would like to use on their own computers. I decided to use the py2exe tool to create an executable. The script runs perfectly in Python, but whenever in my exe file, the first line to access any Bio.Enterez function (such as esearch or efetch), gives me the following error: File "Bio\Entrez\__init__.pyc", line 258, in read > File "Bio\Entrez\Parser.pyc", line 108, in read > File "Bio\Entrez\Parser.pyc", line 377, in externalEntityRefHandler > RuntimeError: Unable to load DTD file eSearch_020511.dtd. > > Bio.Entrez uses NCBI's DTD files to parse XML files returned by NCBI > Entrez. > Though most of NCBI's DTD files are included in the Biopython distribution, > sometimes you may find that a particular DTD file is missing. In such a > case, you can download the DTD file from NCBI and install it manually. > > Usually, you can find missing DTD files at either > http://www.ncbi.nlm.nih.gov/dtd/ > or > http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ > If you cannot find eSearch_020511.dtd there, you may also try to search > for it with a search engine such as Google. > > Please save eSearch_020511.dtd in the directory > C:\Python26\dist\library.zip\Bio\Entrez\DTDs > in order for Bio.Entrez to find it. > Alternatively, you can save eSearch_020511.dtd in the directory > Bio/Entrez/DTDs in the Biopython distribution, and reinstall Biopython. > > Please also inform the Biopython developers by sending an email to > biopython-dev at biopython.org to inform us about this missing DTD, so that > we > can include it with the next release of Biopython. > It seems to me that the py2exe compiler does not grab the necessary DTD files. How can I solve this? Thank you in advance, Nizar Ghneim From biopython at maubp.freeserve.co.uk Thu Mar 4 21:21:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 21:21:17 +0000 Subject: [Biopython] Error with py2exe and Entrez functions In-Reply-To: References: Message-ID: <320fb6e01003041321l23526902i55e60a1ef7a12cd3@mail.gmail.com> On Thu, Mar 4, 2010 at 8:27 PM, Nizar Ghneim wrote: > Hello All, > > I am writing a short script that others would like to use on their own > computers. I decided to use the py2exe tool to create an executable. > ... > It seems to me that the py2exe compiler does not grab the necessary > DTD files. How can I solve this? That does sound like the problem. Have you searched the py2exe documentation for how to specify extra files like this - others tools must have similar needs. Peter From biopython at maubp.freeserve.co.uk Thu Mar 4 21:50:16 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Mar 2010 21:50:16 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> Message-ID: <320fb6e01003041350r127a463fga8e9a315d298a0bc@mail.gmail.com> Hi Anne, You had a comment in your change about the RA line, where you tried to append the semi-colon on output. The reason this broke the unit tests was that the EMBL parser was not removing the semi-colon. I've fixed this now - thanks for flagging this issue. Peter From ap12 at sanger.ac.uk Fri Mar 5 14:26:57 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 5 Mar 2010 14:26:57 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003041204m1de894dfic1518d7a1623b8a2@mail.gmail.com> References: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> <320fb6e01003041204m1de894dfic1518d7a1623b8a2@mail.gmail.com> Message-ID: <1D4AAD34-592F-48D6-BCB1-79019DA5A5C9@sanger.ac.uk> On 4 Mar 2010, at 20:04, Peter wrote: > On Thu, Mar 4, 2010 at 7:59 PM, Peter > wrote: >> On Thu, Mar 4, 2010 at 2:26 PM, Anne Pajon wrote: >>> Hi Peter, >>> >>> I'm happy to work with git. I've already read the bioython wiki >>> page on it >>> so I'm hoping to do the right thing. >> >> OK - now you get to try doing a merge (grin), as I have committed the >> SQ line change (with some minor changes, for example I changed your >> variable names to keep the line length down). > > I also noted in a comment that we should perhaps be writing out > GenBank and EMBL in lower case (as I noted a while ago on Bug > 2999). It will also make counting the bases easier if we only need > to look at one case ;) > > As an EMBL file user, does this seem like the right thing to do? I do not really mind one way or another. EMBOSS seems to write the sequence all in upper case. Anne. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap12 at sanger.ac.uk Fri Mar 5 14:28:44 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 5 Mar 2010 14:28:44 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003041350r127a463fga8e9a315d298a0bc@mail.gmail.com> References: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041350r127a463fga8e9a315d298a0bc@mail.gmail.com> Message-ID: Hi Peter, I saw your fix. Maybe would be better to have: self._write_multi_line("RA", "%s;" % ref.authors) instead of self._write_multi_line("RA", ref.authors+";") But it is a very minor detail. Thanks for fixing it. Kind regards, Anne. On 4 Mar 2010, at 21:50, Peter wrote: > Hi Anne, > > You had a comment in your change about the RA line, where you tried > to append the semi-colon on output. The reason this broke the unit > tests was that the EMBL parser was not removing the semi-colon. > I've fixed this now - thanks for flagging this issue. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap12 at sanger.ac.uk Fri Mar 5 17:34:24 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 5 Mar 2010 17:34:24 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01003050814nddcf6a2x3402c9d3f535eb10@mail.gmail.com> References: <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> <3449050D-E60E-4FB5-AA82-E32A8F131DCA@sanger.ac.uk> <320fb6e01003050713w5e58073fydc2108af73d61dfb@mail.gmail.com> <0486B125-79A5-41BE-85E5-625903BFED04@sanger.ac.uk> <320fb6e01003050814nddcf6a2x3402c9d3f535eb10@mail.gmail.com> Message-ID: Hi Peter, I've tested the PR line with dbxrefs and it works fine, thanks. I've sent you a request for improving the writing of the references by adding the RG line. I've CC the list again... sorry for not having done so for two replies. Kind regards, Anne. On 5 Mar 2010, at 16:14, Peter wrote: > On Fri, Mar 5, 2010 at 4:02 PM, Anne Pajon wrote: >> >> On 5 Mar 2010, at 15:13, Peter wrote: >> >>> On Fri, Mar 5, 2010 at 2:24 PM, Anne Pajon >>> wrote: >>>>> >>>>> In the case of PR (project lines) I think we must be ignoring >>>>> them at >>>>> the moment, but to match the GenBank parser the information should >>>>> be stored in the SeqRecord dbxrefs list not the annotations >>>>> dictionary. >>>> >>>> Would be great to have a place where to store the PR line. >>> >>> Perhaps I was unclear - we do have a place to store the PR line, the >>> SeqRecord's dbxrefs list (following how the GenBank parser stores >>> the project information). >> >> Sorry I did not understood that. Great if I could do it with >> dbxrefs. I'll >> try right now then. >> >>> >>> Getting the EMBL parser to do the same was trivial, although this >>> does make doing the output a tiny bit more complex. See github. >>> >> >> I will have a look. > > I meant I just did this and checked in the change to github ;) > > Thanks for the example - I'll take a look. > > Regarding the mailing list, you probably just clicked on "reply" > rather than "reply all" so it came to just me. > > Thanks, > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Fri Mar 5 17:50:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Mar 2010 17:50:29 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <7A5595B6-5507-4700-B3A1-56E70F1D61FC@sanger.ac.uk> <320fb6e01003040612p43145476vefb0f8582e1d5759@mail.gmail.com> <320fb6e01003041159g3c57fd3eq1df1517faa1ec5e7@mail.gmail.com> <3449050D-E60E-4FB5-AA82-E32A8F131DCA@sanger.ac.uk> <320fb6e01003050713w5e58073fydc2108af73d61dfb@mail.gmail.com> <0486B125-79A5-41BE-85E5-625903BFED04@sanger.ac.uk> <320fb6e01003050814nddcf6a2x3402c9d3f535eb10@mail.gmail.com> Message-ID: <320fb6e01003050950l2fd94d4cx315ecc408ab2a577@mail.gmail.com> On Fri, Mar 5, 2010 at 5:34 PM, Anne Pajon wrote: > Hi Peter, > > I've tested the PR line with dbxrefs and it works fine, thanks. Great. > I've sent you a request for improving the writing of the references by > adding the RG line. I've merged that (using a git cherry-pick) and added support for parsing the RG lines too. I'm pleased you seem to be doing such a good job identifying these little issues with the new EMBL code :) Thank you, Peter From daniel at dim.fm.usp.br Fri Mar 5 21:35:57 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Fri, 05 Mar 2010 18:35:57 -0300 Subject: [Biopython] SFF parser Message-ID: <4B91793D.5060001@dim.fm.usp.br> Hi biopythonists, Anyone has information about the status of the future SFF parser? Att. Daniel -- +---------------------------------------+ Daniel de A. M. M. Silvestre LIM01 - Laborat?rio de Inform?tica M?dica - HCFMUSP Sala 1349 - Depto. de Patologia Faculdade de Medicina Universidade de S?o Paulo Av. Dr. Arnaldo, 455 | e-mail: daniel at dim.fm.usp.br Cerqueira C?sar | Tel: +55-11-3061-7381 01246-903 - S?o Paulo - SP | Cel: +55-11-8042-9369 BRASIL | Skype: jarretinha --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: daniel.vcf Type: text/x-vcard Size: 375 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Sat Mar 6 00:12:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 6 Mar 2010 00:12:34 +0000 Subject: [Biopython] SFF parser In-Reply-To: <4B91793D.5060001@dim.fm.usp.br> References: <4B91793D.5060001@dim.fm.usp.br> Message-ID: <320fb6e01003051612p51c003f2g7ce154498c7fb97f@mail.gmail.com> 2010/3/5 Daniel Silvestre : > Hi biopythonists, > > Anyone has information about the status of the future SFF parser? > > Att. > Daniel Hi Daniel, The code was recently merged into the master branch and will be included with our next release (Biopython 1.54). There has been discussion and some useful feedback already on the dev mailing list - more would be great. If you are happy to install from source, you can try it out now. The latest version of the tutorial (with the source code, not yet published) has a brief example in the cookbook chapter, but the module docstrings are quite extensive. Once installed, try: from Bio.SeqIO import SffIO help(SffIO) Or, just had a read of the code online on github or here: http://biopython.org/SRC/biopython/Bio/SeqIO/SffIO.py Peter P.S. The vcard attachment on your email (file daniel.vcf) seems to mean your emails get held in the moderation queue. From aloraine at gmail.com Sun Mar 7 14:55:19 2010 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 7 Mar 2010 09:55:19 -0500 Subject: [Biopython] how to get the hit length from Bio.Blast.NCBIXML? Message-ID: <83722dde1003070655k7f328600w5320367a688112de@mail.gmail.com> Hello, I'm using Bio.Blast.NCBIXML to parse blastx results for an annotation project. I'm searching contig consensus sequences (assembled from 454 reads) against a protein database. Since these are assembled ESTs and may be incomplete, I need to know how much of a matched sequence was included in the alignment so that I can compute the percent coverage of both the hit and query. How do I retrieve the "hit length" from the objects returned by the parser? I couldn't find anything in the record and alignment objects that contains this information -- if it is not there, should it be added? The hit length appears in the XML: *cut* 3 lcl|3_0 Both_1_c25003 422 1 gnl|BL_ORD_ID|12864 gi|255551002|ref|XP_002516549.1| catalytic, putative [Ricinus communis] 12864 431 1 112.079 *paste* Best, Ann Loraine From p.j.a.cock at googlemail.com Sun Mar 7 16:06:22 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 7 Mar 2010 16:06:22 +0000 Subject: [Biopython] how to get the hit length from Bio.Blast.NCBIXML? In-Reply-To: <83722dde1003070655k7f328600w5320367a688112de@mail.gmail.com> References: <83722dde1003070655k7f328600w5320367a688112de@mail.gmail.com> Message-ID: <320fb6e01003070806o5918743bp72ecb0311cd6b2c2@mail.gmail.com> On Sun, Mar 7, 2010 at 2:55 PM, Ann Loraine wrote: > Hello, > > I'm using Bio.Blast.NCBIXML to parse blastx results for an annotation > project. I'm searching contig consensus sequences (assembled from 454 > reads) against a protein database. > > Since these are assembled ESTs and may be incomplete, I need to know > how much of a matched sequence was included in the alignment so that I > can compute the percent coverage of both the hit and query. > > How do I retrieve the "hit length" from the objects returned by the parser? > > I couldn't find anything in the record and alignment objects that > contains this information -- if it is not there, should it be added? Hi Ann, I think you are looking for the BLAST alignment's length attribute, or perhaps the HSP's align_length attribute. Peter From anaryin at gmail.com Tue Mar 9 22:21:27 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 9 Mar 2010 14:21:27 -0800 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? Message-ID: Hello all, Maybe I'm getting this wrong to begin with but bear with me. I'm trying to renumber the atoms in a PDB file (atoms, not residues). I found a method called get_serial_number that gives me back the atom number, and another called set_serial_number that allows me to change this value. It works wonders for the Structure object, but when I save it with the PDBIO module to a PDB file, it resets the numbering. I checked the code of PDBIO and apparently, it has hard-coded a resetting of the atom number. My question is, what is this set_serial_number for then? Is there a way for me to override this easily? Regards, Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From vincent at vincentdavis.net Wed Mar 10 03:46:56 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 9 Mar 2010 20:46:56 -0700 Subject: [Biopython] matching sequences from fasta files Message-ID: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Let me fist say that I am new to biopython and dna/fasta files. I have been trying to use blastall to get the results I need but I am doing most of my work in python so why use blastall if I can get the results using python. I need to check if any/all the sequence from one fasta file are in another. Looking through the docs I think I could do this. I then what to find "close matches" and for me this means they differ by 1 snp and I need to know the location of this differing snp. How would I do this? *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Wed Mar 10 10:31:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 10:31:17 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Message-ID: <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> On Wed, Mar 10, 2010 at 3:46 AM, Vincent Davis wrote: > Let me fist say that I am new to biopython and dna/fasta files. I have been > trying to use blastall to get the results I need but I am doing most of my > work in python so why use blastall if I can get the results using python. > > I need to check if any/all the sequence from one fasta file are in another. > Looking through the docs I think I could do this. > > I then what to find "close matches" and for me this means they differ by 1 > snp and I need to know the location of this differing snp. How would I do > this? If you want "close matches", then using a tool like command line tool like BLAST (or FASTA, or needle etc) may be the fastest option. You can call these tools from a Python script, and parse their output within the script. (This is probably what you are already doing.) If you want to, you can do pairwise sequence alignment from within Biopython with the Bio.pairwise2 (the module uses C for speed). This isn't covered in the tutorial, read the module documentation: http://www.biopython.org/DIST/docs/api/Bio.pairwise2-module.html For the special case of looking for perfect matches, you would be fine with just Python - depending on your data files, you may be able to match on the record identifiers or simply do string comparisons of the sequences. If you know in advance the pattern of SNPs, then you would be able to efficiently search for them using a regular expression. However, it sounds like you are doing SNP discovery. Here too there should be existing command line tools designed for just this task (and described in the literature). Regards, Peter From ivan at biodec.com Wed Mar 10 11:15:38 2010 From: ivan at biodec.com (Ivan Rossi) Date: Wed, 10 Mar 2010 12:15:38 +0100 (CET) Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> Message-ID: On Wed, 10 Mar 2010, Peter wrote: > For the special case of looking for perfect matches, you would be fine > with just Python - depending on your data files, you may be able to > match on the record identifiers Don't trust that. We have seen many many times the sequence change over time (in different releases of the databases) while keeping the same id. it is much more robust to compare SHA1 (or MD5) hashes of the sequence, or do string comparisons. > or simply do string comparisons of the sequences. This is OK. -- Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Wed Mar 10 13:00:15 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 13:00:15 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> Message-ID: <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> On Wed, Mar 10, 2010 at 11:15 AM, Ivan Rossi wrote: > On Wed, 10 Mar 2010, Peter wrote: > >> For the special case of looking for perfect matches, you would be fine >> with just Python - depending on your data files, you may be able to >> match on the record identifiers > > Don't trust that. We have seen many many times the sequence change > over time (in different releases of the databases) while keeping the same id. Yes, be cautious about blindly matching on just the identifier. That's why I said "may" ;) > it is much more robust to compare SHA1 (or MD5) hashes of the > sequence, or do string comparisons. MD5 is known to have collisions, but Sebasti?n Bassi added support in Biopython for the GCG and SEGUID checksums, e.g. see: from Bio.SeqUtils.CheckSum import seguid help(seguid) SHA1 is used by SEGUID internally, taking care of the case. Peter From ismail.fsr at gmail.com Wed Mar 10 12:57:15 2010 From: ismail.fsr at gmail.com (ismail kaarouch) Date: Wed, 10 Mar 2010 12:57:15 +0000 Subject: [Biopython] help Message-ID: <2eb473a91003100457n2889f502x965af00cc259c63a@mail.gmail.com> when i import class Translate from module Bio i have this msg & i will be forced to re installate the Biopython softwar so i need your help IDLE 2.6.2 >>> from Bio import Translate Warning (from warnings module): File "C:\Python26\lib\site-packages\Bio\Translate.py", line 23 DeprecationWarning) DeprecationWarning: Bio.Translate and Bio.Transcribe are deprecated, and will be removed in a future release of Biopython. Please use the functions or object methods defined in Bio.Seq instead (described in the tutorial). If you want to continue to use this code, please get in contact with the Biopython developers via the mailing lists to avoid its permanent removal from Biopython. From biopython at maubp.freeserve.co.uk Wed Mar 10 13:11:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 13:11:59 +0000 Subject: [Biopython] help In-Reply-To: <2eb473a91003100457n2889f502x965af00cc259c63a@mail.gmail.com> References: <2eb473a91003100457n2889f502x965af00cc259c63a@mail.gmail.com> Message-ID: <320fb6e01003100511j31db9127qe5fdfbe3a9725445@mail.gmail.com> On Wed, Mar 10, 2010 at 12:57 PM, ismail kaarouch wrote: > when i import class Translate from module Bio i have this msg & i will be > forced to re installate the Biopython softwar > so i need your help > > IDLE 2.6.2 >>>> from Bio import Translate > > Warning (from warnings module): > ?File "C:\Python26\lib\site-packages\Bio\Translate.py", line 23 > ? ?DeprecationWarning) > DeprecationWarning: Bio.Translate and Bio.Transcribe are deprecated, and > will be removed in a future release of Biopython. Please use the functions > or object methods defined in Bio.Seq instead (described in the tutorial). If > you want to continue to use this code, please get in contact with the > Biopython developers via the mailing lists to avoid its permanent removal > from Biopython. Hi Ismail, This warning is saying you shouldn't be using Bio.Translate (it will be removed from Biopython). Are you reading an out of date tutorial? The current tutorial is here: http://www.biopython.org/DIST/docs/tutorial/Tutorial.html http://www.biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter From p.j.a.cock at googlemail.com Wed Mar 10 14:30:57 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Mar 2010 14:30:57 +0000 Subject: [Biopython] Biopython & Google Summer of Code 2010 (GSoc) Message-ID: <320fb6e01003100630o6ec5f2aao5053c165f4504b89@mail.gmail.com> Dear Biopythoneers, The Open Bioinformatics Foundation (the Bio* umbrella organisation) is preparing an application for the 2010 Google Summer of Code (GSoC). http://code.google.com/soc/ If you are interested in becoming a mentor for a Biopython related project, you can join us in the application. If you are a student and are interested in a project (or would like to propose one), please take a look at these pages: http://www.open-bio.org/wiki/Google_Summer_of_Code http://biopython.org/wiki/Google_Summer_of_Code Regards, Brad & Peter From cjfields at illinois.edu Wed Mar 10 14:31:39 2010 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 10 Mar 2010 08:31:39 -0600 Subject: [Biopython] matching sequences from fasta files In-Reply-To: References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> Message-ID: On Mar 10, 2010, at 5:15 AM, Ivan Rossi wrote: > On Wed, 10 Mar 2010, Peter wrote: > >> For the special case of looking for perfect matches, you would be fine >> with just Python - depending on your data files, you may be able to >> match on the record identifiers > > Don't trust that. We have seen many many times the sequence change over time (in different releases of the databases) while keeping the same id. If the database has a proper versioning scheme or date information this should be detectable, otherwise I agree. > it is much more robust to compare SHA1 (or MD5) hashes of the sequence, or do string comparisons. Agreed there; it's probably the only full-proof way. >> or simply do string comparisons of the sequences. > > This is OK. > > -- > Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it > BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy > Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com chris (peeking in from bioperl ;) From vincent at vincentdavis.net Wed Mar 10 15:19:00 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Wed, 10 Mar 2010 08:19:00 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> Message-ID: <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> I am considering just using just python and regular expression. Blast is great but I don't seem to be able to easily filter it to get only close matched that differ at 1 snp. I have a custom microarray and a list of the sequences it will bind. I need to test if they are in the genome of toxoplasma gondii (just yes or no) and if there are close matches (differ at 1 snp) and where the diff is in the sequence. So from reading the responses I should consider python.re. or look more into FASTA or needle. to see if i can get my version of a close match from them. Is this right? Like I said I am very new to this, just got called in to get this project done. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Wed, Mar 10, 2010 at 6:00 AM, Peter wrote: > On Wed, Mar 10, 2010 at 11:15 AM, Ivan Rossi wrote: > > On Wed, 10 Mar 2010, Peter wrote: > > > >> For the special case of looking for perfect matches, you would be fine > >> with just Python - depending on your data files, you may be able to > >> match on the record identifiers > > > > Don't trust that. We have seen many many times the sequence change > > over time (in different releases of the databases) while keeping the same > id. > > Yes, be cautious about blindly matching on just the identifier. > That's why I said "may" ;) > > > it is much more robust to compare SHA1 (or MD5) hashes of the > > sequence, or do string comparisons. > > MD5 is known to have collisions, but Sebasti?n Bassi added support > in Biopython for the GCG and SEGUID checksums, e.g. see: > > from Bio.SeqUtils.CheckSum import seguid > help(seguid) > > SHA1 is used by SEGUID internally, taking care of the case. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Mar 10 16:29:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 16:29:17 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> Message-ID: <320fb6e01003100829v764fc89am7350f40cc10b6936@mail.gmail.com> On Wed, Mar 10, 2010 at 3:19 PM, Vincent Davis wrote: > I am considering just using just python and regular expression. Blast is > great but I don't seem to be able to easily filter it to get only close > matched that differ at 1 snp. > I have a custom microarray and a list of the sequences it will bind. I need > to test if they are in the genome of toxoplasma gondii (just yes or no) and > if there are close matches (differ at 1 snp) and where the diff is in the > sequence. > > So from reading the responses I should consider python.re. or look more into > FASTA or needle. to see if i can get my version of a close match from them. > Is this right? Like I said I am very new to this, just got called in to get > this project done. Using BLAST / FASTA / needle / any pairwise alignment is going to boil down running the tool and parsing to filter out what you want. I don't think any of these general purpose tools allow for a "single base pair difference" threshold. This approach should work though. If you want to allow a single mis-match anywhere in the sequence, I'm not sure regular expressions are ideal either. If you wanted to look for matches with a single mis-match at a particular point (i.e. a know SNP) then a regular expression would work fine. However, you might have more success with software designed for second generation sequencing - there are certainly similarities to mapping short reads (e.g. Solexa/Illumina data) to a reference genome. You might also be able to use software designed to look for primer matches (again, these are short sequences). Just some ideas... Peter From lpritc at scri.ac.uk Wed Mar 10 15:53:45 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 10 Mar 2010 15:53:45 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Message-ID: Hi, On 10/03/2010 Wednesday, March 10, 03:46, "Vincent Davis" wrote: > I need to check if any/all the sequence from one fasta file are in another. > Looking through the docs I think I could do this. As others have pointed out, a simple string comparison will do this. > I then what to find "close matches" and for me this means they differ by 1 > snp and I need to know the location of this differing snp. How would I do > this? There are many ways in which this *could* be done. You probably want one that is quite quick, though If I never needed to do this again, I would probably run BLAST or FASTA (or my favourite search algorithm, running ungapped) using one set of sequences as a query, and the other as the target database, using the program parameters to report only one match each time. I'd then use Python to parse the results, throwing away all those matches where i) if the number of aligned bases is the same as the number of bases in the query: the number of match identities differs from the number of aligned bases by more than one ii) if the number of aligned bases differs from the number of bases in the query by exactly one: the number of match identities differs from the number of aligned bases iii) the number of aligned bases differs from the number of bases in the query by two or more The remainder should be your set of (almost) full-length 1/0 SNP matches, and there should be enough data in your search program output to identify the location of the SNP. I think it would be faster to use something off-the-shelf like BLAST and parse the output, than to write something to do the search. It will probably work quicker, too. Lots of ways to do this repeatably, including writing a generator function. I hope this is useful, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From dalloliogm at gmail.com Wed Mar 10 17:27:50 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Wed, 10 Mar 2010 18:27:50 +0100 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <320fb6e01003100231l2f8d16fl17ef4be94d4e093c@mail.gmail.com> <320fb6e01003100500o78f6d622y49ee29e9f0deee01@mail.gmail.com> <77e831101003100719p1ecd6ea1td784c90acf20d0cd@mail.gmail.com> Message-ID: <5aa3b3571003100927l382ec1c1w3dd81a61372a7660@mail.gmail.com> On Wed, Mar 10, 2010 at 4:19 PM, Vincent Davis wrote: > I am considering just using just python and regular expression. Blast is > great but I don't seem to be able to easily filter it to get only close > matched that differ at 1 snp. I am not sure I followed all the discussion in this topic, but if you to find sequences that differ for one or two positions and you don't need to do it in any explicit biological context, you may look for algorithms that do fuzzy matching like agrep. One example may be this module: - http://www.personal.psu.edu/iua1/libs/apse.html which as you can read is outdated and probably won't work properly, but it is based on a C library which may have been implemented in other python modules. I would look for this and also do a google/yahoo/anyother search for 'string fuzzy matching python' or similar, I am sure you can find a lot of literature and modules about that. If you are comfortable with the unix shell, you may be probably be able to implement all your pipeline with some emboss tool to read the sequences and agrep for the matching. Anyway, I didn't understand your use case very well, and I am sure that if you look better on the Internet you can find some tool that does this already without having to write a new script and test it. If you do look for that it would be better, for you and for the people who will read your papers. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From vincent at vincentdavis.net Wed Mar 10 18:10:20 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Wed, 10 Mar 2010 11:10:20 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> Message-ID: <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> @Leighton "If I never needed to do this again, I would probably run BLAST or FASTA (or my favourite search algorithm, running ungapped) using one set of sequences as a query, and the other as the target database, using the program parameters to report only one match each time. I'd then use Python to parse the results, throwing away all those matches where" I don't have a favorite, I have only tried BLAST :) Is there an example of how to interface between python and BLAST. I have no idea where to start. I have never done anything similar. @ Leighton I think I will take your approach. Thanks for the input. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Wed, Mar 10, 2010 at 8:53 AM, Leighton Pritchard wrote: > Hi, > > On 10/03/2010 Wednesday, March 10, 03:46, "Vincent Davis" > wrote: > > > I need to check if any/all the sequence from one fasta file are in > another. > > Looking through the docs I think I could do this. > > As others have pointed out, a simple string comparison will do this. > > > I then what to find "close matches" and for me this means they differ by > 1 > > snp and I need to know the location of this differing snp. How would I do > > this? > > There are many ways in which this *could* be done. You probably want one > that is quite quick, though > > If I never needed to do this again, I would probably run BLAST or FASTA (or > my favourite search algorithm, running ungapped) using one set of sequences > as a query, and the other as the target database, using the program > parameters to report only one match each time. I'd then use Python to > parse the results, throwing away all those matches where > > i) if the number of aligned bases is the same as the number of bases in the > query: the number of match identities differs from the number of aligned > bases by more than one > ii) if the number of aligned bases differs from the number of bases in the > query by exactly one: the number of match identities differs from the > number > of aligned bases > iii) the number of aligned bases differs from the number of bases in the > query by two or more > > The remainder should be your set of (almost) full-length 1/0 SNP matches, > and there should be enough data in your search program output to identify > the location of the SNP. > > I think it would be faster to use something off-the-shelf like BLAST and > parse the output, than to write something to do the search. It will > probably work quicker, too. > > Lots of ways to do this repeatably, including writing a generator function. > > I hope this is useful, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk w: > http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by > guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views > expressed by the sender are not necessarily the views of SCRI and its > subsidiaries. This email and any files transmitted with it are confidential > to the intended recipient at the e-mail address to which it has been > addressed. It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this > confidentiality and you must not use, disclose, copy, print or rely on this > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of > the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are > present in this email, neither the Institute nor the sender accepts any > responsibility for any viruses, and it is your responsibility to scan the > email and the attachments (if any). > ______________________________________________________ > From subhodeep.moitra at gmail.com Wed Mar 10 18:51:08 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Wed, 10 Mar 2010 13:51:08 -0500 Subject: [Biopython] BioPython GSOC 2010 Message-ID: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Hi I am interested in applying for GSOC 2010. Particularly liked the R and Python integration proposal. There are lot of other cool R packages too, such as Bio3d that one can think of. Do you guys have an IRC channel ? Thanks Subho -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From biopython at maubp.freeserve.co.uk Wed Mar 10 21:56:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 21:56:48 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> Message-ID: <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> On Wed, Mar 10, 2010 at 6:10 PM, Vincent Davis wrote: > I don't have a favorite, I have only tried BLAST ?:) > Is there an example of how to interface between python and > BLAST. I have no idea where to start. I have never done > anything similar. There are examples of how to call BLAST and parse its (XML) output with Biopython in our tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Peter P.S. I am reminded of the old saying, "When all you have is a hammer, everything looks like a nail." (by which I mean even if it is not the best tool for the job, you could do it with BLAST). From biopython at maubp.freeserve.co.uk Wed Mar 10 21:59:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Mar 2010 21:59:19 +0000 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Message-ID: <320fb6e01003101359n7fd4883fhd2ee2ec3a5b9d9d0@mail.gmail.com> On Wed, Mar 10, 2010 at 6:51 PM, subhodeep moitra wrote: > Hi > > I am interested in applying for GSOC 2010. > > Particularly liked the R and Python integration proposal. There are lot of > other cool R packages too, such as Bio3d that one can think of. > > Do you guys have an IRC channel ? No - one reason is the Biopython developers cover several timezones, so email is generally more useful. Brad is also in the USA, and he is the Biopython person to talk to about this suggested GSoC project. Peter From vincent at vincentdavis.net Thu Mar 11 00:47:49 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Wed, 10 Mar 2010 17:47:49 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> Message-ID: <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> So I had an idea and wanted to get some feedback. I could make all possible single position mismatches for the sequences. I have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then use BLAST to look for perfect matches. I would probably do this incrementally maybe even just blast for each sequence. The advantage I see in this is that BLAST can run multi core and I am running it on an 8core with 48gb of memory So it seems that this would be the fastest way to do this and very straight forward as there is very little parsing. There is either a match or not. I am purely guessing that generating the list if faster than parsing the results. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Wed, Mar 10, 2010 at 2:56 PM, Peter wrote: > On Wed, Mar 10, 2010 at 6:10 PM, Vincent Davis > wrote: > > I don't have a favorite, I have only tried BLAST :) > > Is there an example of how to interface between python and > > BLAST. I have no idea where to start. I have never done > > anything similar. > > There are examples of how to call BLAST and parse its > (XML) output with Biopython in our tutorial: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Peter > > P.S. I am reminded of the old saying, "When all you have is > a hammer, everything looks like a nail." (by which I mean > even if it is not the best tool for the job, you could do it with > BLAST). > From mjldehoon at yahoo.com Thu Mar 11 01:19:09 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 10 Mar 2010 17:19:09 -0800 (PST) Subject: [Biopython] matching sequences from fasta files In-Reply-To: Message-ID: <224464.63537.qm@web62408.mail.re1.yahoo.com> > On 10/03/2010 Wednesday, March 10, 03:46, "Vincent Davis" > wrote: > > I then what to find "close matches" and for me this > > means they differ by 1 snp and I need to know the > > location of this differing snp. How would I do this? > You could use nexalign for that. http://genome.gsc.riken.jp/osc/english/dataresource/ --Michiel. From lpritc at scri.ac.uk Thu Mar 11 08:35:35 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 11 Mar 2010 08:35:35 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> Message-ID: Hi, On 10/03/2010 Wednesday, March 10, 18:10, "Vincent Davis" wrote: > I don't have a favorite, I have only tried BLAST :) > Is there an example of how to interface between python and BLAST. I have no > idea where to start. I have never done anything similar. For a one-off, I'd run BLAST from the command-line, and use Python to parse the results. http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82 Is the tutorial page that will be most help there, I think. > @ Leighton > I think I will take your approach. Thanks for the input. As with anything I suggest: treat with caution, and check for sanity at each step ;) L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From lpritc at scri.ac.uk Thu Mar 11 09:00:37 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 11 Mar 2010 09:00:37 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> Message-ID: On 11/03/2010 Thursday, March 11, 00:47, "Vincent Davis" wrote: > So I had an idea and wanted to get some feedback. > I could make all possible single position mismatches for the sequences. I > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then > use BLAST to look for perfect matches. That doesn't sound very elegant (or like a good solution) to me, but if you wanted to do that you wouldn't necessarily need Python, except perhaps to generate all possible mismatches. You can restrict BLAST output to the best match, and match identities to 100% with the option (in BLAST+) -word_size 25 Which restricts BLAST to finding seed words of the same length (25) as your oligos. This would also speed up BLAST. You might also consider exploring other output formats, so you could process tabular output from the command line, for instance. However, given the size of your data set, and the sizes of your sequences (neither of which were stated in the OP), I'd be inclined to bypass this altogether, and instead use one of the short-read sequence alignment packages such as SOAP or PASS, to see if it can be applied to your problem. Michiel's suggestion of NEXALIGN might be a good one - I've never used it, so can't say much about it. > I would probably do this > incrementally maybe even just blast for each sequence. The advantage I see > in this is that BLAST can run multi core and I am running it on an 8core > with 48gb of memory So it seems that this would be the fastest way to do > this and very straight forward as there is very little parsing. If you BLASTed each of 17m sequences individually, you would have to parse 17m output files. That sounds like a *lot* of parsing and file IO to me. ;) > There is > either a match or not. I am purely guessing that generating the list if > faster than parsing the results. You could try timing it with 10, 100 and 1000 sequences and see if you notice a trend. With your sequence set, I wouldn't bother - I'd jump straight to the next-gen sequence aligners. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Thu Mar 11 11:06:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Mar 2010 11:06:24 +0000 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> Message-ID: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> On Thu, Mar 11, 2010 at 12:47 AM, Vincent Davis wrote: > So I had an idea and wanted to get some feedback. > I could make all possible single position mismatches for the sequences. I > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). Then > use BLAST to look for perfect matches. I would probably do this > incrementally maybe even just blast for each sequence. The advantage I see > in this is that BLAST can run multi core and I am running it on an 8core > with 48gb of memory So it seems that this would be the fastest way to do > this and very straight forward as there is very little parsing. There is > either a match or not. I am purely guessing that generating the list if > faster than parsing the results. The strengths of BLAST are in fast fuzzy matching. My instinct is is would be silly to take your 230,000 queries, generate an extra queries 17,250,000 queries, and then run BLAST against your (organism specific?) database. Just run the BLAST on your queries with some reasonably strict match parameters, then post filter for your single base change. Now, if you really want to go for the brute force approach of looking for the perfect matches, what you could do is for each query of length 25, generate 25 simple regular expressions (e.g. using the "any letter" wild card in each position). You can do the regular expression matching within Python, or even with a command line tool like EMBOSS dreg. http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dreg.html Speaking of the EMBOSS tools, their fuzzy nucleotide search tool fuzznuc might be useful (you can specify the patterns using the IUPAC codes rather than regular expressions): http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/fuzznuc.html As far as I know, EMBOSS don't have a tool/option for fuzzy matching where you can specify a allowed number of miss-matches - unless one of the primer/vector tools can be used in this way? I'd suggest using primersearch but I think that only takes pairs of primers (not single probes). There is going to more than one way to solve your problem. This will be a useful learning process for you. Regards, Peter From chapmanb at 50mail.com Thu Mar 11 12:37:28 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 11 Mar 2010 07:37:28 -0500 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Message-ID: <20100311123728.GB36200@sobchak.mgh.harvard.edu> Subho; > I am interested in applying for GSOC 2010. > > Particularly liked the R and Python integration proposal. There are lot of > other cool R packages too, such as Bio3d that one can think of. Great to hear you are interested. The official student application period will run from March 29th-April 9th; we will have more specifics about when and where to apply once the organizational application round in finished. There is plenty you can do in the meantime. The selection process for students is competitive, and some of the things that help give proposals an advantage are: - Demonstrating knowledge of the projects. For the R/python idea, this would involve digging into Rpy2, some R packages you would be interested in exposing, and Biopython to get a sense of what a compatible API would look like. - Demonstrating open source coding capabilities. If you've not already worked on an open source project, this could involve putting together working code demonstrating an aspect of your proposal and making it available on Bitbucket or GitHub. - Showing the ability to communicate effectively with the community. Once you have code available, write up some information about it on a blog, ask for feedback on mailing lists, or otherwise let people know it is out there and you want to talk about it. These tips are generally useful independent of what specific project you are applying for. Hope this helps, Brad From vincent at vincentdavis.net Thu Mar 11 13:42:40 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 11 Mar 2010 06:42:40 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> References: <77e831101003091946r630ad89bt2a9a462100db4cce@mail.gmail.com> <77e831101003101010i1e00c805jdf2269a9b86a9cf@mail.gmail.com> <320fb6e01003101356m68bd6287k50145aeaf7cb624b@mail.gmail.com> <77e831101003101647p4aac93eqd5ac464fb0f9acc6@mail.gmail.com> <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> Message-ID: <77e831101003110542s270c2722w20970cf2fd278f9@mail.gmail.com> Thanks again for all the responses I'll let you know what I end up with. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Mar 11, 2010 at 4:06 AM, Peter wrote: > On Thu, Mar 11, 2010 at 12:47 AM, Vincent Davis > wrote: > > So I had an idea and wanted to get some feedback. > > I could make all possible single position mismatches for the sequences. I > > have 230,000 now and the would give me 17,250,000 (3 * 25 * 230,000). > Then > > use BLAST to look for perfect matches. I would probably do this > > incrementally maybe even just blast for each sequence. The advantage I > see > > in this is that BLAST can run multi core and I am running it on an 8core > > with 48gb of memory So it seems that this would be the fastest way to do > > this and very straight forward as there is very little parsing. There is > > either a match or not. I am purely guessing that generating the list if > > faster than parsing the results. > > The strengths of BLAST are in fast fuzzy matching. My instinct is is > would be silly to take your 230,000 queries, generate an extra queries > 17,250,000 queries, and then run BLAST against your (organism > specific?) database. Just run the BLAST on your queries with some > reasonably strict match parameters, then post filter for your single > base change. > > Now, if you really want to go for the brute force approach of looking > for the perfect matches, what you could do is for each query of length > 25, generate 25 simple regular expressions (e.g. using the "any letter" > wild card in each position). You can do the regular expression matching > within Python, or even with a command line tool like EMBOSS dreg. > http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/dreg.html > > Speaking of the EMBOSS tools, their fuzzy nucleotide search tool > fuzznuc might be useful (you can specify the patterns using the > IUPAC codes rather than regular expressions): > http://emboss.sourceforge.net/apps/release/6.2/emboss/apps/fuzznuc.html > > As far as I know, EMBOSS don't have a tool/option for fuzzy matching > where you can specify a allowed number of miss-matches - unless one > of the primer/vector tools can be used in this way? I'd suggest using > primersearch but I think that only takes pairs of primers (not single > probes). > > There is going to more than one way to solve your problem. This > will be a useful learning process for you. > > Regards, > > Peter > From subhodeep.moitra at gmail.com Thu Mar 11 18:17:43 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Thu, 11 Mar 2010 13:17:43 -0500 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <20100311123728.GB36200@sobchak.mgh.harvard.edu> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> <20100311123728.GB36200@sobchak.mgh.harvard.edu> Message-ID: <6a2880081003111017p3bf02764wd9a1f3455dbb9b49@mail.gmail.com> Hi Brad All good advice. I've used BioPython and R for a few things, but am still new to it. I would like to start coding straightaway, and work on my favorite R package as you suggested. Will stay in touch. Thanks Subho On Thu, Mar 11, 2010 at 7:37 AM, Brad Chapman wrote: > Subho; > > > I am interested in applying for GSOC 2010. > > > > Particularly liked the R and Python integration proposal. There are lot > of > > other cool R packages too, such as Bio3d that one can think of. > > Great to hear you are interested. The official student application > period will run from March 29th-April 9th; we will have more > specifics about when and where to apply once the organizational > application round in finished. > > There is plenty you can do in the meantime. The selection process > for students is competitive, and some of the things that help give > proposals an advantage are: > > - Demonstrating knowledge of the projects. For the R/python idea, this > would involve digging into Rpy2, some R packages you would be > interested in exposing, and Biopython to get a sense of what a > compatible API would look like. > > - Demonstrating open source coding capabilities. If you've not > already worked on an open source project, this could involve > putting together working code demonstrating an aspect of your > proposal and making it available on Bitbucket or GitHub. > > - Showing the ability to communicate effectively with the community. > Once you have code available, write up some information about it > on a blog, ask for feedback on mailing lists, or otherwise let > people know it is out there and you want to talk about it. > > These tips are generally useful independent of what specific project > you are applying for. > > Hope this helps, > Brad > -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From mjldehoon at yahoo.com Fri Mar 12 00:36:07 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 11 Mar 2010 16:36:07 -0800 (PST) Subject: [Biopython] matching sequences from fasta files In-Reply-To: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> Message-ID: <790299.17186.qm@web62404.mail.re1.yahoo.com> --- On Thu, 3/11/10, Peter wrote: > From: Peter > Subject: Re: [Biopython] matching sequences from fasta files > To: "Vincent Davis" > Cc: "biopython" > Date: Thursday, March 11, 2010, 6:06 AM > On Thu, Mar 11, 2010 at 12:47 AM, > Vincent Davis > > wrote: > > So I had an idea and wanted to get some feedback. > > I could make all possible single position mismatches > for the sequences. I > > have 230,000 now and the would give me 17,250,000 (3 * > 25 * 230,000). Then > > use BLAST to look for perfect matches. I would > probably do this > > incrementally maybe even just blast for each sequence. > The advantage I see > > in this is that BLAST can run multi core and I am > running it on an 8core > > with 48gb of memory So it seems that this would be the > fastest way to do > > this and very straight forward as there is very little > parsing. There is > > either a match or not. I am purely guessing that > generating the list if > > faster than parsing the results. > Nexalign can do exactly what you are trying to do. See http://genome.gsc.riken.jp/osc/english/dataresource/. --Michiel. From vincent at vincentdavis.net Fri Mar 12 03:08:23 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Thu, 11 Mar 2010 20:08:23 -0700 Subject: [Biopython] matching sequences from fasta files In-Reply-To: <790299.17186.qm@web62404.mail.re1.yahoo.com> References: <320fb6e01003110306u1d3b8763j573fbb8200f9eeaf@mail.gmail.com> <790299.17186.qm@web62404.mail.re1.yahoo.com> Message-ID: <77e831101003111908l4e75b898yfa045ccc96d1850@mail.gmail.com> @Michiel de Hoon Nexalign can do exactly what you are trying to do. See http://genome.gsc.riken.jp/osc/english/dataresource/. Thanks for the link to nextalign. It is perfect and fast. This is exactly what I needed. I already have the results I needed 5min from download to results. Need to spend a little time verifying I have what I what but it looks right. Again thank you very much. * * *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Thu, Mar 11, 2010 at 5:36 PM, Michiel de Hoon wrote: > --- On Thu, 3/11/10, Peter wrote: > > > From: Peter > > Subject: Re: [Biopython] matching sequences from fasta files > > To: "Vincent Davis" > > Cc: "biopython" > > Date: Thursday, March 11, 2010, 6:06 AM > > On Thu, Mar 11, 2010 at 12:47 AM, > > Vincent Davis > > > > wrote: > > > So I had an idea and wanted to get some feedback. > > > I could make all possible single position mismatches > > for the sequences. I > > > have 230,000 now and the would give me 17,250,000 (3 * > > 25 * 230,000). Then > > > use BLAST to look for perfect matches. I would > > probably do this > > > incrementally maybe even just blast for each sequence. > > The advantage I see > > > in this is that BLAST can run multi core and I am > > running it on an 8core > > > with 48gb of memory So it seems that this would be the > > fastest way to do > > > this and very straight forward as there is very little > > parsing. There is > > > either a match or not. I am purely guessing that > > generating the list if > > > faster than parsing the results. > > > Nexalign can do exactly what you are trying to do. > See http://genome.gsc.riken.jp/osc/english/dataresource/. > > --Michiel. > > > > From sbassi at genesdigitales.com Sat Mar 13 14:41:09 2010 From: sbassi at genesdigitales.com (Sebastian Bassi) Date: Sat, 13 Mar 2010 11:41:09 -0300 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Message-ID: On Wed, Feb 24, 2010 at 3:52 PM, Eric Talevich wrote: > On Monday I hosted a 2-hour programming workshop focusing on Biopython and > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > I hope others find these slides useful. Hello, I am about to give a talk about Biopython in a local event (1er Congreso Argentino de Bioinformatica y Biologia Computacional) and I think I could retrieve material from some of your slides (with attribution). What do you think? Best, SB. -- Curso de Python en un d?a: http://bit.ly/cursopython Non standard disclaimer: READ CAREFULLY. By reading this email, you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies ("BOGUS AGREEMENTS") that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. Google ads remover words: suicide, murder From etal at uga.edu Sat Mar 13 15:27:51 2010 From: etal at uga.edu (Eric Talevich) Date: Sat, 13 Mar 2010 10:27:51 -0500 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Message-ID: <3f6baf361003130727g1831dbb8gba7f9b63d160fd0b@mail.gmail.com> On Sat, Mar 13, 2010 at 9:41 AM, Sebastian Bassi wrote: > On Wed, Feb 24, 2010 at 3:52 PM, Eric Talevich wrote: > > On Monday I hosted a 2-hour programming workshop focusing on Biopython > and > > > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I hope others find these slides useful. > > Hello, I am about to give a talk about Biopython in a local event (1er > Congreso Argentino de Bioinformatica y Biologia Computacional) and I > think I could retrieve material from some of your slides (with > attribution). What do you think? > Best, > SB. > Great, I'm glad you found the slides helpful. The Latex Beamer source isn't in a publishable state yet, but I can e-mail it to you if you'd like. -Eric From lgautier at gmail.com Sat Mar 13 18:42:47 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 13 Mar 2010 19:42:47 +0100 Subject: [Biopython] Biopython Digest, Vol 87, Issue 12 In-Reply-To: References: Message-ID: <4B9BDCA7.8010302@gmail.com> When looking at rpy2, do consider the 2.1-dev version. 2.1 will be released before the SoC starts. L. On 3/12/10 6:00 PM, biopython-request at lists.open-bio.org wrote: > Hi Brad > > All good advice. > > I've used BioPython and R for a few things, but am still new to it. > I would like to start coding straightaway, and work on my favorite R package > as you suggested. > Will stay in touch. > > Thanks > Subho > > > On Thu, Mar 11, 2010 at 7:37 AM, Brad Chapman wrote: > >> Subho; >> >>> I am interested in applying for GSOC 2010. >>> >>> Particularly liked the R and Python integration proposal. There are lot >> of >>> other cool R packages too, such as Bio3d that one can think of. >> >> Great to hear you are interested. The official student application >> period will run from March 29th-April 9th; we will have more >> specifics about when and where to apply once the organizational >> application round in finished. >> >> There is plenty you can do in the meantime. The selection process >> for students is competitive, and some of the things that help give >> proposals an advantage are: >> >> - Demonstrating knowledge of the projects. For the R/python idea, this >> would involve digging into Rpy2, some R packages you would be >> interested in exposing, and Biopython to get a sense of what a >> compatible API would look like. >> >> - Demonstrating open source coding capabilities. If you've not >> already worked on an open source project, this could involve >> putting together working code demonstrating an aspect of your >> proposal and making it available on Bitbucket or GitHub. >> >> - Showing the ability to communicate effectively with the community. >> Once you have code available, write up some information about it >> on a blog, ask for feedback on mailing lists, or otherwise let >> people know it is out there and you want to talk about it. >> >> These tips are generally useful independent of what specific project >> you are applying for. >> >> Hope this helps, >> Brad >> > > > From sbassi at genesdigitales.com Sun Mar 14 06:53:45 2010 From: sbassi at genesdigitales.com (Sebastian Bassi) Date: Sun, 14 Mar 2010 03:53:45 -0300 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: <3f6baf361003130727g1831dbb8gba7f9b63d160fd0b@mail.gmail.com> References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> <3f6baf361003130727g1831dbb8gba7f9b63d160fd0b@mail.gmail.com> Message-ID: On Sat, Mar 13, 2010 at 12:27 PM, Eric Talevich wrote: > Great, I'm glad you found the slides helpful. The Latex Beamer source isn't > in a publishable state yet, but I can e-mail it to you if you'd like. Don't worry, I don't need the source, I am planning to use some of the content to write my own. Thank you again, I will post them after the congress. Best, SB. From sbassi at gmail.com Mon Mar 15 06:35:16 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 15 Mar 2010 03:35:16 -0300 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: References: Message-ID: On Tue, Mar 9, 2010 at 7:21 PM, Jo?o Rodrigues wrote: > I checked the code of PDBIO and apparently, it has hard-coded a resetting of > the atom number. My question is, what is this set_serial_number for then? Is > there a way for me to override this easily? It may be a bug. Could you post your code related to this? From biopython at maubp.freeserve.co.uk Mon Mar 15 08:34:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Mar 2010 08:34:41 +0000 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: References: Message-ID: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> On Mon, Mar 15, 2010 at 6:35 AM, Sebastian Bassi wrote: > On Tue, Mar 9, 2010 at 7:21 PM, Jo?o Rodrigues wrote: >> I checked the code of PDBIO and apparently, it has hard-coded a resetting of >> the atom number. My question is, what is this set_serial_number for then? Is >> there a way for me to override this easily? > > It may be a bug. Could you post your code related to this? PDBIO does explicitly just use an incremental counter for the atom number. I don't know why for sure, but this is a simple way to ensure the atoms are given unique identifiers on output. I guess the serial_number is just set by the parser. I don't see an easy way to override it - why do you want to change it? Regarding the point of get_serial_number and set_serial_number, they seem to be rather pointless methods - since you can just edit the serial_number attribute directly. Maybe Thomas has been using Java while writing this code? We have talked about deprecating the pointless get/set functions to make the PDB API a little more transparent. Peter From anaryin at gmail.com Mon Mar 15 08:58:29 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 15 Mar 2010 01:58:29 -0700 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> References: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> Message-ID: Exactly my point. Those two functions are pretty much useless at the moment since the PDBIO module ignores those values. I just changed the value of the atom number in PDBIO for atom.get_serial_number() and it worked as I wanted, so it isn't that hard. I just wanted to ask if this had a particular reason or if it was some forgotten old setting or bug. Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ 2010/3/15 Peter > On Mon, Mar 15, 2010 at 6:35 AM, Sebastian Bassi wrote: > > On Tue, Mar 9, 2010 at 7:21 PM, Jo?o Rodrigues > wrote: > >> I checked the code of PDBIO and apparently, it has hard-coded a > resetting of > >> the atom number. My question is, what is this set_serial_number for > then? Is > >> there a way for me to override this easily? > > > > It may be a bug. Could you post your code related to this? > > PDBIO does explicitly just use an incremental counter for the > atom number. I don't know why for sure, but this is a simple way > to ensure the atoms are given unique identifiers on output. I > guess the serial_number is just set by the parser. I don't see an > easy way to override it - why do you want to change it? > > Regarding the point of get_serial_number and set_serial_number, > they seem to be rather pointless methods - since you can just edit > the serial_number attribute directly. Maybe Thomas has been using > Java while writing this code? We have talked about deprecating the > pointless get/set functions to make the PDB API a little more transparent. > > Peter > From biopython at maubp.freeserve.co.uk Mon Mar 15 09:29:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Mar 2010 09:29:09 +0000 Subject: [Biopython] Bio.PDB Module - Atom.set_serial_number function - usage? In-Reply-To: References: <320fb6e01003150134i18a9fa40xf29fc327a992cd70@mail.gmail.com> Message-ID: <320fb6e01003150229i763e203w9a8da06f3ebb0662@mail.gmail.com> On Mon, Mar 15, 2010 at 8:58 AM, Jo?o Rodrigues wrote: > Exactly my point. Those two functions are pretty much useless at the moment > since the PDBIO module ignores those values. I just changed the value of the > atom number in PDBIO for atom.get_serial_number() and it worked as I wanted, > so it isn't that hard. > > I just wanted to ask if this had a particular reason or if it was some > forgotten old setting or bug. My guess is if you have selected only part of a PDB file, and written this out to new sub-file, then it is conventional to have the atoms numbered sequentially from one. This is what the current code does, but using the serial_number from the objects would results in irregular numbering with gaps in it (not sure if that is against the PDB specification, but it would not surprise me if third party tools don't like it). i.e. Not a bug, but a deliberate design choice. (We'd have to ask Thomas what he was thinking to be sure.) Again, why do you want to change the atom numbers on output? Peter From vincent at vincentdavis.net Tue Mar 16 15:03:45 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 09:03:45 -0600 Subject: [Biopython] comparing micro array data Message-ID: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> So I am very new to this so please accept my ignorance on this subject. I have several micro array samples ~ 8 for each of 3 known genomes. So I know which probes/sequences are a match and which have close matches. I would like to identify which sequences exist in an unknown sample. The array is custom and there is little to know overlap between probes. What is the "standard" way of doing this? I don't care to know if a SNP is present only if the sequence is present. Is this standard available in biopython ? Thanks *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From biopython at maubp.freeserve.co.uk Tue Mar 16 15:15:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 15:15:27 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> Message-ID: <320fb6e01003160815s1e051330ve62211d6c7843f64@mail.gmail.com> On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis wrote: > So I am very new to this so please accept my ignorance on this subject. > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > know which probes/sequences are a match and which have close matches. I > would like to identify which sequences exist in an unknown sample. The array > is custom and there is little to know overlap between probes. > What is the "standard" way of doing this? I don't care to know if a SNP is > present only if the sequence is present. > Is this standard available in biopython ? Hi Vincent, Biopython has only limited pairwise alignment built in - we normally just call specialised command line tools. In addition to classic microarray probe design tools, you *might* be able to exploit related tools for PCR primers or short read tools from next generation sequencing. However, these won't be specifically aware of microarray probe affinities and how to model them. For microarray work I would have to say using R/Bioconductor will probably be more sensible for the very practical reason that they have a much larger community using microarrays than Python does. http://www.bioconductor.org/ Peter P.S. You can call R from Python, see http://rpy.sourceforge.net/ From vincent at vincentdavis.net Tue Mar 16 15:30:42 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 09:30:42 -0600 Subject: [Biopython] comparing micro array data In-Reply-To: <320fb6e01003160815s1e051330ve62211d6c7843f64@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <320fb6e01003160815s1e051330ve62211d6c7843f64@mail.gmail.com> Message-ID: <77e831101003160830m4e679fa0v21df651d79db582a@mail.gmail.com> > > @Peter For microarray work I would have to say using R/Bioconductor will probably be more sensible for the very practical reason that they have a much larger community using microarrays than Python does. http://www.bioconductor.org/ I am working at getting up to speed with R and bioconductor. I ask the question here as I got such a great answer for the last question I had and thought if the tool was available in biopython then I would try it. I don't know how this problem is normally solved. > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Tue, Mar 16, 2010 at 9:15 AM, Peter wrote: > On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis > wrote: > > So I am very new to this so please accept my ignorance on this subject. > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > > know which probes/sequences are a match and which have close matches. I > > would like to identify which sequences exist in an unknown sample. The > array > > is custom and there is little to know overlap between probes. > > What is the "standard" way of doing this? I don't care to know if a SNP > is > > present only if the sequence is present. > > Is this standard available in biopython ? > > Hi Vincent, > > Biopython has only limited pairwise alignment built in - we normally just > call specialised command line tools. In addition to classic microarray > probe design tools, you *might* be able to exploit related tools for PCR > primers or short read tools from next generation sequencing. However, > these won't be specifically aware of microarray probe affinities and how > to model them. > > For microarray work I would have to say using R/Bioconductor will > probably be more sensible for the very practical reason that they > have a much larger community using microarrays than Python does. > http://www.bioconductor.org/ > > Peter > > P.S. You can call R from Python, see http://rpy.sourceforge.net/ > From lpritc at scri.ac.uk Tue Mar 16 16:03:06 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 16 Mar 2010 16:03:06 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160830m4e679fa0v21df651d79db582a@mail.gmail.com> Message-ID: Hi Vincent, On 16/03/2010 Tuesday, March 16, 15:30, "Vincent Davis" wrote: > On Tue, Mar 16, 2010 at 9:15 AM, Peter wrote: > >> @Peter >> For microarray work I would have to say using R/Bioconductor will >> probably be more sensible for the very practical reason that they >> have a much larger community using microarrays than Python does. >> >> http://www.bioconductor.org/ > > I am working at getting up to speed with R and bioconductor. I ask the > question here as I got such a great answer for the last question I had and > thought if the tool was available in biopython then I would try it. I don't > know how this problem is normally solved. Peter's suggestion is a good one, in general. Biopython is lacking in support for microarray analysis - not least in part because there's already an adaptor to R, from which the mature and powerful Bioconductor libraries are available (not to mention that arrays are being superseded by sequencing, so now might not be the time to put too much effort in to that ;)). If you've got microarray issues, a Bioconductor mailing list might be a better first port of call. >> On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis >> wrote: >>> So I am very new to this so please accept my ignorance on this subject. >>> >>> I have several micro array samples ~ 8 for each of 3 known genomes. So I >>> know which probes/sequences are a match and which have close matches. I >>> would like to identify which sequences exist in an unknown sample. The >> array >>> is custom and there is little to know overlap between probes. >>> What is the "standard" way of doing this? I don't care to know if a SNP >> is >>> present only if the sequence is present. >>> Is this standard available in biopython ? It's not very clear to me what the problem is, from your description here. It sounds a bit like you are doing array CGH, starting with an array that was raised to species X, and you then have eight sets of array results (this wouldn't be two samples with three replicates, and a single sample with two replicates, would it?) from known species A, B, and C. Then it seems like you have a sample from species D, and you want to know - perhaps from the array hybridisation data, perhaps from the genome sequence, it's hard to tell - possibly one of two things: which probes will bind to species D; or how many genes from species D are similar to those in species X. These two questions would require quite different approaches; can you be clearer? Cheers, L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From sdavis2 at mail.nih.gov Tue Mar 16 16:38:31 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 16 Mar 2010 12:38:31 -0400 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> Message-ID: <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> On Tue, Mar 16, 2010 at 11:03 AM, Vincent Davis wrote: > So I am very new to this so please accept my ignorance on this subject. > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > know which probes/sequences are a match and which have close matches. I > would like to identify which sequences exist in an unknown sample. The array > is custom and there is little to know overlap between probes. > What is the "standard" way of doing this? I don't care to know if a SNP is > present only if the sequence is present. > Is this standard available in biopython ? Hi, Vincent. I'm not clear on what the study is here. Could you explain a bit more what you are doing? I get the suggestion from your email that you want to do a cross-species comparison using microarrays. If this is the case, this is notoriously difficult to do, so, in addition to the comments here, I would suggest finding a local collaborator if you are relatively new to the microarray field. Sean From vincent at vincentdavis.net Tue Mar 16 16:38:43 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 10:38:43 -0600 Subject: [Biopython] comparing micro array data In-Reply-To: References: <77e831101003160830m4e679fa0v21df651d79db582a@mail.gmail.com> Message-ID: <77e831101003160938o10f53c15m51c1aa559def5513@mail.gmail.com> > > t sounds a bit like you are doing array CGH, starting with an array that > was raised to species X, Yes eight sets of array results (this > wouldn't be two samples with three replicates, and a single sample with two > replicates, would it?) from known species A, B, and C. I have 3 know species, X (the one that matches the array),B, C and about 8 arrays/samples and for each we know if a probe/sequence matches a sequence in the genome. And several different unknown samples D,E,F..... What to know if at any given sequence/probe does the unknown have that sequence or some probability or the most likely to be different. B,and C only help by allowing us to test our method. I also have close mismatch data for the know, that is I know if there is a single mismatch match and the distance of that mismatch from the center of the sequence. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Tue, Mar 16, 2010 at 10:03 AM, Leighton Pritchard wrote: > Hi Vincent, > > On 16/03/2010 Tuesday, March 16, 15:30, "Vincent Davis" > wrote: > > > On Tue, Mar 16, 2010 at 9:15 AM, Peter >wrote: > > > >> @Peter > >> For microarray work I would have to say using R/Bioconductor will > >> probably be more sensible for the very practical reason that they > >> have a much larger community using microarrays than Python does. > >> > >> http://www.bioconductor.org/ > > > > I am working at getting up to speed with R and bioconductor. I ask the > > question here as I got such a great answer for the last question I had > and > > thought if the tool was available in biopython then I would try it. I > don't > > know how this problem is normally solved. > > Peter's suggestion is a good one, in general. Biopython is lacking in > support for microarray analysis - not least in part because there's already > an adaptor to R, from which the mature and powerful Bioconductor libraries > are available (not to mention that arrays are being superseded by > sequencing, so now might not be the time to put too much effort in to that > ;)). If you've got microarray issues, a Bioconductor mailing list might be > a better first port of call. > > >> On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis < > vincent at vincentdavis.net> > >> wrote: > >>> So I am very new to this so please accept my ignorance on this subject. > >>> > >>> I have several micro array samples ~ 8 for each of 3 known genomes. So > I > >>> know which probes/sequences are a match and which have close matches. I > >>> would like to identify which sequences exist in an unknown sample. The > >> array > >>> is custom and there is little to know overlap between probes. > >>> What is the "standard" way of doing this? I don't care to know if a SNP > >> is > >>> present only if the sequence is present. > >>> Is this standard available in biopython ? > > It's not very clear to me what the problem is, from your description here. > It sounds a bit like you are doing array CGH, starting with an array that > was raised to species X, and you then have eight sets of array results > (this > wouldn't be two samples with three replicates, and a single sample with two > replicates, would it?) from known species A, B, and C. Then it seems like > you have a sample from species D, and you want to know - perhaps from the > array hybridisation data, perhaps from the genome sequence, it's hard to > tell - possibly one of two things: which probes will bind to species D; or > how many genes from species D are similar to those in species X. These two > questions would require quite different approaches; can you be clearer? > > Cheers, > > L. > > -- > Dr Leighton Pritchard MRSC > D131, Plant Pathology Programme, SCRI > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:lpritc at scri.ac.uk w: > http://www.scri.ac.uk/staff/leightonpritchard > gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 > > > ______________________________________________________ > SCRI, Invergowrie, Dundee, DD2 5DA. > The Scottish Crop Research Institute is a charitable company limited by > guarantee. > Registered in Scotland No: SC 29367. > Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. > > > DISCLAIMER: > > This email is from the Scottish Crop Research Institute, but the views > expressed by the sender are not necessarily the views of SCRI and its > subsidiaries. This email and any files transmitted with it are confidential > to the intended recipient at the e-mail address to which it has been > addressed. It may not be disclosed or used by any other than that > addressee. > If you are not the intended recipient you are requested to preserve this > confidentiality and you must not use, disclose, copy, print or rely on this > e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of > the sender and delete the email from your system. > > Although SCRI has taken reasonable precautions to ensure no viruses are > present in this email, neither the Institute nor the sender accepts any > responsibility for any viruses, and it is your responsibility to scan the > email and the attachments (if any). > ______________________________________________________ > From vincent at vincentdavis.net Tue Mar 16 16:49:26 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Tue, 16 Mar 2010 10:49:26 -0600 Subject: [Biopython] comparing micro array data In-Reply-To: <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> Message-ID: <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> > > @ Sean I would suggest finding a local collaborator if you are relatively new to the microarray field. I actually was brought into this project by a team from an university. They know lots including that this is a difficult problem. They did not have any references as to how others have solved this problem with whatever success was possible. Since I know python, biopython has been my first choice to ask other smart people :) I am an economist. I am ok with the stats and data but don't know the terminology well, It's been a 3 week crash course in my free time. I wrote my own modules for reading in CEL and CDF files as python objects. I know there are existing solution but I would not learned as much that way. I used the nexalign program that was recommended on this list to get the mismatch data. It's all coming along nicely andI am learning lots. The prject has been languishing for a list of reasons and now there is a push to get it finished. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Tue, Mar 16, 2010 at 10:38 AM, Sean Davis wrote: > On Tue, Mar 16, 2010 at 11:03 AM, Vincent Davis > wrote: > > So I am very new to this so please accept my ignorance on this subject. > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > > know which probes/sequences are a match and which have close matches. I > > would like to identify which sequences exist in an unknown sample. The > array > > is custom and there is little to know overlap between probes. > > What is the "standard" way of doing this? I don't care to know if a SNP > is > > present only if the sequence is present. > > Is this standard available in biopython ? > > Hi, Vincent. I'm not clear on what the study is here. Could you > explain a bit more what you are doing? I get the suggestion from your > email that you want to do a cross-species comparison using > microarrays. If this is the case, this is notoriously difficult to > do, so, in addition to the comments here, I would suggest finding a > local collaborator if you are relatively new to the microarray field. > > Sean > From sdavis2 at mail.nih.gov Tue Mar 16 16:56:12 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 16 Mar 2010 12:56:12 -0400 Subject: [Biopython] comparing micro array data In-Reply-To: <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> Message-ID: <264855a01003160956o1bc9b432qf46abbc001fcbde7@mail.gmail.com> On Tue, Mar 16, 2010 at 12:49 PM, Vincent Davis wrote: > @ Sean > > I would suggest finding a > > local collaborator if you are relatively new to the microarray field. > > > I actually was brought into this project by a team from an university. They > know lots including that this is a difficult problem. They did not have any > references as to how others have solved this problem with whatever success > was possible. Since I know python, biopython has been my first choice to ask > other smart people :) > > > I am an economist. I am ok with the stats and data but don't know the > terminology well, It's been a 3 week crash course in my free time. I wrote > my own modules for reading in CEL and CDF files as python objects. I know > there are existing solution but I would not learned as much that way. I used > the nexalign program that was recommended on this list to get the mismatch > data. It's all coming along nicely andI am learning lots. The prject has > been languishing for a list of reasons and now there is a push to get it > finished. > Perfect! A mathematician working with biologists--this is the way of the world these days. Given the issues that you describe, I would definitely suggest looking at R/bioconductor. That said, I'm not sure that there is a good answer to the problem, as you suggest. If you don't mind a couple of questions, for curiosity sake, how big is the genome of model organism? And what size are the arrays, in terms of probes? Sean > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > > > On Tue, Mar 16, 2010 at 10:38 AM, Sean Davis wrote: > >> On Tue, Mar 16, 2010 at 11:03 AM, Vincent Davis >> wrote: >> > So I am very new to this so please accept my ignorance on this subject. >> > >> > I have several micro array samples ~ 8 for each of 3 known genomes. So I >> > know which probes/sequences are a match and which have close matches. I >> > would like to identify which sequences exist in an unknown sample. The >> array >> > is custom and there is little to know overlap between probes. >> > What is the "standard" way of doing this? I don't care to know if a SNP >> is >> > present only if the sequence is present. >> > Is this standard available in biopython ? >> >> Hi, Vincent. I'm not clear on what the study is here. Could you >> explain a bit more what you are doing? I get the suggestion from your >> email that you want to do a cross-species comparison using >> microarrays. If this is the case, this is notoriously difficult to >> do, so, in addition to the comments here, I would suggest finding a >> local collaborator if you are relatively new to the microarray field. >> >> Sean >> > > From subhodeep.moitra at gmail.com Tue Mar 16 16:56:06 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Tue, 16 Mar 2010 12:56:06 -0400 Subject: [Biopython] comparing micro array data Message-ID: <6a2880081003160956t21e30d1v35d9b9df240370c4@mail.gmail.com> If you need to visualize the microarray data and also do some analysis for interaction networks, then 'Cytoscape' is a good option to go for. Thanks Subho On Tue, Mar 16, 2010 at 12:00 PM, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. comparing micro array data (Vincent Davis) > 2. Re: comparing micro array data (Peter) > 3. Re: comparing micro array data (Vincent Davis) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 16 Mar 2010 09:03:45 -0600 > From: Vincent Davis > Subject: [Biopython] comparing micro array data > To: biopython > Message-ID: > <77e831101003160803n24a4568aq68793a367059f956 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > So I am very new to this so please accept my ignorance on this subject. > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > know which probes/sequences are a match and which have close matches. I > would like to identify which sequences exist in an unknown sample. The > array > is custom and there is little to know overlap between probes. > What is the "standard" way of doing this? I don't care to know if a SNP is > present only if the sequence is present. > Is this standard available in biopython ? > > Thanks > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > > > ------------------------------ > > Message: 2 > Date: Tue, 16 Mar 2010 15:15:27 +0000 > From: Peter > Subject: Re: [Biopython] comparing micro array data > To: Vincent Davis > Cc: biopython > Message-ID: > <320fb6e01003160815s1e051330ve62211d6c7843f64 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis > wrote: > > So I am very new to this so please accept my ignorance on this subject. > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So I > > know which probes/sequences are a match and which have close matches. I > > would like to identify which sequences exist in an unknown sample. The > array > > is custom and there is little to know overlap between probes. > > What is the "standard" way of doing this? I don't care to know if a SNP > is > > present only if the sequence is present. > > Is this standard available in biopython ? > > Hi Vincent, > > Biopython has only limited pairwise alignment built in - we normally just > call specialised command line tools. In addition to classic microarray > probe design tools, you *might* be able to exploit related tools for PCR > primers or short read tools from next generation sequencing. However, > these won't be specifically aware of microarray probe affinities and how > to model them. > > For microarray work I would have to say using R/Bioconductor will > probably be more sensible for the very practical reason that they > have a much larger community using microarrays than Python does. > http://www.bioconductor.org/ > > Peter > > P.S. You can call R from Python, see http://rpy.sourceforge.net/ > > > ------------------------------ > > Message: 3 > Date: Tue, 16 Mar 2010 09:30:42 -0600 > From: Vincent Davis > Subject: Re: [Biopython] comparing micro array data > To: Peter > Cc: biopython > Message-ID: > <77e831101003160830m4e679fa0v21df651d79db582a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > > > > @Peter > > For microarray work I would have to say using R/Bioconductor will > > probably be more sensible for the very practical reason that they > > have a much larger community using microarrays than Python does. > > http://www.bioconductor.org/ > > > I am working at getting up to speed with R and bioconductor. I ask the > question here as I got such a great answer for the last question I had and > thought if the tool was available in biopython then I would try it. I don't > know how this problem is normally solved. > > > > > > > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > > > On Tue, Mar 16, 2010 at 9:15 AM, Peter >wrote: > > > On Tue, Mar 16, 2010 at 3:03 PM, Vincent Davis > > > wrote: > > > So I am very new to this so please accept my ignorance on this subject. > > > > > > I have several micro array samples ~ 8 for each of 3 known genomes. So > I > > > know which probes/sequences are a match and which have close matches. I > > > would like to identify which sequences exist in an unknown sample. The > > array > > > is custom and there is little to know overlap between probes. > > > What is the "standard" way of doing this? I don't care to know if a SNP > > is > > > present only if the sequence is present. > > > Is this standard available in biopython ? > > > > Hi Vincent, > > > > Biopython has only limited pairwise alignment built in - we normally just > > call specialised command line tools. In addition to classic microarray > > probe design tools, you *might* be able to exploit related tools for PCR > > primers or short read tools from next generation sequencing. However, > > these won't be specifically aware of microarray probe affinities and how > > to model them. > > > > For microarray work I would have to say using R/Bioconductor will > > probably be more sensible for the very practical reason that they > > have a much larger community using microarrays than Python does. > > http://www.bioconductor.org/ > > > > Peter > > > > P.S. You can call R from Python, see http://rpy.sourceforge.net/ > > > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 87, Issue 16 > ***************************************** > -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From biopython at maubp.freeserve.co.uk Tue Mar 16 17:29:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Mar 2010 17:29:17 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <264855a01003160956o1bc9b432qf46abbc001fcbde7@mail.gmail.com> References: <77e831101003160803n24a4568aq68793a367059f956@mail.gmail.com> <264855a01003160938n579c42bek4c5122c6fa7b43aa@mail.gmail.com> <77e831101003160949i5d9d9126vd28b257bab0cc685@mail.gmail.com> <264855a01003160956o1bc9b432qf46abbc001fcbde7@mail.gmail.com> Message-ID: <320fb6e01003161029i5feddf8ck76ba2b9ecd2056f2@mail.gmail.com> On Tue, Mar 16, 2010 at 4:56 PM, Sean Davis wrote: > > If you don't mind a couple of questions, for curiosity sake, how big is the > genome of model organism? ?And what size are the arrays, in terms of > probes? Also, what kind of organism? e.g. Plant, animal, bacteria? This will make a difference for the number of papers you'll find doing this kind of thing in the literature. On Tue, Mar 16, 2010 at 4:49 PM, Vincent Davis wrote: > I actually was brought into this project by a team from an university. They > know lots including that this is a difficult problem. They did not have any > references as to how others have solved this problem with whatever success > was possible. Since I know python, biopython has been my first choice to ask > other smart people :) For a recent example using microarrays for cross-species comparison (aka microarray comparative genomic hybridisation) in bacteria you might want to read Leighton's paper (and the references within - which include work on humans): http://www.ncbi.nlm.nih.gov/pubmed/19696881 You can probably guess why he asked if you were doing array CGH ;) Peter From hlapp at drycafe.net Tue Mar 16 20:03:50 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Tue, 16 Mar 2010 16:03:50 -0400 Subject: [Biopython] [OT] Job opportunity: Training coordinator and Bioinformatics Project Manager Message-ID: <0CDDCED9-266E-4CCE-8240-D7E2C8522784@drycafe.net> Hi all - first off, sorry for the cross-posting, we're trying to advertise this as widely as possible. Second, apologies if this is committing an offense and considered spam. I thought though that there might be some people around here who may be interested and suitable. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== A unique position is available for a training coordinator and bioinformatics project manager at the U.S. National Evolutionary Synthesis Center in Durham, North Carolina (NESCent, http:// nescent.org). NESCent is a National Science Foundation funded research center managed by Duke University, the University of North Carolina at Chapel Hill and North Carolina State University on behalf of the international evolutionary biology community. NESCent facilitates synthetic research by bringing together diverse expertise, data, tools and concepts (Sidlauskas et al. 2009). In addition to a resident population of 20-30 scientists, the Center hosts over 800 visitors a year. An informatics staff is on-site to support resident and visiting scientists? needs in high-performance computing, electronic collaboration, scientific software and databases; this includes custom software development for a limited number of high- impact projects. NESCent?s informatics training program includes a rotating series of open-application summer courses, ad-hoc short courses for resident scientists, and remote internships (including past participation in the Google Summer of Code). The training coordinator and bioinformatics project manager will provide oversight to the Center?s training activities. The incumbent will also serve as the interface between scientists and software developers at NESCent. The position provides extensive opportunities for collaboration and intellectual engagement with both NESCent- sponsored scientists and informatics staff; however, this is not an independent research position. The incumbent will report to the Director, while overseeing the work of a small informatics team and coordinating activities among the Center?s science, education and informatics programs. Responsibilities: ? 50% - Consult with sponsored scientists (including scientists in residence and working group participants) about informatics resources and needs. Manage software product development by gathering requirements from scientists, participating in conceptual design, monitoring implementation progress and product quality, facilitating communication between software developers and scientists, and researching software solutions. ? 25% - Oversee NESCent?s course curriculum by identifying opportunities for onsite or online informatics courses that satisfy demand for advanced training of resident and visiting scientists, recruiting instructors, providing guidance to instructors in developing course syllabi, coordinating logistical and technical support requirements, conducting assessments, and serving as a liaison to course organizers at other institutions. ? 25% - Assisting in the management of NESCent?s summer informatics intern program, by coordinating the recruitment, application & review process for students, communicating expectations to students and mentors, monitoring student progress, documenting student outcomes, and performing assessments. Education: Required: M.S. in Biology, Bioinformatics, or a related field. Preferred: Ph.D. and two years postdoctoral experience in evolutionary biology, or an equivalent combination of relevant education and/or experience. Experience: Required: Excellent communication, interpersonal, and organizational skills. Experience with computationally oriented scientific research. Preferred: At least two years in development of databases and open source software. Organization, coordination, development and delivery of courses and workshops appropriate for graduate-level participants. Terms of Employment: Salary will be competitive and commensurate with experience. As a full-time employee, the incumbent will receive Duke University?s benefits package (http://hr.duke.edu/benefits/main.html). The position is available immediately and will remain open until filled. The position is currently funded through November 2014, contingent on annual renewal of the Center by the NSF. How to Apply: Please send a C.V., including contact information for three references, and a brief statement of interest to Allen Rodrigo, Director, NESCent, at a.rodrigo at nescent.org. Inquiries about suitability for the position are welcome. Duke University is an Equal Opportunity/Affirmative Action employer. Additional information about NESCent: http://www.nescent.org References: Sidlauskas B, Ganapathy G, Hazkani-Covo E, Jenkins KP, Lapp H, McCall LW, Price S, Scherle R, Spaeth PA, Kidd DM (2009) Linking Big: The Continuing Promise of Evolutionary Synthesis. Evolution. http://dx.doi.org/10.1111/j.1558-5646.2009.00892.x From lpritc at scri.ac.uk Wed Mar 17 08:20:30 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 17 Mar 2010 08:20:30 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: <320fb6e01003161029i5feddf8ck76ba2b9ecd2056f2@mail.gmail.com> Message-ID: Hi, On 16/03/2010 Tuesday, March 16, 17:29, "Peter" wrote: > On Tue, Mar 16, 2010 at 4:56 PM, Sean Davis wrote: >> >> If you don't mind a couple of questions, for curiosity sake, how big is the >> genome of model organism? ?And what size are the arrays, in terms of >> probes? > > Also, what kind of organism? e.g. Plant, animal, bacteria? This will > make a difference for the number of papers you'll find doing this kind > of thing in the literature. And the type of analysis that's being done, too: human aCGH (lots of references) tends to concentrate on copy number variation and SNP identification, while bacterial aCGH (not so many) focuses largely on presence/absence of putative orthologues. > On Tue, Mar 16, 2010 at 4:49 PM, Vincent Davis wrote: >> I actually was brought into this project by a team from an university. They >> know lots including that this is a difficult problem. They did not have any >> references as to how others have solved this problem with whatever success >> was possible. Since I know python, biopython has been my first choice to ask >> other smart people :) > > For a recent example using microarrays for cross-species comparison > (aka microarray comparative genomic hybridisation) in bacteria you > might want to read Leighton's paper (and the references within - which > include work on humans): > > http://www.ncbi.nlm.nih.gov/pubmed/19696881 > > You can probably guess why he asked if you were doing array CGH ;) And I was just about to blow my own trumpet, too ;) If you've got any questions that are specifically about the paper, I'm happy to take them off-list. L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From lpritc at scri.ac.uk Wed Mar 17 09:26:22 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Wed, 17 Mar 2010 09:26:22 +0000 Subject: [Biopython] comparing micro array data In-Reply-To: Message-ID: Hi Vincent, I've not read this yet, but it might be useful to you: http://zetoc.mimas.ac.uk/wzgw?db=etoc&terms=RN267048680&field=zid L. -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From mitlox at op.pl Wed Mar 17 10:08:24 2010 From: mitlox at op.pl (xyz) Date: Wed, 17 Mar 2010 20:08:24 +1000 Subject: [Biopython] sort fasta file Message-ID: <20100317200824.5f363f77@wp01> Hello, I would like sort multiple fasta file depends on the sequence length, ie. from the read with longest sequence to the read with the shortest sequence. I have tried to do it but I do not how to sort the records depends on the sequence length. from Bio import SeqIO handle = open("example.fasta", "rU") records = list(SeqIO.parse(handle, "fasta")) records.sort(reverse=True) Thank you in advance. Best regards, From biopython at maubp.freeserve.co.uk Wed Mar 17 10:22:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Mar 2010 10:22:48 +0000 Subject: [Biopython] sort fasta file In-Reply-To: <20100317200824.5f363f77@wp01> References: <20100317200824.5f363f77@wp01> Message-ID: <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> On Wed, Mar 17, 2010 at 10:08 AM, xyz wrote: > Hello, > I would like sort multiple fasta file depends on the sequence length, > ?ie. from the read with longest sequence to the read with the shortest > sequence. > > I have tried to do it but I do not how to sort the records depends on > the sequence length. > > from Bio import SeqIO > > handle = open("example.fasta", "rU") > records = list(SeqIO.parse(handle, "fasta")) > records.sort(reverse=True) > > Thank you in advance. > > Best regards, If you can hold all the records in memory at once (which it looks like you can) then this is pretty easy. You need to do a custom search - the built in list help is a bit terse: >>> help([].sort) Help on built-in function sort: sort(...) L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*; cmp(x, y) -> -1, 0, 1 You need to pass in a function as the cmp argument, which will take two objects (here SeqRecords) and return -1, 0 or 1. The concise way to do this is with a lambda, and reuse the built-in function cmp but acting on the length of the records. For example, handle = open("example.fasta", "rU") records = list(SeqIO.parse(handle, "fasta")) handle.close() records.sort(cmp=lambda x,y: cmp(len(x), len(y))) #records.sort(cmp=reverse=True) out_handle = open("sorted.fasta", "w") SeqIO.write(records, out_handle, "fasta") out_handle.close() Peter From mitlox at op.pl Wed Mar 17 12:01:35 2010 From: mitlox at op.pl (xyz) Date: Wed, 17 Mar 2010 22:01:35 +1000 Subject: [Biopython] sort fasta file In-Reply-To: <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> References: <20100317200824.5f363f77@wp01> <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> Message-ID: <20100317220135.3c12e3c4@wp01> On Wed, 17 Mar 2010 10:22:48 +0000 Peter wrote: > For example, > > handle = open("example.fasta", "rU") > records = list(SeqIO.parse(handle, "fasta")) > handle.close() > records.sort(cmp=lambda x,y: cmp(len(x), len(y))) > #records.sort(cmp=reverse=True) > out_handle = open("sorted.fasta", "w") > SeqIO.write(records, out_handle, "fasta") > out_handle.close() > > Peter Thank you for the code. I only changed this and it works. records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq)) If I could not hold all the records in memory at once what could I do? From crosvera at gmail.com Wed Mar 17 17:08:21 2010 From: crosvera at gmail.com (Carlos =?ISO-8859-1?Q?R=EDos?= V.) Date: Wed, 17 Mar 2010 14:08:21 -0300 Subject: [Biopython] BioPython GSOC 2010 In-Reply-To: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> References: <6a2880081003101051h6bb3e732s5176c8c3c6a23c19@mail.gmail.com> Message-ID: <1268845701.2161.10.camel@cabernet> Hello people, I'm very interesting in this idea: http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files I have some experience with the Bio.PDB Module, and I think that would be a very useful tool for labs. Brad Chapman wrote an e-mail that said that we have to demonstrate our knowledge of the project and open source coding capabilities, where I have to show you that? Regards. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502 From eric.talevich at gmail.com Wed Mar 17 18:32:44 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 17 Mar 2010 14:32:44 -0400 Subject: [Biopython] sort fasta file Message-ID: <3f6baf361003171132s4ec12e4bw12d80e2a5edf6977@mail.gmail.com> xyz wrote: > > Hello, > I would like sort multiple fasta file depends on the sequence length, > ie. from the read with longest sequence to the read with the shortest > sequence. > > I have tried to do it but I do not how to sort the records depends on > the sequence length. > > [...] > > If I could not hold all the records in memory at once what could I do? > There's also a program called uclust which can sort reads by sequence length very quickly: http://www.drive5.com/uclust/ It's designed for clustering short reads, but it includes a feature to sort sequences by decreasing length. I think it can handle files larger than available RAM, too, though I haven't tested that. -Eric From biopython at maubp.freeserve.co.uk Thu Mar 18 10:44:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Mar 2010 10:44:09 +0000 Subject: [Biopython] sort fasta file In-Reply-To: <20100317220135.3c12e3c4@wp01> References: <20100317200824.5f363f77@wp01> <320fb6e01003170322j6693e6b0n102190ae712b10ba@mail.gmail.com> <20100317220135.3c12e3c4@wp01> Message-ID: <320fb6e01003180344n47fc9ba3y54c7284fc6747e25@mail.gmail.com> On Wed, Mar 17, 2010 at 12:01 PM, xyz wrote: > On Wed, 17 Mar 2010 10:22:48 +0000 > Peter wrote: >> For example, >> >> handle = open("example.fasta", "rU") >> records = list(SeqIO.parse(handle, "fasta")) >> handle.close() >> records.sort(cmp=lambda x,y: cmp(len(x), len(y))) >> #records.sort(cmp=reverse=True) >> out_handle = open("sorted.fasta", "w") >> SeqIO.write(records, out_handle, "fasta") >> out_handle.close() >> >> Peter > > Thank you for the code. I only changed this and it works. > > records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq)) > > If I could not hold all the records in memory at once what could I do? I would use Bio.SeqIO.index() to give random access to the records. You would also need to load and sort the record identifiers and the lengths. Something like this: from Bio import SeqIO #Get the lengths and ids, and sort on length len_and_ids = sorted((len(rec), rec.id) for rec in \ SeqIO.parse(open("ls_orchid.fasta"),"fasta")) #Once sorted only need the ids, so can free some memory ids = [id for (length, id) in len_and_ids] del len_and_ids #Now prepare the index record_index = SeqIO.index("ls_orchid.fasta", "fasta") #Now prepare a generator expression to give the #records one-by-one for output records = (record_index[id] for id in ids) #Finally write these to a file handle = open("sorted.fasta", "w") count = SeqIO.write(records, handle, "fasta") handle.close() print "Sorted %i records" % count That code should work for any file format support by the Bio.SeqIO parse, index and write functions (e.g. GenBank files, FASTQ, etc). Notice that it actually reads though the input file twice, once to get the ids and lengths, and once to build the index (getting the ids and file offsets). If you wanted to get a bit more low level you could do this in a single pass - but it would be more effort than using the SeqIO functions. I wonder if this example is useful enough to go in the tutorial? What do you think? Peter From subhodeep.moitra at gmail.com Thu Mar 18 17:11:56 2010 From: subhodeep.moitra at gmail.com (subhodeep moitra) Date: Thu, 18 Mar 2010 13:11:56 -0400 Subject: [Biopython] PDB Tidy Message-ID: <6a2880081003181011j2bac6661gae5dbeec4a0eb7d5@mail.gmail.com> Hi Carlos and BioPythoneers Has anyone come across PDB-Tools : http://code.google.com/p/pdb-tools/ It's a python implementation to clean up pdbs and some other stuff. Might be useful for someone interested in the PDB-Tidy project. :) :) Thanks Subho > Message: 1 > Date: Wed, 17 Mar 2010 14:08:21 -0300 > From: Carlos R?os "V." > Subject: Re: [Biopython] BioPython GSOC 2010 > To: biopython at lists.open-bio.org > Message-ID: <1268845701.2161.10.camel at cabernet> > Content-Type: text/plain; charset="UTF-8" > > Hello people, > > I'm very interesting in this idea: > > http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files > > I have some experience with the Bio.PDB Module, and I think that would > be a very useful tool for labs. > > Brad Chapman wrote an e-mail that said that we have to demonstrate our > knowledge of the project and open source coding capabilities, where I > have to show you that? > > Regards. > > -- > http://crosvera.blogspot.com > > Carlos R?os V. > Estudiante de Ing. (E) en Computaci?n e Inform?tica. > Universidad del B?o-B?o > VIII Regi?n, Chile > > Linux user number 425502 > > > > > > ------------------------------ > > Message: 2 > Date: Wed, 17 Mar 2010 14:32:44 -0400 > From: Eric Talevich > Subject: Re: [Biopython] sort fasta file > To: xyz , biopython at lists.open-bio.org > Message-ID: > <3f6baf361003171132s4ec12e4bw12d80e2a5edf6977 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > xyz wrote: > > > > > Hello, > > I would like sort multiple fasta file depends on the sequence length, > > ie. from the read with longest sequence to the read with the shortest > > sequence. > > > > I have tried to do it but I do not how to sort the records depends on > > the sequence length. > > > > [...] > > > > If I could not hold all the records in memory at once what could I do? > > > > There's also a program called uclust which can sort reads by sequence > length > very quickly: > http://www.drive5.com/uclust/ > > It's designed for clustering short reads, but it includes a feature to sort > sequences by decreasing length. I think it can handle files larger than > available RAM, too, though I haven't tested that. > > -Eric > > > ------------------------------ > > Message: 3 > Date: Thu, 18 Mar 2010 10:44:09 +0000 > From: Peter > Subject: Re: [Biopython] sort fasta file > To: xyz > Cc: biopython at lists.open-bio.org > Message-ID: > <320fb6e01003180344n47fc9ba3y54c7284fc6747e25 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Mar 17, 2010 at 12:01 PM, xyz wrote: > > On Wed, 17 Mar 2010 10:22:48 +0000 > > Peter wrote: > >> For example, > >> > >> handle = open("example.fasta", "rU") > >> records = list(SeqIO.parse(handle, "fasta")) > >> handle.close() > >> records.sort(cmp=lambda x,y: cmp(len(x), len(y))) > >> #records.sort(cmp=reverse=True) > >> out_handle = open("sorted.fasta", "w") > >> SeqIO.write(records, out_handle, "fasta") > >> out_handle.close() > >> > >> Peter > > > > Thank you for the code. I only changed this and it works. > > > > records.sort(cmp=lambda x,y: len(y.seq) - len(x.seq)) > > > > If I could not hold all the records in memory at once what could I do? > > I would use Bio.SeqIO.index() to give random access to the > records. You would also need to load and sort the record > identifiers and the lengths. Something like this: > > from Bio import SeqIO > #Get the lengths and ids, and sort on length > len_and_ids = sorted((len(rec), rec.id) for rec in \ > SeqIO.parse(open("ls_orchid.fasta"),"fasta")) > #Once sorted only need the ids, so can free some memory > ids = [id for (length, id) in len_and_ids] > del len_and_ids > #Now prepare the index > record_index = SeqIO.index("ls_orchid.fasta", "fasta") > #Now prepare a generator expression to give the > #records one-by-one for output > records = (record_index[id] for id in ids) > #Finally write these to a file > handle = open("sorted.fasta", "w") > count = SeqIO.write(records, handle, "fasta") > handle.close() > print "Sorted %i records" % count > > That code should work for any file format support by > the Bio.SeqIO parse, index and write functions (e.g. > GenBank files, FASTQ, etc). > > Notice that it actually reads though the input file twice, > once to get the ids and lengths, and once to build the > index (getting the ids and file offsets). If you wanted to > get a bit more low level you could do this in a single > pass - but it would be more effort than using the SeqIO > functions. > > I wonder if this example is useful enough to go in the > tutorial? What do you think? > > Peter > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 87, Issue 19 > ***************************************** > -- Subhodeep Moitra First Year, Masters Student Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA , USA From eric.talevich at gmail.com Thu Mar 18 19:25:01 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 18 Mar 2010 15:25:01 -0400 Subject: [Biopython] BioPython GSOC 2010 Message-ID: <3f6baf361003181225w8bce2fdg5bd7ba894a717ccf@mail.gmail.com> On Wed, 17 Mar 2010 at 14:08:21 -0300, Carlos Rios "V." wrote: > Hello people, > > I'm very interesting in this idea: > > http://biopython.org/wiki/Google_Summer_of_Code#PDB-Tidy:_command-line_tools_for_manipulating_PDB_files > > I have some experience with the Bio.PDB Module, and I think that would > be a very useful tool for labs. > > Brad Chapman wrote an e-mail that said that we have to demonstrate our > knowledge of the project and open source coding capabilities, where I > have to show you that? > Well, OBF has been accepted as a mentoring organization now: http://socghop.appspot.com/gsoc/program/accepted_orgs/google/gsoc2010 So I'd recommend getting yourself set up on GitHub -- other mentoring organizations use git too, and it helps your application to show that you're already familiar with the build tools. Carlos, I see that you have plenty of code that you're willing to share, currently distributed as tarballs from your blog. You could start by publishing some selected projects on GitHub and playing around with it a little there, as well as making your own fork of Biopython. For bonus points, once you have your own Biopython development branch, see if you can write a patch for any of the open issues on Bugzilla: http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED This would all look great on your GSoC application. Thanks for your interest, and best of luck! -Eric From p.j.a.cock at googlemail.com Thu Mar 18 22:03:09 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 18 Mar 2010 22:03:09 +0000 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! In-Reply-To: <4BA29706.8040606@cornell.edu> References: <4BA29706.8040606@cornell.edu> Message-ID: <320fb6e01003181503j7e3030aao7bce7ebf4d8be06@mail.gmail.com> Good news for GSoC 2010 :) ---------- Forwarded message ---------- From: Robert Buels Date: Thu, Mar 18, 2010 at 9:11 PM Subject: Google Summer of Code is *ON* for OBF projects! Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). ? Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo Student applications are due April 9, 2010 at 19:00 UTC. ?Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and who to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2010 Administrator From cjfields at illinois.edu Thu Mar 18 21:57:13 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 18 Mar 2010 16:57:13 -0500 Subject: [Biopython] Google Summer of Code is *ON* for OBF projects! References: <4BA29706.8040606@cornell.edu> Message-ID: <21A0665D-C3CA-4830-A8F7-A989C4D23627@illinois.edu> (forwarding to the BioPython list, as the original post is still clearing the OBF mail filters) Hi all, Great news: Google announced today that the Open Bioinformatics Foundation has been accepted as a mentoring organization for this summer's Google Summer of Code! GSoC is a Google-sponsored student internship program for open-source projects, open to students from around the world (not just US residents). Students are paid a $5000 USD stipend to work as a developer on an open-source project for the summer. For more on GSoC, see GSoC 2010 FAQ at http://tinyurl.com/yzemdfo Student applications are due April 9, 2010 at 19:00 UTC. Students who are interested in participating should look at the OBF's GSoC page at http://open-bio.org/wiki/Google_Summer_of_Code, which lists project ideas, and who to contact about applying. For current developers on OBF projects, please consider volunteering to be a mentor if you have not already, and contribute project ideas. Just list your name and project ideas on OBF wiki and on the relevant project's GSoC wiki page. Thanks to all who helped make OBF's application to GSoC a success, and let's have a great, productive summer of code! Rob Buels OBF GSoC 2010 Administrator From ap12 at sanger.ac.uk Fri Mar 19 19:19:05 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 19 Mar 2010 19:19:05 +0000 Subject: [Biopython] zero-length feature Message-ID: Dear, I am having trouble writing out EMBL file for feature of size one. I've modified InsdcIO.py to fit my need. Because when I try to submit my file to EMBL, it comes back with this comment: badly formatted -- you need a .. between locations. def _insdc_location_string_ignoring_strand_and_subfeatures(feature): if feature.ref: ref = "%s:" % feature.ref else: ref = "" assert not feature.ref_db if feature.location.start == feature.location.end \ and isinstance(feature.location.end, SeqFeature.ExactPosition): #Special case, 12^13 gets mapped to location 12:12 #(a zero length slice, meaning the point between two letters) return "%s%i..%i" % (ref, feature.location.end.position+1, feature.location.end.position+1) else: #Typical case, e.g. 12..15 gets mapped to 11:15 return ref \ + _insdc_feature_position_string(feature.location.start, +1) \ + ".." + \ _insdc_feature_position_string(feature.location.end) But of course I am getting errors when running the tests: ====================================================================== FAIL: GenBank file to BioSQL and back to a GenBank file, NC_005816. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_BioSQL.py", line 419, in test_NC_005816 self.loop(os.path.join(os.getcwd(), "GenBank", "NC_005816.gb"), "gb") File "test_BioSQL.py", line 481, in loop self.assert_(compare_record(old, new)) File "seq_tests_common.py", line 261, in compare_record if not compare_features(old.features, new.features): File "seq_tests_common.py", line 243, in compare_features if not compare_feature(old_f, new_f): File "seq_tests_common.py", line 98, in compare_feature raise e AssertionError: [5933:5933] -> [5933:5934] ====================================================================== ERROR: Write and read back AE017046.embl ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 777, in test_AE017046 write_read(os.path.join("EMBL", "AE017046.embl"), "embl", "gb") File "test_SeqIO_features.py", line 32, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 99, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 50, in compare_record if not compare_features(old.features, new.features): File "test_SeqIO_features.py", line 149, in compare_features if not compare_feature(old,new,ignore_sub_features): File "test_SeqIO_features.py", line 110, in compare_feature % (old.location, new.location, str(old), str(new))) ValueError: [5933:5933] versus [5933:5934]: type: variation location: [5933:5933] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] vs: type: variation location: [5933:5934] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] ====================================================================== ERROR: Write and read back NC_005816.gb ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 702, in test_NC_005816 write_read(os.path.join("GenBank", "NC_005816.gb"), "gb", "gb") File "test_SeqIO_features.py", line 32, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 99, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 50, in compare_record if not compare_features(old.features, new.features): File "test_SeqIO_features.py", line 149, in compare_features if not compare_feature(old,new,ignore_sub_features): File "test_SeqIO_features.py", line 110, in compare_feature % (old.location, new.location, str(old), str(new))) ValueError: [5933:5933] versus [5933:5934]: type: variation location: [5933:5933] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] vs: type: variation location: [5933:5934] ref: None:None strand: 1 qualifiers: Key: note, Value: ['compared to AL109969'] Key: replace, Value: ['a'] ====================================================================== ERROR: Write and read back SC10H5.embl ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 792, in test_SC10H5 write_read(os.path.join("EMBL", "SC10H5.embl"), "embl", "gb") File "test_SeqIO_features.py", line 32, in write_read compare_records(gb_records, gb_records2) File "test_SeqIO_features.py", line 99, in compare_records if not compare_record(old,new,expect_minor_diffs): File "test_SeqIO_features.py", line 50, in compare_record if not compare_features(old.features, new.features): File "test_SeqIO_features.py", line 149, in compare_features if not compare_feature(old,new,ignore_sub_features): File "test_SeqIO_features.py", line 110, in compare_feature % (old.location, new.location, str(old), str(new))) ValueError: [1800:1800] versus [1800:1801]: type: misc_feature location: [1800:1800] ref: None:None strand: 1 qualifiers: Key: note, Value: ['Zero-length feature added to test Bioperl parsing'] vs: type: misc_feature location: [1800:1801] ref: None:None strand: 1 qualifiers: Key: note, Value: ['Zero-length feature added to test Bioperl parsing'] ====================================================================== FAIL: Features: write/read simple between locations. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SeqIO_features.py", line 373, in test_between "10^11") AssertionError: '11..11' != '10^11' ---------------------------------------------------------------------- Ran 144 tests in 226.037 seconds FAILED (failures = 2) What could be a better solution? Thanks to let me know. Kind regards, Anne. -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From barendt at mail.med.upenn.edu Sat Mar 20 02:53:07 2010 From: barendt at mail.med.upenn.edu (Gregory Barendt) Date: Fri, 19 Mar 2010 22:53:07 -0400 Subject: [Biopython] RNA Secondary structure Message-ID: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> Does anyone know of good libraries for looking at RNA secondary structure? I'm looking for particular stem loops in particular locations in lots (hundreds of thousands) of sequences. Right now, I'm pretty inelegantly parsing the .ct file generated by UNAfold. I need to modify my search to be a little more flexible, so I'd much rather use an existing tool than continue to reinvent the wheel. Any advice would be greatly appreciated. Thanks, Greg From vincent at vincentdavis.net Sat Mar 20 03:56:46 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Fri, 19 Mar 2010 21:56:46 -0600 Subject: [Biopython] quantile normalization method Message-ID: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> Is there a quantile normalization method in biopython, I search but did not find. If not it looks straight forward would it be of any interest to the community for me to contribute a method 1. given n arrays of length p, form X of dimension p ? n where each array is a column; 2. sort each column of X to give X sort ; 3. take the means across rows of X sort and assign this mean to each element in the row to get X sort ; 4. get X normalized by rearranging each column of X sort to have the same ordering as original X From A comparison of normalization methods for high density oligonucleotide array data based on variance and bias B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 ? *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From bartek at rezolwenta.eu.org Sat Mar 20 07:55:20 2010 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Sat, 20 Mar 2010 08:55:20 +0100 Subject: [Biopython] quantile normalization method In-Reply-To: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> References: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> Message-ID: <8b34ec181003200055j76230b4uc1b4e0707b24afd1@mail.gmail.com> On Sat, Mar 20, 2010 at 4:56 AM, Vincent Davis wrote: > Is there a quantile normalization method in biopython, I search but did not > find. If not it looks straight forward would it be of any interest to the > community for me to contribute a method > > 1. given n arrays of length p, form X of dimension > p ? n where each array is a column; > 2. sort each column of X to give X sort ; > 3. take the means across rows of X sort and assign this > mean to each element in the row to get X sort ; > 4. get X normalized by rearranging each column of > X sort to have the same ordering as original X > > From > A comparison of normalization methods for high > density oligonucleotide array data based on > variance and bias > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 > ? > Hi, I don't think there is such a method available. I'm myself using the original R implementation by Bolstad et al. It requires rPy and R installed. It can be achieved in a few lines of code:
import rpy2.robjects as robjects
#ll = list of concatenated values to normalize
v = robjects.FloatVector(ll)
#numrows=number of vectors that made up ll
m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
robjects.r('require("preprocessCore")')
normq=robjects.r('normalize.quantiles')
norm_a=numpy.array(normq(m))
#norm_a=normalized array
 
If your method is a pure python implementation which is comparably fast I think it would be worth to have it in Biopython since the method is (in my opinion) quite useful and it would remove the dependency on R from some of my scripts. cheers Bartek From vincent at vincentdavis.net Sat Mar 20 17:16:37 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 20 Mar 2010 11:16:37 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <8b34ec181003200055j76230b4uc1b4e0707b24afd1@mail.gmail.com> References: <77e831101003192056o18009bdejc80235aa36dc6d28@mail.gmail.com> <8b34ec181003200055j76230b4uc1b4e0707b24afd1@mail.gmail.com> Message-ID: <77e831101003201016u32b29872ic71ca87654c45215@mail.gmail.com> @Bartek Wilczynski Could you test the following code against R, speed and acuracy? I am using numpy so you will need to; import numpy as np I did not find any clear documentation as to if the* Bolstad method or quantile normalization methods in general are dropping outliers. Any input here would be great.* I also have to thank Anne Archibald on the scipy mailing list for the fancy array indexing help. def quantile_normalization(anarray): """ anarray with samples in the columns and probes across the rows import numpy as np """ A=anarray AA = np.zeros_like(A) I = np.argsort(A,axis=0) AA[I,np.arange(A.shape[1])] = > np.mean(A[I,np.arange(A.shape[1])],axis=1)[:,np.newaxis] return AA *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 1:55 AM, Bartek Wilczynski wrote: > On Sat, Mar 20, 2010 at 4:56 AM, Vincent Davis wrote: > >> Is there a quantile normalization method in biopython, I search but did >> not >> find. If not it looks straight forward would it be of any interest to the >> community for me to contribute a method >> >> 1. given n arrays of length p, form X of dimension >> p ? n where each array is a column; >> 2. sort each column of X to give X sort ; >> 3. take the means across rows of X sort and assign this >> mean to each element in the row to get X sort ; >> 4. get X normalized by rearranging each column of >> X sort to have the same ordering as original X >> >> From >> A comparison of normalization methods for high >> density oligonucleotide array data based on >> variance and bias >> B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 >> ? >> > > Hi, > > I don't think there is such a method available. > > I'm myself using the original R implementation by Bolstad et al. It > requires rPy and R installed. It can be achieved in a few lines of code: > >
> import rpy2.robjects as robjects
> #ll = list of concatenated values to normalize
> v = robjects.FloatVector(ll)
> #numrows=number of vectors that made up ll
> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
> robjects.r('require("preprocessCore")')
> normq=robjects.r('normalize.quantiles')
> norm_a=numpy.array(normq(m))
> #norm_a=normalized array
>  
> > If your method is a pure python implementation which is comparably fast I > think it would be worth to have it in Biopython since the method is (in my > opinion) quite useful and it would remove the dependency on R from some of > my scripts. > > cheers > Bartek > From lgautier at gmail.com Sat Mar 20 18:05:42 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 20 Mar 2010 19:05:42 +0100 Subject: [Biopython] quantile normalization method In-Reply-To: References: Message-ID: <4BA50E76.5040304@gmail.com> Hi Bartek and Vincent, Few comments: A/ The algorithm is fairly straightforward, as you noted it, but beware of details such missing values, ability to normalize against a target distribution, or ties when ranking (although I'd have to check if those receive a special treatment). The quantile normalization code in the R package "preprocessCore" is in C and might outperform a pure Python implementation. B/ There is a variety of normalization methods in bioconductor, and it might make sense to embrace it as a dependency (rather than reimplement it). I have bindings for Bioconductor up my sleeve about to be distributed to few people for testing. The public release might be around ISMB, BOSC time. C/ norm_a = numpy.array(normq(m)) can be replaced by norm_a = numpy.as_array(normq(m)) to improve performances whenever m is of substantial size (as no copy is made - see http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy ) Best, Laurent On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org wrote: >> > Is there a quantile normalization method in biopython, I search but did not >> > find. If not it looks straight forward would it be of any interest to the >> > community for me to contribute a method >> > >> > 1. given n arrays of length p, form X of dimension >> > p ? n where each array is a column; >> > 2. sort each column of X to give X sort ; >> > 3. take the means across rows of X sort and assign this >> > mean to each element in the row to get X sort ; >> > 4. get X normalized by rearranging each column of >> > X sort to have the same ordering as original X >> > >> > From >> > A comparison of normalization methods for high >> > density oligonucleotide array data based on >> > variance and bias >> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 >> > ? >> > > Hi, > > I don't think there is such a method available. > > I'm myself using the original R implementation by Bolstad et al. It requires > rPy and R installed. It can be achieved in a few lines of code: > >
> import rpy2.robjects as robjects
> #ll = list of concatenated values to normalize
> v = robjects.FloatVector(ll)
> #numrows=number of vectors that made up ll
> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
> robjects.r('require("preprocessCore")')
> normq=robjects.r('normalize.quantiles')
> norm_a=numpy.array(normq(m))
> #norm_a=normalized array
>   
> > If your method is a pure python implementation which is comparably fast I > think it would be worth to have it in Biopython since the method is (in my > opinion) quite useful and it would remove the dependency on R from some of > my scripts. > > cheers > Bartek > From vincent at vincentdavis.net Sat Mar 20 18:26:27 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 20 Mar 2010 12:26:27 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <4BA50E76.5040304@gmail.com> References: <4BA50E76.5040304@gmail.com> Message-ID: <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> > > @Laurent Gautier The algorithm is fairly straightforward, as you noted it, but beware of > details such missing values, ability to normalize against a target > distribution, or ties when ranking (although I'd have to check if those > receive a special treatment).The quantile normalization code in the R > package "preprocessCore" is in C and might outperform a pure Python > implementation. Not sure about speed. I have 84 microarrays samples with ~190,000 probes and it normalizes in 7 sec. I have no idea how fast R is or how many arrays are common to normalize. There is a variety of normalization methods in bioconductor, and it might > make sense to embrace it as a dependency (rather than reimplement it). I > have bindings for Bioconductor up my sleeve about to be distributed to few > people for testing. The public release might be around ISMB, BOSC time. I considered this and in the long run you might be right. But I don't know R and I placed more value on understanding the normalization than learning R. This is in part because there is little advantage in using R in the next steps of my analysis. Bindings seem like a good idea but they would be a black box to me. I guess for me since most of this is new the value of implementing my own normalization in both learning more about python and understanding the normalization out ways the benefits of implementing it in R. As a side question, why use biopython, are there ways in which it is better than R ? For me it is purely that I know python (a little) and can nothing about R. Sure If I am just doing through step by step instruction from a bioconductor use manual I am fine but once I what to do something new am am lost. Not that I can't learn I am just prioritizing my learning. And thanks for this > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no copy is > made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy > ) > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier wrote: > Hi Bartek and Vincent, > > Few comments: > > A/ > > The algorithm is fairly straightforward, as you noted it, but beware of > details such missing values, ability to normalize against a target > distribution, or ties when ranking (although I'd have to check if those > receive a special treatment). > The quantile normalization code in the R package "preprocessCore" is in C > and might outperform a pure Python implementation. > > B/ > > There is a variety of normalization methods in bioconductor, and it might > make sense to embrace it as a dependency (rather than reimplement it). I > have bindings for Bioconductor up my sleeve about to be distributed to few > people for testing. The public release might be around ISMB, BOSC time. > > C/ > > > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no copy is > made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy) > > > > Best, > > > Laurent > > > > > On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org wrote: > >> > Is there a quantile normalization method in biopython, I search but did >>> not >>> > find. If not it looks straight forward would it be of any interest to >>> the >>> > community for me to contribute a method >>> > >>> > 1. given n arrays of length p, form X of dimension >>> > p ? n where each array is a column; >>> > 2. sort each column of X to give X sort ; >>> > 3. take the means across rows of X sort and assign this >>> > mean to each element in the row to get X sort ; >>> > 4. get X normalized by rearranging each column of >>> > X sort to have the same ordering as original X >>> > >>> > From >>> > A comparison of normalization methods for high >>> > density oligonucleotide array data based on >>> > variance and bias >>> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. P. Speed 4, 5 >>> > ? >>> > >>> >> Hi, >> >> I don't think there is such a method available. >> >> I'm myself using the original R implementation by Bolstad et al. It >> requires >> rPy and R installed. It can be achieved in a few lines of code: >> >>
>> import rpy2.robjects as robjects
>> #ll = list of concatenated values to normalize
>> v = robjects.FloatVector(ll)
>> #numrows=number of vectors that made up ll
>> m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>> robjects.r('require("preprocessCore")')
>> normq=robjects.r('normalize.quantiles')
>> norm_a=numpy.array(normq(m))
>> #norm_a=normalized array
>>  
>> >> If your method is a pure python implementation which is comparably fast I >> think it would be worth to have it in Biopython since the method is (in my >> opinion) quite useful and it would remove the dependency on R from some of >> my scripts. >> >> cheers >> Bartek >> >> > From lgautier at gmail.com Sat Mar 20 19:30:45 2010 From: lgautier at gmail.com (Laurent Gautier) Date: Sat, 20 Mar 2010 20:30:45 +0100 Subject: [Biopython] quantile normalization method In-Reply-To: <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> References: <4BA50E76.5040304@gmail.com> <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> Message-ID: <4BA52265.9060908@gmail.com> On 3/20/10 7:26 PM, Vincent Davis wrote: > @Laurent Gautier > > The algorithm is fairly straightforward, as you noted it, but beware > of details such missing values, ability to normalize against a > target distribution, or ties when ranking (although I'd have to > check if those receive a special treatment).The quantile > normalization code in the R package "preprocessCore" is in C and > might outperform a pure Python implementation. > > > Not sure about speed. I have 84 microarrays samples with ~190,000 probes > and it normalizes in 7 sec. I have no idea how fast R is or how many > arrays are common to normalize. So speed is not an issue for your use-case; even a 10x speedup might not justify the effort required to move to C, as this operation is performed once in a while (once per dataset mostly). I am not sure there is a "common" number. When still working with arrays, I can find myself with several hundred arrays with ~2 million probes each. > There is a variety of normalization methods in bioconductor, and it > might make sense to embrace it as a dependency (rather than > reimplement it). I have bindings for Bioconductor up my sleeve about > to be distributed to few people for testing. The public release > might be around ISMB, BOSC time. > > > I considered this and in the long run you might be right. But I don't > know R and I placed more value on understanding the normalization than > learning R. This is in part because there is little advantage in using R > in the next steps of my analysis. Surprising, but you'll know best. > Bindings seem like a good idea but > they would be a black box to me. I guess for me since most of this is > new the value of implementing my own normalization in both learning more > about python and understanding the normalization out ways the benefits > of implementing it in R. Everyone's mileage will vary. I often like building on existing libraries (although I frequently read how methods work): this makes my palette of tools richer than if I had to reimplement everything, and gives me time to create my own. Having this said, learning a language by implementing is a great way to go. > As a side question, why use biopython, are there ways in which it is > better than R ? In short (and therefore with some imprecision and/or distortion), Biopython is a "Python package" (i.e., collection of modules) for bioinformatics, with a forte in handling a number of bioinformatics file formats. R is a language for statistics, data analysis and graphics. > For me it is purely that I know python (a little) and can nothing about > R. Sure If I am just doing through step by step instruction from > a bioconductor use manual I am fine but once I what to do something new > am am lost. Not that I can't learn I am just prioritizing my learning. Then the idea is that you consider R/bioconductor as a Python library. Should you want something new, you can then implement it in Python. Laurent > > And thanks for this > > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no > copy is made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy) > > > > > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > > my blog | LinkedIn > > > > > On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier > wrote: > > Hi Bartek and Vincent, > > Few comments: > > A/ > > The algorithm is fairly straightforward, as you noted it, but beware > of details such missing values, ability to normalize against a > target distribution, or ties when ranking (although I'd have to > check if those receive a special treatment). > The quantile normalization code in the R package "preprocessCore" is > in C and might outperform a pure Python implementation. > > B/ > > There is a variety of normalization methods in bioconductor, and it > might make sense to embrace it as a dependency (rather than > reimplement it). I have bindings for Bioconductor up my sleeve about > to be distributed to few people for testing. The public release > might be around ISMB, BOSC time. > > C/ > > > norm_a = numpy.array(normq(m)) > > can be replaced by > > norm_a = numpy.as_array(normq(m)) > > to improve performances whenever m is of substantial size (as no > copy is made - see > http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy > ) > > > > Best, > > > Laurent > > > > > On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org > wrote: > > > Is there a quantile normalization method in biopython, I > search but did not > > find. If not it looks straight forward would it be of > any interest to the > > community for me to contribute a method > > > > 1. given n arrays of length p, form X of dimension > > p ? n where each array is a column; > > 2. sort each column of X to give X sort ; > > 3. take the means across rows of X sort and assign this > > mean to each element in the row to get X sort ; > > 4. get X normalized by rearranging each column of > > X sort to have the same ordering as original X > > > > From > > A comparison of normalization methods for high > > density oligonucleotide array data based on > > variance and bias > > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. > P. Speed 4, 5 > > ? > > > > Hi, > > I don't think there is such a method available. > > I'm myself using the original R implementation by Bolstad et al. > It requires > rPy and R installed. It can be achieved in a few lines of code: > >
>         import rpy2.robjects as robjects
>         #ll = list of concatenated values to normalize
>         v = robjects.FloatVector(ll)
>         #numrows=number of vectors that made up ll
>         m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>         robjects.r('require("preprocessCore")')
>         normq=robjects.r('normalize.quantiles')
>         norm_a=numpy.array(normq(m))
>         #norm_a=normalized array
>         
> > If your method is a pure python implementation which is > comparably fast I > think it would be worth to have it in Biopython since the method > is (in my > opinion) quite useful and it would remove the dependency on R > from some of > my scripts. > > cheers > Bartek > > > From vincent at vincentdavis.net Sat Mar 20 19:35:33 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sat, 20 Mar 2010 13:35:33 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <4BA52265.9060908@gmail.com> References: <4BA50E76.5040304@gmail.com> <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> <4BA52265.9060908@gmail.com> Message-ID: <77e831101003201235w41db2061q95871d17f797136@mail.gmail.com> @Laurent Gautier, I agree with everything you said :) What I could really use is some to test the python code against R Just to help very if that the results are not completely wrong. *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 1:30 PM, Laurent Gautier wrote: > On 3/20/10 7:26 PM, Vincent Davis wrote: > >> @Laurent Gautier >> >> The algorithm is fairly straightforward, as you noted it, but beware >> of details such missing values, ability to normalize against a >> target distribution, or ties when ranking (although I'd have to >> check if those receive a special treatment).The quantile >> normalization code in the R package "preprocessCore" is in C and >> might outperform a pure Python implementation. >> >> >> Not sure about speed. I have 84 microarrays samples with ~190,000 probes >> and it normalizes in 7 sec. I have no idea how fast R is or how many >> arrays are common to normalize. >> > > So speed is not an issue for your use-case; even a 10x speedup might not > justify the effort required to move to C, as this operation is performed > once in a while (once per dataset mostly). > > I am not sure there is a "common" number. When still working with arrays, I > can find myself with several hundred arrays with ~2 million probes each. > > > There is a variety of normalization methods in bioconductor, and it >> might make sense to embrace it as a dependency (rather than >> reimplement it). I have bindings for Bioconductor up my sleeve about >> to be distributed to few people for testing. The public release >> might be around ISMB, BOSC time. >> >> >> I considered this and in the long run you might be right. But I don't >> know R and I placed more value on understanding the normalization than >> learning R. This is in part because there is little advantage in using R >> in the next steps of my analysis. >> > > Surprising, but you'll know best. > > > Bindings seem like a good idea but >> they would be a black box to me. I guess for me since most of this is >> new the value of implementing my own normalization in both learning more >> about python and understanding the normalization out ways the benefits >> of implementing it in R. >> > > Everyone's mileage will vary. I often like building on existing libraries > (although I frequently read how methods work): this makes my palette of > tools richer than if I had to reimplement everything, and gives me time to > create my own. > Having this said, learning a language by implementing is a great way to go. > > > As a side question, why use biopython, are there ways in which it is >> better than R ? >> > > In short (and therefore with some imprecision and/or distortion), Biopython > is a "Python package" (i.e., collection of modules) for bioinformatics, with > a forte in handling a number of bioinformatics file formats. R is a language > for statistics, data analysis and graphics. > > > For me it is purely that I know python (a little) and can nothing about >> R. Sure If I am just doing through step by step instruction from >> a bioconductor use manual I am fine but once I what to do something new >> am am lost. Not that I can't learn I am just prioritizing my learning. >> > > Then the idea is that you consider R/bioconductor as a Python library. > Should you want something new, you can then implement it in Python. > > > > Laurent > > >> And thanks for this >> >> norm_a = numpy.array(normq(m)) >> >> can be replaced by >> >> norm_a = numpy.as_array(normq(m)) >> >> to improve performances whenever m is of substantial size (as no >> copy is made - see >> >> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >> ) >> >> >> >> >> >> *Vincent Davis >> 720-301-3003 * >> vincent at vincentdavis.net >> >> my blog | LinkedIn >> >> >> >> >> >> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier > > wrote: >> >> Hi Bartek and Vincent, >> >> Few comments: >> >> A/ >> >> The algorithm is fairly straightforward, as you noted it, but beware >> of details such missing values, ability to normalize against a >> target distribution, or ties when ranking (although I'd have to >> check if those receive a special treatment). >> The quantile normalization code in the R package "preprocessCore" is >> in C and might outperform a pure Python implementation. >> >> B/ >> >> There is a variety of normalization methods in bioconductor, and it >> might make sense to embrace it as a dependency (rather than >> reimplement it). I have bindings for Bioconductor up my sleeve about >> to be distributed to few people for testing. The public release >> might be around ISMB, BOSC time. >> >> C/ >> >> >> norm_a = numpy.array(normq(m)) >> >> can be replaced by >> >> norm_a = numpy.as_array(normq(m)) >> >> to improve performances whenever m is of substantial size (as no >> copy is made - see >> >> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >> ) >> >> >> >> Best, >> >> >> Laurent >> >> >> >> >> On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org >> wrote: >> >> > Is there a quantile normalization method in biopython, I >> search but did not >> > find. If not it looks straight forward would it be of >> any interest to the >> > community for me to contribute a method >> > >> > 1. given n arrays of length p, form X of dimension >> > p ? n where each array is a column; >> > 2. sort each column of X to give X sort ; >> > 3. take the means across rows of X sort and assign this >> > mean to each element in the row to get X sort ; >> > 4. get X normalized by rearranging each column of >> > X sort to have the same ordering as original X >> > >> > From >> > A comparison of normalization methods for high >> > density oligonucleotide array data based on >> > variance and bias >> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. >> P. Speed 4, 5 >> > ? >> > >> >> Hi, >> >> I don't think there is such a method available. >> >> I'm myself using the original R implementation by Bolstad et al. >> It requires >> rPy and R installed. It can be achieved in a few lines of code: >> >>
>>        import rpy2.robjects as robjects
>>        #ll = list of concatenated values to normalize
>>        v = robjects.FloatVector(ll)
>>        #numrows=number of vectors that made up ll
>>        m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>>        robjects.r('require("preprocessCore")')
>>        normq=robjects.r('normalize.quantiles')
>>        norm_a=numpy.array(normq(m))
>>        #norm_a=normalized array
>>        
>> >> If your method is a pure python implementation which is >> comparably fast I >> think it would be worth to have it in Biopython since the method >> is (in my >> opinion) quite useful and it would remove the dependency on R >> from some of >> my scripts. >> >> cheers >> Bartek >> >> >> >> > From anaryin at gmail.com Sun Mar 21 01:38:07 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sat, 20 Mar 2010 18:38:07 -0700 Subject: [Biopython] GSOC Bio.PDB Project Message-ID: Hello All, I've been using BioPython for a while now and I guess I'm a spoiled brat for never giving anything back :) Also, I've been wanting to participate in the last two years of GSOC but I've never found a project that I felt adequate to my knowledge (ie. usually too hard). Thus, with this year's Bio.PDB project, I guess I can give it a try to be accepted. I don't have that much experience with coding in collaborative environments, nor I have in big projects, but that's exactly what I'm looking forward to earn. I know my way around BioPython and the Bio.PDB module, and I've had enough headaches dealing with PDB files in the past couple of years to nurture hatred up to a certain level :) And I have a B.Sc in Biochem, which is a double-edged knife for comp. biology. With this said, I guess I have to wait for a reply. If you need extra info, feel free to email me. Jo?o Rodrigues @ http://stanford.edu/~joaor/ From vincent at vincentdavis.net Mon Mar 22 04:02:20 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Sun, 21 Mar 2010 22:02:20 -0600 Subject: [Biopython] quantile normalization method In-Reply-To: <77e831101003201235w41db2061q95871d17f797136@mail.gmail.com> References: <4BA50E76.5040304@gmail.com> <77e831101003201126i22f72970x189b9f7a9335fb33@mail.gmail.com> <4BA52265.9060908@gmail.com> <77e831101003201235w41db2061q95871d17f797136@mail.gmail.com> Message-ID: <77e831101003212102s3850b60au553d19a719b4742c@mail.gmail.com> I found a mistake, the np.zeros_like(A) array need to be set as a float64, otherwise it was assumed int. So the final results would have been rounded to int. def quantile_normalization(anarray): """ anarray with samples in the columns and probes across the rows import numpy as np """ anarray.dtype = np.float64 A=anarray AA = np.float64(np.zeros_like(A)) I = np.argsort(A,axis=0) AA[I,np.arange(A.shape[1])] = np.float64(np.mean(A[I,np.arange(A.shape[1])],axis=1)[:,np.newaxis]) return AA *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn On Sat, Mar 20, 2010 at 1:35 PM, Vincent Davis wrote: > @Laurent Gautier, I agree with everything you said :) > > What I could really use is some to test the python code against R > Just to help very if that the results are not completely wrong. > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > > > On Sat, Mar 20, 2010 at 1:30 PM, Laurent Gautier wrote: > >> On 3/20/10 7:26 PM, Vincent Davis wrote: >> >>> @Laurent Gautier >>> >>> The algorithm is fairly straightforward, as you noted it, but beware >>> of details such missing values, ability to normalize against a >>> target distribution, or ties when ranking (although I'd have to >>> check if those receive a special treatment).The quantile >>> normalization code in the R package "preprocessCore" is in C and >>> might outperform a pure Python implementation. >>> >>> >>> Not sure about speed. I have 84 microarrays samples with ~190,000 probes >>> and it normalizes in 7 sec. I have no idea how fast R is or how many >>> arrays are common to normalize. >>> >> >> So speed is not an issue for your use-case; even a 10x speedup might not >> justify the effort required to move to C, as this operation is performed >> once in a while (once per dataset mostly). >> >> I am not sure there is a "common" number. When still working with arrays, >> I can find myself with several hundred arrays with ~2 million probes each. >> >> >> There is a variety of normalization methods in bioconductor, and it >>> might make sense to embrace it as a dependency (rather than >>> reimplement it). I have bindings for Bioconductor up my sleeve about >>> to be distributed to few people for testing. The public release >>> might be around ISMB, BOSC time. >>> >>> >>> I considered this and in the long run you might be right. But I don't >>> know R and I placed more value on understanding the normalization than >>> learning R. This is in part because there is little advantage in using R >>> in the next steps of my analysis. >>> >> >> Surprising, but you'll know best. >> >> >> Bindings seem like a good idea but >>> they would be a black box to me. I guess for me since most of this is >>> new the value of implementing my own normalization in both learning more >>> about python and understanding the normalization out ways the benefits >>> of implementing it in R. >>> >> >> Everyone's mileage will vary. I often like building on existing libraries >> (although I frequently read how methods work): this makes my palette of >> tools richer than if I had to reimplement everything, and gives me time to >> create my own. >> Having this said, learning a language by implementing is a great way to >> go. >> >> >> As a side question, why use biopython, are there ways in which it is >>> better than R ? >>> >> >> In short (and therefore with some imprecision and/or distortion), >> Biopython is a "Python package" (i.e., collection of modules) for >> bioinformatics, with a forte in handling a number of bioinformatics file >> formats. R is a language for statistics, data analysis and graphics. >> >> >> For me it is purely that I know python (a little) and can nothing about >>> R. Sure If I am just doing through step by step instruction from >>> a bioconductor use manual I am fine but once I what to do something new >>> am am lost. Not that I can't learn I am just prioritizing my learning. >>> >> >> Then the idea is that you consider R/bioconductor as a Python library. >> Should you want something new, you can then implement it in Python. >> >> >> >> Laurent >> >> >>> And thanks for this >>> >>> norm_a = numpy.array(normq(m)) >>> >>> can be replaced by >>> >>> norm_a = numpy.as_array(normq(m)) >>> >>> to improve performances whenever m is of substantial size (as no >>> copy is made - see >>> >>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >>> ) >>> >>> >>> >>> >>> >>> *Vincent Davis >>> 720-301-3003 * >>> vincent at vincentdavis.net >>> >>> my blog | LinkedIn >>> >>> >>> >>> >>> >>> On Sat, Mar 20, 2010 at 12:05 PM, Laurent Gautier >> > wrote: >>> >>> Hi Bartek and Vincent, >>> >>> Few comments: >>> >>> A/ >>> >>> The algorithm is fairly straightforward, as you noted it, but beware >>> of details such missing values, ability to normalize against a >>> target distribution, or ties when ranking (although I'd have to >>> check if those receive a special treatment). >>> The quantile normalization code in the R package "preprocessCore" is >>> in C and might outperform a pure Python implementation. >>> >>> B/ >>> >>> There is a variety of normalization methods in bioconductor, and it >>> might make sense to embrace it as a dependency (rather than >>> reimplement it). I have bindings for Bioconductor up my sleeve about >>> to be distributed to few people for testing. The public release >>> might be around ISMB, BOSC time. >>> >>> C/ >>> >>> >>> norm_a = numpy.array(normq(m)) >>> >>> can be replaced by >>> >>> norm_a = numpy.as_array(normq(m)) >>> >>> to improve performances whenever m is of substantial size (as no >>> copy is made - see >>> >>> http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy >>> ) >>> >>> >>> >>> Best, >>> >>> >>> Laurent >>> >>> >>> >>> >>> On 3/20/10 5:00 PM, biopython-request at lists.open-bio.org >>> wrote: >>> >>> > Is there a quantile normalization method in biopython, I >>> search but did not >>> > find. If not it looks straight forward would it be of >>> any interest to the >>> > community for me to contribute a method >>> > >>> > 1. given n arrays of length p, form X of dimension >>> > p ? n where each array is a column; >>> > 2. sort each column of X to give X sort ; >>> > 3. take the means across rows of X sort and assign this >>> > mean to each element in the row to get X sort ; >>> > 4. get X normalized by rearranging each column of >>> > X sort to have the same ordering as original X >>> > >>> > From >>> > A comparison of normalization methods for high >>> > density oligonucleotide array data based on >>> > variance and bias >>> > B. M. Bolstad 1,?, R. A. Irizarry 2, M. Astrand 3 and T. >>> P. Speed 4, 5 >>> > ? >>> > >>> >>> Hi, >>> >>> I don't think there is such a method available. >>> >>> I'm myself using the original R implementation by Bolstad et al. >>> It requires >>> rPy and R installed. It can be achieved in a few lines of code: >>> >>>
>>>        import rpy2.robjects as robjects
>>>        #ll = list of concatenated values to normalize
>>>        v = robjects.FloatVector(ll)
>>>        #numrows=number of vectors that made up ll
>>>        m = robjects.r['matrix'](v, nrow = numrows, byrow=True)
>>>        robjects.r('require("preprocessCore")')
>>>        normq=robjects.r('normalize.quantiles')
>>>        norm_a=numpy.array(normq(m))
>>>        #norm_a=normalized array
>>>        
>>> >>> If your method is a pure python implementation which is >>> comparably fast I >>> think it would be worth to have it in Biopython since the method >>> is (in my >>> opinion) quite useful and it would remove the dependency on R >>> from some of >>> my scripts. >>> >>> cheers >>> Bartek >>> >>> >>> >>> >> > From eric.talevich at gmail.com Mon Mar 22 04:12:36 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 22 Mar 2010 00:12:36 -0400 Subject: [Biopython] GSOC Bio.PDB Project In-Reply-To: References: Message-ID: <3f6baf361003212112o3a2ebe5bq50a7d59eae06492c@mail.gmail.com> On Sat, Mar 20, 2010 at 9:38 PM, Jo?o Rodrigues wrote: > Hello All, > > I've been using BioPython for a while now and I guess I'm a spoiled brat > for > never giving anything back :) Also, I've been wanting to participate in the > last two years of GSOC but I've never found a project that I felt adequate > to my knowledge (ie. usually too hard). Thus, with this year's Bio.PDB > project, I guess I can give it a try to be accepted. > Sounds good to me! The GSoC projects are meant to be a stretch for students' skills; otherwise you wouldn't need mentors. I don't have that much experience with coding in collaborative environments, > nor I have in big projects, but that's exactly what I'm looking forward to > earn. I know my way around BioPython and the Bio.PDB module, and I've had > enough headaches dealing with PDB files in the past couple of years to > nurture hatred up to a certain level :) And I have a B.Sc in Biochem, which > is a double-edged knife for comp. biology. > Did you see my earlier e-mail about refining ideas for Bio.PDB? Looking at your webpage, I can definitely think of some more specific projects you could do for GSoC. If you don't want other potential students to read your ideas in the formative stages, you can of course e-mail me directly about planning a project. Also, I've been attempting to herd applicants toward our bug tracker: http://bugzilla.open-bio.org/buglist.cgi?product=Biopython&bug_status=NEW&bug_status=REOPENED Thanks for your interest, Eric From biopython at maubp.freeserve.co.uk Mon Mar 22 09:27:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 09:27:04 +0000 Subject: [Biopython] RNA Secondary structure In-Reply-To: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> References: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> Message-ID: <320fb6e01003220227u19be07c8p6f19c1d28ad37bb2@mail.gmail.com> On Sat, Mar 20, 2010 at 2:53 AM, Gregory Barendt wrote: > Does anyone know of good libraries for looking at RNA secondary > structure? I'm looking for particular stem loops in particular locations > in lots (hundreds of thousands) of sequences. > > Right now, I'm pretty inelegantly parsing the .ct file generated by > UNAfold. I need to modify my search to be a little more flexible, so > I'd much rather use an existing tool than continue to reinvent the > wheel. Any advice would be greatly appreciated. > > Thanks, > Greg I think Kristian Rother was looking at RNA support in Biopython last year (CC'd). Peter From biopython at maubp.freeserve.co.uk Mon Mar 22 09:31:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 09:31:49 +0000 Subject: [Biopython] zero-length feature In-Reply-To: References: Message-ID: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> On Fri, Mar 19, 2010 at 7:19 PM, Anne Pajon wrote: > Dear, > > I am having trouble writing out EMBL file for feature of size one. > I've modified InsdcIO.py to fit my need. Because when I try to submit my > file to EMBL, it comes back with this comment: badly formatted -- you > need a .. between locations. Hi Anne, Could you show us the feature location string you are trying to achieve in the EMBL output? That would help me to follow - an example FT entry would be great. Peter From ap12 at sanger.ac.uk Mon Mar 22 11:24:43 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 22 Mar 2010 11:24:43 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> Message-ID: <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> Hi Peter, Here is the feature location string I would like to achieve in the EMBL output: FT gap 422950..422950 FT /estimated_length=1 Regards, Anne. On 22 Mar 2010, at 09:31, Peter wrote: > On Fri, Mar 19, 2010 at 7:19 PM, Anne Pajon wrote: >> Dear, >> >> I am having trouble writing out EMBL file for feature of size one. >> I've modified InsdcIO.py to fit my need. Because when I try to >> submit my >> file to EMBL, it comes back with this comment: badly formatted -- you >> need a .. between locations. > > Hi Anne, > > Could you show us the feature location string you are trying to > achieve in the EMBL output? That would help me to follow - > an example FT entry would be great. > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Mon Mar 22 11:37:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 11:37:58 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> Message-ID: <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> On Mon, Mar 22, 2010 at 11:24 AM, Anne Pajon wrote: > Hi Peter, > > Here is the feature location string I would like to achieve in the EMBL > output: > > FT ? gap ? ? ? ? ? ? 422950..422950 > FT ? ? ? ? ? ? ? ? ? /estimated_length=1 > > > Regards, > Anne. Does your genome have a single N (or n) character at this point? If so, it does make sense to use 422950..422950 to mean that single letter - it really is a feature of length one. That should be possible with the existing (unmodified) Biopython EMBL/GenBank output. Note that in python notation this would be the region [422949:422950], where start != end but instead start+1 == end. If however the gap isn't explicitly in the genome string, I think you should be using something like 422950^422951 to indicate the gap is between bases 422950 and 422951. This is a zero length feature. Perhaps I have misunderstood your aim? Peter From biopython at maubp.freeserve.co.uk Mon Mar 22 11:41:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 11:41:52 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> Message-ID: <320fb6e01003220441n5867af3ei39a4a90fc8c53586@mail.gmail.com> On Mon, Mar 22, 2010 at 11:37 AM, Peter wrote: > Does your genome have a single N (or n) character at this point? > > If so, it does make sense to use 422950..422950 to mean that > single letter - it really is a feature of length one. That should be > possible with the existing (unmodified) Biopython EMBL/GenBank > output. Note that in python notation this would be the region > [422949:422950], where start != end but instead start+1 == end. > > If however the gap isn't explicitly in the genome string, I think you > should be using something like 422950^422951 to indicate the > gap is between bases 422950 and 422951. This is a zero length > feature. > > Perhaps I have misunderstood your aim? I should perhaps include a quote from the EMBL documentation to explain my question a little further: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html Feature Key gap Definition gap in the sequence Mandatory qualifiers /estimated_length=unknown or Optional qualifiers /experiment="text" /inference="TYPE[ (same species)][:EVIDENCE_BASIS]" /map="text" /note="text" Comment the location span of the gap feature for an unknown gap is 100 bp, with the 100 bp indicated as 100 "n"'s in the sequence. Where estimated length is indicated by an integer, this is indicated by the same number of "n"'s in the sequence. No upper or lower limit is set on the size of the gap. i.e. I think EMBL would want you to insert a string of n characters into the genome where you have a gap, and then the gap feature would describe this string of n characters. Peter From ap12 at sanger.ac.uk Mon Mar 22 11:44:00 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 22 Mar 2010 11:44:00 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> Message-ID: My genome has a single N character at this point. Here is the code I use to insert these gaps: # Add FT gap seq = record.seq in_N = False gap_features = [] for i in range(len(seq)): if seq[i] == 'N' and not in_N: start_N = i in_N = True if in_N and not seq[i+1] == 'N': end_N = i if start_N == end_N: log.warning("gap of size 1 %s..%s" % (start_N, end_N)) length = (end_N - start_N) + 1 gap_feature = SeqFeature(FeatureLocation(start_N,end_N +1), strand=1, type="gap") gap_feature.qualifiers['estimated_length'] = [length] gap_features.append(gap_feature) in_N = False What should I do to make it works with (unmodified) Biopython EMBL output? Thanks in advance for your help. Regards, Anne. On 22 Mar 2010, at 11:37, Peter wrote: > On Mon, Mar 22, 2010 at 11:24 AM, Anne Pajon > wrote: >> Hi Peter, >> >> Here is the feature location string I would like to achieve in the >> EMBL >> output: >> >> FT gap 422950..422950 >> FT /estimated_length=1 >> >> >> Regards, >> Anne. > > Does your genome have a single N (or n) character at this point? > > If so, it does make sense to use 422950..422950 to mean that > single letter - it really is a feature of length one. That should be > possible with the existing (unmodified) Biopython EMBL/GenBank > output. Note that in python notation this would be the region > [422949:422950], where start != end but instead start+1 == end. > > If however the gap isn't explicitly in the genome string, I think you > should be using something like 422950^422951 to indicate the > gap is between bases 422950 and 422951. This is a zero length > feature. > > Perhaps I have misunderstood your aim? > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Mon Mar 22 12:07:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 12:07:06 +0000 Subject: [Biopython] zero-length feature In-Reply-To: References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> Message-ID: <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon wrote: > My genome has a single N character at this point. > OK - then the feature should be length one, describing this single base region. i.e. Using python counting, start+1 == end > > Here is the code I use to insert these gaps: > > ? ?# Add FT gap > ? ?seq = record.seq > ? ?in_N = False > ? ?gap_features = [] > ? ?for i in range(len(seq)): > ? ? ? ?if seq[i] == 'N' and not in_N: > ? ? ? ? ? ?start_N = i > ? ? ? ? ? ?in_N = True > ? ? ? ?if in_N and not seq[i+1] == 'N': > ? ? ? ? ? ?end_N = i > ? ? ? ? ? ?if start_N == end_N: > ? ? ? ? ? ? ? ?log.warning("gap of size 1 %s..%s" % (start_N, end_N)) > ? ? ? ? ? ?length = (end_N - start_N) + 1 > ? ? ? ? ? ?gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1), > strand=1, type="gap") > ? ? ? ? ? ?gap_feature.qualifiers['estimated_length'] = [length] > ? ? ? ? ? ?gap_features.append(gap_feature) > ? ? ? ? ? ?in_N = False > > What should I do to make it works with (unmodified) Biopython EMBL output? > Thanks in advance for your help. > > Regards, > Anne. I think you have some out by one counting there (resulting in features of length one shorted than they should have been). How does this self contained example look? from Bio.Alphabet import generic_dna from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation seq = Seq("ANANNANNNANNNNNA", generic_dna) record = SeqRecord(seq, id="Test") print "Finding stretches of N in this:" print seq # TODO - Cope with a sequence which ends with N assert seq[-1] != "N", "FIXME - seq ends with N" in_N = False for i in range(len(seq)): if seq[i] == 'N' and not in_N: start_N = i in_N = True if in_N and not seq[i+1] == 'N': end_N = i+1 length = end_N - start_N assert length > 0 assert str(seq[start_N:end_N]) == "N"*length print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N), print "gap of size %i, Python slicing %s:%s" % (length, start_N, end_N) gap_feature = SeqFeature(FeatureLocation(start_N,end_N), strand=1, type="gap") gap_feature.qualifiers['estimated_length'] = [length] record.features.append(gap_feature) in_N = False print print record.format("embl") And the output, which looks fine to me (this is more readable if your email client uses a fixed width font): Finding stretches of N in this: ANANNANNNANNNNNA .N.............. gap of size 1, Python slicing 1:2 ...NN........... gap of size 2, Python slicing 3:5 ......NNN....... gap of size 3, Python slicing 6:9 ..........NNNNN. gap of size 5, Python slicing 10:15 ID Test; ; ; DNA; ; UNC; 16 BP. XX AC Test; XX DE . XX OS . OC . XX FH Key Location/Qualifiers FT gap 2..2 FT /estimated_length=1 FT gap 4..5 FT /estimated_length=2 FT gap 7..9 FT /estimated_length=3 FT gap 11..15 FT /estimated_length=5 SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; ANANNANNNA NNNNNA 16 // Regards, Peter From ap12 at sanger.ac.uk Mon Mar 22 14:52:53 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 22 Mar 2010 14:52:53 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> Message-ID: <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> Brilliant! Thanks. Regards, Anne. On 22 Mar 2010, at 12:07, Peter wrote: > On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon > wrote: >> My genome has a single N character at this point. >> > > OK - then the feature should be length one, describing this single > base region. i.e. Using python counting, start+1 == end > >> >> Here is the code I use to insert these gaps: >> >> # Add FT gap >> seq = record.seq >> in_N = False >> gap_features = [] >> for i in range(len(seq)): >> if seq[i] == 'N' and not in_N: >> start_N = i >> in_N = True >> if in_N and not seq[i+1] == 'N': >> end_N = i >> if start_N == end_N: >> log.warning("gap of size 1 %s..%s" % (start_N, end_N)) >> length = (end_N - start_N) + 1 >> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1), >> strand=1, type="gap") >> gap_feature.qualifiers['estimated_length'] = [length] >> gap_features.append(gap_feature) >> in_N = False >> >> What should I do to make it works with (unmodified) Biopython EMBL >> output? >> Thanks in advance for your help. >> >> Regards, >> Anne. > > I think you have some out by one counting there (resulting in features > of length one shorted than they should have been). How does this self > contained example look? > > from Bio.Alphabet import generic_dna > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > from Bio.SeqFeature import SeqFeature, FeatureLocation > seq = Seq("ANANNANNNANNNNNA", generic_dna) > record = SeqRecord(seq, id="Test") > print "Finding stretches of N in this:" > print seq > # TODO - Cope with a sequence which ends with N > assert seq[-1] != "N", "FIXME - seq ends with N" > in_N = False > for i in range(len(seq)): > if seq[i] == 'N' and not in_N: > start_N = i > in_N = True > if in_N and not seq[i+1] == 'N': > end_N = i+1 > length = end_N - start_N > assert length > 0 > assert str(seq[start_N:end_N]) == "N"*length > print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N), > print "gap of size %i, Python slicing %s:%s" % (length, > start_N, end_N) > gap_feature = SeqFeature(FeatureLocation(start_N,end_N), > strand=1, type="gap") > gap_feature.qualifiers['estimated_length'] = [length] > record.features.append(gap_feature) > in_N = False > print > print record.format("embl") > > > And the output, which looks fine to me (this is more readable if your > email client uses a fixed width font): > > > Finding stretches of N in this: > ANANNANNNANNNNNA > .N.............. gap of size 1, Python slicing 1:2 > ...NN........... gap of size 2, Python slicing 3:5 > ......NNN....... gap of size 3, Python slicing 6:9 > ..........NNNNN. gap of size 5, Python slicing 10:15 > > ID Test; ; ; DNA; ; UNC; 16 BP. > XX > AC Test; > XX > DE . > XX > OS . > OC . > XX > FH Key Location/Qualifiers > FT gap 2..2 > FT /estimated_length=1 > FT gap 4..5 > FT /estimated_length=2 > FT gap 7..9 > FT /estimated_length=3 > FT gap 11..15 > FT /estimated_length=5 > SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; > ANANNANNNA > NNNNNA 16 > // > > Regards, > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From cjfields at illinois.edu Mon Mar 22 15:01:48 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 22 Mar 2010 10:01:48 -0500 Subject: [Biopython] zero-length feature In-Reply-To: <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> Message-ID: <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> All, Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ location specs indicate a location of one nucleotide (inclusive) in length is to be characterized as one number, not a range at all: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3 2..2 should just be: 2 Or, did I miss something in the discussion? chris On Mar 22, 2010, at 9:52 AM, Anne Pajon wrote: > Brilliant! Thanks. > > Regards, > Anne. > > On 22 Mar 2010, at 12:07, Peter wrote: > >> On Mon, Mar 22, 2010 at 11:44 AM, Anne Pajon wrote: >>> My genome has a single N character at this point. >>> >> >> OK - then the feature should be length one, describing this single >> base region. i.e. Using python counting, start+1 == end >> >>> >>> Here is the code I use to insert these gaps: >>> >>> # Add FT gap >>> seq = record.seq >>> in_N = False >>> gap_features = [] >>> for i in range(len(seq)): >>> if seq[i] == 'N' and not in_N: >>> start_N = i >>> in_N = True >>> if in_N and not seq[i+1] == 'N': >>> end_N = i >>> if start_N == end_N: >>> log.warning("gap of size 1 %s..%s" % (start_N, end_N)) >>> length = (end_N - start_N) + 1 >>> gap_feature = SeqFeature(FeatureLocation(start_N,end_N+1), >>> strand=1, type="gap") >>> gap_feature.qualifiers['estimated_length'] = [length] >>> gap_features.append(gap_feature) >>> in_N = False >>> >>> What should I do to make it works with (unmodified) Biopython EMBL output? >>> Thanks in advance for your help. >>> >>> Regards, >>> Anne. >> >> I think you have some out by one counting there (resulting in features >> of length one shorted than they should have been). How does this self >> contained example look? >> >> from Bio.Alphabet import generic_dna >> from Bio.Seq import Seq >> from Bio.SeqRecord import SeqRecord >> from Bio.SeqFeature import SeqFeature, FeatureLocation >> seq = Seq("ANANNANNNANNNNNA", generic_dna) >> record = SeqRecord(seq, id="Test") >> print "Finding stretches of N in this:" >> print seq >> # TODO - Cope with a sequence which ends with N >> assert seq[-1] != "N", "FIXME - seq ends with N" >> in_N = False >> for i in range(len(seq)): >> if seq[i] == 'N' and not in_N: >> start_N = i >> in_N = True >> if in_N and not seq[i+1] == 'N': >> end_N = i+1 >> length = end_N - start_N >> assert length > 0 >> assert str(seq[start_N:end_N]) == "N"*length >> print "."*start_N + seq[start_N:end_N] + "."*(len(seq)-end_N), >> print "gap of size %i, Python slicing %s:%s" % (length, start_N, end_N) >> gap_feature = SeqFeature(FeatureLocation(start_N,end_N), >> strand=1, type="gap") >> gap_feature.qualifiers['estimated_length'] = [length] >> record.features.append(gap_feature) >> in_N = False >> print >> print record.format("embl") >> >> >> And the output, which looks fine to me (this is more readable if your >> email client uses a fixed width font): >> >> >> Finding stretches of N in this: >> ANANNANNNANNNNNA >> .N.............. gap of size 1, Python slicing 1:2 >> ...NN........... gap of size 2, Python slicing 3:5 >> ......NNN....... gap of size 3, Python slicing 6:9 >> ..........NNNNN. gap of size 5, Python slicing 10:15 >> >> ID Test; ; ; DNA; ; UNC; 16 BP. >> XX >> AC Test; >> XX >> DE . >> XX >> OS . >> OC . >> XX >> FH Key Location/Qualifiers >> FT gap 2..2 >> FT /estimated_length=1 >> FT gap 4..5 >> FT /estimated_length=2 >> FT gap 7..9 >> FT /estimated_length=3 >> FT gap 11..15 >> FT /estimated_length=5 >> SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; >> ANANNANNNA NNNNNA 16 >> // >> >> Regards, >> >> Peter > > -- > Dr Anne Pajon - Pathogen Genomics, Team 81 > Sanger Institute, Wellcome Trust Genome Campus, Hinxton > Cambridge CB10 1SA, United Kingdom > +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Mar 22 15:38:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Mar 2010 15:38:42 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> Message-ID: <320fb6e01003220838x685f6815v7f2ba40983d7316a@mail.gmail.com> On Mon, Mar 22, 2010 at 3:01 PM, Chris Fields wrote: > > All, > > Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ > location specs indicate a location of one nucleotide (inclusive) in length > is to be characterized as one number, not a range at all: > > http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3 > > 2..2 > > should just be: > > 2 > > Or, did I miss something in the discussion? > > chris On the face of it, I think you are right Chris. Good point. Peter From biopython at maubp.freeserve.co.uk Tue Mar 23 12:43:57 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Mar 2010 12:43:57 +0000 Subject: [Biopython] zero-length feature In-Reply-To: <320fb6e01003220838x685f6815v7f2ba40983d7316a@mail.gmail.com> References: <320fb6e01003220231u785b019dwf79eef7dfaeb8a67@mail.gmail.com> <85DB92FE-61E3-4A37-B052-6BF6D40ADCEC@sanger.ac.uk> <320fb6e01003220437w2f7b4d4ra9cfab1b66af2418@mail.gmail.com> <320fb6e01003220507p416b831v571fae654c19bed7@mail.gmail.com> <799F5C22-4C09-4E8C-ADBE-88CC194B9515@sanger.ac.uk> <79EA9A47-2AFD-41B2-9A41-EEBD604B346A@illinois.edu> <320fb6e01003220838x685f6815v7f2ba40983d7316a@mail.gmail.com> Message-ID: <320fb6e01003230543u6f531f44w634f27b4498db937@mail.gmail.com> On Mon, Mar 22, 2010 at 3:38 PM, Peter wrote: > On Mon, Mar 22, 2010 at 3:01 PM, Chris Fields wrote: >> >> All, >> >> Just to make sure I'm sanity-checked here, the GenBank/EMBL/DDBJ >> location specs indicate a location of one nucleotide (inclusive) in length >> is to be characterized as one number, not a range at all: >> >> http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html#3.5.3 >> >> 2..2 >> >> should just be: >> >> 2 >> >> Or, did I miss something in the discussion? >> >> chris > > On the face of it, I think you are right Chris. Good point. > > Peter > Hi again, I've updated the trunk to handle single letter features like that. This means the output of the example script I showed earlier is now: ID Test; ; ; DNA; ; UNC; 16 BP. XX AC Test; XX DE . XX OS . OC . XX FH Key Location/Qualifiers FT gap 2 FT /estimated_length=1 FT gap 4..5 FT /estimated_length=2 FT gap 7..9 FT /estimated_length=3 FT gap 11..15 FT /estimated_length=5 SQ Sequence 16 BP; 5 A; 0 C; 0 G; 0 T; 11 other; ANANNANNNA NNNNNA 16 // Note the single gap feature now has a location "2" not "2..2" Thanks Chris, Peter From biopython at maubp.freeserve.co.uk Wed Mar 24 14:58:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Mar 2010 14:58:07 +0000 Subject: [Biopython] RNA Secondary structure In-Reply-To: <320fb6e01003220227u19be07c8p6f19c1d28ad37bb2@mail.gmail.com> References: <12DC2C15-ABE8-4E65-BB01-45A7388701DC@mail.med.upenn.edu> <320fb6e01003220227u19be07c8p6f19c1d28ad37bb2@mail.gmail.com> Message-ID: <320fb6e01003240758h36a5eb3v91faa70faf8e0f@mail.gmail.com> On Mon, Mar 22, 2010 at 9:27 AM, Peter wrote: > On Sat, Mar 20, 2010 at 2:53 AM, Gregory Barendt > wrote: >> Does anyone know of good libraries for looking at RNA secondary >> structure? I'm looking for particular stem loops in particular locations >> in lots (hundreds of thousands) of sequences. >> >> Right now, I'm pretty inelegantly parsing the .ct file generated by >> UNAfold. I need to modify my search to be a little more flexible, so >> I'd much rather use an existing tool than continue to reinvent the >> wheel. Any advice would be greatly appreciated. >> >> Thanks, >> Greg > > I think Kristian Rother was looking at RNA support in Biopython last > year (CC'd). > Hi again Greg, In case you are not also on the dev mailing list, you might be interested to look at Kristian's code. If you could help out with testing/feedback that would be great: http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007482.html Peter From richard_w_g_price at academia.edu Fri Mar 26 01:33:12 2010 From: richard_w_g_price at academia.edu (Richard Price) Date: Thu, 25 Mar 2010 18:33:12 -0700 Subject: [Biopython] Recent Activity of the 15 Biopython members on Academia.edu Message-ID: Dear Biopython members, We just wanted to let you know about some recent activity on the Biopython group on Academia.edu. In the Biopython group on Academia.edu, there are now: - 15 people (10 in the last month) - 1 paper Biopython members? pages have been viewed a total of 1,801 times, and their papers have been viewed a total of 4 times. To see these people, papers and status updates, follow the link below: http://lists.academia.edu/See-members-of-Biopython Richard Dr. Richard Price, post-doc, Philosophy Dept, Oxford University. Founder of Academia.edu From rmb32 at cornell.edu Fri Mar 26 17:14:32 2010 From: rmb32 at cornell.edu (Robert Buels) Date: Fri, 26 Mar 2010 10:14:32 -0700 Subject: [Biopython] Announcing OBF Summer of Code - please forward! Message-ID: <4BACEB78.3090600@cornell.edu> Hi all, Here's an advertising-ready announcement for OBF's Summer of Code, thanks to Christian Zmasek and Hilmar Lapp for their excellent writing. Student applications are due April 9! Please spread it widely, we need to reach lots of students with it! Rob Buels OBF GSoC 2010 Admin ============================================================ *** Please disseminate widely at your local institutions *** *** including posting to message and job boards, so that *** *** we reach as many students as possible. *** ============================================================ OPEN BIOINFORMATICS FOUNDATION SUMMER OF CODE 2010 Applications due 19:00 UTC, April 9, 2010. http://www.open-bio.org/wiki/Google_Summer_of_Code The Open Bioinformatics Foundation Summer of Code program provides a unique opportunity for undergraduate, masters, and PhD students to obtain hands-on experience writing and extending open-source software for bioinformatics under the mentorship of experienced developers from around the world. The program is the participation of the Open Bioinformatics Foundation (OBF) as a mentoring organization in the Google Summer of Code(tm) (http://code.google.com/soc/). Students successfully completing the 3 month program receive a $5,000 USD stipend, and may work entirely from their home or home institution. Participation is open to students from any country in the world except countries subject to US trade restrictions. Each student will have at least one dedicated mentor to show them the ropes and help them complete their project. The Open Bioinformatics Foundation is particularly seeking students interested in both bioinformatics (computational biology) and software development. Some initial project ideas are listed on the website. These range from Galaxy phylogenetics pipeline development in Biopython to lightweight sequence objects and lazy parsing in BioPerl, a DAS Server for large files on local filesystems, and mapping Java libraries to Perl/Ruby/Python using Biolib+SWIG+JNI. All project ideas are flexible and many can be adjusted in scope to match the skills of the student. We also welcome and encourage students proposing their own project ideas; historically some of the most successful Summer of Code projects are ones proposed by the students themselves. TO APPLY: Apply online at the Google Summer of Code website (http://socghop.appspot.com/), where you will also find GSoC program rules and eligibility requirements. The 12-day application period for students runs from Monday, March 29 through Friday, April 9th, 2010. INQUIRIES: We strongly encourage all interested students to get in touch with us with their ideas as early on as possible. See the OBF GSoC page for contact details. 2010 OBF Summer of Code: http://www.open-bio.org/wiki/Google_Summer_of_Code Google Summer of Code FAQ: http://socghop.appspot.com/document/show/program/google/gsoc2010/faqs From chintal at iitk.ac.in Sat Mar 27 00:36:57 2010 From: chintal at iitk.ac.in (Chintalagiri Shashank) Date: Sat, 27 Mar 2010 06:06:57 +0530 Subject: [Biopython] Introduction Message-ID: <201003270606.59717.chintal@iitk.ac.in> Hello, I'm an undergraduate student of Physics, from the Indian Institute of Technology, Kanpur, and am interested in applying to BioPython for this year's Google Summer of Code. My interest in biology in the context of my Major is a somewhat complex and long-winded explanation, the basis of which is that for the last couple of years I've been seriously looking into biology (specifically, Structural Biology within the context of elements from Bioinformatics and certain other fields) as a potentially interesting field of study, and have been doing courses about the same. I was initially toying with the idea of attempting to write a sequence analysis 'framework' of sorts, where I could have the scaffolding to play around with simple algorithms for structure prediction. In retrospect, I should have make a more thorough search which should have led to OBF and BioPython, but as it is the idea went into cold storage due to certain other pressing constraints on my time, specifically a time-bound institute project that was behind on its schedule. I found OBF soon after the initial GSoC organizations announcement, and have since been looking over various pieces of documentation on it. I did look at the bugtracker as well, as was suggested on list, but it seemed to me that a lot of the bugs listed there were patches awaiting review. I do intend to take another look at the list and see if there is anything I can do there, but I decided that I shouldn't wait any longer before introducing myself formally on the list. I'm interested in working on BioPython/PyCogent interop, because I see a lot of potential in tying the two toolkits together and doing so before more wheels are reinvented. The ability to look at evolutionary effects and structural effects simultaneously could be quite interesting. To be fair, I must note here that while I am quite at home with Python and have a working understanding of the elements that make up BioPython, I have no production experience with either toolkits, and do not have a theoretical understanding of the evolutionary algorithms behind pyCogent. However, I am confident that I will be able to pick up the necessary skills over time, atleast to a degree necessary to make interoperability possible. I also have a couple of ideas in mind for BioPython projects, which really aren't well fleshed out yet. I'll think about them, specifically, their need and feasibility, and send the details to the list in a few days. Please do let me know if you would like any more information in the meanwhile. I've been on the mailing list for a couple of weeks now, so you can just reply on-list unless there is a need for off-list communication. Regards Chintalagiri Shashank chintal at iitk.ac.in From chapmanb at 50mail.com Sat Mar 27 12:36:53 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 27 Mar 2010 08:36:53 -0400 Subject: [Biopython] Introduction In-Reply-To: <201003270606.59717.chintal@iitk.ac.in> References: <201003270606.59717.chintal@iitk.ac.in> Message-ID: <20100327123653.GA1959@kunkel> Chintalagiri; Thanks for the e-mail and introduction. It's great to have you interested in Biopython and GSoC. The path you took to Biopython definitely echos the experience of lots of us; first you try building everything yourself and then realize: there must be some code frameworks out there that make this easier. > I'm interested in working on BioPython/PyCogent interop, because I see a lot > of potential in tying the two toolkits together and doing so before more > wheels are reinvented. The ability to look at evolutionary effects and > structural effects simultaneously could be quite interesting. [...] > I also have a couple of ideas in mind for BioPython projects, which really > aren't well fleshed out yet. I'll think about them, specifically, their need > and feasibility, and send the details to the list in a few days. Great, it sounds like you've already given this a bit of thought. You're welcome to either build off of the Biopython/PyCogent project or develop one of your own ideas into a proposal. Either way, the first step is to start putting together your project proposal and sharing it with us (Google Docs is a good option) so we can offer specific feedback on the programming and science part of things. We can work on the proposals up until Friday, April 9th. If you haven't already it's worth taking a look at the GSoC timeline for all the major dates: http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/timeline Generally, the proposal should contain: - A high level overview of what you hope to accomplish during the summer. - A week by week action plan for work to be done, including specific deliverables. This should be the bulk of the proposal. - A short section with relevant background and experience. We can work on this iteratively until the cutoff, and will be able to offer more specific feedback as we get an idea of your interests and directions. It would also be really useful to provide pointers to any open source code we could look at. If you don't have anything online now, uploading some relevant scripts to a GitHub or Bitbucket repository is a good start. Demonstrating bug fixing ability, as you mentioned, is also a helpful way to show off your programming skills to mentors. Thanks again. Looking forward to working on the proposal with you, Brad From biopyuser at gmail.com Mon Mar 29 05:16:37 2010 From: biopyuser at gmail.com (Biopython User) Date: Sun, 28 Mar 2010 22:16:37 -0700 Subject: [Biopython] KDTree with multidimensional radius? Message-ID: Hi all - New to K-D Trees and biopython, and have a question regarding the feasibility of this setup: Is it possible to create a 3-D tree of (X,Y,T=time) and do a search (node count) with a 2-D "radius" of (d,t) where d is the cartesian distance from a center point (x,y), and t is a temporal distance only on the T=time axis? The problem class I'm trying to solve is as follows: Given a set of nodes (possibly as many as 10 million) in (X,Y,T), find all groups where the group is defined by a central node (x,y,t) and N or more nodes within d distance and t time from that center. I've come to the conclusion that I can do this in a two-step process: that is, first search() on a 2-D (X,Y) tree, and then, for each of the arrays produced, do a 1-D (T) search - but given that the tree creation cost is high, this is potentially very inefficient, and I'm hoping there's a better way. Ideas/feedback/other options greatly appreciated. Kurt. From crosvera at gmail.com Wed Mar 31 22:39:06 2010 From: crosvera at gmail.com (=?ISO-8859-1?Q?Carlos_R=EDos_Vera?=) Date: Wed, 31 Mar 2010 18:39:06 -0400 Subject: [Biopython] PDB-Tidy proposal Message-ID: Dear Biopythoners, I'm Carlos R?os, a student from Chile. As some of you may know, I'm very interested in apply to the Google Summer of Code with the PDB-Tidy idea. So, I wrote a draft that suppose to be my proposal. I'm open to receive any comment, feedback, disagreement... here is the link of the draft: http://github.com/crosvera/pdbtidy_proposal/blob/master/proposal Regards. Ps: sorry if my English is not so good. -- http://crosvera.blogspot.com Carlos R?os V. Estudiante de Ing. (E) en Computaci?n e Inform?tica. Universidad del B?o-B?o VIII Regi?n, Chile Linux user number 425502