From alaguraj.v at gmail.com Sat Jan 2 02:27:27 2010 From: alaguraj.v at gmail.com (Alaguraj Veluchamy) Date: Sat, 2 Jan 2010 12:57:27 +0530 Subject: [Biopython] PSI-BLAST help Message-ID: I have a problem in database search using PSI-BLAST. I have to do PSI-BLAST against combined "nr" and "environmental sequences(env_nr)" databases. I need to iterate 10 rounds. Web services allow selecting one database at a time. Do Biopython offers search against multiple databases. I am unable to find any simple way to do this. Regards, Alaguraj.V On 12/29/09, biopython-request at lists.open-bio.org < biopython-request at lists.open-bio.org> wrote: > > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. Re: Superpose structures... DONE (Michiel de Hoon) > 2. Remove hydrogens... (Yasser Almeida Hern?ndez) > 3. Comparison between bioperl and biopython? (Peng Yu) > 4. Re: [Bioperl-l] Comparison between bioperl and biopython? > (Jason Stajich) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 28 Dec 2009 09:27:23 -0800 (PST) > From: Michiel de Hoon > Subject: Re: [Biopython] Superpose structures... DONE > To: BioPython , almeida at cim.sld.cu > Message-ID: <363342.11278.qm at web62406.mail.re1.yahoo.com> > Content-Type: text/plain; charset=iso-8859-1 > > --- On Mon, 12/28/09, Yasser Almeida Hern?ndez > wrote: > > Now i have another question. It is possible in Biopython > > read gziped pdb files (.pdb.gz)? > > I am not a Bio.PDB user, but from its documentation it looks like it uses > the file name to open a PDB file instead of a handle. Thomas, how do you > feel about modifying Bio.PDB so it uses a file handle instead of a file > name? Then Bio.PDB can parse gzipped and bzipped files. > > --Michiel. > > > > > > > ------------------------------ > > Message: 2 > Date: Tue, 29 Dec 2009 09:18:38 -0500 > From: Yasser Almeida Hern?ndez > Subject: [Biopython] Remove hydrogens... > To: BioPython > Message-ID: <20091229091838.fnyk66sayos8swww at correo.fenhi.uh.cu> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > Hi all... > How can i remove hydrogens atoms from the structures objects? > > Thanks > > -- > Lic. Yasser Almeida Hern?ndez > Center of Molecular Inmunology (CIM) > Nanobiology Group > P.O.Box 16040, Havana, Cuba > Phone: (537) 271-7933, ext. 221 > > ---------------------------------------------------------------- > Correo FENHI > > > > > > > ------------------------------ > > Message: 3 > Date: Tue, 29 Dec 2009 10:08:09 -0600 > From: Peng Yu > Subject: [Biopython] Comparison between bioperl and biopython? > To: bioperl-l at lists.open-bio.org, biopython at lists.open-bio.org > Message-ID: > <366c6f340912290808q6edea4d8ncb59a270f9d11f1a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > May I ask somebody who are versitile in both bioperl and biopython > comment on the pros and cons of bioperl and biopython? I'm sending > this email to both bioperl and biopython mailing lists. But I hope > that it will not result in any contention. > > I assume that the functionality between bioperl or biopython is the > same, i.e., tasks can be done in bioperl can be done biopython and > vice versa, as both libraries have been out there over 10 years. > Please correct me if my understanding is not true. > > Given that a task that can be done with either bioperl or biopython, > I, in particularly, want to know how long it will take to write the > code for the task in bioperl and biopython, with the same readability > requirement (see below) and the assumption that users have the same > fluency in perl and python. > > python is claimed to be good for maintainability. But perl is > criticized for there-are-many-ways-for-a-given-task. Since there are > multiple ways in perl, let us assume that we always use perl in a > readable way. > > > ------------------------------ > > Message: 4 > Date: Tue, 29 Dec 2009 08:49:20 -0800 > From: Jason Stajich > Subject: Re: [Biopython] [Bioperl-l] Comparison between bioperl and > biopython? > To: Peng Yu > Cc: bioperl-l at lists.open-bio.org, biopython at lists.open-bio.org > Message-ID: <2B85EF86-8A84-491B-8C33-7EC16CCB8CBC at bioperl.org> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > Are you asking for the purposes of choosing a toolkit for your work or > just curious about the advantages/disadvantages of language choice? > > -jason > On Dec 29, 2009, at 8:08 AM, Peng Yu wrote: > > > May I ask somebody who are versitile in both bioperl and biopython > > comment on the pros and cons of bioperl and biopython? I'm sending > > this email to both bioperl and biopython mailing lists. But I hope > > that it will not result in any contention. > > > > I assume that the functionality between bioperl or biopython is the > > same, i.e., tasks can be done in bioperl can be done biopython and > > vice versa, as both libraries have been out there over 10 years. > > Please correct me if my understanding is not true. > > > > Given that a task that can be done with either bioperl or biopython, > > I, in particularly, want to know how long it will take to write the > > code for the task in bioperl and biopython, with the same readability > > requirement (see below) and the assumption that users have the same > > fluency in perl and python. > > > > python is claimed to be good for maintainability. But perl is > > criticized for there-are-many-ways-for-a-given-task. Since there are > > multiple ways in perl, let us assume that we always use perl in a > > readable way. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > http://fungalgenomes.org/ > > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 84, Issue 32 > ***************************************** > From alaguraj.v at gmail.com Sat Jan 2 02:28:31 2010 From: alaguraj.v at gmail.com (Alaguraj Veluchamy) Date: Sat, 2 Jan 2010 12:58:31 +0530 Subject: [Biopython] PSI-BLAST help Message-ID: Dear all, I have a problem in database search using PSI-BLAST. I have to do PSI-BLAST against combined "nr" and "environmental sequences(env_nr)" databases. I need to iterate 10 rounds. Web services allow selecting one database at a time. Do Biopython offers search against multiple databases. I am unable to find any simple way to do this. Regards, Alaguraj.V From aboulia at gmail.com Sat Jan 2 05:44:29 2010 From: aboulia at gmail.com (Kevin Lam) Date: Sat, 2 Jan 2010 18:44:29 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? Message-ID: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Hi all finally found biopython Wrappers for the new NCBI BLAST+ tools in Applications.py the question is do I still use NCBIstandalone to use with BLAST+ ? is there a new tutorial for this? Cheers Kevin From stran104 at chapman.edu Sat Jan 2 14:14:53 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 2 Jan 2010 11:14:53 -0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Message-ID: <2a63cc351001021114o3e02122ak686bcbfb031bcc14@mail.gmail.com> I'm no expert here but unfortunately you'll probably have to build your own database to do that. It's not biopython's fault since it just wraps PSI-BLAST and as far as I know PSI-BLAST is only made to search against one database. Perhaps someone else will have a different solution. On Sat, Jan 2, 2010 at 2:44 AM, Kevin Lam wrote: > Hi all > finally found > biopython Wrappers for the new NCBI BLAST+ tools in Applications.py > > the question is do I still use NCBIstandalone to use with BLAST+ ? > > is there a new tutorial for this? > > > Cheers > Kevin > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From stran104 at chapman.edu Sat Jan 2 14:16:08 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 2 Jan 2010 11:16:08 -0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <2a63cc351001021114o3e02122ak686bcbfb031bcc14@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <2a63cc351001021114o3e02122ak686bcbfb031bcc14@mail.gmail.com> Message-ID: <2a63cc351001021116x3501deebwde613bb8d4b110b9@mail.gmail.com> Shoot, I replied to the wrong thread. Ignore that response, it was supposed to go to a PSI-BLAST question. Sorry. On Sat, Jan 2, 2010 at 11:14 AM, Matthew Strand wrote: > I'm no expert here but unfortunately you'll probably have to build your own > database to do that. It's not biopython's fault since it just wraps > PSI-BLAST and as far as I know PSI-BLAST is only made to search against one > database. Perhaps someone else will have a different solution. > > > On Sat, Jan 2, 2010 at 2:44 AM, Kevin Lam wrote: > >> Hi all >> finally found >> biopython Wrappers for the new NCBI BLAST+ tools in Applications.py >> >> the question is do I still use NCBIstandalone to use with BLAST+ ? >> >> is there a new tutorial for this? >> >> >> Cheers >> Kevin >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Matthew Strand > stran104 at chapman.edu > phone: (626) 524-4449 > skype: matstrand > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From stran104 at chapman.edu Sat Jan 2 14:17:07 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 2 Jan 2010 11:17:07 -0800 Subject: [Biopython] PSI-BLAST help In-Reply-To: References: Message-ID: <2a63cc351001021117y3d5fe4b7pb6635d592bb9809b@mail.gmail.com> I'm no expert here but unfortunately you'll probably have to build your own database to do that. It's not biopython's fault since it just wraps PSI-BLAST and as far as I know PSI-BLAST is only made to search against one database. Perhaps someone else will have a different solution. On Fri, Jan 1, 2010 at 11:28 PM, Alaguraj Veluchamy wrote: > Dear all, > I have a problem in database search using PSI-BLAST. > I have to do PSI-BLAST against combined "nr" and "environmental > sequences(env_nr)" databases. > I need to iterate 10 rounds. > Web services allow selecting one database at a time. > > Do Biopython offers search against multiple databases. > I am unable to find any simple way to do this. > > Regards, > Alaguraj.V > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From aboulia at gmail.com Sun Jan 3 03:09:26 2010 From: aboulia at gmail.com (Kevin Lam) Date: Sun, 3 Jan 2010 16:09:26 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Message-ID: <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> Hmmm found this in the blast+ manual is it possible to integrate this somewhere in biopython ?Cheers Kevin 3.1 For users of NCBI C Toolkit BLAST The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using blastall -i query -d nr -o blast.out use legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin For more details, refer to the section titled Backwards compatibility script . On Sat, Jan 2, 2010 at 6:44 PM, Kevin Lam wrote: > Hi all > finally found > biopython Wrappers for the new NCBI BLAST+ tools in Applications.py > > the question is do I still use NCBIstandalone to use with BLAST+ ? > > is there a new tutorial for this? > > > Cheers > Kevin > > From lueck at ipk-gatersleben.de Sun Jan 3 06:19:50 2010 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Sun, 3 Jan 2010 12:19:50 +0100 Subject: [Biopython] Needs some NCBI recommendation Message-ID: <20100103121950.2tzyaglxcq2o0w08@webmail.ipk-gatersleben.de> Hello and a happy new year! I'm currently writing a small software, which allows the users to perform a NCBI online BLAST and to download full records from NCBI via PubMed IDs in a batch mode. I would like to limit the BLAST and Download, in order not to abuse NCBI. What would you suggest for an input limitation for BLAST and Sequence Download (using efetch)? In addition I want to ask, whether it's reasonable to use the efecth in a simple for statement id_list = ["19304878", "18606172"] for i in id_list: handle = Entrez.efetch(db="nucleotide", id=i, rettype="fasta") print handle.read() since I need only the raw GenBank or FASTA files? Thanks for any advice! Stefanie From mjldehoon at yahoo.com Sun Jan 3 09:44:04 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 3 Jan 2010 06:44:04 -0800 (PST) Subject: [Biopython] Needs some NCBI recommendation In-Reply-To: <20100103121950.2tzyaglxcq2o0w08@webmail.ipk-gatersleben.de> Message-ID: <142471.4083.qm@web62407.mail.re1.yahoo.com> --- On Sun, 1/3/10, lueck at ipk-gatersleben.de wrote: > In addition I want to ask, whether it's reasonable to use > the efecth in a simple for statement > > id_list = ["19304878", "18606172"] > for i in id_list: > ???handle = Entrez.efetch(db="nucleotide", id=i, rettype="fasta") > ???print handle.read() > > since I need only the raw GenBank or FASTA files? The following needs only one call to efetch: >>> from Bio import Entrez >>> Entrez.email = "lueck at ipk-gatersleben.de" >>> from Bio import SeqIO >>> handle = Entrez.efetch(db='nucleotide', id="19304878,18606172", rettype='fasta') >>> records = SeqIO.parse(handle, 'fasta') >>> for record in records: ... words = record.id.split("|") ... i = words[1] ... output = open(i+".fa", 'w') ... SeqIO.write([record], output, 'fasta') ... output.close() --Michiel From xuxiang086 at gmail.com Mon Jan 4 03:03:47 2010 From: xuxiang086 at gmail.com (xuxiang086) Date: Mon, 4 Jan 2010 16:03:47 +0800 Subject: [Biopython] installing biopython1.53 failed Message-ID: <201001041603446097700@gmail.com> dear all, I was trying to install biopython1.53 source on my computer whose OS is Ubuntu according to the installation instructions on biopython wiki, and when I input "python setup.py test" I got failure messages as follows: ====================================================================== FAIL: seqmatchall with pair output piped to stdout. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Emboss.py", line 661, in test_seqtmatchall_piped self.assertEqual(align.get_alignment_length(), 9) AssertionError: 471 != 9 ---------------------------------------------------------------------- Ran 140 tests in 102.013 seconds FAILED (failures = 1) Could you help me to figure out what's the problem? Thanks. Sincerely, Xiang From chapmanb at 50mail.com Mon Jan 4 07:51:54 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 07:51:54 -0500 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <90247fbe0912260654scd2b0ceyb37d54f36a3531fa@mail.gmail.com> References: <90247fbe0912260637n7553bdf7wbce10a627c0a124c@mail.gmail.com> <90247fbe0912260654scd2b0ceyb37d54f36a3531fa@mail.gmail.com> Message-ID: <20100104125154.GE80812@sobchak.mgh.harvard.edu> Ning; > From http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html, > I can learn: > PubMed Central contains a number of articles classified as "open > access" for which you may download the full text as XML. For the > remaining articles in PMC you may download only the abstracts as XML. > > but when try to > handle=Entrez.efetch(db='pmc',id=idlist,rettype='full',retmode='xml') > record=Entrez.read(handle) > > got following errors: > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py", > line 258, in read > record = handler.read(handle) > File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/Parser.py", > line 114, in read > raise CorruptedXMLError > Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data. > Please make sure that the input data are in XML format, and that the > data are not corrupted. > > the python version is 1.53 and my system is ubuntu 9.10. Following your example, doing: from Bio import Entrez Entrez.email = 'yours at blah.com' handle = Entrez.efetch(db='pmc', id=2747014, rettype='full', retmode='xml') handle.read() gives back the full XML text, as you wanted. Your next step, calling Entrez.read, asks Biopython to parse this into a record object. There isn't support in Biopython for this currently, and realistically that probably isn't what you want. If you are pulling down full text like this you are best served parsing the XML directly using something like ElementTree: http://docs.python.org/library/xml.etree.elementtree.html and pulling out the items you are interested in. Hope this helps, Brad From chapmanb at 50mail.com Mon Jan 4 08:06:11 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 08:06:11 -0500 Subject: [Biopython] installing biopython1.53 failed In-Reply-To: <201001041603446097700@gmail.com> References: <201001041603446097700@gmail.com> Message-ID: <20100104130611.GF80812@sobchak.mgh.harvard.edu> Xiang; > I was trying to install biopython1.53 source on my computer whose OS is Ubuntu according to the installation instructions on biopython wiki, and when I input "python setup.py test" I got failure messages as follows: > > ====================================================================== > FAIL: seqmatchall with pair output piped to stdout. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Emboss.py", line 661, in test_seqtmatchall_piped > self.assertEqual(align.get_alignment_length(), 9) > AssertionError: 471 != 9 > ---------------------------------------------------------------------- > Ran 140 tests in 102.013 seconds > FAILED (failures = 1) > > Could you help me to figure out what's the problem? Thanks. Biopython appears to be installed okay, and this is an issue with parsing EMBOSS output from the program seqmatchall. If you aren't planning on using EMBOSS, then you can go ahead and use the rest of Biopython without any worries. To figure out the issue, it would be useful to know the version of EMBOSS you are using: % embossversion Writes the current EMBOSS version number to a file 6.0.1 If it's an older one, a simple fix may be to upgrade. You should be able to run 'apt-get update emboss' on ubuntu: http://packages.ubuntu.com/karmic/emboss Hope this helps, Brad From xuxiang086 at gmail.com Mon Jan 4 08:15:39 2010 From: xuxiang086 at gmail.com (xuxiang086) Date: Mon, 4 Jan 2010 21:15:39 +0800 Subject: [Biopython] installing biopython1.53 failed References: <201001041603446097700@gmail.com>, <20100104130611.GF80812@sobchak.mgh.harvard.edu> Message-ID: <201001042115373431286@gmail.com> Hi Brad, Thanks for your help. Xiang 2010-01-04 xuxiang086 ???????? Brad Chapman ?????????? 2010-01-04 21:06:14 ???????? xuxiang086 ?????? BioPython ?????? Re: [Biopython] installing biopython1.53 failed Xiang; > I was trying to install biopython1.53 source on my computer whose OS is Ubuntu according to the installation instructions on biopython wiki, and when I input "python setup.py test" I got failure messages as follows: > > ====================================================================== > FAIL: seqmatchall with pair output piped to stdout. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Emboss.py", line 661, in test_seqtmatchall_piped > self.assertEqual(align.get_alignment_length(), 9) > AssertionError: 471 != 9 > ---------------------------------------------------------------------- > Ran 140 tests in 102.013 seconds > FAILED (failures = 1) > > Could you help me to figure out what's the problem? Thanks. Biopython appears to be installed okay, and this is an issue with parsing EMBOSS output from the program seqmatchall. If you aren't planning on using EMBOSS, then you can go ahead and use the rest of Biopython without any worries. To figure out the issue, it would be useful to know the version of EMBOSS you are using: % embossversion Writes the current EMBOSS version number to a file 6.0.1 If it's an older one, a simple fix may be to upgrade. You should be able to run 'apt-get update emboss' on ubuntu: http://packages.ubuntu.com/karmic/emboss Hope this helps, Brad From mjldehoon at yahoo.com Mon Jan 4 10:15:57 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 4 Jan 2010 07:15:57 -0800 (PST) Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <20100104125154.GE80812@sobchak.mgh.harvard.edu> Message-ID: <595436.42697.qm@web62403.mail.re1.yahoo.com> --- On Mon, 1/4/10, Brad Chapman wrote: > Following your example, doing: > > from Bio import Entrez > Entrez.email = 'yours at blah.com' > handle = Entrez.efetch(db='pmc', id=2747014, > rettype='full', retmode='xml') > handle.read() > > gives back the full XML text, as you wanted. Your next > step, calling > Entrez.read, asks Biopython to parse this into a record > object. > There isn't support in Biopython for this currently, This *is* supported by Biopython. In principle, Bio.Entrez can parse any XML generated by NCBI Entrez as long as the corresponding DTDs are available. In this case, the DTD included in Biopython 1.53 is corrupted, causing the error. Unfortunately, the correct DTD relies on a large number of other DTDs, so just replacing the one DTD is not sufficient. Hmm... maybe we should think of a more robust way of getting the DTDs without relying on their inclusion in the Biopython distribution ... --Michiel. From darnells at dnastar.com Mon Jan 4 11:15:38 2010 From: darnells at dnastar.com (Steve Darnell) Date: Mon, 4 Jan 2010 10:15:38 -0600 Subject: [Biopython] PSI-BLAST help In-Reply-To: <2a63cc351001021117y3d5fe4b7pb6635d592bb9809b@mail.gmail.com> References: <2a63cc351001021117y3d5fe4b7pb6635d592bb9809b@mail.gmail.com> Message-ID: Alaguraj, I am assuming you have already downloaded the nr and env_nr databases. You can create an alias database file that will tie individual databases together to form a larger virtual database. http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html#4.1. 6 I have not personally used this approach, so I cannot offer more guidance that this. However, since biopython only provides a wrapper for the NCBI command line tools, I would expect this approach would work well with biopython scripting. Regards, Steve -- Steve Darnell DNASTAR, Inc. Madison, WI USA -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Matthew Strand Sent: Saturday, January 02, 2010 1:17 PM To: Alaguraj Veluchamy Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] PSI-BLAST help I'm no expert here but unfortunately you'll probably have to build your own database to do that. It's not biopython's fault since it just wraps PSI-BLAST and as far as I know PSI-BLAST is only made to search against one database. Perhaps someone else will have a different solution. On Fri, Jan 1, 2010 at 11:28 PM, Alaguraj Veluchamy wrote: > Dear all, > I have a problem in database search using PSI-BLAST. > I have to do PSI-BLAST against combined "nr" and "environmental > sequences(env_nr)" databases. > I need to iterate 10 rounds. > Web services allow selecting one database at a time. > > Do Biopython offers search against multiple databases. > I am unable to find any simple way to do this. > > Regards, > Alaguraj.V From biopython at maubp.freeserve.co.uk Tue Jan 5 05:47:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 10:47:38 +0000 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Message-ID: <320fb6e01001050247t29a1a57idd2fd22400e9c54a@mail.gmail.com> On Sat, Jan 2, 2010 at 10:44 AM, Kevin Lam wrote: > Hi all > finally found > biopython Wrappers for the new NCBI BLAST+ tools in Applications.py > > the question is do I still use NCBIstandalone to use with BLAST+ ? > No, use Bio.Blast.Applications with the subprocess module. > > is there a new tutorial for this? > Did you check the current Tutorial (as shipped with Biopython 1.53)? http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There are wrappers for the new NCBI BLAST+ tools in the Bio.Blast.Applications module (recommended for future use). There are wrappers for the "legacy" NCBI BLAST tools in the Bio.Blast.Applications module (along with the new BLAST+ wrappers), and the old rather inflexible "helper functions" in Bio.Blast.NCBIStandalone. These are all effectively obsolete (since the NCBI is phasing out the "legacy" BLAST tools), and will be deprecated in a future release of Biopython. This is in the DEPRECATED file, and the module docstrings. Obviously the documentation wasn't as clear as it could have been (and the bit in the tutorial is a little short). Where did you look and can you make any suggestions for clarification or improvement? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jan 5 06:33:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 11:33:26 +0000 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <90247fbe0912260637n7553bdf7wbce10a627c0a124c@mail.gmail.com> References: <90247fbe0912260637n7553bdf7wbce10a627c0a124c@mail.gmail.com> Message-ID: <320fb6e01001050333w3ca52399u565177c4d80a4724@mail.gmail.com> On Sat, Dec 26, 2009 at 2:37 PM, ning luwen wrote: > Dear everyone, > ?? I need to download full text from Pubmed central. After see the > Entrez manual, maybe Entrez(not the web interface) doesn't give a way > to?download .pdf full text file, is this true? > According to the EFetch help, for PMC you can only retrieve XML (although this does seem to give the full text): http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html I had a look at the ELink documentation, and don't see any way to use it to get a PDF link (e.g. to the publisher's site). You could use the DOI, but that doesn't allow control over HTML vs PDF. I think you should email the Entrez support team for advice (and if you find out more, please let us know). >From playing with the PMC website, I eventually found a URL which will work to get a PDF file, both in my browser and via the command line tool wget: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/pdf However, it seems the default Python urllib useragent is blocked for some reason. A quick search online shows one way to over-ride the user-agent in Python, and if we pretend to be the Firefox browser this now works: url = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/pdf" filename = "PMC2682512.pdf" from urllib import FancyURLopener class FakeMozilla(FancyURLopener): version = "Mozilla/5.0 (Windows; U; Windows NT 5.2; rv:1.9.2) Gecko/20100101 Firefox/3.6" FakeMozilla().retrieve(url, filename) So, while that does seem to work, it is *NOT* endorsed by the NCBI. If you just want to download a few files, it may do the trick, but I do think you should email the Entrez support team for advice on how this *should* be done. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jan 5 06:46:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 11:46:34 +0000 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <595436.42697.qm@web62403.mail.re1.yahoo.com> References: <20100104125154.GE80812@sobchak.mgh.harvard.edu> <595436.42697.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e01001050346l23a8920am4c5192831105371b@mail.gmail.com> On Mon, Jan 4, 2010 at 3:15 PM, Michiel de Hoon wrote: > > This *is* supported by Biopython. In principle, Bio.Entrez can parse any > XML generated by NCBI Entrez as long as the corresponding DTDs are > available. In this case, the DTD included in Biopython 1.53 is corrupted, > causing the error. Unfortunately, the correct DTD relies on a large number > of other DTDs, so just replacing the one DTD is not sufficient. > > Hmm... maybe we should think of a more robust way of getting the DTDs > without relying on their inclusion in the Biopython distribution ... Which DTD has a problem? I was aware an elink DTD was *missing* in Biopython 1.53 (adding in git), but not of any corrupted DTD files. In this particular example, it is the NCBI that have a problem - they are returning invalid XML which (understandably) our parser is rejecting. It could just be they haven't kept the XML output and the public DTD files in sync. For example, consider this Entrez URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml According to both these validators this is not a valid XML file! http://www.validome.org/xml/validate/ http://validator.w3.org/ In Biopython when we try and parse this exact URL: >>> from Bio import Entrez >>> import urllib >>> record = Entrez.read(urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml")) Traceback (most recent call last): ... Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data. Please make sure that the input data are in XML format, and that the data are not corrupted. You get the same error using the Bio.Entrez.efetch function which will use an equivalent URL (but with the tool and email set): >>> from Bio import Entrez >>> Entrez.email = "your.name.here at example.com" >>> record = Entrez.read(Entrez.efetch(db="pmc", id="2747014", retmode="xml")) Traceback (most recent call last): ... Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data. Please make sure that the input data are in XML format, and that the data are not corrupted. Peter From mjldehoon at yahoo.com Tue Jan 5 07:17:33 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Jan 2010 04:17:33 -0800 (PST) Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <320fb6e01001050346l23a8920am4c5192831105371b@mail.gmail.com> Message-ID: <95712.11972.qm@web62408.mail.re1.yahoo.com> There are multiple issues here. First, the CorruptedXMLError is caused by the corrupted DTD (which I fixed by now on github). Basically, the corrupted DTD inserts some gibberish into the XML, which is then no longer valid. If you replace the corrupted DTD by the correct one, the CorruptedXMLError goes away. But you'll find that a bunch of other DTDs are missing (these have now been uploaded to github). With the complete set of DTDs, you run into a new error: One of the tags in the XML file is not listed anywhere in any of the DTDs. This is probably the reason the XML validators show that it's not valid XML. I've notified NCBI that the XML output is not consistent with the DTDs for this case. --Michiel --- On Tue, 1/5/10, Peter wrote: > From: Peter > Subject: Re: [Biopython] need help! how to retrieve full text from Pubmed central ? > To: "Michiel de Hoon" > Cc: biopython at lists.open-bio.org, "Brad Chapman" > Date: Tuesday, January 5, 2010, 6:46 AM > On Mon, Jan 4, 2010 at 3:15 PM, > Michiel de Hoon > wrote: > > > > This *is* supported by Biopython. In principle, > Bio.Entrez can parse any > > XML generated by NCBI Entrez as long as the > corresponding DTDs are > > available. In this case, the DTD included in Biopython > 1.53 is corrupted, > > causing the error. Unfortunately, the correct DTD > relies on a large number > > of other DTDs, so just replacing the one DTD is not > sufficient. > > > > Hmm... maybe we should think of a more robust way of > getting the DTDs > > without relying on their inclusion in the Biopython > distribution ... > > Which DTD has a problem? I was aware an elink DTD was > *missing* in > Biopython 1.53 (adding in git), but not of any corrupted > DTD files. > > In this particular example, it is the NCBI that have a > problem - they are > returning invalid XML which (understandably) our parser is > rejecting. > It could just be they haven't kept the XML output and the > public DTD > files in sync. > > For example, consider this Entrez URL: > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml > > According to both these validators this is not a valid XML > file! > > http://www.validome.org/xml/validate/ > http://validator.w3.org/ > > In Biopython when we try and parse this exact URL: > > >>> from Bio import Entrez > >>> import urllib > >>> record = Entrez.read(urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml")) > Traceback (most recent call last): > ... > Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the > XML data. > Please make sure that the input data are in XML format, and > that the > data are not corrupted. > > You get the same error using the Bio.Entrez.efetch function > which > will use an equivalent URL (but with the tool and email > set): > > >>> from Bio import Entrez > >>> Entrez.email = "your.name.here at example.com" > >>> record = Entrez.read(Entrez.efetch(db="pmc", > id="2747014", retmode="xml")) > Traceback (most recent call last): > ... > Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the > XML data. > Please make sure that the input data are in XML format, and > that the > data are not corrupted. > > Peter > From biopython at maubp.freeserve.co.uk Tue Jan 5 07:42:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 12:42:10 +0000 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <95712.11972.qm@web62408.mail.re1.yahoo.com> References: <320fb6e01001050346l23a8920am4c5192831105371b@mail.gmail.com> <95712.11972.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e01001050442j11bf5959y5380a7fcd42e959e@mail.gmail.com> On Tue, Jan 5, 2010 at 12:17 PM, Michiel de Hoon wrote: > > There are multiple issues here. > > First, the CorruptedXMLError is caused by the corrupted DTD (which I fixed > by now on github). Basically, the corrupted DTD inserts some gibberish into > the XML, which is then no longer valid. If you replace the corrupted DTD by > the correct one, the CorruptedXMLError goes away. I see what you mean, our old copy of nlm-articleset-2.0.dtd was actually an HTML redirect message. Oops. Thanks for sorting out that glitch - my fault. > But you'll find that a bunch of other DTDs are missing (these have now been > uploaded to github). With the complete set of DTDs, you run into a new error: Do you get this: NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces > One of the tags in the XML file is not listed anywhere in any of the DTDs. > This is probably the reason the XML validators show that it's not valid XML. > I've notified NCBI that the XML output is not consistent with the DTDs for > this case. Excellent - thank you. Peter P.S. Last year (Sept 2009) I reported a similar problem with ELink XML failing to validate when the history was used (while working on the "Searching for citations" example in the tutorial). That seems to be resolved now so I can update the tutorial... From mjldehoon at yahoo.com Tue Jan 5 09:31:38 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Jan 2010 06:31:38 -0800 (PST) Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <320fb6e01001050442j11bf5959y5380a7fcd42e959e@mail.gmail.com> Message-ID: <179611.33586.qm@web62402.mail.re1.yahoo.com> --- On Tue, 1/5/10, Peter wrote: > Do you get this: > NotImplementedError: The Bio.Entrez parser cannot handle > XML data that make use of XML namespaces I get that one too, but that is easy to fix once NCBI's DTD files are corrected. --Michiel. From biopython at maubp.freeserve.co.uk Tue Jan 5 12:20:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 17:20:18 +0000 Subject: [Biopython] Remove hydrogens... In-Reply-To: <20091229091838.fnyk66sayos8swww@correo.fenhi.uh.cu> References: <20091229091838.fnyk66sayos8swww@correo.fenhi.uh.cu> Message-ID: <320fb6e01001050920r4bdf627cg60e9bb84e004b4ec@mail.gmail.com> 2009/12/29 Yasser Almeida Hern?ndez : > Hi all... > How can i remove hydrogens atoms from the structures objects? > > Thanks > > -- > Lic. Yasser Almeida Hern?ndez Hi, I would suggest you look at pages 5 and 6 of the Bio.PDB documentation, the bit on the Select class: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf See also related discussions on the mailing list: http://lists.open-bio.org/pipermail/biopython/2009-March/005021.html http://lists.open-bio.org/pipermail/biopython/2009-May/005172.html Please let us know how you get on. If you would like to contribute to the project, this seems like an excellent topic for a cookbook entry, once you've got it working of course ;) http://biopython.org/wiki/Category:Cookbook Peter From pedro.al at fenhi.uh.cu Tue Jan 5 13:50:14 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Tue, 05 Jan 2010 13:50:14 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100105135014.4ovlg1l4hkwowo04@correo.fenhi.uh.cu> Hi all... I know a did this question before but i really need your help... I've selected a residue and a atom and i want to save them as a new .pdb file. How can i do that? Thanks -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From alexl at users.sourceforge.net Tue Jan 5 19:14:43 2010 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Tue, 05 Jan 2010 19:14:43 -0500 Subject: [Biopython] Fedora packages for 1.53 available (was Re: Biopython 1.53 released) In-Reply-To: <320fb6e00912150901k138ae04bmc5d5af9c867340ec__41910.6228081093$1260896885$gmane$org@mail.gmail.com> (Peter's message of "Tue, 15 Dec 2009 17:01:38 +0000") References: <320fb6e00912150901k138ae04bmc5d5af9c867340ec__41910.6228081093$1260896885$gmane$org@mail.gmail.com> Message-ID: >>>>> "P" == Peter writes: P> Dear Biopythoneers, We are pleased to announce the availability of P> Biopython 1.53, a new stable release of the Biopython library, three P> months after the release of Biopython 1.52. This is our first release P> since migrating from CVS to git for source code control. Hi there, For all Fedora users, new packages for biopython 1.53 are now available the "updates-testing" repository for F-11 and F-12 To test them out simply run (as root): yum --enablerepo=updates-testing install python-biopython Please provide feedback on packages here: F-11: https://admin.fedoraproject.org/updates/F11/FEDORA-2009-13353 F-12: https://admin.fedoraproject.org/updates/F12/FEDORA-2009-13326 (You can leave feedback anonymously, or using your Fedora account name if you happen to be a Fedora contributor) The more positive feedback from testing, the faster the packages can go into the stable "updates" repo (or conversely if there are any problems, they can be fixed before being pushed). Thanks! Alex From biopython at maubp.freeserve.co.uk Wed Jan 6 06:35:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 11:35:32 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100105135014.4ovlg1l4hkwowo04@correo.fenhi.uh.cu> References: <20100105135014.4ovlg1l4hkwowo04@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060335i16114ab0pc4a183540a29f244@mail.gmail.com> 2010/1/5 Yasser Almeida Hern?ndez : > Hi all... > I know a did this question before but i really need your help... > I've selected a residue and a atom and i want to save them > as a new .pdb file. How can i do that? > > Thanks You need a structure object, and then pass that to PDBIO. I suggest you do this via a Select class - as in your related question about removing hydrogen atoms: http://lists.open-bio.org/pipermail/biopython/2009-December/006028.html http://lists.open-bio.org/pipermail/biopython/2010-January/006064.html If that doesn't make sense, then perhaps you could go into more detail? e.g. tell us which PDB file you are working with, and show us your code so far. You could use a similar example if you'd prefer not to talk about the real research topic. Peter From ap12 at sanger.ac.uk Wed Jan 6 07:24:15 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Wed, 6 Jan 2010 12:24:15 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: Message-ID: <47115A52-FC46-48A3-B3DF-EF012EEE520B@sanger.ac.uk> Sorry for the typo, Please read print embl_record.format("embl") instead of print embl_record.format("genbank") I was just testing if it was possible to write in another format. On 6 Jan 2010, at 12:20, Anne Pajon wrote: > Dear, > > I'm reading EMBL file with Bio.SeqIO for adding an extra feature > qualifier to each of the annotations, and would like to write the > modified annotated sequence back to an EMBL file. > > embl_record = SeqIO.read(open("Alistipes_shahii_WAL8301.embl"), > "embl") > addSystematicId(embl_record) > print embl_record.format("genbank") > > While running the above I'm getting this error: > Reading format 'embl' is supported, but not writing > > Is there a way around? I know from the documentation on the wiki > that biopython does not have a writer for EMBL format. Is there a > plan of having one in the future? I volunteer to test it, or if it > does not exist yet I may be able to contribute writing it... thanks > to let me know. > > Kind regards, > Anne. > -- > Dr Anne Pajon - Pathogen Genomics, Team 81 > Sanger Institute, Wellcome Trust Genome Campus, Hinxton > Cambridge CB10 1SA, United Kingdom > +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) > -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap12 at sanger.ac.uk Wed Jan 6 07:20:30 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Wed, 6 Jan 2010 12:20:30 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? Message-ID: Dear, I'm reading EMBL file with Bio.SeqIO for adding an extra feature qualifier to each of the annotations, and would like to write the modified annotated sequence back to an EMBL file. embl_record = SeqIO.read(open("Alistipes_shahii_WAL8301.embl"), "embl") addSystematicId(embl_record) print embl_record.format("genbank") While running the above I'm getting this error: Reading format 'embl' is supported, but not writing Is there a way around? I know from the documentation on the wiki that biopython does not have a writer for EMBL format. Is there a plan of having one in the future? I volunteer to test it, or if it does not exist yet I may be able to contribute writing it... thanks to let me know. Kind regards, Anne. -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Wed Jan 6 08:15:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 13:15:17 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: Message-ID: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> On Wed, Jan 6, 2010 at 12:20 PM, Anne Pajon wrote: > Dear, > > I'm reading EMBL file with Bio.SeqIO for adding an extra feature qualifier > to each of the annotations, and would like to write the modified annotated > sequence back to an EMBL file. > ... > While running the above I'm getting this error: > Reading format 'embl' is supported, but not writing > > Is there a way around? I know from the documentation on the wiki that > biopython does not have a writer for EMBL format. Is there a plan of having > one in the future? I volunteer to test it, or if it does not exist yet I may > be able to contribute writing it... thanks to let me know. > > Kind regards, > Anne. Hello Anne, The intention was to eventually have both GenBank and EMBL output working in SeqIO - and they should be able to share a lot of code. However, out of practicality, GenBank output was prioritised (and bar a few bits of annotation, seems to be working nicely). There hadn't been much interest in EMBL output in comparison. Getting something basic working shouldn't be too hard (id, features and sequence), and having someone interested help test this would be very valuable. Did you install Biopython from source? Are you happy using git (to grab code for testing)? Neither is essential for trying out new Python code, but would make things a bit simpler. Also, what kind of organisms are you working with? What I'm getting at here is how complex are the feature locations going to be? Peter From ap12 at sanger.ac.uk Wed Jan 6 08:28:42 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Wed, 6 Jan 2010 13:28:42 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> Message-ID: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> Hi Peter, Thanks again for this fast answer. You've been fixing code for me recently on fasta-m10 al_start and al_end, so I am now working with the development version of biopython from git. I have no problem of updating it and testing it here. I am working with about 30 bacteria genomes from the human gut and waiting 100 more genomes to work with this year. I can send you one of the file if you wish. Just let me know. Kind regards, Anne. On 6 Jan 2010, at 13:15, Peter wrote: > On Wed, Jan 6, 2010 at 12:20 PM, Anne Pajon wrote: >> Dear, >> >> I'm reading EMBL file with Bio.SeqIO for adding an extra feature >> qualifier >> to each of the annotations, and would like to write the modified >> annotated >> sequence back to an EMBL file. >> ... >> While running the above I'm getting this error: >> Reading format 'embl' is supported, but not writing >> >> Is there a way around? I know from the documentation on the wiki that >> biopython does not have a writer for EMBL format. Is there a plan >> of having >> one in the future? I volunteer to test it, or if it does not exist >> yet I may >> be able to contribute writing it... thanks to let me know. >> >> Kind regards, >> Anne. > > Hello Anne, > > The intention was to eventually have both GenBank and EMBL output > working in SeqIO - and they should be able to share a lot of code. > However, out of practicality, GenBank output was prioritised (and > bar a few bits of annotation, seems to be working nicely). There > hadn't been much interest in EMBL output in comparison. > > Getting something basic working shouldn't be too hard (id, features > and > sequence), and having someone interested help test this would be very > valuable. Did you install Biopython from source? Are you happy using > git (to grab code for testing)? Neither is essential for trying out > new > Python code, but would make things a bit simpler. > > Also, what kind of organisms are you working with? What I'm getting > at here is how complex are the feature locations going to be? > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From pedro.al at fenhi.uh.cu Wed Jan 6 09:24:30 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Wed, 06 Jan 2010 09:24:30 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100106092430.vz2h8pqo0w8k8gcc@correo.fenhi.uh.cu> I used the "add" method in the Residue class for add the atom object to the residue. That's right that i need a structure object, but how i build this object "de novo" and a how add a new residue on it??? I used the StructureBuilder class with the init_* methods (model, chain, residue etc.) and then the get_structure method, but it doesn't work: # EXPERIMENTAL CODE res.add(contact) # Add a atom to the residue of interest output_structure = StructureBuilder.StructureBuilder() output_structure.init_structure('OUT_STRUCT') output_structure.init_model(0) output_structure.init_chain('X') output_structure.get_structure() output_structure[0]['X'].add(res) io = PDBIO() io.set_structure(output_structure) pdb_out_filename = "cont_res_plus_contact.pdb" io.save(pdb_out_filename, output_structure) I'm processing a hundred of pdb files, and i need this code for write residues and atoms in different conformational states... I hope for your help... Thanks > You need a structure object, and then pass that to PDBIO. > I suggest you do this via a Select class - as in your related > question about removing hydrogen atoms: > http://lists.open-bio.org/pipermail/biopython/2009-December/006028.html > http://lists.open-bio.org/pipermail/biopython/2010-January/006064.html > > If that doesn't make sense, then perhaps you could go into > more detail? e.g. tell us which PDB file you are working with, > and show us your code so far. You could use a similar example > if you'd prefer not to talk about the real research topic. > Peter -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Wed Jan 6 10:10:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 15:10:01 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100106092430.vz2h8pqo0w8k8gcc@correo.fenhi.uh.cu> References: <20100106092430.vz2h8pqo0w8k8gcc@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060710g50a66b8k160d3f0a8a1883f9@mail.gmail.com> 2010/1/6 Yasser Almeida Hern?ndez : > I used the "add" method in the Residue class for add the atom object to the > residue. That's right that i need a structure object, but how i build this > object "de novo" and a how add a new residue on it??? I've not tried that myself, so I don't have any suggestions beyond looking over the documentation - or even the Bio.PDB code itself. > I used the StructureBuilder class with the init_* methods (model, chain, > residue etc.) and then the get_structure method, but it doesn't work: > > # EXPERIMENTAL CODE > res.add(contact) ? # Add a atom to the residue of interest > > output_structure = StructureBuilder.StructureBuilder() > output_structure.init_structure('OUT_STRUCT') > > output_structure.init_model(0) > output_structure.init_chain('X') > output_structure.get_structure() Note that the get_structure() call returns a structure, but you are ignoring the return value. > output_structure[0]['X'].add(res) > io = PDBIO() > io.set_structure(output_structure) > pdb_out_filename = "cont_res_plus_contact.pdb" > io.save(pdb_out_filename, output_structure) You code snippet is incomplete - which makes it harder to try to follow what you are doing. It is missing all the import statements and the definition of the res variable. > I'm processing a hundred of pdb files, and i need this code for write > residues and atoms in different conformational states... Perhaps I had misunderstood - I thought you were starting with a given PDB file, and wanted to select some particular residues/atoms, and output a new partial PDB file with just those bits. That should work using a Select class to create a sub-structure from the original full structure from parsing the original PDF file. i.e You don't need to create a new structure object "de novo". Peter From pedro.al at fenhi.uh.cu Wed Jan 6 10:26:24 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Wed, 06 Jan 2010 10:26:24 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100106102624.3a1bn46ns4444oos@correo.fenhi.uh.cu> You thought right!! But my big doubt is how to use the Select class. The example with the Gly selection in the FAQ document is not so clear to me to apply on my problem. Let's say: I've selected the ASP 10 in the chain 'A' and the atom O1 in a ligand, all in the pdb file 1xyz. How i use the Select class (sintaxis) to write a new pdb file with only the residue/atom selected before? How would be the code? Thanks > Perhaps I had misunderstood - I thought you were starting > with a given PDB file, and wanted to select some particular > residues/atoms, and output a new partial PDB file with just > those bits. That should work using a Select class to create > a sub-structure from the original full structure from parsing > the original PDF file. i.e You don't need to create a new > structure object "de novo". -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Wed Jan 6 10:49:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 15:49:24 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100106102624.3a1bn46ns4444oos@correo.fenhi.uh.cu> References: <20100106102624.3a1bn46ns4444oos@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060749x6e7d98bfnd138b20e5e8564ce@mail.gmail.com> 2010/1/6 Yasser Almeida Hern?ndez : > You thought right!! But my big doubt is how to use the Select class. The > example with the Gly selection in the FAQ document is not so clear to me to > apply on my problem. > Let's say: > I've selected the ASP 10 in the chain 'A' and the atom O1 in a ligand, all > in the pdb file 1xyz. How i use the Select class (sintaxis) to write a new > pdb file with only the residue/atom selected before? How would be the code? Have you got a real example? There is no Asp10 in PDB file 1xyz. But, if for the sake of argument you wanted Arg519 (in either chain) in 1xyz you could do it like this - based on the following example I pointed to earlier: http://lists.open-bio.org/pipermail/biopython/2009-May/005174.html from Bio.PDB import Select, PDBIO from Bio.PDB.PDBParser import PDBParser class MySelector(Select): def accept_residue(self, residue): #Only want Arg519 (in any chain) return residue.resname=="ARG" and residue.id[1]==519 s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") io=PDBIO() io.set_structure(s) io.save("1XYZ-interesting.pdb", select=MySelector()) print "Done" Peter From pedro.al at fenhi.uh.cu Wed Jan 6 11:24:43 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Wed, 06 Jan 2010 11:24:43 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100106112443.0ewqh402gws8sos4@correo.fenhi.uh.cu> I just set Asp10 in 1xyz as a hypothetical residue in a hypothetical structure. One last thing: to select the CB atom in that Arg519 with the MySelector class and return it with the residue, how it would be...? Thanks > Have you got a real example? There is no Asp10 in PDB file 1xyz. > But, if for the sake of argument you wanted Arg519 (in either chain) > in 1xyz you could do it like this - based on the following example I > pointed to earlier: > > http://lists.open-bio.org/pipermail/biopython/2009-May/005174.html > > from Bio.PDB import Select, PDBIO > from Bio.PDB.PDBParser import PDBParser > > class MySelector(Select): > def accept_residue(self, residue): > #Only want Arg519 (in any chain) > return residue.resname=="ARG" and residue.id[1]==519 > > s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") > io=PDBIO() > io.set_structure(s) > io.save("1XYZ-interesting.pdb", select=MySelector()) > print "Done" > > > Peter -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Wed Jan 6 11:56:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 16:56:09 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100106112443.0ewqh402gws8sos4@correo.fenhi.uh.cu> References: <20100106112443.0ewqh402gws8sos4@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060856g2a1fc7c4te23fa15041a86537@mail.gmail.com> 2010/1/6 Yasser Almeida Hern?ndez : > I just set Asp10 in 1xyz as a hypothetical residue in a hypothetical > structure. I thought so - but I was hoping for a concrete example, where you can describe explicitly which bits you are trying to select. > One last thing: ?to select the CB atom in that Arg519 with the MySelector > class and return it with the residue, how it would be...? So out of the entire chain, you just want atom CB from residue Arg519? Try this then, it will give you a tiny PDB file with just two atoms, the CB from Arg519 in the two chains. from Bio.PDB import Select, PDBIO from Bio.PDB.PDBParser import PDBParser class MySelector(Select): def accept_residue(self, residue): #Only want Arg519 (in any chain) return residue.resname=="ARG" and residue.id[1]==519 def accept_atom(self, atom): #Only want the CB atom (in residue Arg519) return atom.name == "CB" s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") io=PDBIO() io.set_structure(s) io.save("1XYZ-interesting.pdb", select=MySelector()) print "Done" Peter From pedro.al at fenhi.uh.cu Thu Jan 7 09:14:28 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Thu, 07 Jan 2010 09:14:28 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100107091428.o9i82r9lsg8kw8sk@correo.fenhi.uh.cu> Yes, out of entire chain. Here's the concrete example: I have two pdb. The first is ligand-bounded (1BCX) and the other is ligand-free (1BVV). In the first i want to save the Tyr166 and the ligand atom O3B, both in a pdb file. In the second i want to save the same equivalent Tyr166 and the ligand atom of the first pdb file, both in other pdb file... I hope this will more clear... Thanks... > So out of the entire chain, you just want atom CB from residue Arg519? > Try this then, it will give you a tiny PDB file with just two atoms, the > CB from Arg519 in the two chains. > > from Bio.PDB import Select, PDBIO > from Bio.PDB.PDBParser import PDBParser > > class MySelector(Select): > def accept_residue(self, residue): > #Only want Arg519 (in any chain) > return residue.resname=="ARG" and residue.id[1]==519 > def accept_atom(self, atom): > #Only want the CB atom (in residue Arg519) > return atom.name == "CB" > > s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") > io=PDBIO() > io.set_structure(s) > io.save("1XYZ-interesting.pdb", select=MySelector()) > print "Done" > > Peter -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Thu Jan 7 11:08:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Jan 2010 16:08:59 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> Message-ID: <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> On Wed, Jan 6, 2010 at 1:28 PM, Anne Pajon wrote: > Hi Peter, > > Thanks again for this fast answer. > > You've been fixing code for me recently on fasta-m10 al_start and al_end, so > I am now working with the development version of biopython from git. I have > no problem of updating it and testing it here. Great. I've just committed very basic EMBL output support to our main branch on git. This is a stepping stone, deliberately a partial solution only for now, to make sure the basics seem to work (dealing with the sequence and identifiers, but nothing about the detailed annotation). In particular, I have deliberately not implemented feature support (yet - the existing code for writing a GenBank feature table will need to be tweaked to cover EMBL feature tables as well). I realise that in the current state this isn't going to be especially useful for you, but if you can have a look anyway and let me know if there is anything amiss that would be helpful. e.g. Make sure your favourite tools like the EMBL files Biopython produces. What do you use? Artemis? > I am working with about 30 bacteria genomes from the human gut and waiting > 100 more genomes to work with this year. I can send you one of the file if > you wish. Just let me know. You could send me one off list if you like - but its probably unnecessary for now. Regards, Peter From biopython at maubp.freeserve.co.uk Thu Jan 7 13:14:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Jan 2010 18:14:20 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> Message-ID: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> On Thu, Jan 7, 2010 at 4:08 PM, Peter wrote: > > Great. I've just committed very basic EMBL output support to our main > branch on git. This is a stepping stone, deliberately a partial solution only > for now, to make sure the basics seem to work (dealing with the sequence > and identifiers, but nothing about the detailed annotation). In particular, > I have deliberately not implemented feature support (yet - the existing > code for writing a GenBank feature table will need to be tweaked to > cover EMBL feature tables as well). > > I realise that in the current state this isn't going to be especially useful > for you, but if you can have a look anyway and let me know if there is > anything amiss that would be helpful. e.g. Make sure your favourite > tools like the EMBL files Biopython produces. What do you use? > Artemis? I did some more work, including writing CO lines for CONTIG records, but when testing realised our EMBL parser doesn't (yet) cope with them: http://bugzilla.open-bio.org/show_bug.cgi?id=2980 Peter From daniel at dim.fm.usp.br Thu Jan 7 13:51:05 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Thu, 07 Jan 2010 16:51:05 -0200 Subject: [Biopython] Why so few recipes in the cookbook? In-Reply-To: <20091221131148.GB21580@sobchak.mgh.harvard.edu> (sfid-+20091221-111151-+000.00-1@spamfilter.osbf.lua) References: <4B2A8B48.50302@dim.fm.usp.br> <320fb6e00912171316y5e514052sabaf2a0104a558ac@mail.gmail.com> <4B2B6DE2.3080500@dim.fm.usp.br> <320fb6e00912180457x31b3c48bl680d48d6b95fdab0@mail.gmail.com> <4B2B8CC3.3090307@dim.fm.usp.br> <320fb6e00912180700w49d3be87r53b1a5201c84461b@mail.gmail.com> <4B2BAE35.2070404@dim.fm.usp.br> <320fb6e00912181442r60348fcwf15776a0451bc6a1@mail.gmail.com> <320fb6e00912210403q5dd4c0d7xf06c9a850ecde9db@mail.gmail.com> <20091221131148.GB21580@sobchak.mgh.harvard.edu> (sfid-+20091221-111151-+000.00-1@spamfilter.osbf.lua) Message-ID: <4B462D19.2050906@dim.fm.usp.br> Hi people, This year will be fantastic for bioinformaticians/biologist and hybrids like me !!! As a side product of my thesis, I'm preparing some courses in bioinformatics oriented to biologists and physicians (I work at a very large medical complex with lots of underused fine clusters). And, of course, I'll need help to shape up the examples in a more OO way. Most of my work is done in pure C99 (a lot of void pointers) and I mainly use python as an interface between databases and my small programs. But, for the moment, my thesis will hang up this project a little. Nevertheless, it's in the second place on my priority queue. By the way, the bloggers from Blue Collar, Programming for Scientists, Yokofakun and related are on the list? They have nice examples that really work. So, the cookbook will take off !!! This is a promise. I really want to use it on my classes. See you very soon, Daniel Brad Chapman wrote: > Peter and Daniel; > Really interesting discussion. Documentation is an area that can > always use more work to appeal to a wider audience. > > Daniel: >>> While this tutorial is enough to CS-oriented guys, it's a really big >>> step to grasp such information for people from other communities. >>> That's why I'm always a little confused about the idea behind bio >>> projects. If the idea is programming of scientists, the approach is >>> way too CS. > > This stresses why we actively encourage contributions from biologists > as well. Many of the contributors to Biopython tend more towards the > programming/bioinformatics side, since that experience helps in building > up and appreciating a re-usable toolkit. When those same people write > documentation, it is going to be naturally biased towards the sort of > work they do. > > I'd definitely encourage you, and anyone else who might be > interested, to build up examples that are more intuitive to those > coming at the work from a different starting point. This is exactly > the idea behind starting up the cookbook on the wiki; it's all > freely editable, so dig right in. > > Brad > -- +---------------------------------------+ Daniel de A. M. M. Silvestre LIM01 - Laborat?rio de Inform?tica M?dica - HCFMUSP Sala 1349 - Depto. de Patologia Faculdade de Medicina Universidade de S?o Paulo Av. Dr. Arnaldo, 455 | e-mail: daniel at dim.fm.usp.br Cerqueira C?sar | Tel: +55-11-3061-7381 01246-903 - S?o Paulo - SP | Cel: +55-11-8042-9369 BRASIL | Skype: jarretinha --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: daniel.vcf Type: text/x-vcard Size: 375 bytes Desc: not available URL: From msameet at gmail.com Fri Jan 8 02:09:32 2010 From: msameet at gmail.com (Sameet Mehta) Date: Fri, 8 Jan 2010 12:39:32 +0530 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames Message-ID: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> Hi All, I have a few lists of gene names/gene symbols for some old (5 year old) microarray experiments. I want to find out the official Gene Symbols for all of these genes. Is there a way to do it in Biopython. regards Sameet -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From p.j.a.cock at googlemail.com Fri Jan 8 05:12:27 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 10:12:27 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> Message-ID: <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> On Fri, Jan 8, 2010 at 7:09 AM, Sameet Mehta wrote: > Hi All, > > I have a few lists of gene names/gene symbols for some old (5 year > old) microarray experiments. ?I want to find out the official Gene > Symbols for all of these genes. ?Is there a way to do it in Biopython. > > regards > Sameet I'd start by working out whose gene names/gene symbols they are. What kind of microarrays are you using? For a custom chip you may have to talk to whomever designed it, but for mainstream commercial chips there should be lookup tables, either on the manufacturors website or perhaps in R/Bioconductor. Note you can actually combine R/Bioconductor with Python using rpy2 (or its predecessor, rpy). For examples, see: http://bcbio.wordpress.com/2010/01/02/automated-retrieval-of-expression-data-with-python-and-r/ http://www.warwick.ac.uk/go/peter_cock/python/heatmap/ Peter From p.j.a.cock at googlemail.com Fri Jan 8 05:47:49 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 10:47:49 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> Message-ID: <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> Please CC the mailing list in replies. On Fri, Jan 8, 2010 at 10:20 AM, Sameet Mehta wrote: > Hi, > Thanks for the reply. ?What I have are the old GeneSymbols. ?I have > already selected the genes of interest based on expression profiles. > But I need their current GeneSymbols, so that I can do GO-Term > enrichment. Yes, but which GeneSymbols do you have? There are lots of different ones (including different species - for human you would probably be talking about the HUGO Gene Nomenclature Committee assigned symbols). Assuming your particular gene symbols are covered, then using NCBI Entrez and the Gene database might work (try ELink?). Peter From dalloliogm at gmail.com Fri Jan 8 06:06:45 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 8 Jan 2010 12:06:45 +0100 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> Message-ID: <5aa3b3571001080306s7fa4bfe4x102e7cc58fe84b84@mail.gmail.com> On Fri, Jan 8, 2010 at 11:47 AM, Peter Cock wrote: > Please CC the mailing list in replies. > > On Fri, Jan 8, 2010 at 10:20 AM, Sameet Mehta wrote: >> Hi, >> Thanks for the reply. ?What I have are the old GeneSymbols. ?I have >> already selected the genes of interest based on expression profiles. >> But I need their current GeneSymbols, so that I can do GO-Term >> enrichment. I would do it with BioMart, as it already has all the datasets available and it makes it possible to do it without programming at all. I know you can do it with biopython, but this is just a one-time job, maybe it is not necessary... In any case, it is true that you can't do it without knowing which GeneSymbols you are using and with which version they were annotated. > > Yes, but which GeneSymbols do you have? There are lots of > different ones (including different species - for human you would > probably be talking about the HUGO Gene Nomenclature > Committee assigned symbols). > > Assuming your particular gene symbols are covered, then using > NCBI Entrez and the Gene database might work (try ELink?). > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From msameet at gmail.com Fri Jan 8 06:25:27 2010 From: msameet at gmail.com (Sameet Mehta) Date: Fri, 08 Jan 2010 16:55:27 +0530 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> Message-ID: <4B471627.70406@gmail.com> Hi, I was wondering about using the NCBI Gene Database. I dont know where to begin. If you could help with some skeleton code, I could take it from there. regards Sameet On 01/08/2010 04:17 PM, Peter Cock wrote: > Please CC the mailing list in replies. > > On Fri, Jan 8, 2010 at 10:20 AM, Sameet Mehta wrote: > >> Hi, >> Thanks for the reply. What I have are the old GeneSymbols. I have >> already selected the genes of interest based on expression profiles. >> But I need their current GeneSymbols, so that I can do GO-Term >> enrichment. >> > Yes, but which GeneSymbols do you have? There are lots of > different ones (including different species - for human you would > probably be talking about the HUGO Gene Nomenclature > Committee assigned symbols). > > Assuming your particular gene symbols are covered, then using > NCBI Entrez and the Gene database might work (try ELink?). > > Peter > > From p.j.a.cock at googlemail.com Fri Jan 8 06:35:57 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 11:35:57 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <4B471627.70406@gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> <4B471627.70406@gmail.com> Message-ID: <320fb6e01001080335n40bd3041n71779cd5c1df83c1@mail.gmail.com> On Fri, Jan 8, 2010 at 11:25 AM, Sameet Mehta wrote: > Hi, > I was wondering about using the NCBI Gene Database. ?I dont know where > to begin. If you could help with some skeleton code, I could take it > from there. How about telling us two or three of your old gene symbols, what they are from, and the desired new gene symbols? If you can manage to do this manually via the Entrez website, that would also be very helpful for doing it automatically via a script. Peter From biopython at maubp.freeserve.co.uk Fri Jan 8 07:48:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Jan 2010 12:48:07 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> Message-ID: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> On Thu, Jan 7, 2010 at 6:14 PM, Peter wrote: > > I did some more work, including writing CO lines for CONTIG records, > but when testing realised our EMBL parser doesn't (yet) cope with them: > http://bugzilla.open-bio.org/show_bug.cgi?id=2980 > OK, now EMBL contig records seem to be working :) Peter From p.j.a.cock at googlemail.com Fri Jan 8 11:48:40 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 16:48:40 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <380bc9b31001080809l531a8c17v109191fc7783d0f5@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> <4B471627.70406@gmail.com> <320fb6e01001080335n40bd3041n71779cd5c1df83c1@mail.gmail.com> <380bc9b31001080809l531a8c17v109191fc7783d0f5@mail.gmail.com> Message-ID: <320fb6e01001080848w3fcac1cg9250e85dc21c82dd@mail.gmail.com> Please CC the mailing list. On Fri, Jan 8, 2010 at 4:09 PM, Sameet Mehta wrote: > Hi, > My list contains gene names such as DKFZP586P0123 , RPL6, etc. ?What I > do is search this in the NCBI Gene database manually, and then i get > the official Gene Symbol. ?I want to automate this process. ?I am of > course interested only in official gene symbols from the Humans. > > Sameet OK, so via my browser using Entrez Gene, I used: DKFZP586P0123 "Homo sapiens"[orgn] This maps uniquely to C2CD3. However, RPL6 "Homo sapiens"[orgn] maps to several hits (some discontinued) included things like RPL6P13. Clearly we need to make the search a little more specific... we only want to search for a name or gene symbol (not the default search on all fields). It looks like searching on "gene" works nicely, see also: http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ Entrez queries like these seem to give unique matches: DKFZP586P0123[gene] "Homo sapiens"[orgn] RPL6[gene] "Homo sapiens"[orgn] e.g. >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.com" >>> search = Entrez.read(Entrez.esearch(db='gene', term='DKFZP586P0123[gene] "Homo sapiens"[orgn]', retmode='xml')) >>> print search["IdList"] ['26005'] That unique ID we got back (26005) is the UID for this gene, which you should be able to use with EFetch (or Elink?). e.g. You could download the whole record as XML, and parse that: >>> result = Entrez.read(Entrez.efetch(db='gene', id='26005', retmode='xml')) >>> result[0]['Entrezgene_gene']['Gene-ref']['Gene-ref_locus'] 'C2CD3' However, this next approach is a much quicker download, and so looks like a more efficient way to get the desired gene symbol: >>> print Entrez.efetch(db='gene', id='26005', retmode='text', rettype='brief').read() 1: C2CD3 C2 calcium-depend... [GeneID: 26005] Next read the Entrez chapter in the Biopython Tutorial, especially the bit about the history functionality for linking ESearch and EFetch. Peter From msameet at gmail.com Fri Jan 8 12:13:24 2010 From: msameet at gmail.com (Sameet Mehta) Date: Fri, 8 Jan 2010 22:43:24 +0530 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <320fb6e01001080848w3fcac1cg9250e85dc21c82dd@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> <4B471627.70406@gmail.com> <320fb6e01001080335n40bd3041n71779cd5c1df83c1@mail.gmail.com> <380bc9b31001080809l531a8c17v109191fc7783d0f5@mail.gmail.com> <320fb6e01001080848w3fcac1cg9250e85dc21c82dd@mail.gmail.com> Message-ID: <380bc9b31001080913p15ba950xb787460b98ef76b9@mail.gmail.com> Thanks Peter, that is something i was looking for. thanks for the help. regards Sameet On Fri, Jan 8, 2010 at 10:18 PM, Peter Cock wrote: > Please CC the mailing list. > > On Fri, Jan 8, 2010 at 4:09 PM, Sameet Mehta wrote: >> Hi, >> My list contains gene names such as DKFZP586P0123 , RPL6, etc. ?What I >> do is search this in the NCBI Gene database manually, and then i get >> the official Gene Symbol. ?I want to automate this process. ?I am of >> course interested only in official gene symbols from the Humans. >> >> Sameet > > OK, so via my browser using Entrez Gene, I used: > > DKFZP586P0123 "Homo sapiens"[orgn] > > This maps uniquely to C2CD3. However, > > RPL6 "Homo sapiens"[orgn] > > maps to several hits (some discontinued) included things like > RPL6P13. Clearly we need to make the search a little more > specific... we only want to search for a name or gene symbol > (not the default search on all fields). > > It looks like searching on "gene" works nicely, see also: > http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ > > Entrez queries like these seem to give unique matches: > > DKFZP586P0123[gene] "Homo sapiens"[orgn] > RPL6[gene] "Homo sapiens"[orgn] > > e.g. > >>>> from Bio import Entrez >>>> Entrez.email = "Your.Name.Here at example.com" >>>> search = Entrez.read(Entrez.esearch(db='gene', term='DKFZP586P0123[gene] "Homo sapiens"[orgn]', retmode='xml')) >>>> print search["IdList"] > ['26005'] > > That unique ID we got back (26005) is the UID for this gene, which > you should be able to use with EFetch (or Elink?). e.g. You could > download the whole record as XML, and parse that: > >>>> result = Entrez.read(Entrez.efetch(db='gene', id='26005', retmode='xml')) >>>> result[0]['Entrezgene_gene']['Gene-ref']['Gene-ref_locus'] > 'C2CD3' > > However, this next approach is a much quicker download, and so > looks like a more efficient way to get the desired gene symbol: > >>>> print Entrez.efetch(db='gene', id='26005', retmode='text', rettype='brief').read() > > 1: C2CD3 C2 calcium-depend... [GeneID: 26005] > > Next read the Entrez chapter in the Biopython Tutorial, especially > the bit about the history functionality for linking ESearch and EFetch. > > Peter > -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From bnbowman at gmail.com Fri Jan 8 20:34:34 2010 From: bnbowman at gmail.com (Brett Bowman) Date: Fri, 8 Jan 2010 17:34:34 -0800 Subject: [Biopython] Organism specific NCBIWWW qblast Message-ID: <627d998d1001081734x16bd2325g60ae36b69fbe922d@mail.gmail.com> Hello gents, I'm trying to create a dataset of proteins that are both highly similar to, and from the same species as, my query sequence. Since I'm going to be doing this repeatedly, for many different query sequences, I'm trying to automate the process with Biopython, but I can't figure out how to enable organism-specific blasts with qblast. Any guidance be greatly appreciated. -Brett Bowman Woelk Lab UCSD School of Medicine UCSD/SDSU Joint Bioinformatics Program From bnbowman at gmail.com Sat Jan 9 17:53:31 2010 From: bnbowman at gmail.com (Brett Bowman) Date: Sat, 9 Jan 2010 14:53:31 -0800 Subject: [Biopython] Blank Returns from Entrez.efetch() Message-ID: <627d998d1001091453m60c78a8dy408e248049fd1b6e@mail.gmail.com> I'm trying to query Entrez for a series of protein IDs with Biopython, but not having much success. The sample code given in the tutorial works perfectly: >>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb") >>> print handle.read() But when I change that to proteins and my IDs, I get an empty handle as a result: >>> handle = Entrez.efetch(db="protein", id="Q81T62.1", rettype="gb") I've tried this on Biopython 1.51 and 1.53, installed on Ubuntu 9.10, and I've tried it with every rettype imaginable, with no success. Any ideas as to where I am going wrong? -Brett Bowman Woelk Lab UCSD School of Medicine UCSD/SDSU Joint Program in Bioinformatics From mjldehoon at yahoo.com Sat Jan 9 21:52:36 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Jan 2010 18:52:36 -0800 (PST) Subject: [Biopython] Blank Returns from Entrez.efetch() In-Reply-To: <627d998d1001091453m60c78a8dy408e248049fd1b6e@mail.gmail.com> Message-ID: <951022.75982.qm@web62407.mail.re1.yahoo.com> Have you looked at the EUtils examples on the NCBI website? It shows one example for efetch from the protein database. --Michiel. --- On Sat, 1/9/10, Brett Bowman wrote: > From: Brett Bowman > Subject: [Biopython] Blank Returns from Entrez.efetch() > To: biopython at biopython.org > Date: Saturday, January 9, 2010, 5:53 PM > I'm trying to query Entrez for a > series of protein IDs with Biopython, > but not having much success.? The sample code given in > the tutorial > works perfectly: > > >>> handle = Entrez.efetch(db="nucleotide", > id="186972394", rettype="gb") > >>> print handle.read() > > But when I change that to proteins and my IDs, I get an > empty handle > as a result: > > >>> handle = Entrez.efetch(db="protein", > id="Q81T62.1", rettype="gb") > > I've tried this on Biopython 1.51 and 1.53, installed on > Ubuntu 9.10, > and I've tried it with every rettype imaginable, with no > success.? Any > ideas as to where I am going wrong? > > -Brett Bowman > Woelk Lab > UCSD School of Medicine > UCSD/SDSU Joint Program in Bioinformatics > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Sun Jan 10 08:01:32 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 10 Jan 2010 08:01:32 -0500 Subject: [Biopython] Blank Returns from Entrez.efetch() In-Reply-To: <951022.75982.qm@web62407.mail.re1.yahoo.com> References: <627d998d1001091453m60c78a8dy408e248049fd1b6e@mail.gmail.com> <951022.75982.qm@web62407.mail.re1.yahoo.com> Message-ID: <20100110130132.GF9694@sobchak.mgh.harvard.edu> Brett; Brett: > > But when I change that to proteins and my IDs, I get an empty handle > > as a result: > > > > >>> handle = Entrez.efetch(db="protein", id="Q81T62.1", rettype="gb") Michiel: > Have you looked at the EUtils examples on the NCBI website? It shows > one example for efetch from the protein database. According to the efetch help here: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html the id parameter should work okay with an accession.version. So your example should work but something is wrong with how NCBI handles this particular record. Other accession.version identifiers do work, and so does the accession alone: >>> handle = Entrez.efetch(db="protein", id="Q81T62", rettype="gb") The safest way to do this is to use GenBank identifiers (GIDs) as the id attribute. This requires one extra step to search for the record and get the ID: >>> handle = Entrez.esearch(db="protein", retmax=1, term="Q81T62.1") >>> rec = Entrez.read(handle) >>> rec {u'Count': '1', u'IdList': ['46395771'], u'QueryTranslation': 'Q81T62.1', u'RetMax': '1', u'RetStart': '0', u'TranslationSet': []} >>> handle = Entrez.efetch(db="protein", id=rec[0]['IdList'][0], rettype="gb") >>> handle.readline() 'LOCUS Q81T62 429 aa linear BCT 15-DEC-2009\n' Brad From chapmanb at 50mail.com Sun Jan 10 08:17:26 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 10 Jan 2010 08:17:26 -0500 Subject: [Biopython] Organism specific NCBIWWW qblast In-Reply-To: <627d998d1001081734x16bd2325g60ae36b69fbe922d@mail.gmail.com> References: <627d998d1001081734x16bd2325g60ae36b69fbe922d@mail.gmail.com> Message-ID: <20100110131726.GG9694@sobchak.mgh.harvard.edu> Hi Brett; > I'm trying to create a dataset of proteins that are both highly > similar to, and from the same species as, my query sequence. Since > I'm going to be doing this repeatedly, for many different query > sequences, I'm trying to automate the process with Biopython, but I > can't figure out how to enable organism-specific blasts with qblast. > Any guidance be greatly appreciated. You want to use the entrez_query argument to qblast: result_handle = NCBIWWW.qblast("blastn", "nr", record.format("fasta"), entrez_query="Mus musculus[orgn]") See these previous threads for more discussion: http://lists.open-bio.org/pipermail/biopython/2009-June/005215.html http://www.biopython.org/pipermail/biopython/2009-September/005616.html Once you've got a short example running it would be great if you could add it as an example to the online cookbook: http://biopython.org/wiki/Category:Cookbook A nice discussion there could help others in the future with the same issue. Thanks, Brad From biopython at maubp.freeserve.co.uk Mon Jan 11 11:22:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 16:22:56 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> Message-ID: <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> Hi Anne, I've just checked in feature support to the new EMBL output in Bio.SeqIO (our main branch on git). If you could give that a test it would be very much appreciated. If you are on the dev mailing list, we can discuss issues there - otherwise we might as well continue on this thread. Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 11:38:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 16:38:33 +0000 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> Message-ID: <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> On Sun, Jan 3, 2010 at 8:09 AM, Kevin Lam wrote: > Hmmm found this in the blast+ manual is it possible to integrate this > somewhere in biopython ?Cheers > Kevin > > > 3.1 For users of NCBI C Toolkit BLAST > > The easiest way to get started using these command line applications is by > means of the legacy_blast.pl PERL script which is bundled along with the > BLAST+ applications. To utilize this script, simply prefix it to the > invocation of the C toolkit BLAST command line application and append the > --path option pointing to the installation directory of the BLAST+ > applications. For example, instead of using > ?blastall -i query -d nr -o blast.out > > use > ?legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin > > For more details, refer to the section titled Backwards compatibility > script > . Hi Kevin, I don't understand how you think the Biopython documentation should mention the legacy_blast.pl script. Could you explain? If someone has an existing Biopython script written to call "legacy" BLAST via the Bio.Blast.NCBIStandalone "helper" function then it would be quite tricky to get this to call BLAST+ via legacy_blast.pl to convert the arguments. These "helper" functions are just too inflexible (we would probably have deprecated them anyway, even without the introduction of BLAST+ by the NCBI). If someone was using the the Bio.Blast.Applications wrapper to call "legacy" BLAST then they could do something like this: import subprocess from Bio.Blast.Applications import BlastallCommandline cline = BlastallCommandline(...) child = subprocess.Popen(str(cline), ...) Then I guess they could make a hack like this in order to use BLAST+ via legacy_blast.pl without changing much code: import subprocess from Bio.Blast.Applications import BlastallCommandline cline = BlastallCommandline(...) hack_template = "legacy_blast.pl %s --path /opt/blast/bin" child = subprocess.Popen(hack_template % cline, ...) Peter From aboulia at gmail.com Mon Jan 11 11:46:22 2010 From: aboulia at gmail.com (Kevin) Date: Tue, 12 Jan 2010 00:46:22 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> Message-ID: Hi Peter, I was thinking of porting the legacy blast script to python as u r right about the helper script being inflexible. The documentation bit was actually about my first email about any updated doc on how to use blast+ with biopython Cheers Kevin Sent from my iPod On 12-Jan-2010, at 12:38 AM, Peter wrote: > On Sun, Jan 3, 2010 at 8:09 AM, Kevin Lam wrote: >> Hmmm found this in the blast+ manual is it possible to integrate this >> somewhere in biopython ?Cheers >> Kevin >> >> >> 3.1 For users of NCBI C Toolkit BLAST >> >> The easiest way to get started using these command line >> applications is by >> means of the legacy_blast.pl PERL script which is bundled along >> with the >> BLAST+ applications. To utilize this script, simply prefix it to the >> invocation of the C toolkit BLAST command line application and >> append the >> --path option pointing to the installation directory of the BLAST+ >> applications. For example, instead of using >> blastall -i query -d nr -o blast.out >> >> use >> legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/ >> blast/bin >> >> For more details, refer to the section titled Backwards compatibility >> script> > >> . > > Hi Kevin, > > I don't understand how you think the Biopython documentation > should mention the legacy_blast.pl script. Could you explain? > > If someone has an existing Biopython script written to call "legacy" > BLAST via the Bio.Blast.NCBIStandalone "helper" function then it > would be quite tricky to get this to call BLAST+ via legacy_blast.pl > to convert the arguments. These "helper" functions are just too > inflexible (we would probably have deprecated them anyway, even > without the introduction of BLAST+ by the NCBI). > > If someone was using the the Bio.Blast.Applications wrapper to > call "legacy" BLAST then they could do something like this: > > import subprocess > from Bio.Blast.Applications import BlastallCommandline > cline = BlastallCommandline(...) > child = subprocess.Popen(str(cline), ...) > > Then I guess they could make a hack like this in order to use > BLAST+ via legacy_blast.pl without changing much code: > > import subprocess > from Bio.Blast.Applications import BlastallCommandline > cline = BlastallCommandline(...) > hack_template = "legacy_blast.pl %s --path /opt/blast/bin" > child = subprocess.Popen(hack_template % cline, ...) > > Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 12:08:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 17:08:55 +0000 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> Message-ID: <320fb6e01001110908s4f19272bp71377c4f70004a24@mail.gmail.com> On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > Hi Peter, > I was thinking of porting the legacy blast script to python as u r right > about the helper script being inflexible. A python version of legacy_blast.pl isn't any more useful than the Perl version is it? Maybe I have misunderstood you. What would be nice is a way to help people update their old Biopython scripts which called legacy BLAST, so that they can be used on BLAST+ instead. I would expect in most cases this means scripts using the legacy BLAST "helper" functions in Bio.Blast.NCBIStandalone. One way to do this would be to add new BLAST+ versions of the "helper" functions (taking the same argument names as before), but that is just a stop gap (a temporary measure). We really want people using these old helper functions to switch to using the wrappers in Bio.Blast.Applications and subprocess instead. > The documentation bit was actually about my first email about any > updated doc on how to use blast+ with biopython I see. What do you think the current (Biopython 1.53) version of the tutorial needs in the BLAST chapter? http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thanks, Peter From ap12 at sanger.ac.uk Mon Jan 11 12:32:43 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 11 Jan 2010 17:32:43 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> Message-ID: Hi Peter, Just tested now. It worked fine. Thanks a lot. Here is the diff between the EMBL output from Bio.SeqIO and the genbank output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: guest137:RAST ap12$ diff tmp.embl updated_files/ Alistipes_shahii_WAL8301_uRAST.embl 1c1 < ID unknown; SV 1; ; DNA; ; ; 3763317 BP. --- > ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 3763317 BP. 5c5 < DE --- > KW . 8c8 < OC . --- > XX 10a11 > FH 1949,1950c1950 < FT /product="Peptidyl-prolyl cis-trans isomerase (EC < FT 5.2.1.8)" --- > FT /product="Peptidyl-prolyl cis-trans isomerase (EC 5.2.1.8)" 3346,3347c3346 < FT kinase/response regulator, hybrid ('one component < FT system')" --- > FT kinase/response regulator, hybrid ('one component system')" 3380,3381c3379 < FT /product="Iron-sulfur cluster assembly ATPase protein < FT SufC" --- > FT /product="Iron-sulfur cluster assembly ATPase protein SufC" 4811,4812c4809 < FT /product="Gamma-glutamyl phosphate reductase (EC < FT 1.2.1.41)" --- > FT /product="Gamma-glutamyl phosphate reductase (EC 1.2.1.41)" 5472,5473c5469 < FT /product="lipoprotein releasing system ATP- binding < FT protein" --- > FT /product="lipoprotein releasing system ATP- binding protein" 5881,5882c5877 < FT /product="NAD-dependent protein deacetylase of SIR2 < FT family" --- > FT /product="NAD-dependent protein deacetylase of SIR2 family" 6032,6033c6027 < FT /product="Exodeoxyribonuclease V alpha chain (EC < FT 3.1.11.5)" --- > FT /product="Exodeoxyribonuclease V alpha chain (EC 3.1.11.5)" 6495,6496c6489 < FT /product="Pyrophosphate-energized proton pump (EC < FT 3.6.1.1)" --- > FT /product="Pyrophosphate-energized proton pump (EC 3.6.1.1)" 6946,6947c6939 < FT /product="Exodeoxyribonuclease V alpha chain (EC < FT 3.1.11.5)" --- > FT /product="Exodeoxyribonuclease V alpha chain (EC 3.1.11.5)" 7128,7129c7120 < FT /product="N-acyl-L-amino acid amidohydrolase (EC < FT 3.5.1.14)" --- > FT /product="N-acyl-L-amino acid amidohydrolase (EC 3.5.1.14)" 8035,8036c8026 < FT /product="D-3-phosphoglycerate dehydrogenase (EC < FT 1.1.1.95)" --- > FT /product="D-3-phosphoglycerate dehydrogenase (EC 1.1.1.95)" 8601,8602c8591 < FT /product="Acetolactate synthase small subunit (EC < FT 2.2.1.6)" --- > FT /product="Acetolactate synthase small subunit (EC 2.2.1.6)" 8608,8609c8597 < FT /product="Acetolactate synthase large subunit (EC < FT 2.2.1.6)" --- > FT /product="Acetolactate synthase large subunit (EC 2.2.1.6)" 9152,9153c9140 < FT /product="Exodeoxyribonuclease V alpha chain (EC < FT 3.1.11.5)" --- > FT /product="Exodeoxyribonuclease V alpha chain (EC 3.1.11.5)" 10659,10660c10646 < FT kinase/response regulator, hybrid ('one-component < FT system')" --- > FT kinase/response regulator, hybrid ('one- component system')" 12056,12057c12042 < FT /product="N-acetylmuramoyl-L-alanine amidase (EC < FT 3.5.1.28)" --- > FT /product="N-acetylmuramoyl-L-alanine amidase (EC 3.5.1.28)" 12957,12958c12942 < FT /product="Phosphatidate cytidylyltransferase (EC < FT 2.7.7.41)" --- > FT /product="Phosphatidate cytidylyltransferase (EC 2.7.7.41)" 13550,13551c13534 < FT /product="Glutamine synthetase type III, GlnN (EC < FT 6.3.1.2)" --- > FT /product="Glutamine synthetase type III, GlnN (EC 6.3.1.2)" 14344c14327,14328 < SQ --- > XX > SQ Sequence 3763317 BP; 772804 A; 1042979 C; 1057681 G; 776208 T; 113645 other; The main differences are on line breaks. Regards, Anne. On 11 Jan 2010, at 16:22, Peter wrote: > Hi Anne, > > I've just checked in feature support to the new EMBL output in > Bio.SeqIO > (our main branch on git). If you could give that a test it would be > very > much appreciated. If you are on the dev mailing list, we can discuss > issues there - otherwise we might as well continue on this thread. > > Thanks, > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From aboulia at gmail.com Tue Jan 12 00:04:07 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 12 Jan 2010 13:04:07 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <320fb6e01001110908s4f19272bp71377c4f70004a24@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> <320fb6e01001110908s4f19272bp71377c4f70004a24@mail.gmail.com> Message-ID: <5b6410e1001112104y1ac0db9eoc565252710fc3334@mail.gmail.com> On Tue, Jan 12, 2010 at 1:08 AM, Peter wrote: > On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > > Hi Peter, > > I was thinking of porting the legacy blast script to python as u r right > > about the helper script being inflexible. > > A python version of legacy_blast.pl isn't any more useful than the > Perl version is it? Maybe I have misunderstood you. > > What would be nice is a way to help people update their old > Biopython scripts which called legacy BLAST, so that they can > be used on BLAST+ instead. I would expect in most cases this > means scripts using the legacy BLAST "helper" functions in > Bio.Blast.NCBIStandalone. One way to do this would be to > add new BLAST+ versions of the "helper" functions (taking > the same argument names as before), but that is just a stop > gap (a temporary measure). We really want people using these > old helper functions to switch to using the wrappers in > Bio.Blast.Applications and subprocess instead. > Yes I was thinking of this when i meant porting/integrate. to integrate the legacy blast perl script into Bio.Blast.NCBIStandalone I didn't realise that Bio.Blast.Applications existed > The documentation bit was actually about my first email about any > > updated doc on how to use blast+ with biopython > > I see. What do you think the current (Biopython 1.53) version > of the tutorial needs in the BLAST chapter? > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc80 was exactly what I was looking for! Maybe i was looking at the wrong page Thanks for pointing it out! > Thanks, > > Peter > Cheers Kevin From biopython at maubp.freeserve.co.uk Tue Jan 12 05:27:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 10:27:47 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> Message-ID: <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: > Hi Peter, > > Just tested now. > > It worked fine. Thanks a lot. Great. > Here is the diff between the EMBL output from Bio.SeqIO and the genbank > output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > > ... > > The main differences are on line breaks. I hadn't yet done a comparison against EMBOSS (what version do you have), but yes, it looks like I am wrapping the feature tables using a shorter line length - we should check that, and it would be easy to adjust in Bio/SeqIO/InsdcIO.py Regarding the SQ line, that was on my "TODO" list. Including the sequence length and base counts shouldn't hard at all. If you want to work on that it should just be a few lines in Bio/SeqIO/InsdcIO.py Right now however, further testing of features would be my first priority. See also: http://lists.open-bio.org/pipermail/open-bio-l/2010-January/000604.html There are other things still to do (e.g. missing fields on the ID line, dates, and references). Peter From biopython at maubp.freeserve.co.uk Tue Jan 12 07:33:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 12:33:35 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> Message-ID: <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> On Tue, Jan 12, 2010 at 10:27 AM, Peter wrote: > On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: >> Here is the diff between the EMBL output from Bio.SeqIO and the genbank >> output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: >> >> ... >> >> The main differences are on line breaks. > > I hadn't yet done a comparison against EMBOSS (what version do you > have), but yes, it looks like I am wrapping the feature tables using a > shorter line length - we should check that, and it would be easy to > adjust in Bio/SeqIO/InsdcIO.py The spec is pretty clear than the feature lines should be up to 80 characters. The premature wrapping was because I had been testing length < 80 instead of <= 80, which is now fixed in git. Peter From p.j.a.cock at googlemail.com Tue Jan 12 09:27:30 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Jan 2010 14:27:30 +0000 Subject: [Biopython] Publication list Message-ID: <320fb6e01001120627s268f0dd4k3a543e3b779507e6@mail.gmail.com> Dear all, We have a fairly extensive manually compiled list of over 150 publications citing, referencing or using Biopython on the wiki, covering the first 10 years of Biopython: http://biopython.org/wiki/Publications *If your own Biopython related publications are missing from this list, please add them. If they are listed in PubMed this is pretty easy.* Keeping this up to date has been a tedious task, although now that we have an up to date reference, which hopefully will get cited, this is a little easier: http://news.open-bio.org/news/2009/03/biopython-paper-published/ There is an example in the Biopython Tutorial of using Bio.Entrez and PubMed Central (PMC) to find papers citing a reference, or you can just use this URL: http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=link&linkname=pubmed_pubmed_citedin&uid=19304878 Likewise, using Google Scholar also finds plenty of citations (although I don't know if this URL will work long term): http://scholar.google.com/scholar?cites=1800471218280477755&hl=en&as_sdt=2000 Perhaps just a few links like these will suffice for tracking future publications? Or do people think we should continue to update the wiki in the same style? Regards, Peter From anaryin at gmail.com Tue Jan 12 14:01:34 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 12 Jan 2010 11:01:34 -0800 Subject: [Biopython] Biopython Digest, Vol 85, Issue 13 In-Reply-To: References: Message-ID: Hello Peter, Well, updating the wiki is cumbersome. Specially if done manually. Why not update the wiki automatically with that link you just gave? Regards, Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Tue, Jan 12, 2010 at 9:00 AM, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > Today's Topics: > > 1. Re: is there an updated tutorial on how to use the Wrappers > for the new NCBI BLAST+ tools? (Peter) > 2. Re: Could Bio.SeqIO write EMBL file? (Anne Pajon) > 3. Re: is there an updated tutorial on how to use the Wrappers > for the new NCBI BLAST+ tools? (Kevin Lam) > 4. Re: Could Bio.SeqIO write EMBL file? (Peter) > 5. Re: Could Bio.SeqIO write EMBL file? (Peter) > 6. Publication list (Peter Cock) > > > ---------- Forwarded message ---------- > From: Peter > To: Kevin > Date: Mon, 11 Jan 2010 17:08:55 +0000 > Subject: Re: [Biopython] is there an updated tutorial on how to use the > Wrappers for the new NCBI BLAST+ tools? > On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > > Hi Peter, > > I was thinking of porting the legacy blast script to python as u r right > > about the helper script being inflexible. > > A python version of legacy_blast.pl isn't any more useful than the > Perl version is it? Maybe I have misunderstood you. > > What would be nice is a way to help people update their old > Biopython scripts which called legacy BLAST, so that they can > be used on BLAST+ instead. I would expect in most cases this > means scripts using the legacy BLAST "helper" functions in > Bio.Blast.NCBIStandalone. One way to do this would be to > add new BLAST+ versions of the "helper" functions (taking > the same argument names as before), but that is just a stop > gap (a temporary measure). We really want people using these > old helper functions to switch to using the wrappers in > Bio.Blast.Applications and subprocess instead. > > > The documentation bit was actually about my first email about any > > updated doc on how to use blast+ with biopython > > I see. What do you think the current (Biopython 1.53) version > of the tutorial needs in the BLAST chapter? > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Thanks, > > Peter > > > > ---------- Forwarded message ---------- > From: Anne Pajon > To: Peter > Date: Mon, 11 Jan 2010 17:32:43 +0000 > Subject: Re: [Biopython] Could Bio.SeqIO write EMBL file? > Hi Peter, > > Just tested now. > > It worked fine. Thanks a lot. > > Here is the diff between the EMBL output from Bio.SeqIO and the genbank > output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > > guest137:RAST ap12$ diff tmp.embl > updated_files/Alistipes_shahii_WAL8301_uRAST.embl > 1c1 > < ID unknown; SV 1; ; DNA; ; ; 3763317 BP. > --- > > ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 3763317 BP. > 5c5 > < DE > --- > > KW . > 8c8 > < OC . > --- > > XX > 10a11 > > FH > 1949,1950c1950 > < FT /product="Peptidyl-prolyl cis-trans isomerase (EC > < FT 5.2.1.8)" > --- > > FT /product="Peptidyl-prolyl cis-trans isomerase (EC > 5.2.1.8)" > 3346,3347c3346 > < FT kinase/response regulator, hybrid ('one component > < FT system')" > --- > > FT kinase/response regulator, hybrid ('one component > system')" > 3380,3381c3379 > < FT /product="Iron-sulfur cluster assembly ATPase > protein > < FT SufC" > --- > > FT /product="Iron-sulfur cluster assembly ATPase > protein SufC" > 4811,4812c4809 > < FT /product="Gamma-glutamyl phosphate reductase (EC > < FT 1.2.1.41)" > --- > > FT /product="Gamma-glutamyl phosphate reductase (EC > 1.2.1.41)" > 5472,5473c5469 > < FT /product="lipoprotein releasing system ATP-binding > < FT protein" > --- > > FT /product="lipoprotein releasing system ATP-binding > protein" > 5881,5882c5877 > < FT /product="NAD-dependent protein deacetylase of SIR2 > < FT family" > --- > > FT /product="NAD-dependent protein deacetylase of SIR2 > family" > 6032,6033c6027 > < FT /product="Exodeoxyribonuclease V alpha chain (EC > < FT 3.1.11.5)" > --- > > FT /product="Exodeoxyribonuclease V alpha chain (EC > 3.1.11.5)" > 6495,6496c6489 > < FT /product="Pyrophosphate-energized proton pump (EC > < FT 3.6.1.1)" > --- > > FT /product="Pyrophosphate-energized proton pump (EC > 3.6.1.1)" > 6946,6947c6939 > < FT /product="Exodeoxyribonuclease V alpha chain (EC > < FT 3.1.11.5)" > --- > > FT /product="Exodeoxyribonuclease V alpha chain (EC > 3.1.11.5)" > 7128,7129c7120 > < FT /product="N-acyl-L-amino acid amidohydrolase (EC > < FT 3.5.1.14)" > --- > > FT /product="N-acyl-L-amino acid amidohydrolase (EC > 3.5.1.14)" > 8035,8036c8026 > < FT /product="D-3-phosphoglycerate dehydrogenase (EC > < FT 1.1.1.95)" > --- > > FT /product="D-3-phosphoglycerate dehydrogenase (EC > 1.1.1.95)" > 8601,8602c8591 > < FT /product="Acetolactate synthase small subunit (EC > < FT 2.2.1.6)" > --- > > FT /product="Acetolactate synthase small subunit (EC > 2.2.1.6)" > 8608,8609c8597 > < FT /product="Acetolactate synthase large subunit (EC > < FT 2.2.1.6)" > --- > > FT /product="Acetolactate synthase large subunit (EC > 2.2.1.6)" > 9152,9153c9140 > < FT /product="Exodeoxyribonuclease V alpha chain (EC > < FT 3.1.11.5)" > --- > > FT /product="Exodeoxyribonuclease V alpha chain (EC > 3.1.11.5)" > 10659,10660c10646 > < FT kinase/response regulator, hybrid ('one-component > < FT system')" > --- > > FT kinase/response regulator, hybrid ('one-component > system')" > 12056,12057c12042 > < FT /product="N-acetylmuramoyl-L-alanine amidase (EC > < FT 3.5.1.28)" > --- > > FT /product="N-acetylmuramoyl-L-alanine amidase (EC > 3.5.1.28)" > 12957,12958c12942 > < FT /product="Phosphatidate cytidylyltransferase (EC > < FT 2.7.7.41)" > --- > > FT /product="Phosphatidate cytidylyltransferase (EC > 2.7.7.41)" > 13550,13551c13534 > < FT /product="Glutamine synthetase type III, GlnN (EC > < FT 6.3.1.2)" > --- > > FT /product="Glutamine synthetase type III, GlnN (EC > 6.3.1.2)" > 14344c14327,14328 > < SQ > --- > > XX > > SQ Sequence 3763317 BP; 772804 A; 1042979 C; 1057681 G; 776208 T; > 113645 other; > > The main differences are on line breaks. > > Regards, > Anne. > > > On 11 Jan 2010, at 16:22, Peter wrote: > > Hi Anne, >> >> I've just checked in feature support to the new EMBL output in Bio.SeqIO >> (our main branch on git). If you could give that a test it would be very >> much appreciated. If you are on the dev mailing list, we can discuss >> issues there - otherwise we might as well continue on this thread. >> >> Thanks, >> >> Peter >> > > -- > Dr Anne Pajon - Pathogen Genomics, Team 81 > Sanger Institute, Wellcome Trust Genome Campus, Hinxton > Cambridge CB10 1SA, United Kingdom > +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, > a charity registered in England with number 1021457 and acompany registered > in England with number 2742969, whose registeredoffice is 215 Euston Road, > London, NW1 2BE. > > > ---------- Forwarded message ---------- > From: Kevin Lam > To: Peter > Date: Tue, 12 Jan 2010 13:04:07 +0800 > Subject: Re: [Biopython] is there an updated tutorial on how to use the > Wrappers for the new NCBI BLAST+ tools? > On Tue, Jan 12, 2010 at 1:08 AM, Peter >wrote: > > > On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > > > Hi Peter, > > > I was thinking of porting the legacy blast script to python as u r > right > > > about the helper script being inflexible. > > > > A python version of legacy_blast.pl isn't any more useful than the > > Perl version is it? Maybe I have misunderstood you. > > > > What would be nice is a way to help people update their old > > Biopython scripts which called legacy BLAST, so that they can > > be used on BLAST+ instead. I would expect in most cases this > > means scripts using the legacy BLAST "helper" functions in > > Bio.Blast.NCBIStandalone. One way to do this would be to > > add new BLAST+ versions of the "helper" functions (taking > > the same argument names as before), but that is just a stop > > gap (a temporary measure). We really want people using these > > old helper functions to switch to using the wrappers in > > Bio.Blast.Applications and subprocess instead. > > > > Yes I was thinking of this when i meant porting/integrate. to integrate the > legacy blast perl script into Bio.Blast.NCBIStandalone > > I didn't realise that Bio.Blast.Applications existed > > > The documentation bit was actually about my first email about any > > > updated doc on how to use blast+ with biopython > > > > I see. What do you think the current (Biopython 1.53) version > > of the tutorial needs in the BLAST chapter? > > > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc80 > was exactly what I was looking for! Maybe i was looking at the wrong page > Thanks for pointing it out! > > > > > Thanks, > > > > Peter > > > > Cheers > Kevin > > > > ---------- Forwarded message ---------- > From: Peter > To: Anne Pajon > Date: Tue, 12 Jan 2010 10:27:47 +0000 > Subject: Re: [Biopython] Could Bio.SeqIO write EMBL file? > On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: > > Hi Peter, > > > > Just tested now. > > > > It worked fine. Thanks a lot. > > Great. > > > Here is the diff between the EMBL output from Bio.SeqIO and the genbank > > output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > > > > ... > > > > The main differences are on line breaks. > > I hadn't yet done a comparison against EMBOSS (what version do you > have), but yes, it looks like I am wrapping the feature tables using a > shorter line length - we should check that, and it would be easy to > adjust in Bio/SeqIO/InsdcIO.py > > Regarding the SQ line, that was on my "TODO" list. Including the > sequence length and base counts shouldn't hard at all. If you want > to work on that it should just be a few lines in Bio/SeqIO/InsdcIO.py > > Right now however, further testing of features would be my first > priority. See also: > http://lists.open-bio.org/pipermail/open-bio-l/2010-January/000604.html > > There are other things still to do (e.g. missing fields on the ID line, > dates, and references). > > Peter > > > > ---------- Forwarded message ---------- > From: Peter > To: Anne Pajon > Date: Tue, 12 Jan 2010 12:33:35 +0000 > Subject: Re: [Biopython] Could Bio.SeqIO write EMBL file? > On Tue, Jan 12, 2010 at 10:27 AM, Peter > wrote: > > On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: > >> Here is the diff between the EMBL output from Bio.SeqIO and the genbank > >> output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > >> > >> ... > >> > >> The main differences are on line breaks. > > > > I hadn't yet done a comparison against EMBOSS (what version do you > > have), but yes, it looks like I am wrapping the feature tables using a > > shorter line length - we should check that, and it would be easy to > > adjust in Bio/SeqIO/InsdcIO.py > > The spec is pretty clear than the feature lines should be up to 80 > characters. The premature wrapping was because I had been > testing length < 80 instead of <= 80, which is now fixed in git. > > Peter > > > > ---------- Forwarded message ---------- > From: Peter Cock > To: Biopython Mailing List > Date: Tue, 12 Jan 2010 14:27:30 +0000 > Subject: [Biopython] Publication list > Dear all, > > We have a fairly extensive manually compiled list of over 150 > publications citing, > referencing or using Biopython on the wiki, covering the first 10 > years of Biopython: > http://biopython.org/wiki/Publications > > *If your own Biopython related publications are missing from this list, > please > add them. If they are listed in PubMed this is pretty easy.* > > Keeping this up to date has been a tedious task, although now that we have > an > up to date reference, which hopefully will get cited, this is a little > easier: > http://news.open-bio.org/news/2009/03/biopython-paper-published/ > > There is an example in the Biopython Tutorial of using Bio.Entrez and > PubMed > Central (PMC) to find papers citing a reference, or you can just use this > URL: > > http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=link&linkname=pubmed_pubmed_citedin&uid=19304878 > > Likewise, using Google Scholar also finds plenty of citations (although I > don't > know if this URL will work long term): > > http://scholar.google.com/scholar?cites=1800471218280477755&hl=en&as_sdt=2000 > > Perhaps just a few links like these will suffice for tracking future > publications? > Or do people think we should continue to update the wiki in the same style? > > Regards, > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Tue Jan 12 16:48:27 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Jan 2010 21:48:27 +0000 Subject: [Biopython] Publication list Message-ID: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> On Tue, Jan 12, 2010 at 7:01 PM, Jo?o Rodrigues wrote: > Hello Peter, > > Well, updating the wiki is cumbersome. Specially if done manually. > Why not update the wiki automatically with that link you just gave? > > Regards, > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ Yes, but how? The NCBI link could be used, or rather the Entrez API, with a script to turn that into a list formatted for the wiki - which could then be run every so often and manually pasted into the wiki. Perhaps with a good understanding of PHP and mediawiki the whole thing could be automated. However, citations via PubMed Central are a small subset (Google scholar had about three times as many hits). My point is even semi-automated, updating the wiki is still quite a bit of work - and making it fully automated is also going to take some effort. This is why I was suggesting the lazy option of providing a few links on the publication list. Peter From p.j.a.cock at googlemail.com Wed Jan 13 06:53:54 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Jan 2010 11:53:54 +0000 Subject: [Biopython] Publication list In-Reply-To: <4B4DB44A.2030202@dim.fm.usp.br> References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> Message-ID: <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> 2010/1/13 Daniel Silvestre : > Hi people, > > It's possible to someone to keep the list in smth like Zotero/Mendeley > or similar and then export is as Wiki citation templates with a wealth > of information attached. It's not automated but is quite simple and > fast. For instance Zotero can add a whole bunch of citations with just > one click. Does have anyone try this option? > > Att. > Daniel Hi Daniel, That sounds worth a try, although it still needs someone to keep track of things. It may be a little easier than the current system (people update the wiki manually, although it is usually me based on running a PMC or Google Scholar search). If we want to keep the wiki based list up to date in future, then having a volunteer would be great. Other than that, we can try and encourage people on the mailing list to add their own papers. Peter From p.j.a.cock at googlemail.com Wed Jan 13 07:14:33 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Jan 2010 12:14:33 +0000 Subject: [Biopython] Publication list In-Reply-To: <4B4DB919.3090804@dim.fm.usp.br> References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> <4B4DB919.3090804@dim.fm.usp.br> Message-ID: <320fb6e01001130414h31cccdc4nf983ac7915c83087@mail.gmail.com> 2010/1/13 Daniel Silvestre : > Hi Peter, > > I think a mixed approach (i.e having a curator and stimulating people to > add things) is the best option. I can easily create a database of > citations in my Zotero. If you have a readable list of what you want to > add, I can do it right now. > > Best, > Daniel If you want to cover everything in the database, can you work from the wiki as it is? If you look at the wiki source, you should be able to pull a PubMed ID for most cases (but not all, a few are not in PubMed, or were done differently due to the wiki plugin not liking accented characters in author names). However, I would suggest just starting with 2010 papers onwards, and trying to build a database automatically from citations of the papers here: http://biopython.org/wiki/Documentation#Papers Peter From daniel at dim.fm.usp.br Wed Jan 13 07:14:17 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Wed, 13 Jan 2010 10:14:17 -0200 Subject: [Biopython] Publication list In-Reply-To: <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> (sfid-+20100113-095359-+000.00-1@spamfilter.osbf.lua) References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> (sfid-+20100113-095359-+000.00-1@spamfilter.osbf.lua) Message-ID: <4B4DB919.3090804@dim.fm.usp.br> Hi Peter, I think a mixed approach (i.e having a curator and stimulating people to add things) is the best option. I can easily create a database of citations in my Zotero. If you have a readable list of what you want to add, I can do it right now. Best, Daniel Peter Cock wrote: > 2010/1/13 Daniel Silvestre : >> Hi people, >> >> It's possible to someone to keep the list in smth like Zotero/Mendeley >> or similar and then export is as Wiki citation templates with a wealth >> of information attached. It's not automated but is quite simple and >> fast. For instance Zotero can add a whole bunch of citations with just >> one click. Does have anyone try this option? >> >> Att. >> Daniel > > Hi Daniel, > > That sounds worth a try, although it still needs someone to keep > track of things. It may be a little easier than the current system > (people update the wiki manually, although it is usually me based > on running a PMC or Google Scholar search). > > If we want to keep the wiki based list up to date in future, then > having a volunteer would be great. Other than that, we can try > and encourage people on the mailing list to add their own papers. > > Peter > --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- From daniel at dim.fm.usp.br Wed Jan 13 07:59:32 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Wed, 13 Jan 2010 10:59:32 -0200 Subject: [Biopython] Publication list In-Reply-To: <4B4DB919.3090804@dim.fm.usp.br> (sfid-+20100113-103459-+000.00-1@spamfilter.osbf.lua) References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> <4B4DB919.3090804@dim.fm.usp.br> (sfid-+20100113-103459-+000.00-1@spamfilter.osbf.lua) Message-ID: <4B4DC3B4.2080306@dim.fm.usp.br> Hi again, Zotero was able to retrieve 113 from the 152 citations in the wiki, which means some of them are not properly formated. So, I will rebuild the list from scratch and test it on a wiki just to see what happens. Att. Daniel Daniel Silvestre wrote: > Hi Peter, > > I think a mixed approach (i.e having a curator and stimulating people to > add things) is the best option. I can easily create a database of > citations in my Zotero. If you have a readable list of what you want to > add, I can do it right now. > > Best, > Daniel > > > Peter Cock wrote: >> 2010/1/13 Daniel Silvestre : >>> Hi people, >>> >>> It's possible to someone to keep the list in smth like Zotero/Mendeley >>> or similar and then export is as Wiki citation templates with a wealth >>> of information attached. It's not automated but is quite simple and >>> fast. For instance Zotero can add a whole bunch of citations with just >>> one click. Does have anyone try this option? >>> >>> Att. >>> Daniel >> Hi Daniel, >> >> That sounds worth a try, although it still needs someone to keep >> track of things. It may be a little easier than the current system >> (people update the wiki manually, although it is usually me based >> on running a PMC or Google Scholar search). >> >> If we want to keep the wiki based list up to date in future, then >> having a volunteer would be great. Other than that, we can try >> and encourage people on the mailing list to add their own papers. >> >> Peter >> --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- From biopython at maubp.freeserve.co.uk Thu Jan 14 09:46:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 14:46:37 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Message-ID: <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> Hi all, Biopython currently supports Python 2.4, 2.5 and 2.6 (and seems to work on the current Python 2.7 alpha), but it is probably time to start phasing out support for Python 2.4. Reasons for encouraging Python 2.5+ include the built in support for sqlite3 (which we can use in the BioSQL wrapper) and ElementTree (which we use for the new phyloXML parser), both of which must currently be manually installed for Python 2.4. There are other technical advantages, see this thread on our development mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007236.html We'd aim to follow our usual deprecation procedure, so at least two releases and one year before actually dropping support for Python 2.4. At that point older Linux distributions which ship with Python 2.4 probably won't be supported anyway. Is dropping support for Python 2.4 going to cause anyone a problem? Please send any replies just to the main mailing list (not the announcement list). Thanks, Peter From ivan at biodec.com Thu Jan 14 10:41:58 2010 From: ivan at biodec.com (Ivan Rossi) Date: Thu, 14 Jan 2010 16:41:58 +0100 (CET) Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140736q2926afabv6b587ec73ddd8da2@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <320fb6e01001140736q2926afabv6b587ec73ddd8da2@mail.gmail.com> Message-ID: On Thu, 14 Jan 2010, Peter wrote: > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha), > but it is probably time to start phasing out support for > Python 2.4. > > ... > > We'd aim to follow our usual deprecation procedure, > so at least two releases and one year before actually > dropping support for Python 2.4. At that point older > Linux distributions which ship with Python 2.4 > probably won't be supported anyway. > > Is dropping support for Python 2.4 going to cause anyone a problem? Provided that the deprecation procedure above is followed it will be fine to us (BioDec). Otherwise it woud have been a problem to plone4bio (http://plone4bio.org) since Plone3 just runs on python 2.4. However Plone4, due in less than 6 months, runs on 2.6 and in a year I am confident that the transition of plone4bio to plone4 will be finished. On the contrary we will have to live with an older BioPy for some time... Ivan -- Ivan Rossi, PhD - ivan AT biodec dot com, ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, 40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Thu Jan 14 12:05:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 17:05:24 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <320fb6e01001140736q2926afabv6b587ec73ddd8da2@mail.gmail.com> Message-ID: <320fb6e01001140905g530a7f6cndcec86f7ea3b9576@mail.gmail.com> On Thu, Jan 14, 2010 at 3:41 PM, Ivan Rossi wrote: > >> >> Is dropping support for Python 2.4 going to cause anyone a problem? >> > > Provided that the deprecation procedure above is followed it will be > fine to us (BioDec). Otherwise it woud have been a problem to plone4bio > (http://plone4bio.org) since Plone3 just runs on python 2.4. However > Plone4, due in less than 6 months, runs on 2.6 and in a year I am > confident that the transition of plone4bio to plone4 will be finished. > > On the contrary we will have to live with an older BioPy for some time... > > Ivan OK - thanks for the heads up. We don't need to rush things, so if in six months time you really need us to keep Python 2.4 compatibility for a bit longer we can discuss that. Peter From biopython at maubp.freeserve.co.uk Thu Jan 14 12:32:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 17:32:22 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <4B4F4071.7040601@fold.natur.cuni.cz> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> Message-ID: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ? wrote: > > Hi Peter, > I don't get this point much. What is the problem stating that with > python 2.5+ one does not need to install an extra dependency while > for 2.4 one needs _two_ modules? > I don't think I want BioSQL nor sqlite so why would I have to upgrade. > Would the requirement be in python language syntax incompatibility then > I would NOT object, but in this situation ... > Martin Hi Martin, This isn't just the issue of sqlite3 and ElementTree. There are several benefits to using more recent versions of Python, for example with an eye on the future for Python 3, and on a practical level it simplifies our testing to have one less version to worry about (especially once Python 2.7 is out, currently scheduled for June 2010). We've already had minor issues with developers using Python 2.5+ syntax unwittingly which broke on Python 2.4 (nothing major, and it was easily fixed once the problem was spotted). If we continue to insist on Python 2.4 support, it may prove problematic for if future potential contributors have existing code written for Python 2.5+ which would require significant re-factoring. None of these concerns are pressing right now (and some are hypothetical), but I think you will agree that Python 2.4 is pretty old, and not widely used anymore. Having a clear plan in place for dropping it seems a sensible move, and once that happens we can start to take advantage of the language and library improvements Python 2.5 added. Are you personally using Python 2.4? If so, could you tell us a little more - for example, is this a university server which would be difficult to update? Or do you require some other Python package which requires Python 2.4? Thanks, Peter From mmokrejs at ribosome.natur.cuni.cz Thu Jan 14 12:52:13 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Thu, 14 Jan 2010 18:52:13 +0100 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> Message-ID: <4B4F59CD.5040006@ribosome.natur.cuni.cz> Hi Peter, I just had troubles with 2.5 to 2.6 move (mailman needed manual patches), and just envisioned that similarly 2.4 to 2.5 would be a trouble. So, personally I don't mind but I would prefer clear listings what modules require the newer features and having an option to skip them during install step them instead of having to blindly upgrade. Personally I just use Bio.SeqIO and that is probably all I need. ^H^H^H^H^ and Entrez or PubMed or Efetch stuff, I got lost in the many biopython deprecations and module renames in the last years. I use the "latest" but forgot how is it currently named. ;-) ^H^H^H^H^H of course I know, efetch(). ;-) Recently I had for example install some old Solaris 2.6 machine with some apps and imagine, was glad to have python 2.3 I think with gcc-3.x at all. Martin Peter wrote: > On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ? > wrote: >> Hi Peter, >> I don't get this point much. What is the problem stating that with >> python 2.5+ one does not need to install an extra dependency while >> for 2.4 one needs _two_ modules? >> I don't think I want BioSQL nor sqlite so why would I have to upgrade. >> Would the requirement be in python language syntax incompatibility then >> I would NOT object, but in this situation ... >> Martin > > Hi Martin, > > This isn't just the issue of sqlite3 and ElementTree. There > are several benefits to using more recent versions of Python, > for example with an eye on the future for Python 3, and on > a practical level it simplifies our testing to have one less > version to worry about (especially once Python 2.7 is out, > currently scheduled for June 2010). > > We've already had minor issues with developers using > Python 2.5+ syntax unwittingly which broke on Python > 2.4 (nothing major, and it was easily fixed once the > problem was spotted). If we continue to insist on Python > 2.4 support, it may prove problematic for if future potential > contributors have existing code written for Python 2.5+ > which would require significant re-factoring. > > None of these concerns are pressing right now (and > some are hypothetical), but I think you will agree that > Python 2.4 is pretty old, and not widely used anymore. > Having a clear plan in place for dropping it seems a > sensible move, and once that happens we can start > to take advantage of the language and library > improvements Python 2.5 added. > > Are you personally using Python 2.4? If so, could you > tell us a little more - for example, is this a university > server which would be difficult to update? Or do you > require some other Python package which requires > Python 2.4? From mmokrejs at ribosome.natur.cuni.cz Thu Jan 14 12:51:58 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Thu, 14 Jan 2010 18:51:58 +0100 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> Message-ID: <4B4F59BE.40500@ribosome.natur.cuni.cz> Hi Peter, I don't get this point much. What is the problem stating that with python 2.5+ one does not need to install an extra dependency while for 2.4 one needs _two_ modules? I don't think I want BioSQL nor sqlite so why would I have to upgrade. Would the requirement be in python language syntax incompatibility then I would NOT object, but in this situation ... Martin Peter wrote: > Hi all, > > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha), > but it is probably time to start phasing out support for > Python 2.4. > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrapper) and ElementTree (which we use > for the new phyloXML parser), both of which must > currently be manually installed for Python 2.4. > > There are other technical advantages, see this > thread on our development mailing list: > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007236.html > > We'd aim to follow our usual deprecation procedure, > so at least two releases and one year before actually > dropping support for Python 2.4. At that point older > Linux distributions which ship with Python 2.4 > probably won't be supported anyway. > > Is dropping support for Python 2.4 going to cause > anyone a problem? From biopython at maubp.freeserve.co.uk Thu Jan 14 13:05:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 18:05:37 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <4B4F5993.9010600@fold.natur.cuni.cz> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> <4B4F5993.9010600@fold.natur.cuni.cz> Message-ID: <320fb6e01001141005s33cf2431xc32581a73540e080@mail.gmail.com> On Thu, Jan 14, 2010 at 5:51 PM, Martin MOKREJ? wrote: > > Hi Peter, > ?I just had troubles with 2.5 to 2.6 move (mailman needed manual patches), > and just envisioned that similarly 2.4 to 2.5 would be a trouble. So, personally > I don't mind but I would prefer clear listings what modules require the > newer features and having an option to skip them during install step them > instead of having to blindly upgrade. ... Its a nice idea in theory, and I can see how it would be useful in some case. However, it sounds quite complicated to implement, and very complex to keep up to date and tested properly. I don't think its a good use of limited developer time. > ?Recently I had for example install some old Solaris 2.6 machine with some > apps and imagine, was glad to have python 2.3 I think with gcc-3.x at all. > Martin I sympathize - although we now have Python 2.6 installed, I think our cluster head node still has Python 2.3 as the default system Python (its due for an upgrade, but systems administrators are rightly cautious). Peter From anbhat at utu.fi Fri Jan 15 11:07:51 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Fri, 15 Jan 2010 18:07:51 +0200 Subject: [Biopython] codon usage Message-ID: Hi, I found this script on "http://www.pasteur.fr/recherche/unites/sis/formation/python/exercises/seqrandom_count_codons_plot.py" which is supposed to count codon usage and plot them in a bar plot but this not working since some of the modules used in the script does not exist anymore. How can modify the script to make it usable or is there a better way to do that? here is the code: import Bio.Fasta from sys import * from string import * from dna import codons from mutateseq import mutateseq file = argv[1] handle = open(file) it = Bio.Fasta.Iterator(handle, Bio.Fasta.SequenceParser()) count = {} count_random = {} seq = it.next() while seq: for codon in codons(seq.seq.tostring()): if count.has_key(codon): count[codon] += 1 else: count[codon] = 0 mutableseq = seq.seq.tomutable() mutateseq(mutableseq,span=1000,p=0.1) for codon in codons(mutableseq.tostring()): if count_random.has_key(codon): count_random[codon] += 1 else: count_random[codon] = 0 seq = it.next() handle.close() #-------------------------------------------------------- # bar charts of codons frequencies # - for legibility, 2 charts are built # - both random and normal frequencies are dsplayed from tkplot import * from Numeric import * def codon_sort(a,b): if a < b: return -1 elif a > b: return 1 else: return 0 for codon in count.keys(): if not count_random.has_key(codon): count_random[codon] = 0 for codon in count_random.keys(): if not count.has_key(codon): count[codon] = 0 labels=count.keys() labels.sort(codon_sort) w1=window(plot_title='Count codons',width=1000) y=array(count.values())[:len(count)/2] x=arange(len(y)+1) w1.bar(y,x,label=labels[:len(count)/2]) w2=window(plot_title='Count codons(2)',width=1000) y=array(count.values())[(len(count)/2)+1:] x=arange(len(y)+1) w2.bar(y,x,label=labels[(len(count)/2)+1:]) y=array(count_random.values())[:len(count_random)/2] x=arange(len(y)+1) w1.bar(y,x,label=labels[:len(count_random)/2]) y=array(count_random.values())[(len(count_random)/2)+1:] x=arange(len(y)+1) w2.bar(y,x,label=labels[(len(count_random)/2)+1:]) Regards, Anirban From biopython at maubp.freeserve.co.uk Fri Jan 15 13:07:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 15 Jan 2010 18:07:43 +0000 Subject: [Biopython] codon usage In-Reply-To: References: Message-ID: <320fb6e01001151007r113fae98l1bac3fd21c3ac7f1@mail.gmail.com> On Fri, Jan 15, 2010 at 4:07 PM, Anirban Bhattachariya wrote: > Hi, > > I found this script on ... which is supposed to count codon usage and plot them > in a bar plot but this not working since some of the modules used in the script > does not exist anymore. Hi Anirban, Sadly that Pasteur Institute "Python course in Bioinformatics" is out of date. We have tried emailing the authors about this, and I offered to help update it - but so far I have had no reply. If anyone has current contact information please get in touch. http://www.pasteur.fr/recherche/unites/sis/formation/python/ Looking at the code there are several issues: The built in python module string still exists but is considered obsolete, string methods are generally preferred. Bio.Fasta still exists but is obsolete, that bit can be replaced with Bio.SeqIO fairly easily. Not sure about the other bits (see below). Numeric is also obsolete and no longer supported (it could use numpy instead). See http://numpy.scipy.org/ Then for the plotting itself I would suggest maybe matplotlib instead of tkplot (personal preference, I've never tried tkplot). http://matplotlib.sourceforge.net/ There are examples of some simple plots using this in the current Biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf The relevant background to this example is here, with a small (compressed) example image where I can't read the captions: http://www.pasteur.fr/recherche/unites/sis/formation/python/apas05.html#f_codon_freq Have you seen a larger sample output image? It should be pretty easy to recode this from scratch, but it would take a bit of "archaeology" to work out what exactly the old code did. It might be easier if you told us what you want to plot - a simple bar chart with an entry for each of the possible 64 codons (assuming non-ambiguous RNA or DNA is used)? Peter From anbhat at utu.fi Sat Jan 16 03:38:28 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Sat, 16 Jan 2010 10:38:28 +0200 Subject: [Biopython] Sequence annotation (Features) Message-ID: Hi, I'm trying to download a protein sequence object (using ID or accession number) and then trying to print its variants (all variant sequences) from its features and annotations.I'm using pseudocholinesterase (http://www.uniprot.org/uniprot/P06276 ) as an example since it has lot of natural variants. The problem is when I'm trying to access the features its saying "0 features" ; how can I access the features in Swiss-Prot file like in genbank file format ( as in section 4.6 of the tutorial). Here is my code: from Bio import ExPASy from Bio import SeqIO from Bio import SeqFeature handle =ExPASy.get_sprot_raw("P06276") seq_record = SeqIO.read(handle, "swiss") handle.close() print seq_record.id print seq_record.name print seq_record.description print repr(seq_record.seq) print "Length %i" % len(seq_record) print seq_record.annotations["keywords"] print len(seq_record) print "%i features" % (len(seq_record.features)) output: P06276 CHLE_HUMAN RecName: Full=Cholinesterase; EC=3.1.1.8; AltName: Full=Acylcholine acylhydrolase; AltName: Full=Choline esterase II; AltName: Full=Butyrylcholine esterase; AltName: Full=Pseudocholinesterase; Flags: Precursor; Seq('MHSKVTIICIRFLFWFLLLCMLIGKSHTEDDIIIATKNGKVRGMNLTVFGGTVT...VGL', ProteinAlphabet()) Length 602 ['3D-structure', 'Complete proteome', 'Direct protein sequencing', 'Disease mutation', 'Disulfide bond', 'Glycoprotein', 'Hydrolase', 'Polymorphism', 'Serine esterase', 'Signal'] 602 0 features Thanks in advance. -Anirban From biopython at maubp.freeserve.co.uk Sat Jan 16 06:21:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 Jan 2010 11:21:52 +0000 Subject: [Biopython] Sequence annotation (Features) In-Reply-To: References: Message-ID: <320fb6e01001160321x5b425d4eqba4f752a1baa358d@mail.gmail.com> On Sat, Jan 16, 2010 at 8:38 AM, Anirban Bhattachariya wrote: > Hi, > > I'm trying to download a protein sequence object (using ID or > accession number) and then trying to print its variants (all > variant sequences) from its features and annotations.I'm using > pseudocholinesterase (http://www.uniprot.org/uniprot/P06276 > as an example since it has lot of natural variants. > > The problem is when I'm trying to access the features its > saying "0 features" ; how can I access the features in > Swiss-Prot file ?like in genbank file format ( as in section > 4.6 ?of the tutorial). It's a know missing feature, although there is a patch here: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 You could help with testing/improving the patch in order to get Bio.SeqIO to do this in future, or in the short term use the underlying parser in Bio.SwissProt. Regards, Peter From anbhat at utu.fi Sun Jan 17 15:11:45 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Sun, 17 Jan 2010 22:11:45 +0200 Subject: [Biopython] How to print variants ? Message-ID: Hi, I'm trying to download a protein sequence object (using ID or accession number) and then trying to print its variants (all variant sequences) from its features and annotations.My script works fine so far and it prints number of sequence features. The problem is, how can I print its variants ( should work for any ID) and all variant sequence? Here is my code so far: from Bio import Entrez from Bio import SeqIO handle = Entrez.efetch(db=raw_input("What type of database? protein/nucleotide="),\ rettype=raw_input("which datbase you want to use? For example:genbank(gb)="),\ ID=raw_input("Enter the ID; for example human BChE contain lot of genetic \ varients,id is P06276=")) for seq_record in SeqIO.parse(handle, "gb") : print seq_record.id, seq_record.description[:50] + "..." print "Sequence length %i," % len(seq_record), print "%i features," % len(seq_record.features), print "from: %s" % seq_record.annotations["source"] print seq_record.annotations["keywords"] print repr(seq_record.seq) print "features:%i" % len(seq_record.features), # [ code for printing variants? ] handle.close() Thanks in advance. -Anirban From biopython at maubp.freeserve.co.uk Sun Jan 17 15:23:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 Jan 2010 20:23:58 +0000 Subject: [Biopython] How to print variants ? In-Reply-To: References: Message-ID: <320fb6e01001171223g388ab1p18dcf8c9e7e3f581@mail.gmail.com> On Sun, Jan 17, 2010 at 8:11 PM, Anirban Bhattachariya wrote: > Hi, > > I'm trying to download a protein sequence object (using ID > or accession number) and then trying to print its variants > (all variant sequences) from its features and annotations. I don't understand what you are asking for. Could you give us a specific worked example (an accession and what you want to print out)? Peter From anbhat at utu.fi Sun Jan 17 16:48:27 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Sun, 17 Jan 2010 23:48:27 +0200 Subject: [Biopython] How to print variants ? Message-ID: Hi , Suppose we want to study how mutations/SNPs affect on binding or some other biochemical reaction. Let's also assume, that we have a motif or motifs we want to test against These variants are listed in sequence files, there is listed only the original protein sequence. For to test motives against variants, we need complete protein sequence. Let's say our protein has 75 variants, so we need original + 75 protein sequences to test with motifs. My intention is to make a list of those 75 proteins. For example if with slicing I can print : print seq_record.features[5], print seq_record.features[13], Output: location: [28:602] ref: None:None strand: None qualifiers: Key: experiment, Value: ['experimental evidence, no additional details recorded'] Key: gene, Value: ['BCHE'] Key: gene_synonym, Value: ['CHE1'] Key: note, Value: ['Cholinesterase. /FTId=PRO_0000008613.'] Key: region_name, Value: ['Mature chain'] type: Region location: [31:32] ref: None:None strand: None qualifiers: Key: experiment, Value: ['experimental evidence, no additional details recorded'] Key: gene, Value: ['BCHE'] Key: gene_synonym, Value: ['CHE1'] Key: note, Value: ['Missing (in BChE deficiency). /FTId=VAR_040011.'] Key: region_name, Value: ['Variant'] Seq('I', IUPACProtein()) type: Region Now I want to print the features which has 'variant' ( in above example the the second one " print seq_record.features[13]" in other words I only want to print features with " Key: region_name, Value: ['Variant']" and ignore other features. Now for the final part I want to print the sequence which has variant sequence. For example : location: [55:56] ref: None:None strand: None qualifiers: Key: experiment, Value: ['experimental evidence, no additional details recorded'] Key: gene, Value: ['BCHE'] Key: gene_synonym, Value: ['CHE1'] Key: note, Value: ['F -> I (in BChE deficiency). /FTId=VAR_040013.'] Key: region_name, Value: ['Variant'] Seq('F', IUPACProtein()) type: Region It says location: [55:56] also there is this line Key: note, Value: ['F -> I (in BChE deficiency). /FTId=VAR_040013.'] That says that F in original sequence has changed to I variant sequence So I need the protein sequence where there in position 55 is I instead of F. Thanks, Anirban From biopython at maubp.freeserve.co.uk Mon Jan 18 05:05:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 Jan 2010 10:05:05 +0000 Subject: [Biopython] How to print variants ? In-Reply-To: References: Message-ID: <320fb6e01001180205i524db7a6sfae8e42faaa6e281@mail.gmail.com> On Sun, Jan 17, 2010 at 9:48 PM, Anirban Bhattachariya wrote: > Hi , > > Suppose we want to study how mutations/SNPs affect on binding or some other > biochemical reaction. Let's also assume, that we have a motif or motifs we want > to test against These variants are listed in sequence files, there is listed only the > original protein sequence. For to test motives against variants, we need complete > protein sequence. Let's say our protein has 75 variants, so we need original + 75 > protein sequences to test with motifs. My intention is to make a list of those 75 > proteins. >From your earlier emails you are working with a GenBank file for P06276: http://lists.open-bio.org/pipermail/biopython/2010-January/006120.html i.e. http://www.ncbi.nlm.nih.gov/protein/116353 or the original SwissProt/UniProt database, as a plain test "swiss" file: http://www.uniprot.org/uniprot/P06276.txt Now either the plain text GenBank or SwissProt files are going to force you to parse strings like "T -> M (in BChE deficiency; dbSNP:rs56309853)." to pull out this information in a usable form (whichever GenBank or SwissProt plain text parser you use). This is possible, but a bit fiddly. Looking at the SwissProt page, they have a table of these variants: http://www.uniprot.org/uniprot/P06276 UniProt also offer a GFF and FASTA file, neither of which are helpful here: http://www.uniprot.org/uniprot/P06276.gff http://www.uniprot.org/uniprot/P06276.fasta However, the XML format looks much nicer: http://www.uniprot.org/uniprot/P06276.xml It has well tagged entries for each variant, e.g. T M - Note there is some work in development to add parsing these UniProt XML files to SeqIO as a SeqRecord, but for your task it would probably be simpler to parse the XML yourself (using one of the standard Python XML libraries) to pull out just these variations. See also: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007244.html Which would you prefer? Working with XML or fuzzy string formats? Peter From aboulia at gmail.com Tue Jan 19 00:50:41 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 19 Jan 2010 13:50:41 +0800 Subject: [Biopython] SeqIO.index for csfasta files Message-ID: <5b6410e1001182150q36e01326mc218c06aaf780bab@mail.gmail.com> Hi all I know csfasta isn't listed in the SeqIO page but can I use index on it as well to retrieve subset of reads from csfasta ? (qual files are ok ) http://news.open-bio.org/news/2009/09/biopython-seqio-index/ Cheers Kevin From aboulia at gmail.com Tue Jan 19 03:31:43 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 19 Jan 2010 16:31:43 +0800 Subject: [Biopython] SeqIO.index for csfasta files memory issues Message-ID: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> What are the memory limitations for SeqIO.index? I am trying to create an index for a 4.5 gb csfasta file ~ 60 million reads but the script crashes at 5 Gb ram usage the machine has 31 Gb ram. #!/usr/bin/python from Bio import SeqIO data = SeqIO.index("Sample3.csfasta", "fasta") print data.keys()[:3] print data["853_15_296_F3"].seq Resource usage summary: CPU time : 381.24 sec. Max Memory : 5103 MB Max Swap : 5347 MB Max Processes : 4 Max Threads : 5 Traceback (most recent call last): File "./extractfasta.py", line 7, in ? data = SeqIO.index("Sample3.csfasta", "fasta") File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/__init__.py", line 703, in index return indexer(filename, alphabet, key_function) File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/_index.py", line 209, in __init__ "fasta", ">") File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/_index.py", line 203, in __init__ self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/_index.py", line 86, in _record_key dict.__setitem__(self, key, seek_position) MemoryError From biopython at maubp.freeserve.co.uk Tue Jan 19 04:32:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 09:32:45 +0000 Subject: [Biopython] SeqIO.index for csfasta files In-Reply-To: <5b6410e1001182150q36e01326mc218c06aaf780bab@mail.gmail.com> References: <5b6410e1001182150q36e01326mc218c06aaf780bab@mail.gmail.com> Message-ID: <320fb6e01001190132v36bcfc91u8e61ed4c89c1af09@mail.gmail.com> On Tue, Jan 19, 2010 at 5:50 AM, Kevin Lam wrote: > Hi all > I know csfasta isn't listed in the SeqIO page but can I use index on it as > well to retrieve subset of reads from csfasta ? (qual files are ok ) > http://news.open-bio.org/news/2009/09/biopython-seqio-index/ > > Cheers > Kevin We don't explicitly support color space FASTA, but it should work. By that I mean the parser will just give you the sequences as is (e.g. A1231232) with a default generic alphabet object. Depending on the number of reads, and the size of the subset, you may find using Bio.SeqIO.parse and write together works better (lower memory requirements). I would suggest building a python set of the desired IDs, then using something like this: #Using set to test membership (hash based, faster than a list) wanted_ids = set(...) #This is a memory efficient generator expression: wanted = (rec for rec in SeqIO.parse(...) if rec.id in wanted_ids) handle = open(..., "w") count = SeqIO.write(wanted, handle, "fasta") handle.close() Peter From biopython at maubp.freeserve.co.uk Tue Jan 19 04:38:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 09:38:43 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> Message-ID: <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> On Tue, Jan 19, 2010 at 8:31 AM, Kevin Lam wrote: > What are the memory limitations for SeqIO.index? > I am trying to create an index for a 4.5 gb csfasta file > ~ 60 million reads > but the script crashes at 5 Gb ram usage > the machine has 31 Gb ram. What OS are you using (and is it 64bit)? What Python are you using (and is it 64bit)? What version of Biopython are you using? I've never tried a file with quite that many reads, but crashing at about 5GB is odd. I wonder if this is a 4GB limit somewhere in your system (e.g. running 32bit Python). Adding some debug statements we could see when it falls over (i.e. how many reads had been indexed). Long term, really really big indexes will be too big to hold in memory as a python dict (record IDs and file offsets). Therefore we have done a little work looking at disk based indexes, including sqlite3. This does make building the index much slower though. For your immediate task, try a simple iteration through the records, selecting the records of interest using Bio.SeqIO.parse and write as per my other email. This way you'll only have to keep in memory one record at a time, and a list/set of the wanted IDs: http://lists.open-bio.org/pipermail/biopython/2010-January/006128.html Peter From alvin at pasteur.edu.uy Tue Jan 19 12:45:23 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Tue, 19 Jan 2010 15:45:23 -0200 Subject: [Biopython] Subprocess:Clustalw Message-ID: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> Hi all, I'm new in Biopython and I am trying to learn how to use Bio.Align. I have some doubts about running Clustalw within a python script. I run this without problems: ### from Bio.Align.Applications import ClustalwCommandline import sys import subprocess STDO = open("stdo.txt", "w") STDE = open("stde.txt", "w") cline = ClustalwCommandline("clustalw2",infile="opuntia.fasta") return_code = subprocess.call(str(cline), stderr = STDE, shell=(sys.platform!="win32")) print return_code ### but he point is that I would like to choose my "infile" from argv. I mean, something like this: archive = open(sys.argv[1]) cline = ClustalwCommandline("clustalw2",infile=archive) I realized that "str" in subprocess doesn't allow this str(cline) I wonder if it could be possible to run the algorithm from argv or any handles . Thanks in advance ?lvaro Pena From p.j.a.cock at googlemail.com Tue Jan 19 13:57:55 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 Jan 2010 18:57:55 +0000 Subject: [Biopython] Subprocess:Clustalw In-Reply-To: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> References: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> Message-ID: <320fb6e01001191057l4ffffa24k3b951817ca038d91@mail.gmail.com> Hi, I'm a little unclear what you are trying to do - clustalw doesn't let you send input via stdin or get the alignment by stdout. Other tools like muscle can do this and our tutorial has examples of this. Peter From schafer at rostlab.org Tue Jan 19 16:56:00 2010 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Tue, 19 Jan 2010 16:56:00 -0500 Subject: [Biopython] Subprocess:Clustalw In-Reply-To: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> References: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> Message-ID: <4B562A70.6000201@rostlab.org> On 01/19/2010 12:45 PM, Alvaro F Pena Perea wrote: > but he point is that I would like to choose my "infile" from argv. I mean, > something like this: > > archive = open(sys.argv[1]) > cline = ClustalwCommandline("clustalw2",infile=archive) I'm not sure I understand the significance of your approach. If this is about reading the path to the fasta file from commandline, why don't you do just the following: """ Assuming, sys.argv[1] holds the path to the fasta file """ archive = sys.argv[1] #Instead of archive = open(sys.argv[1]) cline = ClustalwCommandline("clustalw2",infile=archive) Chris From aboulia at gmail.com Tue Jan 19 18:43:11 2010 From: aboulia at gmail.com (Kevin) Date: Wed, 20 Jan 2010 07:43:11 +0800 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> Message-ID: Hi Peter It's 64 bit centos shared cluster I assumed all the rest of Python and such are the same as well but I may be wrong. It's version 1.53 I believe for biopython I wanted random access as I need half the reads separated this way and I think it is faster. Guess I have to do it the old way Thanks Kev Sent from my iPod On 19-Jan-2010, at 5:38 PM, Peter wrote: > On Tue, Jan 19, 2010 at 8:31 AM, Kevin Lam wrote: >> What are the memory limitations for SeqIO.index? >> I am trying to create an index for a 4.5 gb csfasta file >> ~ 60 million reads >> but the script crashes at 5 Gb ram usage >> the machine has 31 Gb ram. > > What OS are you using (and is it 64bit)? > What Python are you using (and is it 64bit)? > What version of Biopython are you using? > > I've never tried a file with quite that many reads, but > crashing at about 5GB is odd. I wonder if this is a 4GB > limit somewhere in your system (e.g. running 32bit > Python). Adding some debug statements we could > see when it falls over (i.e. how many reads had > been indexed). > > Long term, really really big indexes will be too big > to hold in memory as a python dict (record IDs and > file offsets). Therefore we have done a little work > looking at disk based indexes, including sqlite3. > This does make building the index much slower > though. > > For your immediate task, try a simple iteration > through the records, selecting the records of > interest using Bio.SeqIO.parse and write as per > my other email. This way you'll only have to keep > in memory one record at a time, and a list/set > of the wanted IDs: > http://lists.open-bio.org/pipermail/biopython/2010-January/006128.html > > Peter From biopython at maubp.freeserve.co.uk Wed Jan 20 06:14:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Jan 2010 11:14:26 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> Message-ID: <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> On Tue, Jan 19, 2010 at 11:43 PM, Kevin wrote: > > Hi Peter > It's 64 bit centos shared cluster OK, good. > I assumed all the rest of Python and such are the same as well > but I may be wrong. It would be worth checking out - if the Python installed is just 32bit, then hitting a memory limit at 4GB would make sense. > It's version 1.53 I believe for biopython OK, good. > I wanted random access as I need half the reads separated this way > and I think it is faster. Guess I have to do it the old way. Could you show us a sample of the data - say just the first 20 reads? I could then generate a large test file in a similar style to see what happens if I try and index it on my machine. It would also be nice if you would allow us to use the sample for a Biopython unit test. Thanks, Peter From alvin at pasteur.edu.uy Wed Jan 20 07:07:28 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Wed, 20 Jan 2010 10:07:28 -0200 Subject: [Biopython] Subprocess:Clustalw In-Reply-To: <320fb6e01001191057l4ffffa24k3b951817ca038d91@mail.gmail.com> References: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> <320fb6e01001191057l4ffffa24k3b951817ca038d91@mail.gmail.com> Message-ID: <3d7a3fc11001200407j63d44f90t8408a19c1a071c11@mail.gmail.com> Ok. Thank you very much. I will try with muscle. ?lvaro Pena 2010/1/19 Peter Cock > Hi, > > I'm a little unclear what you are trying to do - clustalw doesn't let > you send input via stdin or get the alignment by stdout. Other tools > like muscle can do this and our tutorial has examples of this. > > Peter > From biopython at maubp.freeserve.co.uk Thu Jan 21 06:31:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 11:31:29 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> Message-ID: <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> On Thu, Jan 21, 2010 at 9:10 AM, Kevin Lam wrote: > > Yups python is 64bit >>>> platform.architecture() > ('64bit', 'ELF') Hmm - I was hoping that wouldn't be the case. > the sample 1 file has > 48412673 reads > here's the top 20 reads > > head -n 20 Sample2.csfasta >>427_22_20_F3 > T33133100313302011000100000000000000000000010000000 >>427_22_29_F3 > T30101002122001001000300000200030000000002121003000 >>427_22_44_F3 > T12223211021010030202120002130211102100003002010303 >>427_22_52_F3 > T32031331333133301101223023301013011032103032122123 >>427_22_58_F3 > T23010130111130001000202232101031001010000000000000 >>427_22_66_F3 > T10303202110222020010200311000110011001001111000110 >>427_22_72_F3 > T23332102212232122131103321303322213023003233100320 >>427_22_87_F3 > T20112313302013303131123323002203111122211310000010 >>427_22_113_F3 > T32021321020200032003222000221030102023012000003013 >>427_22_169_F3 > T22012322202220000000100000100000000000000010100020 Thanks Kevin, I wrote a trivial script to generate a big fake Solid CSFASTA like this: import random total = 48412673 # 48 million count = 0 handle = open("big_fake_solid.csfasta", "w") for i in range(1000): for j in range(100): for k in range(1000): for h in range(256): nuc = random.choice("ACGT") #I could make the color sequence random, but #there is no real point for testing indexing: color_changes = "33133100313302011000100000000000000000000010000000" handle.write(">%03i_%02i_%02i_%02X\n%s%s\n" \ % (i,j,k,h, nuc, color_changes)) count += 1 if count >= total : break if count >= total : break #print "Done %i so far" % count if count >= total : break if count >= total : break handle.close() I then tried indexing with Bio.SeqIO.index("big_fake_solid.csfasta","fasta") using Biopython 1.53+ (latest code from git) on Mac OS X 10.5 Leopard with 12GB of RAM, using the Apple provided Python 2.5 installation. I watched the process in system monitor and it failed when memory consumption reached 4GB, with a repeated message: Python(608) malloc: *** mmap(size=262144) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug and traceback: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/SeqIO/_index.py", line 262, in __init__ self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) It turns out that my copy of Python (the default Apple provided one on Leopard) seems to be just 32bit, $ python Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import platform >>> platform.architecture() ('32bit', '') So *if* your system was running 32bit python, I would expect it to fail like this. I'd like to try a 64bit python locally - either I could install this manually, or look for a big memory Linux box to try. Or, If I updated my OS, it looks like Mac OS X 10.6 Snow Leopard includes 64bit Python 2.6, plus a Python 2.5 which is only 32bit: http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/python.1.html Peter From biopython at maubp.freeserve.co.uk Thu Jan 21 06:58:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 11:58:12 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> Message-ID: <320fb6e01001210358m3b274028re98c64b8e4fe772a@mail.gmail.com> On Thu, Jan 21, 2010 at 11:31 AM, Peter wrote: > It turns out that my copy of Python (the default Apple provided one > on Leopard) seems to be just 32bit, > > $ python > Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) > [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> import platform >>>> platform.architecture() > ('32bit', '') Another check, $ file /usr/bin/python /usr/bin/python: Mach-O universal binary with 2 architectures /usr/bin/python (for architecture ppc7400): Mach-O executable ppc /usr/bin/python (for architecture i386): Mach-O executable i386 $ which python /Library/Frameworks/Python.framework/Versions/Current/bin/python $ file /Library/Frameworks/Python.framework/Versions/Current/bin/python /Library/Frameworks/Python.framework/Versions/Current/bin/python: Mach-O universal binary with 2 architectures /Library/Frameworks/Python.framework/Versions/Current/bin/python (for architecture i386): Mach-O executable i386 /Library/Frameworks/Python.framework/Versions/Current/bin/python (for architecture ppc): Mach-O executable ppc > So *if* your system was running 32bit python, I would expect it to > fail like this. I'd like to try a 64bit python locally - either I could > install this manually, ... >From reading up, it seems that while python.org does have dmg installers for Mac OS X, currently they only support i386 and ppc (not 64bit). While in theory I could download an install Python from source, it sounds a little fiddly, and a don't want to mess up my machine. > ... or look for a big memory Linux box to try. This may be easier for me! Peter From biopython at maubp.freeserve.co.uk Thu Jan 21 08:03:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 13:03:33 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <320fb6e01001210358m3b274028re98c64b8e4fe772a@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> <320fb6e01001210358m3b274028re98c64b8e4fe772a@mail.gmail.com> Message-ID: <320fb6e01001210503m59fb9d82pf4e6a25c8d86d1ce@mail.gmail.com> On Thu, Jan 21, 2010 at 11:58 AM, Peter wrote: > On Thu, Jan 21, 2010 at 11:31 AM, Peter wrote: >> ... or look for a big memory Linux box to try. > > This may be easier for me! That worked :) This was a 48 million entry ~3GB faked color space FASTA file. It took about 10 mins and about 7GB (I missed the final memory usage figure as I was only checking in top), using Biopython 1.53 on a 64bit installation of Python 2.4.3: $ python Python 2.4.3 (#1, Jan 21 2009, 01:11:33) [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import platform >>> platform.architecture() ('64bit', 'ELF') Could you double check the version of Python on the nodes of your cluster (just in case the head node is using something different, or some of the nodes are 32bit and others are 64bit)? Peter From mitlox at op.pl Sat Jan 23 21:14:53 2010 From: mitlox at op.pl (xyz) Date: Sun, 24 Jan 2010 12:14:53 +1000 Subject: [Biopython] BLAST database access Message-ID: <4B5BAD1D.2020004@op.pl> Hello, I have run MegaBlast and the results I can parse for example with: input_file = open("megablastres.txt","r") for line in input_file.readlines(): if line[0] == "#" : #header line, ignore else: parts = line.rstrip().split() print "Subject id = %s" % parts[1] How could I retrieve the sequence which belong to subject id from BLAST database with BioPython? Thank you in advance. Best regards From biopython at maubp.freeserve.co.uk Sun Jan 24 08:47:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Jan 2010 13:47:14 +0000 Subject: [Biopython] BLAST database access In-Reply-To: <4B5BAD1D.2020004@op.pl> References: <4B5BAD1D.2020004@op.pl> Message-ID: <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> On Sun, Jan 24, 2010 at 2:14 AM, xyz wrote: > Hello, > I have run MegaBlast and the results I can parse for example with: > > input_file = open("megablastres.txt","r") > for line in input_file.readlines(): > if line[0] == "#" : > #header line, ignore > else: > parts = line.rstrip().split() > print "Subject id = %s" % parts[1] If all you want is the subject ID, that looks simple. I guess you are using one of the simple tabular output formats? > How could I retrieve the sequence which belong to subject id > from BLAST database with BioPython? Are you using a local BLAST database, or an online one? If online, I would try using the hit ID to search via the NCBI Entrez interface, see the Bio.Entrez chapter in our tutorial. If the database is local, then the NCBI provides a tool as part of the BLAST suite for this called fastacmd. Peter From pedro.al at fenhi.uh.cu Mon Jan 25 11:08:21 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Mon, 25 Jan 2010 11:08:21 -0500 Subject: [Biopython] Rename atoms Message-ID: <20100125110821.vesao6besg0wggcs@correo.fenhi.uh.cu> Hi all... It's possible rename atoms in .pdb files? Thanks -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 246 ---------------------------------------------------------------- Correo FENHI From rafal.b.pawlak at gmail.com Mon Jan 25 16:38:17 2010 From: rafal.b.pawlak at gmail.com (x y) Date: Mon, 25 Jan 2010 22:38:17 +0100 Subject: [Biopython] GI number Message-ID: <586d60191001251338m42717464y6bcf0b2d411a62b0@mail.gmail.com> hello, how extract GI number in this program? from Bio import SeqIO handle = open("xyz.fasta") for seq_record in SeqIO.parse(handle, "fasta"): print seq_record.description handle.close() ex. Osa_SPT6 gi|222632083|gb|EEE64215.1| hypothetical protein Os05g41510.1_ORYZA [Oryza sativa Japonica Group] rafal pawlak From p.j.a.cock at googlemail.com Mon Jan 25 18:41:46 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Jan 2010 23:41:46 +0000 Subject: [Biopython] GI number In-Reply-To: <586d60191001251338m42717464y6bcf0b2d411a62b0@mail.gmail.com> References: <586d60191001251338m42717464y6bcf0b2d411a62b0@mail.gmail.com> Message-ID: <320fb6e01001251541i3842f640me86d4b88e064af86@mail.gmail.com> On Mon, Jan 25, 2010 at 9:38 PM, x y wrote: > hello, > how extract GI number in this program? > > from Bio import SeqIO > handle = open("xyz.fasta") > for seq_record in SeqIO.parse(handle, "fasta"): > ? ?print seq_record.description > handle.close() > > ex. > Osa_SPT6 gi|222632083|gb|EEE64215.1| hypothetical protein Os05g41510.1_ORYZA > [Oryza sativa Japonica Group] > > rafal pawlak I would just the Python string split method on this string - assuming all your record use the same layout, e.g. Something like this: gi = record.description.split()[1].split("|")[1] There are related examples in the tutorial, search for "get_accession" which are a bit more robust because they check the string follows the expected format. You could alternatively use a regular expression. Peter From mitlox at op.pl Mon Jan 25 22:58:53 2010 From: mitlox at op.pl (xyz) Date: Tue, 26 Jan 2010 13:58:53 +1000 Subject: [Biopython] BLAST database access In-Reply-To: <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> References: <4B5BAD1D.2020004@op.pl> <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> Message-ID: <4B5E687D.7040905@op.pl> Peter wrote: >> How could I retrieve the sequence which belong to subject id >> from BLAST database with BioPython? >> > > Are you using a local BLAST database, or an online one? > If online, I would try using the hit ID to search via the NCBI > Entrez interface, see the Bio.Entrez chapter in our tutorial. > If the database is local, then the NCBI provides a tool as > part of the BLAST suite for this called fastacmd. > > Peter Thank you. I could retrieve the sequences from a local BlastDB with fastacmd, but I have some local BlastDBs which do not have any index, because they were created without using the -o T option in formatdb. How could I retrieve the sequences from local BlastDBs without index? Thank you in advance. From biopython at maubp.freeserve.co.uk Tue Jan 26 07:59:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 26 Jan 2010 12:59:25 +0000 Subject: [Biopython] BLAST database access In-Reply-To: <4B5E687D.7040905@op.pl> References: <4B5BAD1D.2020004@op.pl> <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> <4B5E687D.7040905@op.pl> Message-ID: <320fb6e01001260459m6c046e2bwf6b89be26cf0806f@mail.gmail.com> On Tue, Jan 26, 2010 at 3:58 AM, xyz wrote: > Peter wrote: >>> >>> How could I retrieve the sequence which belong to subject id >>> from BLAST database with BioPython? >>> >> >> Are you using a local BLAST database, or an online one? >> If online, I would try using the hit ID to search via the NCBI >> Entrez interface, see the Bio.Entrez chapter in our tutorial. >> If the database is local, then the NCBI provides a tool as >> part of the BLAST suite for this called fastacmd. >> >> Peter > > Thank you. I could retrieve the sequences from a local BlastDB with > fastacmd, but I have some local BlastDBs which do not have any index, > ?because they were created without using the -o T option in formatdb. > > How could I retrieve the sequences from local BlastDBs without index? > > Thank you in advance. That sounds harder... do you still have the original FASTA file used to build the BLASTDB? If so, just index that - for example using the Bio.SeeqIO.convert() functionality in Biopython 1.52 or later. Peter From bouchard.lysiane at gmail.com Wed Jan 27 13:16:30 2010 From: bouchard.lysiane at gmail.com (Lysiane Bouchard) Date: Wed, 27 Jan 2010 13:16:30 -0500 Subject: [Biopython] NaN values, lowess Message-ID: Hi, I am using the lowess function in Bio.Statistics.lowess, version 1.53 When input array y is zero everywhere, I obtain yest=NaN everywhere. I wonder if I did something wrong and if other special cases might lead to NaN values. ------------------------------ ------------------------------------------------------- >>ipython --pylab >>In [1]: import numpy >>In [2]: from Bio.Statistics.lowess import lowess >>In [3]: x = numpy.array(range(200))*1.0 >>In [4]: y = numpy.zeros([200,]) >>In [5]: yest = lowess(x,y) >>In [6]: all(isnan(yest)) >>Out[6]: True ----------------------------------------------------------------------------------- Thank you, Lysiane Bouchard From richard_w_g_price at academia.edu Wed Jan 27 15:41:00 2010 From: richard_w_g_price at academia.edu (Richard Price) Date: Wed, 27 Jan 2010 12:41:00 -0800 Subject: [Biopython] Recent Activity of the 11 Biopython members on Academia.edu Message-ID: Dear Biopython members, We just wanted to let you know about some recent activity on the Biopython group on Academia.edu. In the Biopython group on Academia.edu, there are now: - 11 people - 1 paper - 1 photo Biopython members? pages have been viewed a total of 1,494 times, and their papers have been viewed a total of 2 times. To see these people, papers and status updates, follow the link below: http://lists.academia.edu/See-members-of-Biopython Richard Dr. Richard Price, post-doc, Philosophy Dept, Oxford University. Founder of Academia.edu From s.schmeier at gmail.com Sat Jan 30 03:46:56 2010 From: s.schmeier at gmail.com (Sebastian Schmeier) Date: Sat, 30 Jan 2010 11:46:56 +0300 Subject: [Biopython] SeqIO.index() Message-ID: <679b68f21001300046h36ca77dfh93c07af9ccbbef83@mail.gmail.com> Dear community, I am new to the mailing list and have a problem/question regarding the SeqIO.index() method/module. Up to now, I usually used an home-brewed fasta-file parser. This time though I had a look at the SeqIO interface. I am especially interested in the index() method. The fasta-file I use have non-standardized (if this is even possible) headers. I found that the index method uses the first string after the marker up to a space as the identifier for the dictionary (I will call this ID in the text below). It is however a great idea to have a function argument "key_function" that allows for adjust the key values via a self implemented callback function. This is essential in my case because ID in my fasta-file are not unique per entry. I had a look at the source code of SeqIO/_index.py and I found that unfortunately in the current implementation the "key_function" only acts on ID. I think it would make more sense to allow to extract a key from the complete header. Is this somehow possible with the current implementation? I refer here to the code in SeqIO/_index.py: 188 class _SequentialSeqFileDict(_IndexedSeqFileDict) : . . . 200 if marker_re.match(line) : 201 #Here we can assume the record.id is the first word after the 202 #marker. This is generally fine... but not for GenBank, EMBL, Swiss 203 self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) ##### here you define that the key_function only acts on the first split Thanks, Seb From p.j.a.cock at googlemail.com Sat Jan 30 09:08:57 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Sat, 30 Jan 2010 14:08:57 +0000 Subject: [Biopython] SeqIO.index() In-Reply-To: <679b68f21001300046h36ca77dfh93c07af9ccbbef83@mail.gmail.com> References: <679b68f21001300046h36ca77dfh93c07af9ccbbef83@mail.gmail.com> Message-ID: Hi Your request makes perfect sense for FASTA files, but does not generalise to all the other supported file formats - hence the relatively limited callback support available in Bio.SeqIO.index. I would suggest you could subclass the FASTA indexer to do what you want. Or, for smaller files use Bio.SeqIO.to_dict instead. Regards Peter On 30 Jan 2010, at 08:46, Sebastian Schmeier wrote: > Dear community, > > I am new to the mailing list and have a problem/question regarding the > SeqIO.index() method/module. Up to now, I usually used an home-brewed > fasta-file parser. This time though I had a look at the SeqIO > interface. I am especially interested in the index() method. > > The fasta-file I use have non-standardized (if this is even possible) > headers. I found that the index method uses the first string after the > marker up to a space as the identifier for the dictionary (I will call > this ID in the text below). It is however a great idea to have a > function argument "key_function" that allows for adjust the key values > via a self implemented callback function. This is essential in my case > because ID in my fasta-file are not unique per entry. > > I had a look at the source code of SeqIO/_index.py and I found that > unfortunately in the current implementation the "key_function" only > acts on ID. I think it would make more sense to allow to extract a key > from the complete header. Is this somehow possible with the current > implementation? > > I refer here to the code in SeqIO/_index.py: > > > 188 class _SequentialSeqFileDict(_IndexedSeqFileDict) : > . > . > . > 200 if marker_re.match(line) : > 201 #Here we can assume the record.id is the first > word after the > 202 #marker. This is generally fine... but not for > GenBank, EMBL, Swiss > 203 > self._record_key(line[marker_offset:].strip().split(None,1)[0], > offset) ##### here you define that the key_function only acts > on the first split > > > > Thanks, > Seb > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From alaguraj.v at gmail.com Sat Jan 2 07:27:27 2010 From: alaguraj.v at gmail.com (Alaguraj Veluchamy) Date: Sat, 2 Jan 2010 12:57:27 +0530 Subject: [Biopython] PSI-BLAST help Message-ID: I have a problem in database search using PSI-BLAST. I have to do PSI-BLAST against combined "nr" and "environmental sequences(env_nr)" databases. I need to iterate 10 rounds. Web services allow selecting one database at a time. Do Biopython offers search against multiple databases. I am unable to find any simple way to do this. Regards, Alaguraj.V On 12/29/09, biopython-request at lists.open-bio.org < biopython-request at lists.open-bio.org> wrote: > > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. Re: Superpose structures... DONE (Michiel de Hoon) > 2. Remove hydrogens... (Yasser Almeida Hern?ndez) > 3. Comparison between bioperl and biopython? (Peng Yu) > 4. Re: [Bioperl-l] Comparison between bioperl and biopython? > (Jason Stajich) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 28 Dec 2009 09:27:23 -0800 (PST) > From: Michiel de Hoon > Subject: Re: [Biopython] Superpose structures... DONE > To: BioPython , almeida at cim.sld.cu > Message-ID: <363342.11278.qm at web62406.mail.re1.yahoo.com> > Content-Type: text/plain; charset=iso-8859-1 > > --- On Mon, 12/28/09, Yasser Almeida Hern?ndez > wrote: > > Now i have another question. It is possible in Biopython > > read gziped pdb files (.pdb.gz)? > > I am not a Bio.PDB user, but from its documentation it looks like it uses > the file name to open a PDB file instead of a handle. Thomas, how do you > feel about modifying Bio.PDB so it uses a file handle instead of a file > name? Then Bio.PDB can parse gzipped and bzipped files. > > --Michiel. > > > > > > > ------------------------------ > > Message: 2 > Date: Tue, 29 Dec 2009 09:18:38 -0500 > From: Yasser Almeida Hern?ndez > Subject: [Biopython] Remove hydrogens... > To: BioPython > Message-ID: <20091229091838.fnyk66sayos8swww at correo.fenhi.uh.cu> > Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes"; > format="flowed" > > Hi all... > How can i remove hydrogens atoms from the structures objects? > > Thanks > > -- > Lic. Yasser Almeida Hern?ndez > Center of Molecular Inmunology (CIM) > Nanobiology Group > P.O.Box 16040, Havana, Cuba > Phone: (537) 271-7933, ext. 221 > > ---------------------------------------------------------------- > Correo FENHI > > > > > > > ------------------------------ > > Message: 3 > Date: Tue, 29 Dec 2009 10:08:09 -0600 > From: Peng Yu > Subject: [Biopython] Comparison between bioperl and biopython? > To: bioperl-l at lists.open-bio.org, biopython at lists.open-bio.org > Message-ID: > <366c6f340912290808q6edea4d8ncb59a270f9d11f1a at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > May I ask somebody who are versitile in both bioperl and biopython > comment on the pros and cons of bioperl and biopython? I'm sending > this email to both bioperl and biopython mailing lists. But I hope > that it will not result in any contention. > > I assume that the functionality between bioperl or biopython is the > same, i.e., tasks can be done in bioperl can be done biopython and > vice versa, as both libraries have been out there over 10 years. > Please correct me if my understanding is not true. > > Given that a task that can be done with either bioperl or biopython, > I, in particularly, want to know how long it will take to write the > code for the task in bioperl and biopython, with the same readability > requirement (see below) and the assumption that users have the same > fluency in perl and python. > > python is claimed to be good for maintainability. But perl is > criticized for there-are-many-ways-for-a-given-task. Since there are > multiple ways in perl, let us assume that we always use perl in a > readable way. > > > ------------------------------ > > Message: 4 > Date: Tue, 29 Dec 2009 08:49:20 -0800 > From: Jason Stajich > Subject: Re: [Biopython] [Bioperl-l] Comparison between bioperl and > biopython? > To: Peng Yu > Cc: bioperl-l at lists.open-bio.org, biopython at lists.open-bio.org > Message-ID: <2B85EF86-8A84-491B-8C33-7EC16CCB8CBC at bioperl.org> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > Are you asking for the purposes of choosing a toolkit for your work or > just curious about the advantages/disadvantages of language choice? > > -jason > On Dec 29, 2009, at 8:08 AM, Peng Yu wrote: > > > May I ask somebody who are versitile in both bioperl and biopython > > comment on the pros and cons of bioperl and biopython? I'm sending > > this email to both bioperl and biopython mailing lists. But I hope > > that it will not result in any contention. > > > > I assume that the functionality between bioperl or biopython is the > > same, i.e., tasks can be done in bioperl can be done biopython and > > vice versa, as both libraries have been out there over 10 years. > > Please correct me if my understanding is not true. > > > > Given that a task that can be done with either bioperl or biopython, > > I, in particularly, want to know how long it will take to write the > > code for the task in bioperl and biopython, with the same readability > > requirement (see below) and the assumption that users have the same > > fluency in perl and python. > > > > python is claimed to be good for maintainability. But perl is > > criticized for there-are-many-ways-for-a-given-task. Since there are > > multiple ways in perl, let us assume that we always use perl in a > > readable way. > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > Jason Stajich > jason.stajich at gmail.com > jason at bioperl.org > http://fungalgenomes.org/ > > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 84, Issue 32 > ***************************************** > From alaguraj.v at gmail.com Sat Jan 2 07:28:31 2010 From: alaguraj.v at gmail.com (Alaguraj Veluchamy) Date: Sat, 2 Jan 2010 12:58:31 +0530 Subject: [Biopython] PSI-BLAST help Message-ID: Dear all, I have a problem in database search using PSI-BLAST. I have to do PSI-BLAST against combined "nr" and "environmental sequences(env_nr)" databases. I need to iterate 10 rounds. Web services allow selecting one database at a time. Do Biopython offers search against multiple databases. I am unable to find any simple way to do this. Regards, Alaguraj.V From aboulia at gmail.com Sat Jan 2 10:44:29 2010 From: aboulia at gmail.com (Kevin Lam) Date: Sat, 2 Jan 2010 18:44:29 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? Message-ID: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Hi all finally found biopython Wrappers for the new NCBI BLAST+ tools in Applications.py the question is do I still use NCBIstandalone to use with BLAST+ ? is there a new tutorial for this? Cheers Kevin From stran104 at chapman.edu Sat Jan 2 19:14:53 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 2 Jan 2010 11:14:53 -0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Message-ID: <2a63cc351001021114o3e02122ak686bcbfb031bcc14@mail.gmail.com> I'm no expert here but unfortunately you'll probably have to build your own database to do that. It's not biopython's fault since it just wraps PSI-BLAST and as far as I know PSI-BLAST is only made to search against one database. Perhaps someone else will have a different solution. On Sat, Jan 2, 2010 at 2:44 AM, Kevin Lam wrote: > Hi all > finally found > biopython Wrappers for the new NCBI BLAST+ tools in Applications.py > > the question is do I still use NCBIstandalone to use with BLAST+ ? > > is there a new tutorial for this? > > > Cheers > Kevin > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From stran104 at chapman.edu Sat Jan 2 19:16:08 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 2 Jan 2010 11:16:08 -0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <2a63cc351001021114o3e02122ak686bcbfb031bcc14@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <2a63cc351001021114o3e02122ak686bcbfb031bcc14@mail.gmail.com> Message-ID: <2a63cc351001021116x3501deebwde613bb8d4b110b9@mail.gmail.com> Shoot, I replied to the wrong thread. Ignore that response, it was supposed to go to a PSI-BLAST question. Sorry. On Sat, Jan 2, 2010 at 11:14 AM, Matthew Strand wrote: > I'm no expert here but unfortunately you'll probably have to build your own > database to do that. It's not biopython's fault since it just wraps > PSI-BLAST and as far as I know PSI-BLAST is only made to search against one > database. Perhaps someone else will have a different solution. > > > On Sat, Jan 2, 2010 at 2:44 AM, Kevin Lam wrote: > >> Hi all >> finally found >> biopython Wrappers for the new NCBI BLAST+ tools in Applications.py >> >> the question is do I still use NCBIstandalone to use with BLAST+ ? >> >> is there a new tutorial for this? >> >> >> Cheers >> Kevin >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Matthew Strand > stran104 at chapman.edu > phone: (626) 524-4449 > skype: matstrand > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From stran104 at chapman.edu Sat Jan 2 19:17:07 2010 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 2 Jan 2010 11:17:07 -0800 Subject: [Biopython] PSI-BLAST help In-Reply-To: References: Message-ID: <2a63cc351001021117y3d5fe4b7pb6635d592bb9809b@mail.gmail.com> I'm no expert here but unfortunately you'll probably have to build your own database to do that. It's not biopython's fault since it just wraps PSI-BLAST and as far as I know PSI-BLAST is only made to search against one database. Perhaps someone else will have a different solution. On Fri, Jan 1, 2010 at 11:28 PM, Alaguraj Veluchamy wrote: > Dear all, > I have a problem in database search using PSI-BLAST. > I have to do PSI-BLAST against combined "nr" and "environmental > sequences(env_nr)" databases. > I need to iterate 10 rounds. > Web services allow selecting one database at a time. > > Do Biopython offers search against multiple databases. > I am unable to find any simple way to do this. > > Regards, > Alaguraj.V > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Matthew Strand stran104 at chapman.edu phone: (626) 524-4449 skype: matstrand From aboulia at gmail.com Sun Jan 3 08:09:26 2010 From: aboulia at gmail.com (Kevin Lam) Date: Sun, 3 Jan 2010 16:09:26 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Message-ID: <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> Hmmm found this in the blast+ manual is it possible to integrate this somewhere in biopython ?Cheers Kevin 3.1 For users of NCBI C Toolkit BLAST The easiest way to get started using these command line applications is by means of the legacy_blast.pl PERL script which is bundled along with the BLAST+ applications. To utilize this script, simply prefix it to the invocation of the C toolkit BLAST command line application and append the --path option pointing to the installation directory of the BLAST+ applications. For example, instead of using blastall -i query -d nr -o blast.out use legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin For more details, refer to the section titled Backwards compatibility script . On Sat, Jan 2, 2010 at 6:44 PM, Kevin Lam wrote: > Hi all > finally found > biopython Wrappers for the new NCBI BLAST+ tools in Applications.py > > the question is do I still use NCBIstandalone to use with BLAST+ ? > > is there a new tutorial for this? > > > Cheers > Kevin > > From lueck at ipk-gatersleben.de Sun Jan 3 11:19:50 2010 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Sun, 3 Jan 2010 12:19:50 +0100 Subject: [Biopython] Needs some NCBI recommendation Message-ID: <20100103121950.2tzyaglxcq2o0w08@webmail.ipk-gatersleben.de> Hello and a happy new year! I'm currently writing a small software, which allows the users to perform a NCBI online BLAST and to download full records from NCBI via PubMed IDs in a batch mode. I would like to limit the BLAST and Download, in order not to abuse NCBI. What would you suggest for an input limitation for BLAST and Sequence Download (using efetch)? In addition I want to ask, whether it's reasonable to use the efecth in a simple for statement id_list = ["19304878", "18606172"] for i in id_list: handle = Entrez.efetch(db="nucleotide", id=i, rettype="fasta") print handle.read() since I need only the raw GenBank or FASTA files? Thanks for any advice! Stefanie From mjldehoon at yahoo.com Sun Jan 3 14:44:04 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 3 Jan 2010 06:44:04 -0800 (PST) Subject: [Biopython] Needs some NCBI recommendation In-Reply-To: <20100103121950.2tzyaglxcq2o0w08@webmail.ipk-gatersleben.de> Message-ID: <142471.4083.qm@web62407.mail.re1.yahoo.com> --- On Sun, 1/3/10, lueck at ipk-gatersleben.de wrote: > In addition I want to ask, whether it's reasonable to use > the efecth in a simple for statement > > id_list = ["19304878", "18606172"] > for i in id_list: > ???handle = Entrez.efetch(db="nucleotide", id=i, rettype="fasta") > ???print handle.read() > > since I need only the raw GenBank or FASTA files? The following needs only one call to efetch: >>> from Bio import Entrez >>> Entrez.email = "lueck at ipk-gatersleben.de" >>> from Bio import SeqIO >>> handle = Entrez.efetch(db='nucleotide', id="19304878,18606172", rettype='fasta') >>> records = SeqIO.parse(handle, 'fasta') >>> for record in records: ... words = record.id.split("|") ... i = words[1] ... output = open(i+".fa", 'w') ... SeqIO.write([record], output, 'fasta') ... output.close() --Michiel From xuxiang086 at gmail.com Mon Jan 4 08:03:47 2010 From: xuxiang086 at gmail.com (xuxiang086) Date: Mon, 4 Jan 2010 16:03:47 +0800 Subject: [Biopython] installing biopython1.53 failed Message-ID: <201001041603446097700@gmail.com> dear all, I was trying to install biopython1.53 source on my computer whose OS is Ubuntu according to the installation instructions on biopython wiki, and when I input "python setup.py test" I got failure messages as follows: ====================================================================== FAIL: seqmatchall with pair output piped to stdout. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Emboss.py", line 661, in test_seqtmatchall_piped self.assertEqual(align.get_alignment_length(), 9) AssertionError: 471 != 9 ---------------------------------------------------------------------- Ran 140 tests in 102.013 seconds FAILED (failures = 1) Could you help me to figure out what's the problem? Thanks. Sincerely, Xiang From chapmanb at 50mail.com Mon Jan 4 12:51:54 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 07:51:54 -0500 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <90247fbe0912260654scd2b0ceyb37d54f36a3531fa@mail.gmail.com> References: <90247fbe0912260637n7553bdf7wbce10a627c0a124c@mail.gmail.com> <90247fbe0912260654scd2b0ceyb37d54f36a3531fa@mail.gmail.com> Message-ID: <20100104125154.GE80812@sobchak.mgh.harvard.edu> Ning; > From http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html, > I can learn: > PubMed Central contains a number of articles classified as "open > access" for which you may download the full text as XML. For the > remaining articles in PMC you may download only the abstracts as XML. > > but when try to > handle=Entrez.efetch(db='pmc',id=idlist,rettype='full',retmode='xml') > record=Entrez.read(handle) > > got following errors: > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/__init__.py", > line 258, in read > record = handler.read(handle) > File "/usr/local/lib/python2.6/dist-packages/Bio/Entrez/Parser.py", > line 114, in read > raise CorruptedXMLError > Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data. > Please make sure that the input data are in XML format, and that the > data are not corrupted. > > the python version is 1.53 and my system is ubuntu 9.10. Following your example, doing: from Bio import Entrez Entrez.email = 'yours at blah.com' handle = Entrez.efetch(db='pmc', id=2747014, rettype='full', retmode='xml') handle.read() gives back the full XML text, as you wanted. Your next step, calling Entrez.read, asks Biopython to parse this into a record object. There isn't support in Biopython for this currently, and realistically that probably isn't what you want. If you are pulling down full text like this you are best served parsing the XML directly using something like ElementTree: http://docs.python.org/library/xml.etree.elementtree.html and pulling out the items you are interested in. Hope this helps, Brad From chapmanb at 50mail.com Mon Jan 4 13:06:11 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 4 Jan 2010 08:06:11 -0500 Subject: [Biopython] installing biopython1.53 failed In-Reply-To: <201001041603446097700@gmail.com> References: <201001041603446097700@gmail.com> Message-ID: <20100104130611.GF80812@sobchak.mgh.harvard.edu> Xiang; > I was trying to install biopython1.53 source on my computer whose OS is Ubuntu according to the installation instructions on biopython wiki, and when I input "python setup.py test" I got failure messages as follows: > > ====================================================================== > FAIL: seqmatchall with pair output piped to stdout. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Emboss.py", line 661, in test_seqtmatchall_piped > self.assertEqual(align.get_alignment_length(), 9) > AssertionError: 471 != 9 > ---------------------------------------------------------------------- > Ran 140 tests in 102.013 seconds > FAILED (failures = 1) > > Could you help me to figure out what's the problem? Thanks. Biopython appears to be installed okay, and this is an issue with parsing EMBOSS output from the program seqmatchall. If you aren't planning on using EMBOSS, then you can go ahead and use the rest of Biopython without any worries. To figure out the issue, it would be useful to know the version of EMBOSS you are using: % embossversion Writes the current EMBOSS version number to a file 6.0.1 If it's an older one, a simple fix may be to upgrade. You should be able to run 'apt-get update emboss' on ubuntu: http://packages.ubuntu.com/karmic/emboss Hope this helps, Brad From xuxiang086 at gmail.com Mon Jan 4 13:15:39 2010 From: xuxiang086 at gmail.com (xuxiang086) Date: Mon, 4 Jan 2010 21:15:39 +0800 Subject: [Biopython] installing biopython1.53 failed References: <201001041603446097700@gmail.com>, <20100104130611.GF80812@sobchak.mgh.harvard.edu> Message-ID: <201001042115373431286@gmail.com> Hi Brad, Thanks for your help. Xiang 2010-01-04 xuxiang086 ???? Brad Chapman ????? 2010-01-04 21:06:14 ???? xuxiang086 ??? BioPython ??? Re: [Biopython] installing biopython1.53 failed Xiang; > I was trying to install biopython1.53 source on my computer whose OS is Ubuntu according to the installation instructions on biopython wiki, and when I input "python setup.py test" I got failure messages as follows: > > ====================================================================== > FAIL: seqmatchall with pair output piped to stdout. > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_Emboss.py", line 661, in test_seqtmatchall_piped > self.assertEqual(align.get_alignment_length(), 9) > AssertionError: 471 != 9 > ---------------------------------------------------------------------- > Ran 140 tests in 102.013 seconds > FAILED (failures = 1) > > Could you help me to figure out what's the problem? Thanks. Biopython appears to be installed okay, and this is an issue with parsing EMBOSS output from the program seqmatchall. If you aren't planning on using EMBOSS, then you can go ahead and use the rest of Biopython without any worries. To figure out the issue, it would be useful to know the version of EMBOSS you are using: % embossversion Writes the current EMBOSS version number to a file 6.0.1 If it's an older one, a simple fix may be to upgrade. You should be able to run 'apt-get update emboss' on ubuntu: http://packages.ubuntu.com/karmic/emboss Hope this helps, Brad From mjldehoon at yahoo.com Mon Jan 4 15:15:57 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 4 Jan 2010 07:15:57 -0800 (PST) Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <20100104125154.GE80812@sobchak.mgh.harvard.edu> Message-ID: <595436.42697.qm@web62403.mail.re1.yahoo.com> --- On Mon, 1/4/10, Brad Chapman wrote: > Following your example, doing: > > from Bio import Entrez > Entrez.email = 'yours at blah.com' > handle = Entrez.efetch(db='pmc', id=2747014, > rettype='full', retmode='xml') > handle.read() > > gives back the full XML text, as you wanted. Your next > step, calling > Entrez.read, asks Biopython to parse this into a record > object. > There isn't support in Biopython for this currently, This *is* supported by Biopython. In principle, Bio.Entrez can parse any XML generated by NCBI Entrez as long as the corresponding DTDs are available. In this case, the DTD included in Biopython 1.53 is corrupted, causing the error. Unfortunately, the correct DTD relies on a large number of other DTDs, so just replacing the one DTD is not sufficient. Hmm... maybe we should think of a more robust way of getting the DTDs without relying on their inclusion in the Biopython distribution ... --Michiel. From darnells at dnastar.com Mon Jan 4 16:15:38 2010 From: darnells at dnastar.com (Steve Darnell) Date: Mon, 4 Jan 2010 10:15:38 -0600 Subject: [Biopython] PSI-BLAST help In-Reply-To: <2a63cc351001021117y3d5fe4b7pb6635d592bb9809b@mail.gmail.com> References: <2a63cc351001021117y3d5fe4b7pb6635d592bb9809b@mail.gmail.com> Message-ID: Alaguraj, I am assuming you have already downloaded the nr and env_nr databases. You can create an alias database file that will tie individual databases together to form a larger virtual database. http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html#4.1. 6 I have not personally used this approach, so I cannot offer more guidance that this. However, since biopython only provides a wrapper for the NCBI command line tools, I would expect this approach would work well with biopython scripting. Regards, Steve -- Steve Darnell DNASTAR, Inc. Madison, WI USA -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Matthew Strand Sent: Saturday, January 02, 2010 1:17 PM To: Alaguraj Veluchamy Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] PSI-BLAST help I'm no expert here but unfortunately you'll probably have to build your own database to do that. It's not biopython's fault since it just wraps PSI-BLAST and as far as I know PSI-BLAST is only made to search against one database. Perhaps someone else will have a different solution. On Fri, Jan 1, 2010 at 11:28 PM, Alaguraj Veluchamy wrote: > Dear all, > I have a problem in database search using PSI-BLAST. > I have to do PSI-BLAST against combined "nr" and "environmental > sequences(env_nr)" databases. > I need to iterate 10 rounds. > Web services allow selecting one database at a time. > > Do Biopython offers search against multiple databases. > I am unable to find any simple way to do this. > > Regards, > Alaguraj.V From biopython at maubp.freeserve.co.uk Tue Jan 5 10:47:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 10:47:38 +0000 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> Message-ID: <320fb6e01001050247t29a1a57idd2fd22400e9c54a@mail.gmail.com> On Sat, Jan 2, 2010 at 10:44 AM, Kevin Lam wrote: > Hi all > finally found > biopython Wrappers for the new NCBI BLAST+ tools in Applications.py > > the question is do I still use NCBIstandalone to use with BLAST+ ? > No, use Bio.Blast.Applications with the subprocess module. > > is there a new tutorial for this? > Did you check the current Tutorial (as shipped with Biopython 1.53)? http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf There are wrappers for the new NCBI BLAST+ tools in the Bio.Blast.Applications module (recommended for future use). There are wrappers for the "legacy" NCBI BLAST tools in the Bio.Blast.Applications module (along with the new BLAST+ wrappers), and the old rather inflexible "helper functions" in Bio.Blast.NCBIStandalone. These are all effectively obsolete (since the NCBI is phasing out the "legacy" BLAST tools), and will be deprecated in a future release of Biopython. This is in the DEPRECATED file, and the module docstrings. Obviously the documentation wasn't as clear as it could have been (and the bit in the tutorial is a little short). Where did you look and can you make any suggestions for clarification or improvement? Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jan 5 11:33:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 11:33:26 +0000 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <90247fbe0912260637n7553bdf7wbce10a627c0a124c@mail.gmail.com> References: <90247fbe0912260637n7553bdf7wbce10a627c0a124c@mail.gmail.com> Message-ID: <320fb6e01001050333w3ca52399u565177c4d80a4724@mail.gmail.com> On Sat, Dec 26, 2009 at 2:37 PM, ning luwen wrote: > Dear everyone, > ?? I need to download full text from Pubmed central. After see the > Entrez manual, maybe Entrez(not the web interface) doesn't give a way > to?download .pdf full text file, is this true? > According to the EFetch help, for PMC you can only retrieve XML (although this does seem to give the full text): http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html I had a look at the ELink documentation, and don't see any way to use it to get a PDF link (e.g. to the publisher's site). You could use the DOI, but that doesn't allow control over HTML vs PDF. I think you should email the Entrez support team for advice (and if you find out more, please let us know). >From playing with the PMC website, I eventually found a URL which will work to get a PDF file, both in my browser and via the command line tool wget: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/pdf However, it seems the default Python urllib useragent is blocked for some reason. A quick search online shows one way to over-ride the user-agent in Python, and if we pretend to be the Firefox browser this now works: url = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2682512/pdf" filename = "PMC2682512.pdf" from urllib import FancyURLopener class FakeMozilla(FancyURLopener): version = "Mozilla/5.0 (Windows; U; Windows NT 5.2; rv:1.9.2) Gecko/20100101 Firefox/3.6" FakeMozilla().retrieve(url, filename) So, while that does seem to work, it is *NOT* endorsed by the NCBI. If you just want to download a few files, it may do the trick, but I do think you should email the Entrez support team for advice on how this *should* be done. Regards, Peter From biopython at maubp.freeserve.co.uk Tue Jan 5 11:46:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 11:46:34 +0000 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <595436.42697.qm@web62403.mail.re1.yahoo.com> References: <20100104125154.GE80812@sobchak.mgh.harvard.edu> <595436.42697.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e01001050346l23a8920am4c5192831105371b@mail.gmail.com> On Mon, Jan 4, 2010 at 3:15 PM, Michiel de Hoon wrote: > > This *is* supported by Biopython. In principle, Bio.Entrez can parse any > XML generated by NCBI Entrez as long as the corresponding DTDs are > available. In this case, the DTD included in Biopython 1.53 is corrupted, > causing the error. Unfortunately, the correct DTD relies on a large number > of other DTDs, so just replacing the one DTD is not sufficient. > > Hmm... maybe we should think of a more robust way of getting the DTDs > without relying on their inclusion in the Biopython distribution ... Which DTD has a problem? I was aware an elink DTD was *missing* in Biopython 1.53 (adding in git), but not of any corrupted DTD files. In this particular example, it is the NCBI that have a problem - they are returning invalid XML which (understandably) our parser is rejecting. It could just be they haven't kept the XML output and the public DTD files in sync. For example, consider this Entrez URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml According to both these validators this is not a valid XML file! http://www.validome.org/xml/validate/ http://validator.w3.org/ In Biopython when we try and parse this exact URL: >>> from Bio import Entrez >>> import urllib >>> record = Entrez.read(urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml")) Traceback (most recent call last): ... Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data. Please make sure that the input data are in XML format, and that the data are not corrupted. You get the same error using the Bio.Entrez.efetch function which will use an equivalent URL (but with the tool and email set): >>> from Bio import Entrez >>> Entrez.email = "your.name.here at example.com" >>> record = Entrez.read(Entrez.efetch(db="pmc", id="2747014", retmode="xml")) Traceback (most recent call last): ... Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the XML data. Please make sure that the input data are in XML format, and that the data are not corrupted. Peter From mjldehoon at yahoo.com Tue Jan 5 12:17:33 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Jan 2010 04:17:33 -0800 (PST) Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <320fb6e01001050346l23a8920am4c5192831105371b@mail.gmail.com> Message-ID: <95712.11972.qm@web62408.mail.re1.yahoo.com> There are multiple issues here. First, the CorruptedXMLError is caused by the corrupted DTD (which I fixed by now on github). Basically, the corrupted DTD inserts some gibberish into the XML, which is then no longer valid. If you replace the corrupted DTD by the correct one, the CorruptedXMLError goes away. But you'll find that a bunch of other DTDs are missing (these have now been uploaded to github). With the complete set of DTDs, you run into a new error: One of the tags in the XML file is not listed anywhere in any of the DTDs. This is probably the reason the XML validators show that it's not valid XML. I've notified NCBI that the XML output is not consistent with the DTDs for this case. --Michiel --- On Tue, 1/5/10, Peter wrote: > From: Peter > Subject: Re: [Biopython] need help! how to retrieve full text from Pubmed central ? > To: "Michiel de Hoon" > Cc: biopython at lists.open-bio.org, "Brad Chapman" > Date: Tuesday, January 5, 2010, 6:46 AM > On Mon, Jan 4, 2010 at 3:15 PM, > Michiel de Hoon > wrote: > > > > This *is* supported by Biopython. In principle, > Bio.Entrez can parse any > > XML generated by NCBI Entrez as long as the > corresponding DTDs are > > available. In this case, the DTD included in Biopython > 1.53 is corrupted, > > causing the error. Unfortunately, the correct DTD > relies on a large number > > of other DTDs, so just replacing the one DTD is not > sufficient. > > > > Hmm... maybe we should think of a more robust way of > getting the DTDs > > without relying on their inclusion in the Biopython > distribution ... > > Which DTD has a problem? I was aware an elink DTD was > *missing* in > Biopython 1.53 (adding in git), but not of any corrupted > DTD files. > > In this particular example, it is the NCBI that have a > problem - they are > returning invalid XML which (understandably) our parser is > rejecting. > It could just be they haven't kept the XML output and the > public DTD > files in sync. > > For example, consider this Entrez URL: > > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml > > According to both these validators this is not a valid XML > file! > > http://www.validome.org/xml/validate/ > http://validator.w3.org/ > > In Biopython when we try and parse this exact URL: > > >>> from Bio import Entrez > >>> import urllib > >>> record = Entrez.read(urllib.urlopen("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=2747014&retmode=xml")) > Traceback (most recent call last): > ... > Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the > XML data. > Please make sure that the input data are in XML format, and > that the > data are not corrupted. > > You get the same error using the Bio.Entrez.efetch function > which > will use an equivalent URL (but with the tool and email > set): > > >>> from Bio import Entrez > >>> Entrez.email = "your.name.here at example.com" > >>> record = Entrez.read(Entrez.efetch(db="pmc", > id="2747014", retmode="xml")) > Traceback (most recent call last): > ... > Bio.Entrez.Parser.CorruptedXMLError: Failed to parse the > XML data. > Please make sure that the input data are in XML format, and > that the > data are not corrupted. > > Peter > From biopython at maubp.freeserve.co.uk Tue Jan 5 12:42:10 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 12:42:10 +0000 Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <95712.11972.qm@web62408.mail.re1.yahoo.com> References: <320fb6e01001050346l23a8920am4c5192831105371b@mail.gmail.com> <95712.11972.qm@web62408.mail.re1.yahoo.com> Message-ID: <320fb6e01001050442j11bf5959y5380a7fcd42e959e@mail.gmail.com> On Tue, Jan 5, 2010 at 12:17 PM, Michiel de Hoon wrote: > > There are multiple issues here. > > First, the CorruptedXMLError is caused by the corrupted DTD (which I fixed > by now on github). Basically, the corrupted DTD inserts some gibberish into > the XML, which is then no longer valid. If you replace the corrupted DTD by > the correct one, the CorruptedXMLError goes away. I see what you mean, our old copy of nlm-articleset-2.0.dtd was actually an HTML redirect message. Oops. Thanks for sorting out that glitch - my fault. > But you'll find that a bunch of other DTDs are missing (these have now been > uploaded to github). With the complete set of DTDs, you run into a new error: Do you get this: NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces > One of the tags in the XML file is not listed anywhere in any of the DTDs. > This is probably the reason the XML validators show that it's not valid XML. > I've notified NCBI that the XML output is not consistent with the DTDs for > this case. Excellent - thank you. Peter P.S. Last year (Sept 2009) I reported a similar problem with ELink XML failing to validate when the history was used (while working on the "Searching for citations" example in the tutorial). That seems to be resolved now so I can update the tutorial... From mjldehoon at yahoo.com Tue Jan 5 14:31:38 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 5 Jan 2010 06:31:38 -0800 (PST) Subject: [Biopython] need help! how to retrieve full text from Pubmed central ? In-Reply-To: <320fb6e01001050442j11bf5959y5380a7fcd42e959e@mail.gmail.com> Message-ID: <179611.33586.qm@web62402.mail.re1.yahoo.com> --- On Tue, 1/5/10, Peter wrote: > Do you get this: > NotImplementedError: The Bio.Entrez parser cannot handle > XML data that make use of XML namespaces I get that one too, but that is easy to fix once NCBI's DTD files are corrected. --Michiel. From biopython at maubp.freeserve.co.uk Tue Jan 5 17:20:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 5 Jan 2010 17:20:18 +0000 Subject: [Biopython] Remove hydrogens... In-Reply-To: <20091229091838.fnyk66sayos8swww@correo.fenhi.uh.cu> References: <20091229091838.fnyk66sayos8swww@correo.fenhi.uh.cu> Message-ID: <320fb6e01001050920r4bdf627cg60e9bb84e004b4ec@mail.gmail.com> 2009/12/29 Yasser Almeida Hern?ndez : > Hi all... > How can i remove hydrogens atoms from the structures objects? > > Thanks > > -- > Lic. Yasser Almeida Hern?ndez Hi, I would suggest you look at pages 5 and 6 of the Bio.PDB documentation, the bit on the Select class: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf See also related discussions on the mailing list: http://lists.open-bio.org/pipermail/biopython/2009-March/005021.html http://lists.open-bio.org/pipermail/biopython/2009-May/005172.html Please let us know how you get on. If you would like to contribute to the project, this seems like an excellent topic for a cookbook entry, once you've got it working of course ;) http://biopython.org/wiki/Category:Cookbook Peter From pedro.al at fenhi.uh.cu Tue Jan 5 18:50:14 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Tue, 05 Jan 2010 13:50:14 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100105135014.4ovlg1l4hkwowo04@correo.fenhi.uh.cu> Hi all... I know a did this question before but i really need your help... I've selected a residue and a atom and i want to save them as a new .pdb file. How can i do that? Thanks -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From alexl at users.sourceforge.net Wed Jan 6 00:14:43 2010 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Tue, 05 Jan 2010 19:14:43 -0500 Subject: [Biopython] Fedora packages for 1.53 available (was Re: Biopython 1.53 released) In-Reply-To: <320fb6e00912150901k138ae04bmc5d5af9c867340ec__41910.6228081093$1260896885$gmane$org@mail.gmail.com> (Peter's message of "Tue, 15 Dec 2009 17:01:38 +0000") References: <320fb6e00912150901k138ae04bmc5d5af9c867340ec__41910.6228081093$1260896885$gmane$org@mail.gmail.com> Message-ID: >>>>> "P" == Peter writes: P> Dear Biopythoneers, We are pleased to announce the availability of P> Biopython 1.53, a new stable release of the Biopython library, three P> months after the release of Biopython 1.52. This is our first release P> since migrating from CVS to git for source code control. Hi there, For all Fedora users, new packages for biopython 1.53 are now available the "updates-testing" repository for F-11 and F-12 To test them out simply run (as root): yum --enablerepo=updates-testing install python-biopython Please provide feedback on packages here: F-11: https://admin.fedoraproject.org/updates/F11/FEDORA-2009-13353 F-12: https://admin.fedoraproject.org/updates/F12/FEDORA-2009-13326 (You can leave feedback anonymously, or using your Fedora account name if you happen to be a Fedora contributor) The more positive feedback from testing, the faster the packages can go into the stable "updates" repo (or conversely if there are any problems, they can be fixed before being pushed). Thanks! Alex From biopython at maubp.freeserve.co.uk Wed Jan 6 11:35:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 11:35:32 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100105135014.4ovlg1l4hkwowo04@correo.fenhi.uh.cu> References: <20100105135014.4ovlg1l4hkwowo04@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060335i16114ab0pc4a183540a29f244@mail.gmail.com> 2010/1/5 Yasser Almeida Hern?ndez : > Hi all... > I know a did this question before but i really need your help... > I've selected a residue and a atom and i want to save them > as a new .pdb file. How can i do that? > > Thanks You need a structure object, and then pass that to PDBIO. I suggest you do this via a Select class - as in your related question about removing hydrogen atoms: http://lists.open-bio.org/pipermail/biopython/2009-December/006028.html http://lists.open-bio.org/pipermail/biopython/2010-January/006064.html If that doesn't make sense, then perhaps you could go into more detail? e.g. tell us which PDB file you are working with, and show us your code so far. You could use a similar example if you'd prefer not to talk about the real research topic. Peter From ap12 at sanger.ac.uk Wed Jan 6 12:24:15 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Wed, 6 Jan 2010 12:24:15 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: Message-ID: <47115A52-FC46-48A3-B3DF-EF012EEE520B@sanger.ac.uk> Sorry for the typo, Please read print embl_record.format("embl") instead of print embl_record.format("genbank") I was just testing if it was possible to write in another format. On 6 Jan 2010, at 12:20, Anne Pajon wrote: > Dear, > > I'm reading EMBL file with Bio.SeqIO for adding an extra feature > qualifier to each of the annotations, and would like to write the > modified annotated sequence back to an EMBL file. > > embl_record = SeqIO.read(open("Alistipes_shahii_WAL8301.embl"), > "embl") > addSystematicId(embl_record) > print embl_record.format("genbank") > > While running the above I'm getting this error: > Reading format 'embl' is supported, but not writing > > Is there a way around? I know from the documentation on the wiki > that biopython does not have a writer for EMBL format. Is there a > plan of having one in the future? I volunteer to test it, or if it > does not exist yet I may be able to contribute writing it... thanks > to let me know. > > Kind regards, > Anne. > -- > Dr Anne Pajon - Pathogen Genomics, Team 81 > Sanger Institute, Wellcome Trust Genome Campus, Hinxton > Cambridge CB10 1SA, United Kingdom > +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) > -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From ap12 at sanger.ac.uk Wed Jan 6 12:20:30 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Wed, 6 Jan 2010 12:20:30 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? Message-ID: Dear, I'm reading EMBL file with Bio.SeqIO for adding an extra feature qualifier to each of the annotations, and would like to write the modified annotated sequence back to an EMBL file. embl_record = SeqIO.read(open("Alistipes_shahii_WAL8301.embl"), "embl") addSystematicId(embl_record) print embl_record.format("genbank") While running the above I'm getting this error: Reading format 'embl' is supported, but not writing Is there a way around? I know from the documentation on the wiki that biopython does not have a writer for EMBL format. Is there a plan of having one in the future? I volunteer to test it, or if it does not exist yet I may be able to contribute writing it... thanks to let me know. Kind regards, Anne. -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Wed Jan 6 13:15:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 13:15:17 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: Message-ID: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> On Wed, Jan 6, 2010 at 12:20 PM, Anne Pajon wrote: > Dear, > > I'm reading EMBL file with Bio.SeqIO for adding an extra feature qualifier > to each of the annotations, and would like to write the modified annotated > sequence back to an EMBL file. > ... > While running the above I'm getting this error: > Reading format 'embl' is supported, but not writing > > Is there a way around? I know from the documentation on the wiki that > biopython does not have a writer for EMBL format. Is there a plan of having > one in the future? I volunteer to test it, or if it does not exist yet I may > be able to contribute writing it... thanks to let me know. > > Kind regards, > Anne. Hello Anne, The intention was to eventually have both GenBank and EMBL output working in SeqIO - and they should be able to share a lot of code. However, out of practicality, GenBank output was prioritised (and bar a few bits of annotation, seems to be working nicely). There hadn't been much interest in EMBL output in comparison. Getting something basic working shouldn't be too hard (id, features and sequence), and having someone interested help test this would be very valuable. Did you install Biopython from source? Are you happy using git (to grab code for testing)? Neither is essential for trying out new Python code, but would make things a bit simpler. Also, what kind of organisms are you working with? What I'm getting at here is how complex are the feature locations going to be? Peter From ap12 at sanger.ac.uk Wed Jan 6 13:28:42 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Wed, 6 Jan 2010 13:28:42 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> Message-ID: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> Hi Peter, Thanks again for this fast answer. You've been fixing code for me recently on fasta-m10 al_start and al_end, so I am now working with the development version of biopython from git. I have no problem of updating it and testing it here. I am working with about 30 bacteria genomes from the human gut and waiting 100 more genomes to work with this year. I can send you one of the file if you wish. Just let me know. Kind regards, Anne. On 6 Jan 2010, at 13:15, Peter wrote: > On Wed, Jan 6, 2010 at 12:20 PM, Anne Pajon wrote: >> Dear, >> >> I'm reading EMBL file with Bio.SeqIO for adding an extra feature >> qualifier >> to each of the annotations, and would like to write the modified >> annotated >> sequence back to an EMBL file. >> ... >> While running the above I'm getting this error: >> Reading format 'embl' is supported, but not writing >> >> Is there a way around? I know from the documentation on the wiki that >> biopython does not have a writer for EMBL format. Is there a plan >> of having >> one in the future? I volunteer to test it, or if it does not exist >> yet I may >> be able to contribute writing it... thanks to let me know. >> >> Kind regards, >> Anne. > > Hello Anne, > > The intention was to eventually have both GenBank and EMBL output > working in SeqIO - and they should be able to share a lot of code. > However, out of practicality, GenBank output was prioritised (and > bar a few bits of annotation, seems to be working nicely). There > hadn't been much interest in EMBL output in comparison. > > Getting something basic working shouldn't be too hard (id, features > and > sequence), and having someone interested help test this would be very > valuable. Did you install Biopython from source? Are you happy using > git (to grab code for testing)? Neither is essential for trying out > new > Python code, but would make things a bit simpler. > > Also, what kind of organisms are you working with? What I'm getting > at here is how complex are the feature locations going to be? > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From pedro.al at fenhi.uh.cu Wed Jan 6 14:24:30 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Wed, 06 Jan 2010 09:24:30 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100106092430.vz2h8pqo0w8k8gcc@correo.fenhi.uh.cu> I used the "add" method in the Residue class for add the atom object to the residue. That's right that i need a structure object, but how i build this object "de novo" and a how add a new residue on it??? I used the StructureBuilder class with the init_* methods (model, chain, residue etc.) and then the get_structure method, but it doesn't work: # EXPERIMENTAL CODE res.add(contact) # Add a atom to the residue of interest output_structure = StructureBuilder.StructureBuilder() output_structure.init_structure('OUT_STRUCT') output_structure.init_model(0) output_structure.init_chain('X') output_structure.get_structure() output_structure[0]['X'].add(res) io = PDBIO() io.set_structure(output_structure) pdb_out_filename = "cont_res_plus_contact.pdb" io.save(pdb_out_filename, output_structure) I'm processing a hundred of pdb files, and i need this code for write residues and atoms in different conformational states... I hope for your help... Thanks > You need a structure object, and then pass that to PDBIO. > I suggest you do this via a Select class - as in your related > question about removing hydrogen atoms: > http://lists.open-bio.org/pipermail/biopython/2009-December/006028.html > http://lists.open-bio.org/pipermail/biopython/2010-January/006064.html > > If that doesn't make sense, then perhaps you could go into > more detail? e.g. tell us which PDB file you are working with, > and show us your code so far. You could use a similar example > if you'd prefer not to talk about the real research topic. > Peter -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Wed Jan 6 15:10:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 15:10:01 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100106092430.vz2h8pqo0w8k8gcc@correo.fenhi.uh.cu> References: <20100106092430.vz2h8pqo0w8k8gcc@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060710g50a66b8k160d3f0a8a1883f9@mail.gmail.com> 2010/1/6 Yasser Almeida Hern?ndez : > I used the "add" method in the Residue class for add the atom object to the > residue. That's right that i need a structure object, but how i build this > object "de novo" and a how add a new residue on it??? I've not tried that myself, so I don't have any suggestions beyond looking over the documentation - or even the Bio.PDB code itself. > I used the StructureBuilder class with the init_* methods (model, chain, > residue etc.) and then the get_structure method, but it doesn't work: > > # EXPERIMENTAL CODE > res.add(contact) ? # Add a atom to the residue of interest > > output_structure = StructureBuilder.StructureBuilder() > output_structure.init_structure('OUT_STRUCT') > > output_structure.init_model(0) > output_structure.init_chain('X') > output_structure.get_structure() Note that the get_structure() call returns a structure, but you are ignoring the return value. > output_structure[0]['X'].add(res) > io = PDBIO() > io.set_structure(output_structure) > pdb_out_filename = "cont_res_plus_contact.pdb" > io.save(pdb_out_filename, output_structure) You code snippet is incomplete - which makes it harder to try to follow what you are doing. It is missing all the import statements and the definition of the res variable. > I'm processing a hundred of pdb files, and i need this code for write > residues and atoms in different conformational states... Perhaps I had misunderstood - I thought you were starting with a given PDB file, and wanted to select some particular residues/atoms, and output a new partial PDB file with just those bits. That should work using a Select class to create a sub-structure from the original full structure from parsing the original PDF file. i.e You don't need to create a new structure object "de novo". Peter From pedro.al at fenhi.uh.cu Wed Jan 6 15:26:24 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Wed, 06 Jan 2010 10:26:24 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100106102624.3a1bn46ns4444oos@correo.fenhi.uh.cu> You thought right!! But my big doubt is how to use the Select class. The example with the Gly selection in the FAQ document is not so clear to me to apply on my problem. Let's say: I've selected the ASP 10 in the chain 'A' and the atom O1 in a ligand, all in the pdb file 1xyz. How i use the Select class (sintaxis) to write a new pdb file with only the residue/atom selected before? How would be the code? Thanks > Perhaps I had misunderstood - I thought you were starting > with a given PDB file, and wanted to select some particular > residues/atoms, and output a new partial PDB file with just > those bits. That should work using a Select class to create > a sub-structure from the original full structure from parsing > the original PDF file. i.e You don't need to create a new > structure object "de novo". -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Wed Jan 6 15:49:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 15:49:24 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100106102624.3a1bn46ns4444oos@correo.fenhi.uh.cu> References: <20100106102624.3a1bn46ns4444oos@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060749x6e7d98bfnd138b20e5e8564ce@mail.gmail.com> 2010/1/6 Yasser Almeida Hern?ndez : > You thought right!! But my big doubt is how to use the Select class. The > example with the Gly selection in the FAQ document is not so clear to me to > apply on my problem. > Let's say: > I've selected the ASP 10 in the chain 'A' and the atom O1 in a ligand, all > in the pdb file 1xyz. How i use the Select class (sintaxis) to write a new > pdb file with only the residue/atom selected before? How would be the code? Have you got a real example? There is no Asp10 in PDB file 1xyz. But, if for the sake of argument you wanted Arg519 (in either chain) in 1xyz you could do it like this - based on the following example I pointed to earlier: http://lists.open-bio.org/pipermail/biopython/2009-May/005174.html from Bio.PDB import Select, PDBIO from Bio.PDB.PDBParser import PDBParser class MySelector(Select): def accept_residue(self, residue): #Only want Arg519 (in any chain) return residue.resname=="ARG" and residue.id[1]==519 s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") io=PDBIO() io.set_structure(s) io.save("1XYZ-interesting.pdb", select=MySelector()) print "Done" Peter From pedro.al at fenhi.uh.cu Wed Jan 6 16:24:43 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Wed, 06 Jan 2010 11:24:43 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100106112443.0ewqh402gws8sos4@correo.fenhi.uh.cu> I just set Asp10 in 1xyz as a hypothetical residue in a hypothetical structure. One last thing: to select the CB atom in that Arg519 with the MySelector class and return it with the residue, how it would be...? Thanks > Have you got a real example? There is no Asp10 in PDB file 1xyz. > But, if for the sake of argument you wanted Arg519 (in either chain) > in 1xyz you could do it like this - based on the following example I > pointed to earlier: > > http://lists.open-bio.org/pipermail/biopython/2009-May/005174.html > > from Bio.PDB import Select, PDBIO > from Bio.PDB.PDBParser import PDBParser > > class MySelector(Select): > def accept_residue(self, residue): > #Only want Arg519 (in any chain) > return residue.resname=="ARG" and residue.id[1]==519 > > s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") > io=PDBIO() > io.set_structure(s) > io.save("1XYZ-interesting.pdb", select=MySelector()) > print "Done" > > > Peter -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Wed Jan 6 16:56:09 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 6 Jan 2010 16:56:09 +0000 Subject: [Biopython] Save custom structure... In-Reply-To: <20100106112443.0ewqh402gws8sos4@correo.fenhi.uh.cu> References: <20100106112443.0ewqh402gws8sos4@correo.fenhi.uh.cu> Message-ID: <320fb6e01001060856g2a1fc7c4te23fa15041a86537@mail.gmail.com> 2010/1/6 Yasser Almeida Hern?ndez : > I just set Asp10 in 1xyz as a hypothetical residue in a hypothetical > structure. I thought so - but I was hoping for a concrete example, where you can describe explicitly which bits you are trying to select. > One last thing: ?to select the CB atom in that Arg519 with the MySelector > class and return it with the residue, how it would be...? So out of the entire chain, you just want atom CB from residue Arg519? Try this then, it will give you a tiny PDB file with just two atoms, the CB from Arg519 in the two chains. from Bio.PDB import Select, PDBIO from Bio.PDB.PDBParser import PDBParser class MySelector(Select): def accept_residue(self, residue): #Only want Arg519 (in any chain) return residue.resname=="ARG" and residue.id[1]==519 def accept_atom(self, atom): #Only want the CB atom (in residue Arg519) return atom.name == "CB" s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") io=PDBIO() io.set_structure(s) io.save("1XYZ-interesting.pdb", select=MySelector()) print "Done" Peter From pedro.al at fenhi.uh.cu Thu Jan 7 14:14:28 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Thu, 07 Jan 2010 09:14:28 -0500 Subject: [Biopython] Save custom structure... Message-ID: <20100107091428.o9i82r9lsg8kw8sk@correo.fenhi.uh.cu> Yes, out of entire chain. Here's the concrete example: I have two pdb. The first is ligand-bounded (1BCX) and the other is ligand-free (1BVV). In the first i want to save the Tyr166 and the ligand atom O3B, both in a pdb file. In the second i want to save the same equivalent Tyr166 and the ligand atom of the first pdb file, both in other pdb file... I hope this will more clear... Thanks... > So out of the entire chain, you just want atom CB from residue Arg519? > Try this then, it will give you a tiny PDB file with just two atoms, the > CB from Arg519 in the two chains. > > from Bio.PDB import Select, PDBIO > from Bio.PDB.PDBParser import PDBParser > > class MySelector(Select): > def accept_residue(self, residue): > #Only want Arg519 (in any chain) > return residue.resname=="ARG" and residue.id[1]==519 > def accept_atom(self, atom): > #Only want the CB atom (in residue Arg519) > return atom.name == "CB" > > s=PDBParser().get_structure("1XYZ", "1XYZ.pdb") > io=PDBIO() > io.set_structure(s) > io.save("1XYZ-interesting.pdb", select=MySelector()) > print "Done" > > Peter -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 221 ---------------------------------------------------------------- Correo FENHI From biopython at maubp.freeserve.co.uk Thu Jan 7 16:08:59 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Jan 2010 16:08:59 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> Message-ID: <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> On Wed, Jan 6, 2010 at 1:28 PM, Anne Pajon wrote: > Hi Peter, > > Thanks again for this fast answer. > > You've been fixing code for me recently on fasta-m10 al_start and al_end, so > I am now working with the development version of biopython from git. I have > no problem of updating it and testing it here. Great. I've just committed very basic EMBL output support to our main branch on git. This is a stepping stone, deliberately a partial solution only for now, to make sure the basics seem to work (dealing with the sequence and identifiers, but nothing about the detailed annotation). In particular, I have deliberately not implemented feature support (yet - the existing code for writing a GenBank feature table will need to be tweaked to cover EMBL feature tables as well). I realise that in the current state this isn't going to be especially useful for you, but if you can have a look anyway and let me know if there is anything amiss that would be helpful. e.g. Make sure your favourite tools like the EMBL files Biopython produces. What do you use? Artemis? > I am working with about 30 bacteria genomes from the human gut and waiting > 100 more genomes to work with this year. I can send you one of the file if > you wish. Just let me know. You could send me one off list if you like - but its probably unnecessary for now. Regards, Peter From biopython at maubp.freeserve.co.uk Thu Jan 7 18:14:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 7 Jan 2010 18:14:20 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> Message-ID: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> On Thu, Jan 7, 2010 at 4:08 PM, Peter wrote: > > Great. I've just committed very basic EMBL output support to our main > branch on git. This is a stepping stone, deliberately a partial solution only > for now, to make sure the basics seem to work (dealing with the sequence > and identifiers, but nothing about the detailed annotation). In particular, > I have deliberately not implemented feature support (yet - the existing > code for writing a GenBank feature table will need to be tweaked to > cover EMBL feature tables as well). > > I realise that in the current state this isn't going to be especially useful > for you, but if you can have a look anyway and let me know if there is > anything amiss that would be helpful. e.g. Make sure your favourite > tools like the EMBL files Biopython produces. What do you use? > Artemis? I did some more work, including writing CO lines for CONTIG records, but when testing realised our EMBL parser doesn't (yet) cope with them: http://bugzilla.open-bio.org/show_bug.cgi?id=2980 Peter From daniel at dim.fm.usp.br Thu Jan 7 18:51:05 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Thu, 07 Jan 2010 16:51:05 -0200 Subject: [Biopython] Why so few recipes in the cookbook? In-Reply-To: <20091221131148.GB21580@sobchak.mgh.harvard.edu> (sfid-+20091221-111151-+000.00-1@spamfilter.osbf.lua) References: <4B2A8B48.50302@dim.fm.usp.br> <320fb6e00912171316y5e514052sabaf2a0104a558ac@mail.gmail.com> <4B2B6DE2.3080500@dim.fm.usp.br> <320fb6e00912180457x31b3c48bl680d48d6b95fdab0@mail.gmail.com> <4B2B8CC3.3090307@dim.fm.usp.br> <320fb6e00912180700w49d3be87r53b1a5201c84461b@mail.gmail.com> <4B2BAE35.2070404@dim.fm.usp.br> <320fb6e00912181442r60348fcwf15776a0451bc6a1@mail.gmail.com> <320fb6e00912210403q5dd4c0d7xf06c9a850ecde9db@mail.gmail.com> <20091221131148.GB21580@sobchak.mgh.harvard.edu> (sfid-+20091221-111151-+000.00-1@spamfilter.osbf.lua) Message-ID: <4B462D19.2050906@dim.fm.usp.br> Hi people, This year will be fantastic for bioinformaticians/biologist and hybrids like me !!! As a side product of my thesis, I'm preparing some courses in bioinformatics oriented to biologists and physicians (I work at a very large medical complex with lots of underused fine clusters). And, of course, I'll need help to shape up the examples in a more OO way. Most of my work is done in pure C99 (a lot of void pointers) and I mainly use python as an interface between databases and my small programs. But, for the moment, my thesis will hang up this project a little. Nevertheless, it's in the second place on my priority queue. By the way, the bloggers from Blue Collar, Programming for Scientists, Yokofakun and related are on the list? They have nice examples that really work. So, the cookbook will take off !!! This is a promise. I really want to use it on my classes. See you very soon, Daniel Brad Chapman wrote: > Peter and Daniel; > Really interesting discussion. Documentation is an area that can > always use more work to appeal to a wider audience. > > Daniel: >>> While this tutorial is enough to CS-oriented guys, it's a really big >>> step to grasp such information for people from other communities. >>> That's why I'm always a little confused about the idea behind bio >>> projects. If the idea is programming of scientists, the approach is >>> way too CS. > > This stresses why we actively encourage contributions from biologists > as well. Many of the contributors to Biopython tend more towards the > programming/bioinformatics side, since that experience helps in building > up and appreciating a re-usable toolkit. When those same people write > documentation, it is going to be naturally biased towards the sort of > work they do. > > I'd definitely encourage you, and anyone else who might be > interested, to build up examples that are more intuitive to those > coming at the work from a different starting point. This is exactly > the idea behind starting up the cookbook on the wiki; it's all > freely editable, so dig right in. > > Brad > -- +---------------------------------------+ Daniel de A. M. M. Silvestre LIM01 - Laborat?rio de Inform?tica M?dica - HCFMUSP Sala 1349 - Depto. de Patologia Faculdade de Medicina Universidade de S?o Paulo Av. Dr. Arnaldo, 455 | e-mail: daniel at dim.fm.usp.br Cerqueira C?sar | Tel: +55-11-3061-7381 01246-903 - S?o Paulo - SP | Cel: +55-11-8042-9369 BRASIL | Skype: jarretinha --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: daniel.vcf Type: text/x-vcard Size: 375 bytes Desc: not available URL: From msameet at gmail.com Fri Jan 8 07:09:32 2010 From: msameet at gmail.com (Sameet Mehta) Date: Fri, 8 Jan 2010 12:39:32 +0530 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames Message-ID: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> Hi All, I have a few lists of gene names/gene symbols for some old (5 year old) microarray experiments. I want to find out the official Gene Symbols for all of these genes. Is there a way to do it in Biopython. regards Sameet -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From p.j.a.cock at googlemail.com Fri Jan 8 10:12:27 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 10:12:27 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> Message-ID: <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> On Fri, Jan 8, 2010 at 7:09 AM, Sameet Mehta wrote: > Hi All, > > I have a few lists of gene names/gene symbols for some old (5 year > old) microarray experiments. ?I want to find out the official Gene > Symbols for all of these genes. ?Is there a way to do it in Biopython. > > regards > Sameet I'd start by working out whose gene names/gene symbols they are. What kind of microarrays are you using? For a custom chip you may have to talk to whomever designed it, but for mainstream commercial chips there should be lookup tables, either on the manufacturors website or perhaps in R/Bioconductor. Note you can actually combine R/Bioconductor with Python using rpy2 (or its predecessor, rpy). For examples, see: http://bcbio.wordpress.com/2010/01/02/automated-retrieval-of-expression-data-with-python-and-r/ http://www.warwick.ac.uk/go/peter_cock/python/heatmap/ Peter From p.j.a.cock at googlemail.com Fri Jan 8 10:47:49 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 10:47:49 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> Message-ID: <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> Please CC the mailing list in replies. On Fri, Jan 8, 2010 at 10:20 AM, Sameet Mehta wrote: > Hi, > Thanks for the reply. ?What I have are the old GeneSymbols. ?I have > already selected the genes of interest based on expression profiles. > But I need their current GeneSymbols, so that I can do GO-Term > enrichment. Yes, but which GeneSymbols do you have? There are lots of different ones (including different species - for human you would probably be talking about the HUGO Gene Nomenclature Committee assigned symbols). Assuming your particular gene symbols are covered, then using NCBI Entrez and the Gene database might work (try ELink?). Peter From dalloliogm at gmail.com Fri Jan 8 11:06:45 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 8 Jan 2010 12:06:45 +0100 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> Message-ID: <5aa3b3571001080306s7fa4bfe4x102e7cc58fe84b84@mail.gmail.com> On Fri, Jan 8, 2010 at 11:47 AM, Peter Cock wrote: > Please CC the mailing list in replies. > > On Fri, Jan 8, 2010 at 10:20 AM, Sameet Mehta wrote: >> Hi, >> Thanks for the reply. ?What I have are the old GeneSymbols. ?I have >> already selected the genes of interest based on expression profiles. >> But I need their current GeneSymbols, so that I can do GO-Term >> enrichment. I would do it with BioMart, as it already has all the datasets available and it makes it possible to do it without programming at all. I know you can do it with biopython, but this is just a one-time job, maybe it is not necessary... In any case, it is true that you can't do it without knowing which GeneSymbols you are using and with which version they were annotated. > > Yes, but which GeneSymbols do you have? There are lots of > different ones (including different species - for human you would > probably be talking about the HUGO Gene Nomenclature > Committee assigned symbols). > > Assuming your particular gene symbols are covered, then using > NCBI Entrez and the Gene database might work (try ELink?). > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From msameet at gmail.com Fri Jan 8 11:25:27 2010 From: msameet at gmail.com (Sameet Mehta) Date: Fri, 08 Jan 2010 16:55:27 +0530 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> Message-ID: <4B471627.70406@gmail.com> Hi, I was wondering about using the NCBI Gene Database. I dont know where to begin. If you could help with some skeleton code, I could take it from there. regards Sameet On 01/08/2010 04:17 PM, Peter Cock wrote: > Please CC the mailing list in replies. > > On Fri, Jan 8, 2010 at 10:20 AM, Sameet Mehta wrote: > >> Hi, >> Thanks for the reply. What I have are the old GeneSymbols. I have >> already selected the genes of interest based on expression profiles. >> But I need their current GeneSymbols, so that I can do GO-Term >> enrichment. >> > Yes, but which GeneSymbols do you have? There are lots of > different ones (including different species - for human you would > probably be talking about the HUGO Gene Nomenclature > Committee assigned symbols). > > Assuming your particular gene symbols are covered, then using > NCBI Entrez and the Gene database might work (try ELink?). > > Peter > > From p.j.a.cock at googlemail.com Fri Jan 8 11:35:57 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 11:35:57 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <4B471627.70406@gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> <4B471627.70406@gmail.com> Message-ID: <320fb6e01001080335n40bd3041n71779cd5c1df83c1@mail.gmail.com> On Fri, Jan 8, 2010 at 11:25 AM, Sameet Mehta wrote: > Hi, > I was wondering about using the NCBI Gene Database. ?I dont know where > to begin. If you could help with some skeleton code, I could take it > from there. How about telling us two or three of your old gene symbols, what they are from, and the desired new gene symbols? If you can manage to do this manually via the Entrez website, that would also be very helpful for doing it automatically via a script. Peter From biopython at maubp.freeserve.co.uk Fri Jan 8 12:48:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 8 Jan 2010 12:48:07 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> Message-ID: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> On Thu, Jan 7, 2010 at 6:14 PM, Peter wrote: > > I did some more work, including writing CO lines for CONTIG records, > but when testing realised our EMBL parser doesn't (yet) cope with them: > http://bugzilla.open-bio.org/show_bug.cgi?id=2980 > OK, now EMBL contig records seem to be working :) Peter From p.j.a.cock at googlemail.com Fri Jan 8 16:48:40 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Jan 2010 16:48:40 +0000 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <380bc9b31001080809l531a8c17v109191fc7783d0f5@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> <4B471627.70406@gmail.com> <320fb6e01001080335n40bd3041n71779cd5c1df83c1@mail.gmail.com> <380bc9b31001080809l531a8c17v109191fc7783d0f5@mail.gmail.com> Message-ID: <320fb6e01001080848w3fcac1cg9250e85dc21c82dd@mail.gmail.com> Please CC the mailing list. On Fri, Jan 8, 2010 at 4:09 PM, Sameet Mehta wrote: > Hi, > My list contains gene names such as DKFZP586P0123 , RPL6, etc. ?What I > do is search this in the NCBI Gene database manually, and then i get > the official Gene Symbol. ?I want to automate this process. ?I am of > course interested only in official gene symbols from the Humans. > > Sameet OK, so via my browser using Entrez Gene, I used: DKFZP586P0123 "Homo sapiens"[orgn] This maps uniquely to C2CD3. However, RPL6 "Homo sapiens"[orgn] maps to several hits (some discontinued) included things like RPL6P13. Clearly we need to make the search a little more specific... we only want to search for a name or gene symbol (not the default search on all fields). It looks like searching on "gene" works nicely, see also: http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ Entrez queries like these seem to give unique matches: DKFZP586P0123[gene] "Homo sapiens"[orgn] RPL6[gene] "Homo sapiens"[orgn] e.g. >>> from Bio import Entrez >>> Entrez.email = "Your.Name.Here at example.com" >>> search = Entrez.read(Entrez.esearch(db='gene', term='DKFZP586P0123[gene] "Homo sapiens"[orgn]', retmode='xml')) >>> print search["IdList"] ['26005'] That unique ID we got back (26005) is the UID for this gene, which you should be able to use with EFetch (or Elink?). e.g. You could download the whole record as XML, and parse that: >>> result = Entrez.read(Entrez.efetch(db='gene', id='26005', retmode='xml')) >>> result[0]['Entrezgene_gene']['Gene-ref']['Gene-ref_locus'] 'C2CD3' However, this next approach is a much quicker download, and so looks like a more efficient way to get the desired gene symbol: >>> print Entrez.efetch(db='gene', id='26005', retmode='text', rettype='brief').read() 1: C2CD3 C2 calcium-depend... [GeneID: 26005] Next read the Entrez chapter in the Biopython Tutorial, especially the bit about the history functionality for linking ESearch and EFetch. Peter From msameet at gmail.com Fri Jan 8 17:13:24 2010 From: msameet at gmail.com (Sameet Mehta) Date: Fri, 8 Jan 2010 22:43:24 +0530 Subject: [Biopython] how to obtain official Gene Symbols for a list of GeneNames In-Reply-To: <320fb6e01001080848w3fcac1cg9250e85dc21c82dd@mail.gmail.com> References: <380bc9b31001072309s703b0d6dv6bfe00490f91f05f@mail.gmail.com> <320fb6e01001080212g1cacb48cm620b8c1374cf2e23@mail.gmail.com> <380bc9b31001080220q6ed2455s43f48f9830a05d93@mail.gmail.com> <320fb6e01001080247o3890a75fyb6e9b6080dcb9fb2@mail.gmail.com> <4B471627.70406@gmail.com> <320fb6e01001080335n40bd3041n71779cd5c1df83c1@mail.gmail.com> <380bc9b31001080809l531a8c17v109191fc7783d0f5@mail.gmail.com> <320fb6e01001080848w3fcac1cg9250e85dc21c82dd@mail.gmail.com> Message-ID: <380bc9b31001080913p15ba950xb787460b98ef76b9@mail.gmail.com> Thanks Peter, that is something i was looking for. thanks for the help. regards Sameet On Fri, Jan 8, 2010 at 10:18 PM, Peter Cock wrote: > Please CC the mailing list. > > On Fri, Jan 8, 2010 at 4:09 PM, Sameet Mehta wrote: >> Hi, >> My list contains gene names such as DKFZP586P0123 , RPL6, etc. ?What I >> do is search this in the NCBI Gene database manually, and then i get >> the official Gene Symbol. ?I want to automate this process. ?I am of >> course interested only in official gene symbols from the Humans. >> >> Sameet > > OK, so via my browser using Entrez Gene, I used: > > DKFZP586P0123 "Homo sapiens"[orgn] > > This maps uniquely to C2CD3. However, > > RPL6 "Homo sapiens"[orgn] > > maps to several hits (some discontinued) included things like > RPL6P13. Clearly we need to make the search a little more > specific... we only want to search for a name or gene symbol > (not the default search on all fields). > > It looks like searching on "gene" works nicely, see also: > http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ > > Entrez queries like these seem to give unique matches: > > DKFZP586P0123[gene] "Homo sapiens"[orgn] > RPL6[gene] "Homo sapiens"[orgn] > > e.g. > >>>> from Bio import Entrez >>>> Entrez.email = "Your.Name.Here at example.com" >>>> search = Entrez.read(Entrez.esearch(db='gene', term='DKFZP586P0123[gene] "Homo sapiens"[orgn]', retmode='xml')) >>>> print search["IdList"] > ['26005'] > > That unique ID we got back (26005) is the UID for this gene, which > you should be able to use with EFetch (or Elink?). e.g. You could > download the whole record as XML, and parse that: > >>>> result = Entrez.read(Entrez.efetch(db='gene', id='26005', retmode='xml')) >>>> result[0]['Entrezgene_gene']['Gene-ref']['Gene-ref_locus'] > 'C2CD3' > > However, this next approach is a much quicker download, and so > looks like a more efficient way to get the desired gene symbol: > >>>> print Entrez.efetch(db='gene', id='26005', retmode='text', rettype='brief').read() > > 1: C2CD3 C2 calcium-depend... [GeneID: 26005] > > Next read the Entrez chapter in the Biopython Tutorial, especially > the bit about the history functionality for linking ESearch and EFetch. > > Peter > -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From bnbowman at gmail.com Sat Jan 9 01:34:34 2010 From: bnbowman at gmail.com (Brett Bowman) Date: Fri, 8 Jan 2010 17:34:34 -0800 Subject: [Biopython] Organism specific NCBIWWW qblast Message-ID: <627d998d1001081734x16bd2325g60ae36b69fbe922d@mail.gmail.com> Hello gents, I'm trying to create a dataset of proteins that are both highly similar to, and from the same species as, my query sequence. Since I'm going to be doing this repeatedly, for many different query sequences, I'm trying to automate the process with Biopython, but I can't figure out how to enable organism-specific blasts with qblast. Any guidance be greatly appreciated. -Brett Bowman Woelk Lab UCSD School of Medicine UCSD/SDSU Joint Bioinformatics Program From bnbowman at gmail.com Sat Jan 9 22:53:31 2010 From: bnbowman at gmail.com (Brett Bowman) Date: Sat, 9 Jan 2010 14:53:31 -0800 Subject: [Biopython] Blank Returns from Entrez.efetch() Message-ID: <627d998d1001091453m60c78a8dy408e248049fd1b6e@mail.gmail.com> I'm trying to query Entrez for a series of protein IDs with Biopython, but not having much success. The sample code given in the tutorial works perfectly: >>> handle = Entrez.efetch(db="nucleotide", id="186972394", rettype="gb") >>> print handle.read() But when I change that to proteins and my IDs, I get an empty handle as a result: >>> handle = Entrez.efetch(db="protein", id="Q81T62.1", rettype="gb") I've tried this on Biopython 1.51 and 1.53, installed on Ubuntu 9.10, and I've tried it with every rettype imaginable, with no success. Any ideas as to where I am going wrong? -Brett Bowman Woelk Lab UCSD School of Medicine UCSD/SDSU Joint Program in Bioinformatics From mjldehoon at yahoo.com Sun Jan 10 02:52:36 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 9 Jan 2010 18:52:36 -0800 (PST) Subject: [Biopython] Blank Returns from Entrez.efetch() In-Reply-To: <627d998d1001091453m60c78a8dy408e248049fd1b6e@mail.gmail.com> Message-ID: <951022.75982.qm@web62407.mail.re1.yahoo.com> Have you looked at the EUtils examples on the NCBI website? It shows one example for efetch from the protein database. --Michiel. --- On Sat, 1/9/10, Brett Bowman wrote: > From: Brett Bowman > Subject: [Biopython] Blank Returns from Entrez.efetch() > To: biopython at biopython.org > Date: Saturday, January 9, 2010, 5:53 PM > I'm trying to query Entrez for a > series of protein IDs with Biopython, > but not having much success.? The sample code given in > the tutorial > works perfectly: > > >>> handle = Entrez.efetch(db="nucleotide", > id="186972394", rettype="gb") > >>> print handle.read() > > But when I change that to proteins and my IDs, I get an > empty handle > as a result: > > >>> handle = Entrez.efetch(db="protein", > id="Q81T62.1", rettype="gb") > > I've tried this on Biopython 1.51 and 1.53, installed on > Ubuntu 9.10, > and I've tried it with every rettype imaginable, with no > success.? Any > ideas as to where I am going wrong? > > -Brett Bowman > Woelk Lab > UCSD School of Medicine > UCSD/SDSU Joint Program in Bioinformatics > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Sun Jan 10 13:01:32 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 10 Jan 2010 08:01:32 -0500 Subject: [Biopython] Blank Returns from Entrez.efetch() In-Reply-To: <951022.75982.qm@web62407.mail.re1.yahoo.com> References: <627d998d1001091453m60c78a8dy408e248049fd1b6e@mail.gmail.com> <951022.75982.qm@web62407.mail.re1.yahoo.com> Message-ID: <20100110130132.GF9694@sobchak.mgh.harvard.edu> Brett; Brett: > > But when I change that to proteins and my IDs, I get an empty handle > > as a result: > > > > >>> handle = Entrez.efetch(db="protein", id="Q81T62.1", rettype="gb") Michiel: > Have you looked at the EUtils examples on the NCBI website? It shows > one example for efetch from the protein database. According to the efetch help here: http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchseq_help.html the id parameter should work okay with an accession.version. So your example should work but something is wrong with how NCBI handles this particular record. Other accession.version identifiers do work, and so does the accession alone: >>> handle = Entrez.efetch(db="protein", id="Q81T62", rettype="gb") The safest way to do this is to use GenBank identifiers (GIDs) as the id attribute. This requires one extra step to search for the record and get the ID: >>> handle = Entrez.esearch(db="protein", retmax=1, term="Q81T62.1") >>> rec = Entrez.read(handle) >>> rec {u'Count': '1', u'IdList': ['46395771'], u'QueryTranslation': 'Q81T62.1', u'RetMax': '1', u'RetStart': '0', u'TranslationSet': []} >>> handle = Entrez.efetch(db="protein", id=rec[0]['IdList'][0], rettype="gb") >>> handle.readline() 'LOCUS Q81T62 429 aa linear BCT 15-DEC-2009\n' Brad From chapmanb at 50mail.com Sun Jan 10 13:17:26 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 10 Jan 2010 08:17:26 -0500 Subject: [Biopython] Organism specific NCBIWWW qblast In-Reply-To: <627d998d1001081734x16bd2325g60ae36b69fbe922d@mail.gmail.com> References: <627d998d1001081734x16bd2325g60ae36b69fbe922d@mail.gmail.com> Message-ID: <20100110131726.GG9694@sobchak.mgh.harvard.edu> Hi Brett; > I'm trying to create a dataset of proteins that are both highly > similar to, and from the same species as, my query sequence. Since > I'm going to be doing this repeatedly, for many different query > sequences, I'm trying to automate the process with Biopython, but I > can't figure out how to enable organism-specific blasts with qblast. > Any guidance be greatly appreciated. You want to use the entrez_query argument to qblast: result_handle = NCBIWWW.qblast("blastn", "nr", record.format("fasta"), entrez_query="Mus musculus[orgn]") See these previous threads for more discussion: http://lists.open-bio.org/pipermail/biopython/2009-June/005215.html http://www.biopython.org/pipermail/biopython/2009-September/005616.html Once you've got a short example running it would be great if you could add it as an example to the online cookbook: http://biopython.org/wiki/Category:Cookbook A nice discussion there could help others in the future with the same issue. Thanks, Brad From biopython at maubp.freeserve.co.uk Mon Jan 11 16:22:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 16:22:56 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> Message-ID: <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> Hi Anne, I've just checked in feature support to the new EMBL output in Bio.SeqIO (our main branch on git). If you could give that a test it would be very much appreciated. If you are on the dev mailing list, we can discuss issues there - otherwise we might as well continue on this thread. Thanks, Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 16:38:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 16:38:33 +0000 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> Message-ID: <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> On Sun, Jan 3, 2010 at 8:09 AM, Kevin Lam wrote: > Hmmm found this in the blast+ manual is it possible to integrate this > somewhere in biopython ?Cheers > Kevin > > > 3.1 For users of NCBI C Toolkit BLAST > > The easiest way to get started using these command line applications is by > means of the legacy_blast.pl PERL script which is bundled along with the > BLAST+ applications. To utilize this script, simply prefix it to the > invocation of the C toolkit BLAST command line application and append the > --path option pointing to the installation directory of the BLAST+ > applications. For example, instead of using > ?blastall -i query -d nr -o blast.out > > use > ?legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/blast/bin > > For more details, refer to the section titled Backwards compatibility > script > . Hi Kevin, I don't understand how you think the Biopython documentation should mention the legacy_blast.pl script. Could you explain? If someone has an existing Biopython script written to call "legacy" BLAST via the Bio.Blast.NCBIStandalone "helper" function then it would be quite tricky to get this to call BLAST+ via legacy_blast.pl to convert the arguments. These "helper" functions are just too inflexible (we would probably have deprecated them anyway, even without the introduction of BLAST+ by the NCBI). If someone was using the the Bio.Blast.Applications wrapper to call "legacy" BLAST then they could do something like this: import subprocess from Bio.Blast.Applications import BlastallCommandline cline = BlastallCommandline(...) child = subprocess.Popen(str(cline), ...) Then I guess they could make a hack like this in order to use BLAST+ via legacy_blast.pl without changing much code: import subprocess from Bio.Blast.Applications import BlastallCommandline cline = BlastallCommandline(...) hack_template = "legacy_blast.pl %s --path /opt/blast/bin" child = subprocess.Popen(hack_template % cline, ...) Peter From aboulia at gmail.com Mon Jan 11 16:46:22 2010 From: aboulia at gmail.com (Kevin) Date: Tue, 12 Jan 2010 00:46:22 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> Message-ID: Hi Peter, I was thinking of porting the legacy blast script to python as u r right about the helper script being inflexible. The documentation bit was actually about my first email about any updated doc on how to use blast+ with biopython Cheers Kevin Sent from my iPod On 12-Jan-2010, at 12:38 AM, Peter wrote: > On Sun, Jan 3, 2010 at 8:09 AM, Kevin Lam wrote: >> Hmmm found this in the blast+ manual is it possible to integrate this >> somewhere in biopython ?Cheers >> Kevin >> >> >> 3.1 For users of NCBI C Toolkit BLAST >> >> The easiest way to get started using these command line >> applications is by >> means of the legacy_blast.pl PERL script which is bundled along >> with the >> BLAST+ applications. To utilize this script, simply prefix it to the >> invocation of the C toolkit BLAST command line application and >> append the >> --path option pointing to the installation directory of the BLAST+ >> applications. For example, instead of using >> blastall -i query -d nr -o blast.out >> >> use >> legacy_blast.pl blastall -i query -d nr -o blast.out --path /opt/ >> blast/bin >> >> For more details, refer to the section titled Backwards compatibility >> script> > >> . > > Hi Kevin, > > I don't understand how you think the Biopython documentation > should mention the legacy_blast.pl script. Could you explain? > > If someone has an existing Biopython script written to call "legacy" > BLAST via the Bio.Blast.NCBIStandalone "helper" function then it > would be quite tricky to get this to call BLAST+ via legacy_blast.pl > to convert the arguments. These "helper" functions are just too > inflexible (we would probably have deprecated them anyway, even > without the introduction of BLAST+ by the NCBI). > > If someone was using the the Bio.Blast.Applications wrapper to > call "legacy" BLAST then they could do something like this: > > import subprocess > from Bio.Blast.Applications import BlastallCommandline > cline = BlastallCommandline(...) > child = subprocess.Popen(str(cline), ...) > > Then I guess they could make a hack like this in order to use > BLAST+ via legacy_blast.pl without changing much code: > > import subprocess > from Bio.Blast.Applications import BlastallCommandline > cline = BlastallCommandline(...) > hack_template = "legacy_blast.pl %s --path /opt/blast/bin" > child = subprocess.Popen(hack_template % cline, ...) > > Peter From biopython at maubp.freeserve.co.uk Mon Jan 11 17:08:55 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 11 Jan 2010 17:08:55 +0000 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> Message-ID: <320fb6e01001110908s4f19272bp71377c4f70004a24@mail.gmail.com> On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > Hi Peter, > I was thinking of porting the legacy blast script to python as u r right > about the helper script being inflexible. A python version of legacy_blast.pl isn't any more useful than the Perl version is it? Maybe I have misunderstood you. What would be nice is a way to help people update their old Biopython scripts which called legacy BLAST, so that they can be used on BLAST+ instead. I would expect in most cases this means scripts using the legacy BLAST "helper" functions in Bio.Blast.NCBIStandalone. One way to do this would be to add new BLAST+ versions of the "helper" functions (taking the same argument names as before), but that is just a stop gap (a temporary measure). We really want people using these old helper functions to switch to using the wrappers in Bio.Blast.Applications and subprocess instead. > The documentation bit was actually about my first email about any > updated doc on how to use blast+ with biopython I see. What do you think the current (Biopython 1.53) version of the tutorial needs in the BLAST chapter? http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thanks, Peter From ap12 at sanger.ac.uk Mon Jan 11 17:32:43 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 11 Jan 2010 17:32:43 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> Message-ID: Hi Peter, Just tested now. It worked fine. Thanks a lot. Here is the diff between the EMBL output from Bio.SeqIO and the genbank output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: guest137:RAST ap12$ diff tmp.embl updated_files/ Alistipes_shahii_WAL8301_uRAST.embl 1c1 < ID unknown; SV 1; ; DNA; ; ; 3763317 BP. --- > ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 3763317 BP. 5c5 < DE --- > KW . 8c8 < OC . --- > XX 10a11 > FH 1949,1950c1950 < FT /product="Peptidyl-prolyl cis-trans isomerase (EC < FT 5.2.1.8)" --- > FT /product="Peptidyl-prolyl cis-trans isomerase (EC 5.2.1.8)" 3346,3347c3346 < FT kinase/response regulator, hybrid ('one component < FT system')" --- > FT kinase/response regulator, hybrid ('one component system')" 3380,3381c3379 < FT /product="Iron-sulfur cluster assembly ATPase protein < FT SufC" --- > FT /product="Iron-sulfur cluster assembly ATPase protein SufC" 4811,4812c4809 < FT /product="Gamma-glutamyl phosphate reductase (EC < FT 1.2.1.41)" --- > FT /product="Gamma-glutamyl phosphate reductase (EC 1.2.1.41)" 5472,5473c5469 < FT /product="lipoprotein releasing system ATP- binding < FT protein" --- > FT /product="lipoprotein releasing system ATP- binding protein" 5881,5882c5877 < FT /product="NAD-dependent protein deacetylase of SIR2 < FT family" --- > FT /product="NAD-dependent protein deacetylase of SIR2 family" 6032,6033c6027 < FT /product="Exodeoxyribonuclease V alpha chain (EC < FT 3.1.11.5)" --- > FT /product="Exodeoxyribonuclease V alpha chain (EC 3.1.11.5)" 6495,6496c6489 < FT /product="Pyrophosphate-energized proton pump (EC < FT 3.6.1.1)" --- > FT /product="Pyrophosphate-energized proton pump (EC 3.6.1.1)" 6946,6947c6939 < FT /product="Exodeoxyribonuclease V alpha chain (EC < FT 3.1.11.5)" --- > FT /product="Exodeoxyribonuclease V alpha chain (EC 3.1.11.5)" 7128,7129c7120 < FT /product="N-acyl-L-amino acid amidohydrolase (EC < FT 3.5.1.14)" --- > FT /product="N-acyl-L-amino acid amidohydrolase (EC 3.5.1.14)" 8035,8036c8026 < FT /product="D-3-phosphoglycerate dehydrogenase (EC < FT 1.1.1.95)" --- > FT /product="D-3-phosphoglycerate dehydrogenase (EC 1.1.1.95)" 8601,8602c8591 < FT /product="Acetolactate synthase small subunit (EC < FT 2.2.1.6)" --- > FT /product="Acetolactate synthase small subunit (EC 2.2.1.6)" 8608,8609c8597 < FT /product="Acetolactate synthase large subunit (EC < FT 2.2.1.6)" --- > FT /product="Acetolactate synthase large subunit (EC 2.2.1.6)" 9152,9153c9140 < FT /product="Exodeoxyribonuclease V alpha chain (EC < FT 3.1.11.5)" --- > FT /product="Exodeoxyribonuclease V alpha chain (EC 3.1.11.5)" 10659,10660c10646 < FT kinase/response regulator, hybrid ('one-component < FT system')" --- > FT kinase/response regulator, hybrid ('one- component system')" 12056,12057c12042 < FT /product="N-acetylmuramoyl-L-alanine amidase (EC < FT 3.5.1.28)" --- > FT /product="N-acetylmuramoyl-L-alanine amidase (EC 3.5.1.28)" 12957,12958c12942 < FT /product="Phosphatidate cytidylyltransferase (EC < FT 2.7.7.41)" --- > FT /product="Phosphatidate cytidylyltransferase (EC 2.7.7.41)" 13550,13551c13534 < FT /product="Glutamine synthetase type III, GlnN (EC < FT 6.3.1.2)" --- > FT /product="Glutamine synthetase type III, GlnN (EC 6.3.1.2)" 14344c14327,14328 < SQ --- > XX > SQ Sequence 3763317 BP; 772804 A; 1042979 C; 1057681 G; 776208 T; 113645 other; The main differences are on line breaks. Regards, Anne. On 11 Jan 2010, at 16:22, Peter wrote: > Hi Anne, > > I've just checked in feature support to the new EMBL output in > Bio.SeqIO > (our main branch on git). If you could give that a test it would be > very > much appreciated. If you are on the dev mailing list, we can discuss > issues there - otherwise we might as well continue on this thread. > > Thanks, > > Peter -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From aboulia at gmail.com Tue Jan 12 05:04:07 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 12 Jan 2010 13:04:07 +0800 Subject: [Biopython] is there an updated tutorial on how to use the Wrappers for the new NCBI BLAST+ tools? In-Reply-To: <320fb6e01001110908s4f19272bp71377c4f70004a24@mail.gmail.com> References: <5b6410e1001020244s7b702ca8n760352052dba77c9@mail.gmail.com> <5b6410e1001030009v5420f146k26d723f33c4c5f59@mail.gmail.com> <320fb6e01001110838n476bbca8q6d949162f62405bc@mail.gmail.com> <320fb6e01001110908s4f19272bp71377c4f70004a24@mail.gmail.com> Message-ID: <5b6410e1001112104y1ac0db9eoc565252710fc3334@mail.gmail.com> On Tue, Jan 12, 2010 at 1:08 AM, Peter wrote: > On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > > Hi Peter, > > I was thinking of porting the legacy blast script to python as u r right > > about the helper script being inflexible. > > A python version of legacy_blast.pl isn't any more useful than the > Perl version is it? Maybe I have misunderstood you. > > What would be nice is a way to help people update their old > Biopython scripts which called legacy BLAST, so that they can > be used on BLAST+ instead. I would expect in most cases this > means scripts using the legacy BLAST "helper" functions in > Bio.Blast.NCBIStandalone. One way to do this would be to > add new BLAST+ versions of the "helper" functions (taking > the same argument names as before), but that is just a stop > gap (a temporary measure). We really want people using these > old helper functions to switch to using the wrappers in > Bio.Blast.Applications and subprocess instead. > Yes I was thinking of this when i meant porting/integrate. to integrate the legacy blast perl script into Bio.Blast.NCBIStandalone I didn't realise that Bio.Blast.Applications existed > The documentation bit was actually about my first email about any > > updated doc on how to use blast+ with biopython > > I see. What do you think the current (Biopython 1.53) version > of the tutorial needs in the BLAST chapter? > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc80 was exactly what I was looking for! Maybe i was looking at the wrong page Thanks for pointing it out! > Thanks, > > Peter > Cheers Kevin From biopython at maubp.freeserve.co.uk Tue Jan 12 10:27:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 10:27:47 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> Message-ID: <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: > Hi Peter, > > Just tested now. > > It worked fine. Thanks a lot. Great. > Here is the diff between the EMBL output from Bio.SeqIO and the genbank > output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > > ... > > The main differences are on line breaks. I hadn't yet done a comparison against EMBOSS (what version do you have), but yes, it looks like I am wrapping the feature tables using a shorter line length - we should check that, and it would be easy to adjust in Bio/SeqIO/InsdcIO.py Regarding the SQ line, that was on my "TODO" list. Including the sequence length and base counts shouldn't hard at all. If you want to work on that it should just be a few lines in Bio/SeqIO/InsdcIO.py Right now however, further testing of features would be my first priority. See also: http://lists.open-bio.org/pipermail/open-bio-l/2010-January/000604.html There are other things still to do (e.g. missing fields on the ID line, dates, and references). Peter From biopython at maubp.freeserve.co.uk Tue Jan 12 12:33:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 12 Jan 2010 12:33:35 +0000 Subject: [Biopython] Could Bio.SeqIO write EMBL file? In-Reply-To: <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> References: <320fb6e01001060515y23804dd5xebc326647fa975d8@mail.gmail.com> <246181F8-03A2-49B2-960D-1FFBBC3E2865@sanger.ac.uk> <320fb6e01001070808k57311f43tb67c9cf916d27eab@mail.gmail.com> <320fb6e01001071014w50fa780ct2fdaccf2b9272cdc@mail.gmail.com> <320fb6e01001080448g24ff5308gb797f01cbab79196@mail.gmail.com> <320fb6e01001110822q60ac7103ndb3082cc3146735@mail.gmail.com> <320fb6e01001120227s7b9d9ac5pa8a11837e0ee620b@mail.gmail.com> Message-ID: <320fb6e01001120433q72f577efg635ce232c666a46@mail.gmail.com> On Tue, Jan 12, 2010 at 10:27 AM, Peter wrote: > On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: >> Here is the diff between the EMBL output from Bio.SeqIO and the genbank >> output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: >> >> ... >> >> The main differences are on line breaks. > > I hadn't yet done a comparison against EMBOSS (what version do you > have), but yes, it looks like I am wrapping the feature tables using a > shorter line length - we should check that, and it would be easy to > adjust in Bio/SeqIO/InsdcIO.py The spec is pretty clear than the feature lines should be up to 80 characters. The premature wrapping was because I had been testing length < 80 instead of <= 80, which is now fixed in git. Peter From p.j.a.cock at googlemail.com Tue Jan 12 14:27:30 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Jan 2010 14:27:30 +0000 Subject: [Biopython] Publication list Message-ID: <320fb6e01001120627s268f0dd4k3a543e3b779507e6@mail.gmail.com> Dear all, We have a fairly extensive manually compiled list of over 150 publications citing, referencing or using Biopython on the wiki, covering the first 10 years of Biopython: http://biopython.org/wiki/Publications *If your own Biopython related publications are missing from this list, please add them. If they are listed in PubMed this is pretty easy.* Keeping this up to date has been a tedious task, although now that we have an up to date reference, which hopefully will get cited, this is a little easier: http://news.open-bio.org/news/2009/03/biopython-paper-published/ There is an example in the Biopython Tutorial of using Bio.Entrez and PubMed Central (PMC) to find papers citing a reference, or you can just use this URL: http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=link&linkname=pubmed_pubmed_citedin&uid=19304878 Likewise, using Google Scholar also finds plenty of citations (although I don't know if this URL will work long term): http://scholar.google.com/scholar?cites=1800471218280477755&hl=en&as_sdt=2000 Perhaps just a few links like these will suffice for tracking future publications? Or do people think we should continue to update the wiki in the same style? Regards, Peter From anaryin at gmail.com Tue Jan 12 19:01:34 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 12 Jan 2010 11:01:34 -0800 Subject: [Biopython] Biopython Digest, Vol 85, Issue 13 In-Reply-To: References: Message-ID: Hello Peter, Well, updating the wiki is cumbersome. Specially if done manually. Why not update the wiki automatically with that link you just gave? Regards, Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Tue, Jan 12, 2010 at 9:00 AM, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > Today's Topics: > > 1. Re: is there an updated tutorial on how to use the Wrappers > for the new NCBI BLAST+ tools? (Peter) > 2. Re: Could Bio.SeqIO write EMBL file? (Anne Pajon) > 3. Re: is there an updated tutorial on how to use the Wrappers > for the new NCBI BLAST+ tools? (Kevin Lam) > 4. Re: Could Bio.SeqIO write EMBL file? (Peter) > 5. Re: Could Bio.SeqIO write EMBL file? (Peter) > 6. Publication list (Peter Cock) > > > ---------- Forwarded message ---------- > From: Peter > To: Kevin > Date: Mon, 11 Jan 2010 17:08:55 +0000 > Subject: Re: [Biopython] is there an updated tutorial on how to use the > Wrappers for the new NCBI BLAST+ tools? > On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > > Hi Peter, > > I was thinking of porting the legacy blast script to python as u r right > > about the helper script being inflexible. > > A python version of legacy_blast.pl isn't any more useful than the > Perl version is it? Maybe I have misunderstood you. > > What would be nice is a way to help people update their old > Biopython scripts which called legacy BLAST, so that they can > be used on BLAST+ instead. I would expect in most cases this > means scripts using the legacy BLAST "helper" functions in > Bio.Blast.NCBIStandalone. One way to do this would be to > add new BLAST+ versions of the "helper" functions (taking > the same argument names as before), but that is just a stop > gap (a temporary measure). We really want people using these > old helper functions to switch to using the wrappers in > Bio.Blast.Applications and subprocess instead. > > > The documentation bit was actually about my first email about any > > updated doc on how to use blast+ with biopython > > I see. What do you think the current (Biopython 1.53) version > of the tutorial needs in the BLAST chapter? > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Thanks, > > Peter > > > > ---------- Forwarded message ---------- > From: Anne Pajon > To: Peter > Date: Mon, 11 Jan 2010 17:32:43 +0000 > Subject: Re: [Biopython] Could Bio.SeqIO write EMBL file? > Hi Peter, > > Just tested now. > > It worked fine. Thanks a lot. > > Here is the diff between the EMBL output from Bio.SeqIO and the genbank > output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > > guest137:RAST ap12$ diff tmp.embl > updated_files/Alistipes_shahii_WAL8301_uRAST.embl > 1c1 > < ID unknown; SV 1; ; DNA; ; ; 3763317 BP. > --- > > ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 3763317 BP. > 5c5 > < DE > --- > > KW . > 8c8 > < OC . > --- > > XX > 10a11 > > FH > 1949,1950c1950 > < FT /product="Peptidyl-prolyl cis-trans isomerase (EC > < FT 5.2.1.8)" > --- > > FT /product="Peptidyl-prolyl cis-trans isomerase (EC > 5.2.1.8)" > 3346,3347c3346 > < FT kinase/response regulator, hybrid ('one component > < FT system')" > --- > > FT kinase/response regulator, hybrid ('one component > system')" > 3380,3381c3379 > < FT /product="Iron-sulfur cluster assembly ATPase > protein > < FT SufC" > --- > > FT /product="Iron-sulfur cluster assembly ATPase > protein SufC" > 4811,4812c4809 > < FT /product="Gamma-glutamyl phosphate reductase (EC > < FT 1.2.1.41)" > --- > > FT /product="Gamma-glutamyl phosphate reductase (EC > 1.2.1.41)" > 5472,5473c5469 > < FT /product="lipoprotein releasing system ATP-binding > < FT protein" > --- > > FT /product="lipoprotein releasing system ATP-binding > protein" > 5881,5882c5877 > < FT /product="NAD-dependent protein deacetylase of SIR2 > < FT family" > --- > > FT /product="NAD-dependent protein deacetylase of SIR2 > family" > 6032,6033c6027 > < FT /product="Exodeoxyribonuclease V alpha chain (EC > < FT 3.1.11.5)" > --- > > FT /product="Exodeoxyribonuclease V alpha chain (EC > 3.1.11.5)" > 6495,6496c6489 > < FT /product="Pyrophosphate-energized proton pump (EC > < FT 3.6.1.1)" > --- > > FT /product="Pyrophosphate-energized proton pump (EC > 3.6.1.1)" > 6946,6947c6939 > < FT /product="Exodeoxyribonuclease V alpha chain (EC > < FT 3.1.11.5)" > --- > > FT /product="Exodeoxyribonuclease V alpha chain (EC > 3.1.11.5)" > 7128,7129c7120 > < FT /product="N-acyl-L-amino acid amidohydrolase (EC > < FT 3.5.1.14)" > --- > > FT /product="N-acyl-L-amino acid amidohydrolase (EC > 3.5.1.14)" > 8035,8036c8026 > < FT /product="D-3-phosphoglycerate dehydrogenase (EC > < FT 1.1.1.95)" > --- > > FT /product="D-3-phosphoglycerate dehydrogenase (EC > 1.1.1.95)" > 8601,8602c8591 > < FT /product="Acetolactate synthase small subunit (EC > < FT 2.2.1.6)" > --- > > FT /product="Acetolactate synthase small subunit (EC > 2.2.1.6)" > 8608,8609c8597 > < FT /product="Acetolactate synthase large subunit (EC > < FT 2.2.1.6)" > --- > > FT /product="Acetolactate synthase large subunit (EC > 2.2.1.6)" > 9152,9153c9140 > < FT /product="Exodeoxyribonuclease V alpha chain (EC > < FT 3.1.11.5)" > --- > > FT /product="Exodeoxyribonuclease V alpha chain (EC > 3.1.11.5)" > 10659,10660c10646 > < FT kinase/response regulator, hybrid ('one-component > < FT system')" > --- > > FT kinase/response regulator, hybrid ('one-component > system')" > 12056,12057c12042 > < FT /product="N-acetylmuramoyl-L-alanine amidase (EC > < FT 3.5.1.28)" > --- > > FT /product="N-acetylmuramoyl-L-alanine amidase (EC > 3.5.1.28)" > 12957,12958c12942 > < FT /product="Phosphatidate cytidylyltransferase (EC > < FT 2.7.7.41)" > --- > > FT /product="Phosphatidate cytidylyltransferase (EC > 2.7.7.41)" > 13550,13551c13534 > < FT /product="Glutamine synthetase type III, GlnN (EC > < FT 6.3.1.2)" > --- > > FT /product="Glutamine synthetase type III, GlnN (EC > 6.3.1.2)" > 14344c14327,14328 > < SQ > --- > > XX > > SQ Sequence 3763317 BP; 772804 A; 1042979 C; 1057681 G; 776208 T; > 113645 other; > > The main differences are on line breaks. > > Regards, > Anne. > > > On 11 Jan 2010, at 16:22, Peter wrote: > > Hi Anne, >> >> I've just checked in feature support to the new EMBL output in Bio.SeqIO >> (our main branch on git). If you could give that a test it would be very >> much appreciated. If you are on the dev mailing list, we can discuss >> issues there - otherwise we might as well continue on this thread. >> >> Thanks, >> >> Peter >> > > -- > Dr Anne Pajon - Pathogen Genomics, Team 81 > Sanger Institute, Wellcome Trust Genome Campus, Hinxton > Cambridge CB10 1SA, United Kingdom > +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, > a charity registered in England with number 1021457 and acompany registered > in England with number 2742969, whose registeredoffice is 215 Euston Road, > London, NW1 2BE. > > > ---------- Forwarded message ---------- > From: Kevin Lam > To: Peter > Date: Tue, 12 Jan 2010 13:04:07 +0800 > Subject: Re: [Biopython] is there an updated tutorial on how to use the > Wrappers for the new NCBI BLAST+ tools? > On Tue, Jan 12, 2010 at 1:08 AM, Peter >wrote: > > > On Mon, Jan 11, 2010 at 4:46 PM, Kevin wrote: > > > Hi Peter, > > > I was thinking of porting the legacy blast script to python as u r > right > > > about the helper script being inflexible. > > > > A python version of legacy_blast.pl isn't any more useful than the > > Perl version is it? Maybe I have misunderstood you. > > > > What would be nice is a way to help people update their old > > Biopython scripts which called legacy BLAST, so that they can > > be used on BLAST+ instead. I would expect in most cases this > > means scripts using the legacy BLAST "helper" functions in > > Bio.Blast.NCBIStandalone. One way to do this would be to > > add new BLAST+ versions of the "helper" functions (taking > > the same argument names as before), but that is just a stop > > gap (a temporary measure). We really want people using these > > old helper functions to switch to using the wrappers in > > Bio.Blast.Applications and subprocess instead. > > > > Yes I was thinking of this when i meant porting/integrate. to integrate the > legacy blast perl script into Bio.Blast.NCBIStandalone > > I didn't realise that Bio.Blast.Applications existed > > > The documentation bit was actually about my first email about any > > > updated doc on how to use blast+ with biopython > > > > I see. What do you think the current (Biopython 1.53) version > > of the tutorial needs in the BLAST chapter? > > > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc80 > was exactly what I was looking for! Maybe i was looking at the wrong page > Thanks for pointing it out! > > > > > Thanks, > > > > Peter > > > > Cheers > Kevin > > > > ---------- Forwarded message ---------- > From: Peter > To: Anne Pajon > Date: Tue, 12 Jan 2010 10:27:47 +0000 > Subject: Re: [Biopython] Could Bio.SeqIO write EMBL file? > On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: > > Hi Peter, > > > > Just tested now. > > > > It worked fine. Thanks a lot. > > Great. > > > Here is the diff between the EMBL output from Bio.SeqIO and the genbank > > output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > > > > ... > > > > The main differences are on line breaks. > > I hadn't yet done a comparison against EMBOSS (what version do you > have), but yes, it looks like I am wrapping the feature tables using a > shorter line length - we should check that, and it would be easy to > adjust in Bio/SeqIO/InsdcIO.py > > Regarding the SQ line, that was on my "TODO" list. Including the > sequence length and base counts shouldn't hard at all. If you want > to work on that it should just be a few lines in Bio/SeqIO/InsdcIO.py > > Right now however, further testing of features would be my first > priority. See also: > http://lists.open-bio.org/pipermail/open-bio-l/2010-January/000604.html > > There are other things still to do (e.g. missing fields on the ID line, > dates, and references). > > Peter > > > > ---------- Forwarded message ---------- > From: Peter > To: Anne Pajon > Date: Tue, 12 Jan 2010 12:33:35 +0000 > Subject: Re: [Biopython] Could Bio.SeqIO write EMBL file? > On Tue, Jan 12, 2010 at 10:27 AM, Peter > wrote: > > On Mon, Jan 11, 2010 at 5:32 PM, Anne Pajon wrote: > >> Here is the diff between the EMBL output from Bio.SeqIO and the genbank > >> output from Bio.SeqIO converted with the EMBOSS tool to an EMBL file: > >> > >> ... > >> > >> The main differences are on line breaks. > > > > I hadn't yet done a comparison against EMBOSS (what version do you > > have), but yes, it looks like I am wrapping the feature tables using a > > shorter line length - we should check that, and it would be easy to > > adjust in Bio/SeqIO/InsdcIO.py > > The spec is pretty clear than the feature lines should be up to 80 > characters. The premature wrapping was because I had been > testing length < 80 instead of <= 80, which is now fixed in git. > > Peter > > > > ---------- Forwarded message ---------- > From: Peter Cock > To: Biopython Mailing List > Date: Tue, 12 Jan 2010 14:27:30 +0000 > Subject: [Biopython] Publication list > Dear all, > > We have a fairly extensive manually compiled list of over 150 > publications citing, > referencing or using Biopython on the wiki, covering the first 10 > years of Biopython: > http://biopython.org/wiki/Publications > > *If your own Biopython related publications are missing from this list, > please > add them. If they are listed in PubMed this is pretty easy.* > > Keeping this up to date has been a tedious task, although now that we have > an > up to date reference, which hopefully will get cited, this is a little > easier: > http://news.open-bio.org/news/2009/03/biopython-paper-published/ > > There is an example in the Biopython Tutorial of using Bio.Entrez and > PubMed > Central (PMC) to find papers citing a reference, or you can just use this > URL: > > http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=link&linkname=pubmed_pubmed_citedin&uid=19304878 > > Likewise, using Google Scholar also finds plenty of citations (although I > don't > know if this URL will work long term): > > http://scholar.google.com/scholar?cites=1800471218280477755&hl=en&as_sdt=2000 > > Perhaps just a few links like these will suffice for tracking future > publications? > Or do people think we should continue to update the wiki in the same style? > > Regards, > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From p.j.a.cock at googlemail.com Tue Jan 12 21:48:27 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Jan 2010 21:48:27 +0000 Subject: [Biopython] Publication list Message-ID: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> On Tue, Jan 12, 2010 at 7:01 PM, Jo?o Rodrigues wrote: > Hello Peter, > > Well, updating the wiki is cumbersome. Specially if done manually. > Why not update the wiki automatically with that link you just gave? > > Regards, > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ Yes, but how? The NCBI link could be used, or rather the Entrez API, with a script to turn that into a list formatted for the wiki - which could then be run every so often and manually pasted into the wiki. Perhaps with a good understanding of PHP and mediawiki the whole thing could be automated. However, citations via PubMed Central are a small subset (Google scholar had about three times as many hits). My point is even semi-automated, updating the wiki is still quite a bit of work - and making it fully automated is also going to take some effort. This is why I was suggesting the lazy option of providing a few links on the publication list. Peter From p.j.a.cock at googlemail.com Wed Jan 13 11:53:54 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Jan 2010 11:53:54 +0000 Subject: [Biopython] Publication list In-Reply-To: <4B4DB44A.2030202@dim.fm.usp.br> References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> Message-ID: <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> 2010/1/13 Daniel Silvestre : > Hi people, > > It's possible to someone to keep the list in smth like Zotero/Mendeley > or similar and then export is as Wiki citation templates with a wealth > of information attached. It's not automated but is quite simple and > fast. For instance Zotero can add a whole bunch of citations with just > one click. Does have anyone try this option? > > Att. > Daniel Hi Daniel, That sounds worth a try, although it still needs someone to keep track of things. It may be a little easier than the current system (people update the wiki manually, although it is usually me based on running a PMC or Google Scholar search). If we want to keep the wiki based list up to date in future, then having a volunteer would be great. Other than that, we can try and encourage people on the mailing list to add their own papers. Peter From p.j.a.cock at googlemail.com Wed Jan 13 12:14:33 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 13 Jan 2010 12:14:33 +0000 Subject: [Biopython] Publication list In-Reply-To: <4B4DB919.3090804@dim.fm.usp.br> References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> <4B4DB919.3090804@dim.fm.usp.br> Message-ID: <320fb6e01001130414h31cccdc4nf983ac7915c83087@mail.gmail.com> 2010/1/13 Daniel Silvestre : > Hi Peter, > > I think a mixed approach (i.e having a curator and stimulating people to > add things) is the best option. I can easily create a database of > citations in my Zotero. If you have a readable list of what you want to > add, I can do it right now. > > Best, > Daniel If you want to cover everything in the database, can you work from the wiki as it is? If you look at the wiki source, you should be able to pull a PubMed ID for most cases (but not all, a few are not in PubMed, or were done differently due to the wiki plugin not liking accented characters in author names). However, I would suggest just starting with 2010 papers onwards, and trying to build a database automatically from citations of the papers here: http://biopython.org/wiki/Documentation#Papers Peter From daniel at dim.fm.usp.br Wed Jan 13 12:14:17 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Wed, 13 Jan 2010 10:14:17 -0200 Subject: [Biopython] Publication list In-Reply-To: <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> (sfid-+20100113-095359-+000.00-1@spamfilter.osbf.lua) References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> (sfid-+20100113-095359-+000.00-1@spamfilter.osbf.lua) Message-ID: <4B4DB919.3090804@dim.fm.usp.br> Hi Peter, I think a mixed approach (i.e having a curator and stimulating people to add things) is the best option. I can easily create a database of citations in my Zotero. If you have a readable list of what you want to add, I can do it right now. Best, Daniel Peter Cock wrote: > 2010/1/13 Daniel Silvestre : >> Hi people, >> >> It's possible to someone to keep the list in smth like Zotero/Mendeley >> or similar and then export is as Wiki citation templates with a wealth >> of information attached. It's not automated but is quite simple and >> fast. For instance Zotero can add a whole bunch of citations with just >> one click. Does have anyone try this option? >> >> Att. >> Daniel > > Hi Daniel, > > That sounds worth a try, although it still needs someone to keep > track of things. It may be a little easier than the current system > (people update the wiki manually, although it is usually me based > on running a PMC or Google Scholar search). > > If we want to keep the wiki based list up to date in future, then > having a volunteer would be great. Other than that, we can try > and encourage people on the mailing list to add their own papers. > > Peter > --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- From daniel at dim.fm.usp.br Wed Jan 13 12:59:32 2010 From: daniel at dim.fm.usp.br (Daniel Silvestre) Date: Wed, 13 Jan 2010 10:59:32 -0200 Subject: [Biopython] Publication list In-Reply-To: <4B4DB919.3090804@dim.fm.usp.br> (sfid-+20100113-103459-+000.00-1@spamfilter.osbf.lua) References: <320fb6e01001121348h669f8ff9n28d805789e25931b@mail.gmail.com> <4B4DB44A.2030202@dim.fm.usp.br> <320fb6e01001130353v36d846bfjca00ca447efcb100@mail.gmail.com> <4B4DB919.3090804@dim.fm.usp.br> (sfid-+20100113-103459-+000.00-1@spamfilter.osbf.lua) Message-ID: <4B4DC3B4.2080306@dim.fm.usp.br> Hi again, Zotero was able to retrieve 113 from the 152 citations in the wiki, which means some of them are not properly formated. So, I will rebuild the list from scratch and test it on a wiki just to see what happens. Att. Daniel Daniel Silvestre wrote: > Hi Peter, > > I think a mixed approach (i.e having a curator and stimulating people to > add things) is the best option. I can easily create a database of > citations in my Zotero. If you have a readable list of what you want to > add, I can do it right now. > > Best, > Daniel > > > Peter Cock wrote: >> 2010/1/13 Daniel Silvestre : >>> Hi people, >>> >>> It's possible to someone to keep the list in smth like Zotero/Mendeley >>> or similar and then export is as Wiki citation templates with a wealth >>> of information attached. It's not automated but is quite simple and >>> fast. For instance Zotero can add a whole bunch of citations with just >>> one click. Does have anyone try this option? >>> >>> Att. >>> Daniel >> Hi Daniel, >> >> That sounds worth a try, although it still needs someone to keep >> track of things. It may be a little easier than the current system >> (people update the wiki manually, although it is usually me based >> on running a PMC or Google Scholar search). >> >> If we want to keep the wiki based list up to date in future, then >> having a volunteer would be great. Other than that, we can try >> and encourage people on the mailing list to add their own papers. >> >> Peter >> --------------------------------------------------------------------- Esta mensagem pode conter informacao confidencial. Se voce nao for o destinatario ou a pessoa autorizada a receber esta mensagem, nao podera usar, copiar ou divulgar as informacoes nela contidas ou tomar qualquer acao baseada nessas informacoes. Se voce recebeu esta mensagem por engano, favor avisar imediatamente o remetente, respondendo o e-mail e, em seguida, apague-o. Agradecemos sua cooperacao. This message may contain confidential information. If you are not the addressee or authorized person to receive it for the addressee, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by replying this e-mail message and delete it. Thanks in advance for your cooperation. ---------------------------------------------------------------------- DIM Faculdade de Medicina USP ---------------------------------------------------------------------- From biopython at maubp.freeserve.co.uk Thu Jan 14 14:46:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 14:46:37 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> Message-ID: <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> Hi all, Biopython currently supports Python 2.4, 2.5 and 2.6 (and seems to work on the current Python 2.7 alpha), but it is probably time to start phasing out support for Python 2.4. Reasons for encouraging Python 2.5+ include the built in support for sqlite3 (which we can use in the BioSQL wrapper) and ElementTree (which we use for the new phyloXML parser), both of which must currently be manually installed for Python 2.4. There are other technical advantages, see this thread on our development mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007236.html We'd aim to follow our usual deprecation procedure, so at least two releases and one year before actually dropping support for Python 2.4. At that point older Linux distributions which ship with Python 2.4 probably won't be supported anyway. Is dropping support for Python 2.4 going to cause anyone a problem? Please send any replies just to the main mailing list (not the announcement list). Thanks, Peter From ivan at biodec.com Thu Jan 14 15:41:58 2010 From: ivan at biodec.com (Ivan Rossi) Date: Thu, 14 Jan 2010 16:41:58 +0100 (CET) Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140736q2926afabv6b587ec73ddd8da2@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <320fb6e01001140736q2926afabv6b587ec73ddd8da2@mail.gmail.com> Message-ID: On Thu, 14 Jan 2010, Peter wrote: > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha), > but it is probably time to start phasing out support for > Python 2.4. > > ... > > We'd aim to follow our usual deprecation procedure, > so at least two releases and one year before actually > dropping support for Python 2.4. At that point older > Linux distributions which ship with Python 2.4 > probably won't be supported anyway. > > Is dropping support for Python 2.4 going to cause anyone a problem? Provided that the deprecation procedure above is followed it will be fine to us (BioDec). Otherwise it woud have been a problem to plone4bio (http://plone4bio.org) since Plone3 just runs on python 2.4. However Plone4, due in less than 6 months, runs on 2.6 and in a year I am confident that the transition of plone4bio to plone4 will be finished. On the contrary we will have to live with an older BioPy for some time... Ivan -- Ivan Rossi, PhD - ivan AT biodec dot com, ivan dot rossi3 AT unibo dot it BioDec Srl, Via Calzavecchio 20/2, 40033 Casalecchio di Reno (BO), Italy Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com From biopython at maubp.freeserve.co.uk Thu Jan 14 17:05:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 17:05:24 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <320fb6e01001140736q2926afabv6b587ec73ddd8da2@mail.gmail.com> Message-ID: <320fb6e01001140905g530a7f6cndcec86f7ea3b9576@mail.gmail.com> On Thu, Jan 14, 2010 at 3:41 PM, Ivan Rossi wrote: > >> >> Is dropping support for Python 2.4 going to cause anyone a problem? >> > > Provided that the deprecation procedure above is followed it will be > fine to us (BioDec). Otherwise it woud have been a problem to plone4bio > (http://plone4bio.org) since Plone3 just runs on python 2.4. However > Plone4, due in less than 6 months, runs on 2.6 and in a year I am > confident that the transition of plone4bio to plone4 will be finished. > > On the contrary we will have to live with an older BioPy for some time... > > Ivan OK - thanks for the heads up. We don't need to rush things, so if in six months time you really need us to keep Python 2.4 compatibility for a bit longer we can discuss that. Peter From biopython at maubp.freeserve.co.uk Thu Jan 14 17:32:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 17:32:22 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <4B4F4071.7040601@fold.natur.cuni.cz> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> Message-ID: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ? wrote: > > Hi Peter, > I don't get this point much. What is the problem stating that with > python 2.5+ one does not need to install an extra dependency while > for 2.4 one needs _two_ modules? > I don't think I want BioSQL nor sqlite so why would I have to upgrade. > Would the requirement be in python language syntax incompatibility then > I would NOT object, but in this situation ... > Martin Hi Martin, This isn't just the issue of sqlite3 and ElementTree. There are several benefits to using more recent versions of Python, for example with an eye on the future for Python 3, and on a practical level it simplifies our testing to have one less version to worry about (especially once Python 2.7 is out, currently scheduled for June 2010). We've already had minor issues with developers using Python 2.5+ syntax unwittingly which broke on Python 2.4 (nothing major, and it was easily fixed once the problem was spotted). If we continue to insist on Python 2.4 support, it may prove problematic for if future potential contributors have existing code written for Python 2.5+ which would require significant re-factoring. None of these concerns are pressing right now (and some are hypothetical), but I think you will agree that Python 2.4 is pretty old, and not widely used anymore. Having a clear plan in place for dropping it seems a sensible move, and once that happens we can start to take advantage of the language and library improvements Python 2.5 added. Are you personally using Python 2.4? If so, could you tell us a little more - for example, is this a university server which would be difficult to update? Or do you require some other Python package which requires Python 2.4? Thanks, Peter From mmokrejs at ribosome.natur.cuni.cz Thu Jan 14 17:52:13 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Thu, 14 Jan 2010 18:52:13 +0100 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> Message-ID: <4B4F59CD.5040006@ribosome.natur.cuni.cz> Hi Peter, I just had troubles with 2.5 to 2.6 move (mailman needed manual patches), and just envisioned that similarly 2.4 to 2.5 would be a trouble. So, personally I don't mind but I would prefer clear listings what modules require the newer features and having an option to skip them during install step them instead of having to blindly upgrade. Personally I just use Bio.SeqIO and that is probably all I need. ^H^H^H^H^ and Entrez or PubMed or Efetch stuff, I got lost in the many biopython deprecations and module renames in the last years. I use the "latest" but forgot how is it currently named. ;-) ^H^H^H^H^H of course I know, efetch(). ;-) Recently I had for example install some old Solaris 2.6 machine with some apps and imagine, was glad to have python 2.3 I think with gcc-3.x at all. Martin Peter wrote: > On Thu, Jan 14, 2010 at 4:04 PM, Martin MOKREJ? > wrote: >> Hi Peter, >> I don't get this point much. What is the problem stating that with >> python 2.5+ one does not need to install an extra dependency while >> for 2.4 one needs _two_ modules? >> I don't think I want BioSQL nor sqlite so why would I have to upgrade. >> Would the requirement be in python language syntax incompatibility then >> I would NOT object, but in this situation ... >> Martin > > Hi Martin, > > This isn't just the issue of sqlite3 and ElementTree. There > are several benefits to using more recent versions of Python, > for example with an eye on the future for Python 3, and on > a practical level it simplifies our testing to have one less > version to worry about (especially once Python 2.7 is out, > currently scheduled for June 2010). > > We've already had minor issues with developers using > Python 2.5+ syntax unwittingly which broke on Python > 2.4 (nothing major, and it was easily fixed once the > problem was spotted). If we continue to insist on Python > 2.4 support, it may prove problematic for if future potential > contributors have existing code written for Python 2.5+ > which would require significant re-factoring. > > None of these concerns are pressing right now (and > some are hypothetical), but I think you will agree that > Python 2.4 is pretty old, and not widely used anymore. > Having a clear plan in place for dropping it seems a > sensible move, and once that happens we can start > to take advantage of the language and library > improvements Python 2.5 added. > > Are you personally using Python 2.4? If so, could you > tell us a little more - for example, is this a university > server which would be difficult to update? Or do you > require some other Python package which requires > Python 2.4? From mmokrejs at ribosome.natur.cuni.cz Thu Jan 14 17:51:58 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Thu, 14 Jan 2010 18:51:58 +0100 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> Message-ID: <4B4F59BE.40500@ribosome.natur.cuni.cz> Hi Peter, I don't get this point much. What is the problem stating that with python 2.5+ one does not need to install an extra dependency while for 2.4 one needs _two_ modules? I don't think I want BioSQL nor sqlite so why would I have to upgrade. Would the requirement be in python language syntax incompatibility then I would NOT object, but in this situation ... Martin Peter wrote: > Hi all, > > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha), > but it is probably time to start phasing out support for > Python 2.4. > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrapper) and ElementTree (which we use > for the new phyloXML parser), both of which must > currently be manually installed for Python 2.4. > > There are other technical advantages, see this > thread on our development mailing list: > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007236.html > > We'd aim to follow our usual deprecation procedure, > so at least two releases and one year before actually > dropping support for Python 2.4. At that point older > Linux distributions which ship with Python 2.4 > probably won't be supported anyway. > > Is dropping support for Python 2.4 going to cause > anyone a problem? From biopython at maubp.freeserve.co.uk Thu Jan 14 18:05:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 Jan 2010 18:05:37 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <4B4F5993.9010600@fold.natur.cuni.cz> References: <320fb6e01001130610y6de002d5ua6bbcca4cdf9385a@mail.gmail.com> <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> <4B4F4071.7040601@fold.natur.cuni.cz> <320fb6e01001140932t1bf9b62cse70d8c5ee69dc38a@mail.gmail.com> <4B4F5993.9010600@fold.natur.cuni.cz> Message-ID: <320fb6e01001141005s33cf2431xc32581a73540e080@mail.gmail.com> On Thu, Jan 14, 2010 at 5:51 PM, Martin MOKREJ? wrote: > > Hi Peter, > ?I just had troubles with 2.5 to 2.6 move (mailman needed manual patches), > and just envisioned that similarly 2.4 to 2.5 would be a trouble. So, personally > I don't mind but I would prefer clear listings what modules require the > newer features and having an option to skip them during install step them > instead of having to blindly upgrade. ... Its a nice idea in theory, and I can see how it would be useful in some case. However, it sounds quite complicated to implement, and very complex to keep up to date and tested properly. I don't think its a good use of limited developer time. > ?Recently I had for example install some old Solaris 2.6 machine with some > apps and imagine, was glad to have python 2.3 I think with gcc-3.x at all. > Martin I sympathize - although we now have Python 2.6 installed, I think our cluster head node still has Python 2.3 as the default system Python (its due for an upgrade, but systems administrators are rightly cautious). Peter From anbhat at utu.fi Fri Jan 15 16:07:51 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Fri, 15 Jan 2010 18:07:51 +0200 Subject: [Biopython] codon usage Message-ID: Hi, I found this script on "http://www.pasteur.fr/recherche/unites/sis/formation/python/exercises/seqrandom_count_codons_plot.py" which is supposed to count codon usage and plot them in a bar plot but this not working since some of the modules used in the script does not exist anymore. How can modify the script to make it usable or is there a better way to do that? here is the code: import Bio.Fasta from sys import * from string import * from dna import codons from mutateseq import mutateseq file = argv[1] handle = open(file) it = Bio.Fasta.Iterator(handle, Bio.Fasta.SequenceParser()) count = {} count_random = {} seq = it.next() while seq: for codon in codons(seq.seq.tostring()): if count.has_key(codon): count[codon] += 1 else: count[codon] = 0 mutableseq = seq.seq.tomutable() mutateseq(mutableseq,span=1000,p=0.1) for codon in codons(mutableseq.tostring()): if count_random.has_key(codon): count_random[codon] += 1 else: count_random[codon] = 0 seq = it.next() handle.close() #-------------------------------------------------------- # bar charts of codons frequencies # - for legibility, 2 charts are built # - both random and normal frequencies are dsplayed from tkplot import * from Numeric import * def codon_sort(a,b): if a < b: return -1 elif a > b: return 1 else: return 0 for codon in count.keys(): if not count_random.has_key(codon): count_random[codon] = 0 for codon in count_random.keys(): if not count.has_key(codon): count[codon] = 0 labels=count.keys() labels.sort(codon_sort) w1=window(plot_title='Count codons',width=1000) y=array(count.values())[:len(count)/2] x=arange(len(y)+1) w1.bar(y,x,label=labels[:len(count)/2]) w2=window(plot_title='Count codons(2)',width=1000) y=array(count.values())[(len(count)/2)+1:] x=arange(len(y)+1) w2.bar(y,x,label=labels[(len(count)/2)+1:]) y=array(count_random.values())[:len(count_random)/2] x=arange(len(y)+1) w1.bar(y,x,label=labels[:len(count_random)/2]) y=array(count_random.values())[(len(count_random)/2)+1:] x=arange(len(y)+1) w2.bar(y,x,label=labels[(len(count_random)/2)+1:]) Regards, Anirban From biopython at maubp.freeserve.co.uk Fri Jan 15 18:07:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 15 Jan 2010 18:07:43 +0000 Subject: [Biopython] codon usage In-Reply-To: References: Message-ID: <320fb6e01001151007r113fae98l1bac3fd21c3ac7f1@mail.gmail.com> On Fri, Jan 15, 2010 at 4:07 PM, Anirban Bhattachariya wrote: > Hi, > > I found this script on ... which is supposed to count codon usage and plot them > in a bar plot but this not working since some of the modules used in the script > does not exist anymore. Hi Anirban, Sadly that Pasteur Institute "Python course in Bioinformatics" is out of date. We have tried emailing the authors about this, and I offered to help update it - but so far I have had no reply. If anyone has current contact information please get in touch. http://www.pasteur.fr/recherche/unites/sis/formation/python/ Looking at the code there are several issues: The built in python module string still exists but is considered obsolete, string methods are generally preferred. Bio.Fasta still exists but is obsolete, that bit can be replaced with Bio.SeqIO fairly easily. Not sure about the other bits (see below). Numeric is also obsolete and no longer supported (it could use numpy instead). See http://numpy.scipy.org/ Then for the plotting itself I would suggest maybe matplotlib instead of tkplot (personal preference, I've never tried tkplot). http://matplotlib.sourceforge.net/ There are examples of some simple plots using this in the current Biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf The relevant background to this example is here, with a small (compressed) example image where I can't read the captions: http://www.pasteur.fr/recherche/unites/sis/formation/python/apas05.html#f_codon_freq Have you seen a larger sample output image? It should be pretty easy to recode this from scratch, but it would take a bit of "archaeology" to work out what exactly the old code did. It might be easier if you told us what you want to plot - a simple bar chart with an entry for each of the possible 64 codons (assuming non-ambiguous RNA or DNA is used)? Peter From anbhat at utu.fi Sat Jan 16 08:38:28 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Sat, 16 Jan 2010 10:38:28 +0200 Subject: [Biopython] Sequence annotation (Features) Message-ID: Hi, I'm trying to download a protein sequence object (using ID or accession number) and then trying to print its variants (all variant sequences) from its features and annotations.I'm using pseudocholinesterase (http://www.uniprot.org/uniprot/P06276 ) as an example since it has lot of natural variants. The problem is when I'm trying to access the features its saying "0 features" ; how can I access the features in Swiss-Prot file like in genbank file format ( as in section 4.6 of the tutorial). Here is my code: from Bio import ExPASy from Bio import SeqIO from Bio import SeqFeature handle =ExPASy.get_sprot_raw("P06276") seq_record = SeqIO.read(handle, "swiss") handle.close() print seq_record.id print seq_record.name print seq_record.description print repr(seq_record.seq) print "Length %i" % len(seq_record) print seq_record.annotations["keywords"] print len(seq_record) print "%i features" % (len(seq_record.features)) output: P06276 CHLE_HUMAN RecName: Full=Cholinesterase; EC=3.1.1.8; AltName: Full=Acylcholine acylhydrolase; AltName: Full=Choline esterase II; AltName: Full=Butyrylcholine esterase; AltName: Full=Pseudocholinesterase; Flags: Precursor; Seq('MHSKVTIICIRFLFWFLLLCMLIGKSHTEDDIIIATKNGKVRGMNLTVFGGTVT...VGL', ProteinAlphabet()) Length 602 ['3D-structure', 'Complete proteome', 'Direct protein sequencing', 'Disease mutation', 'Disulfide bond', 'Glycoprotein', 'Hydrolase', 'Polymorphism', 'Serine esterase', 'Signal'] 602 0 features Thanks in advance. -Anirban From biopython at maubp.freeserve.co.uk Sat Jan 16 11:21:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 Jan 2010 11:21:52 +0000 Subject: [Biopython] Sequence annotation (Features) In-Reply-To: References: Message-ID: <320fb6e01001160321x5b425d4eqba4f752a1baa358d@mail.gmail.com> On Sat, Jan 16, 2010 at 8:38 AM, Anirban Bhattachariya wrote: > Hi, > > I'm trying to download a protein sequence object (using ID or > accession number) and then trying to print its variants (all > variant sequences) from its features and annotations.I'm using > pseudocholinesterase (http://www.uniprot.org/uniprot/P06276 > as an example since it has lot of natural variants. > > The problem is when I'm trying to access the features its > saying "0 features" ; how can I access the features in > Swiss-Prot file ?like in genbank file format ( as in section > 4.6 ?of the tutorial). It's a know missing feature, although there is a patch here: http://bugzilla.open-bio.org/show_bug.cgi?id=2235 You could help with testing/improving the patch in order to get Bio.SeqIO to do this in future, or in the short term use the underlying parser in Bio.SwissProt. Regards, Peter From anbhat at utu.fi Sun Jan 17 20:11:45 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Sun, 17 Jan 2010 22:11:45 +0200 Subject: [Biopython] How to print variants ? Message-ID: Hi, I'm trying to download a protein sequence object (using ID or accession number) and then trying to print its variants (all variant sequences) from its features and annotations.My script works fine so far and it prints number of sequence features. The problem is, how can I print its variants ( should work for any ID) and all variant sequence? Here is my code so far: from Bio import Entrez from Bio import SeqIO handle = Entrez.efetch(db=raw_input("What type of database? protein/nucleotide="),\ rettype=raw_input("which datbase you want to use? For example:genbank(gb)="),\ ID=raw_input("Enter the ID; for example human BChE contain lot of genetic \ varients,id is P06276=")) for seq_record in SeqIO.parse(handle, "gb") : print seq_record.id, seq_record.description[:50] + "..." print "Sequence length %i," % len(seq_record), print "%i features," % len(seq_record.features), print "from: %s" % seq_record.annotations["source"] print seq_record.annotations["keywords"] print repr(seq_record.seq) print "features:%i" % len(seq_record.features), # [ code for printing variants? ] handle.close() Thanks in advance. -Anirban From biopython at maubp.freeserve.co.uk Sun Jan 17 20:23:58 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 Jan 2010 20:23:58 +0000 Subject: [Biopython] How to print variants ? In-Reply-To: References: Message-ID: <320fb6e01001171223g388ab1p18dcf8c9e7e3f581@mail.gmail.com> On Sun, Jan 17, 2010 at 8:11 PM, Anirban Bhattachariya wrote: > Hi, > > I'm trying to download a protein sequence object (using ID > or accession number) and then trying to print its variants > (all variant sequences) from its features and annotations. I don't understand what you are asking for. Could you give us a specific worked example (an accession and what you want to print out)? Peter From anbhat at utu.fi Sun Jan 17 21:48:27 2010 From: anbhat at utu.fi (Anirban Bhattachariya) Date: Sun, 17 Jan 2010 23:48:27 +0200 Subject: [Biopython] How to print variants ? Message-ID: Hi , Suppose we want to study how mutations/SNPs affect on binding or some other biochemical reaction. Let's also assume, that we have a motif or motifs we want to test against These variants are listed in sequence files, there is listed only the original protein sequence. For to test motives against variants, we need complete protein sequence. Let's say our protein has 75 variants, so we need original + 75 protein sequences to test with motifs. My intention is to make a list of those 75 proteins. For example if with slicing I can print : print seq_record.features[5], print seq_record.features[13], Output: location: [28:602] ref: None:None strand: None qualifiers: Key: experiment, Value: ['experimental evidence, no additional details recorded'] Key: gene, Value: ['BCHE'] Key: gene_synonym, Value: ['CHE1'] Key: note, Value: ['Cholinesterase. /FTId=PRO_0000008613.'] Key: region_name, Value: ['Mature chain'] type: Region location: [31:32] ref: None:None strand: None qualifiers: Key: experiment, Value: ['experimental evidence, no additional details recorded'] Key: gene, Value: ['BCHE'] Key: gene_synonym, Value: ['CHE1'] Key: note, Value: ['Missing (in BChE deficiency). /FTId=VAR_040011.'] Key: region_name, Value: ['Variant'] Seq('I', IUPACProtein()) type: Region Now I want to print the features which has 'variant' ( in above example the the second one " print seq_record.features[13]" in other words I only want to print features with " Key: region_name, Value: ['Variant']" and ignore other features. Now for the final part I want to print the sequence which has variant sequence. For example : location: [55:56] ref: None:None strand: None qualifiers: Key: experiment, Value: ['experimental evidence, no additional details recorded'] Key: gene, Value: ['BCHE'] Key: gene_synonym, Value: ['CHE1'] Key: note, Value: ['F -> I (in BChE deficiency). /FTId=VAR_040013.'] Key: region_name, Value: ['Variant'] Seq('F', IUPACProtein()) type: Region It says location: [55:56] also there is this line Key: note, Value: ['F -> I (in BChE deficiency). /FTId=VAR_040013.'] That says that F in original sequence has changed to I variant sequence So I need the protein sequence where there in position 55 is I instead of F. Thanks, Anirban From biopython at maubp.freeserve.co.uk Mon Jan 18 10:05:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 Jan 2010 10:05:05 +0000 Subject: [Biopython] How to print variants ? In-Reply-To: References: Message-ID: <320fb6e01001180205i524db7a6sfae8e42faaa6e281@mail.gmail.com> On Sun, Jan 17, 2010 at 9:48 PM, Anirban Bhattachariya wrote: > Hi , > > Suppose we want to study how mutations/SNPs affect on binding or some other > biochemical reaction. Let's also assume, that we have a motif or motifs we want > to test against These variants are listed in sequence files, there is listed only the > original protein sequence. For to test motives against variants, we need complete > protein sequence. Let's say our protein has 75 variants, so we need original + 75 > protein sequences to test with motifs. My intention is to make a list of those 75 > proteins. >From your earlier emails you are working with a GenBank file for P06276: http://lists.open-bio.org/pipermail/biopython/2010-January/006120.html i.e. http://www.ncbi.nlm.nih.gov/protein/116353 or the original SwissProt/UniProt database, as a plain test "swiss" file: http://www.uniprot.org/uniprot/P06276.txt Now either the plain text GenBank or SwissProt files are going to force you to parse strings like "T -> M (in BChE deficiency; dbSNP:rs56309853)." to pull out this information in a usable form (whichever GenBank or SwissProt plain text parser you use). This is possible, but a bit fiddly. Looking at the SwissProt page, they have a table of these variants: http://www.uniprot.org/uniprot/P06276 UniProt also offer a GFF and FASTA file, neither of which are helpful here: http://www.uniprot.org/uniprot/P06276.gff http://www.uniprot.org/uniprot/P06276.fasta However, the XML format looks much nicer: http://www.uniprot.org/uniprot/P06276.xml It has well tagged entries for each variant, e.g. T M - Note there is some work in development to add parsing these UniProt XML files to SeqIO as a SeqRecord, but for your task it would probably be simpler to parse the XML yourself (using one of the standard Python XML libraries) to pull out just these variations. See also: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007244.html Which would you prefer? Working with XML or fuzzy string formats? Peter From aboulia at gmail.com Tue Jan 19 05:50:41 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 19 Jan 2010 13:50:41 +0800 Subject: [Biopython] SeqIO.index for csfasta files Message-ID: <5b6410e1001182150q36e01326mc218c06aaf780bab@mail.gmail.com> Hi all I know csfasta isn't listed in the SeqIO page but can I use index on it as well to retrieve subset of reads from csfasta ? (qual files are ok ) http://news.open-bio.org/news/2009/09/biopython-seqio-index/ Cheers Kevin From aboulia at gmail.com Tue Jan 19 08:31:43 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 19 Jan 2010 16:31:43 +0800 Subject: [Biopython] SeqIO.index for csfasta files memory issues Message-ID: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> What are the memory limitations for SeqIO.index? I am trying to create an index for a 4.5 gb csfasta file ~ 60 million reads but the script crashes at 5 Gb ram usage the machine has 31 Gb ram. #!/usr/bin/python from Bio import SeqIO data = SeqIO.index("Sample3.csfasta", "fasta") print data.keys()[:3] print data["853_15_296_F3"].seq Resource usage summary: CPU time : 381.24 sec. Max Memory : 5103 MB Max Swap : 5347 MB Max Processes : 4 Max Threads : 5 Traceback (most recent call last): File "./extractfasta.py", line 7, in ? data = SeqIO.index("Sample3.csfasta", "fasta") File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/__init__.py", line 703, in index return indexer(filename, alphabet, key_function) File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/_index.py", line 209, in __init__ "fasta", ">") File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/_index.py", line 203, in __init__ self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) File "/home//biopython-1.53/build/lib.linux-x86_64-2.4/Bio/SeqIO/_index.py", line 86, in _record_key dict.__setitem__(self, key, seek_position) MemoryError From biopython at maubp.freeserve.co.uk Tue Jan 19 09:32:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 09:32:45 +0000 Subject: [Biopython] SeqIO.index for csfasta files In-Reply-To: <5b6410e1001182150q36e01326mc218c06aaf780bab@mail.gmail.com> References: <5b6410e1001182150q36e01326mc218c06aaf780bab@mail.gmail.com> Message-ID: <320fb6e01001190132v36bcfc91u8e61ed4c89c1af09@mail.gmail.com> On Tue, Jan 19, 2010 at 5:50 AM, Kevin Lam wrote: > Hi all > I know csfasta isn't listed in the SeqIO page but can I use index on it as > well to retrieve subset of reads from csfasta ? (qual files are ok ) > http://news.open-bio.org/news/2009/09/biopython-seqio-index/ > > Cheers > Kevin We don't explicitly support color space FASTA, but it should work. By that I mean the parser will just give you the sequences as is (e.g. A1231232) with a default generic alphabet object. Depending on the number of reads, and the size of the subset, you may find using Bio.SeqIO.parse and write together works better (lower memory requirements). I would suggest building a python set of the desired IDs, then using something like this: #Using set to test membership (hash based, faster than a list) wanted_ids = set(...) #This is a memory efficient generator expression: wanted = (rec for rec in SeqIO.parse(...) if rec.id in wanted_ids) handle = open(..., "w") count = SeqIO.write(wanted, handle, "fasta") handle.close() Peter From biopython at maubp.freeserve.co.uk Tue Jan 19 09:38:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 Jan 2010 09:38:43 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> Message-ID: <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> On Tue, Jan 19, 2010 at 8:31 AM, Kevin Lam wrote: > What are the memory limitations for SeqIO.index? > I am trying to create an index for a 4.5 gb csfasta file > ~ 60 million reads > but the script crashes at 5 Gb ram usage > the machine has 31 Gb ram. What OS are you using (and is it 64bit)? What Python are you using (and is it 64bit)? What version of Biopython are you using? I've never tried a file with quite that many reads, but crashing at about 5GB is odd. I wonder if this is a 4GB limit somewhere in your system (e.g. running 32bit Python). Adding some debug statements we could see when it falls over (i.e. how many reads had been indexed). Long term, really really big indexes will be too big to hold in memory as a python dict (record IDs and file offsets). Therefore we have done a little work looking at disk based indexes, including sqlite3. This does make building the index much slower though. For your immediate task, try a simple iteration through the records, selecting the records of interest using Bio.SeqIO.parse and write as per my other email. This way you'll only have to keep in memory one record at a time, and a list/set of the wanted IDs: http://lists.open-bio.org/pipermail/biopython/2010-January/006128.html Peter From alvin at pasteur.edu.uy Tue Jan 19 17:45:23 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Tue, 19 Jan 2010 15:45:23 -0200 Subject: [Biopython] Subprocess:Clustalw Message-ID: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> Hi all, I'm new in Biopython and I am trying to learn how to use Bio.Align. I have some doubts about running Clustalw within a python script. I run this without problems: ### from Bio.Align.Applications import ClustalwCommandline import sys import subprocess STDO = open("stdo.txt", "w") STDE = open("stde.txt", "w") cline = ClustalwCommandline("clustalw2",infile="opuntia.fasta") return_code = subprocess.call(str(cline), stderr = STDE, shell=(sys.platform!="win32")) print return_code ### but he point is that I would like to choose my "infile" from argv. I mean, something like this: archive = open(sys.argv[1]) cline = ClustalwCommandline("clustalw2",infile=archive) I realized that "str" in subprocess doesn't allow this str(cline) I wonder if it could be possible to run the algorithm from argv or any handles . Thanks in advance ?lvaro Pena From p.j.a.cock at googlemail.com Tue Jan 19 18:57:55 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 Jan 2010 18:57:55 +0000 Subject: [Biopython] Subprocess:Clustalw In-Reply-To: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> References: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> Message-ID: <320fb6e01001191057l4ffffa24k3b951817ca038d91@mail.gmail.com> Hi, I'm a little unclear what you are trying to do - clustalw doesn't let you send input via stdin or get the alignment by stdout. Other tools like muscle can do this and our tutorial has examples of this. Peter From schafer at rostlab.org Tue Jan 19 21:56:00 2010 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Tue, 19 Jan 2010 16:56:00 -0500 Subject: [Biopython] Subprocess:Clustalw In-Reply-To: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> References: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> Message-ID: <4B562A70.6000201@rostlab.org> On 01/19/2010 12:45 PM, Alvaro F Pena Perea wrote: > but he point is that I would like to choose my "infile" from argv. I mean, > something like this: > > archive = open(sys.argv[1]) > cline = ClustalwCommandline("clustalw2",infile=archive) I'm not sure I understand the significance of your approach. If this is about reading the path to the fasta file from commandline, why don't you do just the following: """ Assuming, sys.argv[1] holds the path to the fasta file """ archive = sys.argv[1] #Instead of archive = open(sys.argv[1]) cline = ClustalwCommandline("clustalw2",infile=archive) Chris From aboulia at gmail.com Tue Jan 19 23:43:11 2010 From: aboulia at gmail.com (Kevin) Date: Wed, 20 Jan 2010 07:43:11 +0800 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> Message-ID: Hi Peter It's 64 bit centos shared cluster I assumed all the rest of Python and such are the same as well but I may be wrong. It's version 1.53 I believe for biopython I wanted random access as I need half the reads separated this way and I think it is faster. Guess I have to do it the old way Thanks Kev Sent from my iPod On 19-Jan-2010, at 5:38 PM, Peter wrote: > On Tue, Jan 19, 2010 at 8:31 AM, Kevin Lam wrote: >> What are the memory limitations for SeqIO.index? >> I am trying to create an index for a 4.5 gb csfasta file >> ~ 60 million reads >> but the script crashes at 5 Gb ram usage >> the machine has 31 Gb ram. > > What OS are you using (and is it 64bit)? > What Python are you using (and is it 64bit)? > What version of Biopython are you using? > > I've never tried a file with quite that many reads, but > crashing at about 5GB is odd. I wonder if this is a 4GB > limit somewhere in your system (e.g. running 32bit > Python). Adding some debug statements we could > see when it falls over (i.e. how many reads had > been indexed). > > Long term, really really big indexes will be too big > to hold in memory as a python dict (record IDs and > file offsets). Therefore we have done a little work > looking at disk based indexes, including sqlite3. > This does make building the index much slower > though. > > For your immediate task, try a simple iteration > through the records, selecting the records of > interest using Bio.SeqIO.parse and write as per > my other email. This way you'll only have to keep > in memory one record at a time, and a list/set > of the wanted IDs: > http://lists.open-bio.org/pipermail/biopython/2010-January/006128.html > > Peter From biopython at maubp.freeserve.co.uk Wed Jan 20 11:14:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 Jan 2010 11:14:26 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> Message-ID: <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> On Tue, Jan 19, 2010 at 11:43 PM, Kevin wrote: > > Hi Peter > It's 64 bit centos shared cluster OK, good. > I assumed all the rest of Python and such are the same as well > but I may be wrong. It would be worth checking out - if the Python installed is just 32bit, then hitting a memory limit at 4GB would make sense. > It's version 1.53 I believe for biopython OK, good. > I wanted random access as I need half the reads separated this way > and I think it is faster. Guess I have to do it the old way. Could you show us a sample of the data - say just the first 20 reads? I could then generate a large test file in a similar style to see what happens if I try and index it on my machine. It would also be nice if you would allow us to use the sample for a Biopython unit test. Thanks, Peter From alvin at pasteur.edu.uy Wed Jan 20 12:07:28 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Wed, 20 Jan 2010 10:07:28 -0200 Subject: [Biopython] Subprocess:Clustalw In-Reply-To: <320fb6e01001191057l4ffffa24k3b951817ca038d91@mail.gmail.com> References: <3d7a3fc11001190945s36c2b53m1fd60ad9c23862ff@mail.gmail.com> <320fb6e01001191057l4ffffa24k3b951817ca038d91@mail.gmail.com> Message-ID: <3d7a3fc11001200407j63d44f90t8408a19c1a071c11@mail.gmail.com> Ok. Thank you very much. I will try with muscle. ?lvaro Pena 2010/1/19 Peter Cock > Hi, > > I'm a little unclear what you are trying to do - clustalw doesn't let > you send input via stdin or get the alignment by stdout. Other tools > like muscle can do this and our tutorial has examples of this. > > Peter > From biopython at maubp.freeserve.co.uk Thu Jan 21 11:31:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 11:31:29 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> Message-ID: <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> On Thu, Jan 21, 2010 at 9:10 AM, Kevin Lam wrote: > > Yups python is 64bit >>>> platform.architecture() > ('64bit', 'ELF') Hmm - I was hoping that wouldn't be the case. > the sample 1 file has > 48412673 reads > here's the top 20 reads > > head -n 20 Sample2.csfasta >>427_22_20_F3 > T33133100313302011000100000000000000000000010000000 >>427_22_29_F3 > T30101002122001001000300000200030000000002121003000 >>427_22_44_F3 > T12223211021010030202120002130211102100003002010303 >>427_22_52_F3 > T32031331333133301101223023301013011032103032122123 >>427_22_58_F3 > T23010130111130001000202232101031001010000000000000 >>427_22_66_F3 > T10303202110222020010200311000110011001001111000110 >>427_22_72_F3 > T23332102212232122131103321303322213023003233100320 >>427_22_87_F3 > T20112313302013303131123323002203111122211310000010 >>427_22_113_F3 > T32021321020200032003222000221030102023012000003013 >>427_22_169_F3 > T22012322202220000000100000100000000000000010100020 Thanks Kevin, I wrote a trivial script to generate a big fake Solid CSFASTA like this: import random total = 48412673 # 48 million count = 0 handle = open("big_fake_solid.csfasta", "w") for i in range(1000): for j in range(100): for k in range(1000): for h in range(256): nuc = random.choice("ACGT") #I could make the color sequence random, but #there is no real point for testing indexing: color_changes = "33133100313302011000100000000000000000000010000000" handle.write(">%03i_%02i_%02i_%02X\n%s%s\n" \ % (i,j,k,h, nuc, color_changes)) count += 1 if count >= total : break if count >= total : break #print "Done %i so far" % count if count >= total : break if count >= total : break handle.close() I then tried indexing with Bio.SeqIO.index("big_fake_solid.csfasta","fasta") using Biopython 1.53+ (latest code from git) on Mac OS X 10.5 Leopard with 12GB of RAM, using the Apple provided Python 2.5 installation. I watched the process in system monitor and it failed when memory consumption reached 4GB, with a repeated message: Python(608) malloc: *** mmap(size=262144) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug and traceback: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/Bio/SeqIO/_index.py", line 262, in __init__ self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) It turns out that my copy of Python (the default Apple provided one on Leopard) seems to be just 32bit, $ python Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import platform >>> platform.architecture() ('32bit', '') So *if* your system was running 32bit python, I would expect it to fail like this. I'd like to try a 64bit python locally - either I could install this manually, or look for a big memory Linux box to try. Or, If I updated my OS, it looks like Mac OS X 10.6 Snow Leopard includes 64bit Python 2.6, plus a Python 2.5 which is only 32bit: http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/python.1.html Peter From biopython at maubp.freeserve.co.uk Thu Jan 21 11:58:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 11:58:12 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> Message-ID: <320fb6e01001210358m3b274028re98c64b8e4fe772a@mail.gmail.com> On Thu, Jan 21, 2010 at 11:31 AM, Peter wrote: > It turns out that my copy of Python (the default Apple provided one > on Leopard) seems to be just 32bit, > > $ python > Python 2.5.2 (r252:60911, Feb 22 2008, 07:57:53) > [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin > Type "help", "copyright", "credits" or "license" for more information. >>>> import platform >>>> platform.architecture() > ('32bit', '') Another check, $ file /usr/bin/python /usr/bin/python: Mach-O universal binary with 2 architectures /usr/bin/python (for architecture ppc7400): Mach-O executable ppc /usr/bin/python (for architecture i386): Mach-O executable i386 $ which python /Library/Frameworks/Python.framework/Versions/Current/bin/python $ file /Library/Frameworks/Python.framework/Versions/Current/bin/python /Library/Frameworks/Python.framework/Versions/Current/bin/python: Mach-O universal binary with 2 architectures /Library/Frameworks/Python.framework/Versions/Current/bin/python (for architecture i386): Mach-O executable i386 /Library/Frameworks/Python.framework/Versions/Current/bin/python (for architecture ppc): Mach-O executable ppc > So *if* your system was running 32bit python, I would expect it to > fail like this. I'd like to try a 64bit python locally - either I could > install this manually, ... >From reading up, it seems that while python.org does have dmg installers for Mac OS X, currently they only support i386 and ppc (not 64bit). While in theory I could download an install Python from source, it sounds a little fiddly, and a don't want to mess up my machine. > ... or look for a big memory Linux box to try. This may be easier for me! Peter From biopython at maubp.freeserve.co.uk Thu Jan 21 13:03:33 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 21 Jan 2010 13:03:33 +0000 Subject: [Biopython] SeqIO.index for csfasta files memory issues In-Reply-To: <320fb6e01001210358m3b274028re98c64b8e4fe772a@mail.gmail.com> References: <5b6410e1001190031k533f182fy939c7b4ecc22d72a@mail.gmail.com> <320fb6e01001190138o191101a5h24cc9286a4e0ce6a@mail.gmail.com> <320fb6e01001200314q1d7fb2b9i7bf6d2a5e1b82fdc@mail.gmail.com> <5b6410e1001210110t39e8ef06ne8681e491cdbf82d@mail.gmail.com> <320fb6e01001210331g1de02917qe8cfe72b13906e0e@mail.gmail.com> <320fb6e01001210358m3b274028re98c64b8e4fe772a@mail.gmail.com> Message-ID: <320fb6e01001210503m59fb9d82pf4e6a25c8d86d1ce@mail.gmail.com> On Thu, Jan 21, 2010 at 11:58 AM, Peter wrote: > On Thu, Jan 21, 2010 at 11:31 AM, Peter wrote: >> ... or look for a big memory Linux box to try. > > This may be easier for me! That worked :) This was a 48 million entry ~3GB faked color space FASTA file. It took about 10 mins and about 7GB (I missed the final memory usage figure as I was only checking in top), using Biopython 1.53 on a 64bit installation of Python 2.4.3: $ python Python 2.4.3 (#1, Jan 21 2009, 01:11:33) [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import platform >>> platform.architecture() ('64bit', 'ELF') Could you double check the version of Python on the nodes of your cluster (just in case the head node is using something different, or some of the nodes are 32bit and others are 64bit)? Peter From mitlox at op.pl Sun Jan 24 02:14:53 2010 From: mitlox at op.pl (xyz) Date: Sun, 24 Jan 2010 12:14:53 +1000 Subject: [Biopython] BLAST database access Message-ID: <4B5BAD1D.2020004@op.pl> Hello, I have run MegaBlast and the results I can parse for example with: input_file = open("megablastres.txt","r") for line in input_file.readlines(): if line[0] == "#" : #header line, ignore else: parts = line.rstrip().split() print "Subject id = %s" % parts[1] How could I retrieve the sequence which belong to subject id from BLAST database with BioPython? Thank you in advance. Best regards From biopython at maubp.freeserve.co.uk Sun Jan 24 13:47:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 24 Jan 2010 13:47:14 +0000 Subject: [Biopython] BLAST database access In-Reply-To: <4B5BAD1D.2020004@op.pl> References: <4B5BAD1D.2020004@op.pl> Message-ID: <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> On Sun, Jan 24, 2010 at 2:14 AM, xyz wrote: > Hello, > I have run MegaBlast and the results I can parse for example with: > > input_file = open("megablastres.txt","r") > for line in input_file.readlines(): > if line[0] == "#" : > #header line, ignore > else: > parts = line.rstrip().split() > print "Subject id = %s" % parts[1] If all you want is the subject ID, that looks simple. I guess you are using one of the simple tabular output formats? > How could I retrieve the sequence which belong to subject id > from BLAST database with BioPython? Are you using a local BLAST database, or an online one? If online, I would try using the hit ID to search via the NCBI Entrez interface, see the Bio.Entrez chapter in our tutorial. If the database is local, then the NCBI provides a tool as part of the BLAST suite for this called fastacmd. Peter From pedro.al at fenhi.uh.cu Mon Jan 25 16:08:21 2010 From: pedro.al at fenhi.uh.cu (Yasser Almeida =?iso-8859-1?b?SGVybuFuZGV6?=) Date: Mon, 25 Jan 2010 11:08:21 -0500 Subject: [Biopython] Rename atoms Message-ID: <20100125110821.vesao6besg0wggcs@correo.fenhi.uh.cu> Hi all... It's possible rename atoms in .pdb files? Thanks -- Lic. Yasser Almeida Hern?ndez Center of Molecular Inmunology (CIM) Nanobiology Group P.O.Box 16040, Havana, Cuba Phone: (537) 271-7933, ext. 246 ---------------------------------------------------------------- Correo FENHI From rafal.b.pawlak at gmail.com Mon Jan 25 21:38:17 2010 From: rafal.b.pawlak at gmail.com (x y) Date: Mon, 25 Jan 2010 22:38:17 +0100 Subject: [Biopython] GI number Message-ID: <586d60191001251338m42717464y6bcf0b2d411a62b0@mail.gmail.com> hello, how extract GI number in this program? from Bio import SeqIO handle = open("xyz.fasta") for seq_record in SeqIO.parse(handle, "fasta"): print seq_record.description handle.close() ex. Osa_SPT6 gi|222632083|gb|EEE64215.1| hypothetical protein Os05g41510.1_ORYZA [Oryza sativa Japonica Group] rafal pawlak From p.j.a.cock at googlemail.com Mon Jan 25 23:41:46 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Jan 2010 23:41:46 +0000 Subject: [Biopython] GI number In-Reply-To: <586d60191001251338m42717464y6bcf0b2d411a62b0@mail.gmail.com> References: <586d60191001251338m42717464y6bcf0b2d411a62b0@mail.gmail.com> Message-ID: <320fb6e01001251541i3842f640me86d4b88e064af86@mail.gmail.com> On Mon, Jan 25, 2010 at 9:38 PM, x y wrote: > hello, > how extract GI number in this program? > > from Bio import SeqIO > handle = open("xyz.fasta") > for seq_record in SeqIO.parse(handle, "fasta"): > ? ?print seq_record.description > handle.close() > > ex. > Osa_SPT6 gi|222632083|gb|EEE64215.1| hypothetical protein Os05g41510.1_ORYZA > [Oryza sativa Japonica Group] > > rafal pawlak I would just the Python string split method on this string - assuming all your record use the same layout, e.g. Something like this: gi = record.description.split()[1].split("|")[1] There are related examples in the tutorial, search for "get_accession" which are a bit more robust because they check the string follows the expected format. You could alternatively use a regular expression. Peter From mitlox at op.pl Tue Jan 26 03:58:53 2010 From: mitlox at op.pl (xyz) Date: Tue, 26 Jan 2010 13:58:53 +1000 Subject: [Biopython] BLAST database access In-Reply-To: <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> References: <4B5BAD1D.2020004@op.pl> <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> Message-ID: <4B5E687D.7040905@op.pl> Peter wrote: >> How could I retrieve the sequence which belong to subject id >> from BLAST database with BioPython? >> > > Are you using a local BLAST database, or an online one? > If online, I would try using the hit ID to search via the NCBI > Entrez interface, see the Bio.Entrez chapter in our tutorial. > If the database is local, then the NCBI provides a tool as > part of the BLAST suite for this called fastacmd. > > Peter Thank you. I could retrieve the sequences from a local BlastDB with fastacmd, but I have some local BlastDBs which do not have any index, because they were created without using the -o T option in formatdb. How could I retrieve the sequences from local BlastDBs without index? Thank you in advance. From biopython at maubp.freeserve.co.uk Tue Jan 26 12:59:25 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 26 Jan 2010 12:59:25 +0000 Subject: [Biopython] BLAST database access In-Reply-To: <4B5E687D.7040905@op.pl> References: <4B5BAD1D.2020004@op.pl> <320fb6e01001240547n72ed3a0cx6ae8cdadfb80ff5a@mail.gmail.com> <4B5E687D.7040905@op.pl> Message-ID: <320fb6e01001260459m6c046e2bwf6b89be26cf0806f@mail.gmail.com> On Tue, Jan 26, 2010 at 3:58 AM, xyz wrote: > Peter wrote: >>> >>> How could I retrieve the sequence which belong to subject id >>> from BLAST database with BioPython? >>> >> >> Are you using a local BLAST database, or an online one? >> If online, I would try using the hit ID to search via the NCBI >> Entrez interface, see the Bio.Entrez chapter in our tutorial. >> If the database is local, then the NCBI provides a tool as >> part of the BLAST suite for this called fastacmd. >> >> Peter > > Thank you. I could retrieve the sequences from a local BlastDB with > fastacmd, but I have some local BlastDBs which do not have any index, > ?because they were created without using the -o T option in formatdb. > > How could I retrieve the sequences from local BlastDBs without index? > > Thank you in advance. That sounds harder... do you still have the original FASTA file used to build the BLASTDB? If so, just index that - for example using the Bio.SeeqIO.convert() functionality in Biopython 1.52 or later. Peter From bouchard.lysiane at gmail.com Wed Jan 27 18:16:30 2010 From: bouchard.lysiane at gmail.com (Lysiane Bouchard) Date: Wed, 27 Jan 2010 13:16:30 -0500 Subject: [Biopython] NaN values, lowess Message-ID: Hi, I am using the lowess function in Bio.Statistics.lowess, version 1.53 When input array y is zero everywhere, I obtain yest=NaN everywhere. I wonder if I did something wrong and if other special cases might lead to NaN values. ------------------------------ ------------------------------------------------------- >>ipython --pylab >>In [1]: import numpy >>In [2]: from Bio.Statistics.lowess import lowess >>In [3]: x = numpy.array(range(200))*1.0 >>In [4]: y = numpy.zeros([200,]) >>In [5]: yest = lowess(x,y) >>In [6]: all(isnan(yest)) >>Out[6]: True ----------------------------------------------------------------------------------- Thank you, Lysiane Bouchard From richard_w_g_price at academia.edu Wed Jan 27 20:41:00 2010 From: richard_w_g_price at academia.edu (Richard Price) Date: Wed, 27 Jan 2010 12:41:00 -0800 Subject: [Biopython] Recent Activity of the 11 Biopython members on Academia.edu Message-ID: Dear Biopython members, We just wanted to let you know about some recent activity on the Biopython group on Academia.edu. In the Biopython group on Academia.edu, there are now: - 11 people - 1 paper - 1 photo Biopython members? pages have been viewed a total of 1,494 times, and their papers have been viewed a total of 2 times. To see these people, papers and status updates, follow the link below: http://lists.academia.edu/See-members-of-Biopython Richard Dr. Richard Price, post-doc, Philosophy Dept, Oxford University. Founder of Academia.edu From s.schmeier at gmail.com Sat Jan 30 08:46:56 2010 From: s.schmeier at gmail.com (Sebastian Schmeier) Date: Sat, 30 Jan 2010 11:46:56 +0300 Subject: [Biopython] SeqIO.index() Message-ID: <679b68f21001300046h36ca77dfh93c07af9ccbbef83@mail.gmail.com> Dear community, I am new to the mailing list and have a problem/question regarding the SeqIO.index() method/module. Up to now, I usually used an home-brewed fasta-file parser. This time though I had a look at the SeqIO interface. I am especially interested in the index() method. The fasta-file I use have non-standardized (if this is even possible) headers. I found that the index method uses the first string after the marker up to a space as the identifier for the dictionary (I will call this ID in the text below). It is however a great idea to have a function argument "key_function" that allows for adjust the key values via a self implemented callback function. This is essential in my case because ID in my fasta-file are not unique per entry. I had a look at the source code of SeqIO/_index.py and I found that unfortunately in the current implementation the "key_function" only acts on ID. I think it would make more sense to allow to extract a key from the complete header. Is this somehow possible with the current implementation? I refer here to the code in SeqIO/_index.py: 188 class _SequentialSeqFileDict(_IndexedSeqFileDict) : . . . 200 if marker_re.match(line) : 201 #Here we can assume the record.id is the first word after the 202 #marker. This is generally fine... but not for GenBank, EMBL, Swiss 203 self._record_key(line[marker_offset:].strip().split(None,1)[0], offset) ##### here you define that the key_function only acts on the first split Thanks, Seb From p.j.a.cock at googlemail.com Sat Jan 30 14:08:57 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Sat, 30 Jan 2010 14:08:57 +0000 Subject: [Biopython] SeqIO.index() In-Reply-To: <679b68f21001300046h36ca77dfh93c07af9ccbbef83@mail.gmail.com> References: <679b68f21001300046h36ca77dfh93c07af9ccbbef83@mail.gmail.com> Message-ID: Hi Your request makes perfect sense for FASTA files, but does not generalise to all the other supported file formats - hence the relatively limited callback support available in Bio.SeqIO.index. I would suggest you could subclass the FASTA indexer to do what you want. Or, for smaller files use Bio.SeqIO.to_dict instead. Regards Peter On 30 Jan 2010, at 08:46, Sebastian Schmeier wrote: > Dear community, > > I am new to the mailing list and have a problem/question regarding the > SeqIO.index() method/module. Up to now, I usually used an home-brewed > fasta-file parser. This time though I had a look at the SeqIO > interface. I am especially interested in the index() method. > > The fasta-file I use have non-standardized (if this is even possible) > headers. I found that the index method uses the first string after the > marker up to a space as the identifier for the dictionary (I will call > this ID in the text below). It is however a great idea to have a > function argument "key_function" that allows for adjust the key values > via a self implemented callback function. This is essential in my case > because ID in my fasta-file are not unique per entry. > > I had a look at the source code of SeqIO/_index.py and I found that > unfortunately in the current implementation the "key_function" only > acts on ID. I think it would make more sense to allow to extract a key > from the complete header. Is this somehow possible with the current > implementation? > > I refer here to the code in SeqIO/_index.py: > > > 188 class _SequentialSeqFileDict(_IndexedSeqFileDict) : > . > . > . > 200 if marker_re.match(line) : > 201 #Here we can assume the record.id is the first > word after the > 202 #marker. This is generally fine... but not for > GenBank, EMBL, Swiss > 203 > self._record_key(line[marker_offset:].strip().split(None,1)[0], > offset) ##### here you define that the key_function only acts > on the first split > > > > Thanks, > Seb > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython