From p.j.a.cock at googlemail.com Thu Jul 1 05:33:04 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Jul 2010 10:33:04 +0100 Subject: [Biopython] (least) Favorite PDB models? In-Reply-To: References: Message-ID: On Thu, Jul 1, 2010 at 4:06 AM, Bryan Lunt wrote: > Greetings All, > > So I have finished (for now) the section of my program that maps PDB > model residues to SEQRES residues... > > And every programmer's favorite thing is QA, right? It should be ;) Unit tests are good! > Does anyone have some suggestions of particularly ugly PDB models > (discontinuous, strange residue numberings, etc) > that I can test it on? There are a few odd files listed here, http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/top500/ For example 1DIN and 2HMZ showed some interesting behaviour with fractional occupancy - these should be hard to map onto a single sequence! Peter From anaryin at gmail.com Fri Jul 2 15:25:21 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 2 Jul 2010 14:25:21 -0500 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: <4C2A7C09.2020204@berkeley.edu> Message-ID: Hey! >> Does anyone have any code for easy alignment between the SEQRES entry > >> in a pdb file and the actual ATOM/HETATM entries in the chain? > There's not parsing for SEQRES yet in parse_header_pdb but it wouldn't be hard to. > >> In biojava, this is just one of the options when you parse a PDB file, > >> it would certainly be useful. > Indeed it would. That looks like a good reason to have a PDB XML parser (as trying to do > this from the plain text PDB is probably fiddly). > I don't know if people have started working on such a parser but I have some sort of a head start. Check here: http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/Struct/WWW/WHATIFXML.py Warning, very ugly :) > > In any case, having function that provides this mapping (both directions) > in > > BioPython would be extremely useful. > > Maybe something for the GSoC project TODO list? ;) > Hmm, I was working on something more or less like this a while back and it didn't work that well. But it might be a good idea. It seems however that Bryan already did it :) No? Jo?o From rodrigo_faccioli at uol.com.br Fri Jul 2 17:50:58 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Fri, 2 Jul 2010 18:50:58 -0300 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: <4C2A7C09.2020204@berkeley.edu>

Message-ID: Hi, About the SEQRES implementation for BioPython, I've developed it. Please, see it in [1]. I hope this implementation could help you. [1] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDBParser.py Best, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Fri, Jul 2, 2010 at 4:25 PM, Jo?o Rodrigues wrote: > Hey! > > >> Does anyone have any code for easy alignment between the SEQRES entry > > >> in a pdb file and the actual ATOM/HETATM entries in the chain? > > > > There's not parsing for SEQRES yet in parse_header_pdb but it wouldn't be > hard to. > > > > >> In biojava, this is just one of the options when you parse a PDB > file, > > >> it would certainly be useful. > > > > Indeed it would. > > > That looks like a good reason to have a PDB XML parser (as trying to do > > this from the plain text PDB is probably fiddly). > > > > I don't know if people have started working on such a parser but I have > some > sort of a head start. Check here: > > > http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/Struct/WWW/WHATIFXML.py > > Warning, very ugly :) > > > > > > In any case, having function that provides this mapping (both > directions) > > in > > > BioPython would be extremely useful. > > > > Maybe something for the GSoC project TODO list? ;) > > > > Hmm, I was working on something more or less like this a while back and it > didn't work that well. But it might be a good idea. It seems however that > Bryan already did it :) No? > > Jo?o > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Mon Jul 5 13:53:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 18:53:41 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? Message-ID: Hello all, Is anyone using Bio.Crystal? It is a parser for a subset of the PDB format which used to be used by the The Nucleic Acid Database Project. My impression from their website is that they are moving all their data to the PDB format - but still have some structures that are present only in the NDB format: http://ndbserver.rutgers.edu/download_data/index.html I'm asking because Bio.Crystal needs some updating to avoid deprecation warnings in the latest versions of Python - and it might be simpler just to deprecate it instead. Peter From rjalves at igc.gulbenkian.pt Mon Jul 5 16:00:17 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 05 Jul 2010 21:00:17 +0100 Subject: [Biopython] Finding what is the most recent Pubmed ID or list of all valid PMIDs Message-ID: <4C3239D1.7040708@igc.gulbenkian.pt> Greetings All, I'm trying to figure out a way to have a more or less up-to-date list of Pubmed IDs for validation purposes. This has to be performed on a programmatic way. My first attempt was to look for this in NCBI's FTP. I could find ftp://ftp.ncbi.nih.gov/pubmed/deleted_pmids.txt but not information about the most recent PMID. I also tried to use EInfo but the count section under PubMed seems either outdated or completely unrelated to the total number of assigned PMIDs. Even when adding the total of deleted_pmids + the number from EInfo I couldn't get accurate information. So my question is, does anyone know how to get either a list of all the valid PMIDs or simply the most recent PMID? Thanks, Renato -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: From biopython at maubp.freeserve.co.uk Mon Jul 5 16:54:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 21:54:37 +0100 Subject: [Biopython] Finding what is the most recent Pubmed ID or list of all valid PMIDs In-Reply-To: <4C3239D1.7040708@igc.gulbenkian.pt> References: <4C3239D1.7040708@igc.gulbenkian.pt> Message-ID: On Mon, Jul 5, 2010 at 9:00 PM, Renato Alves wrote: > Greetings All, > > I'm trying to figure out a way to have a more or less up-to-date list of > Pubmed IDs for validation purposes. This has to be performed on a > programmatic way. > > My first attempt was to look for this in NCBI's FTP. I could find > ftp://ftp.ncbi.nih.gov/pubmed/deleted_pmids.txt but not information > about the most recent PMID. > > I also tried to use EInfo but the count section under PubMed seems > either outdated or completely unrelated to the total number of assigned > PMIDs. Even when adding the total of deleted_pmids + the number from > EInfo I couldn't get accurate information. > > So my question is, does anyone know how to get either a list of all the > valid PMIDs or simply the most recent PMID? To try and work out the latest PMID, I'd start by trying a PubMed search by date, using a recent threshold. What number of PMIDs are you trying to validate? Would it make sense to use Entrez to do the validation (in batches)? Peter From rjalves at igc.gulbenkian.pt Mon Jul 5 18:13:34 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 05 Jul 2010 23:13:34 +0100 Subject: [Biopython] Finding what is the most recent Pubmed ID or list of all valid PMIDs In-Reply-To: References: <4C3239D1.7040708@igc.gulbenkian.pt> Message-ID: <4C32590E.6020609@igc.gulbenkian.pt> From Peter on 07/05/2010 09:54 PM: > On Mon, Jul 5, 2010 at 9:00 PM, Renato Alves wrote: >> Greetings All, >> >> I'm trying to figure out a way to have a more or less up-to-date list of >> Pubmed IDs for validation purposes. This has to be performed on a >> programmatic way. >> >> My first attempt was to look for this in NCBI's FTP. I could find >> ftp://ftp.ncbi.nih.gov/pubmed/deleted_pmids.txt but not information >> about the most recent PMID. >> >> I also tried to use EInfo but the count section under PubMed seems >> either outdated or completely unrelated to the total number of assigned >> PMIDs. Even when adding the total of deleted_pmids + the number from >> EInfo I couldn't get accurate information. >> >> So my question is, does anyone know how to get either a list of all the >> valid PMIDs or simply the most recent PMID? > > To try and work out the latest PMID, I'd start by trying a PubMed search > by date, using a recent threshold. > > What number of PMIDs are you trying to validate? Would it make > sense to use Entrez to do the validation (in batches)? > > Peter I've thought of using Entrez but I was trying to avoid it by using information available locally. I've no idea how many and what PMIDs will be requested. But indeed searching by date I can get a rough idea of what might be the most recent PMID. Thanks Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: From biopython at maubp.freeserve.co.uk Tue Jul 6 04:23:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 09:23:28 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? In-Reply-To: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> References: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> Message-ID: On Tue, Jul 6, 2010 at 7:49 AM, Kristian Rother wrote: > > > Hi Peter, > > none of us here is using Bio.Crystal - and we use NDB structures a lot. > > Best, > ? Kristian Hi Kristian, What kind of files are you using from the NDB? Standard PDB or mmCIF maybe? Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 04:43:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 09:43:18 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? In-Reply-To: <20100706102515.18063fisv68bqesr@horde.genesilico.pl> References: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> <20100706102515.18063fisv68bqesr@horde.genesilico.pl> Message-ID: > Quoting Peter : >> Hi Kristian, >> >> What kind of files are you using from the NDB? Standard PDB or mmCIF >> maybe? >> >> Peter On Tue, Jul 6, 2010 at 9:25 AM, Kristian Rother wrote: > Hi Peter, > > Standard PDB. > We're gradually shifting to the PDB database for queries though, because the > XML-based API for querying subsets of structures is much better there. > > Kristian Thanks - maybe we can depreacte Bio.Crystal then... Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 06:36:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:36:39 +0100 Subject: [Biopython] Deprecating Bio.Crystal in next release? Message-ID: Hi all, Given recent discussion (and the lack of interest on the dev list on previous occasions), is there any objection to deprecating Bio.Crystal in the next release of Biopython? http://lists.open-bio.org/pipermail/biopython/2010-July/006633.html http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004405.html http://lists.open-bio.org/pipermail/biopython-dev/2007-July/002901.html Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 06:46:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:46:12 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? In-Reply-To: <20100706123045.12993suli0ec27dh@horde.genesilico.pl> References: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> <20100706123045.12993suli0ec27dh@horde.genesilico.pl> Message-ID: On Tue, Jul 6, 2010 at 11:30 AM, Kristian Rother wrote: > >> What kind of files are you using from the NDB? Standard PDB or mmCIF >> maybe? > > Ah... sorry.. just PDB. > > Kristian That's useful to know - thanks. Peter P.S. Please try to CC the mailing list ;) From biopython at maubp.freeserve.co.uk Tue Jul 6 09:40:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 14:40:51 +0100 Subject: [Biopython] Deprecating Bio.InterPro Message-ID: Hi all, Another old module which hasn't been updated for some time is Bio.InterPro, a parser for the HTML (webpages) at the EBI, e.g. http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001064 The parser doesn't work with the current website, and also uses a Python library called sgmllib which was deprecated as of Python 2.6. Website parsers are in general a bad idea because the tend to need a lot of work to keep up to date. Perhaps in this case there are suitable plain text files on the FTP site which might be used? Unless anyone has a good reason not to, we are going to deprecate the Bio.IntrerPro module in the next release of Biopython. Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 11:20:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 16:20:38 +0100 Subject: [Biopython] Deprecating Bio.Index? Message-ID: Hello all, Is anyone using the Bio.Index module in Biopython in their own code? This supported file indexing and was used in other parts of Biopython which have all now been deprecated (e.g. Bio.SwissProt.SProt and Bio.Prosite) or removed. The more recent Bio.SeqIO module provides a general approach to indexing sequence files. Would it inconvenience anyone if Bio.Index was deprecated in the next release (triggering warnings when imported, but still functional), and then removed later on? Thanks, Peter From guyeakin at gmail.com Wed Jul 7 20:52:49 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 7 Jul 2010 20:52:49 -0400 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd Message-ID: I am learning biopython and seem to be having trouble parsing efetch generated xml. Maybe I am confused here, but I can't for the life of me Get my xml to parse correctly, and it seems to be coming up with a missing dtd error using both Medline.parse and Entrez.parse. (traceback for medline below below) nlmmedlinecitationset_100301. dtd and pubmed_100301.dtd seem to be missing from my biopython installation, and unavailable from the following NCBI sites: http://www.ncbi.nlm.nih.gov/dtd/ or http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ My apologies if this is user error; i do not see reference to this DTD issue in the archives so am posting the incident. Is this just bad luck during my learning curve, or am I missing something conceptual here? Thanks, Guy Traceback (most recent call last): File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 312, in RunScript exec codeObject in __main__.__dict__ File "C:\ieye\ieye\citations\pubmed_search_fxn.py", line 36, in parsed_results = Entrez.read(fetch_handle) File "C:\Python26\lib\site-packages\Bio\Entrez\__init__.py", line 262, in read record = handler.read(handle) File "C:\Python26\lib\site-packages\Bio\Entrez\Parser.py", line 108, in read self.parser.ParseFile(handle) File "C:\Python26\lib\site-packages\Bio\Entrez\Parser.py", line 381, in externalEntityRefHandler parser.ParseFile(handle) File "C:\Python26\lib\site-packages\Bio\Entrez\Parser.py", line 377, in externalEntityRefHandler raise RuntimeError(message) RuntimeError: Unable to load DTD file nlmmedlinecitationset_100301.dtd. From biopython at maubp.freeserve.co.uk Thu Jul 8 03:42:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 08:42:32 +0100 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd In-Reply-To: References: Message-ID: On Thu, Jul 8, 2010 at 1:52 AM, Guy Eakin wrote: > ?I am learning biopython and seem to be having trouble parsing efetch > generated xml. > > Maybe I am confused here, but I can't for the life of me Get my xml to parse > correctly, and it seems to be coming up with a missing dtd error using both > Medline.parse and Entrez.parse. (traceback for medline below below) > > ?nlmmedlinecitationset_100301. > dtd and pubmed_100301.dtd seem to be missing from my biopython > installation, and unavailable from the following NCBI sites: > > http://www.ncbi.nlm.nih.gov/dtd/ or > http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ > > My apologies if this is user error; i do not see reference to this DTD issue > in the archives so am posting the incident. Is this just bad luck during my > learning curve, or am I missing something conceptual here? The problem is with the NCBI "hiding" the file by not showing the raw contents of that folder, but just an HTML page with a partial list. You need this file: http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/nlmmedlinecitationset_100301.dtd I've added this to our repository so the next version of Biopython will include it. Please let us know if anything else is missing - what was the Entrez request you used to get the XML using this DTD file? Regards, Peter From guyeakin at gmail.com Thu Jul 8 07:28:17 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Thu, 8 Jul 2010 07:28:17 -0400 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd In-Reply-To: References:

Message-ID: Peter, Many thanks. this is a query statement that generated the nlmmedlinecitationset_100301.dtd error: Entrez.esearch(db="pubmed", term= ('glaucom*'), retmax=2, usehistory="y", reldate=7, datetype="edat") fetch_handle = Entrez.efetch(db="pubmed", retmode="xml",rettype='medline', webenv=webenv, query_key=query_key) You will also want to add pubmed_100301.dtd to your repository. I do not have the query that generated it's dependent XML, but got an separate error related to its absence yesterday. Oddly, I was able to download the "hidden" pubmed_100301.dtd, but could not replicate the error. All following errors focused on the nlmmedlinecitationset_100301.dtd file which I could not locate until this morning. Perhaps it was just recently posted to the site. Either way, thanks for the confirmation that I was on the right track. regards, guy On Thu, Jul 8, 2010 at 3:42 AM, Peter wrote: > On Thu, Jul 8, 2010 at 1:52 AM, Guy Eakin wrote: > > I am learning biopython and seem to be having trouble parsing efetch > > generated xml. > > > > Maybe I am confused here, but I can't for the life of me Get my xml to > parse > > correctly, and it seems to be coming up with a missing dtd error using > both > > Medline.parse and Entrez.parse. (traceback for medline below below) > > > > nlmmedlinecitationset_100301. > > dtd and pubmed_100301.dtd seem to be missing from my biopython > > installation, and unavailable from the following NCBI sites: > > > > http://www.ncbi.nlm.nih.gov/dtd/ or > > http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ > > > > My apologies if this is user error; i do not see reference to this DTD > issue > > in the archives so am posting the incident. Is this just bad luck during > my > > learning curve, or am I missing something conceptual here? > > The problem is with the NCBI "hiding" the file by not showing the raw > contents of that folder, but just an HTML page with a partial list. You > need this file: > > > http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/nlmmedlinecitationset_100301.dtd > > I've added this to our repository so the next version of Biopython will > include it. Please let us know if anything else is missing - what was > the Entrez request you used to get the XML using this DTD file? > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Thu Jul 8 07:53:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 12:53:30 +0100 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd In-Reply-To: References:

Message-ID: On Thu, Jul 8, 2010 at 12:28 PM, Guy Eakin wrote: > Peter, > > Many thanks. > > this is a query statement that generated the > nlmmedlinecitationset_100301.dtd error: Entrez.esearch(db="pubmed", > ? ? ? ? ? ? ? ? ? ? ? ?term= ('glaucom*'), > ? ? ? ? ? ? ? ? ? ? ? ?retmax=2, usehistory="y", > ? ? ? ? ? ? ? ? ? ? ? ?reldate=7, datetype="edat") > > > fetch_handle = Entrez.efetch(db="pubmed", retmode="xml",rettype='medline', > ? ? ? ? ? ? ? ? ? ? ? ? ? ? webenv=webenv, query_key=query_key) > Great. A more complete version is: from Bio import Entrez Entrez.email = "A.N.Other at example.com" s = Entrez.read(Entrez.esearch(db="pubmed",term= ('glaucom*'),retmax=2, usehistory="y",reldate=7, datetype="edat")) r = Entrez.read(Entrez.efetch(db="pubmed", retmode="xml",rettype='medline', webenv=s["WebEnv"], query_key=s["QueryKey"])) > > You will also want to add pubmed_100301.dtd to your repository. ... > Yes, we needed the new pubmed_100301.dtd and also bookdoc_100301.dtd It looks like the NCBI did some updates recently - hopefully they'll update that webpage soon and we can see if we need to add anymore (other than finding out from error messages). Thanks for the report, and please let us know if you find any more missing DTD files. Peter From guyeakin at gmail.com Wed Jul 14 16:48:41 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 14 Jul 2010 16:48:41 -0400 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read Message-ID: I am using Bio.Entrez.read to parse XML returned from pubmed. This results in a dictionary for which one of the keys is ArticleIDList, e.g, Example PubmedData': {u'ArticleIdList': ['S0735-6757(09)00464-1', '10.1016/j.ajem.2009.09.013', '20579576'], blah: blah, etc.} In the original XML Each in contains an IDtype attribute that names the ID. for example 10.1016/j.ajem.2009.09.013 the IDtype is useful, but I can't find it in the Bio.Entrez.read output, so I have no *easy* way of determining whether the ID# is pii, pmc, pmid, etc. Is there a better way to get the IDtype attribute, or other XML tag attributes from the Entrez.read output? Thanks. Guy Code below --------------- from Bio import Medline from Bio import Entrez import routine_pubmed_query_terms as pubmedterms #this is a separate .py file that I use to hold query terms, email address, etc. s = Entrez.read(Entrez.esearch(db="pubmed", term=pubmedterms.entrezquery(program), retmax=pubmedterms.maxlimit, usehistory="y", reldate=pubmedterms.datelimit, datetype="edat")) print "found %s records, returning %s" % (int(s["Count"]), len(s["IdList"])) r = Entrez.read(Entrez.efetch(db="pubmed",retmode="xml", rettype='medline', webenv=s["WebEnv"], query_key=s["QueryKey"])) From biopython at maubp.freeserve.co.uk Wed Jul 14 17:18:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 22:18:34 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 9:48 PM, Guy Eakin wrote: > I am using Bio.Entrez.read to parse XML returned from pubmed. > > This results in a dictionary for which one of the keys is ArticleIDList, > e.g, > Example > PubmedData': {u'ArticleIdList': ['S0735-6757(09)00464-1', > '10.1016/j.ajem.2009.09.013', '20579576'], blah: blah, etc.} > > In the original XML Each in contains an IDtype > attribute that names the ID. for example > 10.1016/j.ajem.2009.09.013 > > the IDtype is useful, but I can't find it in the Bio.Entrez.read output, so > I have no *easy* way of determining whether the ID# is > pii, pmc, pmid, etc. > > Is there a better way to get the IDtype attribute, or other XML tag > attributes from the Entrez.read output? > Hi, This information is in the tutorial, but could perhaps be clearer. It might look like you get strings back, but in fact it is a subclass with an attributes property (a dictionary). e.g. from Bio import Entrez handle = Entrez.efetch(db="pubmed",retmode="xml",rettype='medline',id='19304878') r = Entrez.read(handle) handle.close() print r[0]['PubmedData']['ArticleIdList'][1] print r[0]['PubmedData']['ArticleIdList'][1].attributes Michiel - maybe we need to override the __repr__ method so it shows this information? Peter From biopython at maubp.freeserve.co.uk Wed Jul 14 17:33:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 22:33:06 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: On Wed, Jul 14, 2010 at 10:21 PM, Guy Eakin wrote: > thanks.? I understood that it had a? .tag feature,but missed the > .attributes! > Awesome. And thank you for the quick reply. > > Guy No problem. Now you know the answer, can you suggest any clarifications to the documentation? Peter P.S. Try and CC the mailing list in replies. From guyeakin at gmail.com Wed Jul 14 22:52:44 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 14 Jul 2010 22:52:44 -0400 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: Sure, I am new, so there are probably errors, but how about something like a demonstration appended to the end of the tutorial section at http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105 At core, the simple demonstration that type(record) calls a class object rather than a list, and that foo.attributes, and foo.tag exist would be helpful. I am not using any of the sequence utilities, so admit that my reading of those sections was brief. Reiteration in the entrez parsing sections is probably helpful for people like me. A more verbose demonstration follows. Again, thanks for the help. Guy 8.11.1 Parsing Medline records [intervening text omitted] At this point let?s address what these elements contain. Consider information found in the following statement. >>> records[0]['PubmedData'] {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month': '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]} It is important to recall that each item is a biopython class, rather than a simply a dictionary or list item. This can be verified by >>>type(records[0]['PubmedData']['ArticleIdList'] Which returns rather than This is important, as the class item contains additional auxiliary information as noted earlier. One such piece of important auxillary info is the XML tag attributes from the parsed XML. In this case, the original XML contained the following tags: btp163 10.1093/bioinformatics/btp163 19304878 PMC2682512 which have now been parsed into the u'ArticleIdList' dictionary key: >>> L =records[0]['PubmedData']['ArticleIdList'] ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'] Viewed as a simple list, these elements appear to lack the IdType information. However, the IdType attribute from the tag is stored in the parsed data, and can be retrieved by calling ?attributes? on the biopython class object. >>> for item in L: ... print "%s - %s" % (item, item.attributes) ... btp163 - {u'IdType': u'pii'} 10.1093/bioinformatics/btp163 - {u'IdType': u'doi'} 19304878 - {u'IdType': u'pubmed'} PMC2682512 - {u'IdType': u'pmc'} On Wed, Jul 14, 2010 at 5:33 PM, Peter wrote: > On Wed, Jul 14, 2010 at 10:21 PM, Guy Eakin wrote: > > thanks. I understood that it had a .tag feature,but missed the > > .attributes! > > Awesome. And thank you for the quick reply. > > > > Guy > > No problem. Now you know the answer, can you suggest > any clarifications to the documentation? > > Peter > > P.S. Try and CC the mailing list in replies. > From biopython at maubp.freeserve.co.uk Thu Jul 15 05:44:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 10:44:06 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: On Thu, Jul 15, 2010 at 3:52 AM, Guy Eakin wrote: > Sure, I am new, so there are probably errors, but how about something like a > demonstration appended to the end of the tutorial section at > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105 > > At core, the simple demonstration that type(record) calls a class object > rather than a list, and that foo.attributes, and foo.tag exist would be > helpful. ?I am not using any of the sequence utilities, so admit that my > reading of those sections was brief. ?Reiteration in the entrez parsing > sections is probably helpful for people like me. > > A more verbose demonstration follows. > > Again, thanks for the help. > Guy Thank you for the detailed suggested text. > 8.11.1 ?Parsing Medline records [intervening text omitted] > > At this point let?s address what these elements contain. ?Consider > information found in the following statement. > >>>> records[0]['PubmedData'] > > {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', '19304878', > 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month': > '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day': > '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', > u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': > '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]} > > It is important to recall that each item is a biopython class, rather than a > simply a dictionary or list item. ?This can be verified by > >>>>type(records[0]['PubmedData']['ArticleIdList'] > > Which returns rather than 'list'> This is why I was suggesting to Michiel that we override the __repr__ for our subclassed objects, so that rather than seeing things like this: ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'] we get something like: ListElement(['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'], attributes={...}) On deeper reflection, the trouble with this is that all the children within the list would get longer, so the full representation of a ListElement (or any container) would become very very long - swamping the console output. Even if we literally show the attributes with a dot dot dot :( Maybe we'll have to settle for just documentation improvements. Michiel - this is your code - what do you think? Peter From guyeakin at gmail.com Thu Jul 15 09:32:43 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Thu, 15 Jul 2010 09:32:43 -0400 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: >From the naive first-time user perspective, the current implementation is fine for the computer, but could benefit from a viewer output that creates a more human readable representation. I found myself cross referencing the console output to the original XML in most cases. That says to me that there might be benefit to a function that recapitulates the original XML's nested structure, listing attribute values. I would think something along the following would be quite useful, and if limited to particular range of records would not necessarily be unwieldy. (apologies for the admittedly unwieldy markup) >>> Bio.Entrez.viewer(recordlist, range=(0:(len(recordlist)), ShowMedlineCitation = True, ShowPubmedData = True) - (Attributes = Parent1.atribute) - \n #(80 characters/line) ..............value ..............indented text allows word wrap of entries > 80 char. ........1 - Attribute - \n .......................value that's an off the cuff representation before I dash to a meeting, but I think you can see what I am suggesting. Guy Guy On Thu, Jul 15, 2010 at 5:44 AM, Peter wrote: > On Thu, Jul 15, 2010 at 3:52 AM, Guy Eakin wrote: > > Sure, I am new, so there are probably errors, but how about something > like a > > demonstration appended to the end of the tutorial section at > > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105 > > > > At core, the simple demonstration that type(record) calls a class object > > rather than a list, and that foo.attributes, and foo.tag exist would be > > helpful. I am not using any of the sequence utilities, so admit that my > > reading of those sections was brief. Reiteration in the entrez parsing > > sections is probably helpful for people like me. > > > > A more verbose demonstration follows. > > > > Again, thanks for the help. > > Guy > > Thank you for the detailed suggested text. > > > 8.11.1 Parsing Medline records [intervening text omitted] > > > > At this point let?s address what these elements contain. Consider > > information found in the following statement. > > > >>>> records[0]['PubmedData'] > > > > {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', > '19304878', > > 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month': > > '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', > u'Day': > > '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', > > u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': > > '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]} > > > > It is important to recall that each item is a biopython class, rather > than a > > simply a dictionary or list item. This can be verified by > > > >>>>type(records[0]['PubmedData']['ArticleIdList'] > > > > Which returns rather than > 'list'> > > This is why I was suggesting to Michiel that we override the __repr__ > for our subclassed objects, so that rather than seeing things like this: > > ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'] > > we get something like: > > ListElement(['btp163', '10.1093/bioinformatics/btp163', '19304878', > 'PMC2682512'], attributes={...}) > > On deeper reflection, the trouble with this is that all the children within > the list would get longer, so the full representation of a ListElement (or > any container) would become very very long - swamping the console > output. Even if we literally show the attributes with a dot dot dot :( > > Maybe we'll have to settle for just documentation improvements. > Michiel - this is your code - what do you think? > > Peter > From mjldehoon at yahoo.com Thu Jul 15 09:36:19 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 15 Jul 2010 06:36:19 -0700 (PDT) Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: Message-ID: <436965.79909.qm@web62402.mail.re1.yahoo.com> --- On Thu, 7/15/10, Peter wrote: > This is why I was suggesting to Michiel that we override > than seeing the __repr__ for our subclassed objects, so > that rather things like this: > > ['btp163', '10.1093/bioinformatics/btp163', '19304878', > 'PMC2682512'] > > we get something like: > > ListElement(['btp163', '10.1093/bioinformatics/btp163', > '19304878', 'PMC2682512'], attributes={...}) > > On deeper reflection, the trouble with this is that all the > children within the list would get longer, so the full > representation of a ListElement (or > any container) would become very very long - swamping the > console output. The attributes are almost always only a small fraction of the Entrez XML file. So while it's true that each element gets larger, it's a small relative increase. The elements that are very long after adding the attributes are also very long without the attributes. So I am in favor of your original suggestion. If there are no other suggestions, I'll make the change in Bio.Entrez over the weekend (or feel free to do so before that). Best, --Michiel From biopython at maubp.freeserve.co.uk Thu Jul 15 09:50:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 14:50:41 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: <436965.79909.qm@web62402.mail.re1.yahoo.com> References: <436965.79909.qm@web62402.mail.re1.yahoo.com> Message-ID: On Thu, Jul 15, 2010 at 2:36 PM, Michiel de Hoon wrote: > > --- On Thu, 7/15/10, Peter wrote: >> This is why I was suggesting to Michiel that we override >> than seeing the __repr__ for our subclassed objects, so >> that rather things like this: >> >> ['btp163', '10.1093/bioinformatics/btp163', '19304878', >> 'PMC2682512'] >> >> we get something like: >> >> ListElement(['btp163', '10.1093/bioinformatics/btp163', >> '19304878', 'PMC2682512'], attributes={...}) >> >> On deeper reflection, the trouble with this is that all the >> children within the list would get longer, so the full >> representation of a ListElement (or any container) would >> become very very long - swamping the console output. > > The attributes are almost always only a small fraction of the Entrez XML file. > So while it's true that each element gets larger, it's a small relative increase. > The elements that are very long after adding the attributes are also very long > without the attributes. So I am in favor of your original suggestion. If there are > no other suggestions, I'll make the change in Bio.Entrez over the weekend > (or feel free to do so before that). Maybe you can keep the basic data type repr if there are no attributes, and only expand it if needed? It would be inconsistent but would keep the total string length down. Peter From bjorn_johansson at bio.uminho.pt Sat Jul 17 07:32:12 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sat, 17 Jul 2010 12:32:12 +0100 Subject: [Biopython] format fasta files to genbank: problem with too long Locus identifier Message-ID: Hi all, this is an example of parsing a fasta file and then trying to convert it to genbank. It seems that the fasta header file is not split between the "|", and all that is in the fasta header ends up as "LOCUS" in the genbank file. Is this the expected behavior? Can this be set somehow? Thanks for any help on this! /bjorn >>> from Bio import SeqIO >>> a=SeqIO.read("newfile.fasta", "fasta") >>> a SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...GGG', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]) >>> a.format('fasta') '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA\nCGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGT\nGACCCTGATTTGTTGTTGGG\n' >>> a.format('genbank') Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 638, in format return self.__format__(format) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 652, in __format__ SeqIO.write([self], handle, format_spec) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/__init__.py", line 398, in write count = writer_class(handle).write_file(sequences) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/InsdcIO.py", line 628, in write_record self._write_the_first_line(record) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/InsdcIO.py", line 453, in _write_the_first_line raise ValueError("Locus identifier %s is too long" % repr(locus)) ValueError: Locus identifier 'gi|2765658|emb|Z78533.1|CIZ78533' is too long >>> From biopython at maubp.freeserve.co.uk Sat Jul 17 07:50:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jul 2010 12:50:50 +0100 Subject: [Biopython] format fasta files to genbank: problem with too long Locus identifier In-Reply-To: References: Message-ID: 2010/7/17 Bj?rn Johansson : > Hi all, this is an example of parsing a fasta file and then trying to > convert it to genbank. > It seems that the fasta header file is not split between the "|", and all > that is in the fasta header ends up as "LOCUS" in the genbank file. Is this > the expected behavior? Can this be set somehow? > > Thanks for any help on this! > /bjorn Hi Bjorn, Yes this is expected behaviour. There are no standards for FASTA identifiers, the NCBI conventions are just one of dozens of styles. Therefore we don't try and parse the identifiers in FASTA files (we can't do it reliably). Then for GenBank files, the identifier field in the LOCUS line is very limited - you'll have to shorten your ID manually, Try something like this: from Bio import SeqIO a=SeqIO.read("newfile.fasta", "fasta") a.id = a.id.split("|")[3] print a.format('genbank') (untested) Peter From biopython at maubp.freeserve.co.uk Sat Jul 17 15:59:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jul 2010 20:59:46 +0100 Subject: [Biopython] format fasta files to genbank: problem with too long Locus identifier In-Reply-To: References:

Message-ID: 2010/7/17 Bj?rn Johansson : > Thanks for the quick reply! > > it seems that it is the a.name field that needs to be shortened > (and ends up in the LOCUS field). the a.id seems to be the > entire fasta header and ends up as DEFINITION. > > /bjorn Sounds right - I should have tested it after all ;) Peter From mjldehoon at yahoo.com Sun Jul 18 04:43:45 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 18 Jul 2010 01:43:45 -0700 (PDT) Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: Message-ID: <190917.68159.qm@web62405.mail.re1.yahoo.com> > Maybe you can keep the basic data type repr if there are no > attributes, and only expand it if needed? It would be inconsistent > but would keep the total string length down. > Done. The code can be further simplified if we drop the .tag attribute on each XML element. If we drop .tag, then all elements that do not have attributes (which are most of them) can be presented as a simple list, dictionary, string, and so on instead of a ListElement, DictionaryElement, StringElement. Then the *Element classes are used only in those cases where there actually are attributes. Or is that too inconsistent? --Michiel. From biopython at maubp.freeserve.co.uk Sun Jul 18 06:59:02 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 18 Jul 2010 11:59:02 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: <190917.68159.qm@web62405.mail.re1.yahoo.com> References: <190917.68159.qm@web62405.mail.re1.yahoo.com> Message-ID: On Sun, Jul 18, 2010 at 9:43 AM, Michiel de Hoon wrote: >> Maybe you can keep the basic data type repr if there are no >> attributes, and only expand it if needed? It would be inconsistent >> but would keep the total string length down. >> > Done. > > The code can be further simplified if we drop the .tag attribute on > each XML element. If we drop .tag, then all elements that do not > have attributes (which are most of them) can be presented as a > simple list, dictionary, string, and so on instead of a ListElement, > DictionaryElement, StringElement. Then the *Element classes are > used only in those cases where there actually are attributes. Or is > that too inconsistent? I've not played with the code yet to see what you mean about the tag attribute. There is an inconsistency with DictionaryElement vs DictElement (deliberate for space?) and StructureElement vs DictElement (typo?). Peter From mjldehoon at yahoo.com Sun Jul 18 09:59:41 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 18 Jul 2010 06:59:41 -0700 (PDT) Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: Message-ID: <650816.80191.qm@web62406.mail.re1.yahoo.com> --- On Sun, 7/18/10, Peter wrote: > I've not played with the code yet to see what you mean > about the tag attribute. Here's an example: pubmed protein nucleotide nuccore nucgss .... >>> record['DbList'][0] 'pubmed' >>> record['DbList'][0].tag u'DbName' >>> record['DbList'][0].attributes Traceback (most recent call last): File "", line 1, in AttributeError: 'StringElement' object has no attribute 'attributes' >>> So currently record['DbList'] is a list of StringElements (actually, a ListElement of StringElements) where the XML tag name is stored in a .tag attribute. If we don't store the .tag attribute, then record['DbList'] would become a list of plain strings. --Michiel. From bala.biophysics at gmail.com Mon Jul 19 05:39:11 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Mon, 19 Jul 2010 11:39:11 +0200 Subject: [Biopython] clustering pdb Message-ID: Friends, I have around 3000 pdb files obtained from MD simulations. I would like to know if it is possible to cluster the pdb files using any of the biopython's bio.PDB or other modules. If so, i would appreciate if you could provide me a sample code. Is there any utility in Biopython to handle MD trajectories obtained from common packages like gromacs, amber etc ? Thank you, Bala From biopython at maubp.freeserve.co.uk Mon Jul 19 05:49:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jul 2010 10:49:26 +0100 Subject: [Biopython] clustering pdb In-Reply-To: References: Message-ID: On Mon, Jul 19, 2010 at 10:39 AM, Bala subramanian wrote: > Friends, > I have around 3000 pdb files obtained from MD simulations. I would like to > know if it is possible to cluster the pdb files using any of the biopython's > bio.PDB or other modules. If so, i would appreciate if you could provide me > a sample code. What kind of clustering do you want to do? By folds/structure? By sequence? > Is there any utility in Biopython to handle MD trajectories obtained from > common packages like gromacs, amber etc ? No, we don't do any molecular dynamics - try The Molecular Modelling Toolkit (MMTK) by Konran Hinsen for a Python MD toolkit: http://dirac.cnrs-orleans.fr/MMTK/ Peter From schaefer at rostlab.org Mon Jul 19 05:56:45 2010 From: schaefer at rostlab.org (Christian Schaefer) Date: Mon, 19 Jul 2010 11:56:45 +0200 Subject: [Biopython] clustering pdb In-Reply-To: References: Message-ID: <4C44215D.5050401@rostlab.org> On 07/19/2010 11:39 AM, Bala subramanian wrote: > Friends, > I have around 3000 pdb files obtained from MD simulations. I would like to > know if it is possible to cluster the pdb files using any of the biopython's > bio.PDB or other modules. If so, i would appreciate if you could provide me > a sample code. It depends on what you mean by 'clustering'. If you refer to sequence clustering, i.e. redundancy reduction on sequence level, there are several tools out there (like CDHit) for that purpose. Not sure if there's a native implementation in BioPython for that. Chris -- Dipl.-Bioinf. Christian Schaefer Technical University Munich Department for Bioinformatics Faculty of Computer Science/I12 Boltzmannstr. 3 D-85748 Garching b. Muenchen Germany http://www.rostlab.org/~schaefer From biopython at maubp.freeserve.co.uk Mon Jul 19 06:07:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jul 2010 11:07:49 +0100 Subject: [Biopython] clustering pdb In-Reply-To: References: Message-ID: On Mon, Jul 19, 2010 at 10:59 AM, Bala subramanian wrote: > Hello peter, > Sorry for the incomplete information. I want to cluster the structures say > for example using rmsd between the pdb. > > Thanks, > Bala Provided you can specify the atomic mapping between the structures, you can calculate RMSD using Bio.PDB, see for example: http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ One way to do this would be pairwise sequence alignments. There are probably specialised tools out there for this sort of clustering... Peter P.S. Please try to CC the mailing list in replies. From anaryin at gmail.com Mon Jul 19 07:08:12 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 19 Jul 2010 13:08:12 +0200 Subject: [Biopython] clustering pdb In-Reply-To: References: Message-ID: Hey, You can try this: http://sbcb.bioch.ox.ac.uk/oliver/software/GromacsWrapper/html/edpdb.html Best! Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org On Mon, Jul 19, 2010 at 12:07 PM, Peter wrote: > On Mon, Jul 19, 2010 at 10:59 AM, Bala subramanian > wrote: > > Hello peter, > > Sorry for the incomplete information. I want to cluster the structures > say > > for example using rmsd between the pdb. > > > > Thanks, > > Bala > > Provided you can specify the atomic mapping between the structures, > you can calculate RMSD using Bio.PDB, see for example: > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > One way to do this would be pairwise sequence alignments. There > are probably specialised tools out there for this sort of clustering... > > Peter > > P.S. Please try to CC the mailing list in replies. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From schaefer at rostlab.org Mon Jul 19 09:22:09 2010 From: schaefer at rostlab.org (Christian Schaefer) Date: Mon, 19 Jul 2010 15:22:09 +0200 Subject: [Biopython] clustering pdb In-Reply-To: References: Message-ID: <4C445181.6070801@rostlab.org> On 07/19/2010 12:07 PM, Peter wrote: > Provided you can specify the atomic mapping between the structures, > you can calculate RMSD using Bio.PDB, see for example: > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > One way to do this would be pairwise sequence alignments. There > are probably specialised tools out there for this sort of clustering... Thanks Peter for that. I didn't know that Biopython provides superimpositioning of two pdb structures. I'd need that badly for my current project, and I used ProFit [1] so far. A native python implementation is of course very convenient in my case. Chris [1] http://www.bioinf.org.uk/ -- Dipl.-Bioinf. Christian Schaefer Technical University Munich Department for Bioinformatics Faculty of Computer Science/I12 Boltzmannstr. 3 D-85748 Garching b. Muenchen Germany http://www.rostlab.org/~schaefer From calhoun.bradley at gmail.com Mon Jul 19 12:01:22 2010 From: calhoun.bradley at gmail.com (Brad Calhoun) Date: Mon, 19 Jul 2010 11:01:22 -0500 Subject: [Biopython] missing DTD file Message-ID: To whom it may concern, The function of Biopython (currently using version 1.52) is severely limited by a missing DTD file: "eLink_090910.dtd" which I am unable to locate at the suggested locations. Any instruction to resolve this issue would be greatly appreciated. Regards, BTC From biopython at maubp.freeserve.co.uk Mon Jul 19 12:21:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Jul 2010 17:21:46 +0100 Subject: [Biopython] missing DTD file In-Reply-To: References: Message-ID: On Mon, Jul 19, 2010 at 5:01 PM, Brad Calhoun wrote: > To whom it may concern, > > The function of Biopython (currently using version 1.52) is severely limited > by a missing DTD file: "eLink_090910.dtd" ?which I am unable to locate at > the suggested locations. > > Any instruction to resolve this issue would be greatly appreciated. Hi Brad, This was included in Biopython 1.54 (it wasn't noticed quite in time for Biopython 1.53). You can update your Biopython or get this and other DTD files from the NCBI. In this case the file is here: http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/eLink_090910.dtd Peter P.S. Googling for "biopython eLink_090910.dtd" worked pretty well ;) From fredgca at hotmail.com Mon Jul 19 12:37:30 2010 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Mon, 19 Jul 2010 16:37:30 +0000 Subject: [Biopython] clustering pdb In-Reply-To: References: Message-ID: Dear Bala, As indicated by Rodrigues, you can use GromacsWrapper for peforming MD. I do not know any package for python that you could use to handle MD trajectories. However, it is possible to use RPy and Bio3D (http://mccammon.ucsd.edu/~bgrant/bio3d/index.html). Att., Frederico Arnoldi ? > Message: 1 > Date: Mon, 19 Jul 2010 11:39:11 +0200 > From: Bala subramanian > Subject: [Biopython] clustering pdb > To: biopython at lists.open-bio.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Friends, > I have around 3000 pdb files obtained from MD simulations. I would like to > know if it is possible to cluster the pdb files using any of the biopython's > bio.PDB or other modules. If so, i would appreciate if you could provide me > a sample code. > > Is there any utility in Biopython to handle MD trajectories obtained from > common packages like gromacs, amber etc ? > > Thank you, > Bala > _________________________________________________________________ PREPARE-SE: O SEU HOTMAIL VAI FICAR MELHOR DO QUE NUNCA. CLIQUE E VEJA AS NOVIDADES. http://www.windowslive.com.br/public/product.aspx/view/1?ocid=Hotmail:Live:Hotmail:Tagline:senDimensao:PREPARE-SE83:- From anaryin at gmail.com Mon Jul 19 12:47:56 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 19 Jul 2010 18:47:56 +0200 Subject: [Biopython] clustering pdb In-Reply-To: References:

Message-ID: edPDB actually allows some analysis of the trajectories (to some degree). You might want to start looking there and then maybe bridge it with MMTK's functionalities? From pingou at pingoured.fr Wed Jul 21 04:47:53 2010 From: pingou at pingoured.fr (Pierre-Yves) Date: Wed, 21 Jul 2010 10:47:53 +0200 Subject: [Biopython] AssertionError Message-ID: <1279702073.26644.8.camel@lab.localdomain> Hi, I am running into a problem which I can't figure why. I parse a fasta file using biopython: for seq_record in SeqIO.parse(fastafile, "fasta"): if seq_record.id == name: print seq_record.id return seq_record This seq_record becomes seq and I try to extract only a subpart of the sequence: s = Seq(seq.seq[start:stop], generic_dna) seq_out = SeqRecord(s, id = row[col_name]) Print of start and stop shows me: 251944 253441 which is what I expect. But the creation of the sequence object fails with the following error: Traceback (most recent call last): File "FastaExtractor.py", line 91, in s = Seq(seq.seq[start:stop], generic_dna) File "/usr/lib64/python2.6/site-packages/Bio/Seq.py", line 87, in __init__ type(data) == type(u"")) # but can be a unicode string AssertionError When I try to create an object by hand using the python consol, it works perfectly well. Any idea what could be wrong ? Thanks, Pierre From pingou at pingoured.fr Wed Jul 21 05:10:35 2010 From: pingou at pingoured.fr (Pierre-Yves) Date: Wed, 21 Jul 2010 11:10:35 +0200 Subject: [Biopython] AssertionError In-Reply-To: <1279702073.26644.8.camel@lab.localdomain> References: <1279702073.26644.8.camel@lab.localdomain> Message-ID: <1279703435.26644.10.camel@lab.localdomain> On Wed, 2010-07-21 at 10:47 +0200, Pierre-Yves wrote: > s = Seq(seq.seq[start:stop], generic_dna) > seq_out = SeqRecord(s, id = row[col_name]) I found the solution: s = Seq(str(seq.seq[start:stop]), generic_dna) Sorry for the noise, Pierre From biopython at maubp.freeserve.co.uk Wed Jul 21 05:17:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 10:17:21 +0100 Subject: [Biopython] AssertionError In-Reply-To: <1279702073.26644.8.camel@lab.localdomain> References: <1279702073.26644.8.camel@lab.localdomain> Message-ID: On Wed, Jul 21, 2010 at 9:47 AM, Pierre-Yves wrote: > Hi, > > I am running into a problem which I can't figure why. > > I parse a fasta file using biopython: > for seq_record in SeqIO.parse(fastafile, "fasta"): > ? ? if seq_record.id == name: > ? ? ? ? ? ?print seq_record.id > ? ? ? ? ? ?return seq_record > > This seq_record becomes seq and I try to extract only a subpart of the > sequence: > s = Seq(seq.seq[start:stop], generic_dna) > seq_out = SeqRecord(s, id = row[col_name]) If seq is a SeqRecord (a variable name I avoid), then seq.seq is a Seq object, and slicing a Seq object gives another Seq object. This means you shouldn't do this: Seq(seq.seq[start:stop], generic_dna) Just do this: seq.seq[start:stop] Your work around seems overly complicated, Seq(str(seq.seq[start:stop]), generic_dna) If your reason for doing this is to specify the alphabet, just tell Bio.SeqIO.parse() the alphabet instead. You can also slice the original SeqRecord instead, to give a new SeqRecord, and change its id to what you want. > > But the creation of the sequence object fails with the following error: > Traceback (most recent call last): > ?File "FastaExtractor.py", line 91, in > ? ?s = Seq(seq.seq[start:stop], generic_dna) > ?File "/usr/lib64/python2.6/site-packages/Bio/Seq.py", line 87, in > __init__ > ? ?type(data) == type(u"")) ?# but can be a unicode string > AssertionError That should be a clearer error message, the Seq object is not expecting you to give it a Seq object - but a string or unicode. Peter From pingou at pingoured.fr Wed Jul 21 05:22:12 2010 From: pingou at pingoured.fr (Pierre-Yves) Date: Wed, 21 Jul 2010 11:22:12 +0200 Subject: [Biopython] AssertionError In-Reply-To: References: <1279702073.26644.8.camel@lab.localdomain> Message-ID: <1279704132.26644.13.camel@lab.localdomain> On Wed, 2010-07-21 at 10:17 +0100, Peter wrote: > If seq is a SeqRecord (a variable name I avoid), Changed. > then seq.seq is > a Seq object, and slicing a Seq object gives another Seq object. > This means you shouldn't do this: > > Seq(seq.seq[start:stop], generic_dna) > > Just do this: > > seq.seq[start:stop] Indeed: seq_out = SeqRecord( seq_record.seq[start:stop], id = row[col_name] ) just works. > > > > But the creation of the sequence object fails with the following error: > > Traceback (most recent call last): > > File "FastaExtractor.py", line 91, in > > s = Seq(seq.seq[start:stop], generic_dna) > > File "/usr/lib64/python2.6/site-packages/Bio/Seq.py", line 87, in > > __init__ > > type(data) == type(u"")) # but can be a unicode string > > AssertionError > > That should be a clearer error message, the Seq object is not > expecting you to give it a Seq object - but a string or unicode. A clearer error message would indeed be nice. Pierre From biopython at maubp.freeserve.co.uk Wed Jul 21 05:46:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 10:46:18 +0100 Subject: [Biopython] AssertionError In-Reply-To: <1279704132.26644.13.camel@lab.localdomain> References: <1279702073.26644.8.camel@lab.localdomain> <1279704132.26644.13.camel@lab.localdomain> Message-ID: On Wed, Jul 21, 2010 at 10:22 AM, Pierre-Yves wrote: >> >> That should be a clearer error message, the Seq object is not >> expecting you to give it a Seq object - but a string or unicode. > > A clearer error message would indeed be nice. > It will give a TypeError in the next release. Thanks for the feedback. Peter From bala.biophysics at gmail.com Wed Jul 21 10:51:07 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Wed, 21 Jul 2010 16:51:07 +0200 Subject: [Biopython] running naccess Message-ID: Friends, I am trying to run naccess as follows. from Bio.PDB.PDBParser import PDBParser par = PDBParser() structure = par.get_structure('test','12_25.pdb') import Bio.PDB.NACCESS as nac data=nac.run_naccess(structure[0],'12_25.pdb') It seems that there is an issue with tmp dir creation. I could nt run naccess even if i define *temp_path='.'* with run_naccess function. Kindly write me how to resolve the issue. *I get the following error:* IOError Traceback (most recent call last) /tmp/tmpqQLhRl/ in () /apps/py_modules/gcc/lib/python2.6/site-packages/Bio/PDB/NACCESS.pyc in run_naccess(model, pdb_file, probe_size, z_slice, naccess, temp_path) 56 # get the output, then delete the temp directory 57 rsa_file = tmp_pdb_file[:-4] + '.rsa' ---> 58 rf = open(rsa_file) 59 rsa_data = rf.readlines() 60 rf.close() IOError: [Errno 2] No such file or directory: '/tmp/tmpqQLhRl/tmpaOkRlK.rsa' From biopython at maubp.freeserve.co.uk Wed Jul 21 11:02:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Jul 2010 16:02:44 +0100 Subject: [Biopython] running naccess In-Reply-To: References: Message-ID: On Wed, Jul 21, 2010 at 3:51 PM, Bala subramanian wrote: > Friends, > I am trying to run naccess as follows. > > from Bio.PDB.PDBParser import PDBParser > par = PDBParser() > structure = par.get_structure('test','12_25.pdb') > import Bio.PDB.NACCESS as nac > data=nac.run_naccess(structure[0],'12_25.pdb') > > It seems that there is an issue with tmp dir creation. I could nt run > naccess even if i define *temp_path='.'* with run_naccess function. Kindly > write me how to resolve the issue. > > > *I get the following error:* > > IOError ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Traceback (most recent call last) > > /tmp/tmpqQLhRl/ in () > > /apps/py_modules/gcc/lib/python2.6/site-packages/Bio/PDB/NACCESS.pyc in > run_naccess(model, pdb_file, probe_size, z_slice, naccess, temp_path) > ? ? 56 ? ? # get the output, then delete the temp directory > > ? ? 57 ? ? rsa_file = tmp_pdb_file[:-4] + '.rsa' > ---> 58 ? ? rf = open(rsa_file) > ? ? 59 ? ? rsa_data = rf.readlines() > ? ? 60 ? ? rf.close() > > IOError: [Errno 2] No such file or directory: '/tmp/tmpqQLhRl/tmpaOkRlK.rsa' Try putting some debug print statements into NACCESS.py, in particular see what stdout and stderr contain. My guess is there is a problem running the tool. Also try running the NACCESS command line tool by hand at the command prompt, to make sure it is installed and working correctly. Peter From 2huggie at gmail.com Fri Jul 23 03:04:13 2010 From: 2huggie at gmail.com (Timothy Wu) Date: Fri, 23 Jul 2010 15:04:13 +0800 Subject: [Biopython] Gene ontology parsing Message-ID: Hi Is there a parser in BioPython for OBO v1.2? If there isn't (at least I couldn't find it a while back), maybe I could contribute a little. I've casually wrote an OntologyIO package with "OboRecord", a scanner and consumer files. I may have not handles all the stuff written on http://www.geneontology.org/GO.format.obo-1_2.shtml, but it did suit my own purpose before. I don't know if anyone in the project will be interested in it. I also do not know if my code will pass the quality standard set forth by BioPython. I've also wrote a GoGraph structure that allow me to look for stuff in the DAG, treating it much like a tree, looking for "common ancestor" given two nodes, parent, child, etc. I've never contributed code to any project before, so if these code can indeed be contributed, perhaps some help from someone in the project would be needed, thanks. Timothy From bartek at rezolwenta.eu.org Fri Jul 23 03:50:45 2010 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Fri, 23 Jul 2010 09:50:45 +0200 Subject: [Biopython] Gene ontology parsing In-Reply-To: References: Message-ID: hi Timothy, I, for one, would be interested in Gene Ontology parsing. I could also help with reviewing the code if you worry about whether it fits to the rest of biopython. I think the easiest for all would be if you could make a branch of biopython on github including your proposed enhancements. This would allow everybody to see what's in there and also make changes and finally mergi it into the trunk. thanks for your offer Bartek On Fri, Jul 23, 2010 at 9:04 AM, Timothy Wu <2huggie at gmail.com> wrote: > Hi > > Is there a parser in BioPython for OBO v1.2? > > If there isn't (at least I couldn't find it a while back), maybe I could > contribute a little. I've casually wrote an OntologyIO package with > "OboRecord", a scanner and consumer files. I may have not handles all the > stuff written on http://www.geneontology.org/GO.format.obo-1_2.shtml, but > it > did suit my own purpose before. I don't know if anyone in the project will > be interested in it. I also do not know if my code will pass the quality > standard set forth by BioPython. I've also wrote a GoGraph structure that > allow me to look for stuff in the DAG, treating it much like a tree, > looking > for "common ancestor" given two nodes, parent, child, etc. I've never > contributed code to any project before, so if these code can indeed be > contributed, perhaps some help from someone in the project would be needed, > thanks. > > Timothy > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From biopython at maubp.freeserve.co.uk Fri Jul 23 05:45:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Jul 2010 10:45:11 +0100 Subject: [Biopython] Gene ontology parsing In-Reply-To: References: Message-ID: On Fri, Jul 23, 2010 at 8:50 AM, Bartek Wilczynski wrote: > hi Timothy, > > I, for one, would be interested in Gene Ontology parsing. I could also help > with reviewing the code if you worry about whether it fits to the rest of > biopython. > > I think the easiest for all would be if you could make a branch of biopython > on github including your proposed enhancements. This would allow everybody > to see what's in there and also make changes and finally mergi it into the > trunk. > > thanks for your offer > > Bartek Hi Timothy & Bartek, There are already several people working on GO stuff in branches on github, e.g. Chris Lasher, Kyle Ellrott, Tam?s Nepusz. I don't know if any of them are doing OBO v1.2, but it would be sensible to check and try and combine efforts. Peter From amaher at fas.harvard.edu Fri Jul 23 11:22:32 2010 From: amaher at fas.harvard.edu (Andrew Maher) Date: Fri, 23 Jul 2010 11:22:32 -0400 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module Message-ID: I'm a beginner with python, and I'm having trouble with something relatively simple: I'm trying to find the isoelectric point of a protein with the sequence: VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC So, when I load python 2.6.2 with bipython v1.54 and then type in "from Bio.SeqUtils import ProtParam" on one line and then "X = ProteinAnalysis("VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC") on the next line, I get the following error message: NameError: name 'ProteinAnalysis' is not defined But, looking at the source code for ProtParam ( http://biopython.org/DIST/docs/api/Bio.SeqUtils.ProtParam-pysrc.html#ProteinAnalysis), isn't the ProteinAnalysis class clearly defined? What am I doing wrong? From anaryin at gmail.com Fri Jul 23 11:25:47 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 23 Jul 2010 17:25:47 +0200 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module In-Reply-To: References: Message-ID: Hello Andrew, Try this: ProtParam.ProteinAnalysis(Sequence) Best, Jo?o [...] Rodrigues @ http://doeidoei.wordpress.org On Fri, Jul 23, 2010 at 5:22 PM, Andrew Maher wrote: > I'm a beginner with python, and I'm having trouble with something > relatively > simple: > > I'm trying to find the isoelectric point of a protein with the > sequence: > VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC > > So, when I load python 2.6.2 with bipython v1.54 and then type in "from > Bio.SeqUtils import ProtParam" on one line and then "X = > > ProteinAnalysis("VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC") > on the next line, I get the following error message: > > NameError: name 'ProteinAnalysis' is not defined > > But, looking at the source code for ProtParam ( > > http://biopython.org/DIST/docs/api/Bio.SeqUtils.ProtParam-pysrc.html#ProteinAnalysis > ), > isn't the ProteinAnalysis class clearly defined? What am I doing wrong? > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From amaher at fas.harvard.edu Fri Jul 23 11:31:59 2010 From: amaher at fas.harvard.edu (Andrew Maher) Date: Fri, 23 Jul 2010 11:31:59 -0400 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module In-Reply-To: References:

Message-ID: Sorry, now I get this: -bash-3.1$ python Python 2.6.2 (r262:71600, Jul 27 2009, 17:05:24) [GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import Bio.SeqUtils.ProtParam >>> ProtParam.ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC') Traceback (most recent call last): File "", line 1, in NameError: name 'ProtParam' is not defined >>> I wonder what's wrong... On Fri, Jul 23, 2010 at 11:25 AM, Jo?o Rodrigues wrote: > Hello Andrew, > > Try this: ProtParam.ProteinAnalysis(Sequence) > > Best, > > Jo?o [...] Rodrigues > @ http://doeidoei.wordpress.org > > > > On Fri, Jul 23, 2010 at 5:22 PM, Andrew Maher wrote: > >> I'm a beginner with python, and I'm having trouble with something >> relatively >> simple: >> >> I'm trying to find the isoelectric point of a protein with the >> sequence: >> VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC >> >> So, when I load python 2.6.2 with bipython v1.54 and then type in "from >> Bio.SeqUtils import ProtParam" on one line and then "X = >> >> ProteinAnalysis("VPIQKVQDDTKTLIKTIVTRINDISHTQSVSSKQKVTGLDFIPGLHPILTLSKMDQTLAVYQQILTSMPSRNVIQISNDLENLRDLLHVLAFSKSCHLPEASGLETLDSLGGVLEASGYSTEVVALSRLQGSLQDMLWQLDLSPGC") >> on the next line, I get the following error message: >> >> NameError: name 'ProteinAnalysis' is not defined >> >> But, looking at the source code for ProtParam ( >> >> http://biopython.org/DIST/docs/api/Bio.SeqUtils.ProtParam-pysrc.html#ProteinAnalysis >> ), >> isn't the ProteinAnalysis class clearly defined? What am I doing wrong? >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From anaryin at gmail.com Fri Jul 23 11:34:49 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 23 Jul 2010 17:34:49 +0200 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module In-Reply-To: References:

Message-ID: No problem, you just switched the import statement. from Bio.SeqUtils import ProtParam sequence = ''DTKTLIKTIVT" ProtParam.ProteinAnalysis(sequence) or import Bio.SeqUtils.ProtParam sequence = ''DTKTLIKTIVT" ProteinAnalysis(sequence) It's a namespace issue :) Best! Jo?o From biopython at maubp.freeserve.co.uk Fri Jul 23 11:39:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Jul 2010 16:39:40 +0100 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module In-Reply-To: References:

Message-ID: On Fri, Jul 23, 2010 at 4:31 PM, Andrew Maher wrote: > Sorry, now I get this: Maybe reading up on Python's import statement and name spaces will help. You are close... One approach is to import the module: from Bio.SeqUtils import ProtParam x = ProtParam.ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSS') Or, you might prefer to import just the class - probably a good plan if you only need one or two things from this module: from Bio.SeqUtils.ProtParam import ProteinAnalysis x = ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSS') Peter P.S. You can even do this - but I don't think it is very clear: import Bio.SeqUtils.ProtParam x = Bio.SeqUtils.ProtParam.ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSS') From amaher at fas.harvard.edu Fri Jul 23 11:49:06 2010 From: amaher at fas.harvard.edu (Andrew Maher) Date: Fri, 23 Jul 2010 11:49:06 -0400 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module In-Reply-To: References:

Message-ID: Thanks a lot! Also, I've noticed that when using this module to get an isoelectric point, x.pi() does not work, but x.isoelectric_point() does. Perhaps someone should note that somewhere. -Andrew On Fri, Jul 23, 2010 at 11:39 AM, Peter wrote: > On Fri, Jul 23, 2010 at 4:31 PM, Andrew Maher > wrote: > > Sorry, now I get this: > > Maybe reading up on Python's import statement and name spaces will help. > You are close... > > One approach is to import the module: > > from Bio.SeqUtils import ProtParam > x = ProtParam.ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSS') > > Or, you might prefer to import just the class - probably a good plan if you > only need one or two things from this module: > > from Bio.SeqUtils.ProtParam import ProteinAnalysis > x = ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSS') > > Peter > > P.S. > > You can even do this - but I don't think it is very clear: > > import Bio.SeqUtils.ProtParam > x = > Bio.SeqUtils.ProtParam.ProteinAnalysis('VPIQKVQDDTKTLIKTIVTRINDISHTQSVSS') > > From biopython at maubp.freeserve.co.uk Fri Jul 23 11:52:53 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Jul 2010 16:52:53 +0100 Subject: [Biopython] Using the Bio.SeqUtils.ProtParam module In-Reply-To: References:

Message-ID: On Fri, Jul 23, 2010 at 4:49 PM, Andrew Maher wrote: > Thanks a lot! > > Also, I've noticed that when using this module to get an isoelectric point, > x.pi() does not work, but x.isoelectric_point() does. Perhaps someone should > note that somewhere. That looks like a typo in the ProtParam module documentation - I'll fix that. Thanks! Peter From kellrott at gmail.com Fri Jul 23 12:17:19 2010 From: kellrott at gmail.com (Kyle) Date: Fri, 23 Jul 2010 09:17:19 -0700 Subject: [Biopython] Gene ontology parsing In-Reply-To: References: Message-ID: > There are already several people working on GO stuff in branches on github, > e.g. Chris Lasher, Kyle Ellrott, Tam?s Nepusz. I don't know if any of them are > doing OBO v1.2, but it would be sensible to check and try and combine efforts. The branch at http://github.com/kellrott/biopython/tree/gosupport should parse most of the information held in OBO v1.2. Chris's original version was targeted only for the GO OBO file, as there was a typecheck to make sure the node ID's started with 'GO:'. That's disable in my branch, and I've used the package to parse a few of the other ontologies found at www.obofoundry.org. The module is currently called Bio.GO, but maybe it should be re-factored to represent the fact that it covers general OBO files, and not just the GO file specifically. The main things things keeping it from merging into the main branch are proper documentation, complete unit tests, and making sure that it covers all of the standard usage practices. If you can try it out, and let me know which function are missing (and maybe contribute some code), we can push this thing forward. Kyle From bouchard.lysiane at gmail.com Fri Jul 23 15:54:29 2010 From: bouchard.lysiane at gmail.com (Lysiane Bouchard) Date: Fri, 23 Jul 2010 15:54:29 -0400 Subject: [Biopython] NaN values, lowess Message-ID: Hi, I am experimenting problems with the lowess function, from Bio.Statistics module, version 1.53. It seems that when the data trajectory is fairly constant, NaN / infinite values are returned as estimated trend. I wonder if someone encountered a similar problem ? Thank you, Lysiane From pengyu.ut at gmail.com Wed Jul 28 00:08:37 2010 From: pengyu.ut at gmail.com (Peng Yu) Date: Tue, 27 Jul 2010 23:08:37 -0500 Subject: [Biopython] Where is alphebet of Bio.Seq documented Message-ID: Hi, Seq class has a member alphabet if I'm correct about the terminology. But I don't see it in help(Bio.Seq). Is it documented in python help()? from Bio.Seq import Seq my_seq = Seq("AGTACACTGGT") print my_seq print my_seq.alphabet -- Regards, Peng From biopython at maubp.freeserve.co.uk Wed Jul 28 06:51:26 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 28 Jul 2010 11:51:26 +0100 Subject: [Biopython] Where is alphebet of Bio.Seq documented In-Reply-To: References: Message-ID: On Wed, Jul 28, 2010 at 5:08 AM, Peng Yu wrote: > Hi, > > Seq class has a member alphabet if I'm correct about the terminology. > But I don't see it in help(Bio.Seq). Is it documented in python > help()? > > from Bio.Seq import Seq > my_seq = Seq("AGTACACTGGT") > print my_seq > print my_seq.alphabet Hi Peng, The main tutorial does explain this, but you are right that the Seq docstrings should describe the alphabet property/attribute of the Seq objects. It is just an Alphabet object as defined in the Bio.Alphabet module. Peter From Achim.Treumann at NEPAF.com Sat Jul 31 02:05:09 2010 From: Achim.Treumann at NEPAF.com (Achim Treumann) Date: Sat, 31 Jul 2010 07:05:09 +0100 Subject: [Biopython] problem with SeqIO.index() and get_raw Message-ID: <01798D2396253A449511F31F1CDE83550C3E6C@srv1.NEPAF.local> Dear all, I am new to biopython and trying to familiarise myself with its utilities. When I was trying to parse a swissprot.dat file to then copy the full data back using the get_raw attribute, I got stuck (Biopython 1.53, Python 2.6 on WinXP). I was using the following code:


from Bio import SeqIO
from Bio import Swissprot
InputFile = "D:\data\uniprot_sprot.short.dat"

handle = open(InputFile)
uniprot = SeqIO.index(InputFile, "swiss")
acclist=[]
for record in Swissprot.parse(handle):
    accList.append(record.accessions[0])
handle.close()
out_handle = open("D:\data\uniprot_sprot.temp.dat", "w")
for acc in accList:
    out_handle.write(uniprot.get_raw(acc)
out_handle.close()

The input file is a sprot-formatted file with 39 entries. When I try to run this, I get the following error: Traceback (most recent call last): File"", line 2, in handle.write(uniprot.getraw(acc)) Attribute error: 'SwissDict' object has no attribute 'get_raw' Where am I going wrong? I have tried to replicate the corresponding example from the tutorial and got stuck at the same point with the same error message. Any help would be greatly appreciated. As I am typing, I realise that get_raw might only have been implemented in Biopython 1.54... I will post this anyway, and if upgrading sorts it, I will send a reply. Many thanks for comments and advice, Achim From biopython at maubp.freeserve.co.uk Sat Jul 31 05:16:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 31 Jul 2010 10:16:38 +0100 Subject: [Biopython] problem with SeqIO.index() and get_raw In-Reply-To: <01798D2396253A449511F31F1CDE83550C3E6C@srv1.NEPAF.local> References: <01798D2396253A449511F31F1CDE83550C3E6C@srv1.NEPAF.local> Message-ID: On Sat, Jul 31, 2010 at 7:05 AM, Achim Treumann wrote: > Dear all, > > I am new to biopython and trying to familiarise myself with its utilities. > When I was trying to parse a swissprot.dat file to then copy the full > data back using the get_raw attribute, I got stuck (Biopython 1.53, > Python 2.6 on WinXP). > >... > > As I am typing, I realise that get_raw might only have been > implemented in Biopython 1.54... I will post this anyway, and if > upgrading sorts it, I will send a reply. Hi Achim, Well guessed. Yes, get_raw was added in Biopython 1.54, see http://news.open-bio.org/news/2010/05/biopython-release-154/ Are you reading the current tutorial on line (rather than the version shipped with Biopython 1.53)? It does mention this, although maybe it could be in the FAQ as well... would that help? Peter From Achim.Treumann at NEPAF.com Sat Jul 31 21:53:17 2010 From: Achim.Treumann at NEPAF.com (Achim Treumann) Date: Sun, 1 Aug 2010 02:53:17 +0100 Subject: [Biopython] problem with SeqIO.index() and get_raw References: <01798D2396253A449511F31F1CDE83550C3E6C@srv1.NEPAF.local> Message-ID: <01798D2396253A449511F31F1CDE83550C3E6D@srv1.NEPAF.local> Hi Peter, Jordan andd others, many thanks for the fast replies. You are obviously right - installing 1.54 got me rolling. I now noticed that the tutorial I was working with was indeed 1.54 (should have checked that when I downloaded it separately from the distribution). When I was going through the tutorial I did not realise that the version I was working with was not 1.54... Don't think that it would be essential to put this into the FAQ - neither google nor searching this discussion list brought up other people who were struggling with this, I must have been the only one too thick to check version numbers :-) Thanks a lot again for your rapid help, Achim -----Original Message----- From: p.j.a.cock at googlemail.com on behalf of Peter Sent: Sat 31/07/2010 18:16 To: Achim Treumann Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] problem with SeqIO.index() and get_raw On Sat, Jul 31, 2010 at 7:05 AM, Achim Treumann wrote: > Dear all, > > I am new to biopython and trying to familiarise myself with its utilities. > When I was trying to parse a swissprot.dat file to then copy the full > data back using the get_raw attribute, I got stuck (Biopython 1.53, > Python 2.6 on WinXP). > >... > > As I am typing, I realise that get_raw might only have been > implemented in Biopython 1.54... I will post this anyway, and if > upgrading sorts it, I will send a reply. Hi Achim, Well guessed. Yes, get_raw was added in Biopython 1.54, see http://news.open-bio.org/news/2010/05/biopython-release-154/ Are you reading the current tutorial on line (rather than the version shipped with Biopython 1.53)? It does mention this, although maybe it could be in the FAQ as well... would that help? Peter From reece at berkeley.edu Thu Jul 1 00:05:05 2010 From: reece at berkeley.edu (Reece Hart) Date: Wed, 30 Jun 2010 17:05:05 -0700 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: <4C2A7C09.2020204@berkeley.edu> Message-ID: <4C2BDBB1.7070503@berkeley.edu> On 06/30/2010 06:44 AM, Peter wrote: > That looks like a good reason to have a PDB XML parser (as trying to do > this from the plain text PDB is probably fiddly). > The only way to do this from the PDB text file is to infer the mapping by aligning the seqres block and the ATOM residues. Fiddly is a generous description of the problems one will encounter. mmCIF and XML are the authoritative sources, AFAIK. -Reece From lunt at ctbp.ucsd.edu Thu Jul 1 03:06:41 2010 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Wed, 30 Jun 2010 20:06:41 -0700 Subject: [Biopython] (least) Favorite PDB models? Message-ID: Greetings All, So I have finished (for now) the section of my program that maps PDB model residues to SEQRES residues... And every programmer's favorite thing is QA, right? Does anyone have some suggestions of particularly ugly PDB models (discontinuous, strange residue numberings, etc) that I can test it on? Thanks All! -Bryan Lunt From p.j.a.cock at googlemail.com Thu Jul 1 09:33:04 2010 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 1 Jul 2010 10:33:04 +0100 Subject: [Biopython] (least) Favorite PDB models? In-Reply-To: References: Message-ID: On Thu, Jul 1, 2010 at 4:06 AM, Bryan Lunt wrote: > Greetings All, > > So I have finished (for now) the section of my program that maps PDB > model residues to SEQRES residues... > > And every programmer's favorite thing is QA, right? It should be ;) Unit tests are good! > Does anyone have some suggestions of particularly ugly PDB models > (discontinuous, strange residue numberings, etc) > that I can test it on? There are a few odd files listed here, http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/top500/ For example 1DIN and 2HMZ showed some interesting behaviour with fractional occupancy - these should be hard to map onto a single sequence! Peter From anaryin at gmail.com Fri Jul 2 19:25:21 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 2 Jul 2010 14:25:21 -0500 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: <4C2A7C09.2020204@berkeley.edu> Message-ID: Hey! >> Does anyone have any code for easy alignment between the SEQRES entry > >> in a pdb file and the actual ATOM/HETATM entries in the chain? > There's not parsing for SEQRES yet in parse_header_pdb but it wouldn't be hard to. > >> In biojava, this is just one of the options when you parse a PDB file, > >> it would certainly be useful. > Indeed it would. That looks like a good reason to have a PDB XML parser (as trying to do > this from the plain text PDB is probably fiddly). > I don't know if people have started working on such a parser but I have some sort of a head start. Check here: http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/Struct/WWW/WHATIFXML.py Warning, very ugly :) > > In any case, having function that provides this mapping (both directions) > in > > BioPython would be extremely useful. > > Maybe something for the GSoC project TODO list? ;) > Hmm, I was working on something more or less like this a while back and it didn't work that well. But it might be a good idea. It seems however that Bryan already did it :) No? Jo?o From rodrigo_faccioli at uol.com.br Fri Jul 2 21:50:58 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Fri, 2 Jul 2010 18:50:58 -0300 Subject: [Biopython] BioJava-like seqres alignment for Bio.PDB In-Reply-To: References: <4C2A7C09.2020204@berkeley.edu>

Message-ID: Hi, About the SEQRES implementation for BioPython, I've developed it. Please, see it in [1]. I hope this implementation could help you. [1] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDBParser.py Best, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Fri, Jul 2, 2010 at 4:25 PM, Jo?o Rodrigues wrote: > Hey! > > >> Does anyone have any code for easy alignment between the SEQRES entry > > >> in a pdb file and the actual ATOM/HETATM entries in the chain? > > > > There's not parsing for SEQRES yet in parse_header_pdb but it wouldn't be > hard to. > > > > >> In biojava, this is just one of the options when you parse a PDB > file, > > >> it would certainly be useful. > > > > Indeed it would. > > > That looks like a good reason to have a PDB XML parser (as trying to do > > this from the plain text PDB is probably fiddly). > > > > I don't know if people have started working on such a parser but I have > some > sort of a head start. Check here: > > > http://github.com/JoaoRodrigues/biopython/blob/GSOC2010/Bio/Struct/WWW/WHATIFXML.py > > Warning, very ugly :) > > > > > > In any case, having function that provides this mapping (both > directions) > > in > > > BioPython would be extremely useful. > > > > Maybe something for the GSoC project TODO list? ;) > > > > Hmm, I was working on something more or less like this a while back and it > didn't work that well. But it might be a good idea. It seems however that > Bryan already did it :) No? > > Jo?o > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Mon Jul 5 17:53:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 18:53:41 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? Message-ID: Hello all, Is anyone using Bio.Crystal? It is a parser for a subset of the PDB format which used to be used by the The Nucleic Acid Database Project. My impression from their website is that they are moving all their data to the PDB format - but still have some structures that are present only in the NDB format: http://ndbserver.rutgers.edu/download_data/index.html I'm asking because Bio.Crystal needs some updating to avoid deprecation warnings in the latest versions of Python - and it might be simpler just to deprecate it instead. Peter From rjalves at igc.gulbenkian.pt Mon Jul 5 20:00:17 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 05 Jul 2010 21:00:17 +0100 Subject: [Biopython] Finding what is the most recent Pubmed ID or list of all valid PMIDs Message-ID: <4C3239D1.7040708@igc.gulbenkian.pt> Greetings All, I'm trying to figure out a way to have a more or less up-to-date list of Pubmed IDs for validation purposes. This has to be performed on a programmatic way. My first attempt was to look for this in NCBI's FTP. I could find ftp://ftp.ncbi.nih.gov/pubmed/deleted_pmids.txt but not information about the most recent PMID. I also tried to use EInfo but the count section under PubMed seems either outdated or completely unrelated to the total number of assigned PMIDs. Even when adding the total of deleted_pmids + the number from EInfo I couldn't get accurate information. So my question is, does anyone know how to get either a list of all the valid PMIDs or simply the most recent PMID? Thanks, Renato -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 262 bytes Desc: OpenPGP digital signature URL: From biopython at maubp.freeserve.co.uk Mon Jul 5 20:54:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Jul 2010 21:54:37 +0100 Subject: [Biopython] Finding what is the most recent Pubmed ID or list of all valid PMIDs In-Reply-To: <4C3239D1.7040708@igc.gulbenkian.pt> References: <4C3239D1.7040708@igc.gulbenkian.pt> Message-ID: On Mon, Jul 5, 2010 at 9:00 PM, Renato Alves wrote: > Greetings All, > > I'm trying to figure out a way to have a more or less up-to-date list of > Pubmed IDs for validation purposes. This has to be performed on a > programmatic way. > > My first attempt was to look for this in NCBI's FTP. I could find > ftp://ftp.ncbi.nih.gov/pubmed/deleted_pmids.txt but not information > about the most recent PMID. > > I also tried to use EInfo but the count section under PubMed seems > either outdated or completely unrelated to the total number of assigned > PMIDs. Even when adding the total of deleted_pmids + the number from > EInfo I couldn't get accurate information. > > So my question is, does anyone know how to get either a list of all the > valid PMIDs or simply the most recent PMID? To try and work out the latest PMID, I'd start by trying a PubMed search by date, using a recent threshold. What number of PMIDs are you trying to validate? Would it make sense to use Entrez to do the validation (in batches)? Peter From rjalves at igc.gulbenkian.pt Mon Jul 5 22:13:34 2010 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 05 Jul 2010 23:13:34 +0100 Subject: [Biopython] Finding what is the most recent Pubmed ID or list of all valid PMIDs In-Reply-To: References: <4C3239D1.7040708@igc.gulbenkian.pt> Message-ID: <4C32590E.6020609@igc.gulbenkian.pt> From Peter on 07/05/2010 09:54 PM: > On Mon, Jul 5, 2010 at 9:00 PM, Renato Alves wrote: >> Greetings All, >> >> I'm trying to figure out a way to have a more or less up-to-date list of >> Pubmed IDs for validation purposes. This has to be performed on a >> programmatic way. >> >> My first attempt was to look for this in NCBI's FTP. I could find >> ftp://ftp.ncbi.nih.gov/pubmed/deleted_pmids.txt but not information >> about the most recent PMID. >> >> I also tried to use EInfo but the count section under PubMed seems >> either outdated or completely unrelated to the total number of assigned >> PMIDs. Even when adding the total of deleted_pmids + the number from >> EInfo I couldn't get accurate information. >> >> So my question is, does anyone know how to get either a list of all the >> valid PMIDs or simply the most recent PMID? > > To try and work out the latest PMID, I'd start by trying a PubMed search > by date, using a recent threshold. > > What number of PMIDs are you trying to validate? Would it make > sense to use Entrez to do the validation (in batches)? > > Peter I've thought of using Entrez but I was trying to avoid it by using information available locally. I've no idea how many and what PMIDs will be requested. But indeed searching by date I can get a rough idea of what might be the most recent PMID. Thanks Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: From biopython at maubp.freeserve.co.uk Tue Jul 6 08:23:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 09:23:28 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? In-Reply-To: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> References: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> Message-ID: On Tue, Jul 6, 2010 at 7:49 AM, Kristian Rother wrote: > > > Hi Peter, > > none of us here is using Bio.Crystal - and we use NDB structures a lot. > > Best, > ? Kristian Hi Kristian, What kind of files are you using from the NDB? Standard PDB or mmCIF maybe? Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 08:43:18 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 09:43:18 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? In-Reply-To: <20100706102515.18063fisv68bqesr@horde.genesilico.pl> References: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> <20100706102515.18063fisv68bqesr@horde.genesilico.pl> Message-ID: > Quoting Peter : >> Hi Kristian, >> >> What kind of files are you using from the NDB? Standard PDB or mmCIF >> maybe? >> >> Peter On Tue, Jul 6, 2010 at 9:25 AM, Kristian Rother wrote: > Hi Peter, > > Standard PDB. > We're gradually shifting to the PDB database for queries though, because the > XML-based API for querying subsets of structures is much better there. > > Kristian Thanks - maybe we can depreacte Bio.Crystal then... Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 10:36:39 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:36:39 +0100 Subject: [Biopython] Deprecating Bio.Crystal in next release? Message-ID: Hi all, Given recent discussion (and the lack of interest on the dev list on previous occasions), is there any objection to deprecating Bio.Crystal in the next release of Biopython? http://lists.open-bio.org/pipermail/biopython/2010-July/006633.html http://lists.open-bio.org/pipermail/biopython-dev/2008-October/004405.html http://lists.open-bio.org/pipermail/biopython-dev/2007-July/002901.html Thanks, Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 10:46:12 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 11:46:12 +0100 Subject: [Biopython] Is Bio.Crystal still useful (for NDB files)? In-Reply-To: <20100706123045.12993suli0ec27dh@horde.genesilico.pl> References: <20100706084942.10823rdn5pazwcg6@horde.genesilico.pl> <20100706123045.12993suli0ec27dh@horde.genesilico.pl> Message-ID: On Tue, Jul 6, 2010 at 11:30 AM, Kristian Rother wrote: > >> What kind of files are you using from the NDB? Standard PDB or mmCIF >> maybe? > > Ah... sorry.. just PDB. > > Kristian That's useful to know - thanks. Peter P.S. Please try to CC the mailing list ;) From biopython at maubp.freeserve.co.uk Tue Jul 6 13:40:51 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 14:40:51 +0100 Subject: [Biopython] Deprecating Bio.InterPro Message-ID: Hi all, Another old module which hasn't been updated for some time is Bio.InterPro, a parser for the HTML (webpages) at the EBI, e.g. http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001064 The parser doesn't work with the current website, and also uses a Python library called sgmllib which was deprecated as of Python 2.6. Website parsers are in general a bad idea because the tend to need a lot of work to keep up to date. Perhaps in this case there are suitable plain text files on the FTP site which might be used? Unless anyone has a good reason not to, we are going to deprecate the Bio.IntrerPro module in the next release of Biopython. Peter From biopython at maubp.freeserve.co.uk Tue Jul 6 15:20:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 6 Jul 2010 16:20:38 +0100 Subject: [Biopython] Deprecating Bio.Index? Message-ID: Hello all, Is anyone using the Bio.Index module in Biopython in their own code? This supported file indexing and was used in other parts of Biopython which have all now been deprecated (e.g. Bio.SwissProt.SProt and Bio.Prosite) or removed. The more recent Bio.SeqIO module provides a general approach to indexing sequence files. Would it inconvenience anyone if Bio.Index was deprecated in the next release (triggering warnings when imported, but still functional), and then removed later on? Thanks, Peter From guyeakin at gmail.com Thu Jul 8 00:52:49 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 7 Jul 2010 20:52:49 -0400 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd Message-ID: I am learning biopython and seem to be having trouble parsing efetch generated xml. Maybe I am confused here, but I can't for the life of me Get my xml to parse correctly, and it seems to be coming up with a missing dtd error using both Medline.parse and Entrez.parse. (traceback for medline below below) nlmmedlinecitationset_100301. dtd and pubmed_100301.dtd seem to be missing from my biopython installation, and unavailable from the following NCBI sites: http://www.ncbi.nlm.nih.gov/dtd/ or http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ My apologies if this is user error; i do not see reference to this DTD issue in the archives so am posting the incident. Is this just bad luck during my learning curve, or am I missing something conceptual here? Thanks, Guy Traceback (most recent call last): File "C:\Python26\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py", line 312, in RunScript exec codeObject in __main__.__dict__ File "C:\ieye\ieye\citations\pubmed_search_fxn.py", line 36, in parsed_results = Entrez.read(fetch_handle) File "C:\Python26\lib\site-packages\Bio\Entrez\__init__.py", line 262, in read record = handler.read(handle) File "C:\Python26\lib\site-packages\Bio\Entrez\Parser.py", line 108, in read self.parser.ParseFile(handle) File "C:\Python26\lib\site-packages\Bio\Entrez\Parser.py", line 381, in externalEntityRefHandler parser.ParseFile(handle) File "C:\Python26\lib\site-packages\Bio\Entrez\Parser.py", line 377, in externalEntityRefHandler raise RuntimeError(message) RuntimeError: Unable to load DTD file nlmmedlinecitationset_100301.dtd. From biopython at maubp.freeserve.co.uk Thu Jul 8 07:42:32 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 08:42:32 +0100 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd In-Reply-To: References: Message-ID: On Thu, Jul 8, 2010 at 1:52 AM, Guy Eakin wrote: > ?I am learning biopython and seem to be having trouble parsing efetch > generated xml. > > Maybe I am confused here, but I can't for the life of me Get my xml to parse > correctly, and it seems to be coming up with a missing dtd error using both > Medline.parse and Entrez.parse. (traceback for medline below below) > > ?nlmmedlinecitationset_100301. > dtd and pubmed_100301.dtd seem to be missing from my biopython > installation, and unavailable from the following NCBI sites: > > http://www.ncbi.nlm.nih.gov/dtd/ or > http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ > > My apologies if this is user error; i do not see reference to this DTD issue > in the archives so am posting the incident. Is this just bad luck during my > learning curve, or am I missing something conceptual here? The problem is with the NCBI "hiding" the file by not showing the raw contents of that folder, but just an HTML page with a partial list. You need this file: http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/nlmmedlinecitationset_100301.dtd I've added this to our repository so the next version of Biopython will include it. Please let us know if anything else is missing - what was the Entrez request you used to get the XML using this DTD file? Regards, Peter From guyeakin at gmail.com Thu Jul 8 11:28:17 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Thu, 8 Jul 2010 07:28:17 -0400 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd In-Reply-To: References:

Message-ID: Peter, Many thanks. this is a query statement that generated the nlmmedlinecitationset_100301.dtd error: Entrez.esearch(db="pubmed", term= ('glaucom*'), retmax=2, usehistory="y", reldate=7, datetype="edat") fetch_handle = Entrez.efetch(db="pubmed", retmode="xml",rettype='medline', webenv=webenv, query_key=query_key) You will also want to add pubmed_100301.dtd to your repository. I do not have the query that generated it's dependent XML, but got an separate error related to its absence yesterday. Oddly, I was able to download the "hidden" pubmed_100301.dtd, but could not replicate the error. All following errors focused on the nlmmedlinecitationset_100301.dtd file which I could not locate until this morning. Perhaps it was just recently posted to the site. Either way, thanks for the confirmation that I was on the right track. regards, guy On Thu, Jul 8, 2010 at 3:42 AM, Peter wrote: > On Thu, Jul 8, 2010 at 1:52 AM, Guy Eakin wrote: > > I am learning biopython and seem to be having trouble parsing efetch > > generated xml. > > > > Maybe I am confused here, but I can't for the life of me Get my xml to > parse > > correctly, and it seems to be coming up with a missing dtd error using > both > > Medline.parse and Entrez.parse. (traceback for medline below below) > > > > nlmmedlinecitationset_100301. > > dtd and pubmed_100301.dtd seem to be missing from my biopython > > installation, and unavailable from the following NCBI sites: > > > > http://www.ncbi.nlm.nih.gov/dtd/ or > > http://eutils.ncbi.nlm.nih.gov/entrez/query/DTD/ > > > > My apologies if this is user error; i do not see reference to this DTD > issue > > in the archives so am posting the incident. Is this just bad luck during > my > > learning curve, or am I missing something conceptual here? > > The problem is with the NCBI "hiding" the file by not showing the raw > contents of that folder, but just an HTML page with a partial list. You > need this file: > > > http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/nlmmedlinecitationset_100301.dtd > > I've added this to our repository so the next version of Biopython will > include it. Please let us know if anything else is missing - what was > the Entrez request you used to get the XML using this DTD file? > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Thu Jul 8 11:53:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Jul 2010 12:53:30 +0100 Subject: [Biopython] Bio.Entrez/Medline DTD problems - missing DTD nlmmedlinecitationset_100301.dtd In-Reply-To: References:

Message-ID: On Thu, Jul 8, 2010 at 12:28 PM, Guy Eakin wrote: > Peter, > > Many thanks. > > this is a query statement that generated the > nlmmedlinecitationset_100301.dtd error: Entrez.esearch(db="pubmed", > ? ? ? ? ? ? ? ? ? ? ? ?term= ('glaucom*'), > ? ? ? ? ? ? ? ? ? ? ? ?retmax=2, usehistory="y", > ? ? ? ? ? ? ? ? ? ? ? ?reldate=7, datetype="edat") > > > fetch_handle = Entrez.efetch(db="pubmed", retmode="xml",rettype='medline', > ? ? ? ? ? ? ? ? ? ? ? ? ? ? webenv=webenv, query_key=query_key) > Great. A more complete version is: from Bio import Entrez Entrez.email = "A.N.Other at example.com" s = Entrez.read(Entrez.esearch(db="pubmed",term= ('glaucom*'),retmax=2, usehistory="y",reldate=7, datetype="edat")) r = Entrez.read(Entrez.efetch(db="pubmed", retmode="xml",rettype='medline', webenv=s["WebEnv"], query_key=s["QueryKey"])) > > You will also want to add pubmed_100301.dtd to your repository. ... > Yes, we needed the new pubmed_100301.dtd and also bookdoc_100301.dtd It looks like the NCBI did some updates recently - hopefully they'll update that webpage soon and we can see if we need to add anymore (other than finding out from error messages). Thanks for the report, and please let us know if you find any more missing DTD files. Peter From guyeakin at gmail.com Wed Jul 14 20:48:41 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 14 Jul 2010 16:48:41 -0400 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read Message-ID: I am using Bio.Entrez.read to parse XML returned from pubmed. This results in a dictionary for which one of the keys is ArticleIDList, e.g, Example PubmedData': {u'ArticleIdList': ['S0735-6757(09)00464-1', '10.1016/j.ajem.2009.09.013', '20579576'], blah: blah, etc.} In the original XML Each in contains an IDtype attribute that names the ID. for example 10.1016/j.ajem.2009.09.013 the IDtype is useful, but I can't find it in the Bio.Entrez.read output, so I have no *easy* way of determining whether the ID# is pii, pmc, pmid, etc. Is there a better way to get the IDtype attribute, or other XML tag attributes from the Entrez.read output? Thanks. Guy Code below --------------- from Bio import Medline from Bio import Entrez import routine_pubmed_query_terms as pubmedterms #this is a separate .py file that I use to hold query terms, email address, etc. s = Entrez.read(Entrez.esearch(db="pubmed", term=pubmedterms.entrezquery(program), retmax=pubmedterms.maxlimit, usehistory="y", reldate=pubmedterms.datelimit, datetype="edat")) print "found %s records, returning %s" % (int(s["Count"]), len(s["IdList"])) r = Entrez.read(Entrez.efetch(db="pubmed",retmode="xml", rettype='medline', webenv=s["WebEnv"], query_key=s["QueryKey"])) From biopython at maubp.freeserve.co.uk Wed Jul 14 21:18:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 22:18:34 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References: Message-ID: On Wed, Jul 14, 2010 at 9:48 PM, Guy Eakin wrote: > I am using Bio.Entrez.read to parse XML returned from pubmed. > > This results in a dictionary for which one of the keys is ArticleIDList, > e.g, > Example > PubmedData': {u'ArticleIdList': ['S0735-6757(09)00464-1', > '10.1016/j.ajem.2009.09.013', '20579576'], blah: blah, etc.} > > In the original XML Each in contains an IDtype > attribute that names the ID. for example > 10.1016/j.ajem.2009.09.013 > > the IDtype is useful, but I can't find it in the Bio.Entrez.read output, so > I have no *easy* way of determining whether the ID# is > pii, pmc, pmid, etc. > > Is there a better way to get the IDtype attribute, or other XML tag > attributes from the Entrez.read output? > Hi, This information is in the tutorial, but could perhaps be clearer. It might look like you get strings back, but in fact it is a subclass with an attributes property (a dictionary). e.g. from Bio import Entrez handle = Entrez.efetch(db="pubmed",retmode="xml",rettype='medline',id='19304878') r = Entrez.read(handle) handle.close() print r[0]['PubmedData']['ArticleIdList'][1] print r[0]['PubmedData']['ArticleIdList'][1].attributes Michiel - maybe we need to override the __repr__ method so it shows this information? Peter From biopython at maubp.freeserve.co.uk Wed Jul 14 21:33:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Jul 2010 22:33:06 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: On Wed, Jul 14, 2010 at 10:21 PM, Guy Eakin wrote: > thanks.? I understood that it had a? .tag feature,but missed the > .attributes! > Awesome. And thank you for the quick reply. > > Guy No problem. Now you know the answer, can you suggest any clarifications to the documentation? Peter P.S. Try and CC the mailing list in replies. From guyeakin at gmail.com Thu Jul 15 02:52:44 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 14 Jul 2010 22:52:44 -0400 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: Sure, I am new, so there are probably errors, but how about something like a demonstration appended to the end of the tutorial section at http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105 At core, the simple demonstration that type(record) calls a class object rather than a list, and that foo.attributes, and foo.tag exist would be helpful. I am not using any of the sequence utilities, so admit that my reading of those sections was brief. Reiteration in the entrez parsing sections is probably helpful for people like me. A more verbose demonstration follows. Again, thanks for the help. Guy 8.11.1 Parsing Medline records [intervening text omitted] At this point let?s address what these elements contain. Consider information found in the following statement. >>> records[0]['PubmedData'] {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month': '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]} It is important to recall that each item is a biopython class, rather than a simply a dictionary or list item. This can be verified by >>>type(records[0]['PubmedData']['ArticleIdList'] Which returns rather than This is important, as the class item contains additional auxiliary information as noted earlier. One such piece of important auxillary info is the XML tag attributes from the parsed XML. In this case, the original XML contained the following tags: btp163 10.1093/bioinformatics/btp163 19304878 PMC2682512 which have now been parsed into the u'ArticleIdList' dictionary key: >>> L =records[0]['PubmedData']['ArticleIdList'] ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'] Viewed as a simple list, these elements appear to lack the IdType information. However, the IdType attribute from the tag is stored in the parsed data, and can be retrieved by calling ?attributes? on the biopython class object. >>> for item in L: ... print "%s - %s" % (item, item.attributes) ... btp163 - {u'IdType': u'pii'} 10.1093/bioinformatics/btp163 - {u'IdType': u'doi'} 19304878 - {u'IdType': u'pubmed'} PMC2682512 - {u'IdType': u'pmc'} On Wed, Jul 14, 2010 at 5:33 PM, Peter wrote: > On Wed, Jul 14, 2010 at 10:21 PM, Guy Eakin wrote: > > thanks. I understood that it had a .tag feature,but missed the > > .attributes! > > Awesome. And thank you for the quick reply. > > > > Guy > > No problem. Now you know the answer, can you suggest > any clarifications to the documentation? > > Peter > > P.S. Try and CC the mailing list in replies. > From biopython at maubp.freeserve.co.uk Thu Jul 15 09:44:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 10:44:06 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: On Thu, Jul 15, 2010 at 3:52 AM, Guy Eakin wrote: > Sure, I am new, so there are probably errors, but how about something like a > demonstration appended to the end of the tutorial section at > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105 > > At core, the simple demonstration that type(record) calls a class object > rather than a list, and that foo.attributes, and foo.tag exist would be > helpful. ?I am not using any of the sequence utilities, so admit that my > reading of those sections was brief. ?Reiteration in the entrez parsing > sections is probably helpful for people like me. > > A more verbose demonstration follows. > > Again, thanks for the help. > Guy Thank you for the detailed suggested text. > 8.11.1 ?Parsing Medline records [intervening text omitted] > > At this point let?s address what these elements contain. ?Consider > information found in the following statement. > >>>> records[0]['PubmedData'] > > {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', '19304878', > 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month': > '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', u'Day': > '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', > u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': > '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]} > > It is important to recall that each item is a biopython class, rather than a > simply a dictionary or list item. ?This can be verified by > >>>>type(records[0]['PubmedData']['ArticleIdList'] > > Which returns rather than 'list'> This is why I was suggesting to Michiel that we override the __repr__ for our subclassed objects, so that rather than seeing things like this: ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'] we get something like: ListElement(['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'], attributes={...}) On deeper reflection, the trouble with this is that all the children within the list would get longer, so the full representation of a ListElement (or any container) would become very very long - swamping the console output. Even if we literally show the attributes with a dot dot dot :( Maybe we'll have to settle for just documentation improvements. Michiel - this is your code - what do you think? Peter From guyeakin at gmail.com Thu Jul 15 13:32:43 2010 From: guyeakin at gmail.com (Guy Eakin) Date: Thu, 15 Jul 2010 09:32:43 -0400 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: References:

Message-ID: >From the naive first-time user perspective, the current implementation is fine for the computer, but could benefit from a viewer output that creates a more human readable representation. I found myself cross referencing the console output to the original XML in most cases. That says to me that there might be benefit to a function that recapitulates the original XML's nested structure, listing attribute values. I would think something along the following would be quite useful, and if limited to particular range of records would not necessarily be unwieldy. (apologies for the admittedly unwieldy markup) >>> Bio.Entrez.viewer(recordlist, range=(0:(len(recordlist)), ShowMedlineCitation = True, ShowPubmedData = True) - (Attributes = Parent1.atribute) - \n #(80 characters/line) ..............value ..............indented text allows word wrap of entries > 80 char. ........1 - Attribute - \n .......................value that's an off the cuff representation before I dash to a meeting, but I think you can see what I am suggesting. Guy Guy On Thu, Jul 15, 2010 at 5:44 AM, Peter wrote: > On Thu, Jul 15, 2010 at 3:52 AM, Guy Eakin wrote: > > Sure, I am new, so there are probably errors, but how about something > like a > > demonstration appended to the end of the tutorial section at > > http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#htoc105 > > > > At core, the simple demonstration that type(record) calls a class object > > rather than a list, and that foo.attributes, and foo.tag exist would be > > helpful. I am not using any of the sequence utilities, so admit that my > > reading of those sections was brief. Reiteration in the entrez parsing > > sections is probably helpful for people like me. > > > > A more verbose demonstration follows. > > > > Again, thanks for the help. > > Guy > > Thank you for the detailed suggested text. > > > 8.11.1 Parsing Medline records [intervening text omitted] > > > > At this point let?s address what these elements contain. Consider > > information found in the following statement. > > > >>>> records[0]['PubmedData'] > > > > {u'ArticleIdList': ['btp163', '10.1093/bioinformatics/btp163', > '19304878', > > 'PMC2682512'], u'PublicationStatus': 'ppublish', u'History': [{u'Month': > > '3', u'Day': '20', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', > u'Day': > > '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': '3', > > u'Day': '24', u'Hour': '9', u'Year': '2009'}, {u'Minute': '0', u'Month': > > '7', u'Day': '10', u'Hour': '9', u'Year': '2009'}]} > > > > It is important to recall that each item is a biopython class, rather > than a > > simply a dictionary or list item. This can be verified by > > > >>>>type(records[0]['PubmedData']['ArticleIdList'] > > > > Which returns rather than > 'list'> > > This is why I was suggesting to Michiel that we override the __repr__ > for our subclassed objects, so that rather than seeing things like this: > > ['btp163', '10.1093/bioinformatics/btp163', '19304878', 'PMC2682512'] > > we get something like: > > ListElement(['btp163', '10.1093/bioinformatics/btp163', '19304878', > 'PMC2682512'], attributes={...}) > > On deeper reflection, the trouble with this is that all the children within > the list would get longer, so the full representation of a ListElement (or > any container) would become very very long - swamping the console > output. Even if we literally show the attributes with a dot dot dot :( > > Maybe we'll have to settle for just documentation improvements. > Michiel - this is your code - what do you think? > > Peter > From mjldehoon at yahoo.com Thu Jul 15 13:36:19 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 15 Jul 2010 06:36:19 -0700 (PDT) Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: Message-ID: <436965.79909.qm@web62402.mail.re1.yahoo.com> --- On Thu, 7/15/10, Peter wrote: > This is why I was suggesting to Michiel that we override > than seeing the __repr__ for our subclassed objects, so > that rather things like this: > > ['btp163', '10.1093/bioinformatics/btp163', '19304878', > 'PMC2682512'] > > we get something like: > > ListElement(['btp163', '10.1093/bioinformatics/btp163', > '19304878', 'PMC2682512'], attributes={...}) > > On deeper reflection, the trouble with this is that all the > children within the list would get longer, so the full > representation of a ListElement (or > any container) would become very very long - swamping the > console output. The attributes are almost always only a small fraction of the Entrez XML file. So while it's true that each element gets larger, it's a small relative increase. The elements that are very long after adding the attributes are also very long without the attributes. So I am in favor of your original suggestion. If there are no other suggestions, I'll make the change in Bio.Entrez over the weekend (or feel free to do so before that). Best, --Michiel From biopython at maubp.freeserve.co.uk Thu Jul 15 13:50:41 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Jul 2010 14:50:41 +0100 Subject: [Biopython] Pubmeddata XML parsing with Entrez .fetch and .read In-Reply-To: <436965.79909.qm@web62402.mail.re1.yahoo.com> References: <436965.79909.qm@web62402.mail.re1.yahoo.com> Message-ID: On Thu, Jul 15, 2010 at 2:36 PM, Michiel de Hoon wrote: > > --- On Thu, 7/15/10, Peter wrote: >> This is why I was suggesting to Michiel that we override >> than seeing the __repr__ for our subclassed objects, so >> that rather things like this: >> >> ['btp163', '10.1093/bioinformatics/btp163', '19304878', >> 'PMC2682512'] >> >> we get something like: >> >> ListElement(['btp163', '10.1093/bioinformatics/btp163', >> '19304878', 'PMC2682512'], attributes={...}) >> >> On deeper reflection, the trouble with this is that all the >> children within the list would get longer, so the full >> representation of a ListElement (or any container) would >> become very very long - swamping the console output. > > The attributes are almost always only a small fraction of the Entrez XML file. > So while it's true that each element gets larger, it's a small relative increase. > The elements that are very long after adding the attributes are also very long > without the attributes. So I am in favor of your original suggestion. If there are > no other suggestions, I'll make the change in Bio.Entrez over the weekend > (or feel free to do so before that). Maybe you can keep the basic data type repr if there are no attributes, and only expand it if needed? It would be inconsistent but would keep the total string length down. Peter From bjorn_johansson at bio.uminho.pt Sat Jul 17 11:32:12 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sat, 17 Jul 2010 12:32:12 +0100 Subject: [Biopython] format fasta files to genbank: problem with too long Locus identifier Message-ID: Hi all, this is an example of parsing a fasta file and then trying to convert it to genbank. It seems that the fasta header file is not split between the "|", and all that is in the fasta header ends up as "LOCUS" in the genbank file. Is this the expected behavior? Can this be set somehow? Thanks for any help on this! /bjorn >>> from Bio import SeqIO >>> a=SeqIO.read("newfile.fasta", "fasta") >>> a SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...GGG', SingleLetterAlphabet()), id='gi|2765658|emb|Z78533.1|CIZ78533', name='gi|2765658|emb|Z78533.1|CIZ78533', description='gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA', dbxrefs=[]) >>> a.format('fasta') '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA\nCGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGT\nGACCCTGATTTGTTGTTGGG\n' >>> a.format('genbank') Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 638, in format return self.__format__(format) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqRecord.py", line 652, in __format__ SeqIO.write([self], handle, format_spec) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/__init__.py", line 398, in write count = writer_class(handle).write_file(sequences) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/Interfaces.py", line 271, in write_file count = self.write_records(records) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/Interfaces.py", line 256, in write_records self.write_record(record) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/InsdcIO.py", line 628, in write_record self._write_the_first_line(record) File "/usr/local/lib/python2.6/dist-packages/Bio/SeqIO/InsdcIO.py", line 453, in _write_the_first_line raise ValueError("Locus identifier %s is too long" % repr(locus)) ValueError: Locus identifier 'gi|2765658|emb|Z78533.1|CIZ78533' is too long >>> From biopython at maubp.freeserve.co.uk Sat Jul 17 11:50:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jul 2010 12:50:50 +0100 Subject: [Biopython] format fasta files to genbank: problem with too long Locus identifier In-Reply-To: References: Message-ID: 2010/7/17 Bj?rn Johansson : > Hi all, this is an example of parsing a fasta file and then trying to > convert it to genbank. > It seems that the fasta header file is not split between the "|", and all > that is in the fasta header ends up as "LOCUS" in the genbank file. Is this > the expected behavior? Can this be set somehow? > > Thanks for any help on this! > /bjorn Hi Bjorn, Yes this is expected behaviour. There are no standards for FASTA identifiers, the NCBI conventions are just one of dozens of styles. Therefore we don't try and parse the identifiers in FASTA files (we can't do it reliably). Then for GenBank files, the identifier field in the LOCUS line is very limited - you'll have to shorten your ID manually, Try something like this: from Bio import SeqIO a=SeqIO.read("newfile.fasta", "fasta") a.id = a.id.split("|")[3] print a.format('genbank') (untested) Peter From biopython at maubp.freeserve.co.uk Sat Jul 17 19:59:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Jul 2010 20:59:46 +0100 Subject: [Biopython] format fasta files to genbank: problem with too long Locus identifier In-Reply-To: References: