From krewink at inb.uni-luebeck.de Mon Apr 3 13:48:06 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Mon, 3 Apr 2006 19:48:06 +0200 Subject: [Biopython-dev] EMBL flatfile parsing Message-ID: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> Hello, I am trying to parse a EMBL-formated file with biopython, but I couldn't find any working parser for this. When I try to use the Martel-based parser as described in one of the mailinglist-threads, I get the following error: Python 2.4.1 (#1, Oct 22 2005, 16:20:11) [GCC 4.0.0 20041026 (Apple Computer, Inc. build 4061)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> filename = '/Users/krewinkel/tmp/embltest.embl' >>> from Bio.formatdefs.embl import embl65 >>> from xml.sax import saxutils >>> parser = embl65.make_parser() >>> parser.setContentHandler(saxutils.XMLGenerator()) >>> parser.parse(open(filename)) Traceback (most recent call last): File "", line 1, in ? File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 482, in parse self.parseFile(source.getCharacterStream() or source.getByteStream()) File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 468, in parseFile self._err_handler.error(result) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/xml/sax/handler.py", line 34, in error raise exception Martel.Parser.ParserPositionException: error parsing at or beyond character 0 The file itself appears to be okay, since it can be read by 'seqret' and bioperl. This seems to be a parser problem -- or am I doing something wrong? Thanks in advance Albert -- Albert Krewinkel University of Luebeck phone: +49 (451) 500 5516 email: krewink at inb.uni-luebeck.de From lpritc at scri.sari.ac.uk Tue Apr 4 10:00:35 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 15:00:35 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch Message-ID: <1144159235.4725.10.camel@lplinuxdev> Hi all, Apologies for the multiple postings - I'm getting warnings about suspicious headers, and on the last repost I forgot the subject line... I'm not sure if you want this here, or on the BioPython BugZilla, but I've written a patch that modifies BioSQL/Loader.py to load db_xref qualifier/value pairs, and also contains a patch to correct bug 1921 caused by the attempted default insertion of an invalid taxon_id. The .diff is attached. I've been living with the patched code for a month without any issues, so it's been stable as far as I've needed it to be (and has worked with all bacterial GenBank .gbk files and the BioPython GenBank parser). There are two sections of the patched code that print/write to stdout. Is this an acceptable way of reporting to the user in a BioPython style? Cheers, L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From lpritc at scri.sari.ac.uk Tue Apr 4 10:19:01 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 15:19:01 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch .diff Message-ID: <1144160341.4725.13.camel@lplinuxdev> I'm not having much fun with mailing lists, this afternoon - my attachment appears to have gone missing. The contents of the .diff are inline below: 1c1 < """Load biopyton objects into a BioSQL database for persistant storage. --- > """Load biopython objects into a BioSQL database for persistant storage. 9a10 > import sys 226c227 < taxon_id = "0" # inserted this because the taxon population code is out of date --- > #taxon_id = "0" # inserted this because the taxon population code is out of date 231a233,235 > # removed taxon_id field, as it was causing difficulties with the > # schema - not inserting a value allows it to default to NULL, > # avoiding the foreign key constraint. 235d238 < taxon_id, 249d251 < %s, 252d253 < taxon_id, 450,451c451,543 < qualifier_key_id = self._get_term_id(qualifier_key, < ontology_id = tag_ontology_id) --- > # Treat db_xref qualifiers differently to sequence annotation > # qualifiers by populating the seqfeature_dbxref and dbxref > # tables. Other qualifiers go into the seqfeature_qualifier_value > # and (if new) term tables. > if qualifier_key != 'db_xref': > qualifier_key_id = self._get_term_id(qualifier_key, > ontology_id=tag_ontology_id) > # now add all of the values to their table > for qual_value_rank in range(len(qualifiers [qualifier_key])): > qualifier_value = qualifiers [qualifier_key][qual_value_rank] > sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ > r" (%s, %s, %s, %s)" > self.adaptor.execute(sql, (seqfeature_id, > qualifier_key_id, > qual_value_rank + 1, > qualifier_value)) > else: > # The dbxref_id qualifier/value sets go into the dbxref table > # as dbname, accession, version tuples, with dbxref.dbxref_id > # being automatically assigned, and into the seqfeature_dbxref > # table as seqfeature_id, dbxref_id, and rank tuples > self._load_seqfeature_dbxref(qualifiers [qualifier_key], > seqfeature_id) > > > def _load_seqfeature_dbxref(self, dbxrefs, seqfeature_id): > """ _load_seqfeature_dbxref(self, dbxrefs, seqfeature_id) > > o dbxrefs List, dbxref data from the source file in the > format : > > o seqfeature_id Int, the identifier for the seqfeature in the > seqfeature table > > Insert dbxref qualifier data for a seqfeature into the > seqfeature_dbxref and, if required, dbxref tables. > The dbxref_id qualifier/value sets go into the dbxref table > as dbname, accession, version tuples, with dbxref.dbxref_id > being automatically assigned, and into the seqfeature_dbxref > table as seqfeature_id, dbxref_id, and rank tuples > """ > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > for rank, value in enumerate(dbxrefs): > # Split the DB:accession format string at colons. We have to > # account for multiple-line and multiple-accession entries > try: > dbxref_data = value.replace(' ','').replace ('\n','').split(':') > key = dbxref_data[0] > accessions = dbxref_data[1:] > except: > # Parsing fails - return > # !!!!!!!!!!!!!!!!!!!!!! > # !! IMPLEMENT RAISE ERROR MESSAGE HERE > # !!!!!!!!!!!!!!!!!!!!!! > print "Parsing of db_xref failed:", key, accession > if key not in db_dict: > # Database is currently unknown, so add it to the db_dict > # temporarily and print a message to stdout > sys.stdout.write("%s not recognised as database type: " % key+\ > "temporarily accepting key") > db_dict[key] = key > db = db_dict[key] > # Loop over all the grabbed accessions, and attempt to fill the > # table > for accession in accessions: > # Get the dbxref_id value for the dbxref data > dbxref_id = self._get_dbxref_id(db, accession) > # Insert the seqfeature_dbxref data > self._get_seqfeature_dbxref(seqfeature_id, dbxref_id, rank+1) > > def _get_dbxref_id(self, db, accession): > """ _get_dbxref_id(self, db, accession) -> Int 453,460c545,590 < # now add all of the values to their table < for qual_value_rank in range(len(qualifiers [qualifier_key])): < qualifier_value = qualifiers [qualifier_key][qual_value_rank] < sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ < r" (%s, %s, %s, %s)" < self.adaptor.execute(sql, (seqfeature_id, < qualifier_key_id, qual_value_rank + 1, qualifier_value)) < --- > o db String, the name of the external database containing > the accession number > > o accession String, the accession of the dbxref data > > Finds and returns the dbxref_id for the passed data. The method > attempts to find an existing record first, and inserts the data > if there is no record. > """ > # Check for an existing record > sql = r'SELECT dbxref_id FROM dbxref WHERE dbname = %s ' \ > r'AND accession = %s' > dbxref_id = self.adaptor.execute_and_fetch_col0(sql, (db, accession)) > # If there was a record, return the dbxref_id, else create the > # record and return the created dbxref_id > if dbxref_id: > return dbxref_id[0] > return self._add_dbxref(db, accession, 0) > > def _get_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank): > """ Check for a pre-existing seqfeature_dbxref entry with the passed > seqfeature_id and dbxref_id. If one does not exist, insert new > data > > """ > # Check for an existing record > sql = r'SELECT seqfeature_id, dbxref_id FROM seqfeature_dbxref ' \ > r'WHERE seqfeature_id = "%s" AND dbxref_id = "%s"' > result = self.adaptor.execute_and_fetch_col0(sql, (seqfeature_id, > dbxref_id)) > # If there was a record, return without executing anything, else create > # the record and return > if result: > return result > return self._add_seqfeature_dbxref(seqfeature_id, dbxref_id, rank) > > def _add_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank): > """ Insert a seqfeature_dbxref row and return the seqfeature_id and > dbxref_id > """ > sql = r'INSERT INTO seqfeature_dbxref VALUES' \ > r'(%s, %s, %s)' > self.adaptor.execute(sql, (seqfeature_id, dbxref_id, rank)) > return (seqfeature_id, dbxref_id) > > -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From lpritc at scri.sari.ac.uk Tue Apr 4 09:39:20 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 14:39:20 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch Message-ID: <1144157963.4725.5.camel@lplinuxdev> Hi all, I'm not sure if you want this here, or on the BioPython BugZilla, but I've written a patch that modifies BioSQL/Loader.py to load db_xref qualifier/value pairs, and also contains a patch to correct bug 1921 caused by the attempted default insertion of an invalid taxon_id. The .diff is attached. I've been living with the patched code for a month without any issues, so it's been stable as far as I've needed it to be (and has worked with all bacterial GenBank .gbk files and the BioPython GenBank parser). There are two sections of the patched code that print/write to stdout. Is this an acceptable way of reporting to the user in a BioPython style? Cheers, L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). -------------- next part -------------- A non-text attachment was scrubbed... Name: Loader.diff Type: text/x-patch Size: 8677 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060404/11c98fbc/attachment-0001.bin From lpritc at scri.sari.ac.uk Tue Apr 4 09:58:52 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 14:58:52 +0100 Subject: [Biopython-dev] (no subject) Message-ID: <1144159132.4725.6.camel@lplinuxdev> Hi all, I'm not sure if you want this here, or on the BioPython BugZilla, but I've written a patch that modifies BioSQL/Loader.py to load db_xref qualifier/value pairs, and also contains a patch to correct bug 1921 caused by the attempted default insertion of an invalid taxon_id. The .diff is attached. I've been living with the patched code for a month without any issues, so it's been stable as far as I've needed it to be (and has worked with all bacterial GenBank .gbk files and the BioPython GenBank parser). There are two sections of the patched code that print/write to stdout. Is this an acceptable way of reporting to the user in a BioPython style? Cheers, L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). -------------- next part -------------- A non-text attachment was scrubbed... Name: Loader.diff Type: text/x-patch Size: 8677 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython-dev/attachments/20060404/e782caba/attachment.bin From biopython-dev at maubp.freeserve.co.uk Tue Apr 4 14:55:24 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 04 Apr 2006 19:55:24 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch In-Reply-To: <1144157963.4725.5.camel@lplinuxdev> References: <1144157963.4725.5.camel@lplinuxdev> Message-ID: <4432C11C.5000109@maubp.freeserve.co.uk> Leighton Pritchard wrote: > Hi all, > > I'm not sure if you want this here, or on the BioPython BugZilla, but Using bugzilla is probably a better idea as: (a) It will look after patches rather than fighting email attachments (b) Its easier for the developers to see in one place what is outstanding/in need of attention. > I've written a patch that modifies BioSQL/Loader.py to load db_xref > qualifier/value pairs, and also contains a patch to correct bug 1921 > caused by the attempted default insertion of an invalid taxon_id. I had noticed bug 1921 when you logged it, but having never dabbled with mySQL I didn't want to touch it. > The .diff is attached. I've been living with the patched code for a > month without any issues, so it's been stable as far as I've needed it > to be (and has worked with all bacterial GenBank .gbk files and the > BioPython GenBank parser). Out of interest, are you running the CVS GenBank parser? > There are two sections of the patched code that print/write to stdout. > Is this an acceptable way of reporting to the user in a BioPython style? Good question. I have seen some parts of the code simply using "print" to output warning messages. Some of the PDB code explicitly directs its warnings to std error. Anyone? Peter From lpritc at scri.sari.ac.uk Thu Apr 6 05:28:48 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Thu, 06 Apr 2006 10:28:48 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch In-Reply-To: <4432C11C.5000109@maubp.freeserve.co.uk> References: <1144157963.4725.5.camel@lplinuxdev> <4432C11C.5000109@maubp.freeserve.co.uk> Message-ID: <1144315729.4725.32.camel@lplinuxdev> Hi all, On Tue, 2006-04-04 at 19:55 +0100, Peter (BioPython Dev) wrote: > Leighton Pritchard wrote: > > I'm not sure if you want this here, or on the BioPython BugZilla, but > > Using bugzilla is probably a better idea Cheers, I've put the patch on there, now. > > The .diff is attached. I've been living with the patched code for a > > month without any issues, so it's been stable as far as I've needed it > > to be (and has worked with all bacterial GenBank .gbk files and the > > BioPython GenBank parser). > > Out of interest, are you running the CVS GenBank parser? Not at the moment. I've just installed it now, and I'll let you know if there are any clashes with my uidate to Loader.py -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Wed Apr 12 16:13:52 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython-dev)) Date: Wed, 12 Apr 2006 21:13:52 +0100 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> Message-ID: <443D5F80.6030701@maubp.freeserve.co.uk> Albert Krewinkel wrote: > Hello, > > I am trying to parse a EMBL-formated file with biopython, but I > couldn't find any working parser for this. When I try to use the > Martel-based parser as described in one of the mailinglist-threads, I > get the following error: ... > Martel.Parser.ParserPositionException: error parsing at or beyond character 0 > > The file itself appears to be okay, since it can be read by 'seqret' > and bioperl. This seems to be a parser problem -- or am I doing > something wrong? This does sound like there may have been a file format change, and it no longer matches what BioPython is expecting. Could you log a bug (based on your previous email), and attach an example EMBL file. Or email it directly to me. Thanks Peter P.S. Sorry for the delay in my reply - I was hoping someone familiar with EMBL would step forward... From biopython-dev at maubp.freeserve.co.uk Wed Apr 19 07:24:04 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython-dev)) Date: Wed, 19 Apr 2006 12:24:04 +0100 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> Message-ID: <44461DD4.80306@maubp.freeserve.co.uk> Albert Krewinkel wrote: > Hello, > > I am trying to parse a EMBL-formated file with biopython, but I > couldn't find any working parser for this. When I try to use the > Martel-based parser as described in one of the mailinglist-threads, I > get the following error... OK, we have the following files in BioPython: Bio/formatdefs/embl.py (wrapper) Bio/expressions/embl/__init__.py (dummy file) Bio/expressions/embl/embl65.py (contains Martel definition) According to the comments, this should read EMBL files in the format from EMBL Nucleotide Sequence Database Release 65, December 2000. They are now on release 86, and there have been changes to the file format: http://www.ebi.ac.uk/embl/Documentation/changesdetails.html For example, the ID lines have changed, and the SV (sequence version) line removed. > > Python 2.4.1 (#1, Oct 22 2005, 16:20:11) > [GCC 4.0.0 20041026 (Apple Computer, Inc. build 4061)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>>>filename = '/Users/krewinkel/tmp/embltest.embl' >>>>from Bio.formatdefs.embl import embl65 >>>>from xml.sax import saxutils >>>>parser = embl65.make_parser() >>>>parser.setContentHandler(saxutils.XMLGenerator()) >>>>parser.parse(open(filename)) That looks like its based on Jeff Chang's email dated 23 July 2003, one of the only mentions of EMBL that I could spot in the archives. http://lists.open-bio.org/pipermail/biopython-dev/2003-July/001351.html > > Traceback (most recent call last): > File "", line 1, in ? > File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 482, in parse > self.parseFile(source.getCharacterStream() or source.getByteStream()) > File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 468, in parseFile > self._err_handler.error(result) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/xml/sax/handler.py", line 34, in error > raise exception > Martel.Parser.ParserPositionException: error parsing at or beyond character 0 Same here, using your example file. The fact that it seems to be failing right at beginning suggests it is the change to the ID line that is causing the problem (line one in the example file). > The file itself appears to be okay, since it can be read by 'seqret' > and bioperl. This seems to be a parser problem -- or am I doing > something wrong? It does look like an out of date file format definition in BioPython (assuming that example code from Jeff Chang is fine). Peter From mdehoon at c2b2.columbia.edu Wed Apr 19 13:58:23 2006 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 19 Apr 2006 13:58:23 -0400 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF9@cgcmail.cgc.cpmc.columbia.edu> Peter wrote: > Have you noticed that there are some slight differences between the XML > parser and the text parser results (single values versus lists with one > entry)? > > i.e. As it stands, the XML parser is not quite a drop in replacement for > existing code. No, I was not aware of that. Can you give an example where the two parsers give a different result? --Michiel. From biopython-dev at maubp.freeserve.co.uk Wed Apr 19 13:47:54 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 19 Apr 2006 18:47:54 +0100 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF7@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF7@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <444677CA.3020403@maubp.freeserve.co.uk> On the discussion list Michiel De Hoon wrote: > A general question is if anybody still needs the parser for Blast text > output. Currently, we are confusing our users by having a Blast text parser > that tends to break. A broken parser may be worse than no parser. Michiel, Have you noticed that there are some slight differences between the XML parser and the text parser results (single values versus lists with one entry)? i.e. As it stands, the XML parser is not quite a drop in replacement for existing code. Peter From biopython-dev at maubp.freeserve.co.uk Wed Apr 19 14:46:10 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 19 Apr 2006 19:46:10 +0100 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF9@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF9@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <44468572.4070003@maubp.freeserve.co.uk> Michiel De Hoon wrote: > Peter wrote: > >>Have you noticed that there are some slight differences between the XML >>parser and the text parser results (single values versus lists with one >>entry)? >> >>i.e. As it stands, the XML parser is not quite a drop in replacement for >>existing code. > > > No, I was not aware of that. Can you give an example where the two parsers > give a different result? > > --Michiel. As I recall, it wasn't different data, just a slightly different format... I've just been trying to get a matched pair of both plain text and XML output to demonstrate this. The online qblast "Text" appears to be slightly different to what the current parser is expecting. For standalone blast I only have RPS-BLAST databases on my local machine, and the text output form RPS-BLAST is very different and cannot be parsed by the current Standalone Blast parser. If anyone has a matched set of Blast output files which BioPython can parse they could email me that would be great. Might even turn it into a short addition to the test suite. i.e. same data, in both the XML and plain text formats. Maybe blastp or blastn output? According to my notes, I was getting lists for the following with the plain text output, which are now integers using the XML parser: hsp.gaps hsp.positives hsp.identities The list behaviour may have been my own fault, as that code was written to use my modified standalone NCBI parser for use with RPS-BLAST... Peter From mdehoon at c2b2.columbia.edu Wed Apr 19 16:14:00 2006 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 19 Apr 2006 16:14:00 -0400 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> > Peter wrote: > If anyone has a matched set of Blast output files which BioPython can > parse they could email me that would be great. Might even turn it into > a short addition to the test suite. i.e. same data, in both the XML and > plain text formats. > According to my notes, I was getting lists for the following with the > plain text output, which are now integers using the XML parser: > hsp.gaps > hsp.positives > hsp.identities I took the query from the blast text output from the first Blast test in the Biopython test suite and ran it with the online blast, generating XML and plain text output. The text-based parser chokes on the blast text output, but anyway we can see from the text output what the result should have been. With the XML parser, you are right that hsp.gaps, hsp.positives, and hsp.identities are integers now, while they are lists with the text-based parser (running the text-based parser on the blast text output in the test suite gives indeed lists). What happens is that if the Blast output looks like this: Identities = 28/87 (32%), Positives = 44/87 (50%), Gaps = 12/87 (13%) then the text-based parser returns hsp.identities = (28, 87) hsp.positives = (44, 87) hsp.gaps = (12, 87) while the XML parser returns hsp.identities = 28 hsp.positives = 44 hsp.gaps = 12 ; we can get the 87 from len(hsp.query). Actually, I like the XML parser output a bit better, but we can change it to the text parser's output if preferred. Do you know of any other inconsistencies between the parsers? If not, I suggest raising a deprecation warning with the text-based Blast parser, so users won't waste time trying to figure out why it doesn't work. --Michiel. From sbassi at gmail.com Wed Apr 19 17:09:33 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 19 Apr 2006 18:09:33 -0300 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> Message-ID: On 4/19/06, Michiel De Hoon wrote: > If not, I suggest raising a deprecation warning with the text-based Blast > parser, so users won't waste time trying to figure out why it doesn't work. I agree with it. What Is also needed is a warning in Biopython Cookbook and tutorials hosted on biopython.org site (since that is the first place most people look for documentation). -- Bioinformatics news: http://www.bioinformatica.info Lriser: http://www.linspire.com/lraiser_success.php?serial=318 From biopython-dev at maubp.freeserve.co.uk Thu Apr 20 08:34:25 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython-dev)) Date: Thu, 20 Apr 2006 13:34:25 +0100 Subject: [Biopython-dev] [BioPython] blast text vs XML, was: Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <44477FD1.3020708@maubp.freeserve.co.uk> Peter wrote: >>According to my notes, I was getting lists for the following with the >>plain text output, which are now integers using the XML parser: >> >>hsp.gaps >>hsp.positives >>hsp.identities Thanks for confirming that. Michiel De Hoon wrote: > Actually, I like the XML parser output a bit better, but we can change it to > the text parser's output if preferred. I agree that the XML parser output is much simpler. However, my gut instinct is to preserve the old behaviour so that anyone with an old script can simply swap the parser from plain text to XML and have everything else "just work". > Do you know of any other inconsistencies between the parsers? No - but unless someone sits down with a pair of match files and compares the resulting data structures, we don't know for sure. > If not, I suggest raising a deprecation warning with the text-based Blast > parser, so users won't waste time trying to figure out why it doesn't work. Not a bad idea. In addition, it would be nice if the text parser could also check the first line to see if its actual XML output and issue a helpful error message. Or maybe even handle this transparently for the user with just a warning message? At some point we should also change the default parameters for the blast commands in Bio/Blast/NCBIStandalone.py to default to XML output (as I did with the rpsblast support, using the -m 7 command line option). Peter From krewink at inb.uni-luebeck.de Mon Apr 3 17:48:06 2006 From: krewink at inb.uni-luebeck.de (Albert Krewinkel) Date: Mon, 3 Apr 2006 19:48:06 +0200 Subject: [Biopython-dev] EMBL flatfile parsing Message-ID: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> Hello, I am trying to parse a EMBL-formated file with biopython, but I couldn't find any working parser for this. When I try to use the Martel-based parser as described in one of the mailinglist-threads, I get the following error: Python 2.4.1 (#1, Oct 22 2005, 16:20:11) [GCC 4.0.0 20041026 (Apple Computer, Inc. build 4061)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> filename = '/Users/krewinkel/tmp/embltest.embl' >>> from Bio.formatdefs.embl import embl65 >>> from xml.sax import saxutils >>> parser = embl65.make_parser() >>> parser.setContentHandler(saxutils.XMLGenerator()) >>> parser.parse(open(filename)) Traceback (most recent call last): File "", line 1, in ? File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 482, in parse self.parseFile(source.getCharacterStream() or source.getByteStream()) File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 468, in parseFile self._err_handler.error(result) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/xml/sax/handler.py", line 34, in error raise exception Martel.Parser.ParserPositionException: error parsing at or beyond character 0 The file itself appears to be okay, since it can be read by 'seqret' and bioperl. This seems to be a parser problem -- or am I doing something wrong? Thanks in advance Albert -- Albert Krewinkel University of Luebeck phone: +49 (451) 500 5516 email: krewink at inb.uni-luebeck.de From lpritc at scri.sari.ac.uk Tue Apr 4 14:00:35 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 15:00:35 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch Message-ID: <1144159235.4725.10.camel@lplinuxdev> Hi all, Apologies for the multiple postings - I'm getting warnings about suspicious headers, and on the last repost I forgot the subject line... I'm not sure if you want this here, or on the BioPython BugZilla, but I've written a patch that modifies BioSQL/Loader.py to load db_xref qualifier/value pairs, and also contains a patch to correct bug 1921 caused by the attempted default insertion of an invalid taxon_id. The .diff is attached. I've been living with the patched code for a month without any issues, so it's been stable as far as I've needed it to be (and has worked with all bacterial GenBank .gbk files and the BioPython GenBank parser). There are two sections of the patched code that print/write to stdout. Is this an acceptable way of reporting to the user in a BioPython style? Cheers, L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From lpritc at scri.sari.ac.uk Tue Apr 4 14:19:01 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 15:19:01 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch .diff Message-ID: <1144160341.4725.13.camel@lplinuxdev> I'm not having much fun with mailing lists, this afternoon - my attachment appears to have gone missing. The contents of the .diff are inline below: 1c1 < """Load biopyton objects into a BioSQL database for persistant storage. --- > """Load biopython objects into a BioSQL database for persistant storage. 9a10 > import sys 226c227 < taxon_id = "0" # inserted this because the taxon population code is out of date --- > #taxon_id = "0" # inserted this because the taxon population code is out of date 231a233,235 > # removed taxon_id field, as it was causing difficulties with the > # schema - not inserting a value allows it to default to NULL, > # avoiding the foreign key constraint. 235d238 < taxon_id, 249d251 < %s, 252d253 < taxon_id, 450,451c451,543 < qualifier_key_id = self._get_term_id(qualifier_key, < ontology_id = tag_ontology_id) --- > # Treat db_xref qualifiers differently to sequence annotation > # qualifiers by populating the seqfeature_dbxref and dbxref > # tables. Other qualifiers go into the seqfeature_qualifier_value > # and (if new) term tables. > if qualifier_key != 'db_xref': > qualifier_key_id = self._get_term_id(qualifier_key, > ontology_id=tag_ontology_id) > # now add all of the values to their table > for qual_value_rank in range(len(qualifiers [qualifier_key])): > qualifier_value = qualifiers [qualifier_key][qual_value_rank] > sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ > r" (%s, %s, %s, %s)" > self.adaptor.execute(sql, (seqfeature_id, > qualifier_key_id, > qual_value_rank + 1, > qualifier_value)) > else: > # The dbxref_id qualifier/value sets go into the dbxref table > # as dbname, accession, version tuples, with dbxref.dbxref_id > # being automatically assigned, and into the seqfeature_dbxref > # table as seqfeature_id, dbxref_id, and rank tuples > self._load_seqfeature_dbxref(qualifiers [qualifier_key], > seqfeature_id) > > > def _load_seqfeature_dbxref(self, dbxrefs, seqfeature_id): > """ _load_seqfeature_dbxref(self, dbxrefs, seqfeature_id) > > o dbxrefs List, dbxref data from the source file in the > format : > > o seqfeature_id Int, the identifier for the seqfeature in the > seqfeature table > > Insert dbxref qualifier data for a seqfeature into the > seqfeature_dbxref and, if required, dbxref tables. > The dbxref_id qualifier/value sets go into the dbxref table > as dbname, accession, version tuples, with dbxref.dbxref_id > being automatically assigned, and into the seqfeature_dbxref > table as seqfeature_id, dbxref_id, and rank tuples > """ > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > for rank, value in enumerate(dbxrefs): > # Split the DB:accession format string at colons. We have to > # account for multiple-line and multiple-accession entries > try: > dbxref_data = value.replace(' ','').replace ('\n','').split(':') > key = dbxref_data[0] > accessions = dbxref_data[1:] > except: > # Parsing fails - return > # !!!!!!!!!!!!!!!!!!!!!! > # !! IMPLEMENT RAISE ERROR MESSAGE HERE > # !!!!!!!!!!!!!!!!!!!!!! > print "Parsing of db_xref failed:", key, accession > if key not in db_dict: > # Database is currently unknown, so add it to the db_dict > # temporarily and print a message to stdout > sys.stdout.write("%s not recognised as database type: " % key+\ > "temporarily accepting key") > db_dict[key] = key > db = db_dict[key] > # Loop over all the grabbed accessions, and attempt to fill the > # table > for accession in accessions: > # Get the dbxref_id value for the dbxref data > dbxref_id = self._get_dbxref_id(db, accession) > # Insert the seqfeature_dbxref data > self._get_seqfeature_dbxref(seqfeature_id, dbxref_id, rank+1) > > def _get_dbxref_id(self, db, accession): > """ _get_dbxref_id(self, db, accession) -> Int 453,460c545,590 < # now add all of the values to their table < for qual_value_rank in range(len(qualifiers [qualifier_key])): < qualifier_value = qualifiers [qualifier_key][qual_value_rank] < sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ < r" (%s, %s, %s, %s)" < self.adaptor.execute(sql, (seqfeature_id, < qualifier_key_id, qual_value_rank + 1, qualifier_value)) < --- > o db String, the name of the external database containing > the accession number > > o accession String, the accession of the dbxref data > > Finds and returns the dbxref_id for the passed data. The method > attempts to find an existing record first, and inserts the data > if there is no record. > """ > # Check for an existing record > sql = r'SELECT dbxref_id FROM dbxref WHERE dbname = %s ' \ > r'AND accession = %s' > dbxref_id = self.adaptor.execute_and_fetch_col0(sql, (db, accession)) > # If there was a record, return the dbxref_id, else create the > # record and return the created dbxref_id > if dbxref_id: > return dbxref_id[0] > return self._add_dbxref(db, accession, 0) > > def _get_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank): > """ Check for a pre-existing seqfeature_dbxref entry with the passed > seqfeature_id and dbxref_id. If one does not exist, insert new > data > > """ > # Check for an existing record > sql = r'SELECT seqfeature_id, dbxref_id FROM seqfeature_dbxref ' \ > r'WHERE seqfeature_id = "%s" AND dbxref_id = "%s"' > result = self.adaptor.execute_and_fetch_col0(sql, (seqfeature_id, > dbxref_id)) > # If there was a record, return without executing anything, else create > # the record and return > if result: > return result > return self._add_seqfeature_dbxref(seqfeature_id, dbxref_id, rank) > > def _add_seqfeature_dbxref(self, seqfeature_id, dbxref_id, rank): > """ Insert a seqfeature_dbxref row and return the seqfeature_id and > dbxref_id > """ > sql = r'INSERT INTO seqfeature_dbxref VALUES' \ > r'(%s, %s, %s)' > self.adaptor.execute(sql, (seqfeature_id, dbxref_id, rank)) > return (seqfeature_id, dbxref_id) > > -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From lpritc at scri.sari.ac.uk Tue Apr 4 13:39:20 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 14:39:20 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch Message-ID: <1144157963.4725.5.camel@lplinuxdev> Hi all, I'm not sure if you want this here, or on the BioPython BugZilla, but I've written a patch that modifies BioSQL/Loader.py to load db_xref qualifier/value pairs, and also contains a patch to correct bug 1921 caused by the attempted default insertion of an invalid taxon_id. The .diff is attached. I've been living with the patched code for a month without any issues, so it's been stable as far as I've needed it to be (and has worked with all bacterial GenBank .gbk files and the BioPython GenBank parser). There are two sections of the patched code that print/write to stdout. Is this an acceptable way of reporting to the user in a BioPython style? Cheers, L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). -------------- next part -------------- A non-text attachment was scrubbed... Name: Loader.diff Type: text/x-patch Size: 8677 bytes Desc: not available URL: From lpritc at scri.sari.ac.uk Tue Apr 4 13:58:52 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Tue, 04 Apr 2006 14:58:52 +0100 Subject: [Biopython-dev] (no subject) Message-ID: <1144159132.4725.6.camel@lplinuxdev> Hi all, I'm not sure if you want this here, or on the BioPython BugZilla, but I've written a patch that modifies BioSQL/Loader.py to load db_xref qualifier/value pairs, and also contains a patch to correct bug 1921 caused by the attempted default insertion of an invalid taxon_id. The .diff is attached. I've been living with the patched code for a month without any issues, so it's been stable as far as I've needed it to be (and has worked with all bacterial GenBank .gbk files and the BioPython GenBank parser). There are two sections of the patched code that print/write to stdout. Is this an acceptable way of reporting to the user in a BioPython style? Cheers, L. -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). -------------- next part -------------- A non-text attachment was scrubbed... Name: Loader.diff Type: text/x-patch Size: 8677 bytes Desc: not available URL: From biopython-dev at maubp.freeserve.co.uk Tue Apr 4 18:55:24 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Tue, 04 Apr 2006 19:55:24 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch In-Reply-To: <1144157963.4725.5.camel@lplinuxdev> References: <1144157963.4725.5.camel@lplinuxdev> Message-ID: <4432C11C.5000109@maubp.freeserve.co.uk> Leighton Pritchard wrote: > Hi all, > > I'm not sure if you want this here, or on the BioPython BugZilla, but Using bugzilla is probably a better idea as: (a) It will look after patches rather than fighting email attachments (b) Its easier for the developers to see in one place what is outstanding/in need of attention. > I've written a patch that modifies BioSQL/Loader.py to load db_xref > qualifier/value pairs, and also contains a patch to correct bug 1921 > caused by the attempted default insertion of an invalid taxon_id. I had noticed bug 1921 when you logged it, but having never dabbled with mySQL I didn't want to touch it. > The .diff is attached. I've been living with the patched code for a > month without any issues, so it's been stable as far as I've needed it > to be (and has worked with all bacterial GenBank .gbk files and the > BioPython GenBank parser). Out of interest, are you running the CVS GenBank parser? > There are two sections of the patched code that print/write to stdout. > Is this an acceptable way of reporting to the user in a BioPython style? Good question. I have seen some parts of the code simply using "print" to output warning messages. Some of the PDB code explicitly directs its warnings to std error. Anyone? Peter From lpritc at scri.sari.ac.uk Thu Apr 6 09:28:48 2006 From: lpritc at scri.sari.ac.uk (Leighton Pritchard) Date: Thu, 06 Apr 2006 10:28:48 +0100 Subject: [Biopython-dev] BioSQL Loader.py patch In-Reply-To: <4432C11C.5000109@maubp.freeserve.co.uk> References: <1144157963.4725.5.camel@lplinuxdev> <4432C11C.5000109@maubp.freeserve.co.uk> Message-ID: <1144315729.4725.32.camel@lplinuxdev> Hi all, On Tue, 2006-04-04 at 19:55 +0100, Peter (BioPython Dev) wrote: > Leighton Pritchard wrote: > > I'm not sure if you want this here, or on the BioPython BugZilla, but > > Using bugzilla is probably a better idea Cheers, I've put the patch on there, now. > > The .diff is attached. I've been living with the patched code for a > > month without any issues, so it's been stable as far as I've needed it > > to be (and has worked with all bacterial GenBank .gbk files and the > > BioPython GenBank parser). > > Out of interest, are you running the CVS GenBank parser? Not at the moment. I've just installed it now, and I'll let you know if there are any clashes with my uidate to Loader.py -- Dr Leighton Pritchard AMRSC D131, Plant-Pathogen Interactions, Scottish Crop Research Institute Invergowrie, Dundee, Scotland, DD2 5DA, UK T: +44 (0)1382 562731 x2405 F: +44 (0)1382 568578 E: lpritc at scri.sari.ac.uk W: http://bioinf.scri.sari.ac.uk/lp GPG/PGP: FEFC205C E58BA41B http://www.keyserver.net (If the signature does not verify, please remove the SCRI disclaimer) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.sari.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From biopython-dev at maubp.freeserve.co.uk Wed Apr 12 20:13:52 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython-dev)) Date: Wed, 12 Apr 2006 21:13:52 +0100 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> Message-ID: <443D5F80.6030701@maubp.freeserve.co.uk> Albert Krewinkel wrote: > Hello, > > I am trying to parse a EMBL-formated file with biopython, but I > couldn't find any working parser for this. When I try to use the > Martel-based parser as described in one of the mailinglist-threads, I > get the following error: ... > Martel.Parser.ParserPositionException: error parsing at or beyond character 0 > > The file itself appears to be okay, since it can be read by 'seqret' > and bioperl. This seems to be a parser problem -- or am I doing > something wrong? This does sound like there may have been a file format change, and it no longer matches what BioPython is expecting. Could you log a bug (based on your previous email), and attach an example EMBL file. Or email it directly to me. Thanks Peter P.S. Sorry for the delay in my reply - I was hoping someone familiar with EMBL would step forward... From biopython-dev at maubp.freeserve.co.uk Wed Apr 19 11:24:04 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython-dev)) Date: Wed, 19 Apr 2006 12:24:04 +0100 Subject: [Biopython-dev] EMBL flatfile parsing In-Reply-To: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> References: <20060403174806.GA22937@pc06.inb.mu-luebeck.de> Message-ID: <44461DD4.80306@maubp.freeserve.co.uk> Albert Krewinkel wrote: > Hello, > > I am trying to parse a EMBL-formated file with biopython, but I > couldn't find any working parser for this. When I try to use the > Martel-based parser as described in one of the mailinglist-threads, I > get the following error... OK, we have the following files in BioPython: Bio/formatdefs/embl.py (wrapper) Bio/expressions/embl/__init__.py (dummy file) Bio/expressions/embl/embl65.py (contains Martel definition) According to the comments, this should read EMBL files in the format from EMBL Nucleotide Sequence Database Release 65, December 2000. They are now on release 86, and there have been changes to the file format: http://www.ebi.ac.uk/embl/Documentation/changesdetails.html For example, the ID lines have changed, and the SV (sequence version) line removed. > > Python 2.4.1 (#1, Oct 22 2005, 16:20:11) > [GCC 4.0.0 20041026 (Apple Computer, Inc. build 4061)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>>>filename = '/Users/krewinkel/tmp/embltest.embl' >>>>from Bio.formatdefs.embl import embl65 >>>>from xml.sax import saxutils >>>>parser = embl65.make_parser() >>>>parser.setContentHandler(saxutils.XMLGenerator()) >>>>parser.parse(open(filename)) That looks like its based on Jeff Chang's email dated 23 July 2003, one of the only mentions of EMBL that I could spot in the archives. http://lists.open-bio.org/pipermail/biopython-dev/2003-July/001351.html > > Traceback (most recent call last): > File "", line 1, in ? > File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 482, in parse > self.parseFile(source.getCharacterStream() or source.getByteStream()) > File "/opt/local/lib/python2.4/site-packages/Martel/Parser.py", line 468, in parseFile > self._err_handler.error(result) > File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/xml/sax/handler.py", line 34, in error > raise exception > Martel.Parser.ParserPositionException: error parsing at or beyond character 0 Same here, using your example file. The fact that it seems to be failing right at beginning suggests it is the change to the ID line that is causing the problem (line one in the example file). > The file itself appears to be okay, since it can be read by 'seqret' > and bioperl. This seems to be a parser problem -- or am I doing > something wrong? It does look like an out of date file format definition in BioPython (assuming that example code from Jeff Chang is fine). Peter From mdehoon at c2b2.columbia.edu Wed Apr 19 17:58:23 2006 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 19 Apr 2006 13:58:23 -0400 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF9@cgcmail.cgc.cpmc.columbia.edu> Peter wrote: > Have you noticed that there are some slight differences between the XML > parser and the text parser results (single values versus lists with one > entry)? > > i.e. As it stands, the XML parser is not quite a drop in replacement for > existing code. No, I was not aware of that. Can you give an example where the two parsers give a different result? --Michiel. From biopython-dev at maubp.freeserve.co.uk Wed Apr 19 17:47:54 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 19 Apr 2006 18:47:54 +0100 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF7@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF7@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <444677CA.3020403@maubp.freeserve.co.uk> On the discussion list Michiel De Hoon wrote: > A general question is if anybody still needs the parser for Blast text > output. Currently, we are confusing our users by having a Blast text parser > that tends to break. A broken parser may be worse than no parser. Michiel, Have you noticed that there are some slight differences between the XML parser and the text parser results (single values versus lists with one entry)? i.e. As it stands, the XML parser is not quite a drop in replacement for existing code. Peter From biopython-dev at maubp.freeserve.co.uk Wed Apr 19 18:46:10 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython Dev)) Date: Wed, 19 Apr 2006 19:46:10 +0100 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF9@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEF9@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <44468572.4070003@maubp.freeserve.co.uk> Michiel De Hoon wrote: > Peter wrote: > >>Have you noticed that there are some slight differences between the XML >>parser and the text parser results (single values versus lists with one >>entry)? >> >>i.e. As it stands, the XML parser is not quite a drop in replacement for >>existing code. > > > No, I was not aware of that. Can you give an example where the two parsers > give a different result? > > --Michiel. As I recall, it wasn't different data, just a slightly different format... I've just been trying to get a matched pair of both plain text and XML output to demonstrate this. The online qblast "Text" appears to be slightly different to what the current parser is expecting. For standalone blast I only have RPS-BLAST databases on my local machine, and the text output form RPS-BLAST is very different and cannot be parsed by the current Standalone Blast parser. If anyone has a matched set of Blast output files which BioPython can parse they could email me that would be great. Might even turn it into a short addition to the test suite. i.e. same data, in both the XML and plain text formats. Maybe blastp or blastn output? According to my notes, I was getting lists for the following with the plain text output, which are now integers using the XML parser: hsp.gaps hsp.positives hsp.identities The list behaviour may have been my own fault, as that code was written to use my modified standalone NCBI parser for use with RPS-BLAST... Peter From mdehoon at c2b2.columbia.edu Wed Apr 19 20:14:00 2006 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 19 Apr 2006 16:14:00 -0400 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput Message-ID: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> > Peter wrote: > If anyone has a matched set of Blast output files which BioPython can > parse they could email me that would be great. Might even turn it into > a short addition to the test suite. i.e. same data, in both the XML and > plain text formats. > According to my notes, I was getting lists for the following with the > plain text output, which are now integers using the XML parser: > hsp.gaps > hsp.positives > hsp.identities I took the query from the blast text output from the first Blast test in the Biopython test suite and ran it with the online blast, generating XML and plain text output. The text-based parser chokes on the blast text output, but anyway we can see from the text output what the result should have been. With the XML parser, you are right that hsp.gaps, hsp.positives, and hsp.identities are integers now, while they are lists with the text-based parser (running the text-based parser on the blast text output in the test suite gives indeed lists). What happens is that if the Blast output looks like this: Identities = 28/87 (32%), Positives = 44/87 (50%), Gaps = 12/87 (13%) then the text-based parser returns hsp.identities = (28, 87) hsp.positives = (44, 87) hsp.gaps = (12, 87) while the XML parser returns hsp.identities = 28 hsp.positives = 44 hsp.gaps = 12 ; we can get the 87 from len(hsp.query). Actually, I like the XML parser output a bit better, but we can change it to the text parser's output if preferred. Do you know of any other inconsistencies between the parsers? If not, I suggest raising a deprecation warning with the text-based Blast parser, so users won't waste time trying to figure out why it doesn't work. --Michiel. From sbassi at gmail.com Wed Apr 19 21:09:33 2006 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 19 Apr 2006 18:09:33 -0300 Subject: [Biopython-dev] [BioPython] Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> Message-ID: On 4/19/06, Michiel De Hoon wrote: > If not, I suggest raising a deprecation warning with the text-based Blast > parser, so users won't waste time trying to figure out why it doesn't work. I agree with it. What Is also needed is a warning in Biopython Cookbook and tutorials hosted on biopython.org site (since that is the first place most people look for documentation). -- Bioinformatics news: http://www.bioinformatica.info Lriser: http://www.linspire.com/lraiser_success.php?serial=318 From biopython-dev at maubp.freeserve.co.uk Thu Apr 20 12:34:25 2006 From: biopython-dev at maubp.freeserve.co.uk (Peter (BioPython-dev)) Date: Thu, 20 Apr 2006 13:34:25 +0100 Subject: [Biopython-dev] [BioPython] blast text vs XML, was: Need help parsing Blastoutput In-Reply-To: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> References: <6CA15ADD82E5724F88CB53D50E61C9AE9ECEFA@cgcmail.cgc.cpmc.columbia.edu> Message-ID: <44477FD1.3020708@maubp.freeserve.co.uk> Peter wrote: >>According to my notes, I was getting lists for the following with the >>plain text output, which are now integers using the XML parser: >> >>hsp.gaps >>hsp.positives >>hsp.identities Thanks for confirming that. Michiel De Hoon wrote: > Actually, I like the XML parser output a bit better, but we can change it to > the text parser's output if preferred. I agree that the XML parser output is much simpler. However, my gut instinct is to preserve the old behaviour so that anyone with an old script can simply swap the parser from plain text to XML and have everything else "just work". > Do you know of any other inconsistencies between the parsers? No - but unless someone sits down with a pair of match files and compares the resulting data structures, we don't know for sure. > If not, I suggest raising a deprecation warning with the text-based Blast > parser, so users won't waste time trying to figure out why it doesn't work. Not a bad idea. In addition, it would be nice if the text parser could also check the first line to see if its actual XML output and issue a helpful error message. Or maybe even handle this transparently for the user with just a warning message? At some point we should also change the default parameters for the blast commands in Bio/Blast/NCBIStandalone.py to default to XML output (as I did with the rpsblast support, using the -m 7 command line option). Peter