From lpritc at scri.ac.uk Thu Nov 1 13:39:26 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 01 Nov 2007 17:39:26 +0000 Subject: [BioPython] Bioinformatics/Computational Biology Post Message-ID: <1193938766.12359.5.camel@lplinuxdev.scri.sari.ac.uk> Dear all, We have a vacancy in SCRI's Plant Pathology Programme, where we use Python/BioPython fairly extensively (including for the GenomeDiagram visualisation tool mentioned in the advert). The closing data for applicants is 8th December, and the details are available at the site below: http://www.scri.ac.uk/careers/vacancies/bioinformaticsresearcher Please share the link with anyone you think may be interested in applying for the post. """ Plant pathology at SCRI has an international reputation for excellence and innovation. This post offers the opportunity to join our bioinformatics team, which contributed to many notable firsts in plant pathology, including: annotation of the Pectobacterium atrosepticum and Phytophthora infestans genome sequences, large-scale whole-genome comparative genomics of bacteria and oomycetes, and development of the comparative genomics visualisation tool, GenomeDiagram. You will be responsible for specific areas of bioinformatics research contributing to the Globodera pallida genome sequencing project, metagenomic investigations of viral populations in soils and the analysis of plant-pathogen interactions in several different systems. You will also be expected to develop your own line of bioinformatics research, and eventually to obtain independent funding. """ Best wishes, -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From mlds at unimelb.edu.au Thu Nov 1 17:52:27 2007 From: mlds at unimelb.edu.au (Mike Dyall-Smith) Date: Fri, 2 Nov 2007 08:52:27 +1100 Subject: [BioPython] Cookbook:RenameFasta In-Reply-To: References: Message-ID: <09384C7F-B174-4A84-9E89-FC3C0C4CB8D5@unimelb.edu.au> Njm revised the code of Humberto at: http://bio.scipy.org/wiki/index.php/RenameFastaSequences to make it more idiomatic Python. I altered my discussion according to the new code. However, when I tested it on a fasta file I get an error: File "REnameFasta2.py", line 6, in filename, basename = sys.argv ValueError: too many values to unpack If I revert back to the old code for this line, ie. replace ' filename, basename = sys.argv' with 'filename = sys.argv[1] ' 'basename = sys.argv[2]' it works fine. I assume there is some subtle error in assigning two names to sys.argv Regards, Mike D-S From mdehoon at c2b2.columbia.edu Thu Nov 1 20:22:18 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 1 Nov 2007 20:22:18 -0400 Subject: [BioPython] Cookbook:RenameFasta References: <09384C7F-B174-4A84-9E89-FC3C0C4CB8D5@unimelb.edu.au> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B650@mail2.exch.c2b2.columbia.edu> > If I revert back to the old code for this line, ie. replace > ' filename, basename = sys.argv' > with > 'filename = sys.argv[1] ' > 'basename = sys.argv[2]' > > it works fine. There is also a sys.argv[0]. So sys.argv is a list with (at least) three elements. If you do 'filename, basename = sys.argv', you have two variables on the left-hand-side and three on the right-hand-side. Instead, you could do temp, filename, basename = sys.argv # And ignore temp subsequently or filename, basename = sys.argv[1:] --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Mike Dyall-Smith Sent: Thu 11/1/2007 5:52 PM To: biopython at lists.open-bio.org Subject: [BioPython] Cookbook:RenameFasta Njm revised the code of Humberto at: http://bio.scipy.org/wiki/index.php/RenameFastaSequences to make it more idiomatic Python. I altered my discussion according to the new code. However, when I tested it on a fasta file I get an error: File "REnameFasta2.py", line 6, in filename, basename = sys.argv ValueError: too many values to unpack If I revert back to the old code for this line, ie. replace ' filename, basename = sys.argv' with 'filename = sys.argv[1] ' 'basename = sys.argv[2]' it works fine. I assume there is some subtle error in assigning two names to sys.argv Regards, Mike D-S _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From ericgibert at yahoo.fr Wed Nov 7 20:50:06 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 01:50:06 +0000 (GMT) Subject: [BioPython] from Bio import db Message-ID: <413801.21408.qm@web26511.mail.ukl.yahoo.com> Hello, I just upgraded to BioPython 1.44 and when I try to run my previous script, I have the error: Traceback (most recent call last): File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in parser = record_parser) File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db (I am on Fedora 7 64bit) Any suggestions? Thank you Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From mdehoon at c2b2.columbia.edu Wed Nov 7 21:19:30 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 7 Nov 2007 21:19:30 -0500 Subject: [BioPython] from Bio import db References: <413801.21408.qm@web26511.mail.ukl.yahoo.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B657@mail2.exch.c2b2.columbia.edu> Dear Eric, Some significant changes were needed in Biopython release 1.44 for reasons of compatibility with the new version of mxTextTools. Unfortunately, as you found, some code may break as a result. In your code: > File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in > parser = record_parser) > File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", > line 1283, in __init__ > from Bio import db > ImportError: cannot import name db It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. Can you show us the code leading up to this point? I'm guessing that you are trying to use NCBIDictionary, but it would be helpful to see how exactly you are trying to use it, so that we can come up with a solution. Sorry for the trouble. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Eric Gibert Sent: Wed 11/7/2007 8:50 PM To: biopython at lists.open-bio.org Subject: [BioPython] from Bio import db Hello, I just upgraded to BioPython 1.44 and when I try to run my previous script, I have the error: Traceback (most recent call last): File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in parser = record_parser) File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db (I am on Fedora 7 64bit) Any suggestions? Thank you Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mdehoon at c2b2.columbia.edu Wed Nov 7 22:44:14 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 7 Nov 2007 22:44:14 -0500 Subject: [BioPython] FW: Re : from Bio import db References: <881492.27912.qm@web26503.mail.ukl.yahoo.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B658@mail2.exch.c2b2.columbia.edu> Eric answer pasted below... -----Original Message----- From: Eric Gibert [mailto:ericgibert at yahoo.fr] Sent: Wed 11/7/2007 10:09 PM To: Michiel De Hoon Subject: Re : [BioPython] from Bio import db Dear Michiel, Here is the code. Sorry for the trouble. BTW, your guess it correct: NCBIDictionary... Here is my "debugging" script: -------------------------------------------------------------- import Bio from Bio import GenBank from BioSQL import BioSeqDatabase import BioSQL.BioSeq list_to_load = [] def iterSeq(): for s in list_to_load: yield s if __name__ == '__main__': gi_list = GenBank.search_for("Archineura") record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank", parser = record_parser) # <--- this is the line 18 causing the error biodb = BioSeqDatabase.open_database(driver = "MySQLdb", user = "yyyyy", passwd = "xxxxx", host = "localhost", db = "BioSQL") selectedDb = biodb["allOdonata"] # gi_list.append('57282195') # gi_list.append('AJ459224') for seq in gi_list: gb_record = ncbi_dict[seq] print gb_record.id, try: db_record = selectedDb.lookup(version = gb_record.id) print "already present" except IndexError: print "absent", seq list_to_load.append(gb_record) #print list_to_load itS = iterSeq() selectedDb.load(itS) ---------------------------------------------------------------- Looking in the previous script version did not help me: the "import" is not in the def but in the script header, except that, no difference in the 3 statements of the NCBIDictionary class. I hope you will find a solution. Thank you, Eric ----- Message d'origine ---- De : Michiel De Hoon ? : Eric Gibert ; biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 10h19mn 30s Objet : RE: [BioPython] from Bio import db Dear Eric, Some significant changes were needed in Biopython release 1.44 for reasons of compatibility with the new version of mxTextTools. Unfortunately, as you found, some code may break as a result. In your code: > File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in > parser = record_parser) > File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", > line 1283, in __init__ > from Bio import db > ImportError: cannot import name db It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. Can you show us the code leading up to this point? I'm guessing that you are trying to use NCBIDictionary, but it would be helpful to see how exactly you are trying to use it, so that we can come up with a solution. Sorry for the trouble. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Eric Gibert Sent: Wed 11/7/2007 8:50 PM To: biopython at lists.open-bio.org Subject: [BioPython] from Bio import db Hello, I just upgraded to BioPython 1.44 and when I try to run my previous script, I have the error: Traceback (most recent call last): File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in parser = record_parser) File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db (I am on Fedora 7 64bit) Any suggestions? Thank you Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From ericgibert at yahoo.fr Thu Nov 8 06:07:02 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 11:07:02 +0000 (GMT) Subject: [BioPython] small "bug" correction in package BioSql Message-ID: <823663.39006.qm@web26515.mail.ukl.yahoo.com> Dear all, In BioSeq/BioSeq.py, in the class DBSeq definition, we have the function: def _retrieve_seq(adaptor, primary_id): seqs = adaptor.execute_and_fetchall( "SELECT alphabet, length(seq) FROM biosequence" \ " WHERE bioentry_id = %s", (primary_id,)) if seqs: moltype, length = seqs[0] moltype = moltype.lower() # <-- EG as "DNA" is found in my database! from Bio.Alphabet import IUPAC if moltype == "dna": alphabet = IUPAC.unambiguous_dna elif moltype == "rna": alphabet = IUPAC.unambiguous_rna elif moltype == "protein": alphabet = IUPAC.protein else: raise AssertionError("Unknown moltype: %s" % moltype) seq = DBSeq(primary_id, adaptor, alphabet, 0, int(length)) return seq else: return None please note my correction: force moltype to be turn in lower case as my database has upper case value! this raises the "Unknown moltype" error. Alternatively, we could request the SQL statement to return a lower case version of "alphabet" but I do not know if this function is standard for all database... Might be good to add in the standard package. Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From biopython at maubp.freeserve.co.uk Thu Nov 8 06:40:00 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 11:40:00 +0000 Subject: [BioPython] small "bug" correction in package BioSql In-Reply-To: <823663.39006.qm@web26515.mail.ukl.yahoo.com> References: <823663.39006.qm@web26515.mail.ukl.yahoo.com> Message-ID: <4732F590.6050505@maubp.freeserve.co.uk> Eric Gibert wrote: > Dear all, > > In BioSeq/BioSeq.py, in the class DBSeq definition, we have the > function: > > ... > > please note my correction: force moltype to be turn in lower case as > my database has upper case value! this raises the "Unknown moltype" > error. Hi Eric, I've made your suggested change in CVS, biopython/BioSQL/BioSeq.py revision 1.13, thank you. I would encourage you to investigate why some of the "alphabet" fields in the biosequence table are in upper case. There could be a bug elsewhere which is writing these entries with the wrong alphabet. Is this affecting all entries, or just some? Peter From ericgibert at yahoo.fr Thu Nov 8 06:49:12 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 11:49:12 +0000 (GMT) Subject: [BioPython] Re : small "bug" correction in package BioSql Message-ID: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Dear Peter, All the alphabet are "DNA" (upper case) in my database. The sequences are taken from NCBI by a BioJava application. Thus is should be that BioJava inserts the records with "DNA". Thus no potential "hidden bug" in BioPython. Maybe a point to share with the Open-Bio committee. Eric ----- Message d'origine ---- De : Peter ? : Eric Gibert Cc : biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s Objet : Re: [BioPython] small "bug" correction in package BioSql Eric Gibert wrote: > Dear all, > > In BioSeq/BioSeq.py, in the class DBSeq definition, we have the > function: > > ... > > please note my correction: force moltype to be turn in lower case as > my database has upper case value! this raises the "Unknown moltype" > error. Hi Eric, I've made your suggested change in CVS, biopython/BioSQL/BioSeq.py revision 1.13, thank you. I would encourage you to investigate why some of the "alphabet" fields in the biosequence table are in upper case. There could be a bug elsewhere which is writing these entries with the wrong alphabet. Is this affecting all entries, or just some? Peter _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From ericgibert at yahoo.fr Thu Nov 8 07:36:46 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 12:36:46 +0000 (GMT) Subject: [BioPython] Re : from Bio import db Message-ID: <206354.77387.qm@web26512.mail.ukl.yahoo.com> Dear Peter, Yes, this fix the error, *thank you*. NB: line is 1283, not 1293 (little typo error but maybe important for future refernece, I do not know). Then I have subsequent errors, unrelated to the current topic [which is fixed]. Let me first investigate before sending another mail, on another topic/bug report. Eric ----- Message d'origine ---- De : Peter ? : Michiel De Hoon Cc : Eric Gibert ; biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 20h14mn 18s Objet : Re: [BioPython] from Bio import db >> ImportError: cannot import name db > > It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. > Can you show us the code leading up to this point? I'm guessing that you are > trying to use NCBIDictionary, but it would be helpful to see how exactly you > are trying to use it, so that we can come up with a solution. > > Sorry for the trouble. I think the problem was introduced in Biopython 1.44 by disabling some "magic" code in Bio/__init__.py which created Bio.db at runtime, which was then imported in Bio/GenBank/__init__.py The following one line patch to the NCBIDictionary class in Bio/GenBank/__init__.py seems to fix this: diff -r1.77 __init__.py 1283c1283 < from Bio import db --- > from Bio.config.DBRegistry import db i.e. change line 1293 from "from Bio import db" to "from Bio.config.DBRegistry import db" in the NCBIDictionary class. Eric, could you try this on your setup? Peter Note to self: We need to add the NCBIDictionary to the GenBank unit test _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From biopython at maubp.freeserve.co.uk Thu Nov 8 07:14:18 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 12:14:18 +0000 Subject: [BioPython] from Bio import db In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B657@mail2.exch.c2b2.columbia.edu> References: <413801.21408.qm@web26511.mail.ukl.yahoo.com> <6243BAA9F5E0D24DA41B27997D1FD14402B657@mail2.exch.c2b2.columbia.edu> Message-ID: <4732FD9A.3080404@maubp.freeserve.co.uk> >> ImportError: cannot import name db > > It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. > Can you show us the code leading up to this point? I'm guessing that you are > trying to use NCBIDictionary, but it would be helpful to see how exactly you > are trying to use it, so that we can come up with a solution. > > Sorry for the trouble. I think the problem was introduced in Biopython 1.44 by disabling some "magic" code in Bio/__init__.py which created Bio.db at runtime, which was then imported in Bio/GenBank/__init__.py The following one line patch to the NCBIDictionary class in Bio/GenBank/__init__.py seems to fix this: diff -r1.77 __init__.py 1283c1283 < from Bio import db --- > from Bio.config.DBRegistry import db i.e. change line 1293 from "from Bio import db" to "from Bio.config.DBRegistry import db" in the NCBIDictionary class. Eric, could you try this on your setup? Peter Note to self: We need to add the NCBIDictionary to the GenBank unit test From ericgibert at yahoo.fr Thu Nov 8 07:36:46 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 12:36:46 +0000 (GMT) Subject: [BioPython] Re : from Bio import db Message-ID: <206354.77387.qm@web26512.mail.ukl.yahoo.com> Dear Peter, Yes, this fix the error, *thank you*. NB: line is 1283, not 1293 (little typo error but maybe important for future refernece, I do not know). Then I have subsequent errors, unrelated to the current topic [which is fixed]. Let me first investigate before sending another mail, on another topic/bug report. Eric ----- Message d'origine ---- De : Peter ? : Michiel De Hoon Cc : Eric Gibert ; biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 20h14mn 18s Objet : Re: [BioPython] from Bio import db >> ImportError: cannot import name db > > It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. > Can you show us the code leading up to this point? I'm guessing that you are > trying to use NCBIDictionary, but it would be helpful to see how exactly you > are trying to use it, so that we can come up with a solution. > > Sorry for the trouble. I think the problem was introduced in Biopython 1.44 by disabling some "magic" code in Bio/__init__.py which created Bio.db at runtime, which was then imported in Bio/GenBank/__init__.py The following one line patch to the NCBIDictionary class in Bio/GenBank/__init__.py seems to fix this: diff -r1.77 __init__.py 1283c1283 < from Bio import db --- > from Bio.config.DBRegistry import db i.e. change line 1293 from "from Bio import db" to "from Bio.config.DBRegistry import db" in the NCBIDictionary class. Eric, could you try this on your setup? Peter Note to self: We need to add the NCBIDictionary to the GenBank unit test _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From ericgibert at yahoo.fr Thu Nov 8 09:21:31 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 14:21:31 +0000 (GMT) Subject: [BioPython] Bug in BioSQL/Loader.py Message-ID: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Dear all, Bug 1) I noticed that the SQL statement "INSERT INTO bioentry...." in line 229 is missing one %s. I added it and it works fine... until the bug on the next command: bioentry_id = self.adaptor.last_id('bioentry') which causes the old timer bug 2 in DBUtils.py line 34: MySQL is now using lastrowid and no longer insert_id() Correction as per below: --------------------------------------------- class Mysql_dbutils(Generic_dbutils): def last_id(self, cursor, table): return cursor.lastrowid # <-- EG original command: cursor.insert_id() _dbutils["MySQLdb"] = Mysql_dbutils ----------------------------------------------- Then this leads me to the follow bug 3 --- or maybe this is *not* a bug ? --- I explain: In my BioSQL database, the table 'seqfeature_qualifier_value' as the following schema: seqfeature_id int(11) term_id int(11) value varchar(255) rank int(11) note that first we have 'value' then we have 'rank'. But the 'INSERT INTO seqfeature_qualifier_value' statement found in BioSQL/Loader.py line 453 is: qualifier_value = qualifiers[qualifier_key][qual_value_rank] sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ r" (%s, %s, %s, %s)" self.adaptor.execute(sql, (seqfeature_id, qualifier_key_id, qual_value_rank + 1, qualifier_value)) thus I need to invert the last two elements of the list. As I said, I do not know if my BioSQL schema is correct or not. If my schema is correct then my correction is obvious: self.adaptor.execute(sql, (seqfeature_id, qualifier_key_id, qualifier_value, # EG invert the two last params qual_value_rank + 1 )) ------------------ Finally, the script executes without error and .... nothing happens! It looks like there is no 'commit' nowhere and so the new records are not inserted in the database. Although the psychopg database enjoys a: def autocommit(self, conn, y = True): conn.autocommit(y) _dbutils["psycopg"] = Psycopg_dbutils MySQL does not have such an overload for 'autocommit' in DBUtils.py. Could this fix the problem ? In the file MySQLdb/connections.py, on line 213, we have: # PEP-249 requires autocommit to be initially off self.autocommit(False) Therefore the source for the Mysql_dbutils class is now: class Mysql_dbutils(Generic_dbutils): def last_id(self, cursor, table): return cursor.lastrowid #EG original command: cursor.insert_id() def autocommit(self, conn, y = True): # EG addition as by default it is set to False conn.autocommit(y) _dbutils["MySQLdb"] = Mysql_dbutils Unfortunately, this is *NOT* fixing the lack of 'commit'. I need your help... Cordialement, Eric ____________________________________________________________________________________________ D?couvrez le blog Yahoo! Mail : le nouveau Yahoo! Mail, astuces, conseils.. et vos r?actions ! http://blog.mail.yahoo.fr From hlapp at gmx.net Thu Nov 8 10:53:03 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 10:53:03 -0500 Subject: [BioPython] small "bug" correction in package BioSql In-Reply-To: <762277.43372.qm@web26507.mail.ukl.yahoo.com> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Message-ID: Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we explicitly lowercase the value found for alphabet, and the comment says why: # Note: Biojava uses upper-case terms for alphabet, so we # need to change to all-lower in case the sequence was # manipulated by Biojava. $obj->alphabet(lc($rows->[3])) if $rows->[3]; However, when inserting sequences, we leave the value as is in BioPerl (which is lowercase), leading to a potential problem for Biojava upon retrieval. Do the Biojava folks deal with that? Should this may harmonized across the board? -hilmar On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: > Dear Peter, > > All the alphabet are "DNA" (upper case) in my database. The > sequences are taken from NCBI by a BioJava application. > Thus is should be that BioJava inserts the records with "DNA". Thus > no potential "hidden bug" in BioPython. > > Maybe a point to share with the Open-Bio committee. > > Eric > > ----- Message d'origine ---- > De : Peter > ? : Eric Gibert > Cc : biopython at lists.open-bio.org > Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s > Objet : Re: [BioPython] small "bug" correction in package BioSql > > Eric Gibert wrote: >> Dear all, >> >> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >> function: >> >> ... >> >> please note my correction: force moltype to be turn in lower case as >> my database has upper case value! this raises the "Unknown moltype" >> error. > > Hi Eric, I've made your suggested change in CVS, > biopython/BioSQL/BioSeq.py revision 1.13, thank you. > > I would encourage you to investigate why some of the "alphabet" fields > in the biosequence table are in upper case. There could be a bug > elsewhere which is writing these entries with the wrong alphabet. Is > this affecting all entries, or just some? > > Peter > > > > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Nov 8 10:59:49 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 10:59:49 -0500 Subject: [BioPython] Bug in BioSQL/Loader.py In-Reply-To: <353717.53467.qm@web26514.mail.ukl.yahoo.com> References: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Message-ID: On Nov 8, 2007, at 9:21 AM, Eric Gibert wrote: > qualifier_value = qualifiers[qualifier_key] > [qual_value_rank] > sql = r"INSERT INTO seqfeature_qualifier_value > VALUES" \ > r" (%s, %s, %s, %s)" Not enumerating the columns in the INSERT clause is dangerous programming I think. This should be fixed, and should be fixed for all statements where it is an issue. Though BioSQL has been (and will likely continue to be) extremely stable compared to many other schemas, it would be a shame if even adding columns isn't possible because some of the Bio* projects don't enumerate the columns explicitly. My $0.02. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Nov 8 10:39:13 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 15:39:13 +0000 Subject: [BioPython] Bug in BioSQL/Loader.py In-Reply-To: <353717.53467.qm@web26514.mail.ukl.yahoo.com> References: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Message-ID: <47332DA1.8060002@maubp.freeserve.co.uk> Eric Gibert wrote: > Dear all, > > ... in DBUtils.py line 34: MySQL is now using lastrowid and no longer insert_id() > > Correction as per below: > --------------------------------------------- > class Mysql_dbutils(Generic_dbutils): > def last_id(self, cursor, table): > return cursor.lastrowid # <-- EG original command: cursor.insert_id() > _dbutils["MySQLdb"] = Mysql_dbutils > ----------------------------------------------- Sounds like the one of the issues raised on bug 2390 http://bugzilla.open-bio.org/show_bug.cgi?id=2390 Nice to have my suggested fix confirmed; I'll probably check that in tonight. The auto-commit thing also looks relevant too... Peter From biopython at maubp.freeserve.co.uk Thu Nov 8 11:56:53 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 16:56:53 +0000 Subject: [BioPython] Bug in BioSQL/Loader.py In-Reply-To: References: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Message-ID: <47333FD5.2030504@maubp.freeserve.co.uk> Hilmar Lapp wrote: > On Nov 8, 2007, at 9:21 AM, Eric Gibert wrote: > >> qualifier_value = qualifiers[qualifier_key][qual_value_rank] >> sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ >> r" (%s, %s, %s, %s)" > > Not enumerating the columns in the INSERT clause is dangerous > programming I think. This should be fixed, and should be fixed for > all statements where it is an issue. I agree with you 100% on this issue. As I still haven't made the time to setup a BioSQL database on my machine, I would be grateful if someone could check the patch on newly filed Bug 2384, http://bugzilla.open-bio.org/show_bug.cgi?id=2394 Thanks Peter From ericgibert at yahoo.fr Thu Nov 8 12:50:44 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 17:50:44 +0000 (GMT) Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database Message-ID: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Dear all, When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted previously by my BioJava application, I have: print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', 'genbank_accessions', 'TITLE', 'cross_references', 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', 'CIRCULAR'] but a freshly inserted BioSeq by BioPython 1.44 only gives me: Debug on Seq: EF631597.1 = ['cross_references', 'dates', 'references', 'gi', 'data_file_division'] Once I look in the table bioentry_qualifier_value * 20 records for a Sequence imported by BioJava * 1 only for a Sequence inserted by BioPython: the date which should be inserted by "_load_bioentry_date" in BioSQL/Loader.py Quite a few annotations missing, no? Any idea? Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From biopython at maubp.freeserve.co.uk Thu Nov 8 14:18:47 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 19:18:47 +0000 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: <47336117.2010102@maubp.freeserve.co.uk> Eric Gibert wrote: > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? So Biopython is recording nothing in table bioentry_qualifier_value (apart from the date), but is recording other essential things (in other tables) like the sequence itself? Could you double check your schema, as from the issue I filed as bug 2394 based on your earlier email, your schema doesn't seem to be up to date: http://bugzilla.open-bio.org/show_bug.cgi?id=2394 Peter From lists.steve at arachnedesign.net Thu Nov 8 14:32:09 2007 From: lists.steve at arachnedesign.net (Steve Lianoglou) Date: Thu, 8 Nov 2007 14:32:09 -0500 Subject: [BioPython] Compiling from CVS on OS X Message-ID: Hi all, I was having problems compiling biopython from source, specifically getting the Bio/Cluster/clustermodule.c file to compile well. The problem was that the system wasn't finding the `Numeric/ arrayobject.h` file for inclusion. I "fixed" it by editing the setup.py file and adding '/opt/local/include/python2.4' to the include_dirs param on line 474 so it would pick up the files in my python install (that's just where numeric header was installed from macports). Is this the expected way to achieve this, or is there some envi-var, or site.cfg to tweak to do this correctly (or is my python install whacky from the get go?) FWIW I'm using python 2.4 installed via macports. Thanks, -steve From hlapp at gmx.net Thu Nov 8 15:28:19 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:28:19 -0500 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: Maybe we need to hold some mini-hackathon to make the different toolkits compatible in how they map annotation to the schema. Obviously I don't know whether you have the latest Biojava setup here, but I'll just comment how BioPerl/Bioperl-db would map this: 'ORIGIN' - if I'm not mistaken this is only a token that introduces the actual sequence. I'm not sure what Biojava is storing as value here. 'DIVISION' - this maps to column division in table bioentry (though I agree that if perfectly following the weak typing principle this should be tag/value association, but at present it's still an actual column) 'genbank_accessions' - secondary accession numbers indeed go into the qualifier value table. The primary accession maps to column accession in table bioentry 'TITLE' - this is part of a publication reference, and should map to column title in table reference (which it does in bioperl-db) 'cross_references' - not sure where these would be coming from in GenBank format; for EMBL this will map to the dbxref table 'data_file_division' - not sure what this is (same as DIVISION?) 'VERSION' - in BioPerl we parse this apart into a version for the accession (which is column version in table bioentry) and the GI number, which maps to column identifier in table bioentry 'references' - these map to table reference (and bioentry_reference for association with the bioentry) 'KEYWORDS' - indeed these map to bioentry_qualifier_value 'GI' - maps to column identifier in table bioentry 'SIZE' - not sure what size that is. If it is the length of the sequence, it should (and in BioPerl/bioperl-db does) map to column length in table biosequence 'DEFINITION' - maps to column description in table bioentry 'REFERENCE' - should be the same as for 'references' 'MDAT' - not sure what this is 'ORGANISM' - this is the organism and maps to the table taxon (and taxon_name), with a foreign key in bioentry pointing to the taxon 'JOURNAL' - this is part of a reference, see 'references' 'ACCESSION' - the primary accession, maps to column accession in table bioentry 'LOCUS' - in the file itself this is an entire line consisting of multiple fields; BioPerl/bioperl-db maps the locus name (the first token after the literal token LOCUS) to column name in table bioentry 'SOURCE' - this is the organism, see 'ORGANISM' 'PUBMED' - this is part of a literature reference, and maps to a foreign key in the reference table (reference.dbxref) to a dbxref entry with PUBMED or PMID as the database and the pubmed ID as the accession 'AUTHORS' - part of a literature reference, maps to column authors in table reference 'TYPE' - not sure what this is. If it's the alphabet, it maps to table biosequence, column alphabet 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, though there have been plans to make it a column in table biosequence. Note that this could in fact be the way Biojava stores it too, but upon retrieval represents it in the way you are seeing it. Hth, -hilmar On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > Dear all, > > When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted > previously by my BioJava application, I have: > > print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() > > Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', > 'genbank_accessions', 'TITLE', 'cross_references', > 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', > 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', > 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', > 'CIRCULAR'] > > but a freshly inserted BioSeq by BioPython 1.44 only gives me: > Debug on Seq: EF631597.1 = ['cross_references', 'dates', > 'references', 'gi', 'data_file_division'] > > > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which > should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? > > Eric > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Nov 8 15:30:29 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:30:29 -0500 Subject: [BioPython] [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <473336E6.6000100@ebi.ac.uk> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> Message-ID: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> It seems BioPerl and Biopython both want (and have traditionally used) lowercase - do you mind going with that for Biojava as well, or alternatively, simply map upon insert/update and retrieve? -hilmar On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >> explicitly lowercase the value found for alphabet, and the comment >> says why: >> >> # Note: Biojava uses upper-case terms for alphabet, so we >> # need to change to all-lower in case the sequence was >> # manipulated by Biojava. >> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >> >> However, when inserting sequences, we leave the value as is in >> BioPerl (which is lowercase), leading to a potential problem for >> Biojava upon retrieval. Do the Biojava folks deal with that? Should >> this may harmonized across the board? >> >> -hilmar >> >> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >> >>> Dear Peter, >>> >>> All the alphabet are "DNA" (upper case) in my database. The >>> sequences are taken from NCBI by a BioJava application. >>> Thus is should be that BioJava inserts the records with "DNA". Thus >>> no potential "hidden bug" in BioPython. >>> >>> Maybe a point to share with the Open-Bio committee. >>> >>> Eric >>> >>> ----- Message d'origine ---- >>> De : Peter >>> ? : Eric Gibert >>> Cc : biopython at lists.open-bio.org >>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>> >>> Eric Gibert wrote: >>>> Dear all, >>>> >>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>> function: >>>> >>>> ... >>>> >>>> please note my correction: force moltype to be turn in lower >>>> case as >>>> my database has upper case value! this raises the "Unknown moltype" >>>> error. >>> Hi Eric, I've made your suggested change in CVS, >>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>> >>> I would encourage you to investigate why some of the "alphabet" >>> fields >>> in the biosequence table are in upper case. There could be a bug >>> elsewhere which is writing these entries with the wrong >>> alphabet. Is >>> this affecting all entries, or just some? >>> >>> Peter >>> >>> >>> >>> >>> >>> >>> >>> >>> ____________________________________________________________________ >>> __ >>> _______ >>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>> Yahoo! Mail >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3 > 9x+CUHig3GfBCZ56rDb1ZG4= > =OJyB > -----END PGP SIGNATURE----- -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mdehoon at c2b2.columbia.edu Thu Nov 8 19:58:42 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 8 Nov 2007 19:58:42 -0500 Subject: [BioPython] Compiling from CVS on OS X References: Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B65A@mail2.exch.c2b2.columbia.edu> > The problem was that the system wasn't finding the `Numeric/ > arrayobject.h` file for inclusion. I "fixed" it by editing the > setup.py file and adding '/opt/local/include/python2.4' to the > include_dirs param on line 474 so it would pick up the files in my > python install (that's just where numeric header was installed from > macports). By default, the directory containing the Python include files is searched for header files during compilation. If you install Numerical Python from source, it will put its header files in the same location, and no editing of setup.py is needed. If, on the other hand, you use a precompiled package, it may be installed in a different place. For some reason, the package from macports puts the Numerical Python header files in a non-standard place, which is why they could not be found. Apparently macports assumes that Python is installed in /opt/local/; this may very well be the installation directory used by macports for Python itself. In other words, macports uses a non-standard directory, there is no way for Python to know about it, so the header files cannot be found during compilation. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Steve Lianoglou Sent: Thu 11/8/2007 2:32 PM To: BioPython at lists.open-bio.org Subject: [BioPython] Compiling from CVS on OS X Hi all, I was having problems compiling biopython from source, specifically getting the Bio/Cluster/clustermodule.c file to compile well. The problem was that the system wasn't finding the `Numeric/ arrayobject.h` file for inclusion. I "fixed" it by editing the setup.py file and adding '/opt/local/include/python2.4' to the include_dirs param on line 474 so it would pick up the files in my python install (that's just where numeric header was installed from macports). Is this the expected way to achieve this, or is there some envi-var, or site.cfg to tweak to do this correctly (or is my python install whacky from the get go?) FWIW I'm using python 2.4 installed via macports. Thanks, -steve _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From ericgibert at yahoo.fr Fri Nov 9 08:35:12 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Fri, 9 Nov 2007 21:35:12 +0800 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: <000601c822d5$5d811c20$6400a8c0@Gecko> Dear Hilmar, Thank you for this reply. Now I would like to know where BioPythin has stored "SOURCE" or "ORGANISM" in BioSQL? I cannot find them. Then, supposing they are somewhere, how can I get them back? Thank you Eric -----Original Message----- From: Hilmar Lapp [mailto:hlapp at gmx.net] Sent: Friday, November 09, 2007 4:28 AM To: Eric Gibert Cc: biopython at lists.open-bio.org; BioJava Subject: Re: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database Maybe we need to hold some mini-hackathon to make the different toolkits compatible in how they map annotation to the schema. Obviously I don't know whether you have the latest Biojava setup here, but I'll just comment how BioPerl/Bioperl-db would map this: 'ORIGIN' - if I'm not mistaken this is only a token that introduces the actual sequence. I'm not sure what Biojava is storing as value here. 'DIVISION' - this maps to column division in table bioentry (though I agree that if perfectly following the weak typing principle this should be tag/value association, but at present it's still an actual column) 'genbank_accessions' - secondary accession numbers indeed go into the qualifier value table. The primary accession maps to column accession in table bioentry 'TITLE' - this is part of a publication reference, and should map to column title in table reference (which it does in bioperl-db) 'cross_references' - not sure where these would be coming from in GenBank format; for EMBL this will map to the dbxref table 'data_file_division' - not sure what this is (same as DIVISION?) 'VERSION' - in BioPerl we parse this apart into a version for the accession (which is column version in table bioentry) and the GI number, which maps to column identifier in table bioentry 'references' - these map to table reference (and bioentry_reference for association with the bioentry) 'KEYWORDS' - indeed these map to bioentry_qualifier_value 'GI' - maps to column identifier in table bioentry 'SIZE' - not sure what size that is. If it is the length of the sequence, it should (and in BioPerl/bioperl-db does) map to column length in table biosequence 'DEFINITION' - maps to column description in table bioentry 'REFERENCE' - should be the same as for 'references' 'MDAT' - not sure what this is 'ORGANISM' - this is the organism and maps to the table taxon (and taxon_name), with a foreign key in bioentry pointing to the taxon 'JOURNAL' - this is part of a reference, see 'references' 'ACCESSION' - the primary accession, maps to column accession in table bioentry 'LOCUS' - in the file itself this is an entire line consisting of multiple fields; BioPerl/bioperl-db maps the locus name (the first token after the literal token LOCUS) to column name in table bioentry 'SOURCE' - this is the organism, see 'ORGANISM' 'PUBMED' - this is part of a literature reference, and maps to a foreign key in the reference table (reference.dbxref) to a dbxref entry with PUBMED or PMID as the database and the pubmed ID as the accession 'AUTHORS' - part of a literature reference, maps to column authors in table reference 'TYPE' - not sure what this is. If it's the alphabet, it maps to table biosequence, column alphabet 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, though there have been plans to make it a column in table biosequence. Note that this could in fact be the way Biojava stores it too, but upon retrieval represents it in the way you are seeing it. Hth, -hilmar On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > Dear all, > > When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted > previously by my BioJava application, I have: > > print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() > > Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', > 'genbank_accessions', 'TITLE', 'cross_references', > 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', > 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', > 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', > 'CIRCULAR'] > > but a freshly inserted BioSeq by BioPython 1.44 only gives me: > Debug on Seq: EF631597.1 = ['cross_references', 'dates', > 'references', 'gi', 'data_file_division'] > > > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which > should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? > > Eric > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lists.steve at arachnedesign.net Fri Nov 9 09:43:15 2007 From: lists.steve at arachnedesign.net (Steve Lianoglou) Date: Fri, 9 Nov 2007 09:43:15 -0500 Subject: [BioPython] Compiling from CVS on OS X In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B65A@mail2.exch.c2b2.columbia.edu> References: <6243BAA9F5E0D24DA41B27997D1FD14402B65A@mail2.exch.c2b2.columbia.edu> Message-ID: <2E1132E9-DB5C-4416-AC11-68FCE83EDF5C@arachnedesign.net> Sorry ... didn't cc the list: Hi Michiel, > If, on the other hand, you use a precompiled package, it may be > installed in > a different place. For some reason, the package from macports puts the > Numerical Python header files in a non-standard place, which is why > they > could not be found. Apparently macports assumes that Python is > installed in > /opt/local/; this may very well be the installation directory used by > macports for Python itself. > > In other words, macports uses a non-standard directory, there is no > way for > Python to know about it, so the header files cannot be found during > compilation. I see. I could imagine that many folks might be using macports (or similar) to help them manage some of their dependencies. Do you think it's a good idea if we add the ability to add some custom include paths in a more non-intrusive n00b friendly way, like through a site.cfg file, for example? Thanks, -steve From ericgibert at yahoo.fr Sat Nov 10 06:16:40 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Sat, 10 Nov 2007 19:16:40 +0800 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <47336117.2010102@maubp.freeserve.co.uk> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <47336117.2010102@maubp.freeserve.co.uk> Message-ID: <001c01c8238b$2ec64070$6400a8c0@Gecko> Dear Peter, My problem is not that I do not have entries in the tables but it is that no interpretation on the feature is perform. Example: In the tutorial and in BioJava, 'source' is an annotation: # from the Biopython Tutorial and Cookbook print "from: %s" % seq_record.annotations['source'] This returns a "KeyError: 'source'" On the other hand, after some tinkering, I found that I can have a a feature from the list Seq.features with the type='source' which contains a "qualifiers['organism']"... Quite cumbersome. But maybe there is another way, more straight forward, that I did not find. Can you tell me? -------- For you information, I went thru the tables of my BioSQL database: Here are my findings with BioPython insertion in BioSQL using "myDataBase.load(list_of_seq)": (note: one test sequence was fetch by GenBank.download_many() and the other using GenBank.NCBIDictionary) 1) table bioentry: all column populated except for 'taxon_id' which is NULL (maybe I need an extra call for populating the 'taxon' table before?) (FYI BioJava sequences are not filling all columns correctly) 2) table bioentry_dbxref: no data inserted (always empty, even with BioJava) 3) table bioentry_qualifier_value: One entry only, for the 'term_id' = 149, rank = 1, and value = '07-JUL-2005' or other 'DD-MMM-YYYY' dates (see my remarks below) 4) table bioentry_reference: two records per sequence with reference _id correctly mapping the 'reference' table, rank, start_pos and end_pos also correctly filled 5) table bioentry_relationships: no entry found (always empty, even with BioJava) 6) table biosequence: one entry per seq, the 'seq' field is correct. Note: the 'version' is set to 0 whereas it should be 1... (length is correct and we have "dna" is lower case :-) ) 7) table comment: no entry found (always empty, even with BioJava) 8) table dbxref: some records are generated, for dbname 'PUBMED' and 'Taxon' with the correct value (FYI: I think that my BioJava is not managing this table...) 9) table dbxref_qualifier_value: (always empty, even with BioJava) 10) table location: all locations loaded correctly, note that 'term_id' and 'dbxref_id' remain NULL for these seq but I have value for other seq. 11) table location_qualifier_value: always empty, even with BioJava 12) table ontology: some rows but not related to the sequences 13) Table reference: entries correct, note 'dbxref_id' remains NULL for these seq but I have value for other seq. 14) table seqfeature: entries are there (same as in table 'location'). FYI:'display_name is always NULL. 15) table seqfeature_dbxref: always empty, even with BioJava 16) table seqfeature_qualifier_value: filled correctly 17) table seqfeature_relationship: always empty, even with BioJava 18) table taxon: always empty, even with BioJava) 19) table taxon_name: I have one but not from this test (I tried to tinker a little bit with taxon but stopped) 20) table term: always empty, even with BioJava 21) table term_dbxref: always empty, even with BioJava 22) table term_relationship_term: have some entries 23) table term_synonym: always empty, even with BioJava ------------ Thank you Eric -----Original Message----- From: Peter [mailto:biopython at maubp.freeserve.co.uk] Sent: Friday, November 09, 2007 3:19 AM To: Eric Gibert Cc: biopython at lists.open-bio.org Subject: Re: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database Eric Gibert wrote: > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? So Biopython is recording nothing in table bioentry_qualifier_value (apart from the date), but is recording other essential things (in other tables) like the sequence itself? Could you double check your schema, as from the issue I filed as bug 2394 based on your earlier email, your schema doesn't seem to be up to date: http://bugzilla.open-bio.org/show_bug.cgi?id=2394 Peter From hlapp at gmx.net Sat Nov 10 15:42:45 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 15:42:45 -0500 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <000601c822d5$5d811c20$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <000601c822d5$5d811c20$6400a8c0@Gecko> Message-ID: <3F4D1638-5D86-4AC2-9D40-A77C4271B598@gmx.net> On Nov 9, 2007, at 8:35 AM, Eric Gibert wrote: > Thank you for this reply. Now I would like to know where BioPythin has > stored "SOURCE" or "ORGANISM" in BioSQL? I cannot find them. > > Then, supposing they are somewhere, how can I get them back? Just to clarify, I'm not a Biopython developer. I was merely commenting from the BioSQL perspective, with maybe the background of we use it in the BioPerl language binding. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat Nov 10 15:38:17 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 15:38:17 -0500 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> Message-ID: <5DDEBCDE-C8DA-4B2C-86F4-47FDB82CADAC@gmx.net> Just a few comments below, specifically where no rows would in fact be what I expect: On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote: > [...] > -------- For you information, I went thru the tables of my BioSQL > database: > [...] > 1) table bioentry: all column populated except for 'taxon_id' which > is NULL > (maybe I need an extra call for populating the 'taxon' table before?) Bioperl-db will try to look up (or create if necessary) the taxon from the taxon information attached to the sequence, but for BioPerl we actually recommend to pre-load the database with the NCBI taxonomy, which can be comfortably done with the script load_ncbi_taxonomy.pl that comes with BioSQL. > > 2) table bioentry_dbxref: no data inserted (always empty, even with > BioJava) This would mean that the sequence(s) have no dbxrefs. Note that for GenBank sequences that would be expected, since unfortunately, and unlike EMBL format, GenBank puts the dbxrefs into the feature table. > 3) table bioentry_qualifier_value: > > One entry only, for the 'term_id' = 149, rank = 1, and value = '07- > JUL-2005' > or other 'DD-MMM-YYYY' dates (see my remarks below) Below you say that your term table is empty, so I don't know why you can have value here at all. > [...] > 5) table bioentry_relationships: no entry found (always empty, even > with > BioJava) If you load sequences, they won't have direct relationships to other sequences (except dbxrefs, but those are rather 'pointers' and are stored in their own table). In Bioperl-db, this table is used only if you load sequence clusters through Bio::Cluster objects (such as UniGene). > [...] > 7) table comment: no entry found (always empty, even with BioJava) Again, this is expected with GenBank. AFAIK genbank format doesn't allow for comments at the level of the sequence. You would (i.e., should) find entries here if you load UniProt entries. > 8) table dbxref: some records are generated, for dbname 'PUBMED' > and 'Taxon' > with the correct value Taxon obviously isn't really a dbxref, but rather a taxon (and hence should go into that table). > [...] > 9) table dbxref_qualifier_value: (always empty, even with BioJava) That's almost expected. There's rather few cases where dbxrefs have additional attributes that the language can parse out from a source (and then maps to the schema). > [...] > 10) table location: all locations loaded correctly, note that > 'term_id' and > 'dbxref_id' remain NULL for these seq but I have value for other seq. Theoretically, the term_id should point to the term giving the type of the location. If you (or Biopython) are only dealing with simple ('normal') locations, then it's not needed. The dbxref_id gives the reference to the remote sequence if the location for a feature refers to a different sequence than the feature itself does (so-called 'remote locations'). If the sequences you loaded don't have such locations, there this would be expected to be empty (or if Biopython doesn't handle such locations). > 11) table location_qualifier_value: always empty, even with BioJava This is expected if Biopython doesn't support fuzzy locations, or if none of the feature locations that you loaded are fuzzy. > [...] > 13) Table reference: entries correct, note 'dbxref_id' remains NULL > for > these seq but I have value for other seq. It should point to the pubmed ID for the reference but only if there was one. > 14) table seqfeature: entries are there (same as in table 'location'). > FYI:'display_name is always NULL. GenBank doesn't give names to features (and I think EMBL does neither), so this is expected. > 15) table seqfeature_dbxref: always empty, even with BioJava That's likely more to do with your language object model than with anything else. dbxref annotation for features is in tag/value pairs, just as any other, so your language (Biopython in this case) will have to do a lot of interpretation to tease out the semantics behind each tag name and based on that decide what to do with the value. Indeed, by default we don't even do this in BioPerl. > [...] > 17) table seqfeature_relationship: always empty, even with BioJava GenBank (and EMBL) feature tables are flat, not hierarchical, so this is expected. > 18) table taxon: always empty, even with BioJava) This is where the organism should go. > 19) table taxon_name: I have one but not from this test (I tried to > tinker a > little bit with taxon but stopped) That's odd that you can have an entry in taxon_name w/o a corresponding one in taxon. Do you have foreign key checks disabled? > 20) table term: always empty, even with BioJava That's strange, since you say you do have rows in bioentry_qualifier_value, which has an enforced foreign key to term. Did you disable the foreign key checks? > 21) table term_dbxref: always empty, even with BioJava That's expected unless you loaded an ontology whose terms have dbxrefs, and your language object model supports that. > [...] > 23) table term_synonym: always empty, even with BioJava Same as for 21). Your terms would have to have synonyms, and your language object model would have to support those, before you could expect to get anything in here. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ericgibert at yahoo.fr Sat Nov 10 21:11:54 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Sun, 11 Nov 2007 10:11:54 +0800 Subject: [BioPython] Taxon/organism/source in Biopython In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com><47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> Message-ID: <002a01c82408$3f386d20$6400a8c0@Gecko> I find out one answer to my question: in BioSQL/Loader.py (Biopython version 1.44), inside the function _load_bioentry_table, there is a *commented* call to # taxon_id = self._get_taxon_id(record) I uncommented it and also modified the "INSERT INTO bioentry" statement slightly below to include the taxon_id column and value. Thereafter, the call to self._get_taxon_id ensures that the taxon is handled and inserted in the database. The Sequence.annotations now contains: - 'taxonomy': ['the Genus', 'theSpecies'] - 'ncbi_taxoid' : 123456L - 'organism' : 'thespecies' Which is exactly what I was looking for :-) Attention: inside the _get_taxon_id function, in the section starting with the comment "# XXX -- Brad:......", inserts are performed without checking prior existence of 'taxon': although it is clear that the lowest taxon is not in the database already (or else we would have already returned from the function), INSERT of higher level without prior existence check is not correct: I have imported two sequences of the same genus but different species and the GENUS has been created twice. This is also due to the fact the genus does not have ncbi_taxon_id... Thus I propose to first check if the taxon is not already in the table before insertion, based on SELECT taxon_id from taxon_name where name=%s and name_class='scientific name' What do you think? PS1: Hilmar, I was typing this mail when I received your mail commenting on the taxon manipulation. As you can see, this provides some answers. Thank you for taking the time to detail the tables content: as you guessed, I only access NCBI :-) PS2: Hilmar mentions that BioPerl has a function 'load_ncbi_taxonomy.pl'. Does BioPython has one too (I could not find one)? If there is none, shall we/I try to provide one? -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Eric Gibert Sent: Saturday, November 10, 2007 7:17 PM To: biopython at lists.open-bio.org Subject: Re: [BioPython] error on insert new sequences from GenBank: noannotations saved in BioSQL database Dear Peter, My problem is not that I do not have entries in the tables but it is that no interpretation on the feature is perform. Example: In the tutorial and in BioJava, 'source' is an annotation: # from the Biopython Tutorial and Cookbook print "from: %s" % seq_record.annotations['source'] This returns a "KeyError: 'source'" On the other hand, after some tinkering, I found that I can have a a feature from the list Seq.features with the type='source' which contains a "qualifiers['organism']"... Quite cumbersome. But maybe there is another way, more straight forward, that I did not find. Can you tell me? From hlapp at gmx.net Sat Nov 10 22:41:15 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 22:41:15 -0500 Subject: [BioPython] Taxon/organism/source in Biopython In-Reply-To: <002a01c82408$3f386d20$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com><47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> <002a01c82408$3f386d20$6400a8c0@Gecko> Message-ID: On Nov 10, 2007, at 9:11 PM, Eric Gibert wrote: > PS2: Hilmar mentions that BioPerl has a function > 'load_ncbi_taxonomy.pl'. It is BioSQL that has this script, not BioPerl. The script doesn't depend on BioPerl, only on Perl (which almost every system has installed). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From vmatthewa at gmail.com Sun Nov 11 13:56:33 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Sun, 11 Nov 2007 11:56:33 -0700 Subject: [BioPython] posting to list Message-ID: <8fc5e4c20711111056n1c42d26cvae357df50d810dea@mail.gmail.com> my e-mail is vmatthewa at gmail.com From vmatthewa at gmail.com Sun Nov 11 14:07:33 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Sun, 11 Nov 2007 12:07:33 -0700 Subject: [BioPython] Writing a biopython script to download all Genbank records from Nucleotide database Message-ID: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> Hi everyone, Please ignore my last messages, I am still getting the hang of this e-mail list and everything. I am trying to write a bio-python script to download all Genbank records in the Nucleotide database and I know what I want to do just not how to go about doing it. I am using a Unix based system with bio-python 2.4 and I am using emacs editor, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks. Matthew From winter at biotec.tu-dresden.de Sun Nov 11 14:37:02 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Sun, 11 Nov 2007 20:37:02 +0100 Subject: [BioPython] Writing a biopython script to download all Genbank records from Nucleotide database In-Reply-To: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> References: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> Message-ID: <473759DE.1010802@biotec.tu-dresden.de> Matthew Abravanel wrote: > Hi everyone, > > Please ignore my last messages, I am still getting the hang of this > e-mail list and everything. I am trying to write a bio-python script to > download all Genbank records in the Nucleotide database and I know what I > want to do just not how to go about doing it. I am using a Unix based system > with bio-python 2.4 and I am using emacs editor, if someone could help me > out I would really appreciate it with some sample code or something. I > just started learning python and have tried to follow the documentation and > cookbook without much success, my programming experience is virtually > non-existent. Thanks. > > Matthew Hi Matthew, I used the code below to retrieve some entries from the Nucleotide database. Since two entries already take a few seconds, it is probably a bad idea to download _all_ entries in that way. You might be better off downloading the data first: ftp://ftp.ncbi.nih.gov/genbank/ HTH, cheers, Christof from Bio import GenBank featureParser = GenBank.FeatureParser() ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank", parser=featureParser) accessionNumbers = ["BC063166", "NM_028459"] for accessionNo in accessionNumbers: giList = GenBank.search_for(accessionNo) for gi in giList: record = ncbiDict[gi] # parsing happens here for feature in record.features: # extract sequences if feature.type == "CDS": codingStart = feature.location._start.position codingEnd = feature.location._end.position completeSequence = record.seq.tostring() fiveUTRSequence = completeSequence[:codingStart] codingSequence = completeSequence[codingStart:codingEnd] threeUTRSequence = completeSequence[codingEnd:] # extract gene name if feature.type == "gene": geneName = feature.qualifiers['gene'][0] print "Found", gi, geneName, len(completeSequence) From biopython at maubp.freeserve.co.uk Mon Nov 12 05:16:07 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Nov 2007 10:16:07 +0000 Subject: [BioPython] Taxon/organism/source in Biopython In-Reply-To: <002a01c82408$3f386d20$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com><47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> <002a01c82408$3f386d20$6400a8c0@Gecko> Message-ID: <473827E7.2020907@maubp.freeserve.co.uk> Eric Gibert wrote: > I find out one answer to my question: in BioSQL/Loader.py (Biopython version > 1.44), inside the function _load_bioentry_table, there is a *commented* call > to > > # taxon_id = self._get_taxon_id(record) > > I uncommented it and also modified the "INSERT INTO bioentry" statement > slightly below to include the taxon_id column and value. See bug 1921, where this was done as a work around for a zero taxon id. Clearly we should revisit this issue and fix this properly: http://bugzilla.open-bio.org/show_bug.cgi?id=1921 Peter From vmatthewa at gmail.com Mon Nov 12 16:19:27 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Mon, 12 Nov 2007 14:19:27 -0700 Subject: [BioPython] script to extract records from nucleotide database Message-ID: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> Hi Christof, I tried out the code you sent me just to see if it would work but I get an attribute error or something? Here is the error I get: Traceback (most recent call last): File "./run", line 3, in ? from Bio import GenBank File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 47, in ? File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 20, in ? from Bio.SeqRecord import SeqRecord File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line 11, in ? File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, in __init__ AttributeError: 'module' object has no attribute 'formats' Here is the code I have used: #!/usr/pkg/bin/python2.4 from Bio import GenBank featureParser = GenBank.FeatureParser() ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank",parser=featureParser) accessionNumbers=["BC063166", "NM_028459"] for accessionNo in accessionNumbers: giList = GenBank.search_for(accessionNo) for gi in giList: record = ncbiDict[gi] for feature in record.features: if feature.type =="CDS": codingStart = feature.location._start.position codingEnd = feature.location._end.position completeSequence = record.seq.tostring() fiveUTRSequence = completeSequence[:codingStart] codingSequence = completeSequence[codingStart:codingEnd] threeUTRSequence = completeSequence[codingEnd:] if feature.type=="gene": geneName=feature.qualifiers['gene'][0] print "Found",gi,geneName,len(completeSequence) I do not know if it is a difference in python2.4 version or not? Any help would be appreciate, thanks. Matthew From alexl at users.sourceforge.net Tue Nov 13 02:01:04 2007 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Tue, 13 Nov 2007 00:01:04 -0700 Subject: [BioPython] Fedora packages for 1.44 (was Re: Biopython release 1.44 ready) In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B645@mail2.exch.c2b2.columbia.edu> (Michiel De Hoon's message of "Sun\, 28 Oct 2007 02\:32\:40 -0400") References: <6243BAA9F5E0D24DA41B27997D1FD14402B645@mail2.exch.c2b2.columbia.edu> Message-ID: >>>>> "MDH" == Michiel De Hoon writes: MDH> Hi everybody, Biopython release 1.44 is now available for MDH> download from the Biopython website at http://biopython.org. For those that are using biopython on Fedora, I have updated the packages for biopython 1.44 in the "updates-testing" repository for F-7 and F-8 (the stable release is still on 1.43 until the updated packages get some testing, then I'll push them out to the stable "updates" repo). To test them out simply run (as root): yum --enablerepo=updates-testing install python-biopython If you have a Fedora account you can provide feedback directly here: https://admin.fedoraproject.org/updates/F7/FEDORA-2007-3266 https://admin.fedoraproject.org/updates/F8/FEDORA-2007-3198 otherwise, simply e-mail me, or fill out a bugzilla report on http://bugzilla.redhat.com/ (selecting the "Fedora" product and the "python-biopython" component). Thanks! Alex From winter at biotec.tu-dresden.de Tue Nov 13 15:25:45 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 13 Nov 2007 21:25:45 +0100 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> Message-ID: <473A0849.2020904@biotec.tu-dresden.de> Matthew Abravanel wrote: > Hi Christof, > > I tried out the code you sent me just to see if it would work but I get an > attribute error or something? Here is the error I get: > > > Traceback (most recent call last): > File "./run", line 3, in ? > from Bio import GenBank > File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line > 47, in ? > File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line > 20, in ? > from Bio.SeqRecord import SeqRecord > File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line 11, in > ? > File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, in > __init__ > AttributeError: 'module' object has no attribute 'formats' Hi Matthew, your import of the GenBank module fails. Most likely your BioPython installation is broken. Could you try to re-install it? On a Python (2.4) shell, this should work: Python 2.4.4 (#2, Apr 5 2007, 20:11:18) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import Bio >>> import Bio.GenBank >>> HTH, Christof > > > Here is the code I have used: > > > > #!/usr/pkg/bin/python2.4 > > from Bio import GenBank > > > featureParser = GenBank.FeatureParser() > ncbiDict = GenBank.NCBIDictionary("nucleotide", > "genbank",parser=featureParser) > > accessionNumbers=["BC063166", "NM_028459"] > > > for accessionNo in accessionNumbers: > giList = GenBank.search_for(accessionNo) > for gi in giList: > record = ncbiDict[gi] > for feature in record.features: > if feature.type =="CDS": > codingStart = feature.location._start.position > codingEnd = feature.location._end.position > completeSequence = record.seq.tostring() > fiveUTRSequence = completeSequence[:codingStart] > codingSequence = completeSequence[codingStart:codingEnd] > threeUTRSequence = completeSequence[codingEnd:] > if feature.type=="gene": > geneName=feature.qualifiers['gene'][0] > > print "Found",gi,geneName,len(completeSequence) > > > I do not know if it is a difference in python2.4 version or not? Any help > would be appreciate, thanks. > > Matthew > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Nov 13 17:59:43 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Nov 2007 22:59:43 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473A0849.2020904@biotec.tu-dresden.de> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> Message-ID: <473A2C5F.5060102@maubp.freeserve.co.uk> Christof Winter wrote: > Matthew Abravanel wrote: >> Hi Christof, >> >> I tried out the code you sent me just to see if it would work but I get an >> attribute error or something? Here is the error I get: >> >> Traceback (most recent call last): >> File "./run", line 3, in ? >> from Bio import GenBank >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line >> 47, in ? >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line >> 20, in ? >> from Bio.SeqRecord import SeqRecord >> File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line 11, in >> ? >> File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, in >> __init__ >> AttributeError: 'module' object has no attribute 'formats' > > Hi Matthew, > > your import of the GenBank module fails. Most likely your BioPython installation > is broken. Could you try to re-install it? What version of Biopython do you have Matthew? I'm pretty sure it isn't the latest Biopython 1.44, it must be Biopython 1.43 or older... but even on Biopython 1.43 doing "from Bio import GenBank" should work. Odd. What OS are you using, and how and where did you install Biopython? Peter From vmatthewa at gmail.com Tue Nov 13 18:46:48 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Tue, 13 Nov 2007 16:46:48 -0700 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473A2C5F.5060102@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> Message-ID: <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> Hi Christof and everyone, Thanks for all the comments and everything, the OS I am using is NetBSD 3.1 , and I think you were right Peter about my biopython version I think it was 1.43 instead of the latest 1.44 version. If I get the latest 1.44version of biopython do you think the code should work or do I need to think of something else? Sincerely, Matthew On Nov 13, 2007 3:59 PM, Peter wrote: > Christof Winter wrote: > > Matthew Abravanel wrote: > >> Hi Christof, > >> > >> I tried out the code you sent me just to see if it would work but I get > an > >> attribute error or something? Here is the error I get: > >> > >> Traceback (most recent call last): > >> File "./run", line 3, in ? > >> from Bio import GenBank > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", > line > >> 47, in ? > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", > line > >> 20, in ? > >> from Bio.SeqRecord import SeqRecord > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line > 11, in > >> ? > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, > in > >> __init__ > >> AttributeError: 'module' object has no attribute 'formats' > > > > Hi Matthew, > > > > your import of the GenBank module fails. Most likely your BioPython > installation > > is broken. Could you try to re-install it? > > What version of Biopython do you have Matthew? I'm pretty sure it isn't > the latest Biopython 1.44, it must be Biopython 1.43 or older... but > even on Biopython 1.43 doing "from Bio import GenBank" should work. > > Odd. What OS are you using, and how and where did you install Biopython? > > Peter > > From biopython at maubp.freeserve.co.uk Tue Nov 13 19:01:45 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Nov 2007 00:01:45 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> Message-ID: <473A3AE9.9030506@maubp.freeserve.co.uk> Matthew Abravanel wrote: > Hi Christof and everyone, > > Thanks for all the comments and everything, the OS I am using is NetBSD > 3.1 , and I think you were right Peter about my biopython version I think it > was 1.43 instead of the latest 1.44 version. If I get the latest > 1.44v ersion of biopython do you think the code should work or do I > need to think of something else? Well, that example code looked like it should have worked on Biopython 1.43 so I am a little puzzled. I don't know if anyone else is using NetBSD with Biopython, but the import problem could be some sort of installation problem. This is why I was asking about how and where you installed Biopython (i.e. did you install from source?) If you do try Biopython 1.44, watch out for bug 2393, we managed to break Bio.GenBank.NCBIDictionary - on the bright side its a one line fix: http://bugzilla.open-bio.org/show_bug.cgi?id=2393 Peter From biopython at maubp.freeserve.co.uk Tue Nov 13 04:15:57 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Nov 2007 09:15:57 +0000 Subject: [BioPython] Writing a biopython script to download all Genbank records from Nucleotide database In-Reply-To: <473759DE.1010802@biotec.tu-dresden.de> References: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> <473759DE.1010802@biotec.tu-dresden.de> Message-ID: <47396B4D.6010400@maubp.freeserve.co.uk> Christof Winter wrote: > I used the code below to retrieve some entries from the Nucleotide database. > Since two entries already take a few seconds, it is probably a bad idea to > download _all_ entries in that way. > > You might be better off downloading the data first: > ftp://ftp.ncbi.nih.gov/genbank/ I would agree 100%. Another benefit is you can script an FTP download (e.g. using wget which can cope with an interrupted internet connection nicely). > from Bio import GenBank > > featureParser = GenBank.FeatureParser() > ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank", parser=featureParser) > ... Note that Bio.GenBank.NCBIDictionary won't work in Biopython 1.44, but its been fixed again in CVS - see bug 2393. http://bugzilla.open-bio.org/show_bug.cgi?id=2393 > accessionNumbers = ["BC063166", "NM_028459"] > > for accessionNo in accessionNumbers: > giList = GenBank.search_for(accessionNo) > for gi in giList: > record = ncbiDict[gi] # parsing happens here > ... I expect you can ask the NCBI for records by accession directly, rather than doing a search to get the GI number. Peter From winter at biotec.tu-dresden.de Wed Nov 14 04:47:32 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Wed, 14 Nov 2007 10:47:32 +0100 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473A3AE9.9030506@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> Message-ID: <473AC434.4030605@biotec.tu-dresden.de> Peter wrote: > Matthew Abravanel wrote: >> Hi Christof and everyone, >> >> Thanks for all the comments and everything, the OS I am using is NetBSD >> 3.1 , and I think you were right Peter about my biopython version I think it >> was 1.43 instead of the latest 1.44 version. If I get the latest >> 1.44v ersion of biopython do you think the code should work or do I >> need to think of something else? > > Well, that example code looked like it should have worked on Biopython > 1.43 so I am a little puzzled. I even run 1.42 (Debian package python-biopython 1.42-2), and it works fine. From idoerg at gmail.com Wed Nov 14 11:43:03 2007 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 14 Nov 2007 08:43:03 -0800 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473AC434.4030605@biotec.tu-dresden.de> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> Message-ID: The culprit code in Bio/FormatIO class FormatIO: def __init__(self, name, default_input_format = None, default_output_format = None, abbrev = None, registery = None): if abbrev is None: abbrev = name if registery is None: import Bio registery = Bio.formats seems like this class is being instantiated from SeqRecord.py using the following call: io = FormatIO.FormatIO("SeqRecord", default_input_format = "sequence", default_output_format = "fasta") Which causes 'registry' to be set to None, which causes a call to Bio.formats which does not exist. 1) Can I have the offending code that started all this? (I'm a bit late in the game, I know) 2) What is (was) Bio.formats? has it been replaced by something else? 3) What is this registry thingy? We need it? Iddo On Nov 14, 2007 1:47 AM, Christof Winter wrote: > Peter wrote: > > Matthew Abravanel wrote: > >> Hi Christof and everyone, > >> > >> Thanks for all the comments and everything, the OS I am using is NetBSD > >> 3.1 , and I think you were right Peter about my biopython version I > think it > >> was 1.43 instead of the latest 1.44 version. If I get the latest > >> 1.44v ersion of biopython do you think the code should work or do I > >> need to think of something else? > > > > Well, that example code looked like it should have worked on Biopython > > 1.43 so I am a little puzzled. > > I even run 1.42 (Debian package python-biopython 1.42-2), and it works > fine. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython at maubp.freeserve.co.uk Wed Nov 14 16:43:36 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Nov 2007 21:43:36 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> Message-ID: <473B6C08.1060605@maubp.freeserve.co.uk> Iddo Friedberg wrote: > The culprit code in Bio/FormatIO ... We've removed Bio.FormatIO for Biopython 1.44 (in favour of Bio.SeqIO). It was a input/output framework based on Martel "regular expressions" to describe file formats. > 1) Can I have the offending code that started all this? (I'm a bit late in > the game, I know) I think it was an innocent looking "from Bio import GenBank", which I have never seen cause this error before. Hence my wondering if there was an installation problem (e.g. a partial installation). > 2) What is (was) Bio.formats? has it been replaced by something else? > 3) What is this registry thingy? We need it? I think Bio.formats and the registry thing are all tied together with code in Bio/formatdefs, Bio/config etc. Its all very complicated, and doesn't seem to be much documented. Peter From vmatthewa at gmail.com Thu Nov 15 15:50:13 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Thu, 15 Nov 2007 13:50:13 -0700 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473B6C08.1060605@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> Message-ID: <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> Hi everyone, Thanks for all the comments, so since you said you removed Bio.FormatIO in version 1.44 and replaced it with Bio.SeqIO do you think I can still successfully use that code I was given if I have 1.44 provided I watch out for bugs and so on? What is the difference between Bio.FormatIO and Bio.SeqIO, other then them describing file formats differently? Also how exactly could one have a partial installation, some of the package not installing? Thanks again for the help. Sincerely, Matthew On Nov 14, 2007 2:43 PM, Peter wrote: > Iddo Friedberg wrote: > > The culprit code in Bio/FormatIO ... > > We've removed Bio.FormatIO for Biopython 1.44 (in favour of Bio.SeqIO). > It was a input/output framework based on Martel "regular expressions" to > describe file formats. > > > 1) Can I have the offending code that started all this? (I'm a bit late > in > > the game, I know) > > I think it was an innocent looking "from Bio import GenBank", which I > have never seen cause this error before. Hence my wondering if there was > an installation problem (e.g. a partial installation). > > > 2) What is (was) Bio.formats? has it been replaced by something else? > > 3) What is this registry thingy? We need it? > > I think Bio.formats and the registry thing are all tied together with > code in Bio/formatdefs, Bio/config etc. Its all very complicated, and > doesn't seem to be much documented. > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Nov 15 16:11:21 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Nov 2007 21:11:21 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> Message-ID: <320fb6e00711151311q700a473age0e026b8fcfb5c5@mail.gmail.com> > Thanks for all the comments, so since you said you removed Bio.FormatIO in > version 1.44 and replaced it with Bio.SeqIO do you think I can still > successfully use that code I was given if I have 1.44 provided I watch out > for bugs and so on? Assuming you can apply the fix for Bug 2393, then that Bio.GenBank.NCBIDictionary code should work fine with Biopython 1.44. There is also a related example in the SeqIO chapter of the tutorial using the Bio.GenBank.download_many() function. > What is the difference between Bio.FormatIO and > Bio.SeqIO, other then them describing file formats differently? In terms of typical end use, Bio.SeqIO and Bio.FormatIO provided similar capabilities, but FormatIO wasn't very up to date in terms of its format support. The big differences are internal. For any new code, please try Bio.SeqIO (available in Biopython 1.43 onwards), which is described in the tutorial and the wiki: http://biopython.org/wiki/SeqIO > Also how exactly could one have a partial installation, some of the package not > installing? This was a guess - there is/was clearly something odd about your install. If you installed from source, maybe some step failed part way leaving you with only some parts installed. Another possibility is on BSD is there is something different about the installation paths which is confusing things. We haven't worked out what went wrong on your system so I'm was just speculating. Peter From holger.dinkel at gmail.com Fri Nov 16 05:38:56 2007 From: holger.dinkel at gmail.com (holger.dinkel at gmail.com) Date: Fri, 16 Nov 2007 11:38:56 +0100 Subject: [BioPython] Prosite / Prorule Message-ID: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> Hello List, I just stumbled upon an error with the parsing of a 'newer' (>20) version of Prosite: Prosite introduced a new field called ProRules which cause errors in parsing with Bio/Prosite/__init__.py / Bio/ParserSupport.py. I updated biopython to 1.44, but the error persists. Here is the Traceback: ---------------------------------------------------------------------------------------------------- File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 227, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 349, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 381, in feed self._scan_record(uhandle, consumer) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 396, in _scan_record fn(self, uhandle, consumer) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 477, in _scan_do self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 418, in _scan_line read_and_call(uhandle, event_fn, start=line_type) File "/usr/lib/python2.4/site-packages/Bio/ParserSupport.py", line 300, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not start with 'DO': PR PRU00498; ---------------------------------------------------------------------------------------------------- I tried to figure out, where the problem lies, but I do not really understand the structure of the parsing modules in 'Bio/Prosite/__init__.py' I tried to create a new entry for the prorule: define a def _scan_pr(self, uhandle, consumer): self._scan_line('PR', uhandle, consumer.identification, up_to_one=1) add that to the '_scan_fns' and so on, but then the scanning order seems to get out of order, and i get a different "SyntaxError: Line does not start with ..." error... Is the parsing mechanism described anywhere, so I can look it up and fix the error? Regards, Holger -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071116/94d590fd/attachment.bin From vmatthewa at gmail.com Fri Nov 16 16:26:10 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Fri, 16 Nov 2007 14:26:10 -0700 Subject: [BioPython] Bug 2393 in bugzilla for the Bio.GenBank.NCBIDictionary code Message-ID: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> Hi all, I was wondering not being familiar with the bugzilla feature, if I wanted to fix Bug 2393 there seems to be a patch with the .py extention that was created for it. Do I need to download that file or type it into my existing code or something like that? Please excuse my ignorence. Thanks. Matthew On Nov 15, 2007 2:11 PM, Peter wrote: > > Thanks for all the comments, so since you said you removed Bio.FormatIOin > > version 1.44 and replaced it with Bio.SeqIO do you think I can still > > successfully use that code I was given if I have 1.44 provided I watch > out > > for bugs and so on? > > Assuming you can apply the fix for Bug 2393, then that > Bio.GenBank.NCBIDictionary code should work fine with Biopython 1.44. > > There is also a related example in the SeqIO chapter of the tutorial > using the Bio.GenBank.download_many() function. > > > What is the difference between Bio.FormatIO and > > Bio.SeqIO, other then them describing file formats differently? > > In terms of typical end use, Bio.SeqIO and Bio.FormatIO provided > similar capabilities, but FormatIO wasn't very up to date in terms of > its format support. The big differences are internal. For any new > code, please try Bio.SeqIO (available in Biopython 1.43 onwards), > which is described in the tutorial and the wiki: > http://biopython.org/wiki/SeqIO > > > Also how exactly could one have a partial installation, some of the > package not > > installing? > > This was a guess - there is/was clearly something odd about your > install. If you installed from source, maybe some step failed part > way leaving you with only some parts installed. Another possibility > is on BSD is there is something different about the installation paths > which is confusing things. We haven't worked out what went wrong on > your system so I'm was just speculating. > > Peter > From biopython at maubp.freeserve.co.uk Fri Nov 16 16:38:14 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Nov 2007 21:38:14 +0000 Subject: [BioPython] Bug 2393 in bugzilla for the Bio.GenBank.NCBIDictionary code In-Reply-To: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> References: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> Message-ID: <320fb6e00711161338w73f8e7d1l964671208e263ba3@mail.gmail.com> On Nov 16, 2007 9:26 PM, Matthew Abravanel wrote: > Hi all, > > I was wondering not being familiar with the bugzilla feature, if I wanted > to fix Bug 2393 there seems to be a patch with the .py extention that was > created for it. Do I need to download that file or type it into my existing > code or something like that? Please excuse my ignorence. Thanks. > > Matthew http://bugzilla.open-bio.org/show_bug.cgi?id=2393 Right now, Bug 2393 has two attachments: * Michiel's patch - which makes some drastic changes * My suggested test case (a python file) Ignore these for now. Instead, I suggest you make the "quick-and-dirty" one line change described in comment 2, which has since been checked into CVS (comment 5) http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c2 http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c5 Or just download the latest Bio/GenBank/__init__.py from the web interface to CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/GenBank/__init__.py?cvsroot=biopython Back up your old file, and replace it with the new one (assuming you have sufficient admin rights). Peter From timmcilveen at talktalk.net Sat Nov 17 14:48:38 2007 From: timmcilveen at talktalk.net (tim) Date: Sat, 17 Nov 2007 19:48:38 +0000 Subject: [BioPython] installing mxTextTools In-Reply-To: <473B6C08.1060605@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> Message-ID: <1195328918.5259.10.camel@linux-qvtz.site> Hi, I am setting up biopython on my Suse Linux 10.3. I have Python 2.5.1 installed on my system already. After downloading mxTextTools i unzip it and start the install, but get the error: invalid Python installation: unable to open /usr/lib/python2.5/config/Makefile (No such file or directory) Indeed when I browse to python 2.5 in /usr/lib/python2.5/ , there is no config folder. Python 2.5 works fine though, so what is going on here? Any ideas anyone? Thanks, Tim From timmcilveen at talktalk.net Sat Nov 17 14:40:57 2007 From: timmcilveen at talktalk.net (tim) Date: Sat, 17 Nov 2007 19:40:57 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <473B6C08.1060605@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> Message-ID: <1195328457.5259.4.camel@linux-qvtz.site> Hi, On Suse Linux I am trying to install Biopython. I have Python 2.5.1. installed and I need to install mxtext tools into python. When I perform : I get the error message invalid Python installation: unable to open /usr/lib/python2.5/config/Makefile (no such file or directory) Indeed when I browse the file system of python 2.5.1, I find no such file. Python is working fine though. From biopython at maubp.freeserve.co.uk Sat Nov 17 15:10:51 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Nov 2007 20:10:51 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <1195328457.5259.4.camel@linux-qvtz.site> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <1195328457.5259.4.camel@linux-qvtz.site> Message-ID: <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> > Hi, > On Suse Linux I am trying to install Biopython. I have Python 2.5.1. > installed and I need to install mxtext tools into python. My guess is you tried a standard "python setup.py install" which won't work on mxTextTools (because egenix provide things pre-compiled). On their webpage they suggest this for a system wide install: sudo python setup.py build --skip install If all else fails, I suggest you ask on the egenix mailing list: http://www.egenix.com/support/mailing-lists/ Peter From timmcilveen at talktalk.net Sat Nov 17 16:36:30 2007 From: timmcilveen at talktalk.net (tim) Date: Sat, 17 Nov 2007 21:36:30 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <1195328457.5259.4.camel@linux-qvtz.site> <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> Message-ID: <1195335390.4371.2.camel@linux-qvtz.site> Hi Peter, I did use > sudo python setup.py build --skip install but with no success. I'll try the support list at Egenix. Thanks for the quick reply :-) Tim On Sat, 2007-11-17 at 20:10 +0000, Peter wrote: > > Hi, > > On Suse Linux I am trying to install Biopython. I have Python 2.5.1. > > installed and I need to install mxtext tools into python. > > My guess is you tried a standard "python setup.py install" which > won't work on mxTextTools (because egenix provide things pre-compiled). > On their webpage they suggest this for a system wide install: > > sudo python setup.py build --skip install > > If all else fails, I suggest you ask on the egenix mailing list: > http://www.egenix.com/support/mailing-lists/ > > Peter From biopython at maubp.freeserve.co.uk Sat Nov 17 16:44:22 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Nov 2007 21:44:22 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <1195335390.4371.2.camel@linux-qvtz.site> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <1195328457.5259.4.camel@linux-qvtz.site> <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> <1195335390.4371.2.camel@linux-qvtz.site> Message-ID: <320fb6e00711171344oa111afaj28f525b19dbfa34@mail.gmail.com> tim wrote: > I did use > >> sudo python setup.py build --skip install > > but with no success. I'll try the support list at Egenix. By the way - Biopython had/has a few issues with mxTextTools 3.0, most of which we have now worked around in Biopython 1.44. I notice egenix now have mxTextTools 2.0 available on their website once again - if you are only installing this for Biopython then I would recommend you use mxTextTools 2.0 instead. http://www.egenix.com/www2002/python/eGenix-mx-Extensions-v2.x.html/ I'm just going to update our wiki to mention this... Peter From anablopes at gmail.com Sun Nov 18 13:49:53 2007 From: anablopes at gmail.com (Ana Branca Lopes) Date: Sun, 18 Nov 2007 18:49:53 +0000 Subject: [BioPython] number of enzimes in a single pdb file Message-ID: <2489921e0711181049i1e11cef3ia6d02b62abad776e@mail.gmail.com> Hello, When a PDB file has more than one crystallographic unit (more than one protein) in a single file, is there any way to know how many copies there are? Many thanks for any help Ana From mdehoon at c2b2.columbia.edu Sun Nov 18 20:20:35 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Sun, 18 Nov 2007 20:20:35 -0500 Subject: [BioPython] Bug 2393 in bugzilla for theBio.GenBank.NCBIDictionary code References: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> <320fb6e00711161338w73f8e7d1l964671208e263ba3@mail.gmail.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B661@mail2.exch.c2b2.columbia.edu> It's probably a good idea to make a new Biopython release in the near future, after fixing this bug. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Peter Sent: Fri 11/16/2007 4:38 PM To: Matthew Abravanel Cc: biopython at lists.open-bio.org Subject: Re: [BioPython] Bug 2393 in bugzilla for theBio.GenBank.NCBIDictionary code On Nov 16, 2007 9:26 PM, Matthew Abravanel wrote: > Hi all, > > I was wondering not being familiar with the bugzilla feature, if I wanted > to fix Bug 2393 there seems to be a patch with the .py extention that was > created for it. Do I need to download that file or type it into my existing > code or something like that? Please excuse my ignorence. Thanks. > > Matthew http://bugzilla.open-bio.org/show_bug.cgi?id=2393 Right now, Bug 2393 has two attachments: * Michiel's patch - which makes some drastic changes * My suggested test case (a python file) Ignore these for now. Instead, I suggest you make the "quick-and-dirty" one line change described in comment 2, which has since been checked into CVS (comment 5) http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c2 http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c5 Or just download the latest Bio/GenBank/__init__.py from the web interface to CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/GenBank/__ init__.py?cvsroot=biopython Back up your old file, and replace it with the new one (assuming you have sufficient admin rights). Peter _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mdehoon at c2b2.columbia.edu Sun Nov 18 20:19:00 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Sun, 18 Nov 2007 20:19:00 -0500 Subject: [BioPython] Prosite / Prorule References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu> Holger wrote: > I just stumbled upon an error with the parsing of a 'newer' (>20) version of > Prosite: Prosite introduced a new field called ProRules which cause errors > in parsing with Bio/Prosite/__init__.py / Bio/ParserSupport.py. ... > I tried to figure out, where the problem lies, but I do not really understand > the structure of the parsing modules in 'Bio/Prosite/__init__.py' ... > Is the parsing mechanism described anywhere, so I can look it up and fix the error? The Prosite parser was written about five years ago, and it may very well be that none of the currently active Biopython developers really know how this parser works. In that case, one option may be to write a new Prosite parser from scratch. That could even be an easier solution than trying to fix the existing parser. If you decide to go that way, it would be a good idea to discuss the Prosite parser design beforehand on the development mailing list (biopython-dev at biopython.org). --Michiel Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From holger.dinkel at gmail.com Mon Nov 19 04:24:24 2007 From: holger.dinkel at gmail.com (holger.dinkel at gmail.com) Date: Mon, 19 Nov 2007 10:24:24 +0100 Subject: [BioPython] Prosite / Prorule In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu> <473DBF04.3070509@maubp.freeserve.co.uk> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu> <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <473DBF04.3070509@maubp.freeserve.co.uk> Message-ID: <20071119092424.GC6177@megaira.biochem.uni-erlangen.de> Hello Peter and Michiel, * Peter wrote: > holger.dinkel at gmail.com wrote: > > Could you file a bug and attach a small recent Prosite file which has this problem? I have created a bugreport (#2403) and also attached two files (a script to show the error and a prosite-file 'prosite_test.dat') > Note that the order in _scan_fns does matter. are you sure about that? The definition of the '_scan_fns'-List, which holds all callbacks to prosite-entries, shows some 'redundancy'. This makes me think, that the entries are handled sequentially: ---------------------------------------------------------------------------------------------------- _scan_fns = [ _scan_id, _scan_ac, _scan_dt, _scan_de, _scan_pa, _scan_ma, _scan_ru, _scan_nr, _scan_cc, # This is a really dirty hack, and should be fixed properly at # some point. ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309 # in Rel 17 have lines out of order. Thus, I have to rescan # these, which decreases performance. _scan_ma, _scan_nr, _scan_cc, _scan_dr, _scan_3d, _scan_do, _scan_terminator ] ---------------------------------------------------------------------------------------------------- And while scanning prosite-records the function '_scan_record' simply iterates over the _scan_fns-entries: ---------------------------------------------------------------------------------------------------- def _scan_record(self, uhandle, consumer): consumer.start_record() for fn in self._scan_fns: fn(self, uhandle, consumer) ---------------------------------------------------------------------------------------------------- > > Not that I am aware of, however the SwissProt parser looks very similar, so we should be able to fix this without too much hassle. > > Thanks > > Peter * Michiel De Hoon wrote: > > The Prosite parser was written about five years ago, and it may very well be > that none of the currently active Biopython developers really know how this > parser works. In that case, one option may be to write a new Prosite parser > from scratch. That could even be an easier solution than trying to fix the > existing parser. If you decide to go that way, it would be a good idea to > discuss the Prosite parser design beforehand on the development mailing list > (biopython-dev at biopython.org). > > --Michiel Re-writing the parser might be the best choice here. Unfortunately, I have not much experience in writing parsers and also had quite a hard time trying to understand what was going on in the Prosite RecordParser... 8-/ The way I THINK this should be done, is some event-driven mechanism, where the first letters of the scanned line determine what kind of information follows. As compared to iterating over a list (like in the current _scan_fns) and trying to match each entry with the line... Could you point me to a parser-implementation which functions as a 'template' of good parser design. Maybe I can merge it with the existing Prosite-Parser... Thanks to all of you, Holger -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071119/fac5baf7/attachment.bin From mdehoon at c2b2.columbia.edu Mon Nov 19 04:49:23 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Mon, 19 Nov 2007 04:49:23 -0500 Subject: [BioPython] Prosite / Prorule References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de><6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu><20071116103856.GM8243@megaira.biochem.uni-erlangen.de><473DBF04.3070509@maubp.freeserve.co.uk> <20071119092424.GC6177@megaira.biochem.uni-erlangen.de> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B664@mail2.exch.c2b2.columbia.edu> > Re-writing the parser might be the best choice here. Unfortunately, I have not > much experience in writing parsers and also had quite a hard time trying to > understand what was going on in the Prosite RecordParser... 8-/ > > The way I THINK this should be done, is some event-driven mechanism, where the > first letters of the scanned line determine what kind of information follows. > As compared to iterating over a list (like in the current _scan_fns) and trying > to match each entry with the line... > > Could you point me to a parser-implementation which functions as a 'template' of > good parser design. Maybe I can merge it with the existing Prosite-Parser... You could have a look at the function "parse" in Bio/KEGG/Enzyme/__init__.py This is something I wrote for Biopython release 1.44, when it turned out that the new version of mxTextTools caused the previous Bio/KEGG/Enzyme parser to fail. At that time, I decided to write the parser from scratch instead of trying to fix the existing parser (mainly because I didn't understand how the existing parser worked). The result is a rather straightforward parser. Now, for KEGG it is possible that one file contains several KEGG.Enzyme records. The "parse" functions pulls them out one by one (using an iterator). This is why the function has a "yield", and no "return" in the end. From the user perspective, it works as follows: from Bio.KEGG import Enzyme input = open("my_kegg_file_containing_lots_of_enzymes.txt") records = Enzyme.parse(input) for record in records: # record is now one Bio.KEGG.Enzyme.Record instance # Do something with the record print record For Prosite, I don't know if you can have several Prosite records concatenated in one file. If you do, you can use the same approach as for the KEGG parser. If not, I guess a Prosite "parse" function should just return one record directly. As in: from Bio import Prosite input = open("my_prosite_file.txt") record = Prosite.parse(input) # record is now one Bio.Prosite.Record instance --Michiel. From srini_iyyer_bio at yahoo.com Mon Nov 19 17:34:53 2007 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Mon, 19 Nov 2007 14:34:53 -0800 (PST) Subject: [BioPython] windows : reading local blast output Message-ID: <904468.67404.qm@web38109.mail.mud.yahoo.com> Dear group, I am using Python (2.4) and biopython(1.44) in windows. I installed a local blast version for windows. The following code breaks down and throws the error pasted below for convenience: This part of the code works when used on Linux based blast output. Obviously I suspect the '\r\n' for windows. Code: from Bio import Blast from Bio.Blast import NCBIStandalone blast_out = open('C:\human\prb_blast.out','U') result = [] b_parser = NCBIStandalone.BlastParser() b_iterator = NCBIStandalone.Iterator(blast_out,b_parser) b_record = b_iterator.next() I tried opening file handle with 'r', 'rU' and 'U' options. Yet there is no success. Could you help me here. I never had this issue before because I never used windows for blast. Thanks Srini Error report: >>> Traceback (most recent call last): File "C:\Python24\blast_parser.py", line 8, in ? b_record = b_iterator.next() File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 1553, in next return self._parser.parse(File.StringHandle(data)) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 746, in parse self._scanner.feed(handle, self._consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 99, in feed self._scan_rounds(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 229, in _scan_rounds self._scan_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 363, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 373, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 385, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 420, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 454, in _scan_hsp_alignment read_and_call_while(uhandle, consumer.noevent, blank=1) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 314, in read_and_call_while line = safe_readline(uhandle) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 411, in safe_readline raise SyntaxError, "Unexpected end of stream." SyntaxError: Unexpected end of stream. ____________________________________________________________________________________ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ From mdehoon at c2b2.columbia.edu Mon Nov 19 19:57:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Mon, 19 Nov 2007 19:57:59 -0500 Subject: [BioPython] windows : reading local blast output References: <904468.67404.qm@web38109.mail.mud.yahoo.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B666@mail2.exch.c2b2.columbia.edu> It looks like you are trying to parse Blast plain-text output. It is not necessarily related to the \r\n problem, it may be that you are running a different Blast version on Windows. Differences between Blast versions tend to break the plain-text Blast output parser. How about trying to parse Blast output in XML format? See the tutorial for more information. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Srinivas Iyyer Sent: Mon 11/19/2007 5:34 PM To: biopython at biopython.org Subject: [BioPython] windows : reading local blast output Dear group, I am using Python (2.4) and biopython(1.44) in windows. I installed a local blast version for windows. The following code breaks down and throws the error pasted below for convenience: This part of the code works when used on Linux based blast output. Obviously I suspect the '\r\n' for windows. Code: from Bio import Blast from Bio.Blast import NCBIStandalone blast_out = open('C:\human\prb_blast.out','U') result = [] b_parser = NCBIStandalone.BlastParser() b_iterator = NCBIStandalone.Iterator(blast_out,b_parser) b_record = b_iterator.next() I tried opening file handle with 'r', 'rU' and 'U' options. Yet there is no success. Could you help me here. I never had this issue before because I never used windows for blast. Thanks Srini Error report: >>> Traceback (most recent call last): File "C:\Python24\blast_parser.py", line 8, in ? b_record = b_iterator.next() File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 1553, in next return self._parser.parse(File.StringHandle(data)) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 746, in parse self._scanner.feed(handle, self._consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 99, in feed self._scan_rounds(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 229, in _scan_rounds self._scan_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 363, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 373, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 385, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 420, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 454, in _scan_hsp_alignment read_and_call_while(uhandle, consumer.noevent, blank=1) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 314, in read_and_call_while line = safe_readline(uhandle) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 411, in safe_readline raise SyntaxError, "Unexpected end of stream." SyntaxError: Unexpected end of stream. _____________________________________________________________________________ _______ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Nov 19 18:26:08 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Nov 2007 23:26:08 +0000 Subject: [BioPython] windows : reading local blast output In-Reply-To: <904468.67404.qm@web38109.mail.mud.yahoo.com> References: <904468.67404.qm@web38109.mail.mud.yahoo.com> Message-ID: <47421B90.9000904@maubp.freeserve.co.uk> Srinivas Iyyer wrote: > Dear group, > > I am using Python (2.4) and biopython(1.44) in windows. I installed a > local blast version for windows. Did you install Biopython 1.44 using the Windows installer? What version of stand alone blast are you using? > The following code breaks down and throws the error pasted below for > convenience: This part of the code works when used on Linux based > blast output. Obviously I suspect the '\r\n' for windows. Could you file a bug [with the full version information], and then upload the problem output file C:\human\prb_blast.out please? Then we can try and reproduce the problem. http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython As to the new lines, if that is the problem, I would have expected opening the handle in universal mode should have fixed it. Have you tried experimenting with dos2unix and unix2dos on the file? Also - could you try XML output rather than plain text? See the tutorial for examples. http://biopython.org/DIST/docs/tutorial/Tutorial.html Peter From biopython at maubp.freeserve.co.uk Fri Nov 16 11:02:12 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Nov 2007 16:02:12 +0000 Subject: [BioPython] Prosite / Prorule In-Reply-To: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> Message-ID: <473DBF04.3070509@maubp.freeserve.co.uk> holger.dinkel at gmail.com wrote: > Hello List, > > I just stumbled upon an error with the parsing of a 'newer' (>20) version of > Prosite: Prosite introduced a new field called ProRules which cause errors > in parsing with Bio/Prosite/__init__.py / Bio/ParserSupport.py. > I updated biopython to 1.44, but the error persists. Could you file a bug and attach a small recent Prosite file which has this problem? > I tried to figure out, where the problem lies, but I do not really understand > the structure of the parsing modules in 'Bio/Prosite/__init__.py' > I tried to create a new entry for the prorule: > define a > > def _scan_pr(self, uhandle, consumer): > self._scan_line('PR', uhandle, consumer.identification, up_to_one=1) > > add that to the '_scan_fns' and so on, but then the scanning order seems to get > out of order, and i get a different "SyntaxError: Line does not start with ..." > error... Note that the order in _scan_fns does matter. > Is the parsing mechanism described anywhere, so I can look it up and fix the error? Not that I am aware of, however the SwissProt parser looks very similar, so we should be able to fix this without too much hassle. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Nov 20 05:18:31 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Nov 2007 10:18:31 +0000 Subject: [BioPython] Prosite / Prorule In-Reply-To: <473DBF04.3070509@maubp.freeserve.co.uk> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <473DBF04.3070509@maubp.freeserve.co.uk> Message-ID: <320fb6e00711200218i3d446510l6cc7009d4c9ed08b@mail.gmail.com> > Could you file a bug and attach a small recent Prosite file which has > this problem? Holger reported bug 2403, which I believe I have fixed (having worked with our SwissProt parser before I found this quite straight forward): http://bugzilla.open-bio.org/show_bug.cgi?id=2403 Peter From holger.dinkel at gmail.com Tue Nov 20 07:35:15 2007 From: holger.dinkel at gmail.com (holger.dinkel at gmail.com) Date: Tue, 20 Nov 2007 13:35:15 +0100 Subject: [BioPython] Prosite / Prorule In-Reply-To: <320fb6e00711200218i3d446510l6cc7009d4c9ed08b@mail.gmail.com> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <473DBF04.3070509@maubp.freeserve.co.uk> <320fb6e00711200218i3d446510l6cc7009d4c9ed08b@mail.gmail.com> Message-ID: <20071120123515.GA8723@megaira.biochem.uni-erlangen.de> Hallo Peter, thank you very much for your real quick help! that bug is fixed! ;-> But alas, there are still some errors thrown when scanning the whole prosite_20.dat: (they only show up now since the other errors were fixed) Firstly, the Prosite-Team had also introduced a new field called "postprocessing", so now the parser chokes on that. And secondly the parser breaks at some special comment-lines with authornames in it of the form "CC /AUTHOR=K_Hofmann; N_Hulo" (Prosite-Acc PS50293): The comments are split into columns and then parsed into values at the "="-letter. As Mr. Hulo does not have a "/Author=" prepended, an error is raised... I was able to fix the first problem straightforward as Peter did and inserted a postprocessing-entry. I could also solve the second problem, but only with some hack which might not suit everybody: First, i split the "qual, data = [word.lstrip() for word in col.split("=")]" into two to avoid KeyErrors: qual = [word.lstrip() for word in col.split("=")][0] data = ''.join([word.lstrip() for word in col.split("=")][1:]) and then i introduced a hack to circumvent the aforementioned problem: changed if qual == '/TAXO-RANGE': to if qual == 'N_Hulo': continue elif qual == '/TAXO-RANGE': I know this is far from excellent, but crude enough to work ;-> If you'd like to incorporate at least the first changes, you can find the 'new' __init__.py file attached at the bug #2403 as whole file as well as a patch. It succesfully scans prosite version 18 to 20 (others not checked). I could also send it to the list, but I am not sure if mails with attachments are allowed here? * Peter wrote: > > Holger reported bug 2403, which I believe I have fixed (having worked > with our SwissProt parser before I found this quite straight forward): > http://bugzilla.open-bio.org/show_bug.cgi?id=2403 > > Peter best wishes, Holger -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071120/dea2f198/attachment.bin From arareko at campus.iztacala.unam.mx Thu Nov 22 11:37:24 2007 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Thu, 22 Nov 2007 10:37:24 -0600 Subject: [BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Message-ID: <4745B044.5090102@campus.iztacala.unam.mx> Hi Peter, In BioPerl, there's no such mapping for db_xref's that I'm aware of. Each parser handles db_xref records on its own. Take a look at the Bio::SeqIO::genbank code, inside the next_seq() method for example: http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup Regards, Mauricio. Peter wrote: > Dear all, > > I'm one of the Biopython developers. I've recently got going with > BioSQL and have been getting to grips with the Biopython BioSQL > interface. I'm aware that we need to try and be consistent with > BioPerl and BioJava, so I'd like to pose my first question related to > that. > > When loading GenBank records, many features have db_xref qualifiers, > e.g. from a random CDS feature in E. coli K12: > > /db_xref="ASAP:1309" > /db_xref="GI:16128366" > /db_xref="ECOCYC:EG10213" > /db_xref="GeneID:945313" > > Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", > "GeneID" before using recording these entries in the seqfeature_dbxref > and dbxref tables. For example, "GI" becomes "GeneIndex". > Biopython's current mapping is as follows: > > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > > In my testing, I've found several GenBank db_xref abbreviation for > which we don't have a mapping defined, such as "LocusID", "dbSNP", > "MGD", "MIM", or from an EMBL file, "REMTREMBL". > > I'd like to know if BioPerl and/or BioJava and/or BioRuby define a > similar mapping in their BioSQL code (or GenBank parser), so that > Biopython can follow your example. > > Thank you, > > Peter > > P.S. See also Biopython bug 2405 > http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Thu Nov 22 19:42:12 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 22 Nov 2007 18:42:12 -0600 Subject: [BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <4745B044.5090102@campus.iztacala.unam.mx> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> Message-ID: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> I think SeqIO checks the name for parsing reasons only, in cases where the format changes based on the source (such as GenPept DBSOURCE data). I don't think we go beyond that in Bioperl, probably b/c modifying or expanding names for data persistence would lead to volatile coding issues (i.e. consistency between parsers, constant updating to cover new crossrefs, etc). I would definitely suggest retaining the original DB as it appears in the dbxref for consistency/sanity; if needed return expanded names using a different method if they are designated. chris On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote: > Hi Peter, > > In BioPerl, there's no such mapping for db_xref's that I'm aware of. > Each parser handles db_xref records on its own. Take a look at the > Bio::SeqIO::genbank code, inside the next_seq() method for example: > > http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/ > Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup > > Regards, > Mauricio. > > Peter wrote: >> Dear all, >> >> I'm one of the Biopython developers. I've recently got going with >> BioSQL and have been getting to grips with the Biopython BioSQL >> interface. I'm aware that we need to try and be consistent with >> BioPerl and BioJava, so I'd like to pose my first question related to >> that. >> >> When loading GenBank records, many features have db_xref qualifiers, >> e.g. from a random CDS feature in E. coli K12: >> >> /db_xref="ASAP:1309" >> /db_xref="GI:16128366" >> /db_xref="ECOCYC:EG10213" >> /db_xref="GeneID:945313" >> >> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", >> "GeneID" before using recording these entries in the >> seqfeature_dbxref >> and dbxref tables. For example, "GI" becomes "GeneIndex". >> Biopython's current mapping is as follows: >> >> # Dictionary of database types, keyed by GenBank db_xref abbreviation >> db_dict = {'GeneID': 'Entrez', >> 'GI': 'GeneIndex', >> 'COG': 'COG', >> 'CDD': 'CDD', >> 'DDBJ': 'DNA Databank of Japan', >> 'Entrez': 'Entrez', >> 'GeneIndex': 'GeneIndex', >> 'PUBMED': 'PubMed', >> 'taxon': 'Taxon', >> 'ATCC': 'ATCC', >> 'ISFinder': 'ISFinder', >> 'GOA': 'Gene Ontology Annotation', >> 'ASAP': 'ASAP', >> 'PSEUDO': 'PSEUDO', >> 'InterPro': 'InterPro', >> 'GEO': 'Gene Expression Omnibus', >> 'EMBL': 'EMBL', >> 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', >> 'ECOCYC': 'EcoCyc', >> 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' >> } >> >> In my testing, I've found several GenBank db_xref abbreviation for >> which we don't have a mapping defined, such as "LocusID", "dbSNP", >> "MGD", "MIM", or from an EMBL file, "REMTREMBL". >> >> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a >> similar mapping in their BioSQL code (or GenBank parser), so that >> Biopython can follow your example. >> >> Thank you, >> >> Peter >> >> P.S. See also Biopython bug 2405 >> http://bugzilla.open-bio.org/show_bug.cgi?id=2405 >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Sat Nov 24 04:16:49 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Nov 2007 09:16:49 +0000 Subject: [BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> Message-ID: <320fb6e00711240116g4819fc81g202fda35801f19f2@mail.gmail.com> Thank you Chris and Mauricio, On 11/23/07, Chris Fields wrote: > I think [BioPerl's] SeqIO checks the name for parsing reasons only, in > cases where the format changes based on the source (such as GenPept > DBSOURCE data). I don't think we go beyond that in Bioperl, probably > b/c modifying or expanding names for data persistence would lead to > volatile coding issues (i.e. consistency between parsers, constant > updating to cover new crossrefs, etc). And in Biopython's case, we get annoying warnings if it hasn't seen the term before! Which is way I filed Biopython bug 2405 in the first place :) http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > I would definitely suggest retaining the original DB as it appears in > the dbxref for consistency/sanity; if needed return expanded names > using a different method if they are designated. Sounds good to me. Peter From bjoern.thorwirth at uni-due.de Tue Nov 27 03:13:35 2007 From: bjoern.thorwirth at uni-due.de (=?ISO-8859-1?Q?Bj=F6rn?= Thorwirth) Date: Tue, 27 Nov 2007 09:13:35 +0100 Subject: [BioPython] NCBIXML Message-ID: <1196151215.3128.1.camel@mistery> Hello List! Today i got some trouble with the NCBIXML module. I've tested my code on a 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit i got the this error with Biopython-1.43 / 1.44: File "/usr/lib/python2.4/site-packages/twisted/internet/threads.py", line 25, in _putResultInDeferred result = f(*args, **kwargs) File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line 56, in getResults for record in records: File "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", line 625, in parse assert len(blast_parser._records) == 0 exceptions.UnboundLocalError: local variable 'blast_parser' referenced before assignment File "/usr/lib/python2.4/site-packages/twisted/internet/threads.py", line 25, in _putResultInDeferred result = f(*args, **kwargs) File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line 56, in get Results for record in records: File "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/B io/Blast/NCBIXML.py", line 625, in parse assert len(blast_parser._records) == 0 exceptions.UnboundLocalError: local variable 'blast_parser' referenced before as signment^[[B And here is the Code which calls the NCBIXML module: def getResults(self,resultFileHandle,err_handle): try: records=NCBIXML.parse(resultFileHandle) except Exception,e: self.IoErrorHandler(e, resultFileHandle, err_handle) raise bestScore=None bestExpect=None bestRes=None results=[] if records: for record in records: for alignment in record.alignments: resRec={} resRec['title']=alignment.title resRec['length']=alignment.length for hsp in alignment.hsps: resRec['score']=hsp.score resRec['expect']=hsp.expect resRec['subj']=hsp.sbjct resRec['obj']=hsp.query resRec['match']=hsp.match if self.debug: print 'alignment.hsp:' print hsp.score,hsp.expect,hsp.sbjct, hsp.match, hsp.query results.append(resRec) if bestScore==None: bestScore=hsp.score bestExpect=hsp.expect bestRes=len(results)-1 elif hsp.score>bestScore: bestScore=hsp.score bestExpect=hesp.expect bestRes=len(results)-1 Has someone an Idea? Should I just catch the error? Best regards, Bj?rn Thorwirth From mdehoon at c2b2.columbia.edu Tue Nov 27 07:25:00 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Tue, 27 Nov 2007 21:25:00 +0900 Subject: [BioPython] NCBIXML In-Reply-To: <1196151215.3128.1.camel@mistery> References: <1196151215.3128.1.camel@mistery> Message-ID: <474C0C9C.70204@c2b2.columbia.edu> Can you create a minimal code example that shows this bug? From the Python traceback, it appears that the error does not occur in NCBIXML but some place else. It would be good to isolate this bug to find out where exactly the problem lies. --Michiel. Bj?rn Thorwirth wrote: > Hello List! > > Today i got some trouble with the NCBIXML module. I've tested my code on > a > 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit > i > got the this error with Biopython-1.43 / 1.44: > > From bjoern.thorwirth at uni-due.de Tue Nov 27 08:51:22 2007 From: bjoern.thorwirth at uni-due.de (=?ISO-8859-1?Q?Bj=F6rn?= Thorwirth) Date: Tue, 27 Nov 2007 14:51:22 +0100 Subject: [BioPython] NCBIXML - Blast in- and output Message-ID: <1196171482.6683.19.camel@mistery> Hi Michiel! Thanks for your fast respose! I've used the NCBIXML together with a Twisted server. That's why the backtrace is a bit bloatet. But I guess this are the important lines: File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line 56, in getResults for record in records: File "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", line 625, in parse assert len(blast_parser._records) == 0 exceptions.UnboundLocalError: local variable 'blast_parser' referenced before assignment. Now i was able to backtrace, where the Problem comes from. It may not be related to 32/64 Bit. It happens when Blast's calculation of the "Karlin-Altschul parameters" fails. This may happen due low complexity of the Query sequence (see Blast FAQ). I've attached a tar with the Blast-output and the reference and input files. I didn't stumble over this problem before on 32 Bit, because that was a smaller Sequenze-DB for testing purposes Bj?rn Am Dienstag, den 27.11.2007, 21:25 +0900 schrieb Michiel de Hoon: > Can you create a minimal code example that shows this bug? > From the Python traceback, it appears that the error does not occur in > NCBIXML but some place else. It would be good to isolate this bug to > find out where exactly the problem lies. > > --Michiel. > > Bj?rn Thorwirth wrote: > > Hello List! > > > > Today i got some trouble with the NCBIXML module. I've tested my code on > > a > > 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit > > i > > got the this error with Biopython-1.43 / 1.44: > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: BlastError.tar.gz Type: application/x-compressed-tar Size: 630 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20071127/788cb293/attachment.bin From biopython at maubp.freeserve.co.uk Tue Nov 27 11:59:18 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Nov 2007 16:59:18 +0000 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <1196171482.6683.19.camel@mistery> References: <1196171482.6683.19.camel@mistery> Message-ID: <320fb6e00711270859y228cc731gc3ced0f21d9d624@mail.gmail.com> Hi again Bj?rn, > Now i was able to backtrace, where the Problem comes from. It may not be > related to 32/64 Bit. It happens when Blast's calculation > of the "Karlin-Altschul parameters" fails. This may happen due low > complexity of the Query sequence (see Blast FAQ). > I've attached a tar with the Blast-output and the reference and input > files. I didn't stumble over this problem before on 32 Bit, > because that was a smaller Sequenze-DB for testing purposes It does sound like its nothing to do with 32bit versus 64bit. Could you file a bug and then attach the following files - reference.fasta - input2.fasta - output XML file (or say if its empty) Using Blast 2.2.16, I got an empty output file (with messages "Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options" on the error stream). Using Blast 2.2.10, I got XML output including messages like "BlastKarlinBlkGappedCalc: Gap existence and extension values of -1 and -1 not supported for PAM250" (also on the error output). This was on a 64bit Linux machine. You haven't said what version of standalone Blast you have - as you can see it does make a difference. Peter From biopython at maubp.freeserve.co.uk Tue Nov 27 05:27:43 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Nov 2007 10:27:43 +0000 Subject: [BioPython] NCBIXML In-Reply-To: <1196151215.3128.1.camel@mistery> References: <1196151215.3128.1.camel@mistery> Message-ID: <474BF11F.9050802@maubp.freeserve.co.uk> Bj?rn Thorwirth wrote: > Hello List! > > Today i got some trouble with the NCBIXML module. I've tested my code on > a 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit > i got this error with Biopython-1.43 / 1.44: Hi Bj?rn. That problem does sound odd. The NCBI XML parser is pure python, so I wouldn't have expected any problems with 32 vs 64 bit. What else is different between the machines? e.g. Operating System, version of Python. Also where is the XML file coming from - if its standalone blast, could you check and tell us the version on each machine. You could also try running the test suite (included in the source for Biopython) to see if that shows any difference between the two machines. Also, to try and reproduce your parsing error, could you supply an example Blast XML file that illustrates the problem (works on the 32bit computer, but not on the 64 bit computer)? The best way would be to file a bug, and then attach the test case (two steps), rather than trying to send an attachment on the mailing list. Thanks, Peter From mdehoon at c2b2.columbia.edu Wed Nov 28 06:54:39 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 28 Nov 2007 20:54:39 +0900 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <1196171482.6683.19.camel@mistery> References: <1196171482.6683.19.camel@mistery> Message-ID: <474D56FF.4020807@c2b2.columbia.edu> Dear Bj?rn, Did you look at the Blast output file you are trying to parse? It consists of lines like: blastall] WARNING: 1OR8:E|PDBID|CHAIN|SEQUENCE: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options and there is no actual Blast output. So I am not surprised the Blast parser fails... --Michiel. Bj?rn Thorwirth wrote: > Hi Michiel! > > Thanks for your fast respose! I've used the NCBIXML together with a > Twisted server. That's why the backtrace is a bit bloatet. > But I guess this are the important lines: > > File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line > 56, > in getResults > for record in records: > File > "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", > line 625, in parse > assert len(blast_parser._records) == 0 > exceptions.UnboundLocalError: local variable 'blast_parser' referenced > before assignment. > > Now i was able to backtrace, where the Problem comes from. It may not be > related to 32/64 Bit. It happens when Blast's calculation > of the "Karlin-Altschul parameters" fails. This may happen due low > complexity of the Query sequence (see Blast FAQ). > I've attached a tar with the Blast-output and the reference and input > files. I didn't stumble over this problem before on 32 Bit, > because that was a smaller Sequenze-DB for testing purposes > > Bj?rn From bjoern.thorwirth at uni-due.de Wed Nov 28 07:50:54 2007 From: bjoern.thorwirth at uni-due.de (=?ISO-8859-1?Q?Bj=F6rn?= Thorwirth) Date: Wed, 28 Nov 2007 13:50:54 +0100 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <474D56FF.4020807@c2b2.columbia.edu> References: <1196171482.6683.19.camel@mistery> <474D56FF.4020807@c2b2.columbia.edu> Message-ID: <1196254254.2387.4.camel@mistery> Hi Michiel! I've filed a bug (#2412) like Peter suggested. Am Mittwoch, den 28.11.2007, 20:54 +0900 schrieb Michiel de Hoon: > Dear Bj?rn, > > Did you look at the Blast output file you are trying to parse? > It consists of lines like: I did look at it. It's the Error out, the xml file is just empty. > > blastall] WARNING: 1OR8:E|PDBID|CHAIN|SEQUENCE: Could not calculate > ungapped Karlin-Altschul parameters due to an invalid query sequence or > its translation. Please verify the query sequence(s) and/or filtering > options > > and there is no actual Blast output. So I am not surprised the Blast > parser fails... > But shouldn't that beeing catched? And btw. in my first post to the list, i asked, if I just should catch the Exception. in my code i called the NCBIXML like this: try: records=NCBIXML.parse(resultFileHandle) except Exception,e: self.IoErrorHandler(e, resultFileHandle, err_handle) raise ... if records: for record in records: Thats where i got the Exception. For me is it ok.. I've just added an Exception handler around the "for record"-Loop, and everything is done. But i thought i schould get the exception by initialization. Best regrads, and sorry for any inconvenience Bj?rn > --Michiel. > > Bj?rn Thorwirth wrote: > > Hi Michiel! > > > > Thanks for your fast respose! I've used the NCBIXML together with a > > Twisted server. That's why the backtrace is a bit bloatet. > > But I guess this are the important lines: > > > > File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line > > 56, > > in getResults > > for record in records: > > File > > "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", > > line 625, in parse > > assert len(blast_parser._records) == 0 > > exceptions.UnboundLocalError: local variable 'blast_parser' referenced > > before assignment. > > > > Now i was able to backtrace, where the Problem comes from. It may not be > > related to 32/64 Bit. It happens when Blast's calculation > > of the "Karlin-Altschul parameters" fails. This may happen due low > > complexity of the Query sequence (see Blast FAQ). > > I've attached a tar with the Blast-output and the reference and input > > files. I didn't stumble over this problem before on 32 Bit, > > because that was a smaller Sequenze-DB for testing purposes > > > > Bj?rn From mdehoon at c2b2.columbia.edu Wed Nov 28 19:02:17 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 29 Nov 2007 09:02:17 +0900 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <1196254254.2387.4.camel@mistery> References: <1196171482.6683.19.camel@mistery> <474D56FF.4020807@c2b2.columbia.edu> <1196254254.2387.4.camel@mistery> Message-ID: <474E0189.6080005@c2b2.columbia.edu> Bj?rn Thorwirth wrote: > But shouldn't that beeing catched? And btw. in my first post to the > list, i asked, if I just should catch the Exception. > in my code i called the NCBIXML like this: > try: > records=NCBIXML.parse(resultFileHandle) > except Exception,e: > self.IoErrorHandler(e, resultFileHandle, err_handle) > raise > ... > if records: > for record in records: > Thats where i got the Exception. For me is it ok.. I've just added an > Exception handler around the "for record"-Loop, and everything is done. > But i thought i schould get the exception by initialization. > The NCBIXML.parse call does not actually parse the file, it just sets up the parser. The actual parsing is done when you call records.next(), which is done implicitly in your for-loop. This approach allows NCBIXML.parse to be used also for very large Blast output files, which cannot be kept in memory as a whole. So the Exception handler should be around the for-loop, not the parse. --Michiel. From karin.lagesen at medisin.uio.no Fri Nov 30 06:57:04 2007 From: karin.lagesen at medisin.uio.no (Karin Lagesen) Date: Fri, 30 Nov 2007 12:57:04 +0100 Subject: [BioPython] ambiguous alphabets and alignments Message-ID: Hello. I have used biopython on and off, and found it very good. I have now however encountered an odd problem which I hope you can help me with. I am working with alignments, and I do this: >>> from Bio import Clustalw >>> from Bio.Align import AlignInfo >>> from Bio.Alphabet import IUPAC >>> alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA) >>> summary_aln = AlignInfo.SummaryInfo(alignment) >>> pssm = summary_aln.pos_specific_score_matrix() Traceback (most recent call last): File "", line 1, in ? File "/usr/local/python/lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 368, in pos_specific_score_matri x File "/usr/local/python/lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 111, in dumb_consensus File "/usr/local/python/lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 203, in _guess_consensus_alphabe t ValueError: Could not determine the type of alphabet. >>> Now, to test what alphabet I am dealing with I use code from SummaryInfo: >>> from Bio import Alphabet >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> isinstance(summary_aln.alignment._records[0].seq.alphabet.alphabet, Alphabet.DNAAlphabet) False >>> summary_aln.alignment._records[0].seq.alphabet.alphabet >>> However, when I check the Alphabet class: class IUPACAmbiguousDNA(Alphabet.DNAAlphabet): letters = IUPACData.ambiguous_dna_letters it seems like the alphabet I load the alignment with is an extension of DNAAlphabet, however, the isinstance still fails. I am pretty sure that this is somehow a misunderstanding on my side, but I cannot figure this one out. Thankyou for your help! Karin -- Karin Lagesen, PhD student karin.lagesen at medisin.uio.no http://folk.uio.no/karinlag From biopython at maubp.freeserve.co.uk Fri Nov 30 10:14:16 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 30 Nov 2007 15:14:16 +0000 Subject: [BioPython] ambiguous alphabets and alignments In-Reply-To: References: Message-ID: <320fb6e00711300714n6c5afc8r78c5949ecfb3104b@mail.gmail.com> > >>> from Bio import Clustalw > >>> from Bio.Align import AlignInfo > >>> from Bio.Alphabet import IUPAC > >>> alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA) > >>> summary_aln = AlignInfo.SummaryInfo(alignment) > >>> pssm = summary_aln.pos_specific_score_matrix() I think the problem is you are giving Clustalw an alphabet class rather than an instance of the class. I am not at a computer with Biopython installed right now, but I would guess you need to change one line subtly: alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA()) Peter From karin.lagesen at medisin.uio.no Fri Nov 30 10:41:28 2007 From: karin.lagesen at medisin.uio.no (Karin Lagesen) Date: Fri, 30 Nov 2007 16:41:28 +0100 Subject: [BioPython] ambiguous alphabets and alignments In-Reply-To: <320fb6e00711300714n6c5afc8r78c5949ecfb3104b@mail.gmail.com> (biopython@maubp.freeserve.co.uk's message of "Fri, 30 Nov 2007 15:14:16 +0000") References: <320fb6e00711300714n6c5afc8r78c5949ecfb3104b@mail.gmail.com> Message-ID: Peter writes: >> >>> from Bio import Clustalw >> >>> from Bio.Align import AlignInfo >> >>> from Bio.Alphabet import IUPAC >> >>> alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA) >> >>> summary_aln = AlignInfo.SummaryInfo(alignment) >> >>> pssm = summary_aln.pos_specific_score_matrix() > > I think the problem is you are giving Clustalw an alphabet class > rather than an instance of the class. I am not at a computer with > Biopython installed right now, but I would guess you need to change > one line subtly: > > alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", > alphabet=IUPAC.IUPACAmbiguousDNA()) Thanks! That was exactly the problem. It just didn't strike me that this could be the problem:) Karin -- Karin Lagesen, PhD student karin.lagesen at medisin.uio.no http://folk.uio.no/karinlag From lpritc at scri.ac.uk Thu Nov 1 17:39:26 2007 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 01 Nov 2007 17:39:26 +0000 Subject: [BioPython] Bioinformatics/Computational Biology Post Message-ID: <1193938766.12359.5.camel@lplinuxdev.scri.sari.ac.uk> Dear all, We have a vacancy in SCRI's Plant Pathology Programme, where we use Python/BioPython fairly extensively (including for the GenomeDiagram visualisation tool mentioned in the advert). The closing data for applicants is 8th December, and the details are available at the site below: http://www.scri.ac.uk/careers/vacancies/bioinformaticsresearcher Please share the link with anyone you think may be interested in applying for the post. """ Plant pathology at SCRI has an international reputation for excellence and innovation. This post offers the opportunity to join our bioinformatics team, which contributed to many notable firsts in plant pathology, including: annotation of the Pectobacterium atrosepticum and Phytophthora infestans genome sequences, large-scale whole-genome comparative genomics of bacteria and oomycetes, and development of the comparative genomics visualisation tool, GenomeDiagram. You will be responsible for specific areas of bioinformatics research contributing to the Globodera pallida genome sequencing project, metagenomic investigations of viral populations in soils and the analysis of plant-pathogen interactions in several different systems. You will also be expected to develop your own line of bioinformatics research, and eventually to obtain independent funding. """ Best wishes, -- Dr Leighton Pritchard B.Sc.(Hons) MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland DD2 5DA e:lpritc at scri.ac.uk w:http://bioinf.scri.ac.uk/lp gpg/pgp: 0xFEFC205C _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). From mlds at unimelb.edu.au Thu Nov 1 21:52:27 2007 From: mlds at unimelb.edu.au (Mike Dyall-Smith) Date: Fri, 2 Nov 2007 08:52:27 +1100 Subject: [BioPython] Cookbook:RenameFasta In-Reply-To: References: Message-ID: <09384C7F-B174-4A84-9E89-FC3C0C4CB8D5@unimelb.edu.au> Njm revised the code of Humberto at: http://bio.scipy.org/wiki/index.php/RenameFastaSequences to make it more idiomatic Python. I altered my discussion according to the new code. However, when I tested it on a fasta file I get an error: File "REnameFasta2.py", line 6, in filename, basename = sys.argv ValueError: too many values to unpack If I revert back to the old code for this line, ie. replace ' filename, basename = sys.argv' with 'filename = sys.argv[1] ' 'basename = sys.argv[2]' it works fine. I assume there is some subtle error in assigning two names to sys.argv Regards, Mike D-S From mdehoon at c2b2.columbia.edu Fri Nov 2 00:22:18 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 1 Nov 2007 20:22:18 -0400 Subject: [BioPython] Cookbook:RenameFasta References: <09384C7F-B174-4A84-9E89-FC3C0C4CB8D5@unimelb.edu.au> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B650@mail2.exch.c2b2.columbia.edu> > If I revert back to the old code for this line, ie. replace > ' filename, basename = sys.argv' > with > 'filename = sys.argv[1] ' > 'basename = sys.argv[2]' > > it works fine. There is also a sys.argv[0]. So sys.argv is a list with (at least) three elements. If you do 'filename, basename = sys.argv', you have two variables on the left-hand-side and three on the right-hand-side. Instead, you could do temp, filename, basename = sys.argv # And ignore temp subsequently or filename, basename = sys.argv[1:] --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Mike Dyall-Smith Sent: Thu 11/1/2007 5:52 PM To: biopython at lists.open-bio.org Subject: [BioPython] Cookbook:RenameFasta Njm revised the code of Humberto at: http://bio.scipy.org/wiki/index.php/RenameFastaSequences to make it more idiomatic Python. I altered my discussion according to the new code. However, when I tested it on a fasta file I get an error: File "REnameFasta2.py", line 6, in filename, basename = sys.argv ValueError: too many values to unpack If I revert back to the old code for this line, ie. replace ' filename, basename = sys.argv' with 'filename = sys.argv[1] ' 'basename = sys.argv[2]' it works fine. I assume there is some subtle error in assigning two names to sys.argv Regards, Mike D-S _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From ericgibert at yahoo.fr Thu Nov 8 01:50:06 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 01:50:06 +0000 (GMT) Subject: [BioPython] from Bio import db Message-ID: <413801.21408.qm@web26511.mail.ukl.yahoo.com> Hello, I just upgraded to BioPython 1.44 and when I try to run my previous script, I have the error: Traceback (most recent call last): File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in parser = record_parser) File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db (I am on Fedora 7 64bit) Any suggestions? Thank you Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From mdehoon at c2b2.columbia.edu Thu Nov 8 02:19:30 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 7 Nov 2007 21:19:30 -0500 Subject: [BioPython] from Bio import db References: <413801.21408.qm@web26511.mail.ukl.yahoo.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B657@mail2.exch.c2b2.columbia.edu> Dear Eric, Some significant changes were needed in Biopython release 1.44 for reasons of compatibility with the new version of mxTextTools. Unfortunately, as you found, some code may break as a result. In your code: > File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in > parser = record_parser) > File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", > line 1283, in __init__ > from Bio import db > ImportError: cannot import name db It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. Can you show us the code leading up to this point? I'm guessing that you are trying to use NCBIDictionary, but it would be helpful to see how exactly you are trying to use it, so that we can come up with a solution. Sorry for the trouble. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Eric Gibert Sent: Wed 11/7/2007 8:50 PM To: biopython at lists.open-bio.org Subject: [BioPython] from Bio import db Hello, I just upgraded to BioPython 1.44 and when I try to run my previous script, I have the error: Traceback (most recent call last): File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in parser = record_parser) File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db (I am on Fedora 7 64bit) Any suggestions? Thank you Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mdehoon at c2b2.columbia.edu Thu Nov 8 03:44:14 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Wed, 7 Nov 2007 22:44:14 -0500 Subject: [BioPython] FW: Re : from Bio import db References: <881492.27912.qm@web26503.mail.ukl.yahoo.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B658@mail2.exch.c2b2.columbia.edu> Eric answer pasted below... -----Original Message----- From: Eric Gibert [mailto:ericgibert at yahoo.fr] Sent: Wed 11/7/2007 10:09 PM To: Michiel De Hoon Subject: Re : [BioPython] from Bio import db Dear Michiel, Here is the code. Sorry for the trouble. BTW, your guess it correct: NCBIDictionary... Here is my "debugging" script: -------------------------------------------------------------- import Bio from Bio import GenBank from BioSQL import BioSeqDatabase import BioSQL.BioSeq list_to_load = [] def iterSeq(): for s in list_to_load: yield s if __name__ == '__main__': gi_list = GenBank.search_for("Archineura") record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank", parser = record_parser) # <--- this is the line 18 causing the error biodb = BioSeqDatabase.open_database(driver = "MySQLdb", user = "yyyyy", passwd = "xxxxx", host = "localhost", db = "BioSQL") selectedDb = biodb["allOdonata"] # gi_list.append('57282195') # gi_list.append('AJ459224') for seq in gi_list: gb_record = ncbi_dict[seq] print gb_record.id, try: db_record = selectedDb.lookup(version = gb_record.id) print "already present" except IndexError: print "absent", seq list_to_load.append(gb_record) #print list_to_load itS = iterSeq() selectedDb.load(itS) ---------------------------------------------------------------- Looking in the previous script version did not help me: the "import" is not in the def but in the script header, except that, no difference in the 3 statements of the NCBIDictionary class. I hope you will find a solution. Thank you, Eric ----- Message d'origine ---- De : Michiel De Hoon ? : Eric Gibert ; biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 10h19mn 30s Objet : RE: [BioPython] from Bio import db Dear Eric, Some significant changes were needed in Biopython release 1.44 for reasons of compatibility with the new version of mxTextTools. Unfortunately, as you found, some code may break as a result. In your code: > File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in > parser = record_parser) > File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", > line 1283, in __init__ > from Bio import db > ImportError: cannot import name db It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. Can you show us the code leading up to this point? I'm guessing that you are trying to use NCBIDictionary, but it would be helpful to see how exactly you are trying to use it, so that we can come up with a solution. Sorry for the trouble. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Eric Gibert Sent: Wed 11/7/2007 8:50 PM To: biopython at lists.open-bio.org Subject: [BioPython] from Bio import db Hello, I just upgraded to BioPython 1.44 and when I try to run my previous script, I have the error: Traceback (most recent call last): File "/home/eric/workspace/PySeq/src/GenbankSearch.py", line 18, in parser = record_parser) File "/usr/lib64/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db (I am on Fedora 7 64bit) Any suggestions? Thank you Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From ericgibert at yahoo.fr Thu Nov 8 11:07:02 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 11:07:02 +0000 (GMT) Subject: [BioPython] small "bug" correction in package BioSql Message-ID: <823663.39006.qm@web26515.mail.ukl.yahoo.com> Dear all, In BioSeq/BioSeq.py, in the class DBSeq definition, we have the function: def _retrieve_seq(adaptor, primary_id): seqs = adaptor.execute_and_fetchall( "SELECT alphabet, length(seq) FROM biosequence" \ " WHERE bioentry_id = %s", (primary_id,)) if seqs: moltype, length = seqs[0] moltype = moltype.lower() # <-- EG as "DNA" is found in my database! from Bio.Alphabet import IUPAC if moltype == "dna": alphabet = IUPAC.unambiguous_dna elif moltype == "rna": alphabet = IUPAC.unambiguous_rna elif moltype == "protein": alphabet = IUPAC.protein else: raise AssertionError("Unknown moltype: %s" % moltype) seq = DBSeq(primary_id, adaptor, alphabet, 0, int(length)) return seq else: return None please note my correction: force moltype to be turn in lower case as my database has upper case value! this raises the "Unknown moltype" error. Alternatively, we could request the SQL statement to return a lower case version of "alphabet" but I do not know if this function is standard for all database... Might be good to add in the standard package. Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From biopython at maubp.freeserve.co.uk Thu Nov 8 11:40:00 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 11:40:00 +0000 Subject: [BioPython] small "bug" correction in package BioSql In-Reply-To: <823663.39006.qm@web26515.mail.ukl.yahoo.com> References: <823663.39006.qm@web26515.mail.ukl.yahoo.com> Message-ID: <4732F590.6050505@maubp.freeserve.co.uk> Eric Gibert wrote: > Dear all, > > In BioSeq/BioSeq.py, in the class DBSeq definition, we have the > function: > > ... > > please note my correction: force moltype to be turn in lower case as > my database has upper case value! this raises the "Unknown moltype" > error. Hi Eric, I've made your suggested change in CVS, biopython/BioSQL/BioSeq.py revision 1.13, thank you. I would encourage you to investigate why some of the "alphabet" fields in the biosequence table are in upper case. There could be a bug elsewhere which is writing these entries with the wrong alphabet. Is this affecting all entries, or just some? Peter From ericgibert at yahoo.fr Thu Nov 8 11:49:12 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 11:49:12 +0000 (GMT) Subject: [BioPython] Re : small "bug" correction in package BioSql Message-ID: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Dear Peter, All the alphabet are "DNA" (upper case) in my database. The sequences are taken from NCBI by a BioJava application. Thus is should be that BioJava inserts the records with "DNA". Thus no potential "hidden bug" in BioPython. Maybe a point to share with the Open-Bio committee. Eric ----- Message d'origine ---- De : Peter ? : Eric Gibert Cc : biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s Objet : Re: [BioPython] small "bug" correction in package BioSql Eric Gibert wrote: > Dear all, > > In BioSeq/BioSeq.py, in the class DBSeq definition, we have the > function: > > ... > > please note my correction: force moltype to be turn in lower case as > my database has upper case value! this raises the "Unknown moltype" > error. Hi Eric, I've made your suggested change in CVS, biopython/BioSQL/BioSeq.py revision 1.13, thank you. I would encourage you to investigate why some of the "alphabet" fields in the biosequence table are in upper case. There could be a bug elsewhere which is writing these entries with the wrong alphabet. Is this affecting all entries, or just some? Peter _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From ericgibert at yahoo.fr Thu Nov 8 12:36:46 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 12:36:46 +0000 (GMT) Subject: [BioPython] Re : from Bio import db Message-ID: <206354.77387.qm@web26512.mail.ukl.yahoo.com> Dear Peter, Yes, this fix the error, *thank you*. NB: line is 1283, not 1293 (little typo error but maybe important for future refernece, I do not know). Then I have subsequent errors, unrelated to the current topic [which is fixed]. Let me first investigate before sending another mail, on another topic/bug report. Eric ----- Message d'origine ---- De : Peter ? : Michiel De Hoon Cc : Eric Gibert ; biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 20h14mn 18s Objet : Re: [BioPython] from Bio import db >> ImportError: cannot import name db > > It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. > Can you show us the code leading up to this point? I'm guessing that you are > trying to use NCBIDictionary, but it would be helpful to see how exactly you > are trying to use it, so that we can come up with a solution. > > Sorry for the trouble. I think the problem was introduced in Biopython 1.44 by disabling some "magic" code in Bio/__init__.py which created Bio.db at runtime, which was then imported in Bio/GenBank/__init__.py The following one line patch to the NCBIDictionary class in Bio/GenBank/__init__.py seems to fix this: diff -r1.77 __init__.py 1283c1283 < from Bio import db --- > from Bio.config.DBRegistry import db i.e. change line 1293 from "from Bio import db" to "from Bio.config.DBRegistry import db" in the NCBIDictionary class. Eric, could you try this on your setup? Peter Note to self: We need to add the NCBIDictionary to the GenBank unit test _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From biopython at maubp.freeserve.co.uk Thu Nov 8 12:14:18 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 12:14:18 +0000 Subject: [BioPython] from Bio import db In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B657@mail2.exch.c2b2.columbia.edu> References: <413801.21408.qm@web26511.mail.ukl.yahoo.com> <6243BAA9F5E0D24DA41B27997D1FD14402B657@mail2.exch.c2b2.columbia.edu> Message-ID: <4732FD9A.3080404@maubp.freeserve.co.uk> >> ImportError: cannot import name db > > It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. > Can you show us the code leading up to this point? I'm guessing that you are > trying to use NCBIDictionary, but it would be helpful to see how exactly you > are trying to use it, so that we can come up with a solution. > > Sorry for the trouble. I think the problem was introduced in Biopython 1.44 by disabling some "magic" code in Bio/__init__.py which created Bio.db at runtime, which was then imported in Bio/GenBank/__init__.py The following one line patch to the NCBIDictionary class in Bio/GenBank/__init__.py seems to fix this: diff -r1.77 __init__.py 1283c1283 < from Bio import db --- > from Bio.config.DBRegistry import db i.e. change line 1293 from "from Bio import db" to "from Bio.config.DBRegistry import db" in the NCBIDictionary class. Eric, could you try this on your setup? Peter Note to self: We need to add the NCBIDictionary to the GenBank unit test From ericgibert at yahoo.fr Thu Nov 8 12:36:46 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 12:36:46 +0000 (GMT) Subject: [BioPython] Re : from Bio import db Message-ID: <206354.77387.qm@web26512.mail.ukl.yahoo.com> Dear Peter, Yes, this fix the error, *thank you*. NB: line is 1283, not 1293 (little typo error but maybe important for future refernece, I do not know). Then I have subsequent errors, unrelated to the current topic [which is fixed]. Let me first investigate before sending another mail, on another topic/bug report. Eric ----- Message d'origine ---- De : Peter ? : Michiel De Hoon Cc : Eric Gibert ; biopython at lists.open-bio.org Envoy? le : Jeudi, 8 Novembre 2007, 20h14mn 18s Objet : Re: [BioPython] from Bio import db >> ImportError: cannot import name db > > It looks like we missed an "import db" statement in Bio/GenBank/__init__.py. > Can you show us the code leading up to this point? I'm guessing that you are > trying to use NCBIDictionary, but it would be helpful to see how exactly you > are trying to use it, so that we can come up with a solution. > > Sorry for the trouble. I think the problem was introduced in Biopython 1.44 by disabling some "magic" code in Bio/__init__.py which created Bio.db at runtime, which was then imported in Bio/GenBank/__init__.py The following one line patch to the NCBIDictionary class in Bio/GenBank/__init__.py seems to fix this: diff -r1.77 __init__.py 1283c1283 < from Bio import db --- > from Bio.config.DBRegistry import db i.e. change line 1293 from "from Bio import db" to "from Bio.config.DBRegistry import db" in the NCBIDictionary class. Eric, could you try this on your setup? Peter Note to self: We need to add the NCBIDictionary to the GenBank unit test _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From ericgibert at yahoo.fr Thu Nov 8 14:21:31 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 14:21:31 +0000 (GMT) Subject: [BioPython] Bug in BioSQL/Loader.py Message-ID: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Dear all, Bug 1) I noticed that the SQL statement "INSERT INTO bioentry...." in line 229 is missing one %s. I added it and it works fine... until the bug on the next command: bioentry_id = self.adaptor.last_id('bioentry') which causes the old timer bug 2 in DBUtils.py line 34: MySQL is now using lastrowid and no longer insert_id() Correction as per below: --------------------------------------------- class Mysql_dbutils(Generic_dbutils): def last_id(self, cursor, table): return cursor.lastrowid # <-- EG original command: cursor.insert_id() _dbutils["MySQLdb"] = Mysql_dbutils ----------------------------------------------- Then this leads me to the follow bug 3 --- or maybe this is *not* a bug ? --- I explain: In my BioSQL database, the table 'seqfeature_qualifier_value' as the following schema: seqfeature_id int(11) term_id int(11) value varchar(255) rank int(11) note that first we have 'value' then we have 'rank'. But the 'INSERT INTO seqfeature_qualifier_value' statement found in BioSQL/Loader.py line 453 is: qualifier_value = qualifiers[qualifier_key][qual_value_rank] sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ r" (%s, %s, %s, %s)" self.adaptor.execute(sql, (seqfeature_id, qualifier_key_id, qual_value_rank + 1, qualifier_value)) thus I need to invert the last two elements of the list. As I said, I do not know if my BioSQL schema is correct or not. If my schema is correct then my correction is obvious: self.adaptor.execute(sql, (seqfeature_id, qualifier_key_id, qualifier_value, # EG invert the two last params qual_value_rank + 1 )) ------------------ Finally, the script executes without error and .... nothing happens! It looks like there is no 'commit' nowhere and so the new records are not inserted in the database. Although the psychopg database enjoys a: def autocommit(self, conn, y = True): conn.autocommit(y) _dbutils["psycopg"] = Psycopg_dbutils MySQL does not have such an overload for 'autocommit' in DBUtils.py. Could this fix the problem ? In the file MySQLdb/connections.py, on line 213, we have: # PEP-249 requires autocommit to be initially off self.autocommit(False) Therefore the source for the Mysql_dbutils class is now: class Mysql_dbutils(Generic_dbutils): def last_id(self, cursor, table): return cursor.lastrowid #EG original command: cursor.insert_id() def autocommit(self, conn, y = True): # EG addition as by default it is set to False conn.autocommit(y) _dbutils["MySQLdb"] = Mysql_dbutils Unfortunately, this is *NOT* fixing the lack of 'commit'. I need your help... Cordialement, Eric ____________________________________________________________________________________________ D?couvrez le blog Yahoo! Mail : le nouveau Yahoo! Mail, astuces, conseils.. et vos r?actions ! http://blog.mail.yahoo.fr From hlapp at gmx.net Thu Nov 8 15:53:03 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 10:53:03 -0500 Subject: [BioPython] small "bug" correction in package BioSql In-Reply-To: <762277.43372.qm@web26507.mail.ukl.yahoo.com> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> Message-ID: Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we explicitly lowercase the value found for alphabet, and the comment says why: # Note: Biojava uses upper-case terms for alphabet, so we # need to change to all-lower in case the sequence was # manipulated by Biojava. $obj->alphabet(lc($rows->[3])) if $rows->[3]; However, when inserting sequences, we leave the value as is in BioPerl (which is lowercase), leading to a potential problem for Biojava upon retrieval. Do the Biojava folks deal with that? Should this may harmonized across the board? -hilmar On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: > Dear Peter, > > All the alphabet are "DNA" (upper case) in my database. The > sequences are taken from NCBI by a BioJava application. > Thus is should be that BioJava inserts the records with "DNA". Thus > no potential "hidden bug" in BioPython. > > Maybe a point to share with the Open-Bio committee. > > Eric > > ----- Message d'origine ---- > De : Peter > ? : Eric Gibert > Cc : biopython at lists.open-bio.org > Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s > Objet : Re: [BioPython] small "bug" correction in package BioSql > > Eric Gibert wrote: >> Dear all, >> >> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >> function: >> >> ... >> >> please note my correction: force moltype to be turn in lower case as >> my database has upper case value! this raises the "Unknown moltype" >> error. > > Hi Eric, I've made your suggested change in CVS, > biopython/BioSQL/BioSeq.py revision 1.13, thank you. > > I would encourage you to investigate why some of the "alphabet" fields > in the biosequence table are in upper case. There could be a bug > elsewhere which is writing these entries with the wrong alphabet. Is > this affecting all entries, or just some? > > Peter > > > > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Nov 8 15:59:49 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 10:59:49 -0500 Subject: [BioPython] Bug in BioSQL/Loader.py In-Reply-To: <353717.53467.qm@web26514.mail.ukl.yahoo.com> References: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Message-ID: On Nov 8, 2007, at 9:21 AM, Eric Gibert wrote: > qualifier_value = qualifiers[qualifier_key] > [qual_value_rank] > sql = r"INSERT INTO seqfeature_qualifier_value > VALUES" \ > r" (%s, %s, %s, %s)" Not enumerating the columns in the INSERT clause is dangerous programming I think. This should be fixed, and should be fixed for all statements where it is an issue. Though BioSQL has been (and will likely continue to be) extremely stable compared to many other schemas, it would be a shame if even adding columns isn't possible because some of the Bio* projects don't enumerate the columns explicitly. My $0.02. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Nov 8 15:39:13 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 15:39:13 +0000 Subject: [BioPython] Bug in BioSQL/Loader.py In-Reply-To: <353717.53467.qm@web26514.mail.ukl.yahoo.com> References: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Message-ID: <47332DA1.8060002@maubp.freeserve.co.uk> Eric Gibert wrote: > Dear all, > > ... in DBUtils.py line 34: MySQL is now using lastrowid and no longer insert_id() > > Correction as per below: > --------------------------------------------- > class Mysql_dbutils(Generic_dbutils): > def last_id(self, cursor, table): > return cursor.lastrowid # <-- EG original command: cursor.insert_id() > _dbutils["MySQLdb"] = Mysql_dbutils > ----------------------------------------------- Sounds like the one of the issues raised on bug 2390 http://bugzilla.open-bio.org/show_bug.cgi?id=2390 Nice to have my suggested fix confirmed; I'll probably check that in tonight. The auto-commit thing also looks relevant too... Peter From biopython at maubp.freeserve.co.uk Thu Nov 8 16:56:53 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 16:56:53 +0000 Subject: [BioPython] Bug in BioSQL/Loader.py In-Reply-To: References: <353717.53467.qm@web26514.mail.ukl.yahoo.com> Message-ID: <47333FD5.2030504@maubp.freeserve.co.uk> Hilmar Lapp wrote: > On Nov 8, 2007, at 9:21 AM, Eric Gibert wrote: > >> qualifier_value = qualifiers[qualifier_key][qual_value_rank] >> sql = r"INSERT INTO seqfeature_qualifier_value VALUES" \ >> r" (%s, %s, %s, %s)" > > Not enumerating the columns in the INSERT clause is dangerous > programming I think. This should be fixed, and should be fixed for > all statements where it is an issue. I agree with you 100% on this issue. As I still haven't made the time to setup a BioSQL database on my machine, I would be grateful if someone could check the patch on newly filed Bug 2384, http://bugzilla.open-bio.org/show_bug.cgi?id=2394 Thanks Peter From ericgibert at yahoo.fr Thu Nov 8 17:50:44 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Thu, 8 Nov 2007 17:50:44 +0000 (GMT) Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database Message-ID: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Dear all, When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted previously by my BioJava application, I have: print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', 'genbank_accessions', 'TITLE', 'cross_references', 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', 'CIRCULAR'] but a freshly inserted BioSeq by BioPython 1.44 only gives me: Debug on Seq: EF631597.1 = ['cross_references', 'dates', 'references', 'gi', 'data_file_division'] Once I look in the table bioentry_qualifier_value * 20 records for a Sequence imported by BioJava * 1 only for a Sequence inserted by BioPython: the date which should be inserted by "_load_bioentry_date" in BioSQL/Loader.py Quite a few annotations missing, no? Any idea? Eric _____________________________________________________________________________ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail From biopython at maubp.freeserve.co.uk Thu Nov 8 19:18:47 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 08 Nov 2007 19:18:47 +0000 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: <47336117.2010102@maubp.freeserve.co.uk> Eric Gibert wrote: > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? So Biopython is recording nothing in table bioentry_qualifier_value (apart from the date), but is recording other essential things (in other tables) like the sequence itself? Could you double check your schema, as from the issue I filed as bug 2394 based on your earlier email, your schema doesn't seem to be up to date: http://bugzilla.open-bio.org/show_bug.cgi?id=2394 Peter From lists.steve at arachnedesign.net Thu Nov 8 19:32:09 2007 From: lists.steve at arachnedesign.net (Steve Lianoglou) Date: Thu, 8 Nov 2007 14:32:09 -0500 Subject: [BioPython] Compiling from CVS on OS X Message-ID: Hi all, I was having problems compiling biopython from source, specifically getting the Bio/Cluster/clustermodule.c file to compile well. The problem was that the system wasn't finding the `Numeric/ arrayobject.h` file for inclusion. I "fixed" it by editing the setup.py file and adding '/opt/local/include/python2.4' to the include_dirs param on line 474 so it would pick up the files in my python install (that's just where numeric header was installed from macports). Is this the expected way to achieve this, or is there some envi-var, or site.cfg to tweak to do this correctly (or is my python install whacky from the get go?) FWIW I'm using python 2.4 installed via macports. Thanks, -steve From hlapp at gmx.net Thu Nov 8 20:28:19 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:28:19 -0500 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <499834.44468.qm@web26501.mail.ukl.yahoo.com> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: Maybe we need to hold some mini-hackathon to make the different toolkits compatible in how they map annotation to the schema. Obviously I don't know whether you have the latest Biojava setup here, but I'll just comment how BioPerl/Bioperl-db would map this: 'ORIGIN' - if I'm not mistaken this is only a token that introduces the actual sequence. I'm not sure what Biojava is storing as value here. 'DIVISION' - this maps to column division in table bioentry (though I agree that if perfectly following the weak typing principle this should be tag/value association, but at present it's still an actual column) 'genbank_accessions' - secondary accession numbers indeed go into the qualifier value table. The primary accession maps to column accession in table bioentry 'TITLE' - this is part of a publication reference, and should map to column title in table reference (which it does in bioperl-db) 'cross_references' - not sure where these would be coming from in GenBank format; for EMBL this will map to the dbxref table 'data_file_division' - not sure what this is (same as DIVISION?) 'VERSION' - in BioPerl we parse this apart into a version for the accession (which is column version in table bioentry) and the GI number, which maps to column identifier in table bioentry 'references' - these map to table reference (and bioentry_reference for association with the bioentry) 'KEYWORDS' - indeed these map to bioentry_qualifier_value 'GI' - maps to column identifier in table bioentry 'SIZE' - not sure what size that is. If it is the length of the sequence, it should (and in BioPerl/bioperl-db does) map to column length in table biosequence 'DEFINITION' - maps to column description in table bioentry 'REFERENCE' - should be the same as for 'references' 'MDAT' - not sure what this is 'ORGANISM' - this is the organism and maps to the table taxon (and taxon_name), with a foreign key in bioentry pointing to the taxon 'JOURNAL' - this is part of a reference, see 'references' 'ACCESSION' - the primary accession, maps to column accession in table bioentry 'LOCUS' - in the file itself this is an entire line consisting of multiple fields; BioPerl/bioperl-db maps the locus name (the first token after the literal token LOCUS) to column name in table bioentry 'SOURCE' - this is the organism, see 'ORGANISM' 'PUBMED' - this is part of a literature reference, and maps to a foreign key in the reference table (reference.dbxref) to a dbxref entry with PUBMED or PMID as the database and the pubmed ID as the accession 'AUTHORS' - part of a literature reference, maps to column authors in table reference 'TYPE' - not sure what this is. If it's the alphabet, it maps to table biosequence, column alphabet 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, though there have been plans to make it a column in table biosequence. Note that this could in fact be the way Biojava stores it too, but upon retrieval represents it in the way you are seeing it. Hth, -hilmar On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > Dear all, > > When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted > previously by my BioJava application, I have: > > print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() > > Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', > 'genbank_accessions', 'TITLE', 'cross_references', > 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', > 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', > 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', > 'CIRCULAR'] > > but a freshly inserted BioSeq by BioPython 1.44 only gives me: > Debug on Seq: EF631597.1 = ['cross_references', 'dates', > 'references', 'gi', 'data_file_division'] > > > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which > should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? > > Eric > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu Nov 8 20:30:29 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 8 Nov 2007 15:30:29 -0500 Subject: [BioPython] [Biojava-l] small "bug" correction in package BioSql In-Reply-To: <473336E6.6000100@ebi.ac.uk> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> Message-ID: <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> It seems BioPerl and Biopython both want (and have traditionally used) lowercase - do you mind going with that for Biojava as well, or alternatively, simply map upon insert/update and retrieve? -hilmar On Nov 8, 2007, at 11:18 AM, Richard Holland wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > we do need a consensus here. > > I'm happy to go with whatever value is chosen, as the BioJava code can > easily be modified to suit. > > cheers, > Richard > > Hilmar Lapp wrote: >> Indeed Biojava uses uppercase for alphabet. In Bioperl-db, we >> explicitly lowercase the value found for alphabet, and the comment >> says why: >> >> # Note: Biojava uses upper-case terms for alphabet, so we >> # need to change to all-lower in case the sequence was >> # manipulated by Biojava. >> $obj->alphabet(lc($rows->[3])) if $rows->[3]; >> >> However, when inserting sequences, we leave the value as is in >> BioPerl (which is lowercase), leading to a potential problem for >> Biojava upon retrieval. Do the Biojava folks deal with that? Should >> this may harmonized across the board? >> >> -hilmar >> >> On Nov 8, 2007, at 6:49 AM, Eric Gibert wrote: >> >>> Dear Peter, >>> >>> All the alphabet are "DNA" (upper case) in my database. The >>> sequences are taken from NCBI by a BioJava application. >>> Thus is should be that BioJava inserts the records with "DNA". Thus >>> no potential "hidden bug" in BioPython. >>> >>> Maybe a point to share with the Open-Bio committee. >>> >>> Eric >>> >>> ----- Message d'origine ---- >>> De : Peter >>> ? : Eric Gibert >>> Cc : biopython at lists.open-bio.org >>> Envoy? le : Jeudi, 8 Novembre 2007, 19h40mn 00s >>> Objet : Re: [BioPython] small "bug" correction in package BioSql >>> >>> Eric Gibert wrote: >>>> Dear all, >>>> >>>> In BioSeq/BioSeq.py, in the class DBSeq definition, we have the >>>> function: >>>> >>>> ... >>>> >>>> please note my correction: force moltype to be turn in lower >>>> case as >>>> my database has upper case value! this raises the "Unknown moltype" >>>> error. >>> Hi Eric, I've made your suggested change in CVS, >>> biopython/BioSQL/BioSeq.py revision 1.13, thank you. >>> >>> I would encourage you to investigate why some of the "alphabet" >>> fields >>> in the biosequence table are in upper case. There could be a bug >>> elsewhere which is writing these entries with the wrong >>> alphabet. Is >>> this affecting all entries, or just some? >>> >>> Peter >>> >>> >>> >>> >>> >>> >>> >>> >>> ____________________________________________________________________ >>> __ >>> _______ >>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >>> Yahoo! Mail >>> _______________________________________________ >>> BioPython mailing list - BioPython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >> > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2.2 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFHMzbm4C5LeMEKA/QRAtzGAJ98MKWg0uUOafDVVkihSzfSTwtfxACgi6q3 > 9x+CUHig3GfBCZ56rDb1ZG4= > =OJyB > -----END PGP SIGNATURE----- -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mdehoon at c2b2.columbia.edu Fri Nov 9 00:58:42 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Thu, 8 Nov 2007 19:58:42 -0500 Subject: [BioPython] Compiling from CVS on OS X References: Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B65A@mail2.exch.c2b2.columbia.edu> > The problem was that the system wasn't finding the `Numeric/ > arrayobject.h` file for inclusion. I "fixed" it by editing the > setup.py file and adding '/opt/local/include/python2.4' to the > include_dirs param on line 474 so it would pick up the files in my > python install (that's just where numeric header was installed from > macports). By default, the directory containing the Python include files is searched for header files during compilation. If you install Numerical Python from source, it will put its header files in the same location, and no editing of setup.py is needed. If, on the other hand, you use a precompiled package, it may be installed in a different place. For some reason, the package from macports puts the Numerical Python header files in a non-standard place, which is why they could not be found. Apparently macports assumes that Python is installed in /opt/local/; this may very well be the installation directory used by macports for Python itself. In other words, macports uses a non-standard directory, there is no way for Python to know about it, so the header files cannot be found during compilation. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Steve Lianoglou Sent: Thu 11/8/2007 2:32 PM To: BioPython at lists.open-bio.org Subject: [BioPython] Compiling from CVS on OS X Hi all, I was having problems compiling biopython from source, specifically getting the Bio/Cluster/clustermodule.c file to compile well. The problem was that the system wasn't finding the `Numeric/ arrayobject.h` file for inclusion. I "fixed" it by editing the setup.py file and adding '/opt/local/include/python2.4' to the include_dirs param on line 474 so it would pick up the files in my python install (that's just where numeric header was installed from macports). Is this the expected way to achieve this, or is there some envi-var, or site.cfg to tweak to do this correctly (or is my python install whacky from the get go?) FWIW I'm using python 2.4 installed via macports. Thanks, -steve _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From ericgibert at yahoo.fr Fri Nov 9 13:35:12 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Fri, 9 Nov 2007 21:35:12 +0800 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> Message-ID: <000601c822d5$5d811c20$6400a8c0@Gecko> Dear Hilmar, Thank you for this reply. Now I would like to know where BioPythin has stored "SOURCE" or "ORGANISM" in BioSQL? I cannot find them. Then, supposing they are somewhere, how can I get them back? Thank you Eric -----Original Message----- From: Hilmar Lapp [mailto:hlapp at gmx.net] Sent: Friday, November 09, 2007 4:28 AM To: Eric Gibert Cc: biopython at lists.open-bio.org; BioJava Subject: Re: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database Maybe we need to hold some mini-hackathon to make the different toolkits compatible in how they map annotation to the schema. Obviously I don't know whether you have the latest Biojava setup here, but I'll just comment how BioPerl/Bioperl-db would map this: 'ORIGIN' - if I'm not mistaken this is only a token that introduces the actual sequence. I'm not sure what Biojava is storing as value here. 'DIVISION' - this maps to column division in table bioentry (though I agree that if perfectly following the weak typing principle this should be tag/value association, but at present it's still an actual column) 'genbank_accessions' - secondary accession numbers indeed go into the qualifier value table. The primary accession maps to column accession in table bioentry 'TITLE' - this is part of a publication reference, and should map to column title in table reference (which it does in bioperl-db) 'cross_references' - not sure where these would be coming from in GenBank format; for EMBL this will map to the dbxref table 'data_file_division' - not sure what this is (same as DIVISION?) 'VERSION' - in BioPerl we parse this apart into a version for the accession (which is column version in table bioentry) and the GI number, which maps to column identifier in table bioentry 'references' - these map to table reference (and bioentry_reference for association with the bioentry) 'KEYWORDS' - indeed these map to bioentry_qualifier_value 'GI' - maps to column identifier in table bioentry 'SIZE' - not sure what size that is. If it is the length of the sequence, it should (and in BioPerl/bioperl-db does) map to column length in table biosequence 'DEFINITION' - maps to column description in table bioentry 'REFERENCE' - should be the same as for 'references' 'MDAT' - not sure what this is 'ORGANISM' - this is the organism and maps to the table taxon (and taxon_name), with a foreign key in bioentry pointing to the taxon 'JOURNAL' - this is part of a reference, see 'references' 'ACCESSION' - the primary accession, maps to column accession in table bioentry 'LOCUS' - in the file itself this is an entire line consisting of multiple fields; BioPerl/bioperl-db maps the locus name (the first token after the literal token LOCUS) to column name in table bioentry 'SOURCE' - this is the organism, see 'ORGANISM' 'PUBMED' - this is part of a literature reference, and maps to a foreign key in the reference table (reference.dbxref) to a dbxref entry with PUBMED or PMID as the database and the pubmed ID as the accession 'AUTHORS' - part of a literature reference, maps to column authors in table reference 'TYPE' - not sure what this is. If it's the alphabet, it maps to table biosequence, column alphabet 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, though there have been plans to make it a column in table biosequence. Note that this could in fact be the way Biojava stores it too, but upon retrieval represents it in the way you are seeing it. Hth, -hilmar On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > Dear all, > > When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted > previously by my BioJava application, I have: > > print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() > > Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', > 'genbank_accessions', 'TITLE', 'cross_references', > 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', > 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', > 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', > 'CIRCULAR'] > > but a freshly inserted BioSeq by BioPython 1.44 only gives me: > Debug on Seq: EF631597.1 = ['cross_references', 'dates', > 'references', 'gi', 'data_file_division'] > > > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which > should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? > > Eric > > > > > > ______________________________________________________________________ > _______ > Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers > Yahoo! Mail > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lists.steve at arachnedesign.net Fri Nov 9 14:43:15 2007 From: lists.steve at arachnedesign.net (Steve Lianoglou) Date: Fri, 9 Nov 2007 09:43:15 -0500 Subject: [BioPython] Compiling from CVS on OS X In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B65A@mail2.exch.c2b2.columbia.edu> References: <6243BAA9F5E0D24DA41B27997D1FD14402B65A@mail2.exch.c2b2.columbia.edu> Message-ID: <2E1132E9-DB5C-4416-AC11-68FCE83EDF5C@arachnedesign.net> Sorry ... didn't cc the list: Hi Michiel, > If, on the other hand, you use a precompiled package, it may be > installed in > a different place. For some reason, the package from macports puts the > Numerical Python header files in a non-standard place, which is why > they > could not be found. Apparently macports assumes that Python is > installed in > /opt/local/; this may very well be the installation directory used by > macports for Python itself. > > In other words, macports uses a non-standard directory, there is no > way for > Python to know about it, so the header files cannot be found during > compilation. I see. I could imagine that many folks might be using macports (or similar) to help them manage some of their dependencies. Do you think it's a good idea if we add the ability to add some custom include paths in a more non-intrusive n00b friendly way, like through a site.cfg file, for example? Thanks, -steve From ericgibert at yahoo.fr Sat Nov 10 11:16:40 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Sat, 10 Nov 2007 19:16:40 +0800 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <47336117.2010102@maubp.freeserve.co.uk> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <47336117.2010102@maubp.freeserve.co.uk> Message-ID: <001c01c8238b$2ec64070$6400a8c0@Gecko> Dear Peter, My problem is not that I do not have entries in the tables but it is that no interpretation on the feature is perform. Example: In the tutorial and in BioJava, 'source' is an annotation: # from the Biopython Tutorial and Cookbook print "from: %s" % seq_record.annotations['source'] This returns a "KeyError: 'source'" On the other hand, after some tinkering, I found that I can have a a feature from the list Seq.features with the type='source' which contains a "qualifiers['organism']"... Quite cumbersome. But maybe there is another way, more straight forward, that I did not find. Can you tell me? -------- For you information, I went thru the tables of my BioSQL database: Here are my findings with BioPython insertion in BioSQL using "myDataBase.load(list_of_seq)": (note: one test sequence was fetch by GenBank.download_many() and the other using GenBank.NCBIDictionary) 1) table bioentry: all column populated except for 'taxon_id' which is NULL (maybe I need an extra call for populating the 'taxon' table before?) (FYI BioJava sequences are not filling all columns correctly) 2) table bioentry_dbxref: no data inserted (always empty, even with BioJava) 3) table bioentry_qualifier_value: One entry only, for the 'term_id' = 149, rank = 1, and value = '07-JUL-2005' or other 'DD-MMM-YYYY' dates (see my remarks below) 4) table bioentry_reference: two records per sequence with reference _id correctly mapping the 'reference' table, rank, start_pos and end_pos also correctly filled 5) table bioentry_relationships: no entry found (always empty, even with BioJava) 6) table biosequence: one entry per seq, the 'seq' field is correct. Note: the 'version' is set to 0 whereas it should be 1... (length is correct and we have "dna" is lower case :-) ) 7) table comment: no entry found (always empty, even with BioJava) 8) table dbxref: some records are generated, for dbname 'PUBMED' and 'Taxon' with the correct value (FYI: I think that my BioJava is not managing this table...) 9) table dbxref_qualifier_value: (always empty, even with BioJava) 10) table location: all locations loaded correctly, note that 'term_id' and 'dbxref_id' remain NULL for these seq but I have value for other seq. 11) table location_qualifier_value: always empty, even with BioJava 12) table ontology: some rows but not related to the sequences 13) Table reference: entries correct, note 'dbxref_id' remains NULL for these seq but I have value for other seq. 14) table seqfeature: entries are there (same as in table 'location'). FYI:'display_name is always NULL. 15) table seqfeature_dbxref: always empty, even with BioJava 16) table seqfeature_qualifier_value: filled correctly 17) table seqfeature_relationship: always empty, even with BioJava 18) table taxon: always empty, even with BioJava) 19) table taxon_name: I have one but not from this test (I tried to tinker a little bit with taxon but stopped) 20) table term: always empty, even with BioJava 21) table term_dbxref: always empty, even with BioJava 22) table term_relationship_term: have some entries 23) table term_synonym: always empty, even with BioJava ------------ Thank you Eric -----Original Message----- From: Peter [mailto:biopython at maubp.freeserve.co.uk] Sent: Friday, November 09, 2007 3:19 AM To: Eric Gibert Cc: biopython at lists.open-bio.org Subject: Re: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database Eric Gibert wrote: > Once I look in the table bioentry_qualifier_value > > * 20 records for a Sequence imported by BioJava > * 1 only for a Sequence inserted by BioPython: the date which should be inserted by "_load_bioentry_date" in BioSQL/Loader.py > > Quite a few annotations missing, no? > > Any idea? So Biopython is recording nothing in table bioentry_qualifier_value (apart from the date), but is recording other essential things (in other tables) like the sequence itself? Could you double check your schema, as from the issue I filed as bug 2394 based on your earlier email, your schema doesn't seem to be up to date: http://bugzilla.open-bio.org/show_bug.cgi?id=2394 Peter From hlapp at gmx.net Sat Nov 10 20:42:45 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 15:42:45 -0500 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <000601c822d5$5d811c20$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <000601c822d5$5d811c20$6400a8c0@Gecko> Message-ID: <3F4D1638-5D86-4AC2-9D40-A77C4271B598@gmx.net> On Nov 9, 2007, at 8:35 AM, Eric Gibert wrote: > Thank you for this reply. Now I would like to know where BioPythin has > stored "SOURCE" or "ORGANISM" in BioSQL? I cannot find them. > > Then, supposing they are somewhere, how can I get them back? Just to clarify, I'm not a Biopython developer. I was merely commenting from the BioSQL perspective, with maybe the background of we use it in the BioPerl language binding. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat Nov 10 20:38:17 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 15:38:17 -0500 Subject: [BioPython] error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com> <47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> Message-ID: <5DDEBCDE-C8DA-4B2C-86F4-47FDB82CADAC@gmx.net> Just a few comments below, specifically where no rows would in fact be what I expect: On Nov 10, 2007, at 6:16 AM, Eric Gibert wrote: > [...] > -------- For you information, I went thru the tables of my BioSQL > database: > [...] > 1) table bioentry: all column populated except for 'taxon_id' which > is NULL > (maybe I need an extra call for populating the 'taxon' table before?) Bioperl-db will try to look up (or create if necessary) the taxon from the taxon information attached to the sequence, but for BioPerl we actually recommend to pre-load the database with the NCBI taxonomy, which can be comfortably done with the script load_ncbi_taxonomy.pl that comes with BioSQL. > > 2) table bioentry_dbxref: no data inserted (always empty, even with > BioJava) This would mean that the sequence(s) have no dbxrefs. Note that for GenBank sequences that would be expected, since unfortunately, and unlike EMBL format, GenBank puts the dbxrefs into the feature table. > 3) table bioentry_qualifier_value: > > One entry only, for the 'term_id' = 149, rank = 1, and value = '07- > JUL-2005' > or other 'DD-MMM-YYYY' dates (see my remarks below) Below you say that your term table is empty, so I don't know why you can have value here at all. > [...] > 5) table bioentry_relationships: no entry found (always empty, even > with > BioJava) If you load sequences, they won't have direct relationships to other sequences (except dbxrefs, but those are rather 'pointers' and are stored in their own table). In Bioperl-db, this table is used only if you load sequence clusters through Bio::Cluster objects (such as UniGene). > [...] > 7) table comment: no entry found (always empty, even with BioJava) Again, this is expected with GenBank. AFAIK genbank format doesn't allow for comments at the level of the sequence. You would (i.e., should) find entries here if you load UniProt entries. > 8) table dbxref: some records are generated, for dbname 'PUBMED' > and 'Taxon' > with the correct value Taxon obviously isn't really a dbxref, but rather a taxon (and hence should go into that table). > [...] > 9) table dbxref_qualifier_value: (always empty, even with BioJava) That's almost expected. There's rather few cases where dbxrefs have additional attributes that the language can parse out from a source (and then maps to the schema). > [...] > 10) table location: all locations loaded correctly, note that > 'term_id' and > 'dbxref_id' remain NULL for these seq but I have value for other seq. Theoretically, the term_id should point to the term giving the type of the location. If you (or Biopython) are only dealing with simple ('normal') locations, then it's not needed. The dbxref_id gives the reference to the remote sequence if the location for a feature refers to a different sequence than the feature itself does (so-called 'remote locations'). If the sequences you loaded don't have such locations, there this would be expected to be empty (or if Biopython doesn't handle such locations). > 11) table location_qualifier_value: always empty, even with BioJava This is expected if Biopython doesn't support fuzzy locations, or if none of the feature locations that you loaded are fuzzy. > [...] > 13) Table reference: entries correct, note 'dbxref_id' remains NULL > for > these seq but I have value for other seq. It should point to the pubmed ID for the reference but only if there was one. > 14) table seqfeature: entries are there (same as in table 'location'). > FYI:'display_name is always NULL. GenBank doesn't give names to features (and I think EMBL does neither), so this is expected. > 15) table seqfeature_dbxref: always empty, even with BioJava That's likely more to do with your language object model than with anything else. dbxref annotation for features is in tag/value pairs, just as any other, so your language (Biopython in this case) will have to do a lot of interpretation to tease out the semantics behind each tag name and based on that decide what to do with the value. Indeed, by default we don't even do this in BioPerl. > [...] > 17) table seqfeature_relationship: always empty, even with BioJava GenBank (and EMBL) feature tables are flat, not hierarchical, so this is expected. > 18) table taxon: always empty, even with BioJava) This is where the organism should go. > 19) table taxon_name: I have one but not from this test (I tried to > tinker a > little bit with taxon but stopped) That's odd that you can have an entry in taxon_name w/o a corresponding one in taxon. Do you have foreign key checks disabled? > 20) table term: always empty, even with BioJava That's strange, since you say you do have rows in bioentry_qualifier_value, which has an enforced foreign key to term. Did you disable the foreign key checks? > 21) table term_dbxref: always empty, even with BioJava That's expected unless you loaded an ontology whose terms have dbxrefs, and your language object model supports that. > [...] > 23) table term_synonym: always empty, even with BioJava Same as for 21). Your terms would have to have synonyms, and your language object model would have to support those, before you could expect to get anything in here. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ericgibert at yahoo.fr Sun Nov 11 02:11:54 2007 From: ericgibert at yahoo.fr (Eric Gibert) Date: Sun, 11 Nov 2007 10:11:54 +0800 Subject: [BioPython] Taxon/organism/source in Biopython In-Reply-To: <001c01c8238b$2ec64070$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com><47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> Message-ID: <002a01c82408$3f386d20$6400a8c0@Gecko> I find out one answer to my question: in BioSQL/Loader.py (Biopython version 1.44), inside the function _load_bioentry_table, there is a *commented* call to # taxon_id = self._get_taxon_id(record) I uncommented it and also modified the "INSERT INTO bioentry" statement slightly below to include the taxon_id column and value. Thereafter, the call to self._get_taxon_id ensures that the taxon is handled and inserted in the database. The Sequence.annotations now contains: - 'taxonomy': ['the Genus', 'theSpecies'] - 'ncbi_taxoid' : 123456L - 'organism' : 'thespecies' Which is exactly what I was looking for :-) Attention: inside the _get_taxon_id function, in the section starting with the comment "# XXX -- Brad:......", inserts are performed without checking prior existence of 'taxon': although it is clear that the lowest taxon is not in the database already (or else we would have already returned from the function), INSERT of higher level without prior existence check is not correct: I have imported two sequences of the same genus but different species and the GENUS has been created twice. This is also due to the fact the genus does not have ncbi_taxon_id... Thus I propose to first check if the taxon is not already in the table before insertion, based on SELECT taxon_id from taxon_name where name=%s and name_class='scientific name' What do you think? PS1: Hilmar, I was typing this mail when I received your mail commenting on the taxon manipulation. As you can see, this provides some answers. Thank you for taking the time to detail the tables content: as you guessed, I only access NCBI :-) PS2: Hilmar mentions that BioPerl has a function 'load_ncbi_taxonomy.pl'. Does BioPython has one too (I could not find one)? If there is none, shall we/I try to provide one? -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Eric Gibert Sent: Saturday, November 10, 2007 7:17 PM To: biopython at lists.open-bio.org Subject: Re: [BioPython] error on insert new sequences from GenBank: noannotations saved in BioSQL database Dear Peter, My problem is not that I do not have entries in the tables but it is that no interpretation on the feature is perform. Example: In the tutorial and in BioJava, 'source' is an annotation: # from the Biopython Tutorial and Cookbook print "from: %s" % seq_record.annotations['source'] This returns a "KeyError: 'source'" On the other hand, after some tinkering, I found that I can have a a feature from the list Seq.features with the type='source' which contains a "qualifiers['organism']"... Quite cumbersome. But maybe there is another way, more straight forward, that I did not find. Can you tell me? From hlapp at gmx.net Sun Nov 11 03:41:15 2007 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 10 Nov 2007 22:41:15 -0500 Subject: [BioPython] Taxon/organism/source in Biopython In-Reply-To: <002a01c82408$3f386d20$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com><47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> <002a01c82408$3f386d20$6400a8c0@Gecko> Message-ID: On Nov 10, 2007, at 9:11 PM, Eric Gibert wrote: > PS2: Hilmar mentions that BioPerl has a function > 'load_ncbi_taxonomy.pl'. It is BioSQL that has this script, not BioPerl. The script doesn't depend on BioPerl, only on Perl (which almost every system has installed). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From vmatthewa at gmail.com Sun Nov 11 18:56:33 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Sun, 11 Nov 2007 11:56:33 -0700 Subject: [BioPython] posting to list Message-ID: <8fc5e4c20711111056n1c42d26cvae357df50d810dea@mail.gmail.com> my e-mail is vmatthewa at gmail.com From vmatthewa at gmail.com Sun Nov 11 19:07:33 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Sun, 11 Nov 2007 12:07:33 -0700 Subject: [BioPython] Writing a biopython script to download all Genbank records from Nucleotide database Message-ID: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> Hi everyone, Please ignore my last messages, I am still getting the hang of this e-mail list and everything. I am trying to write a bio-python script to download all Genbank records in the Nucleotide database and I know what I want to do just not how to go about doing it. I am using a Unix based system with bio-python 2.4 and I am using emacs editor, if someone could help me out I would really appreciate it with some sample code or something. I just started learning python and have tried to follow the documentation and cookbook without much success, my programming experience is virtually non-existent. Thanks. Matthew From winter at biotec.tu-dresden.de Sun Nov 11 19:37:02 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Sun, 11 Nov 2007 20:37:02 +0100 Subject: [BioPython] Writing a biopython script to download all Genbank records from Nucleotide database In-Reply-To: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> References: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> Message-ID: <473759DE.1010802@biotec.tu-dresden.de> Matthew Abravanel wrote: > Hi everyone, > > Please ignore my last messages, I am still getting the hang of this > e-mail list and everything. I am trying to write a bio-python script to > download all Genbank records in the Nucleotide database and I know what I > want to do just not how to go about doing it. I am using a Unix based system > with bio-python 2.4 and I am using emacs editor, if someone could help me > out I would really appreciate it with some sample code or something. I > just started learning python and have tried to follow the documentation and > cookbook without much success, my programming experience is virtually > non-existent. Thanks. > > Matthew Hi Matthew, I used the code below to retrieve some entries from the Nucleotide database. Since two entries already take a few seconds, it is probably a bad idea to download _all_ entries in that way. You might be better off downloading the data first: ftp://ftp.ncbi.nih.gov/genbank/ HTH, cheers, Christof from Bio import GenBank featureParser = GenBank.FeatureParser() ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank", parser=featureParser) accessionNumbers = ["BC063166", "NM_028459"] for accessionNo in accessionNumbers: giList = GenBank.search_for(accessionNo) for gi in giList: record = ncbiDict[gi] # parsing happens here for feature in record.features: # extract sequences if feature.type == "CDS": codingStart = feature.location._start.position codingEnd = feature.location._end.position completeSequence = record.seq.tostring() fiveUTRSequence = completeSequence[:codingStart] codingSequence = completeSequence[codingStart:codingEnd] threeUTRSequence = completeSequence[codingEnd:] # extract gene name if feature.type == "gene": geneName = feature.qualifiers['gene'][0] print "Found", gi, geneName, len(completeSequence) From biopython at maubp.freeserve.co.uk Mon Nov 12 10:16:07 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Nov 2007 10:16:07 +0000 Subject: [BioPython] Taxon/organism/source in Biopython In-Reply-To: <002a01c82408$3f386d20$6400a8c0@Gecko> References: <499834.44468.qm@web26501.mail.ukl.yahoo.com><47336117.2010102@maubp.freeserve.co.uk> <001c01c8238b$2ec64070$6400a8c0@Gecko> <002a01c82408$3f386d20$6400a8c0@Gecko> Message-ID: <473827E7.2020907@maubp.freeserve.co.uk> Eric Gibert wrote: > I find out one answer to my question: in BioSQL/Loader.py (Biopython version > 1.44), inside the function _load_bioentry_table, there is a *commented* call > to > > # taxon_id = self._get_taxon_id(record) > > I uncommented it and also modified the "INSERT INTO bioentry" statement > slightly below to include the taxon_id column and value. See bug 1921, where this was done as a work around for a zero taxon id. Clearly we should revisit this issue and fix this properly: http://bugzilla.open-bio.org/show_bug.cgi?id=1921 Peter From vmatthewa at gmail.com Mon Nov 12 21:19:27 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Mon, 12 Nov 2007 14:19:27 -0700 Subject: [BioPython] script to extract records from nucleotide database Message-ID: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> Hi Christof, I tried out the code you sent me just to see if it would work but I get an attribute error or something? Here is the error I get: Traceback (most recent call last): File "./run", line 3, in ? from Bio import GenBank File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line 47, in ? File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line 20, in ? from Bio.SeqRecord import SeqRecord File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line 11, in ? File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, in __init__ AttributeError: 'module' object has no attribute 'formats' Here is the code I have used: #!/usr/pkg/bin/python2.4 from Bio import GenBank featureParser = GenBank.FeatureParser() ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank",parser=featureParser) accessionNumbers=["BC063166", "NM_028459"] for accessionNo in accessionNumbers: giList = GenBank.search_for(accessionNo) for gi in giList: record = ncbiDict[gi] for feature in record.features: if feature.type =="CDS": codingStart = feature.location._start.position codingEnd = feature.location._end.position completeSequence = record.seq.tostring() fiveUTRSequence = completeSequence[:codingStart] codingSequence = completeSequence[codingStart:codingEnd] threeUTRSequence = completeSequence[codingEnd:] if feature.type=="gene": geneName=feature.qualifiers['gene'][0] print "Found",gi,geneName,len(completeSequence) I do not know if it is a difference in python2.4 version or not? Any help would be appreciate, thanks. Matthew From alexl at users.sourceforge.net Tue Nov 13 07:01:04 2007 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Tue, 13 Nov 2007 00:01:04 -0700 Subject: [BioPython] Fedora packages for 1.44 (was Re: Biopython release 1.44 ready) In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B645@mail2.exch.c2b2.columbia.edu> (Michiel De Hoon's message of "Sun\, 28 Oct 2007 02\:32\:40 -0400") References: <6243BAA9F5E0D24DA41B27997D1FD14402B645@mail2.exch.c2b2.columbia.edu> Message-ID: >>>>> "MDH" == Michiel De Hoon writes: MDH> Hi everybody, Biopython release 1.44 is now available for MDH> download from the Biopython website at http://biopython.org. For those that are using biopython on Fedora, I have updated the packages for biopython 1.44 in the "updates-testing" repository for F-7 and F-8 (the stable release is still on 1.43 until the updated packages get some testing, then I'll push them out to the stable "updates" repo). To test them out simply run (as root): yum --enablerepo=updates-testing install python-biopython If you have a Fedora account you can provide feedback directly here: https://admin.fedoraproject.org/updates/F7/FEDORA-2007-3266 https://admin.fedoraproject.org/updates/F8/FEDORA-2007-3198 otherwise, simply e-mail me, or fill out a bugzilla report on http://bugzilla.redhat.com/ (selecting the "Fedora" product and the "python-biopython" component). Thanks! Alex From winter at biotec.tu-dresden.de Tue Nov 13 20:25:45 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Tue, 13 Nov 2007 21:25:45 +0100 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> Message-ID: <473A0849.2020904@biotec.tu-dresden.de> Matthew Abravanel wrote: > Hi Christof, > > I tried out the code you sent me just to see if it would work but I get an > attribute error or something? Here is the error I get: > > > Traceback (most recent call last): > File "./run", line 3, in ? > from Bio import GenBank > File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line > 47, in ? > File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line > 20, in ? > from Bio.SeqRecord import SeqRecord > File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line 11, in > ? > File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, in > __init__ > AttributeError: 'module' object has no attribute 'formats' Hi Matthew, your import of the GenBank module fails. Most likely your BioPython installation is broken. Could you try to re-install it? On a Python (2.4) shell, this should work: Python 2.4.4 (#2, Apr 5 2007, 20:11:18) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import Bio >>> import Bio.GenBank >>> HTH, Christof > > > Here is the code I have used: > > > > #!/usr/pkg/bin/python2.4 > > from Bio import GenBank > > > featureParser = GenBank.FeatureParser() > ncbiDict = GenBank.NCBIDictionary("nucleotide", > "genbank",parser=featureParser) > > accessionNumbers=["BC063166", "NM_028459"] > > > for accessionNo in accessionNumbers: > giList = GenBank.search_for(accessionNo) > for gi in giList: > record = ncbiDict[gi] > for feature in record.features: > if feature.type =="CDS": > codingStart = feature.location._start.position > codingEnd = feature.location._end.position > completeSequence = record.seq.tostring() > fiveUTRSequence = completeSequence[:codingStart] > codingSequence = completeSequence[codingStart:codingEnd] > threeUTRSequence = completeSequence[codingEnd:] > if feature.type=="gene": > geneName=feature.qualifiers['gene'][0] > > print "Found",gi,geneName,len(completeSequence) > > > I do not know if it is a difference in python2.4 version or not? Any help > would be appreciate, thanks. > > Matthew > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Nov 13 22:59:43 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Nov 2007 22:59:43 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473A0849.2020904@biotec.tu-dresden.de> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> Message-ID: <473A2C5F.5060102@maubp.freeserve.co.uk> Christof Winter wrote: > Matthew Abravanel wrote: >> Hi Christof, >> >> I tried out the code you sent me just to see if it would work but I get an >> attribute error or something? Here is the error I get: >> >> Traceback (most recent call last): >> File "./run", line 3, in ? >> from Bio import GenBank >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", line >> 47, in ? >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", line >> 20, in ? >> from Bio.SeqRecord import SeqRecord >> File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line 11, in >> ? >> File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, in >> __init__ >> AttributeError: 'module' object has no attribute 'formats' > > Hi Matthew, > > your import of the GenBank module fails. Most likely your BioPython installation > is broken. Could you try to re-install it? What version of Biopython do you have Matthew? I'm pretty sure it isn't the latest Biopython 1.44, it must be Biopython 1.43 or older... but even on Biopython 1.43 doing "from Bio import GenBank" should work. Odd. What OS are you using, and how and where did you install Biopython? Peter From vmatthewa at gmail.com Tue Nov 13 23:46:48 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Tue, 13 Nov 2007 16:46:48 -0700 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473A2C5F.5060102@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> Message-ID: <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> Hi Christof and everyone, Thanks for all the comments and everything, the OS I am using is NetBSD 3.1 , and I think you were right Peter about my biopython version I think it was 1.43 instead of the latest 1.44 version. If I get the latest 1.44version of biopython do you think the code should work or do I need to think of something else? Sincerely, Matthew On Nov 13, 2007 3:59 PM, Peter wrote: > Christof Winter wrote: > > Matthew Abravanel wrote: > >> Hi Christof, > >> > >> I tried out the code you sent me just to see if it would work but I get > an > >> attribute error or something? Here is the error I get: > >> > >> Traceback (most recent call last): > >> File "./run", line 3, in ? > >> from Bio import GenBank > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/__init__.py", > line > >> 47, in ? > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/GenBank/Scanner.py", > line > >> 20, in ? > >> from Bio.SeqRecord import SeqRecord > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/SeqRecord.py", line > 11, in > >> ? > >> File "/usr/pkg/lib/python2.4/site-packages/Bio/FormatIO.py", line 55, > in > >> __init__ > >> AttributeError: 'module' object has no attribute 'formats' > > > > Hi Matthew, > > > > your import of the GenBank module fails. Most likely your BioPython > installation > > is broken. Could you try to re-install it? > > What version of Biopython do you have Matthew? I'm pretty sure it isn't > the latest Biopython 1.44, it must be Biopython 1.43 or older... but > even on Biopython 1.43 doing "from Bio import GenBank" should work. > > Odd. What OS are you using, and how and where did you install Biopython? > > Peter > > From biopython at maubp.freeserve.co.uk Wed Nov 14 00:01:45 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Nov 2007 00:01:45 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> Message-ID: <473A3AE9.9030506@maubp.freeserve.co.uk> Matthew Abravanel wrote: > Hi Christof and everyone, > > Thanks for all the comments and everything, the OS I am using is NetBSD > 3.1 , and I think you were right Peter about my biopython version I think it > was 1.43 instead of the latest 1.44 version. If I get the latest > 1.44v ersion of biopython do you think the code should work or do I > need to think of something else? Well, that example code looked like it should have worked on Biopython 1.43 so I am a little puzzled. I don't know if anyone else is using NetBSD with Biopython, but the import problem could be some sort of installation problem. This is why I was asking about how and where you installed Biopython (i.e. did you install from source?) If you do try Biopython 1.44, watch out for bug 2393, we managed to break Bio.GenBank.NCBIDictionary - on the bright side its a one line fix: http://bugzilla.open-bio.org/show_bug.cgi?id=2393 Peter From biopython at maubp.freeserve.co.uk Tue Nov 13 09:15:57 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Nov 2007 09:15:57 +0000 Subject: [BioPython] Writing a biopython script to download all Genbank records from Nucleotide database In-Reply-To: <473759DE.1010802@biotec.tu-dresden.de> References: <8fc5e4c20711111107p5c3b7f94q7ccb3de7493a3279@mail.gmail.com> <473759DE.1010802@biotec.tu-dresden.de> Message-ID: <47396B4D.6010400@maubp.freeserve.co.uk> Christof Winter wrote: > I used the code below to retrieve some entries from the Nucleotide database. > Since two entries already take a few seconds, it is probably a bad idea to > download _all_ entries in that way. > > You might be better off downloading the data first: > ftp://ftp.ncbi.nih.gov/genbank/ I would agree 100%. Another benefit is you can script an FTP download (e.g. using wget which can cope with an interrupted internet connection nicely). > from Bio import GenBank > > featureParser = GenBank.FeatureParser() > ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank", parser=featureParser) > ... Note that Bio.GenBank.NCBIDictionary won't work in Biopython 1.44, but its been fixed again in CVS - see bug 2393. http://bugzilla.open-bio.org/show_bug.cgi?id=2393 > accessionNumbers = ["BC063166", "NM_028459"] > > for accessionNo in accessionNumbers: > giList = GenBank.search_for(accessionNo) > for gi in giList: > record = ncbiDict[gi] # parsing happens here > ... I expect you can ask the NCBI for records by accession directly, rather than doing a search to get the GI number. Peter From winter at biotec.tu-dresden.de Wed Nov 14 09:47:32 2007 From: winter at biotec.tu-dresden.de (Christof Winter) Date: Wed, 14 Nov 2007 10:47:32 +0100 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473A3AE9.9030506@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> Message-ID: <473AC434.4030605@biotec.tu-dresden.de> Peter wrote: > Matthew Abravanel wrote: >> Hi Christof and everyone, >> >> Thanks for all the comments and everything, the OS I am using is NetBSD >> 3.1 , and I think you were right Peter about my biopython version I think it >> was 1.43 instead of the latest 1.44 version. If I get the latest >> 1.44v ersion of biopython do you think the code should work or do I >> need to think of something else? > > Well, that example code looked like it should have worked on Biopython > 1.43 so I am a little puzzled. I even run 1.42 (Debian package python-biopython 1.42-2), and it works fine. From idoerg at gmail.com Wed Nov 14 16:43:03 2007 From: idoerg at gmail.com (Iddo Friedberg) Date: Wed, 14 Nov 2007 08:43:03 -0800 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473AC434.4030605@biotec.tu-dresden.de> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> Message-ID: The culprit code in Bio/FormatIO class FormatIO: def __init__(self, name, default_input_format = None, default_output_format = None, abbrev = None, registery = None): if abbrev is None: abbrev = name if registery is None: import Bio registery = Bio.formats seems like this class is being instantiated from SeqRecord.py using the following call: io = FormatIO.FormatIO("SeqRecord", default_input_format = "sequence", default_output_format = "fasta") Which causes 'registry' to be set to None, which causes a call to Bio.formats which does not exist. 1) Can I have the offending code that started all this? (I'm a bit late in the game, I know) 2) What is (was) Bio.formats? has it been replaced by something else? 3) What is this registry thingy? We need it? Iddo On Nov 14, 2007 1:47 AM, Christof Winter wrote: > Peter wrote: > > Matthew Abravanel wrote: > >> Hi Christof and everyone, > >> > >> Thanks for all the comments and everything, the OS I am using is NetBSD > >> 3.1 , and I think you were right Peter about my biopython version I > think it > >> was 1.43 instead of the latest 1.44 version. If I get the latest > >> 1.44v ersion of biopython do you think the code should work or do I > >> need to think of something else? > > > > Well, that example code looked like it should have worked on Biopython > > 1.43 so I am a little puzzled. > > I even run 1.42 (Debian package python-biopython 1.42-2), and it works > fine. > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- I. Friedberg "The only problem with troubleshooting is that sometimes trouble shoots back." From biopython at maubp.freeserve.co.uk Wed Nov 14 21:43:36 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Nov 2007 21:43:36 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> Message-ID: <473B6C08.1060605@maubp.freeserve.co.uk> Iddo Friedberg wrote: > The culprit code in Bio/FormatIO ... We've removed Bio.FormatIO for Biopython 1.44 (in favour of Bio.SeqIO). It was a input/output framework based on Martel "regular expressions" to describe file formats. > 1) Can I have the offending code that started all this? (I'm a bit late in > the game, I know) I think it was an innocent looking "from Bio import GenBank", which I have never seen cause this error before. Hence my wondering if there was an installation problem (e.g. a partial installation). > 2) What is (was) Bio.formats? has it been replaced by something else? > 3) What is this registry thingy? We need it? I think Bio.formats and the registry thing are all tied together with code in Bio/formatdefs, Bio/config etc. Its all very complicated, and doesn't seem to be much documented. Peter From vmatthewa at gmail.com Thu Nov 15 20:50:13 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Thu, 15 Nov 2007 13:50:13 -0700 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <473B6C08.1060605@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> Message-ID: <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> Hi everyone, Thanks for all the comments, so since you said you removed Bio.FormatIO in version 1.44 and replaced it with Bio.SeqIO do you think I can still successfully use that code I was given if I have 1.44 provided I watch out for bugs and so on? What is the difference between Bio.FormatIO and Bio.SeqIO, other then them describing file formats differently? Also how exactly could one have a partial installation, some of the package not installing? Thanks again for the help. Sincerely, Matthew On Nov 14, 2007 2:43 PM, Peter wrote: > Iddo Friedberg wrote: > > The culprit code in Bio/FormatIO ... > > We've removed Bio.FormatIO for Biopython 1.44 (in favour of Bio.SeqIO). > It was a input/output framework based on Martel "regular expressions" to > describe file formats. > > > 1) Can I have the offending code that started all this? (I'm a bit late > in > > the game, I know) > > I think it was an innocent looking "from Bio import GenBank", which I > have never seen cause this error before. Hence my wondering if there was > an installation problem (e.g. a partial installation). > > > 2) What is (was) Bio.formats? has it been replaced by something else? > > 3) What is this registry thingy? We need it? > > I think Bio.formats and the registry thing are all tied together with > code in Bio/formatdefs, Bio/config etc. Its all very complicated, and > doesn't seem to be much documented. > > Peter > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Nov 15 21:11:21 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Nov 2007 21:11:21 +0000 Subject: [BioPython] script to extract records from nucleotide database In-Reply-To: <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <8fc5e4c20711151250y286ff4d4se7875e029e12eceb@mail.gmail.com> Message-ID: <320fb6e00711151311q700a473age0e026b8fcfb5c5@mail.gmail.com> > Thanks for all the comments, so since you said you removed Bio.FormatIO in > version 1.44 and replaced it with Bio.SeqIO do you think I can still > successfully use that code I was given if I have 1.44 provided I watch out > for bugs and so on? Assuming you can apply the fix for Bug 2393, then that Bio.GenBank.NCBIDictionary code should work fine with Biopython 1.44. There is also a related example in the SeqIO chapter of the tutorial using the Bio.GenBank.download_many() function. > What is the difference between Bio.FormatIO and > Bio.SeqIO, other then them describing file formats differently? In terms of typical end use, Bio.SeqIO and Bio.FormatIO provided similar capabilities, but FormatIO wasn't very up to date in terms of its format support. The big differences are internal. For any new code, please try Bio.SeqIO (available in Biopython 1.43 onwards), which is described in the tutorial and the wiki: http://biopython.org/wiki/SeqIO > Also how exactly could one have a partial installation, some of the package not > installing? This was a guess - there is/was clearly something odd about your install. If you installed from source, maybe some step failed part way leaving you with only some parts installed. Another possibility is on BSD is there is something different about the installation paths which is confusing things. We haven't worked out what went wrong on your system so I'm was just speculating. Peter From holger.dinkel at gmail.com Fri Nov 16 10:38:56 2007 From: holger.dinkel at gmail.com (holger.dinkel at gmail.com) Date: Fri, 16 Nov 2007 11:38:56 +0100 Subject: [BioPython] Prosite / Prorule Message-ID: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> Hello List, I just stumbled upon an error with the parsing of a 'newer' (>20) version of Prosite: Prosite introduced a new field called ProRules which cause errors in parsing with Bio/Prosite/__init__.py / Bio/ParserSupport.py. I updated biopython to 1.44, but the error persists. Here is the Traceback: ---------------------------------------------------------------------------------------------------- File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 227, in next return self._parser.parse(File.StringHandle(data)) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 349, in parse self._scanner.feed(handle, self._consumer) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 381, in feed self._scan_record(uhandle, consumer) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 396, in _scan_record fn(self, uhandle, consumer) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 477, in _scan_do self._scan_line('DO', uhandle, consumer.documentation, exactly_one=1) File "/usr/lib/python2.4/site-packages/Bio/Prosite/__init__.py", line 418, in _scan_line read_and_call(uhandle, event_fn, start=line_type) File "/usr/lib/python2.4/site-packages/Bio/ParserSupport.py", line 300, in read_and_call raise SyntaxError, errmsg SyntaxError: Line does not start with 'DO': PR PRU00498; ---------------------------------------------------------------------------------------------------- I tried to figure out, where the problem lies, but I do not really understand the structure of the parsing modules in 'Bio/Prosite/__init__.py' I tried to create a new entry for the prorule: define a def _scan_pr(self, uhandle, consumer): self._scan_line('PR', uhandle, consumer.identification, up_to_one=1) add that to the '_scan_fns' and so on, but then the scanning order seems to get out of order, and i get a different "SyntaxError: Line does not start with ..." error... Is the parsing mechanism described anywhere, so I can look it up and fix the error? Regards, Holger -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From vmatthewa at gmail.com Fri Nov 16 21:26:10 2007 From: vmatthewa at gmail.com (Matthew Abravanel) Date: Fri, 16 Nov 2007 14:26:10 -0700 Subject: [BioPython] Bug 2393 in bugzilla for the Bio.GenBank.NCBIDictionary code Message-ID: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> Hi all, I was wondering not being familiar with the bugzilla feature, if I wanted to fix Bug 2393 there seems to be a patch with the .py extention that was created for it. Do I need to download that file or type it into my existing code or something like that? Please excuse my ignorence. Thanks. Matthew On Nov 15, 2007 2:11 PM, Peter wrote: > > Thanks for all the comments, so since you said you removed Bio.FormatIOin > > version 1.44 and replaced it with Bio.SeqIO do you think I can still > > successfully use that code I was given if I have 1.44 provided I watch > out > > for bugs and so on? > > Assuming you can apply the fix for Bug 2393, then that > Bio.GenBank.NCBIDictionary code should work fine with Biopython 1.44. > > There is also a related example in the SeqIO chapter of the tutorial > using the Bio.GenBank.download_many() function. > > > What is the difference between Bio.FormatIO and > > Bio.SeqIO, other then them describing file formats differently? > > In terms of typical end use, Bio.SeqIO and Bio.FormatIO provided > similar capabilities, but FormatIO wasn't very up to date in terms of > its format support. The big differences are internal. For any new > code, please try Bio.SeqIO (available in Biopython 1.43 onwards), > which is described in the tutorial and the wiki: > http://biopython.org/wiki/SeqIO > > > Also how exactly could one have a partial installation, some of the > package not > > installing? > > This was a guess - there is/was clearly something odd about your > install. If you installed from source, maybe some step failed part > way leaving you with only some parts installed. Another possibility > is on BSD is there is something different about the installation paths > which is confusing things. We haven't worked out what went wrong on > your system so I'm was just speculating. > > Peter > From biopython at maubp.freeserve.co.uk Fri Nov 16 21:38:14 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Nov 2007 21:38:14 +0000 Subject: [BioPython] Bug 2393 in bugzilla for the Bio.GenBank.NCBIDictionary code In-Reply-To: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> References: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> Message-ID: <320fb6e00711161338w73f8e7d1l964671208e263ba3@mail.gmail.com> On Nov 16, 2007 9:26 PM, Matthew Abravanel wrote: > Hi all, > > I was wondering not being familiar with the bugzilla feature, if I wanted > to fix Bug 2393 there seems to be a patch with the .py extention that was > created for it. Do I need to download that file or type it into my existing > code or something like that? Please excuse my ignorence. Thanks. > > Matthew http://bugzilla.open-bio.org/show_bug.cgi?id=2393 Right now, Bug 2393 has two attachments: * Michiel's patch - which makes some drastic changes * My suggested test case (a python file) Ignore these for now. Instead, I suggest you make the "quick-and-dirty" one line change described in comment 2, which has since been checked into CVS (comment 5) http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c2 http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c5 Or just download the latest Bio/GenBank/__init__.py from the web interface to CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/GenBank/__init__.py?cvsroot=biopython Back up your old file, and replace it with the new one (assuming you have sufficient admin rights). Peter From timmcilveen at talktalk.net Sat Nov 17 19:48:38 2007 From: timmcilveen at talktalk.net (tim) Date: Sat, 17 Nov 2007 19:48:38 +0000 Subject: [BioPython] installing mxTextTools In-Reply-To: <473B6C08.1060605@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> Message-ID: <1195328918.5259.10.camel@linux-qvtz.site> Hi, I am setting up biopython on my Suse Linux 10.3. I have Python 2.5.1 installed on my system already. After downloading mxTextTools i unzip it and start the install, but get the error: invalid Python installation: unable to open /usr/lib/python2.5/config/Makefile (No such file or directory) Indeed when I browse to python 2.5 in /usr/lib/python2.5/ , there is no config folder. Python 2.5 works fine though, so what is going on here? Any ideas anyone? Thanks, Tim From timmcilveen at talktalk.net Sat Nov 17 19:40:57 2007 From: timmcilveen at talktalk.net (tim) Date: Sat, 17 Nov 2007 19:40:57 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <473B6C08.1060605@maubp.freeserve.co.uk> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> Message-ID: <1195328457.5259.4.camel@linux-qvtz.site> Hi, On Suse Linux I am trying to install Biopython. I have Python 2.5.1. installed and I need to install mxtext tools into python. When I perform : I get the error message invalid Python installation: unable to open /usr/lib/python2.5/config/Makefile (no such file or directory) Indeed when I browse the file system of python 2.5.1, I find no such file. Python is working fine though. From biopython at maubp.freeserve.co.uk Sat Nov 17 20:10:51 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Nov 2007 20:10:51 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <1195328457.5259.4.camel@linux-qvtz.site> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <1195328457.5259.4.camel@linux-qvtz.site> Message-ID: <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> > Hi, > On Suse Linux I am trying to install Biopython. I have Python 2.5.1. > installed and I need to install mxtext tools into python. My guess is you tried a standard "python setup.py install" which won't work on mxTextTools (because egenix provide things pre-compiled). On their webpage they suggest this for a system wide install: sudo python setup.py build --skip install If all else fails, I suggest you ask on the egenix mailing list: http://www.egenix.com/support/mailing-lists/ Peter From timmcilveen at talktalk.net Sat Nov 17 21:36:30 2007 From: timmcilveen at talktalk.net (tim) Date: Sat, 17 Nov 2007 21:36:30 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A0849.2020904@biotec.tu-dresden.de> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <1195328457.5259.4.camel@linux-qvtz.site> <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> Message-ID: <1195335390.4371.2.camel@linux-qvtz.site> Hi Peter, I did use > sudo python setup.py build --skip install but with no success. I'll try the support list at Egenix. Thanks for the quick reply :-) Tim On Sat, 2007-11-17 at 20:10 +0000, Peter wrote: > > Hi, > > On Suse Linux I am trying to install Biopython. I have Python 2.5.1. > > installed and I need to install mxtext tools into python. > > My guess is you tried a standard "python setup.py install" which > won't work on mxTextTools (because egenix provide things pre-compiled). > On their webpage they suggest this for a system wide install: > > sudo python setup.py build --skip install > > If all else fails, I suggest you ask on the egenix mailing list: > http://www.egenix.com/support/mailing-lists/ > > Peter From biopython at maubp.freeserve.co.uk Sat Nov 17 21:44:22 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 17 Nov 2007 21:44:22 +0000 Subject: [BioPython] installing mxtext tools In-Reply-To: <1195335390.4371.2.camel@linux-qvtz.site> References: <8fc5e4c20711121319q5c1fd46exb584978ec0b9dd76@mail.gmail.com> <473A2C5F.5060102@maubp.freeserve.co.uk> <8fc5e4c20711131546g111461bfrd210b68d03f8bcd8@mail.gmail.com> <473A3AE9.9030506@maubp.freeserve.co.uk> <473AC434.4030605@biotec.tu-dresden.de> <473B6C08.1060605@maubp.freeserve.co.uk> <1195328457.5259.4.camel@linux-qvtz.site> <320fb6e00711171210g7fdcbf5cl477affd092cd7035@mail.gmail.com> <1195335390.4371.2.camel@linux-qvtz.site> Message-ID: <320fb6e00711171344oa111afaj28f525b19dbfa34@mail.gmail.com> tim wrote: > I did use > >> sudo python setup.py build --skip install > > but with no success. I'll try the support list at Egenix. By the way - Biopython had/has a few issues with mxTextTools 3.0, most of which we have now worked around in Biopython 1.44. I notice egenix now have mxTextTools 2.0 available on their website once again - if you are only installing this for Biopython then I would recommend you use mxTextTools 2.0 instead. http://www.egenix.com/www2002/python/eGenix-mx-Extensions-v2.x.html/ I'm just going to update our wiki to mention this... Peter From anablopes at gmail.com Sun Nov 18 18:49:53 2007 From: anablopes at gmail.com (Ana Branca Lopes) Date: Sun, 18 Nov 2007 18:49:53 +0000 Subject: [BioPython] number of enzimes in a single pdb file Message-ID: <2489921e0711181049i1e11cef3ia6d02b62abad776e@mail.gmail.com> Hello, When a PDB file has more than one crystallographic unit (more than one protein) in a single file, is there any way to know how many copies there are? Many thanks for any help Ana From mdehoon at c2b2.columbia.edu Mon Nov 19 01:20:35 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Sun, 18 Nov 2007 20:20:35 -0500 Subject: [BioPython] Bug 2393 in bugzilla for theBio.GenBank.NCBIDictionary code References: <8fc5e4c20711161326q27723a16re5eca3de82573f8e@mail.gmail.com> <320fb6e00711161338w73f8e7d1l964671208e263ba3@mail.gmail.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B661@mail2.exch.c2b2.columbia.edu> It's probably a good idea to make a new Biopython release in the near future, after fixing this bug. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Peter Sent: Fri 11/16/2007 4:38 PM To: Matthew Abravanel Cc: biopython at lists.open-bio.org Subject: Re: [BioPython] Bug 2393 in bugzilla for theBio.GenBank.NCBIDictionary code On Nov 16, 2007 9:26 PM, Matthew Abravanel wrote: > Hi all, > > I was wondering not being familiar with the bugzilla feature, if I wanted > to fix Bug 2393 there seems to be a patch with the .py extention that was > created for it. Do I need to download that file or type it into my existing > code or something like that? Please excuse my ignorence. Thanks. > > Matthew http://bugzilla.open-bio.org/show_bug.cgi?id=2393 Right now, Bug 2393 has two attachments: * Michiel's patch - which makes some drastic changes * My suggested test case (a python file) Ignore these for now. Instead, I suggest you make the "quick-and-dirty" one line change described in comment 2, which has since been checked into CVS (comment 5) http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c2 http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c5 Or just download the latest Bio/GenBank/__init__.py from the web interface to CVS: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/GenBank/__ init__.py?cvsroot=biopython Back up your old file, and replace it with the new one (assuming you have sufficient admin rights). Peter _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mdehoon at c2b2.columbia.edu Mon Nov 19 01:19:00 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Sun, 18 Nov 2007 20:19:00 -0500 Subject: [BioPython] Prosite / Prorule References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu> Holger wrote: > I just stumbled upon an error with the parsing of a 'newer' (>20) version of > Prosite: Prosite introduced a new field called ProRules which cause errors > in parsing with Bio/Prosite/__init__.py / Bio/ParserSupport.py. ... > I tried to figure out, where the problem lies, but I do not really understand > the structure of the parsing modules in 'Bio/Prosite/__init__.py' ... > Is the parsing mechanism described anywhere, so I can look it up and fix the error? The Prosite parser was written about five years ago, and it may very well be that none of the currently active Biopython developers really know how this parser works. In that case, one option may be to write a new Prosite parser from scratch. That could even be an easier solution than trying to fix the existing parser. If you decide to go that way, it would be a good idea to discuss the Prosite parser design beforehand on the development mailing list (biopython-dev at biopython.org). --Michiel Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 From holger.dinkel at gmail.com Mon Nov 19 09:24:24 2007 From: holger.dinkel at gmail.com (holger.dinkel at gmail.com) Date: Mon, 19 Nov 2007 10:24:24 +0100 Subject: [BioPython] Prosite / Prorule In-Reply-To: <6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu> <473DBF04.3070509@maubp.freeserve.co.uk> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu> <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <473DBF04.3070509@maubp.freeserve.co.uk> Message-ID: <20071119092424.GC6177@megaira.biochem.uni-erlangen.de> Hello Peter and Michiel, * Peter wrote: > holger.dinkel at gmail.com wrote: > > Could you file a bug and attach a small recent Prosite file which has this problem? I have created a bugreport (#2403) and also attached two files (a script to show the error and a prosite-file 'prosite_test.dat') > Note that the order in _scan_fns does matter. are you sure about that? The definition of the '_scan_fns'-List, which holds all callbacks to prosite-entries, shows some 'redundancy'. This makes me think, that the entries are handled sequentially: ---------------------------------------------------------------------------------------------------- _scan_fns = [ _scan_id, _scan_ac, _scan_dt, _scan_de, _scan_pa, _scan_ma, _scan_ru, _scan_nr, _scan_cc, # This is a really dirty hack, and should be fixed properly at # some point. ZN2_CY6_FUNGAL_2, DNAJ_2 in Rel 15 and PS50309 # in Rel 17 have lines out of order. Thus, I have to rescan # these, which decreases performance. _scan_ma, _scan_nr, _scan_cc, _scan_dr, _scan_3d, _scan_do, _scan_terminator ] ---------------------------------------------------------------------------------------------------- And while scanning prosite-records the function '_scan_record' simply iterates over the _scan_fns-entries: ---------------------------------------------------------------------------------------------------- def _scan_record(self, uhandle, consumer): consumer.start_record() for fn in self._scan_fns: fn(self, uhandle, consumer) ---------------------------------------------------------------------------------------------------- > > Not that I am aware of, however the SwissProt parser looks very similar, so we should be able to fix this without too much hassle. > > Thanks > > Peter * Michiel De Hoon wrote: > > The Prosite parser was written about five years ago, and it may very well be > that none of the currently active Biopython developers really know how this > parser works. In that case, one option may be to write a new Prosite parser > from scratch. That could even be an easier solution than trying to fix the > existing parser. If you decide to go that way, it would be a good idea to > discuss the Prosite parser design beforehand on the development mailing list > (biopython-dev at biopython.org). > > --Michiel Re-writing the parser might be the best choice here. Unfortunately, I have not much experience in writing parsers and also had quite a hard time trying to understand what was going on in the Prosite RecordParser... 8-/ The way I THINK this should be done, is some event-driven mechanism, where the first letters of the scanned line determine what kind of information follows. As compared to iterating over a list (like in the current _scan_fns) and trying to match each entry with the line... Could you point me to a parser-implementation which functions as a 'template' of good parser design. Maybe I can merge it with the existing Prosite-Parser... Thanks to all of you, Holger -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From mdehoon at c2b2.columbia.edu Mon Nov 19 09:49:23 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Mon, 19 Nov 2007 04:49:23 -0500 Subject: [BioPython] Prosite / Prorule References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de><6243BAA9F5E0D24DA41B27997D1FD14402B660@mail2.exch.c2b2.columbia.edu><20071116103856.GM8243@megaira.biochem.uni-erlangen.de><473DBF04.3070509@maubp.freeserve.co.uk> <20071119092424.GC6177@megaira.biochem.uni-erlangen.de> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B664@mail2.exch.c2b2.columbia.edu> > Re-writing the parser might be the best choice here. Unfortunately, I have not > much experience in writing parsers and also had quite a hard time trying to > understand what was going on in the Prosite RecordParser... 8-/ > > The way I THINK this should be done, is some event-driven mechanism, where the > first letters of the scanned line determine what kind of information follows. > As compared to iterating over a list (like in the current _scan_fns) and trying > to match each entry with the line... > > Could you point me to a parser-implementation which functions as a 'template' of > good parser design. Maybe I can merge it with the existing Prosite-Parser... You could have a look at the function "parse" in Bio/KEGG/Enzyme/__init__.py This is something I wrote for Biopython release 1.44, when it turned out that the new version of mxTextTools caused the previous Bio/KEGG/Enzyme parser to fail. At that time, I decided to write the parser from scratch instead of trying to fix the existing parser (mainly because I didn't understand how the existing parser worked). The result is a rather straightforward parser. Now, for KEGG it is possible that one file contains several KEGG.Enzyme records. The "parse" functions pulls them out one by one (using an iterator). This is why the function has a "yield", and no "return" in the end. From the user perspective, it works as follows: from Bio.KEGG import Enzyme input = open("my_kegg_file_containing_lots_of_enzymes.txt") records = Enzyme.parse(input) for record in records: # record is now one Bio.KEGG.Enzyme.Record instance # Do something with the record print record For Prosite, I don't know if you can have several Prosite records concatenated in one file. If you do, you can use the same approach as for the KEGG parser. If not, I guess a Prosite "parse" function should just return one record directly. As in: from Bio import Prosite input = open("my_prosite_file.txt") record = Prosite.parse(input) # record is now one Bio.Prosite.Record instance --Michiel. From srini_iyyer_bio at yahoo.com Mon Nov 19 22:34:53 2007 From: srini_iyyer_bio at yahoo.com (Srinivas Iyyer) Date: Mon, 19 Nov 2007 14:34:53 -0800 (PST) Subject: [BioPython] windows : reading local blast output Message-ID: <904468.67404.qm@web38109.mail.mud.yahoo.com> Dear group, I am using Python (2.4) and biopython(1.44) in windows. I installed a local blast version for windows. The following code breaks down and throws the error pasted below for convenience: This part of the code works when used on Linux based blast output. Obviously I suspect the '\r\n' for windows. Code: from Bio import Blast from Bio.Blast import NCBIStandalone blast_out = open('C:\human\prb_blast.out','U') result = [] b_parser = NCBIStandalone.BlastParser() b_iterator = NCBIStandalone.Iterator(blast_out,b_parser) b_record = b_iterator.next() I tried opening file handle with 'r', 'rU' and 'U' options. Yet there is no success. Could you help me here. I never had this issue before because I never used windows for blast. Thanks Srini Error report: >>> Traceback (most recent call last): File "C:\Python24\blast_parser.py", line 8, in ? b_record = b_iterator.next() File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 1553, in next return self._parser.parse(File.StringHandle(data)) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 746, in parse self._scanner.feed(handle, self._consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 99, in feed self._scan_rounds(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 229, in _scan_rounds self._scan_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 363, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 373, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 385, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 420, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 454, in _scan_hsp_alignment read_and_call_while(uhandle, consumer.noevent, blank=1) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 314, in read_and_call_while line = safe_readline(uhandle) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 411, in safe_readline raise SyntaxError, "Unexpected end of stream." SyntaxError: Unexpected end of stream. ____________________________________________________________________________________ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ From mdehoon at c2b2.columbia.edu Tue Nov 20 00:57:59 2007 From: mdehoon at c2b2.columbia.edu (Michiel De Hoon) Date: Mon, 19 Nov 2007 19:57:59 -0500 Subject: [BioPython] windows : reading local blast output References: <904468.67404.qm@web38109.mail.mud.yahoo.com> Message-ID: <6243BAA9F5E0D24DA41B27997D1FD14402B666@mail2.exch.c2b2.columbia.edu> It looks like you are trying to parse Blast plain-text output. It is not necessarily related to the \r\n problem, it may be that you are running a different Blast version on Windows. Differences between Blast versions tend to break the plain-text Blast output parser. How about trying to parse Blast output in XML format? See the tutorial for more information. --Michiel. Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1150 St Nicholas Avenue New York, NY 10032 -----Original Message----- From: biopython-bounces at lists.open-bio.org on behalf of Srinivas Iyyer Sent: Mon 11/19/2007 5:34 PM To: biopython at biopython.org Subject: [BioPython] windows : reading local blast output Dear group, I am using Python (2.4) and biopython(1.44) in windows. I installed a local blast version for windows. The following code breaks down and throws the error pasted below for convenience: This part of the code works when used on Linux based blast output. Obviously I suspect the '\r\n' for windows. Code: from Bio import Blast from Bio.Blast import NCBIStandalone blast_out = open('C:\human\prb_blast.out','U') result = [] b_parser = NCBIStandalone.BlastParser() b_iterator = NCBIStandalone.Iterator(blast_out,b_parser) b_record = b_iterator.next() I tried opening file handle with 'r', 'rU' and 'U' options. Yet there is no success. Could you help me here. I never had this issue before because I never used windows for blast. Thanks Srini Error report: >>> Traceback (most recent call last): File "C:\Python24\blast_parser.py", line 8, in ? b_record = b_iterator.next() File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 1553, in next return self._parser.parse(File.StringHandle(data)) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 746, in parse self._scanner.feed(handle, self._consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 99, in feed self._scan_rounds(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 229, in _scan_rounds self._scan_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 363, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 373, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 385, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 420, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "C:\Python24\Lib\site-packages\Bio\Blast\NCBIStandalone.py", line 454, in _scan_hsp_alignment read_and_call_while(uhandle, consumer.noevent, blank=1) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 314, in read_and_call_while line = safe_readline(uhandle) File "C:\Python24\Lib\site-packages\Bio\ParserSupport.py", line 411, in safe_readline raise SyntaxError, "Unexpected end of stream." SyntaxError: Unexpected end of stream. _____________________________________________________________________________ _______ Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now. http://mobile.yahoo.com/sports;_ylt=At9_qDKvtAbMuh1G1SQtBI7ntAcJ _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Nov 19 23:26:08 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Nov 2007 23:26:08 +0000 Subject: [BioPython] windows : reading local blast output In-Reply-To: <904468.67404.qm@web38109.mail.mud.yahoo.com> References: <904468.67404.qm@web38109.mail.mud.yahoo.com> Message-ID: <47421B90.9000904@maubp.freeserve.co.uk> Srinivas Iyyer wrote: > Dear group, > > I am using Python (2.4) and biopython(1.44) in windows. I installed a > local blast version for windows. Did you install Biopython 1.44 using the Windows installer? What version of stand alone blast are you using? > The following code breaks down and throws the error pasted below for > convenience: This part of the code works when used on Linux based > blast output. Obviously I suspect the '\r\n' for windows. Could you file a bug [with the full version information], and then upload the problem output file C:\human\prb_blast.out please? Then we can try and reproduce the problem. http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython As to the new lines, if that is the problem, I would have expected opening the handle in universal mode should have fixed it. Have you tried experimenting with dos2unix and unix2dos on the file? Also - could you try XML output rather than plain text? See the tutorial for examples. http://biopython.org/DIST/docs/tutorial/Tutorial.html Peter From biopython at maubp.freeserve.co.uk Fri Nov 16 16:02:12 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Nov 2007 16:02:12 +0000 Subject: [BioPython] Prosite / Prorule In-Reply-To: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> Message-ID: <473DBF04.3070509@maubp.freeserve.co.uk> holger.dinkel at gmail.com wrote: > Hello List, > > I just stumbled upon an error with the parsing of a 'newer' (>20) version of > Prosite: Prosite introduced a new field called ProRules which cause errors > in parsing with Bio/Prosite/__init__.py / Bio/ParserSupport.py. > I updated biopython to 1.44, but the error persists. Could you file a bug and attach a small recent Prosite file which has this problem? > I tried to figure out, where the problem lies, but I do not really understand > the structure of the parsing modules in 'Bio/Prosite/__init__.py' > I tried to create a new entry for the prorule: > define a > > def _scan_pr(self, uhandle, consumer): > self._scan_line('PR', uhandle, consumer.identification, up_to_one=1) > > add that to the '_scan_fns' and so on, but then the scanning order seems to get > out of order, and i get a different "SyntaxError: Line does not start with ..." > error... Note that the order in _scan_fns does matter. > Is the parsing mechanism described anywhere, so I can look it up and fix the error? Not that I am aware of, however the SwissProt parser looks very similar, so we should be able to fix this without too much hassle. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Nov 20 10:18:31 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Nov 2007 10:18:31 +0000 Subject: [BioPython] Prosite / Prorule In-Reply-To: <473DBF04.3070509@maubp.freeserve.co.uk> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <473DBF04.3070509@maubp.freeserve.co.uk> Message-ID: <320fb6e00711200218i3d446510l6cc7009d4c9ed08b@mail.gmail.com> > Could you file a bug and attach a small recent Prosite file which has > this problem? Holger reported bug 2403, which I believe I have fixed (having worked with our SwissProt parser before I found this quite straight forward): http://bugzilla.open-bio.org/show_bug.cgi?id=2403 Peter From holger.dinkel at gmail.com Tue Nov 20 12:35:15 2007 From: holger.dinkel at gmail.com (holger.dinkel at gmail.com) Date: Tue, 20 Nov 2007 13:35:15 +0100 Subject: [BioPython] Prosite / Prorule In-Reply-To: <320fb6e00711200218i3d446510l6cc7009d4c9ed08b@mail.gmail.com> References: <20071116103856.GM8243@megaira.biochem.uni-erlangen.de> <473DBF04.3070509@maubp.freeserve.co.uk> <320fb6e00711200218i3d446510l6cc7009d4c9ed08b@mail.gmail.com> Message-ID: <20071120123515.GA8723@megaira.biochem.uni-erlangen.de> Hallo Peter, thank you very much for your real quick help! that bug is fixed! ;-> But alas, there are still some errors thrown when scanning the whole prosite_20.dat: (they only show up now since the other errors were fixed) Firstly, the Prosite-Team had also introduced a new field called "postprocessing", so now the parser chokes on that. And secondly the parser breaks at some special comment-lines with authornames in it of the form "CC /AUTHOR=K_Hofmann; N_Hulo" (Prosite-Acc PS50293): The comments are split into columns and then parsed into values at the "="-letter. As Mr. Hulo does not have a "/Author=" prepended, an error is raised... I was able to fix the first problem straightforward as Peter did and inserted a postprocessing-entry. I could also solve the second problem, but only with some hack which might not suit everybody: First, i split the "qual, data = [word.lstrip() for word in col.split("=")]" into two to avoid KeyErrors: qual = [word.lstrip() for word in col.split("=")][0] data = ''.join([word.lstrip() for word in col.split("=")][1:]) and then i introduced a hack to circumvent the aforementioned problem: changed if qual == '/TAXO-RANGE': to if qual == 'N_Hulo': continue elif qual == '/TAXO-RANGE': I know this is far from excellent, but crude enough to work ;-> If you'd like to incorporate at least the first changes, you can find the 'new' __init__.py file attached at the bug #2403 as whole file as well as a patch. It succesfully scans prosite version 18 to 20 (others not checked). I could also send it to the list, but I am not sure if mails with attachments are allowed here? * Peter wrote: > > Holger reported bug 2403, which I believe I have fixed (having worked > with our SwissProt parser before I found this quite straight forward): > http://bugzilla.open-bio.org/show_bug.cgi?id=2403 > > Peter best wishes, Holger -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From arareko at campus.iztacala.unam.mx Thu Nov 22 16:37:24 2007 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Thu, 22 Nov 2007 10:37:24 -0600 Subject: [BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> Message-ID: <4745B044.5090102@campus.iztacala.unam.mx> Hi Peter, In BioPerl, there's no such mapping for db_xref's that I'm aware of. Each parser handles db_xref records on its own. Take a look at the Bio::SeqIO::genbank code, inside the next_seq() method for example: http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup Regards, Mauricio. Peter wrote: > Dear all, > > I'm one of the Biopython developers. I've recently got going with > BioSQL and have been getting to grips with the Biopython BioSQL > interface. I'm aware that we need to try and be consistent with > BioPerl and BioJava, so I'd like to pose my first question related to > that. > > When loading GenBank records, many features have db_xref qualifiers, > e.g. from a random CDS feature in E. coli K12: > > /db_xref="ASAP:1309" > /db_xref="GI:16128366" > /db_xref="ECOCYC:EG10213" > /db_xref="GeneID:945313" > > Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", > "GeneID" before using recording these entries in the seqfeature_dbxref > and dbxref tables. For example, "GI" becomes "GeneIndex". > Biopython's current mapping is as follows: > > # Dictionary of database types, keyed by GenBank db_xref abbreviation > db_dict = {'GeneID': 'Entrez', > 'GI': 'GeneIndex', > 'COG': 'COG', > 'CDD': 'CDD', > 'DDBJ': 'DNA Databank of Japan', > 'Entrez': 'Entrez', > 'GeneIndex': 'GeneIndex', > 'PUBMED': 'PubMed', > 'taxon': 'Taxon', > 'ATCC': 'ATCC', > 'ISFinder': 'ISFinder', > 'GOA': 'Gene Ontology Annotation', > 'ASAP': 'ASAP', > 'PSEUDO': 'PSEUDO', > 'InterPro': 'InterPro', > 'GEO': 'Gene Expression Omnibus', > 'EMBL': 'EMBL', > 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', > 'ECOCYC': 'EcoCyc', > 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' > } > > In my testing, I've found several GenBank db_xref abbreviation for > which we don't have a mapping defined, such as "LocusID", "dbSNP", > "MGD", "MIM", or from an EMBL file, "REMTREMBL". > > I'd like to know if BioPerl and/or BioJava and/or BioRuby define a > similar mapping in their BioSQL code (or GenBank parser), so that > Biopython can follow your example. > > Thank you, > > Peter > > P.S. See also Biopython bug 2405 > http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Fri Nov 23 00:42:12 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 22 Nov 2007 18:42:12 -0600 Subject: [BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <4745B044.5090102@campus.iztacala.unam.mx> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> Message-ID: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> I think SeqIO checks the name for parsing reasons only, in cases where the format changes based on the source (such as GenPept DBSOURCE data). I don't think we go beyond that in Bioperl, probably b/c modifying or expanding names for data persistence would lead to volatile coding issues (i.e. consistency between parsers, constant updating to cover new crossrefs, etc). I would definitely suggest retaining the original DB as it appears in the dbxref for consistency/sanity; if needed return expanded names using a different method if they are designated. chris On Nov 22, 2007, at 10:37 AM, Mauricio Herrera Cuadra wrote: > Hi Peter, > > In BioPerl, there's no such mapping for db_xref's that I'm aware of. > Each parser handles db_xref records on its own. Take a look at the > Bio::SeqIO::genbank code, inside the next_seq() method for example: > > http://code.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/ > Bio/SeqIO/genbank.pm?rev=HEAD&content-type=text/vnd.viewcvs-markup > > Regards, > Mauricio. > > Peter wrote: >> Dear all, >> >> I'm one of the Biopython developers. I've recently got going with >> BioSQL and have been getting to grips with the Biopython BioSQL >> interface. I'm aware that we need to try and be consistent with >> BioPerl and BioJava, so I'd like to pose my first question related to >> that. >> >> When loading GenBank records, many features have db_xref qualifiers, >> e.g. from a random CDS feature in E. coli K12: >> >> /db_xref="ASAP:1309" >> /db_xref="GI:16128366" >> /db_xref="ECOCYC:EG10213" >> /db_xref="GeneID:945313" >> >> Bioython attempts to translate the strings "ASAP", "GI", "ECOCYC", >> "GeneID" before using recording these entries in the >> seqfeature_dbxref >> and dbxref tables. For example, "GI" becomes "GeneIndex". >> Biopython's current mapping is as follows: >> >> # Dictionary of database types, keyed by GenBank db_xref abbreviation >> db_dict = {'GeneID': 'Entrez', >> 'GI': 'GeneIndex', >> 'COG': 'COG', >> 'CDD': 'CDD', >> 'DDBJ': 'DNA Databank of Japan', >> 'Entrez': 'Entrez', >> 'GeneIndex': 'GeneIndex', >> 'PUBMED': 'PubMed', >> 'taxon': 'Taxon', >> 'ATCC': 'ATCC', >> 'ISFinder': 'ISFinder', >> 'GOA': 'Gene Ontology Annotation', >> 'ASAP': 'ASAP', >> 'PSEUDO': 'PSEUDO', >> 'InterPro': 'InterPro', >> 'GEO': 'Gene Expression Omnibus', >> 'EMBL': 'EMBL', >> 'UniProtKB/Swiss-Prot': 'UniProtKB/Swiss-Prot', >> 'ECOCYC': 'EcoCyc', >> 'UniProtKB/TrEMBL': 'UniProtKB/TrEMBL' >> } >> >> In my testing, I've found several GenBank db_xref abbreviation for >> which we don't have a mapping defined, such as "LocusID", "dbSNP", >> "MGD", "MIM", or from an EMBL file, "REMTREMBL". >> >> I'd like to know if BioPerl and/or BioJava and/or BioRuby define a >> similar mapping in their BioSQL code (or GenBank parser), so that >> Biopython can follow your example. >> >> Thank you, >> >> Peter >> >> P.S. See also Biopython bug 2405 >> http://bugzilla.open-bio.org/show_bug.cgi?id=2405 >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Sat Nov 24 09:16:49 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 24 Nov 2007 09:16:49 +0000 Subject: [BioPython] [BioSQL-l] BioSQL : GenBank db_xref names in dbxref table In-Reply-To: <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> References: <320fb6e00711201136i6b3ca41eo8f6718e98f79c531@mail.gmail.com> <4745B044.5090102@campus.iztacala.unam.mx> <47D0EC6F-C34A-4AA8-97EE-478F2A5ADF62@uiuc.edu> Message-ID: <320fb6e00711240116g4819fc81g202fda35801f19f2@mail.gmail.com> Thank you Chris and Mauricio, On 11/23/07, Chris Fields wrote: > I think [BioPerl's] SeqIO checks the name for parsing reasons only, in > cases where the format changes based on the source (such as GenPept > DBSOURCE data). I don't think we go beyond that in Bioperl, probably > b/c modifying or expanding names for data persistence would lead to > volatile coding issues (i.e. consistency between parsers, constant > updating to cover new crossrefs, etc). And in Biopython's case, we get annoying warnings if it hasn't seen the term before! Which is way I filed Biopython bug 2405 in the first place :) http://bugzilla.open-bio.org/show_bug.cgi?id=2405 > I would definitely suggest retaining the original DB as it appears in > the dbxref for consistency/sanity; if needed return expanded names > using a different method if they are designated. Sounds good to me. Peter From bjoern.thorwirth at uni-due.de Tue Nov 27 08:13:35 2007 From: bjoern.thorwirth at uni-due.de (=?ISO-8859-1?Q?Bj=F6rn?= Thorwirth) Date: Tue, 27 Nov 2007 09:13:35 +0100 Subject: [BioPython] NCBIXML Message-ID: <1196151215.3128.1.camel@mistery> Hello List! Today i got some trouble with the NCBIXML module. I've tested my code on a 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit i got the this error with Biopython-1.43 / 1.44: File "/usr/lib/python2.4/site-packages/twisted/internet/threads.py", line 25, in _putResultInDeferred result = f(*args, **kwargs) File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line 56, in getResults for record in records: File "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", line 625, in parse assert len(blast_parser._records) == 0 exceptions.UnboundLocalError: local variable 'blast_parser' referenced before assignment File "/usr/lib/python2.4/site-packages/twisted/internet/threads.py", line 25, in _putResultInDeferred result = f(*args, **kwargs) File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line 56, in get Results for record in records: File "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/B io/Blast/NCBIXML.py", line 625, in parse assert len(blast_parser._records) == 0 exceptions.UnboundLocalError: local variable 'blast_parser' referenced before as signment^[[B And here is the Code which calls the NCBIXML module: def getResults(self,resultFileHandle,err_handle): try: records=NCBIXML.parse(resultFileHandle) except Exception,e: self.IoErrorHandler(e, resultFileHandle, err_handle) raise bestScore=None bestExpect=None bestRes=None results=[] if records: for record in records: for alignment in record.alignments: resRec={} resRec['title']=alignment.title resRec['length']=alignment.length for hsp in alignment.hsps: resRec['score']=hsp.score resRec['expect']=hsp.expect resRec['subj']=hsp.sbjct resRec['obj']=hsp.query resRec['match']=hsp.match if self.debug: print 'alignment.hsp:' print hsp.score,hsp.expect,hsp.sbjct, hsp.match, hsp.query results.append(resRec) if bestScore==None: bestScore=hsp.score bestExpect=hsp.expect bestRes=len(results)-1 elif hsp.score>bestScore: bestScore=hsp.score bestExpect=hesp.expect bestRes=len(results)-1 Has someone an Idea? Should I just catch the error? Best regards, Bj?rn Thorwirth From mdehoon at c2b2.columbia.edu Tue Nov 27 12:25:00 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Tue, 27 Nov 2007 21:25:00 +0900 Subject: [BioPython] NCBIXML In-Reply-To: <1196151215.3128.1.camel@mistery> References: <1196151215.3128.1.camel@mistery> Message-ID: <474C0C9C.70204@c2b2.columbia.edu> Can you create a minimal code example that shows this bug? From the Python traceback, it appears that the error does not occur in NCBIXML but some place else. It would be good to isolate this bug to find out where exactly the problem lies. --Michiel. Bj?rn Thorwirth wrote: > Hello List! > > Today i got some trouble with the NCBIXML module. I've tested my code on > a > 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit > i > got the this error with Biopython-1.43 / 1.44: > > From bjoern.thorwirth at uni-due.de Tue Nov 27 13:51:22 2007 From: bjoern.thorwirth at uni-due.de (=?ISO-8859-1?Q?Bj=F6rn?= Thorwirth) Date: Tue, 27 Nov 2007 14:51:22 +0100 Subject: [BioPython] NCBIXML - Blast in- and output Message-ID: <1196171482.6683.19.camel@mistery> Hi Michiel! Thanks for your fast respose! I've used the NCBIXML together with a Twisted server. That's why the backtrace is a bit bloatet. But I guess this are the important lines: File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line 56, in getResults for record in records: File "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", line 625, in parse assert len(blast_parser._records) == 0 exceptions.UnboundLocalError: local variable 'blast_parser' referenced before assignment. Now i was able to backtrace, where the Problem comes from. It may not be related to 32/64 Bit. It happens when Blast's calculation of the "Karlin-Altschul parameters" fails. This may happen due low complexity of the Query sequence (see Blast FAQ). I've attached a tar with the Blast-output and the reference and input files. I didn't stumble over this problem before on 32 Bit, because that was a smaller Sequenze-DB for testing purposes Bj?rn Am Dienstag, den 27.11.2007, 21:25 +0900 schrieb Michiel de Hoon: > Can you create a minimal code example that shows this bug? > From the Python traceback, it appears that the error does not occur in > NCBIXML but some place else. It would be good to isolate this bug to > find out where exactly the problem lies. > > --Michiel. > > Bj?rn Thorwirth wrote: > > Hello List! > > > > Today i got some trouble with the NCBIXML module. I've tested my code on > > a > > 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit > > i > > got the this error with Biopython-1.43 / 1.44: > > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: BlastError.tar.gz Type: application/x-compressed-tar Size: 630 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Tue Nov 27 16:59:18 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Nov 2007 16:59:18 +0000 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <1196171482.6683.19.camel@mistery> References: <1196171482.6683.19.camel@mistery> Message-ID: <320fb6e00711270859y228cc731gc3ced0f21d9d624@mail.gmail.com> Hi again Bj?rn, > Now i was able to backtrace, where the Problem comes from. It may not be > related to 32/64 Bit. It happens when Blast's calculation > of the "Karlin-Altschul parameters" fails. This may happen due low > complexity of the Query sequence (see Blast FAQ). > I've attached a tar with the Blast-output and the reference and input > files. I didn't stumble over this problem before on 32 Bit, > because that was a smaller Sequenze-DB for testing purposes It does sound like its nothing to do with 32bit versus 64bit. Could you file a bug and then attach the following files - reference.fasta - input2.fasta - output XML file (or say if its empty) Using Blast 2.2.16, I got an empty output file (with messages "Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options" on the error stream). Using Blast 2.2.10, I got XML output including messages like "BlastKarlinBlkGappedCalc: Gap existence and extension values of -1 and -1 not supported for PAM250" (also on the error output). This was on a 64bit Linux machine. You haven't said what version of standalone Blast you have - as you can see it does make a difference. Peter From biopython at maubp.freeserve.co.uk Tue Nov 27 10:27:43 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Nov 2007 10:27:43 +0000 Subject: [BioPython] NCBIXML In-Reply-To: <1196151215.3128.1.camel@mistery> References: <1196151215.3128.1.camel@mistery> Message-ID: <474BF11F.9050802@maubp.freeserve.co.uk> Bj?rn Thorwirth wrote: > Hello List! > > Today i got some trouble with the NCBIXML module. I've tested my code on > a 32-Bit machine with Biopython-1.43 where it worked flawless. On 64-Bit > i got this error with Biopython-1.43 / 1.44: Hi Bj?rn. That problem does sound odd. The NCBI XML parser is pure python, so I wouldn't have expected any problems with 32 vs 64 bit. What else is different between the machines? e.g. Operating System, version of Python. Also where is the XML file coming from - if its standalone blast, could you check and tell us the version on each machine. You could also try running the test suite (included in the source for Biopython) to see if that shows any difference between the two machines. Also, to try and reproduce your parsing error, could you supply an example Blast XML file that illustrates the problem (works on the 32bit computer, but not on the 64 bit computer)? The best way would be to file a bug, and then attach the test case (two steps), rather than trying to send an attachment on the mailing list. Thanks, Peter From mdehoon at c2b2.columbia.edu Wed Nov 28 11:54:39 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Wed, 28 Nov 2007 20:54:39 +0900 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <1196171482.6683.19.camel@mistery> References: <1196171482.6683.19.camel@mistery> Message-ID: <474D56FF.4020807@c2b2.columbia.edu> Dear Bj?rn, Did you look at the Blast output file you are trying to parse? It consists of lines like: blastall] WARNING: 1OR8:E|PDBID|CHAIN|SEQUENCE: Could not calculate ungapped Karlin-Altschul parameters due to an invalid query sequence or its translation. Please verify the query sequence(s) and/or filtering options and there is no actual Blast output. So I am not surprised the Blast parser fails... --Michiel. Bj?rn Thorwirth wrote: > Hi Michiel! > > Thanks for your fast respose! I've used the NCBIXML together with a > Twisted server. That's why the backtrace is a bit bloatet. > But I guess this are the important lines: > > File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line > 56, > in getResults > for record in records: > File > "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", > line 625, in parse > assert len(blast_parser._records) == 0 > exceptions.UnboundLocalError: local variable 'blast_parser' referenced > before assignment. > > Now i was able to backtrace, where the Problem comes from. It may not be > related to 32/64 Bit. It happens when Blast's calculation > of the "Karlin-Altschul parameters" fails. This may happen due low > complexity of the Query sequence (see Blast FAQ). > I've attached a tar with the Blast-output and the reference and input > files. I didn't stumble over this problem before on 32 Bit, > because that was a smaller Sequenze-DB for testing purposes > > Bj?rn From bjoern.thorwirth at uni-due.de Wed Nov 28 12:50:54 2007 From: bjoern.thorwirth at uni-due.de (=?ISO-8859-1?Q?Bj=F6rn?= Thorwirth) Date: Wed, 28 Nov 2007 13:50:54 +0100 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <474D56FF.4020807@c2b2.columbia.edu> References: <1196171482.6683.19.camel@mistery> <474D56FF.4020807@c2b2.columbia.edu> Message-ID: <1196254254.2387.4.camel@mistery> Hi Michiel! I've filed a bug (#2412) like Peter suggested. Am Mittwoch, den 28.11.2007, 20:54 +0900 schrieb Michiel de Hoon: > Dear Bj?rn, > > Did you look at the Blast output file you are trying to parse? > It consists of lines like: I did look at it. It's the Error out, the xml file is just empty. > > blastall] WARNING: 1OR8:E|PDBID|CHAIN|SEQUENCE: Could not calculate > ungapped Karlin-Altschul parameters due to an invalid query sequence or > its translation. Please verify the query sequence(s) and/or filtering > options > > and there is no actual Blast output. So I am not surprised the Blast > parser fails... > But shouldn't that beeing catched? And btw. in my first post to the list, i asked, if I just should catch the Exception. in my code i called the NCBIXML like this: try: records=NCBIXML.parse(resultFileHandle) except Exception,e: self.IoErrorHandler(e, resultFileHandle, err_handle) raise ... if records: for record in records: Thats where i got the Exception. For me is it ok.. I've just added an Exception handler around the "for record"-Loop, and everything is done. But i thought i schould get the exception by initialization. Best regrads, and sorry for any inconvenience Bj?rn > --Michiel. > > Bj?rn Thorwirth wrote: > > Hi Michiel! > > > > Thanks for your fast respose! I've used the NCBIXML together with a > > Twisted server. That's why the backtrace is a bit bloatet. > > But I guess this are the important lines: > > > > File "/home/user/workspace/PLGDaemon/src/mmCIF/blast_util.py", line > > 56, > > in getResults > > for record in records: > > File > > "/home/user/Desktop/biopython/biopython-1.43/build/lib.linux-x86_64-2.4/Bio/Blast/NCBIXML.py", > > line 625, in parse > > assert len(blast_parser._records) == 0 > > exceptions.UnboundLocalError: local variable 'blast_parser' referenced > > before assignment. > > > > Now i was able to backtrace, where the Problem comes from. It may not be > > related to 32/64 Bit. It happens when Blast's calculation > > of the "Karlin-Altschul parameters" fails. This may happen due low > > complexity of the Query sequence (see Blast FAQ). > > I've attached a tar with the Blast-output and the reference and input > > files. I didn't stumble over this problem before on 32 Bit, > > because that was a smaller Sequenze-DB for testing purposes > > > > Bj?rn From mdehoon at c2b2.columbia.edu Thu Nov 29 00:02:17 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Thu, 29 Nov 2007 09:02:17 +0900 Subject: [BioPython] NCBIXML - Blast in- and output In-Reply-To: <1196254254.2387.4.camel@mistery> References: <1196171482.6683.19.camel@mistery> <474D56FF.4020807@c2b2.columbia.edu> <1196254254.2387.4.camel@mistery> Message-ID: <474E0189.6080005@c2b2.columbia.edu> Bj?rn Thorwirth wrote: > But shouldn't that beeing catched? And btw. in my first post to the > list, i asked, if I just should catch the Exception. > in my code i called the NCBIXML like this: > try: > records=NCBIXML.parse(resultFileHandle) > except Exception,e: > self.IoErrorHandler(e, resultFileHandle, err_handle) > raise > ... > if records: > for record in records: > Thats where i got the Exception. For me is it ok.. I've just added an > Exception handler around the "for record"-Loop, and everything is done. > But i thought i schould get the exception by initialization. > The NCBIXML.parse call does not actually parse the file, it just sets up the parser. The actual parsing is done when you call records.next(), which is done implicitly in your for-loop. This approach allows NCBIXML.parse to be used also for very large Blast output files, which cannot be kept in memory as a whole. So the Exception handler should be around the for-loop, not the parse. --Michiel. From karin.lagesen at medisin.uio.no Fri Nov 30 11:57:04 2007 From: karin.lagesen at medisin.uio.no (Karin Lagesen) Date: Fri, 30 Nov 2007 12:57:04 +0100 Subject: [BioPython] ambiguous alphabets and alignments Message-ID: Hello. I have used biopython on and off, and found it very good. I have now however encountered an odd problem which I hope you can help me with. I am working with alignments, and I do this: >>> from Bio import Clustalw >>> from Bio.Align import AlignInfo >>> from Bio.Alphabet import IUPAC >>> alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA) >>> summary_aln = AlignInfo.SummaryInfo(alignment) >>> pssm = summary_aln.pos_specific_score_matrix() Traceback (most recent call last): File "", line 1, in ? File "/usr/local/python/lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 368, in pos_specific_score_matri x File "/usr/local/python/lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 111, in dumb_consensus File "/usr/local/python/lib/python2.4/site-packages/Bio/Align/AlignInfo.py", line 203, in _guess_consensus_alphabe t ValueError: Could not determine the type of alphabet. >>> Now, to test what alphabet I am dealing with I use code from SummaryInfo: >>> from Bio import Alphabet >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> isinstance(summary_aln.alignment._records[0].seq.alphabet.alphabet, Alphabet.DNAAlphabet) False >>> summary_aln.alignment._records[0].seq.alphabet.alphabet >>> However, when I check the Alphabet class: class IUPACAmbiguousDNA(Alphabet.DNAAlphabet): letters = IUPACData.ambiguous_dna_letters it seems like the alphabet I load the alignment with is an extension of DNAAlphabet, however, the isinstance still fails. I am pretty sure that this is somehow a misunderstanding on my side, but I cannot figure this one out. Thankyou for your help! Karin -- Karin Lagesen, PhD student karin.lagesen at medisin.uio.no http://folk.uio.no/karinlag From biopython at maubp.freeserve.co.uk Fri Nov 30 15:14:16 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 30 Nov 2007 15:14:16 +0000 Subject: [BioPython] ambiguous alphabets and alignments In-Reply-To: References: Message-ID: <320fb6e00711300714n6c5afc8r78c5949ecfb3104b@mail.gmail.com> > >>> from Bio import Clustalw > >>> from Bio.Align import AlignInfo > >>> from Bio.Alphabet import IUPAC > >>> alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA) > >>> summary_aln = AlignInfo.SummaryInfo(alignment) > >>> pssm = summary_aln.pos_specific_score_matrix() I think the problem is you are giving Clustalw an alphabet class rather than an instance of the class. I am not at a computer with Biopython installed right now, but I would guess you need to change one line subtly: alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA()) Peter From karin.lagesen at medisin.uio.no Fri Nov 30 15:41:28 2007 From: karin.lagesen at medisin.uio.no (Karin Lagesen) Date: Fri, 30 Nov 2007 16:41:28 +0100 Subject: [BioPython] ambiguous alphabets and alignments In-Reply-To: <320fb6e00711300714n6c5afc8r78c5949ecfb3104b@mail.gmail.com> (biopython@maubp.freeserve.co.uk's message of "Fri, 30 Nov 2007 15:14:16 +0000") References: <320fb6e00711300714n6c5afc8r78c5949ecfb3104b@mail.gmail.com> Message-ID: Peter writes: >> >>> from Bio import Clustalw >> >>> from Bio.Align import AlignInfo >> >>> from Bio.Alphabet import IUPAC >> >>> alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", alphabet=IUPAC.IUPACAmbiguousDNA) >> >>> summary_aln = AlignInfo.SummaryInfo(alignment) >> >>> pssm = summary_aln.pos_specific_score_matrix() > > I think the problem is you are giving Clustalw an alphabet class > rather than an instance of the class. I am not at a computer with > Biopython installed right now, but I would guess you need to change > one line subtly: > > alignment = Clustalw.parse_file("align16S/AE000511_16S.aln", > alphabet=IUPAC.IUPACAmbiguousDNA()) Thanks! That was exactly the problem. It just didn't strike me that this could be the problem:) Karin -- Karin Lagesen, PhD student karin.lagesen at medisin.uio.no http://folk.uio.no/karinlag