From idoerg at gmail.com Mon Mar 3 21:07:15 2008 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 3 Mar 2008 18:07:15 -0800 Subject: [BioPython] GenBank writer? Message-ID: Hi all, Does biopython have a GenBank writer? SeqIO does not seem to have that. Forgive the basic question. I'm a structure person lately turned sequence analyzer.... Iddo -- Iddo Friedberg, Ph.D. CALIT2, mail code 0440 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0440, USA T: +1 (858) 534-0570 T: +1 (858) 646-3100 x3516 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Tue Mar 4 04:30:21 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Mar 2008 09:30:21 +0000 Subject: [BioPython] GenBank writer? In-Reply-To: References: Message-ID: <320fb6e00803040130i419949f3od9b65bcecb9b6758@mail.gmail.com> On Tue, Mar 4, 2008 at 2:07 AM, Iddo Friedberg wrote: > Hi all, > > Does biopython have a GenBank writer? SeqIO does not seem to have that. > > Forgive the basic question. I'm a structure person lately turned sequence > analyzer.... > > Iddo I've got bits of one for SeqIO, unfinished. There are a lot of tricky little problems when trying to write a SeqRecord which didn't originally come from a GenBank or EMBL file - e.g. identifiers that are too long. I do still want to finish this, and I should have more free time in a couple of months. You could try using the GenBank Record objects (not SeqRecord objects), and writing them to file. Peter From lueck at ipk-gatersleben.de Tue Mar 4 04:48:26 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 4 Mar 2008 10:48:26 +0100 Subject: [BioPython] Genbank writer (code) Message-ID: <003801c87ddc$e4f15f00$1022a8c0@ipkgatersleben.de> Hi! I wrote a small class, which I use for writing Genbank Files. If someone is interested, I can send it. At the moment it's adjusted for my purpose but you can change it. It's a very simple line by line writing, but it work's. Since I'm beginner, it's would be helpful to get correction from the user. Regards Stefanie From biopython at maubp.freeserve.co.uk Tue Mar 4 05:30:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 Mar 2008 10:30:24 +0000 Subject: [BioPython] Genbank writer (code) In-Reply-To: <003801c87ddc$e4f15f00$1022a8c0@ipkgatersleben.de> References: <003801c87ddc$e4f15f00$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00803040230y5d0831e6yf8f1a98a0a38ce3f@mail.gmail.com> On Tue, Mar 4, 2008 at 9:48 AM, Stefanie L?ck wrote: > Hi! > > I wrote a small class, which I use for writing Genbank Files. If someone is interested, > I can send it. At the moment it's adjusted for my purpose but you can change it. It's > a very simple line by line writing, but it work's. Since I'm beginner, it's would be helpful > to get correction from the user. > > Regards > Stefanie Could you add the files to Bug 2294? http://bugzilla.open-bio.org/show_bug.cgi?id=2294 I did make a start at integrating Howard's code - some of the issues involve tweaking the parser to record more information in the SeqRecord. Then there is the whole issue of mapping other file formats to GenBank and vice versa, and validating our output in other tools (e.g. EMBOSS or BioPerl). Thanks, Peter From ericgibert at yahoo.fr Fri Mar 7 23:13:49 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Sat, 8 Mar 2008 12:13:49 +0800 Subject: [BioPython] Bio.EUtils In-Reply-To: <8786.65209.qm@web62404.mail.re1.yahoo.com> References: <8786.65209.qm@web62404.mail.re1.yahoo.com> Message-ID: <009c01c880d2$e070b770$0202fea9@Gecko> Dear Michel, I use DBIdsClient from Bio.EUtils to fetch data from the taxonomy database of NCBI to update the table "taxon" and "taxon_name" in my BioSQL database with the NCBI data. I added one line definition in /Bio/EUtils/Config.py to declare that db: Databases.TAXONOMY = _add_db(DatabaseInfo("taxonomy", 1)) Then I can use client.search and result.efetch without problem. Eric -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Michiel de Hoon Sent: Friday, January 25, 2008 9:05 PM To: biopython at biopython.org Subject: [BioPython] Bio.EUtils Hello everybody, I am looking at the various ways Biopython interacts with NCBI's Entrez search engine, and if possible to organize and document this a bit more. Currently there are several modules that interact with Entrez. The most extensive one is Bio.EUtils, but there are also simpler modules such as Bio.WWW.NCBI. I was wondering: 1) Is anybody using Bio.EUtils? 2) If so, could you give an example script that uses Bio.EUtils? So we can get an idea of the amount of overlap between Bio.EUtils and Bio.WWW.NCBI and others. Thanks! --Michiel. --------------------------------- Never miss a thing. Make Yahoo your homepage. _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From mjldehoon at yahoo.com Sun Mar 9 09:13:05 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sun, 9 Mar 2008 06:13:05 -0700 (PDT) Subject: [BioPython] Bio.EUtils In-Reply-To: <009c01c880d2$e070b770$0202fea9@Gecko> Message-ID: <738226.56379.qm@web62406.mail.re1.yahoo.com> Dear Eric, Thank you for your reply. For esearch and efetch, can you use Bio.WWW.NCBI instead? Something like: >>> from Bio.WWW import NCBI >>> handle = NCBI.esearch(db="taxonomy", term="your_search_term") >>> result = handle.read() # Gets the search results in XML format. and similar for NCBI.efetch. Or does Bio.EUtils parse the XML results? One caveat: Bio.WWW.NCBI will be updated and renamed Bio.Entrez in Biopython 1.45. Thanks, --Michiel. Eric Gibert wrote: Dear Michel, I use DBIdsClient from Bio.EUtils to fetch data from the taxonomy database of NCBI to update the table "taxon" and "taxon_name" in my BioSQL database with the NCBI data. I added one line definition in /Bio/EUtils/Config.py to declare that db: Databases.TAXONOMY = _add_db(DatabaseInfo("taxonomy", 1)) Then I can use client.search and result.efetch without problem. Eric -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Michiel de Hoon Sent: Friday, January 25, 2008 9:05 PM To: biopython at biopython.org Subject: [BioPython] Bio.EUtils Hello everybody, I am looking at the various ways Biopython interacts with NCBI's Entrez search engine, and if possible to organize and document this a bit more. Currently there are several modules that interact with Entrez. The most extensive one is Bio.EUtils, but there are also simpler modules such as Bio.WWW.NCBI. I was wondering: 1) Is anybody using Bio.EUtils? 2) If so, could you give an example script that uses Bio.EUtils? So we can get an idea of the amount of overlap between Bio.EUtils and Bio.WWW.NCBI and others. Thanks! --Michiel. --------------------------------- Never miss a thing. Make Yahoo your homepage. _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- Looking for last minute shopping deals? Find them fast with Yahoo! Search. From florian.koelling at tu-bs.de Mon Mar 10 13:15:58 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Mon, 10 Mar 2008 18:15:58 +0100 Subject: [BioPython] Trying to extract a (single) ligand from a PDB file (dimere) Message-ID: <47D56CCE.80207@tu-bs.de> Hi folks! I'm new to biopython and tried to write a parser to extract a Ligand from it 's PDB file. It even works - but I've got a dimer with two ligands and I'd like to have a new PDB file containing only one ligand. My trials to acces the residue(the ligand) via the residue id failed... Hope you can Help me. My code so far :-(((( #### LIGAND DETECTION import Bio.PDB from Bio.PDB import* import Numeric parser = PDBParser() structure = parser.get_structure('s', '2F1G.pdb') print "\n" print "structure_object_created" requirement_id_list = [] requirement_names_list = [] residue = structure.get_residues() for i in residue: residue_id = i.get_id() if residue_id[0] !='W': #parsing water if residue_id[0] !=' ': #parsing resname=(residue_id[0]).strip('H_') #remove het_flag requirement_names_list.append(resname) #RES NAME requirement_id_list.append (residue_id[1]) #RES ID ###Req_lists: requirements for writing print '\n' print requirement_names_list, 'ligands_Names-> Hetatms ' print requirement_id_list, 'ligands_IDs-> Hetatms ' print '\n' if len(requirement_names_list) == 0: print 'no_hetatms found!' assert() class HetatmSelect(Select): def accept_residue(self, residue): if residue.get_resname() in requirement_names_list[0] : #writing 1st_element return 1 else: return 0 io=PDBIO() ###WRITING io.set_structure(structure) io.save('ligand_parsed.pdb', HetatmSelect()) print "writing parsed pdb file .... Done" Thanx alot! Floran From jswetnam at gmail.com Mon Mar 10 18:42:47 2008 From: jswetnam at gmail.com (James Swetnam) Date: Mon, 10 Mar 2008 18:42:47 -0400 Subject: [BioPython] import error Message-ID: Hello. First of all, thank you to the developers for the great resource. It's only with a sense of regret that I post my first message to this list as a complaint! I'm doing some rather complicated (to me, at least) database manipulation with the help of the excellent ORM provided by the biopython project. I think this is unimportant, but I can provide the full source code if asked. The problem I have is with the following line I have included in one of my source files, copied from one of the biopython tutorials: ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank") This line, both in isolation and in the context of my script, produces the rather terse import error: cardozo13:Bio james$ python Python 2.5.1 (r251:54863, Jan 10 2008, 15:27:44) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import GenBank >>> ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank") Traceback (most recent call last): File "", line 1, in File "/sw/lib/python2.5/site-packages/Bio/GenBank/__init__.py", line 1283, in __init__ from Bio import db ImportError: cannot import name db The import error is actually occuring inside one of the stock files provided by the biopython distribution I am using (installed via fink). Can anyone help me to resolve this? ----------------------------------------------- James Swetnam Research Technician NYU School of Medicine Pharmacology ----------------------------------------------- From peter at maubp.freeserve.co.uk Mon Mar 10 19:54:27 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Mon, 10 Mar 2008 23:54:27 +0000 Subject: [BioPython] import error In-Reply-To: References: Message-ID: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> On Mon, Mar 10, 2008 at 10:42 PM, James Swetnam wrote: > Hello. First of all, thank you to the developers for the great > resource. It's only with a sense of regret that I post my first > message to this list as a complaint! > > ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank") > > This line, both in isolation and in the context of my script, produces > the rather terse import error: Unfortunately you have fallen foul of a known issue in Biopython 1.44 (bug 2993), which has since been fixed: http://bugzilla.open-bio.org/show_bug.cgi?id=2393 If you can hold on a few weeks we do hope to have Biopython 1.45 out "shortly". You could update to CVS, or perhaps most simply I can suggest you use the one line fix in the NCBIDictionary class, see: http://bugzilla.open-bio.org/show_bug.cgi?id=2393#c2 Peter From peter at maubp.freeserve.co.uk Tue Mar 11 13:35:00 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Mar 2008 17:35:00 +0000 Subject: [BioPython] import error In-Reply-To: <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> References: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> Message-ID: <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> On Tue, Mar 11, 2008 at 4:27 PM, James Swetnam wrote: > Would anyone have any problem with me updating the BioSQL tutorial to > reflect the fact that BioSQL is on svn and not CVS? I got burned by a > bug that had been fixed in both the stable 1.0 release and current svn > version, but which lurked unsquashed in the (deprecated) CVS. Updating the documentation is always a good idea - which bit of it did you mean? I've just updated the wiki page http://biopython.org/wiki/BioSQL as it certainly shouldn't talk about BioSQL in CVS anymore. We could update it to get their latest code from SVN, but I thought it would probably be simpler to use the standard downloads now that BioSQL 1.0 has been released. If you could proof read the new page, and fix any little things, that would be great. When Biopython moves from CVS to SVN (which is imminent) we'll have to check all the documentation again... Peter From peter at maubp.freeserve.co.uk Tue Mar 11 14:37:12 2008 From: peter at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Mar 2008 18:37:12 +0000 Subject: [BioPython] import error In-Reply-To: <77666E7A-7E6C-4150-9C24-DEC92D38272B@gmail.com> References: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> <77666E7A-7E6C-4150-9C24-DEC92D38272B@gmail.com> Message-ID: <320fb6e00803111137v1615f400g69159d04a6a9f3b4@mail.gmail.com> On Tue, Mar 11, 2008 at 5:46 PM, James Swetnam wrote: > Peter > > I agree with your reasoning-- a link to the 1.0 release is a better > idea. The tutorial looks fine. Using a passwordless root login makes > my heart skip a beat, but I suppose it's not your place to teach > people mysql best practices, and for local testing and sandbox > databases it's fine. > > James I wasn't so sure about the blank password when I wrote that, but I was working from what existing documentation I could find and wanted to keep things simple. If you have some links to a good practice guide, it wouldn't hurt to add short passage to the wiki page. > I should note for the record that the 'Basic BioSQL with Biopython' > tutorial also mentions the CVS version of the BioSQL schema > > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html The LaTeX source for that document actually seems to live in the BioSQL repository, for which I don't think I have access rights. I'll have to raise that on the BioSQL mailing list... See e.g. http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/ Peter P.S. Did you mean to send your emails to me only, and not the mailing list? From jswetnam at gmail.com Tue Mar 11 18:21:05 2008 From: jswetnam at gmail.com (James Swetnam) Date: Tue, 11 Mar 2008 18:21:05 -0400 Subject: [BioPython] import error In-Reply-To: <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> References: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> Message-ID: <7A87EC3E-67A9-4D15-897E-2B58A1B7EEFF@gmail.com> Hello. First off, apologies if my problem has been resolved in a previous mailing; the archives search on the OBF wiki is disabled. Also, it's quite possible i'm doing something boneheaded, as I still consider myself a fairly novice python programmer. So apologies if I make you read through this just to correct an indentation error or somethinig similar! I'm trying to use the Biopython BioSQL bindings to populate a locally served MySQL database with what I like to call 'chimeric' SeqRecord objects. I take as a starting point a large, FASTA formatted file of short, translated (~35AA) protein sequences from the LANL HIV Sequence Database. Every one of these LANL protein sequences is a subset of a longer sequence available in genbank. Each of the sequences I download thus has an associated genbank accession number. I'd like to combine both the specificity afforded by the LANL sequences with the 'meta' information given by the genbank files into one record for each translated protein sequence. Thus, in very broad pseudocode, my procedure is as follows: for every sequence in fasta formatted lanl file get the genbank number grab the genbank file and parse into a SeqRecord replace the Seq object in the genbank SeqRecord with the LANL protein sequence let Biopython do its magic and populate my biosql database with my chimeric SeqRecord ... Profit! The entire procedure is rather short, thanks to the developers' hard work and the magic of abstraction. Here's the actual code: http://pastebin.com/m118199fe OK. FIne. But I'm getting an error when I do this, which originates deep in the bowels of the MySQLdb library, which I'd rather not touch without a lot more coffee than I have available. -----------------------------degas:v3_sequence_browser james$ ipython populate_database.py /sw/lib/python2.5/site-packages/Bio/config/DBRegistry.py:149: DeprecationWarning: Concurrent behavior has been deprecated, as this functionality needs Bio.MultiProc, which itself has been deprecated. If you need the concurrent behavior, please let the Biopython developers know by sending an email to biopython-dev at biopython.org to avoid permanent removal of this feature. DeprecationWarning) --------------------------------------------------------------------------- Traceback (most recent call last) /Users/james/src/v3_sequence_browser/populate_database.py in () 35 36 db = server.new_database("v3") ---> 37 db.load(v3prod) 38 server.adaptor.commit() 39 /sw/lib/python2.5/site-packages/BioSQL/BioSeqDatabase.py in load(self, record_iterator) 412 break 413 num_records += 1 --> 414 db_loader.load_seqrecord(cur_record) 415 416 return num_records /sw/lib/python2.5/site-packages/BioSQL/Loader.py in load_seqrecord(self, record) 28 """Load a Biopython SeqRecord into the database. 29 """ ---> 30 bioentry_id = self._load_bioentry_table(record) 31 self._load_bioentry_date(record, bioentry_id) 32 self._load_biosequence(record, bioentry_id) /sw/lib/python2.5/site-packages/BioSQL/Loader.py in _load_bioentry_table(self, record) 248 division, 249 description, --> 250 version)) 251 # now retrieve the id for the bioentry 252 bioentry_id = self.adaptor.last_id('bioentry') /sw/lib/python2.5/site-packages/BioSQL/BioSeqDatabase.py in execute(self, sql, args) 275 """Just execute an sql command. 276 """ --> 277 self.cursor.execute(sql, args or ()) 278 279 def get_subseq_as_string(self, seqid, start, end): /sw/lib/python2.5/site-packages/MySQLdb/cursors.py in execute(self, query, args) 149 query = query.encode(charset) 150 if args is not None: --> 151 query = query % db.literal(args) 152 try: 153 r = self._query(query) : not all arguments converted during string formatting WARNING: Failure executing file: Any direct help or references are much appreciated. James Swetnam Research Technician Department of Pharmacology NYU School of Medicine From biopython at maubp.freeserve.co.uk Tue Mar 11 19:16:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Mar 2008 23:16:43 +0000 Subject: [BioPython] import error In-Reply-To: <7A87EC3E-67A9-4D15-897E-2B58A1B7EEFF@gmail.com> References: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> <7A87EC3E-67A9-4D15-897E-2B58A1B7EEFF@gmail.com> Message-ID: <320fb6e00803111616g644485afxbed7592441beb0ce@mail.gmail.com> On Tue, Mar 11, 2008 at 10:21 PM, James Swetnam wrote: > Hello. > > First off, apologies if my problem has been resolved in a previous > mailing; the archives search on the OBF wiki is disabled. Hmm. I don't know anything about that - I've always just used Google to search the mailing list. > Also, it's quite possible i'm doing something boneheaded, as I still consider > myself a fairly novice python programmer. So apologies if I make you > read through this just to correct an indentation error or somethinig > similar! > ... I haven't tried to reproduce this (which would be tricky without the FASTA file), but my initial guess would be duplicate identifiers. i.e. perhaps Biopython is failing to add one of the records because its id clashes with an existing record in the database. It should be fairly easy for you to check this... If I'm right, then the fix could be to assign the modified SeqRecord a new id, maybe some combination of the parent GenBank id and a unique count - or the sub string location? Keep in mind that our BioSQL code is currently a little bit fussy about the formatting of the record id - it expects it to end in a ".version", see bug 2425. http://bugzilla.open-bio.org/show_bug.cgi?id=2425 Peter From jswetnam at gmail.com Tue Mar 11 19:58:51 2008 From: jswetnam at gmail.com (James Swetnam) Date: Tue, 11 Mar 2008 19:58:51 -0400 Subject: [BioPython] import error In-Reply-To: <320fb6e00803111616g644485afxbed7592441beb0ce@mail.gmail.com> References: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> <7A87EC3E-67A9-4D15-897E-2B58A1B7EEFF@gmail.com> <320fb6e00803111616g644485afxbed7592441beb0ce@mail.gmail.com> Message-ID: For the mailing list problem I was referring to, at least, go to http://biopython.org/wiki/Mailing_lists There's a link at the bottom with the text 'Search the mailing list archives.' Clicking it brings you to this page: http://www.open-bio.org/wiki/Unavailable_Search_Open_Bio_Org Best, James On Mar 11, 2008, at 7:16 PM, Peter wrote: > On Tue, Mar 11, 2008 at 10:21 PM, James Swetnam > wrote: >> Hello. >> >> First off, apologies if my problem has been resolved in a previous >> mailing; the archives search on the OBF wiki is disabled. > > Hmm. I don't know anything about that - I've always just used Google > to search the mailing list. > >> Also, it's quite possible i'm doing something boneheaded, as I >> still consider >> myself a fairly novice python programmer. So apologies if I make you >> read through this just to correct an indentation error or somethinig >> similar! >> ... > > I haven't tried to reproduce this (which would be tricky without the > FASTA file), but my initial guess would be duplicate identifiers. > i.e. perhaps Biopython is failing to add one of the records because > its id clashes with an existing record in the database. It should be > fairly easy for you to check this... > > If I'm right, then the fix could be to assign the modified SeqRecord a > new id, maybe some combination of the parent GenBank id and a unique > count - or the sub string location? > > Keep in mind that our BioSQL code is currently a little bit fussy > about the formatting of the record id - it expects it to end in a > ".version", see bug 2425. > http://bugzilla.open-bio.org/show_bug.cgi?id=2425 > > Peter From biopython at maubp.freeserve.co.uk Tue Mar 11 20:20:41 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Mar 2008 00:20:41 +0000 Subject: [BioPython] Searching the mailing list Message-ID: <320fb6e00803111720k79d990f6lb0de95399f5f52a8@mail.gmail.com> On Tue, Mar 11, 2008 at 11:58 PM, James Swetnam wrote: > For the mailing list problem I was referring to, at least, go to > > http://biopython.org/wiki/Mailing_lists > > There's a link at the bottom with the text 'Search the mailing list > archives.' Clicking it brings you to this page: > > http://www.open-bio.org/wiki/Unavailable_Search_Open_Bio_Org > > Best, > James Sorry if I wasn't clear. I meant I didn't know who at the OBF was dealing with this - although looking at the wiki page's history (two years ago) they seem to have forgotten about it! I'll enquire... Peter From florian.koelling at tu-bs.de Wed Mar 12 08:23:57 2008 From: florian.koelling at tu-bs.de (Florian Koelling) Date: Wed, 12 Mar 2008 13:23:57 +0100 Subject: [BioPython] -->Problem using Bio.PDB "Dice" Message-ID: <47D7CB5D.1070400@tu-bs.de> Hi Folks! I tried to grab the following line from a pdb file - (in order to create a new one) using "dice" HETATM 3427 O11 GNF 1001 36.457 27.002 14.788 1.00 27.60 O The code i tried: from Bio.PDB import* parser = PDBParser() structure =parser.get_structure('s','2F1G.pdb') y = extract(structure ," ", 1001, 1001 ,'2out.pdb') I only receive an empty file. Is there any possibility to use dice when a chain specification is obviously missing? Thanx alot! Florian From biopython at maubp.freeserve.co.uk Wed Mar 12 14:42:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 Mar 2008 18:42:50 +0000 Subject: [BioPython] Searching the mailing list In-Reply-To: <320fb6e00803111720k79d990f6lb0de95399f5f52a8@mail.gmail.com> References: <320fb6e00803111720k79d990f6lb0de95399f5f52a8@mail.gmail.com> Message-ID: <320fb6e00803121142w15aaa333te28adcfafcb038d9@mail.gmail.com> Just to let everyone know, http://search.open-bio.org/ is up and running again (using Google to do the hard work), and searching the mailing list archives seems to work pretty well. Peter From olav.zimmermann at fz-juelich.de Wed Mar 12 15:28:49 2008 From: olav.zimmermann at fz-juelich.de (olav.zimmermann at fz-juelich.de) Date: Wed, 12 Mar 2008 20:28:49 +0100 Subject: [BioPython] -->Problem using Bio.PDB "Dice" Message-ID: Hi Florian, the accept_residue method of the ChainSelector class which is used by Dice.extract deliberately skips het residues (see Dice.py line 37ff): def accept_residue(self, residue): # residue - between start and end hetatm_flag, resseq, icode=residue.get_id() if hetatm_flag!=" ": # skip HETATMS return 0 if icode!=" ": print "WARNING: Icode at ", residue.get_id() if self.start<=resseq<=self.end: return 1 return 0 <\cite> As the selector is not a parameter for Dice.extract, you should do your own overwrite of ChainSelector and pass it to PDBIO.save Best regards Olav ----- Original Message ----- From: Florian Koelling Date: Wednesday, March 12, 2008 1:23 pm Subject: [BioPython] -->Problem using Bio.PDB "Dice" > Hi Folks! > > I tried to grab the following line from a pdb file - (in order to > create a new one) using "dice" > > HETATM 3427 O11 GNF 1001 36.457 27.002 14.788 1.00 > 27.60 O > > The code i tried: > from Bio.PDB import* > > parser = PDBParser() > structure =parser.get_structure('s','2F1G.pdb') > > y = extract(structure ," ", 1001, 1001 ,'2out.pdb') > > I only receive an empty file. > Is there any possibility to use dice when a chain specification > is > obviously missing? > > Thanx alot! > > Florian > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ------------------------------------------------------------------- ------------------------------------------------------------------- Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzende des Aufsichtsrats: MinDir'in Baerbel Brumme-Bothe Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr. Harald Bolt, Dr. Sebastian M. Schmidt ------------------------------------------------------------------- ------------------------------------------------------------------- From jswetnam at gmail.com Wed Mar 12 18:26:36 2008 From: jswetnam at gmail.com (James Swetnam) Date: Wed, 12 Mar 2008 18:26:36 -0400 Subject: [BioPython] import error In-Reply-To: <320fb6e00803111616g644485afxbed7592441beb0ce@mail.gmail.com> References: <320fb6e00803101654y2db4cf63lbc1c8a8a1c8e8f4@mail.gmail.com> <95AD2738-8BC6-4E62-B4F4-3E7E0877BFD2@gmail.com> <320fb6e00803111035p1a5020f4n7dc71b3c036ff2a8@mail.gmail.com> <7A87EC3E-67A9-4D15-897E-2B58A1B7EEFF@gmail.com> <320fb6e00803111616g644485afxbed7592441beb0ce@mail.gmail.com> Message-ID: <9EF64FD3-5D2C-403D-8A63-1206CEE8EAE6@gmail.com> Hi Peter, and subscribers: The problem seems to vanish with the current CVS release of python. I was using the fink package (which I believe is 1.44?). Thank you for your help. James On Mar 11, 2008, at 7:16 PM, Peter wrote: > On Tue, Mar 11, 2008 at 10:21 PM, James Swetnam > wrote: >> Hello. >> >> First off, apologies if my problem has been resolved in a previous >> mailing; the archives search on the OBF wiki is disabled. > > Hmm. I don't know anything about that - I've always just used Google > to search the mailing list. > >> Also, it's quite possible i'm doing something boneheaded, as I >> still consider >> myself a fairly novice python programmer. So apologies if I make you >> read through this just to correct an indentation error or somethinig >> similar! >> ... > > I haven't tried to reproduce this (which would be tricky without the > FASTA file), but my initial guess would be duplicate identifiers. > i.e. perhaps Biopython is failing to add one of the records because > its id clashes with an existing record in the database. It should be > fairly easy for you to check this... > > If I'm right, then the fix could be to assign the modified SeqRecord a > new id, maybe some combination of the parent GenBank id and a unique > count - or the sub string location? > > Keep in mind that our BioSQL code is currently a little bit fussy > about the formatting of the record id - it expects it to end in a > ".version", see bug 2425. > http://bugzilla.open-bio.org/show_bug.cgi?id=2425 > > Peter From mjldehoon at yahoo.com Wed Mar 12 22:15:03 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 12 Mar 2008 19:15:03 -0700 (PDT) Subject: [BioPython] import error In-Reply-To: <9EF64FD3-5D2C-403D-8A63-1206CEE8EAE6@gmail.com> Message-ID: <668680.6630.qm@web62401.mail.re1.yahoo.com> Good to hear! I guess you meant "the current CVS release of *biopython*" (not python itself?). Biopython currently in CVS will be released as version 1.45 in nine days. --Michiel. James Swetnam wrote: Hi Peter, and subscribers: The problem seems to vanish with the current CVS release of python. I was using the fink package (which I believe is 1.44?). Thank you for your help. James On Mar 11, 2008, at 7:16 PM, Peter wrote: > On Tue, Mar 11, 2008 at 10:21 PM, James Swetnam > wrote: >> Hello. >> >> First off, apologies if my problem has been resolved in a previous >> mailing; the archives search on the OBF wiki is disabled. > > Hmm. I don't know anything about that - I've always just used Google > to search the mailing list. > >> Also, it's quite possible i'm doing something boneheaded, as I >> still consider >> myself a fairly novice python programmer. So apologies if I make you >> read through this just to correct an indentation error or somethinig >> similar! >> ... > > I haven't tried to reproduce this (which would be tricky without the > FASTA file), but my initial guess would be duplicate identifiers. > i.e. perhaps Biopython is failing to add one of the records because > its id clashes with an existing record in the database. It should be > fairly easy for you to check this... > > If I'm right, then the fix could be to assign the modified SeqRecord a > new id, maybe some combination of the parent GenBank id and a unique > count - or the sub string location? > > Keep in mind that our BioSQL code is currently a little bit fussy > about the formatting of the record id - it expects it to end in a > ".version", see bug 2425. > http://bugzilla.open-bio.org/show_bug.cgi?id=2425 > > Peter _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- Never miss a thing. Make Yahoo your homepage. From biopython at maubp.freeserve.co.uk Mon Mar 17 03:55:56 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Mar 2008 07:55:56 +0000 Subject: [BioPython] Old Biopython code for EBI Bibliographics services In-Reply-To: <320fb6e00803170049g79960e14u8c1417fcdc99a0d5@mail.gmail.com> References: <320fb6e00803140944v13f241b9icc0e911643f234cd@mail.gmail.com> <47DE0C22.9040202@netsys.co.za> <320fb6e00803170049g79960e14u8c1417fcdc99a0d5@mail.gmail.com> Message-ID: <320fb6e00803170055n18457967n27d1b07eaa6cb522@mail.gmail.com> Dear list, We have an old module Bio/biblio.py written by Tiaan Wessels back in 2002 (during a South African hackathon). This is code to use some EBI Bibliographics services, but currently no longer works. At the very least, the EBI have changed the URLs for their SOAP services. I got in touch with the author by email, and he no longer uses the code and thought we could remove it. Does anyone on the list still use Bio/biblio.py? Would anyone like to take a more in depth look at the code, and the current EBI web API, and see if there is anything in Bio.biblio worth keeping? If not, I'm proposing we mark this as deprecated for the next release of Biopython. Thanks, Peter From jswetnam at gmail.com Mon Mar 17 17:14:56 2008 From: jswetnam at gmail.com (James Swetnam) Date: Mon, 17 Mar 2008 17:14:56 -0400 Subject: [BioPython] Site nuked? In-Reply-To: <320fb6e00803170055n18457967n27d1b07eaa6cb522@mail.gmail.com> References: <320fb6e00803140944v13f241b9icc0e911643f234cd@mail.gmail.com> <47DE0C22.9040202@netsys.co.za> <320fb6e00803170049g79960e14u8c1417fcdc99a0d5@mail.gmail.com> <320fb6e00803170055n18457967n27d1b07eaa6cb522@mail.gmail.com> Message-ID: <571B6801-6413-4AAD-BAA4-6C478B17F3F0@gmail.com> From where I stand, I see nothing but white page when I go to: http://biopython.org/wiki/Main_Page Hopefully this is an issue on my end? James On Mar 17, 2008, at 3:55 AM, Peter wrote: > Dear list, > > We have an old module Bio/biblio.py written by Tiaan Wessels back in > 2002 (during a South African hackathon). This is code to use some EBI > Bibliographics services, but currently no longer works. At the very > least, the EBI have changed the URLs for their SOAP services. I got > in touch with the author by email, and he no longer uses the code and > thought we could remove it. > > Does anyone on the list still use Bio/biblio.py? > > Would anyone like to take a more in depth look at the code, and the > current EBI web API, and see if there is anything in Bio.biblio worth > keeping? > > If not, I'm proposing we mark this as deprecated for the next release > of Biopython. > > Thanks, > > Peter > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From idoerg at gmail.com Mon Mar 17 17:22:08 2008 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 17 Mar 2008 14:22:08 -0700 Subject: [BioPython] Site nuked? In-Reply-To: <571B6801-6413-4AAD-BAA4-6C478B17F3F0@gmail.com> References: <320fb6e00803140944v13f241b9icc0e911643f234cd@mail.gmail.com> <47DE0C22.9040202@netsys.co.za> <320fb6e00803170049g79960e14u8c1417fcdc99a0d5@mail.gmail.com> <320fb6e00803170055n18457967n27d1b07eaa6cb522@mail.gmail.com> <571B6801-6413-4AAD-BAA4-6C478B17F3F0@gmail.com> Message-ID: Dang, no it isn't just on your end. Seems like the whole open-bio Wiki is SNAFU'd. cc'ing this to Chris Dagdigian, although I bet he is on it already... Thanks, Iddo On Mon, Mar 17, 2008 at 2:14 PM, James Swetnam wrote: > From where I stand, I see nothing but white page when I go to: > > http://biopython.org/wiki/Main_Page > > Hopefully this is an issue on my end? > > James > > > -- Iddo Friedberg, Ph.D. CALIT2, mail code 0440 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0440, USA T: +1 (858) 534-0570 T: +1 (858) 646-3100 x3516 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Mon Mar 17 17:22:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Mar 2008 21:22:50 +0000 Subject: [BioPython] Site nuked? In-Reply-To: <571B6801-6413-4AAD-BAA4-6C478B17F3F0@gmail.com> References: <320fb6e00803140944v13f241b9icc0e911643f234cd@mail.gmail.com> <47DE0C22.9040202@netsys.co.za> <320fb6e00803170049g79960e14u8c1417fcdc99a0d5@mail.gmail.com> <320fb6e00803170055n18457967n27d1b07eaa6cb522@mail.gmail.com> <571B6801-6413-4AAD-BAA4-6C478B17F3F0@gmail.com> Message-ID: <320fb6e00803171422g6c9160e4k2bc073e105dd409a@mail.gmail.com> On Mon, Mar 17, 2008 at 9:14 PM, James Swetnam wrote: > From where I stand, I see nothing but white page when I go to: > > http://biopython.org/wiki/Main_Page > > Hopefully this is an issue on my end? > > James It looks like its all the OBF sites, for example www.bioperl.org and www.biosql.org are down too. Last time this happened, the webserver needed a reboot and it was all fine - I'll email them in case they haven't noticed. You can probably manage in the very short term using Google's cache. Regards, Peter From dag at sonsorol.org Mon Mar 17 17:37:53 2008 From: dag at sonsorol.org (Chris Dagdigian) Date: Mon, 17 Mar 2008 17:37:53 -0400 Subject: [BioPython] Site nuked? In-Reply-To: References: <320fb6e00803140944v13f241b9icc0e911643f234cd@mail.gmail.com> <47DE0C22.9040202@netsys.co.za> <320fb6e00803170049g79960e14u8c1417fcdc99a0d5@mail.gmail.com> <320fb6e00803170055n18457967n27d1b07eaa6cb522@mail.gmail.com> <571B6801-6413-4AAD-BAA4-6C478B17F3F0@gmail.com> Message-ID: <17743519-5396-4BB0-9668-8CF3ACD7944F@sonsorol.org> Hi folks, There is some sort of memory leak in our webserver/mediawiki or php extension framework. Even a cron job restarting the webserver hourly does not seem to be fixing things although it does seem to increase the period between failures. Anyway, the site is back. I just had to restart the webserver. In the future you can also report this to root-l at open-bio.org -- there are a number of us who have webserver restart access. Regards, Chris On Mar 17, 2008, at 5:22 PM, Iddo Friedberg wrote: > Dang, no it isn't just on your end. > > Seems like the whole open-bio Wiki is SNAFU'd. > > cc'ing this to Chris Dagdigian, although I bet he is on it already... > > Thanks, > > Iddo > > On Mon, Mar 17, 2008 at 2:14 PM, James Swetnam > wrote: > From where I stand, I see nothing but white page when I go to: > > http://biopython.org/wiki/Main_Page > > Hopefully this is an issue on my end? > > James > > > > > > -- > > Iddo Friedberg, Ph.D. > CALIT2, mail code 0440 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0440, USA > T: +1 (858) 534-0570 > T: +1 (858) 646-3100 x3516 > http://iddo-friedberg.org From lueck at ipk-gatersleben.de Wed Mar 19 03:39:58 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 19 Mar 2008 08:39:58 +0100 Subject: [BioPython] Alignment with Emboss water tool Message-ID: <002401c88994$6ee2c6a0$1022a8c0@ipkgatersleben.de> Hi! I'm trying to make an alignment with Emboss water. Here's my code: from Bio.Emboss.Applications import WaterCommandline from Bio.Application import generic_run water = WaterCommandline() water.set_parameter("-asequence", "asis::atgctccg") water.set_parameter("-bsequence", "asis::taggcgctcgctcgatgctccgatgctcgc") water.set_parameter("-gapopen", "10") water.set_parameter("-gapextend", "0.5") water.set_parameter("-outfile", "test2.txt") final = generic_run(water) I get the following error: Traceback (most recent call last): File "C:\Dokumente und Einstellungen\Administrator\Desktop\water.py", line 10, in final = generic_run(water) File "C:\Python25\lib\site-packages\Bio\Application\__init__.py", line 18, in generic_run child = popen2.Popen3(str(commandline), 1) AttributeError: 'module' object has no attribute 'Popen3' Does someone knows how to fix this? Thank in advance! Stefanie From mjldehoon at yahoo.com Wed Mar 19 04:22:07 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 19 Mar 2008 01:22:07 -0700 (PDT) Subject: [BioPython] Alignment with Emboss water tool In-Reply-To: <002401c88994$6ee2c6a0$1022a8c0@ipkgatersleben.de> Message-ID: <975333.80988.qm@web62414.mail.re1.yahoo.com> It turns out that popen2.Popen3 is not available on Windows. See: http://docs.python.org/lib/module-popen2.html As a workaround, you can modify the file C:\Python25\lib\site-packages\Bio\Application\__init__.py and replace the call to popen2.Popen3 by a call to popen2.popen3 (some other small modifications will be necessary). The drawback of this approach is that we then cannot get the error code of the child process. Maybe we should modify Bio/Application/__init__.py to return a dummy error code on Windows. --Micheil Stefanie L?ck wrote: Hi! I'm trying to make an alignment with Emboss water. Here's my code: from Bio.Emboss.Applications import WaterCommandline from Bio.Application import generic_run water = WaterCommandline() water.set_parameter("-asequence", "asis::atgctccg") water.set_parameter("-bsequence", "asis::taggcgctcgctcgatgctccgatgctcgc") water.set_parameter("-gapopen", "10") water.set_parameter("-gapextend", "0.5") water.set_parameter("-outfile", "test2.txt") final = generic_run(water) I get the following error: Traceback (most recent call last): File "C:\Dokumente und Einstellungen\Administrator\Desktop\water.py", line 10, in final = generic_run(water) File "C:\Python25\lib\site-packages\Bio\Application\__init__.py", line 18, in generic_run child = popen2.Popen3(str(commandline), 1) AttributeError: 'module' object has no attribute 'Popen3' Does someone knows how to fix this? Thank in advance! Stefanie _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- Looking for last minute shopping deals? Find them fast with Yahoo! Search. From mjldehoon at yahoo.com Sat Mar 22 07:02:38 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 22 Mar 2008 04:02:38 -0700 (PDT) Subject: [BioPython] Biopython release 1.45 Message-ID: <901773.64728.qm@web62408.mail.re1.yahoo.com> We are pleased to announce the release of Biopython 1.45. This release includes numerous code improvements and fixes, including in Bio.Seq, Bio.SeqIO, Bio.Entrez, Bio.PopGen, Bio.SwissProt, Bio.Cluster, Bio.SCOP, Bio.InterPro, Bio.GenBank, Bio.ExPASy, BioSQL, and the Biopython documentation. Too many to list them all here! Source distributions and Windows installers are available from the Biopython website at http://biopython.org. My thanks to all code contributers who made this new release possible. --Michiel on behalf of the Biopython developers. --------------------------------- Looking for last minute shopping deals? Find them fast with Yahoo! Search. From lueck at ipk-gatersleben.de Tue Mar 25 05:24:01 2008 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 25 Mar 2008 10:24:01 +0100 Subject: [BioPython] Emboss (water) alignment output Message-ID: <006101c88e59$f64f8820$1022a8c0@ipkgatersleben.de> Hi! Is it possible to get just the alignment as an output instead of all this additional infos? Like atgctctcc IIIIIIIII II atgctcgcc The best, without to write in a file, so directly over python. Thanks in advance! Stefanie From p.j.a.cock at googlemail.com Tue Mar 25 06:25:17 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Mar 2008 10:25:17 +0000 Subject: [BioPython] Emboss (water) alignment output In-Reply-To: <006101c88e59$f64f8820$1022a8c0@ipkgatersleben.de> References: <006101c88e59$f64f8820$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00803250325u42dd61edvb7a2a4f673300211@mail.gmail.com> On Tue, Mar 25, 2008 at 9:24 AM, Stefanie L?ck wrote: > Hi! > > Is it possible to get just the alignment as an output instead of all this additional infos? > > Like > > atgctctcc > IIIIIIIII II > atgctcgcc The EMBOSS tools usually have a choice of output formats. I do know that the EMBOSS water tool will ouput as FASTA format, which is very simple. You could easily recreate the match/miss-match string from this. I do have some experimental code for reading the emboss pairwise alignment format, but its not in Bioython yet. > The best, without to write in a file, so directly over python. Do you mean calling EMBOSS water without first creating an input file for it? If you sequences are short, you can provide them as part of the command line (search for "asis" in the EMBOSS documentation). You can also pipe the input into water (see EMBOSS command line "-filter") ... this can probably also be done with python but I've never tried. If you mean get the output in stdout instead of a file, then the EMBOSS tools can do that too (EMBOSS command line option "-stdout"). You'd just need to open the ouput stream as a pipe in Python. Peter From bsantos at biocant.pt Tue Mar 25 10:23:23 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Tue, 25 Mar 2008 14:23:23 -0000 Subject: [BioPython] Module for ace assembly format Message-ID: <002201c88e83$ca34b150$5e9e13f0$@pt> Hi, I'm currently working in an assembly project and I need to read some files in ace assembly format, my question is if biopython have any module to do this. I have done a search in google and I found some references to a bioperl module, but in the biopython website I didn't find any information about this format. Thank you very much in advance; Bruno Santos From p.j.a.cock at googlemail.com Tue Mar 25 10:34:34 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Mar 2008 14:34:34 +0000 Subject: [BioPython] Module for ace assembly format In-Reply-To: <002201c88e83$ca34b150$5e9e13f0$@pt> References: <002201c88e83$ca34b150$5e9e13f0$@pt> Message-ID: <320fb6e00803250734uc750015q7f1d0d6782ff190c@mail.gmail.com> On Tue, Mar 25, 2008 at 2:23 PM, Bruno Santos wrote: > Hi, > > I'm currently working in an assembly project and I need to read some files > in ace assembly format, my question is if biopython have any module to do > this. I have done a search in google and I found some references to a > bioperl module, but in the biopython website I didn't find any information > about this format. Hi Bruno, Have a look at the Bio.Sequencing.Ace module for ACE files output by PHRAP. We don't seem to have anything in the main tutorial on this, so if you would like to write a short piece for the "cookbook" chapter that would be helpful Peter From ericgibert at yahoo.fr Tue Mar 25 10:54:57 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Tue, 25 Mar 2008 14:54:57 +0000 (GMT) Subject: [BioPython] Concerns the update of BioSQL.taxon table Message-ID: <711039.40736.qm@web26505.mail.ukl.yahoo.com> Dear all, I moved to BioPython 1.45 and created a fresh BioSQL 1.0.0 database. Everything went smoothly except for one important point: the table 'taxon' defines the column ncbi_taxon_id with a unique index. But currently, when a BioSeq is created, the lineage records are all inserted as found in the GenBank data. At insertion, there is no problem since insertion set NULL for all ncbi_taxon_id but for the species one, no duplicate keys are found. On the other hand, when I run my script to update the 'taxon' table, some ranks are the same (like family or class or order). I obtain then a 'duplicate entry on key 2' SQL error. Before I did not have the problem because I did not have the ncbi_taxon_id associated to a UNIQUE index. Is this new in BioSQL 1.0.0? If the answer is YES then I guess that the reason behind is to avoid to repeat all ranks for each species but to define them once only. I understand that solution but then our BioPython INSERT of a new BioSeq is "incompatible" with this behavior. Thus I wonder if we should: a) remove the UNIQUE index on ncbi_taxon_id or b) rewrite the management of the 'taxon' table in BioPython with a control that we add records only for new rank, with a 'clever' parent linkage (then what about the right and left value fields?). Please let me know, Eric _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr From biopython at maubp.freeserve.co.uk Tue Mar 25 11:53:24 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Mar 2008 15:53:24 +0000 Subject: [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <711039.40736.qm@web26505.mail.ukl.yahoo.com> References: <711039.40736.qm@web26505.mail.ukl.yahoo.com> Message-ID: <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> Hi Eric, Your issue is almost certainly due to switching from Biopython 1.44 to 1.45, rather than from a prerelease BioSQL to the recently released BioSQL 1.0.0. For background, you should read Bug 2422 and the BioSQL thread it points to. http://bugzilla.open-bio.org/show_bug.cgi?id=2422 Biopython 1.44 never recorded the taxon id (and therefore didn't use the taxon/taxon_name tables) Biopython 1.45 does record the taxon id, and attempts to fill in missing taxon/taxon_name entries I'm a little unclear on what is going wrong for you. Did you pre-load the NCBI taxonomy for example? The script you are talking about, is this your own? Peter From biopython at maubp.freeserve.co.uk Tue Mar 25 11:56:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Mar 2008 15:56:16 +0000 Subject: [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> References: <711039.40736.qm@web26505.mail.ukl.yahoo.com> <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> Message-ID: <320fb6e00803250856n1001d74dxeb8560652f594e51@mail.gmail.com> On Tue, Mar 25, 2008 at 3:53 PM, Peter wrote: > Hi Eric, > > Your issue is almost certainly due to switching from Biopython 1.44 to > 1.45, rather than from a prerelease BioSQL to the recently released > BioSQL 1.0.0. > > For background, you should read Bug 2422 and the BioSQL thread it points to. > http://bugzilla.open-bio.org/show_bug.cgi?id=2422 > > Biopython 1.44 never recorded the taxon id (and therefore didn't use > the taxon/taxon_name tables) > Biopython 1.45 does record the taxon id, and attempts to fill in > missing taxon/taxon_name entries > > I'm a little unclear on what is going wrong for you. Did you pre-load > the NCBI taxonomy for example? The script you are talking about, is > this your own? > > Peter > P.S. Did you mean to send your original message to the BioSQL list as well Eric? You need biosql-l at lists.open-bio.org not biosql at lists.open-bio.org Peter From barry_finzel at yahoo.com Tue Mar 25 12:21:30 2008 From: barry_finzel at yahoo.com (Barry Finzel) Date: Tue, 25 Mar 2008 09:21:30 -0700 (PDT) Subject: [BioPython] Editing structure entities created by PDBParser Message-ID: <178200.99522.qm@web52102.mail.re2.yahoo.com> I'm relatively new to Biopython, but I've already plunged pretty deeply into Bio.PDB and am trying to write some tools to standardize non-standard PDB files - mostly an exercise in renaming and resorting residue objects (particularly HET groups). I'm hoping to find (or create) something like a chain.sort() method that reorders ALL residues children of a chain by the criteria described in the Bio.PDB.Chain.Chain API : def _sort(self, r1, r2) """Sort function for residues in a chain Residues are first sorted according to their hetatm records. Protein and nucleic acid residues first, hetatm residues next, and waters last. Within each group, the residues are sorted according to their resseq's (sequence identifiers). Finally, residues with the same resseq's are sorted according to icode. Arguments: o r1, r2 - Residue objects """ I use Bio.PDBIO to output a revised model in PDB format. As things are, I see no way to avoid writing output PDB files with objects in the same order as they were read - unless I recreate a duplicate structure in the order I want. Any alternate suggestions? Barry --------------------------------- Looking for last minute shopping deals? Find them fast with Yahoo! Search. From bsantos at biocant.pt Tue Mar 25 13:32:35 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Tue, 25 Mar 2008 17:32:35 -0000 Subject: [BioPython] Module for ace assembly format In-Reply-To: <320fb6e00803250734uc750015q7f1d0d6782ff190c@mail.gmail.com> References: <002201c88e83$ca34b150$5e9e13f0$@pt> <320fb6e00803250734uc750015q7f1d0d6782ff190c@mail.gmail.com> Message-ID: <004601c88e9e$38bfa5c0$aa3ef140$@pt> The ACE file in my case is the one produced by an assembly of a newer generation of sequencer machines, more precisely from the assembler of GS FLX from Roche 454Life Sciences, and it is working perfectly. So I will try to write some documentation during this week and then I contact you to know how to upload the file so it can be used in the cookbook. Thank you for your help. Bruno Santos -----Mensagem original----- De: Peter Cock [mailto:p.j.a.cock at googlemail.com] Enviada: ter?a-feira, 25 de Mar?o de 2008 14:35 Para: Bruno Santos Cc: biopython at biopython.org Assunto: Re: [BioPython] Module for ace assembly format On Tue, Mar 25, 2008 at 2:23 PM, Bruno Santos wrote: > Hi, > > I'm currently working in an assembly project and I need to read some files > in ace assembly format, my question is if biopython have any module to do > this. I have done a search in google and I found some references to a > bioperl module, but in the biopython website I didn't find any information > about this format. Hi Bruno, Have a look at the Bio.Sequencing.Ace module for ACE files output by PHRAP. We don't seem to have anything in the main tutorial on this, so if you would like to write a short piece for the "cookbook" chapter that would be helpful Peter From p.j.a.cock at googlemail.com Tue Mar 25 13:43:50 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Mar 2008 17:43:50 +0000 Subject: [BioPython] Module for ace assembly format In-Reply-To: <004601c88e9e$38bfa5c0$aa3ef140$@pt> References: <002201c88e83$ca34b150$5e9e13f0$@pt> <320fb6e00803250734uc750015q7f1d0d6782ff190c@mail.gmail.com> <004601c88e9e$38bfa5c0$aa3ef140$@pt> Message-ID: <320fb6e00803251043p321f7109ib00bbba0b15cefa5@mail.gmail.com> On Tue, Mar 25, 2008 at 5:32 PM, Bruno Santos wrote: > The ACE file in my case is the one produced by an assembly of a newer > generation of sequencer machines, more precisely from the assembler of GS > FLX from Roche 454Life Sciences, and it is working perfectly. So I will try > to write some documentation during this week and then I contact you to know > how to upload the file so it can be used in the cookbook. > Thank you for your help. It sounds like you are probably talking about the Roche fasta and psuedo-fasta format files described on Bug 2382, http://bugzilla.open-bio.org/show_bug.cgi?id=2382 Peter From bsantos at biocant.pt Tue Mar 25 14:20:35 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Tue, 25 Mar 2008 18:20:35 -0000 Subject: [BioPython] Module for ace assembly format In-Reply-To: <320fb6e00803251103s32f0b024t9416869a60917d0c@mail.gmail.com> References: <002201c88e83$ca34b150$5e9e13f0$@pt> <320fb6e00803250734uc750015q7f1d0d6782ff190c@mail.gmail.com> <004601c88e9e$38bfa5c0$aa3ef140$@pt> <320fb6e00803251043p321f7109ib00bbba0b15cefa5@mail.gmail.com> <004701c88ea0$abced6b0$036c8410$@pt> <320fb6e00803251103s32f0b024t9416869a60917d0c@mail.gmail.com> Message-ID: <004801c88ea4$ed094e90$c71bebb0$@pt> >Did you mean to send this email to me only? > >Peter Not really I made reply and I forget to change the mail address. I can get some documentation about that, but most of the files produced are already able to been parsed by the existing code, since the assembler as integration with other software packages like the Phred/Phrap/Consed Package, and it also output the results in fasta format and other standard format. So I don't think it is necessary to create a module to deal with all the output files, since many of them are already able to be parsed with the current versions of SeqIO Sequence modules. The main problem with the Roche files is that the primary files created by the Sequencer machine are proprietary files called sff, and they are binary files, but Roche provides a program to convert this output to standard formats. -----Mensagem original----- De: Peter Cock [mailto:p.j.a.cock at googlemail.com] Enviada: ter?a-feira, 25 de Mar?o de 2008 18:03 Para: Bruno Santos Assunto: Re: [BioPython] Module for ace assembly format On Tue, Mar 25, 2008 at 5:50 PM, Bruno Santos wrote: > I'm not sure if it's the same, for the bugzilla I wasn't able to understand > what is the objective of the bugzilla, if it is to create a parser for read > the custom FASTA > Files produced by the sequencer with the extension .fna or to create parsers > to all the files produced by the assembler. But for what I understand it's > just an improvement to the current FASTA parsers to deal with this specific > format. I think Jared wanted a fancy "fasta like" parser to cope with all sorts of files, included some produced by Roche. Michiel and I didn't regard these non-sequence files as FASTA files, but agreed a separate module to parse the Roche "fasta like" files might be useful. Without first hand experience of the various output files from Roche, or a good set of samples, its not clear to me how best to proceed. But I don't want to extend an existing fasta sequence parsing module to deal with these. If you could add example files and links to any Roche documentation that would help - even if its just what their terminology is for all the different file types. Did you mean to send this email to me only? Peter From p.j.a.cock at googlemail.com Tue Mar 25 14:30:51 2008 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Mar 2008 18:30:51 +0000 Subject: [BioPython] Module for ace assembly format In-Reply-To: <004801c88ea4$ed094e90$c71bebb0$@pt> References: <002201c88e83$ca34b150$5e9e13f0$@pt> <320fb6e00803250734uc750015q7f1d0d6782ff190c@mail.gmail.com> <004601c88e9e$38bfa5c0$aa3ef140$@pt> <320fb6e00803251043p321f7109ib00bbba0b15cefa5@mail.gmail.com> <004701c88ea0$abced6b0$036c8410$@pt> <320fb6e00803251103s32f0b024t9416869a60917d0c@mail.gmail.com> <004801c88ea4$ed094e90$c71bebb0$@pt> Message-ID: <320fb6e00803251130ucf3f74cwc884e6c0114eccf8@mail.gmail.com> Forwarding this back to the mailing list in case anyone else was following this thread. Peter ---------- Forwarded message ---------- From: Bruno Santos Date: Tue, Mar 25, 2008 at 6:20 PM Subject: Re: [BioPython] Module for ace assembly format To: biopython at biopython.org >Did you mean to send this email to me only? > >Peter Not really I made reply and I forget to change the mail address. I can get some documentation about that, but most of the files produced are already able to been parsed by the existing code, since the assembler as integration with other software packages like the Phred/Phrap/Consed Package, and it also output the results in fasta format and other standard format. So I don't think it is necessary to create a module to deal with all the output files, since many of them are already able to be parsed with the current versions of SeqIO Sequence modules. The main problem with the Roche files is that the primary files created by the Sequencer machine are proprietary files called sff, and they are binary files, but Roche provides a program to convert this output to standard formats. -----Mensagem original----- De: Peter Enviada: ter?a-feira, 25 de Mar?o de 2008 18:03 Para: Bruno Santos Assunto: Re: [BioPython] Module for ace assembly format On Tue, Mar 25, 2008 at 5:50 PM, Bruno Santos wrote: > I'm not sure if it's the same, for the bugzilla I wasn't able to > understand what is the objective of the bugzilla, if it is to creat > a parser for read the custom FASTA Files produced by the > sequencer with the extension .fna or to create parsers > to all the files produced by the assembler. But for what I > understand it's just an improvement to the current FASTA > parsers to deal with this specific format. I think Jared wanted a fancy "fasta like" parser to cope with all sorts of files, included some produced by Roche. Michiel and I didn't regard these non-sequence files as FASTA files, but agreed a separate module to parse the Roche "fasta like" files might be useful. Without first hand experience of the various output files from Roche, or a good set of samples, its not clear to me how best to proceed. But I don't want to extend an existing fasta sequence parsing module to deal with these. If you could add example files and links to any Roche documentation that would help - even if its just what their terminology is for all the different file types. Did you mean to send this email to me only? Peter From rjalves at igc.gulbenkian.pt Tue Mar 25 17:00:02 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Tue, 25 Mar 2008 21:00:02 +0000 Subject: [BioPython] Bio.Cluster - Howto, Documentation, exporting results Message-ID: <47E967D2.6010602@igc.gulbenkian.pt> Hi, I'm having some troubles understanding how to export data processed using Bio.Cluster.distancematrix and Bio.Cluster.treecluster . I don't read the data from files, so I don't understand if DataFile class is what I need, and if it is, how do I make use of it. Additionally, as of BioPython-1.44, there are a couple of things mentioned in the documentation that are not available in Bio.Cluster. One of those is the Bio.Cluster.read function. I don't know if this is because it was not yet in BioPython-1.44 or if the documentation is outdated. The cookbook doesn't help either since there is nothing there regarding Bio.Cluster. What I'm trying to do is to calculate the distances between some multidimensional vectors and then cluster them. I managed to do that, but then I don't know what to do with the Tree object I get. It's also not obvious how do I keep track of which values in the Tree object correspond to which entries in the distance matrix or in the original data. So the questions are: Is it possible to pass text in the original data so that it is used as some sort of identifying header in later operations? How can I export the Tree object to something like the treeview format mentioned in the documentation? Is there any way to visualize the tree directly using ASCII or something more graphical? Thanks Renato Alves From mjldehoon at yahoo.com Wed Mar 26 00:19:13 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 25 Mar 2008 21:19:13 -0700 (PDT) Subject: [BioPython] Bio.Cluster - Howto, Documentation, exporting results In-Reply-To: <47E967D2.6010602@igc.gulbenkian.pt> Message-ID: <720801.11321.qm@web62412.mail.re1.yahoo.com> > Additionally, as of BioPython-1.44, there are a couple of things > mentioned in the documentation that are not available in Bio.Cluster. > One of those is the Bio.Cluster.read function. I don't know if this is > because it was not yet in BioPython-1.44 or if the documentation is > outdated. Some changes were made in Bio.Cluster in Biopython 1.45. These are largely cosmetic to make Bio.Cluster more consistent with other modules in Biopython. One of them is the read() function, which was added in Biopython 1.45. I have now updated the documentation for Bio.Cluster on the Biopython website; it corresponds to Biopython 1.45. > I don't read the data from files, so I don't understand if DataFile > class is what I need, and if it is, how do I make use of it. > What I'm trying to do is to calculate the distances between some > multidimensional vectors and then cluster them. I managed to do that, > but then I don't know what to do with the Tree object I get. It's also > not obvious how do I keep track of which values in the Tree object > correspond to which entries in the distance matrix or in the original data. The values in the Tree object, if non-negative, simply correspond to the row number in the distance matrix. If negative, they correspond to a node number. So if the Tree object is [1, 2] --> This is Node # -1 [-1,0] --> This is node # -2 then first row 1 and row 2 in the distance matrix are joined, and then row 0 in the distance matrix is joined to the node [1,2]. > Is it possible to pass text in the original data so that it is used as > some sort of identifying header in later operations? Instead of relying on the row numbers, you can also create an empty Bio.Cluster.Record object and fill this object with the data you have. Bio.Cluster.Record is essentially the same as Bio.Cluster.DataFile, just the name was changed for consistency with other Biopython modules. It may be a good idea to look at the documentation of Cluster 3 at http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/manual/index.html to understand what all the fields in Bio.Cluster.Record are. Another way is to construct a file in memory and let Bio.Cluster.read parse it. >>> lines = "Start\tCol0\tCol1\tCol2\nRow0\t2.0\t1.2\t3.4\nRow1\t5.0\t6.2\t7.1\nRow2\t2.3\t5.6\t1.2\n" >>> print lines Start Col0 Col1 Col2 Row0 2.0 1.2 3.4 Row1 5.0 6.2 7.1 Row2 2.3 5.6 1.2 >>> import StringIO >>> handle = StringIO.StringIO(lines) >>> record = Cluster.read(handle) >>> tree = record.treecluster() > How can I export the Tree object to something like the treeview format > mentioned in the documentation? >>> record.save("myfilename", tree) > Is there any way to visualize the tree directly using ASCII or something > more graphical? Currently, there is no ASCII art -like representation to visualize the tree. So the easiest solution is to save the clustering solution in the treeview format, and use Java TreeView to visualize it. --Michiel. --------------------------------- Looking for last minute shopping deals? Find them fast with Yahoo! Search. From ericgibert at yahoo.fr Wed Mar 26 07:29:24 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Wed, 26 Mar 2008 11:29:24 +0000 (GMT) Subject: [BioPython] Concerns the update of BioSQL.taxon table Message-ID: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Thank you Peter for the correct email of the BioSQL list. No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before. I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database. Example: I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: +----------+---------------+-----------------+--------------+ | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | +----------+---------------+-----------------+--------------+ | 13 | 2759 | NULL | superkingdom | | 14 | 33208 | 13 | kingdom | | 15 | 6656 | 14 | phylum | | 16 | 6960 | 15 | superclass | | 17 | 50557 | 16 | class | | 18 | 7496 | 17 | no rank | | 19 | 33339 | 18 | subclass | | 20 | 6961 | 19 | order | | 21 | 6962 | 20 | suborder | | 22 | 6964 | 21 | family | | 23 | 229390 | 22 | genus | | 24 | 229391 | 23 | species | No problem. Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function: | 25 | NULL | NULL | NULL | | 26 | NULL | 25 | NULL | | 27 | NULL | 26 | NULL | | 28 | NULL | 27 | NULL | | 29 | NULL | 28 | NULL | | 30 | NULL | 29 | NULL | | 31 | NULL | 30 | NULL | | 32 | NULL | 31 | NULL | | 33 | NULL | 32 | NULL | | 34 | NULL | 33 | NULL | | 35 | NULL | 34 | genus | | 36 | 320892 | 35 | species | then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'. Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father. Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table. Best regards, Eric _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr From biopython at maubp.freeserve.co.uk Wed Mar 26 08:30:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Mar 2008 12:30:50 +0000 Subject: [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: <320fb6e00803260530w72cca900mc19654798d5d7e13@mail.gmail.com> On Wed, Mar 26, 2008 at 11:29 AM, Eric Gibert wrote: > Thank you Peter for the correct email of the BioSQL list. > > No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. > My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a > *unique* index on taxon.ncbi_taxon_id. I did not have this index before. > > I have written a script that connects to the taxonomy database of NCBI and get > the XML data for the species. Then it updates the taxon table, replacing the > ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it > after the loading of BioSeqs in the database. So you wrote your own version of the BioSQL perl script load_ncbi_taxonomy.pl? > Example: > I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: > +----------+---------------+-----------------+--------------+ > | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | > +----------+---------------+-----------------+--------------+ > | 13 | 2759 | NULL | superkingdom | > | 14 | 33208 | 13 | kingdom | > | 15 | 6656 | 14 | phylum | > | 16 | 6960 | 15 | superclass | > | 17 | 50557 | 16 | class | > | 18 | 7496 | 17 | no rank | > | 19 | 33339 | 18 | subclass | > | 20 | 6961 | 19 | order | > | 21 | 6962 | 20 | suborder | > | 22 | 6964 | 21 | family | > | 23 | 229390 | 22 | genus | > | 24 | 229391 | 23 | species | > > No problem. > > Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' > taxons records are inserted by the db.load() BioPython function: These records are "guess work" based on the lineage in the GenBank file - we don't know the NCBI taxon ids, so they are NULL, nor the rank, but there is a scientific name in the lined taxon_name table. I am open to the idea of not writing this guessed lineage, and just writing one entry for the species and the given NCBI taxon ID. However, as the new entry Orthetrum sabina should share some of its lineage with Nannophya pygmaea, then I agree Biopython *should* be re-using those existing taxon entries, if it can match them safely using the scientific name. Re-reading the relevant bit of old code, it doesn't seem to do this. I've file bug 2475: http://bugzilla.open-bio.org/show_bug.cgi?id=2475 This is actually a tricky problem, requiring some a 'clever' parent linkage as you said in your earlier email. Hilmar wrote this about the equivalent code in BioPerl: >> It's pretty unreliable actually. There is not only synonymy but also >> rampant homonymy in taxonomic names. There are plenty of examples >> for the same scientific name in use for a plant and for some animal, for >> example. So in order to be unambiguous you will need to know (and >> check) the kingdom. See http://lists.open-bio.org/pipermail/biosql-l/2008-March/001207.html Eric wrote: > then I try to run my script: this time I have an update failure because the > record 34 is the SAME family hence same ncbi_taxon_id as record 22: > 'duplicate entry on key 2'. > > Either this *unique* index is new and it is a BioSQL "issue" (as said, this index > did not exist in my previous BioSQL db so I never encountered this issue before), Hopefully Hilmar from BioSQL can answer this. > OR the way BioPython "repeats" existing taxons is incorrect/not compatible. > In that case, when inserting the second BioSeq, record 34 should not be created > but record 35 (the genus) should "point" to the already existing family at record > 22 as its father. This example might be easier to follow if the scientific names from the taxon_name were included. I would check the lineage but the NCBI wepage is being very slow for me right now. In the short term, as a quick fix, your script could first remove taxon entries with a blank NCBI taxon ID (and clear any keys pointing to them). Not elegent - but it would work. Thanks Eric Peter From hlapp at gmx.net Wed Mar 26 09:29:01 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Mar 2008 09:29:01 -0400 Subject: [BioPython] [BioSQL-l] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: On Mar 26, 2008, at 7:29 AM, Eric Gibert wrote: > Either this *unique* index is new and it is a BioSQL "issue" (as > said, this index did not exist in my previous BioSQL db so I never > encountered this issue before) The unique index has been there since Feb 2003 (the Singapore Biohackathon). I'm not sure how you got a version that doesn't have it. The unique key constraint on the identifier column is also necessary - otherwise you cannot guarantee lookups by the NCBI taxonID to return either one or zero rows. Like Peter and Richard, I also don't understand what the point would be in allowing the same taxon (which in essence is a node), as identified by taxonID, to exist more than once. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From rjalves at igc.gulbenkian.pt Wed Mar 26 10:11:49 2008 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Wed, 26 Mar 2008 14:11:49 +0000 Subject: [BioPython] Bio.Cluster - Howto, Documentation, exporting results In-Reply-To: <720801.11321.qm@web62412.mail.re1.yahoo.com> References: <720801.11321.qm@web62412.mail.re1.yahoo.com> Message-ID: <47EA59A5.8010501@igc.gulbenkian.pt> Thanks again for all the clarifications. This was very helpful. Renato Michiel de Hoon wrote: > > Additionally, as of BioPython-1.44, there are a couple of things > > mentioned in the documentation that are not available in Bio.Cluster. > > One of those is the Bio.Cluster.read function. I don't know if this is > > because it was not yet in BioPython-1.44 or if the documentation is > > outdated. > > Some changes were made in Bio.Cluster in Biopython 1.45. These are > largely cosmetic to make Bio.Cluster more consistent with other > modules in Biopython. One of them is the read() function, which was > added in Biopython 1.45. I have now updated the documentation for > Bio.Cluster on the Biopython website; it corresponds to Biopython 1.45. > > > I don't read the data from files, so I don't understand if DataFile > > class is what I need, and if it is, how do I make use of it. > > > What I'm trying to do is to calculate the distances between some > > multidimensional vectors and then cluster them. I managed to do that, > > but then I don't know what to do with the Tree object I get. It's also > > not obvious how do I keep track of which values in the Tree object > > correspond to which entries in the distance matrix or in the > original data. > > The values in the Tree object, if non-negative, simply correspond to > the row number in the distance matrix. If negative, they correspond to > a node number. So if the Tree object is > [1, 2] --> This is Node # -1 > [-1,0] --> This is node # -2 > then first row 1 and row 2 in the distance matrix are joined, and then > row 0 in the distance matrix is joined to the node [1,2]. > > > Is it possible to pass text in the original data so that it is used as > > some sort of identifying header in later operations? > > Instead of relying on the row numbers, you can also create an empty > Bio.Cluster.Record object and fill this object with the data you have. > Bio.Cluster.Record is essentially the same as Bio.Cluster.DataFile, > just the name was changed for consistency with other Biopython > modules. It may be a good idea to look at the documentation of Cluster > 3 at > http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/manual/index.html > to understand what all the fields in Bio.Cluster.Record are. > > Another way is to construct a file in memory and let Bio.Cluster.read > parse it. > >>> lines = > "Start\tCol0\tCol1\tCol2\nRow0\t2.0\t1.2\t3.4\nRow1\t5.0\t6.2\t7.1\nRow2\t2.3\t5.6\t1.2\n" > >>> print lines > Start Col0 Col1 Col2 > Row0 2.0 1.2 3.4 > Row1 5.0 6.2 7.1 > Row2 2.3 5.6 1.2 > >>> import StringIO > >>> handle = StringIO.StringIO(lines) > >>> record = Cluster.read(handle) > >>> tree = record.treecluster() > > > How can I export the Tree object to something like the treeview format > > mentioned in the documentation? > > >>> record.save("myfilename", tree) > > > Is there any way to visualize the tree directly using ASCII or > something > > more graphical? > > Currently, there is no ASCII art -like representation to visualize the > tree. So the easiest solution is to save the clustering solution in > the treeview format, and use Java TreeView to visualize it. > > --Michiel. > > ------------------------------------------------------------------------ > Looking for last minute shopping deals? Find them fast with Yahoo! > Search. > From alexl at users.sourceforge.net Thu Mar 27 07:21:46 2008 From: alexl at users.sourceforge.net (Alex Lancaster) Date: Thu, 27 Mar 2008 04:21:46 -0700 Subject: [BioPython] Fedora packages for 1.45 (was Re: Biopython release 1.45) In-Reply-To: <901773.64728.qm@web62408.mail.re1.yahoo.com> (Michiel de Hoon's message of "Sat\, 22 Mar 2008 04\:02\:38 -0700 \(PDT\)") References: <901773.64728.qm@web62408.mail.re1.yahoo.com> Message-ID: >>>>> "" == Michiel de Hoon writes: > We are pleased to announce the release of Biopython 1.45. This > release includes numerous code improvements and fixes, including > in Bio.Seq, Bio.SeqIO, Bio.Entrez, Bio.PopGen, Bio.SwissProt, > Bio.Cluster, Bio.SCOP, Bio.InterPro, Bio.GenBank, Bio.ExPASy, > BioSQL, and the Biopython documentation. Too many to list them > all here! > Source distributions and Windows installers are available from > the Biopython website at http://biopython.org. My thanks to all > code contributers who made this new release possible. Hi there, For all Fedora users, new packages for biopython 1.45 are now available the "updates-testing" repository for F-7 and F-8 To test them out simply run (as root): yum --enablerepo=updates-testing install python-biopython Please provide feedback on packages here: https://admin.fedoraproject.org/updates/F8/FEDORA-2008-2750 https://admin.fedoraproject.org/updates/F7/FEDORA-2008-2724 the more positive feedback from testing, the faster the packages can go into the stable "updates" repo (or conversely if there are any problems, they can be fixed before being pushed). Thanks! Alex From bsantos at biocant.pt Thu Mar 27 13:32:53 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Thu, 27 Mar 2008 17:32:53 -0000 Subject: [BioPython] FW: How to Bio.cluster Module Message-ID: <002501c89030$982ec110$c88c4330$@pt> Hi, This question is a little bit more generic so I really don't know if anyone in the mailing may help me. I have a fasta file with thousands of reads obtained by a sequencing run, in this fasta file I know I have several copies of the same sequences but their size and some nucleotides inside it can change. So I need to group them together using clustering so then I can create a consensus sequence for each group. I am trying to achieve this by align all the sequences using clustalw-mpi and the I run dnadist from phylip to obtain a matrix of distances between the sequences. Now I need to use clustering to group the sequences based on these values and for that I am trying to use Bio.cluster to achieve this. Can anyone help me to choose the clustering method I should use and how can I submit this kind of data to that method? Sincerely, Bruno Santos From bsantos at biocant.pt Thu Mar 27 13:33:55 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Thu, 27 Mar 2008 17:33:55 -0000 Subject: [BioPython] How to use Bio.cluster Module to assembly dna sequences Message-ID: <002a01c89030$bd4e14a0$37ea3de0$@pt> Hi, This question is a little bit more generic so I really don't know if anyone in the mailing may help me. I have a fasta file with thousands of reads obtained by a sequencing run, in this fasta file I know I have several copies of the same sequences but their size and some nucleotides inside it can change. So I need to group them together using clustering so then I can create a consensus sequence for each group. I am trying to achieve this by align all the sequences using clustalw-mpi and the I run dnadist from phylip to obtain a matrix of distances between the sequences. Now I need to use clustering to group the sequences based on these values and for that I am trying to use Bio.cluster to achieve this. Can anyone help me to choose the clustering method I should use and how can I submit this kind of data to that method? Sincerely, Bruno Santos From bsantos at biocant.pt Thu Mar 27 13:25:52 2008 From: bsantos at biocant.pt (Bruno Santos) Date: Thu, 27 Mar 2008 17:25:52 -0000 Subject: [BioPython] How to Bio.cluster Module Message-ID: <001f01c8902f$9d30e8b0$d792ba10$@pt> Hi, This question is a little bit more generic so I really don?t know if anyone in the mailing may help me. I have a fasta file with thousands of reads obtained by a sequencing run, in this fasta file I know I have several copies of the same sequences but their size and some nucleotides inside it can change. So I need to group them together using clustering so then I can create a consensus sequence for each group. I am trying to achieve this by align all the sequences using clustalw-mpi and the I run dnadist from phylip to obtain a matrix of distances between the sequences. Now I need to use clustering to group the sequences based on these values and for that I am trying to use Bio.cluster to achieve this. Can anyone help me to choose the clustering method I should use and how can I submit this kind of data to that method? Sincerely, Bruno Santos logo_biocant Bioinformatics Unit Biocantpark, N?cleo 04, Lote 3 3060-197 Cantanhede Tel: 231 410 892 E-mail: bsantos at biocant.pt http://bioinformatics.biocant.pt -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 2271 bytes Desc: not available URL: From jblanca at btc.upv.es Fri Mar 28 03:38:46 2008 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 28 Mar 2008 08:38:46 +0100 Subject: [BioPython] How to use Bio.cluster Module to assembly dna sequences In-Reply-To: <002a01c89030$bd4e14a0$37ea3de0$@pt> References: <002a01c89030$bd4e14a0$37ea3de0$@pt> Message-ID: <200803280838.47015.jblanca@btc.upv.es> Hi, I don't know the details of your sequences, but for the assembly that you want to do there could be better methods. I have done this kind of assemblies with ESTs sequences and for that porpouse I have used cap3 or tgicl. Best regards, Jose Blanca On Thursday 27 March 2008 18:33:55 Bruno Santos wrote: > Hi, > > This question is a little bit more generic so I really don't know if anyone > in the mailing may help me. > > I have a fasta file with thousands of reads obtained by a sequencing run, > in this fasta file I know I have several copies of the same sequences but > their size and some nucleotides inside it can change. So I need to group > them together using clustering so then I can create a consensus sequence > for each group. > > I am trying to achieve this by align all the sequences using clustalw-mpi > and the I run dnadist from phylip to obtain a matrix of distances between > the sequences. Now I need to use clustering to group the sequences based on > these values and for that I am trying to use Bio.cluster to achieve this. > Can anyone help me to choose the clustering method I should use and how can > I submit this kind of data to that method? > > > > Sincerely, > > Bruno Santos > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From taleinat at gmail.com Sat Mar 29 08:38:39 2008 From: taleinat at gmail.com (Tal Einat) Date: Sat, 29 Mar 2008 15:38:39 +0300 Subject: [BioPython] How to test Sequence objects for equality? Message-ID: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> Hello, I'm new to BioPython, but I've managed to stumble in my very first steps. Could someone help explain this behavior? >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> Seq('A', IUPAC.unambiguous_dna) == Seq('A', IUPAC.unambiguous_dna) False >>> My current goal is to search for (possibly ambiguous) matching sequences in an efficient manner, but I haven't found docs or a tutorial which cover this. (WinXP SP2, Python2.5, BioPython 1.45) - Tal From idoerg at gmail.com Sat Mar 29 09:24:03 2008 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 29 Mar 2008 06:24:03 -0700 Subject: [BioPython] How to test Sequence objects for equality? In-Reply-To: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> References: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> Message-ID: Tal, Seq types do not support a comparison function. The reason is that it is not very common to perform a 100% identity on two sequences. You can just extract the strings and compare. The more common case is seqeunce alignment, and Biopython does support that. You can use Bio.pairwise2 (documentation in the module source code, not in the cookbook). Or for multiple alignments you can call ClustalX externally (the tutorial / cookbook explains that). HTH, Iddo On Sat, Mar 29, 2008 at 5:38 AM, Tal Einat wrote: > Hello, > > I'm new to BioPython, but I've managed to stumble in my very first > steps. Could someone help explain this behavior? > > >>> from Bio.Seq import Seq > >>> from Bio.Alphabet import IUPAC > >>> Seq('A', IUPAC.unambiguous_dna) == Seq('A', IUPAC.unambiguous_dna) > False > >>> > > My current goal is to search for (possibly ambiguous) matching > sequences in an efficient manner, but I haven't found docs or a > tutorial which cover this. > > (WinXP SP2, Python2.5, BioPython 1.45) > > - Tal > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. CALIT2, mail code 0440 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0440, USA T: +1 (858) 534-0570 T: +1 (858) 646-3100 x3516 http://iddo-friedberg.org From taleinat at gmail.com Sat Mar 29 10:06:59 2008 From: taleinat at gmail.com (Tal Einat) Date: Sat, 29 Mar 2008 17:06:59 +0300 Subject: [BioPython] How to test Sequence objects for equality? In-Reply-To: References: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> Message-ID: <7afdee2f0803290706w657566eau6aaa1a32a01826a0@mail.gmail.com> Iddo Friedberg wrote: > > On Sat, Mar 29, 2008 at 5:38 AM, Tal Einat wrote: > > > > Hello, > > > > I'm new to BioPython, but I've managed to stumble in my very first > > steps. Could someone help explain this behavior? > > > > >>> from Bio.Seq import Seq > > >>> from Bio.Alphabet import IUPAC > > >>> Seq('A', IUPAC.unambiguous_dna) == Seq('A', IUPAC.unambiguous_dna) > > False > > >>> > > > > My current goal is to search for (possibly ambiguous) matching > > sequences in an efficient manner, but I haven't found docs or a > > tutorial which cover this. > > Seq types do not support a comparison function. The reason is that it is not > very common to perform a 100% identity on two sequences. You can just > extract the strings and compare. > > > The more common case is seqeunce alignment, and Biopython does support that. > You can use Bio.pairwise2 (documentation in the module source code, not in > the cookbook). Or for multiple alignments you can call ClustalX externally > (the tutorial / cookbook explains that). Hello Iddo, thank you for the quick response! Extracting the strings and comparing is good for exact matches, but I also need to match sequences with ambiguities. Is there no such function in BioPython? Unfortunately sequence alignment is not what I'm trying to do, so much so that I can't think of a way to transform my problem into a sequence alignment problem. I really do need to compare pairs of sequences one by one, as efficiently as possible. On a side note, I was surprised by having == return False for identical sequences. To make BioPython less confusing, may I suggest either disabling comparison of sequences or making such comparison do the Right Thing? - Tal From biopython at maubp.freeserve.co.uk Sat Mar 29 12:05:31 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 29 Mar 2008 16:05:31 +0000 Subject: [BioPython] How to test Sequence objects for equality? In-Reply-To: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> References: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> Message-ID: <320fb6e00803290905m223f4556x3cf857b7e257a44a@mail.gmail.com> On Sat, Mar 29, 2008 at 12:38 PM, Tal Einat wrote: > Hello, > > I'm new to BioPython, but I've managed to stumble in my very first > steps. Could someone help explain this behavior? > > >>> from Bio.Seq import Seq > >>> from Bio.Alphabet import IUPAC > >>> Seq('A', IUPAC.unambiguous_dna) == Seq('A', IUPAC.unambiguous_dna) > False This is a little tricky because Biopython would have to be able to decide sequence equality based on a combination of the sequence and the alphabets. For example, which of the following would you say are equal: Seq('A', IUPAC.unambiguous_dna) Seq('A', IUPAC.ambiguous_dna) Seq('A', IUPAC.unambiguous_rna) Seq('A', IUPAC.ambiguous_dna) Seq('A', IUPAC.protein) etc In this sort of work, you probably won't be trying to compare DNA to RNA, or to proteins - all you care about is the sequence string itself. So compare that: from Bio.Seq import Seq from Bio.Alphabet import IUPAC alpha = Seq('ACG', IUPAC.unambiguous_dna) beta = Seq('ACG', IUPAC.ambiguous_dna) gamma = Seq('ACN', IUPAC.ambiguous_dna) print str(alpha) == str(beta) print str(beta) == str(gamma) NOTE - If you are using an older version of Biopython, do this instead: print alpha.tostring() == beta.tostring() print beta.tostring() == gamma.tostring() Peter From biopython at maubp.freeserve.co.uk Sat Mar 29 12:13:22 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 29 Mar 2008 16:13:22 +0000 Subject: [BioPython] How to test Sequence objects for equality? In-Reply-To: <7afdee2f0803290706w657566eau6aaa1a32a01826a0@mail.gmail.com> References: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> <7afdee2f0803290706w657566eau6aaa1a32a01826a0@mail.gmail.com> Message-ID: <320fb6e00803290913j2295d4c1nae7af599818eb69c@mail.gmail.com> > Hello Iddo, thank you for the quick response! > > Extracting the strings and comparing is good for exact matches, but I > also need to match sequences with ambiguities. Is there no such > function in BioPython? > > Unfortunately sequence alignment is not what I'm trying to do, so much > so that I can't think of a way to transform my problem into a sequence > alignment problem. I really do need to compare pairs of sequences one > by one, as efficiently as possible. So you want to know if two ambiguous sequences are "compatible"? In some cases that looks simple and well defined: ACT and ACA -> False ACT and ACN -> True ACY and ACN -> True ACY and ACR -> False ACY and ACM -> Maybe That last example is about doubly ambiguous comparisons like Y (T or C) and M (A or C)? If they both are really a C, then yes, ACY and ACM would be compatible. But they might not be. > On a side note, I was surprised by having == return False for > identical sequences. To make BioPython less confusing, may I suggest > either disabling comparison of sequences or making such comparison do > the Right Thing? As I tried to explain in my other email - I don't think there is a clear "Right Thing" that would suit everyone. So maybe you are right - some sort of exception would make sense... Peter From biopython at maubp.freeserve.co.uk Sat Mar 29 12:31:38 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 29 Mar 2008 16:31:38 +0000 Subject: [BioPython] How to test Sequence objects for equality? In-Reply-To: <320fb6e00803290913j2295d4c1nae7af599818eb69c@mail.gmail.com> References: <7afdee2f0803290538p1ef1d0b2tbeaef9398362cc26@mail.gmail.com> <7afdee2f0803290706w657566eau6aaa1a32a01826a0@mail.gmail.com> <320fb6e00803290913j2295d4c1nae7af599818eb69c@mail.gmail.com> Message-ID: <320fb6e00803290931j4efb90a6o42820ed0a6e49116@mail.gmail.com> I wrote: > So you want to know if two ambiguous sequences are "compatible"? ... I realised I wasn't being entirely consistent: ACT and ACT -> True ACT and ACA -> False ACT and ACR -> False (R = G or C) ACT and ACN -> Maybe ACY and ACN -> Maybe ACY and ACR -> False (Y = T or C, R = G or A) ACY and ACM -> Maybe So a boolean function which returns True when two ambiguous sequences could be equal is actually possible. On the implementation, rather than generating all possible non-ambiguous interpretations of the sequences and looking for a match, we'd just need to tabulate all the possible pairwise combinations of letters. We'd also need to do this for DNA vs DNA, RNA vs RNA and Protein vs Protein. This could probably live somewhere in Bio.SeqUtils Peter From mjldehoon at yahoo.com Sat Mar 29 23:01:33 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 29 Mar 2008 20:01:33 -0700 (PDT) Subject: [BioPython] FW: How to Bio.cluster Module In-Reply-To: <002501c89030$982ec110$c88c4330$@pt> Message-ID: <176093.41502.qm@web62410.mail.re1.yahoo.com> I agree with Bruno that Bio.Cluster is probably not the best tool for this kind of alignment. If you do want to use Bio.Cluster, first you have to decide how you want to define the distance or similarity between sequences. Then, create a distance matrix that stores all these distances, and apply Cluster.treecluster on this distance matrix to get a hierarchical clustering of the sequences: >>> from Bio import Cluster >>> d = [[],[2.0],[3.0,4.0]] # Your distance matrix # Distance between seq1 and seq2 is 2.0 # Distance between seq1 and seq3 is 3.0 # Distance between seq2 and seq3 is 4.0 >>> print Cluster.treecluster(distancematrix=d) (1, 0): 2 (2, -1): 4 # First, join seq1 and seq2 at distance 2.0 # Then, join seq3 with the node (seq1, seq2). --Michiel > Hi, > I don't know the details of your sequences, but for the assembly that you want > to do there could be better methods. I have done this kind of assemblies with > ESTs sequences and for that porpouse I have used cap3 or tgicl. > Best regards, > Jose Blanca Bruno Santos wrote: Hi, This question is a little bit more generic so I really don't know if anyone in the mailing may help me. I have a fasta file with thousands of reads obtained by a sequencing run, in this fasta file I know I have several copies of the same sequences but their size and some nucleotides inside it can change. So I need to group them together using clustering so then I can create a consensus sequence for each group. I am trying to achieve this by align all the sequences using clustalw-mpi and the I run dnadist from phylip to obtain a matrix of distances between the sequences. Now I need to use clustering to group the sequences based on these values and for that I am trying to use Bio.cluster to achieve this. Can anyone help me to choose the clustering method I should use and how can I submit this kind of data to that method? Sincerely, Bruno Santos _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. From sbassi at gmail.com Sun Mar 30 21:28:26 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Sun, 30 Mar 2008 22:28:26 -0300 Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? Message-ID: Hello, I am working at the xml formater (to format BLAST xml to html). I found that the num_letters_in_database data is stored as a "private" attribute. When I want to retrieve the num_letters_in_database I have to do: >>> fr=NCBIXML.parse(open(f_in)).next() >>> fr._num_letters_in_database # SEE THE _ before num_letter_in_database 4662239 Instead of: >>> fr.num_letters_in_database [] Like this one: >>> fr.num_sequences_in_database 400 Here is the code with highlighting: http://www.pastecode.com.ar/f542368d4 Original sourcecode from here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?rev=1.14&cvsroot=biopython&content-type=text/vnd.viewcvs-markup (short URL http://tinyurl.com/3dk5mw) -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From sbassi at gmail.com Mon Mar 31 02:26:05 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 31 Mar 2008 03:26:05 -0300 Subject: [BioPython] sc_match and sc_mismatch are empty. Message-ID: sc_match and sc_mismatch are empty when there should be: 1 and -3. >>> fr=NCBIXML.parse(open(f_in)).next() >>> fr.version u'2.2.17' >>> fr.application u'BLASTN' >>> fr.expect u'10' >>> fr.sc_match >>> fr.sc_mismatch The xml file is here: http://www.pastecode.com.ar/f7a880f5a (click here to download: http://www.pastecode.com.ar/pastebin.php?dl=f7a880f5a) -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From biopython at maubp.freeserve.co.uk Mon Mar 31 07:23:05 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Mar 2008 12:23:05 +0100 Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? In-Reply-To: References: Message-ID: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> On Mon, Mar 31, 2008 at 2:28 AM, Sebastian Bassi wrote: > Hello, > > I am working at the xml formater (to format BLAST xml to html). > I found that the num_letters_in_database data is stored as a "private" > attribute. > > When I want to retrieve the num_letters_in_database I have to do: > > >>> fr=NCBIXML.parse(open(f_in)).next() > >>> fr._num_letters_in_database # SEE THE _ before num_letter_in_database > 4662239 That looks like a bug to me. The old text parser in Bio.Blast.NCBIStandalone and the Bio.Blast.Record both use num_letters_in_database with no leading underscore. If there are no objections, I'm happy to make that change. Peter From biopython at maubp.freeserve.co.uk Mon Mar 31 07:30:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Mar 2008 12:30:32 +0100 Subject: [BioPython] sc_match and sc_mismatch are empty. In-Reply-To: References: Message-ID: <320fb6e00803310430oc924a74j715999eea5ebff0d@mail.gmail.com> Hi Sebastian, That sounds like a bug - could file this issue on Bugzilla please? You could also try this with the example Tests/Blast/xbt002.xml in our unit tests from Blast 2.2.12 just to see if the XML format has changed. Peter On Mon, Mar 31, 2008 at 7:26 AM, Sebastian Bassi wrote: > sc_match and sc_mismatch are empty when there should be: 1 and -3. > > >>> fr=NCBIXML.parse(open(f_in)).next() > >>> fr.version > u'2.2.17' > >>> fr.application > u'BLASTN' > >>> fr.expect > u'10' > >>> fr.sc_match > >>> fr.sc_mismatch > > The xml file is here: > http://www.pastecode.com.ar/f7a880f5a > (click here to download: http://www.pastecode.com.ar/pastebin.php?dl=f7a880f5a) > From sdavis2 at mail.nih.gov Mon Mar 31 07:52:57 2008 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 31 Mar 2008 07:52:57 -0400 Subject: [BioPython] [Biopython-dev] Bio.Entrez XML parsing In-Reply-To: <264855a00803310452k1e029031jd3847c52f082c26f@mail.gmail.com> References: <864047.785.qm@web62410.mail.re1.yahoo.com> <264855a00803301751h270ee34dg86325eb1af298369@mail.gmail.com> <320fb6e00803310425u478fc938w2ff426c4eae32d99@mail.gmail.com> <264855a00803310452k1e029031jd3847c52f082c26f@mail.gmail.com> Message-ID: <264855a00803310452v3b9425a0le3ec41de9d236210@mail.gmail.com> On Mon, Mar 31, 2008 at 7:52 AM, Sean Davis wrote: > > On Mon, Mar 31, 2008 at 7:25 AM, Peter wrote: > > On Mon, Mar 31, 2008 at 1:51 AM, Sean Davis wrote: > > > This makes sense. However, it seems that there needs to be a way to > > > "register" a parser with read() so that users can extend their local > > > installation with a specialized parser. In other words, it seems that > > > a way to dynamically register a parser with read() would be helpful. > > > Or am I missing something? > > > > I like Michiel's plan. The mapping could be as simple as a (private) > > dictionary in Bio.Entrez, mapping formats to parser objects/functions > > - as done in Bio.SeqIO - which lets the user add new parsers or > > override the built in ones should they so desire. > > That sounds like it would also work just fine. Forgot to send to the list, also. From mjldehoon at yahoo.com Mon Mar 31 07:54:39 2008 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 31 Mar 2008 04:54:39 -0700 (PDT) Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? In-Reply-To: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> Message-ID: <364830.83096.qm@web62412.mail.re1.yahoo.com> Also see bug report 2176 for other naming inconsistencies between the different Blast parsers. http://bugzilla.open-bio.org/show_bug.cgi?id=2176 Peter wrote: On Mon, Mar 31, 2008 at 2:28 AM, Sebastian Bassi wrote: > Hello, > > I am working at the xml formater (to format BLAST xml to html). > I found that the num_letters_in_database data is stored as a "private" > attribute. > > When I want to retrieve the num_letters_in_database I have to do: > > >>> fr=NCBIXML.parse(open(f_in)).next() > >>> fr._num_letters_in_database # SEE THE _ before num_letter_in_database > 4662239 That looks like a bug to me. The old text parser in Bio.Blast.NCBIStandalone and the Bio.Blast.Record both use num_letters_in_database with no leading underscore. If there are no objections, I'm happy to make that change. Peter _______________________________________________ BioPython mailing list - BioPython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython --------------------------------- No Cost - Get a month of Blockbuster Total Access now. Sweet deal for Yahoo! users and friends. From biopython at maubp.freeserve.co.uk Mon Mar 31 10:13:06 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Mar 2008 15:13:06 +0100 Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? In-Reply-To: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> References: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> Message-ID: <320fb6e00803310713q521811cbv5490b4fa7f03bc42@mail.gmail.com> On Mon, Mar 31, 2008 at 12:23 PM, Peter wrote: > On Mon, Mar 31, 2008 at 2:28 AM, Sebastian Bassi wrote: > > Hello, > > > > I am working at the xml formater (to format BLAST xml to html). > > I found that the num_letters_in_database data is stored as a "private" > > attribute. > > That looks like a bug to me. The old text parser in > Bio.Blast.NCBIStandalone and the Bio.Blast.Record both use > num_letters_in_database with no leading underscore. > > If there are no objections, I'm happy to make that change. As Michiel didn't object, I've fixed this in CVS. Thanks for the alter Sebastian. Peter From biopython at maubp.freeserve.co.uk Mon Mar 31 10:22:27 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Mar 2008 15:22:27 +0100 Subject: [BioPython] sc_match and sc_mismatch are empty. In-Reply-To: <320fb6e00803310430oc924a74j715999eea5ebff0d@mail.gmail.com> References: <320fb6e00803310430oc924a74j715999eea5ebff0d@mail.gmail.com> Message-ID: <320fb6e00803310722x6d3f47fdqd477b1b6856a19ab@mail.gmail.com> On Mon, Mar 31, 2008 at 12:30 PM, Peter wrote: > Hi Sebastian, > > That sounds like a bug - could file this issue on Bugzilla please? > > You could also try this with the example Tests/Blast/xbt002.xml in our > unit tests from Blast 2.2.12 just to see if the XML format has > changed. Never mind about filing the bug on bugzilla, I've fixed this in CVS - it was a simple oversight in the code that didn't copy the sc_match and sc_mismatch values over to the results. Peter From sbassi at gmail.com Mon Mar 31 10:39:03 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 31 Mar 2008 11:39:03 -0300 Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? In-Reply-To: <320fb6e00803310713q521811cbv5490b4fa7f03bc42@mail.gmail.com> References: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> <320fb6e00803310713q521811cbv5490b4fa7f03bc42@mail.gmail.com> Message-ID: On Mon, Mar 31, 2008 at 11:13 AM, Peter wrote: > As Michiel didn't object, I've fixed this in CVS. Thanks for the > alter Sebastian. Could you please ADD the old code? I mean, keep both num_letters_in_database and _num_letters_in_database. My code has to run in current version (1.44 and 1.45 with _) but don't want it to break in the next version. I think that leaving the old code (with _) won't affect anybody, and you can mark it with a comment that this code is kept for backward compatibility. Best, sB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5 From biopython at maubp.freeserve.co.uk Mon Mar 31 11:02:05 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 Mar 2008 16:02:05 +0100 Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? In-Reply-To: References: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> <320fb6e00803310713q521811cbv5490b4fa7f03bc42@mail.gmail.com> Message-ID: <320fb6e00803310802u745d952ey5c878c7ddc1b955b@mail.gmail.com> On Mon, Mar 31, 2008 at 3:39 PM, Sebastian Bassi wrote: > On Mon, Mar 31, 2008 at 11:13 AM, Peter wrote: > > As Michiel didn't object, I've fixed this in CVS. Thanks for the > > alert Sebastian. > > Could you please ADD the old code? I mean, keep both > num_letters_in_database and _num_letters_in_database. My code has to > run in current version (1.44 and 1.45 with _) but don't want it to > break in the next version. I think that leaving the old code (with _) > won't affect anybody, and you can mark it with a comment that this > code is kept for backward compatibility. I'd rather not. Firstly, old code shouldn't be using "private" attributes anyway. Secondly, as far as I know, you are the only person this would affect, so it would be easier to fix it in your code. Something like: try : count = record.num_letters_in_database except AttributeError : #Hack for Biopython 1.45 or earlier count = record._num_letters_in_database Of course, if there are lots of people out there also using _num_letters_in_database then I guess we could support both for the next release. Peter From sbassi at gmail.com Mon Mar 31 11:24:28 2008 From: sbassi at gmail.com (Sebastian Bassi) Date: Mon, 31 Mar 2008 12:24:28 -0300 Subject: [BioPython] Why _num_letters_in_database XMLBLAST parser? In-Reply-To: <320fb6e00803310802u745d952ey5c878c7ddc1b955b@mail.gmail.com> References: <320fb6e00803310423q6d70eb1ft2fb13d4c78032a86@mail.gmail.com> <320fb6e00803310713q521811cbv5490b4fa7f03bc42@mail.gmail.com> <320fb6e00803310802u745d952ey5c878c7ddc1b955b@mail.gmail.com> Message-ID: On Mon, Mar 31, 2008 at 12:02 PM, Peter wrote: > I'd rather not. Firstly, old code shouldn't be using "private" > attributes anyway. Secondly, as far as I know, you are the only > person this would affect, so it would be easier to fix it in your > code. Something like: > try : > count = record.num_letters_in_database You are right, after sending my email I thought about this. I will use the try/except as you suggested. Best, SB. -- Curso Biologia Molecular para programadores: http://tinyurl.com/2vv8w6 Bioinformatics news: http://www.bioinformatica.info Tutorial libre de Python: http://tinyurl.com/2az5d5