From bugzilla-daemon at portal.open-bio.org Wed Jan 3 08:15:00 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 08:15:00 -0500 Subject: [Biopython-dev] [Bug 2174] New: FDist Support in BioPython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2174 Summary: FDist Support in BioPython Product: Biopython Version: 1.24 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tiagoantao at gmail.com This is an enhancement bug to submit code related to fdist2 http://www.rubic.rdg.ac.uk/~mab/software.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 08:15:18 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 08:15:18 -0500 Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython In-Reply-To: Message-ID: <200701031315.l03DFIGn007058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2174 tiagoantao at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 08:16:06 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 08:16:06 -0500 Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython In-Reply-To: Message-ID: <200701031316.l03DG6qL007102@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2174 ------- Comment #1 from tiagoantao at gmail.com 2007-01-03 08:16 ------- Created an attachment (id=532) --> (http://bugzilla.open-bio.org/attachment.cgi?id=532&action=view) Code support fdist -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Wed Jan 3 08:16:30 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 3 Jan 2007 13:16:30 +0000 Subject: [Biopython-dev] FDist: more Population Genetics code Message-ID: <6d941f120701030516m1adb3daeh6e4645121ba8679d@mail.gmail.com> Hi! I have submitted another enhancement bug, with support for FDist. It allows to generate and parse Fdist files and to control fdist applications. There are also a couple of utility functions. FDist is a niche application (mainly used to detect selection in animal genetics). Not the most fundamental one to support, but it is currently one that I am working on, thus, the code. Regarding my summited code for GenePop, I have summited a different version on bugzilla. The main difference, is that I moved everything from Bio to Bio.PopGen. Before I continue putting code on bugzilla I would like to know if it is worthwhile doing it... Any opinions on the code submitted or if any changes are required? I would really like to continue converting my code to BioPython, but only if it has any possibility of ending up being useful/included in distribution somewhere in the future... ;) I am currently working on code related to SimCoal2, Arlequin and general statistics (Fst, heterozygosity, ...). Which will probably be ready quite soon (ie, next two weeks). This is more mainstream than FDist I have some other code lying around mainly related to HapMap, but I will only submit it after reviewing and reusing it again. This is more distant future ... like a couple of months. Tiago From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:38:39 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:38:39 -0500 Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached In-Reply-To: Message-ID: <200701032138.l03Lcdji028402@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2051 ------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2007-01-03 16:38 ------- > Regardless, I do still see a > number of inconsistencies. Please submit a separate bug report (including your patch) for these inconsistencies. The current bug report is titled "XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached" With Peter's patch, we can now parse multiple blast queries, so I'd like to close this bug report. For future bug reports and patches: Try to handle separate bugs in separate bug reports and patches. For developers, when looking at a patch handling several issues at the same time, it's difficult to understand which parts of the patch are essential, which are good but non-essential, and which are code cleanup. Speaking for myself, I would probably have considered this patch earlier if it had been less convoluted. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:48:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:48:45 -0500 Subject: [Biopython-dev] [Bug 2176] New: XML Blast parser: miscellaneous bug fixes and cleanup Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 Summary: XML Blast parser: miscellaneous bug fixes and cleanup Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jmjoseph at andrew.cmu.edu This follows the discussion started in bug 2051. The blast XML parser does now work (Thanks!), but could still use a little work. Here's a list of the issues I can see now. I'll follow with patches to correct a few. In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still defined as (None,None) tuples. However, in NCBIXML.py, these variables are set as integers. I don't see a point of a tuple at all, at least for NCBIXML. (I realize it is used in NCBIStandalone.py). Most importantly, the inconsistency makes it difficult to handle cases when the parameter is not set. It seems easiest, though, to just retain the tuple format. In the past, I worried that the order of tuple building for self._blast.gap_penalties or ka_params could cause the tuple to have an incorrect ordering. I seem to remember hitting an issue where the tuple was built with the wrong length, but I can't be specific. In general, it remains odd to me to not just use a list and set each element respectively. If necessary, one could convert to a tuple when finished or use some other approach that does not rely upon order. Why not use query_len, as defined in the XML file, or query_length instead of query_letters as a variable name? In BlastParser._end_Iteration, self._blast.query_letters is set. This is not defined/documented in the Parameters class in Record.py. Rather, query_length is defined there. In the Header class, though, the name query_letters is used. There also seems to be some confusion between num_letters_in_database, num_sequences_in_database, database_letters, and database_sequences. Note that even if this naming is not corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters" rather than "self._blast.query_letters". Similarly, why store the bit score and E-value as 'bits' and '_hsp.expect'/'descr.e' rather than just using bit_score and evalue, as in the blast XML ouput? I make use of in 2.2.13. This value missing entirely. The parsing of and is confusing. For example, 1 gnl|BL_ORD_ID|0 3377250 ... results in _hit.title set to "gnl|BL_ORD_ID|0 3377250". I would rather they remain separate (or both methods be used). This is certainly not an exhaustive list. I'm happy to provide another patch correcting many of these inconsistencies. At the very least, the variable names defined in Record.py should be used in NCBIXML.py. May I modify at least the above names to correspond more closely to the names used in the XML? I know I've found this particularly confusing. -Jacob -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:50:33 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:50:33 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701032150.l03LoXp4028921@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #1 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 ------- Created an attachment (id=533) --> (http://bugzilla.open-bio.org/attachment.cgi?id=533&action=view) Patch to NCBIXML.py These patches to NCBIXML and Record: * replace query_letters with query_length, * use tuples for _hsp.identities, positives, and gaps * store _hsp.align_length * separate the hit id and hit def elements. title is retained -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:50:53 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:50:53 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701032150.l03Lorvn028958@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #2 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 ------- Created an attachment (id=534) --> (http://bugzilla.open-bio.org/attachment.cgi?id=534&action=view) Patch to Record.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 16:53:02 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:53:02 -0500 Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached In-Reply-To: Message-ID: <200701032153.l03Lr2Th029085@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2051 jmjoseph at andrew.cmu.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #14 from jmjoseph at andrew.cmu.edu 2007-01-03 16:53 ------- Michiel, I have started bug 2176. Thank you for your assistance. -Jacob -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Fri Jan 5 05:35:59 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 5 Jan 2007 10:35:59 +0000 Subject: [Biopython-dev] biopython-dev In-Reply-To: References: Message-ID: <6d941f120701050235p437e9283sfad21772401baefa@mail.gmail.com> Hi Ralph, Thanks for the info, let me see if I can sum up what I have and what I am planning to do... I currently work with microsatellite and SNP data (already isolated ones, not retrieved from sequences that I have). I have code (parsers, controllers... is varies from case to case; the quality also varies) related to GenePop, fdist2, SimCoal2, Arlequin. I also have preliminary code to work with HapMap and the UCSC table browser. I have code implementing some statistics like Fst (Cockram and Weir), expected/observed heterozygozity, ... I will be, in the middle term, quite interested in all the sequence part (Tajima Ds, Fu and Li's, and e.g. the new statistic in the Voight 2006 paper). Also, linkage disequilibrium is very high on my priority list. I have been thinking quite a bit on representation of markers and populations (especially in a genomic context). e.g. I have noticed that you use a couple of arrays, one with names, the other with sequences, to represent population data. I am currently scratching my head with representation on a genomic scale (ie, multi-marker, mainly because of LD). But I think this will come smoothly when I really start to do LD studies... This is all in a context of detecting selection, disentangling selection from population structure, and hopefully, in the near future coevolution in the context of host/parasite (diseases...). I have set aside some time to assure that all the code that I am doing can be reused by the community. It is my plan to build and maintain this code during the next years (I am funded until 2010 with a PhD grant). Regards, Tiago On 1/4/07, Ralph Haygood wrote: > Tiago, > > Yes, I do still read biopython-dev. But at the moment, I have even > less time than usual, because I'm at a conference. If there's > something you want to ask me, go ahead, but unless the answer is > trivial, it may take me several days. > > You're right that my stuff is very sequence oriented. In fact, it's > very alignment oriented. It can analyze simple insertion/deletion as > well as single-nucleotide variation. Here's a typical use case, to > give you the flavor: > > alignment = phylip_file_to_alignment("sm50PromoterSpurAfra.phy") > populations = {'Spur': range(20), 'Afra': [20]} > statistics = Statistics(alignment, populations) > print "ungapped length: %d" % statistics.ungapped_length() > print "K SNPs: %d" % statistics.get_K('Spur') > print "K simple indels: %d" % statistics.get_K_simple_indel('Spur') > print "theta_W SNPs: %g" % statistics.get_theta_W('Spur') > print "theta_W simple indels: %g" % statistics.get_theta_W_simple_indel('Spur') > print "pi SNPs: %g" % statistics.get_pi('Spur') > print "pi simple indels: %g" % statistics.get_pi_simple_indel('Spur') > print "D_T SNPs: %g" % statistics.get_D_T('Spur') > print "D_T simple indels: %g" % statistics.get_D_T_simple_indel('Spur') > print "D_FL SNPs: %g" % statistics.get_D_FL('Spur', 'Afra') > print "D_FL simple indels: %g" % statistics.get_D_FL_simple_indel('Spur', 'Afra') > etc. > > Spur is Stronglyocentrotus purpuratus and Afra is Allocentrotus > fragilis, two closely related species of sea urchin. In this example, > I have 20 sequences of a certain region from Spur and one from Afra, > so I'm analyzing the population genetics of the region within Spur, > with Afra as an outgroup for doing things like inferring which allele > is ancestral at a polymorphism within Spur. K is the number of > polymorphisms, theta_W is Watterson's estimator of 4 x effective > population size x neutral mutation rate, pi is the average number of > pairwise differences between alleles, D_T is Tajima's D, D_FL is Fu > and Li's D (which requires an outgroup), etc. The software can do > more elaborate things like permutation tests for assessing whether a > statistic differs between two alignments, which might be something > like known transcription factor binding sites versus other nucleotide > sites in a promoter. The canned software DnaSP can't do that, which > is one of the reasons why I wrote my stuff. > > Ralph > -- Good judgment comes from experience. Experience comes from bad judgment. - Unknown author From bugzilla-daemon at portal.open-bio.org Sun Jan 7 16:25:18 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 7 Jan 2007 16:25:18 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701072125.l07LPIiS032620@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-01-07 16:25 ------- > In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still > defined as (None,None) tuples. However, in NCBIXML.py, these > variables are set as integers. I don't see a point of a tuple at all, > at least for NCBIXML. (I realize it is used in NCBIStandalone.py). > Most importantly, the inconsistency makes it difficult to handle cases > when the parameter is not set. It seems easiest, though, to just > retain the tuple format. I don't see a good reason for a tuple either -- though it may have seemed like a good idea back in the days that Blast only produced plain-text output. Instead of making NCBIXML also use a tuple, I'd rather set HSP.identities|gaps|positives to None instead of (None, None) in Record.py. This may break some code for people using NCBIStandalone. On the other hand, it doesn't break Biopython's test suite. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 8 10:24:20 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jan 2007 10:24:20 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701081524.l08FOKFn008935@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 10:24 ------- Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps - I would like all the parsers NCBIStandalone and NCBIXML (and ideally the HTML parser too) to return identical record objects. To do this, we could either: (a) change NCBIXML to use tuples instead of integers (as suggested by Jacob) or, (b) change NCBIStandalone to use simple integers instead of tuples (is this what you meant in comment 3 Michiel?) Choice (b) would seem simpler in the long term - but would probably break more existing code. Also, users of NCBIXML are going to have to update their scripts anyway after bug 2051, so choice (a) would distrupt less people. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 8 11:14:36 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jan 2007 11:14:36 -0500 Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current Swiss-Prot version (RX and OH lines are broken) In-Reply-To: Message-ID: <200701081614.l08GEaMm011511@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2043 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 11:14 ------- Checked in support for for Line type OH (Organism Host) for viral hosts based on code from Kristian Rother. These lines were just being ignored. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 9 11:10:06 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jan 2007 11:10:06 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701091610.l09GA6Wm004669@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2007-01-09 11:10 ------- > Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps > - I would like all the parsers NCBIStandalone and NCBIXML (and ideally the > HTML parser too) to return identical record objects. Even after patch #2090, the NCBIStandalone parser is broken for multiple Blast records, and will probably be broken for single Blast records also when a new Blast version comes out. I haven't tried the HTML parser, but I'd be surprised if it can parse HTML output from recent versions of Blast. So whereas I agree in principle that the three parsers should return identical records objects, in practice it's hardly relevant given that two of the three parsers either don't work or cannot work reliably. > To do this, we could either: > > (a) change NCBIXML to use tuples instead of integers (as suggested by Jacob) All three of us agree that there's no good reason for tuples. Option (a) implies copying a bad design choice from a semi-broken parser to a functioning parser. > or, > > (b) change NCBIStandalone to use simple integers instead of tuples (is this > what you meant in comment 3 Michiel?) > > Choice (b) would seem simpler in the long term - but would probably break more > existing code. Also, users of NCBIXML are going to have to update their > scripts anyway after bug 2051, so choice (a) would distrupt less people. Both option (a) and (b) break existing code. So let me suggest option (c): (c) Don't do anything. This doesn't break any code. In the near term, people that use both the plain-text parser and the XML parser will have to deal with differences in the Blast record produced by the parser. But how many people are that anyway? Most likely, not enough to justify option (a). In the long term, assuming that both the plain-text parser and the HTML parser will be deprecated, there will be no more inconsistencies. My question to Jacob: Why do you need to use the plain-text Blast parser? Is there something it can do that the XML parser cannot? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 9 12:29:01 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jan 2007 12:29:01 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701091729.l09HT1Vi009189@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-09 12:29 ------- Tuples/Integers for HSP.identities, HSP.gaps, and HSP.positives --------------------------------------------------------------- Michiel's option (c) of doing nothing is very pragmatic. If we go for this, I think we should at least update the record object's documentation to say it will be a tuple (when used with NCBIStandalone) or an integer (when used with NCBIXML). Perhaps we should also change the default in a new record object too... query_letters versus query_length --------------------------------- Another of Jacobs suggestions was to rename the record.query_letters (short for number of letters in query?) to something like query_length (which is closer to the actual text of query_len used in the XML file). I personally am not inclined to change this even though it would be slightly clearer. Note that I have corrected the error on line 186 of NCBIXML.py in CVS - well spotted Jacob. This mistake was my fault - recently introduced as part of the changes I made on bug 2051 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 10 15:47:59 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 10 Jan 2007 15:47:59 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701102047.l0AKlxX6027453@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2007-01-10 15:47 ------- > Another of Jacobs suggestions was to rename the record.query_letters (short for > number of letters in query?) to something like query_length (which is closer to > the actual text of query_len used in the XML file). I personally am not > inclined to change this even though it would be slightly clearer. In principle I agree with Jacob on this one. But as Jacob also indicates, there are probably more variable names that are less than ideal. So if we change these variable names, it's better to change all of them at the same time. This, however, will break a lot of existing code. With all the other changes to the Blast parsers, now doesn't seem to be the best time for such a change. However, let's get back to this point once the dust settles with the Blast parsers. With hit_id, hit_def, and hsp.align_length, I see no problems with Jacob's suggestion. Objections, anybody? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 10 16:05:23 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 10 Jan 2007 16:05:23 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701102105.l0AL5Nht028299@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #8 from jmjoseph at andrew.cmu.edu 2007-01-10 16:05 ------- > My question to Jacob: > Why do you need to use the plain-text Blast parser? Is there something it can > do that the XML parser cannot? I use only the XML parser. My greatest concern is not that the plain-text and XML parsers are different, but rather that the XML parser is not consistent with Record.py. An example that I consider completely broken is the definition of query_length in Record.py, but the use of self._blast.query_letters in NCBIXML.py. To avoid breaking the existing plain-text parser code, would it be too objectionable to use a new class, Record-XML.py, with definitions that exactly match the usage in NCBIXML.py? Since few people are likely to use both parsers, and any using the XML parser have required recent code updates anyway, perhaps this separation would be easiest. Thanks. -Jacob -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 11 00:18:17 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 11 Jan 2007 00:18:17 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701110518.l0B5IHuD018624@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2007-01-11 00:18 ------- > To avoid breaking the existing plain-text parser code, would it be too > objectionable to use a new class, Record-XML.py, with definitions that exactly > match the usage in NCBIXML.py? Go ahead, but don't add it to Biopython ;-). It would just add to the confusion, without a real benefit to users. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kosa at genesilico.pl Thu Jan 11 03:52:27 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 09:52:27 +0100 Subject: [Biopython-dev] powerful Alignment class in Biopython? Message-ID: <45A5FACB.40109@genesilico.pl> Hi, Is anyone going to develop or developing now an Alignment class in Biopython as powerful as for example SimpleAlign in Bioperl? Look here for instance for methods available in Bioperl http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html. The reason I am asking is that I do not know if I should start working on more functional subclass of biopython Alignment class (I do not want to come back to Perl ;-)... Regards, Janek From fkauff at duke.edu Thu Jan 11 04:27:25 2007 From: fkauff at duke.edu (Frank) Date: Thu, 11 Jan 2007 10:27:25 +0100 Subject: [Biopython-dev] powerful Alignment class in Biopython? In-Reply-To: <45A5FACB.40109@genesilico.pl> References: <45A5FACB.40109@genesilico.pl> Message-ID: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de> Hi Janek, then Nexus parser in Biopython (for which I still haven't written any documentation yet...) basically holds an alignment, and has some methods that deal with basic alignment functionality. If you're going to work on a more sophisticated alignment class, maybe we should try to get Nexus class and alignment class work smoothly together. Frank On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote: > Hi, > > Is anyone going to develop or developing now an Alignment class in > Biopython as powerful as for example SimpleAlign in Bioperl? Look here > for instance for methods available in Bioperl > http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html. > > The reason I am asking is that I do not know if I should start working > on more functional subclass of biopython Alignment class (I do not want > to come back to Perl ;-)... > > Regards, > Janek > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From tiagoantao at gmail.com Thu Jan 11 05:36:35 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 11 Jan 2007 10:36:35 +0000 Subject: [Biopython-dev] PopGen code Message-ID: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> Hi, A couple of weeks ago I have put on bugzilla code related to population genetics, namely parsing of GenePop and Fdist files, plus code to control fdist. I have had no feedback whatsoever, namely comments to the quality of the code, if there is interest in adding it in the future to BioPython, etc... I have much more code that I could start converting to BioPython format, some of which is a bit more complicated to convert (e.g., an Arlequin format parser). Before I start doing it I would like to know if there will be any feedback at all or if I am just loosing my time... Regards, Tiago -- Good judgment comes from experience. Experience comes from bad judgment. - Unknown author From kosa at genesilico.pl Thu Jan 11 06:34:50 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 12:34:50 +0100 Subject: [Biopython-dev] powerful Alignment class in Biopython? In-Reply-To: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de> References: <45A5FACB.40109@genesilico.pl> <1168507645.2888.3.camel@osiris.biologie.uni-kl.de> Message-ID: <45A620DA.4000302@genesilico.pl> Hi, I have a feeling that it would be better to write all methods similar to BioPerl ones directly for BioPython Alignment class. The main reason is that this class is not related to any format like Fasta, Clustal or Nexus. It stores SeqRecords which are also not in Fasta or other format. It would make many things easier. For instance, I can write all my functions which do sth with alignments so that they accept general Alignment objects (and not necessarily FastaAlignment or ClustalAlignment objects ). Would not it better to write all stuff which do general things with alignments (column counting, column selection/removal etc.) so that it works with general Alignment class rather than with class for alignment of specific biological format? Janek Frank wrote: > Hi Janek, > > then Nexus parser in Biopython (for which I still haven't written any > documentation yet...) basically holds an alignment, and has some methods > that deal with basic alignment functionality. If you're going to work on > a more sophisticated alignment class, maybe we should try to get Nexus > class and alignment class work smoothly together. > > Frank > > > On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote: > >> Hi, >> >> Is anyone going to develop or developing now an Alignment class in >> Biopython as powerful as for example SimpleAlign in Bioperl? Look here >> for instance for methods available in Bioperl >> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html. >> >> The reason I am asking is that I do not know if I should start working >> on more functional subclass of biopython Alignment class (I do not want >> to come back to Perl ;-)... >> >> Regards, >> Janek >> >> >> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> From kosa at genesilico.pl Thu Jan 11 07:11:11 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 13:11:11 +0100 Subject: [Biopython-dev] [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> Message-ID: <45A6295F.5030103@genesilico.pl> Are you going to fix this in the new SeqIO?: When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are stripped away after the first "space". Janek Peter (BioPython List) wrote: > Jan Kosinski wrote: >> Hi, >> >> I am quite new in BioPython and I am a little bit confused when >> trying to use BioPython for working with fasta sequences and alignments. >> >> For instance, I can read and parse fasta files with Bio.Fasta, return >> records (as Fasta.record class), iterate and so on. But then I am >> going to Bio.Fasta.FastaAlign module which offers FastaAlignment >> (subclass of Alignment class) class. However, this class has very >> limited methods and get_all_seqs and get_seq_by_num return SeqRecord >> object instead of Fasta.record (why??) what makes it hard to use >> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta >> (with Fasta.record) for sequences. Maybe I am wrong but Biopython >> seems to be full of incompatibilities. Or one should know which >> modules and classes should not be used? >> >> Could you recommend me what should I use for my work with fasta >> sequences and alignments? Which BioPython modules and classes? > > You can use Bio.Fasta to read in files either as Fasta.Record objects, > or as SeqRecord objects. I would use SeqRecord objects - they are > more general should you ever want to use a different input file format > - plus as you have noticed, the alignment object also uses SeqRecord > objects to hold each (gapped) sequence. > > There are other options if you search the code - but Bio.Fasta is the > best documented and most used. > > If you are brave, then you might have a look at the new code in > Bio.SeqIO which you can get from CVS. This is still in a state of > flux however... but the Fasta parsing is much faster. See this page > and the mailing list archives for more: > > http://www.biopython.org/wiki/SeqIO > > > Or should I use other packages like CoreBio? > > You could do - it has the advantage of having started recently from a > clean slate, and having much less "old code". > >> Thank you in advance for any guidelines, >> Janek Kosinski > > Peter From biopython-dev at maubp.freeserve.co.uk Fri Jan 12 07:36:56 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Jan 2007 12:36:56 +0000 Subject: [Biopython-dev] PopGen code In-Reply-To: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> Message-ID: <45A780E8.2070803@maubp.freeserve.co.uk> Tiago Ant?o wrote: > Hi, > > A couple of weeks ago I have put on bugzilla code related to > population genetics, namely parsing of GenePop and Fdist files, plus > code to control fdist. > I have had no feedback whatsoever, namely comments to the quality of > the code, if there is interest in adding it in the future to > BioPython, etc... > > I have much more code that I could start converting to BioPython > format, some of which is a bit more complicated to convert (e.g., an > Arlequin format parser). Before I start doing it I would like to know > if there will be any feedback at all or if I am just loosing my > time... I suppose I/we would be able to read your code from a general perspective (coding style, clarity of comments, etc). I haven't made time for this. I suspect BioPython currently has no active developers who feel qualified to interpret your population genetics code. I was hoping that you and Ralph Haygood would combine forces - if you are both happy with some code that does bode well. Any comments Michiel? Regarding population genetic file formats - from a very quick search about Arlequin it sounds like this file format can hold lots of different types of data. I would encourage you to try and come up with a generic population record data object that could hold this or information from GenePop or Fdist as well. I have no idea how easy this would be... Peter From tiagoantao at gmail.com Fri Jan 12 09:16:53 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 12 Jan 2007 14:16:53 +0000 Subject: [Biopython-dev] PopGen code In-Reply-To: <45A780E8.2070803@maubp.freeserve.co.uk> References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> <45A780E8.2070803@maubp.freeserve.co.uk> Message-ID: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com> Hi, Thanks for the answer. > I suspect BioPython currently has no active developers who feel > qualified to interpret your population genetics code. I was hoping that > you and Ralph Haygood would combine forces - if you are both happy with > some code that does bode well. Any comments Michiel? I think Ralph (who subscribes to this list, and thus can comment) has strong time constraints, and will probably have little available time in the near future... > Regarding population genetic file formats - from a very quick search > about Arlequin it sounds like this file format can hold lots of > different types of data. I would encourage you to try and come up with > a generic population record data object that could hold this or > information from GenePop or Fdist as well. I have no idea how easy this > would be... I have been thinking a lot about a generic data structure to hold population genomic (ie not only genetic) data. I have, in fact, implemented (in CAML, not Python) quite a few different data representations. I was not happy with none of them. Different kinds of markers (that sometimes overlap - eg sequences and SNPs), linkage disequilibrium (thus relations between markers...), ploidy (no need to think on different organisms, think mitochondria, nuclear chromosomes, Y chromosome), ... make a general solution not trivial. As I see it, there are a few options: 1. Have a grand, unified structure, but that will take time to mature 2. Assume that there will be different representations for different scopes, assume that that is a bad thing and live with that 3. Assume that there will be different representations, and that that is good, in the sense that a one size, fits all approach in this case has lots of problems I think the pragmatic approach for now is not to have a generic representation. I would lean more to let things mature (develop statistics, parsers, ...) and after there is more experience (and, hopefully, user feedback) then reassess the issue of a general representation. I am aware that this will entail each part of code having a different calling data structure, but I think that with care and common sense that won't be very problematic. I don't mind having the code on an alpha branch for as long as you see fit, I just want to be sure that whatever effort I put in converting (or creating new) my code to BioPython is not lost, that is why I would like feedback on what will happen to the code that I am submitting. I am willing to accommodate any reasonable requirements regarding code quality and development process... Regards, Tiago -- Good judgment comes from experience. Experience comes from bad judgment. - Unknown author From bsouthey at gmail.com Fri Jan 12 09:30:26 2007 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 12 Jan 2007 08:30:26 -0600 Subject: [Biopython-dev] PopGen code In-Reply-To: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com> References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> <45A780E8.2070803@maubp.freeserve.co.uk> <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com> Message-ID: Hi, While I do have a remote interest in this, I do not have any time to look at this at present. As I mentioned in a previous email, that John Cole is doing related work in Python but not part of BioPython. It would probably be good to have some unified approach and direction because of the overlaps that occur and so other pieces of code can be easily added. Regards Bruce On 1/12/07, Tiago Ant?o wrote: > Hi, > > Thanks for the answer. > > > I suspect BioPython currently has no active developers who feel > > qualified to interpret your population genetics code. I was hoping that > > you and Ralph Haygood would combine forces - if you are both happy with > > some code that does bode well. Any comments Michiel? > > I think Ralph (who subscribes to this list, and thus can comment) has > strong time constraints, and will probably have little available time > in the near future... > > > Regarding population genetic file formats - from a very quick search > > about Arlequin it sounds like this file format can hold lots of > > different types of data. I would encourage you to try and come up with > > a generic population record data object that could hold this or > > information from GenePop or Fdist as well. I have no idea how easy this > > would be... > > I have been thinking a lot about a generic data structure to hold > population genomic (ie not only genetic) data. I have, in fact, > implemented (in CAML, not Python) quite a few different data > representations. I was not happy with none of them. Different kinds of > markers (that sometimes overlap - eg sequences and SNPs), linkage > disequilibrium (thus relations between markers...), ploidy (no need to > think on different organisms, think mitochondria, nuclear chromosomes, > Y chromosome), ... make a general solution not trivial. > As I see it, there are a few options: > 1. Have a grand, unified structure, but that will take time to mature > 2. Assume that there will be different representations for different > scopes, assume that that is a bad thing and live with that > 3. Assume that there will be different representations, and that that > is good, in the sense that a one size, fits all approach in this case > has lots of problems > > I think the pragmatic approach for now is not to have a generic > representation. I would lean more to let things mature (develop > statistics, parsers, ...) and after there is more experience (and, > hopefully, user feedback) then reassess the issue of a general > representation. I am aware that this will entail each part of code > having a different calling data structure, but I think that with care > and common sense that won't be very problematic. > > I don't mind having the code on an alpha branch for as long as you see > fit, I just want to be sure that whatever effort I put in converting > (or creating new) my code to BioPython is not lost, that is why I > would like feedback on what will happen to the code that I am > submitting. I am willing to accommodate any reasonable requirements > regarding code quality and development process... > > Regards, > Tiago > > -- > Good judgment comes from experience. > Experience comes from bad judgment. > - Unknown author > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From bugzilla-daemon at portal.open-bio.org Fri Jan 12 19:27:11 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 12 Jan 2007 19:27:11 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701130027.l0D0RBlR027978@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2007-01-12 19:27 ------- I've committed the code to handle hit_id, hit_def, and hsp.align_length to CVS. Let's keep this bug report open for now to remind ourselves to revisit the issues with variable names at some point. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:04:24 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:04:24 -0500 Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current Swiss-Prot version (RX and OH lines are broken) In-Reply-To: Message-ID: <200701151104.l0FB4OdQ015531@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2043 k_rother at yahoo.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |k_rother at yahoo.de -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:07:33 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:07:33 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151107.l0FB7Xdb015724@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 ------- Comment #2 from k_rother at yahoo.de 2007-01-15 06:07 ------- Created an attachment (id=543) --> (http://bugzilla.open-bio.org/attachment.cgi?id=543&action=view) new date() method handling new style DT lines new to bugzilla. don't know whether this is the proper way to commit code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:17:55 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:17:55 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151117.l0FBHtd1016589@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 ------- Comment #3 from k_rother at yahoo.de 2007-01-15 06:17 ------- Created an attachment (id=544) --> (http://bugzilla.open-bio.org/attachment.cgi?id=544&action=view) SProt.py that digests all 250,000 Uniprot entries successfully. also checked the data record contents whether the dates and version numbers of the first few entries are correct. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 06:19:59 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:19:59 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151119.l0FBJxt9016846@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 k_rother at yahoo.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |k_rother at yahoo.de Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #4 from k_rother at yahoo.de 2007-01-15 06:19 ------- i think this should finish the bug unless someone wants to beautify the code. KR -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 07:26:57 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 07:26:57 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151226.l0FCQvNE022159@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:26 ------- Reopening - its not fixed until we have updated the code in CVS. However, I will try and have a look at your code. By the way - in general developers pefer patches rather than chunks of code, or edited copies of the original. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 07:51:46 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 07:51:46 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151251.l0FCpk8H023891@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:51 ------- I have updated CVS with a slightly modified version of your code Kristian. See revision 1.36, web version here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython It passes the old unit test, test_SProt.py, but if you could double check this on the latest release that would be great. Thanks very much. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Jan 15 15:04:34 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Jan 2007 20:04:34 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45A94BFD.5080209@c2b2.columbia.edu> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> Message-ID: <45ABDE52.1030300@maubp.freeserve.co.uk> Michiel de Hoon wrote: > In my opinion, the new Bio.SeqIO code is a huge improvement to > Biopython, so I'd be happy to make a new release for it. > > ... > > For Bio.SeqIO, we're also in pretty good shape, as far as I can tell. > From what I remember, the remaining issues were > 1) Which functionality to include, in particular > a) if functions should accept file names in addition to file handles; I have decided to follow Michiel's stance on this issue: handles only. > b) if functions should infer the file format from the file extension, > the file content, or otherwise. Right now the file format string is optional and if omitted the file extension (via handle.name) is used to try and guess. It would be trivial to remove this functionality and make format a required argument. We could at a later date chose to add limited support for format guessing based on file contents without altering the function parameters (i.e. the API). Both these features would be nice to have (speaking as user) but then again, am I prepared to support the headaches they may cause later on. I'm wavering on this issue (having previously been in favour of including the format guessing). Item 1(c) on Michiel's list could have been do we need the three "helper functions" which turned a file into a SeqRecord list, dictionary or alignment. Again, I have come round to Michiel's view and removed these as they were just simple wrappers for list, SequencesToDictionary and SequencesToAlignment. > 2) What are the best names for the functions that the user will see. The good news is that after that little spring clean there are less functions to name - just these four really: SequenceIterator, once known as FileToSequenceIterator and before that File2SequenceIterator. Now takes just an input file handle and an optional file format. Returns a SeqRecord iterator. SequencesToDictionary - takes SeqRecord iterator or list, plus an optional function to define the keys, and returns a dictionary. SequencesToAlignment - takes SeqRecord iterator or list, and returns an alignment object. Perhaps this functionality should be included in the alignment class itself... WriteSequences, once known as SequencesToFile - takes a SeqRecord iterator or list, and output handle, and a format string. Intended for use on a whole file at once (i.e. the general case where there may be headers/footers etc). This does not let you do incremental writes one for each record (which would be possible for some formats like GenBank or fasta) Peter From mdehoon at c2b2.columbia.edu Mon Jan 15 18:00:16 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 15 Jan 2007 18:00:16 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45ABDE52.1030300@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> Message-ID: <45AC0780.9060306@c2b2.columbia.edu> Peter wrote: > WriteSequences, once known as SequencesToFile - takes a SeqRecord > iterator or list, and output handle, and a format string. Intended for > use on a whole file at once (i.e. the general case where there may be > headers/footers etc). This does not let you do incremental writes one > for each record (which would be possible for some formats like GenBank > or fasta) At the end of WriteSequences, the file is closed: def WriteSequences(sequences, handle, format) : ... handle.close() #just in case the writer object forgot Why would it be a problem if the handle is not closed? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 06:10:21 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jan 2007 11:10:21 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45AC0780.9060306@c2b2.columbia.edu> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> <45AC0780.9060306@c2b2.columbia.edu> Message-ID: <45ACB29D.2010002@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >> WriteSequences, once known as SequencesToFile - takes a SeqRecord >> iterator or list, and output handle, and a format string. Intended for >> use on a whole file at once (i.e. the general case where there may be >> headers/footers etc). This does not let you do incremental writes one >> for each record (which would be possible for some formats like GenBank >> or fasta) > > At the end of WriteSequences, the file is closed: > > def WriteSequences(sequences, handle, format) : > ... > handle.close() #just in case the writer object forgot > > Why would it be a problem if the handle is not closed? OK, I've fixed that. That issue was on my mind too - in particular it would stop Bio.SeqIO from creating concatenated phylip alignments which are used in bootstrapping. Reading this sort of file is a different issue, which I am also currently thinking about. Peter From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 07:48:49 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jan 2007 12:48:49 +0000 Subject: [Biopython-dev] Bio.SeqIO - Output In-Reply-To: <45ACB29D.2010002@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> <45AC0780.9060306@c2b2.columbia.edu> <45ACB29D.2010002@maubp.freeserve.co.uk> Message-ID: <45ACC9B1.6010709@maubp.freeserve.co.uk> I've been thinking about sequence output (i.e. writing sequence files), and have come to the conclusion that my writer classes in Bio/SeqIO/Interfaces.py are probably too complicated. My current Bio.SeqIO output implementation tries to be very flexible - if you look beyond the top level function WriteSequences (aka SequencesToFile) then the individual writer classes have a confusing range of capabilities. New Idea ======== I was thinking that we should only support two cases for sequence output: (*) simple sequential file formats - record by record, or file at once - can use a SeqRecord iterator (or a list) (*) all other file formats - file at once only - probably needs a list of SeqRecords (not an iterator) For the sequential file formats such as fasta, genbank and swiss there are no headers or footers - and a single sequence alone would be a valid file. For all other file formats (e.g. clustal, stockholm, phylip, anything in XML, ...) we would only offer the "file at once" option. When implementing a writer for a new file format, you just have to implement a "write file" function or a "write record" function which takes the record(s) and a handle. The implementation details are up to you. Drawbacks ========= There are some sequential file formats where, under the scheme above, you would be forced to write the file in one go... However, I can only think of one irrelevant example, so this may not matter. Can anyone suggest some other examples? Some sort of simple tabular file with a header row maybe? For example simple Stockholm files (if you ignore the PFAM style annotation) have a generic header, followed by sequential records and a generic footer. The point here is that the header does not contain anything about the records which will follow it. e.g. The number of records, or if they are protein or nucleotides. For files like this it would be possible to write the file record by record given an iterator - provided you also write the header and footer. Right now this is the only file format I can think of that has this property - and I don't currently even support this (instead like BioPerl I create Stockholm files with PFAM style annotations). Stockholm files with PFAM style annotation do not qualify, because the header contains the number of records. Similarly for non-interlaced PHYLIP. Peter From mdehoon at c2b2.columbia.edu Tue Jan 16 12:51:23 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 16 Jan 2007 12:51:23 -0500 Subject: [Biopython-dev] [BioPython] Next release plans; was: what to use for working with fasta sequences and alignments? In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45AA34AE.5080100@maubp.freeserve.co.uk> Message-ID: <45AD109B.4030009@c2b2.columbia.edu> Peter wrote: > Regarding the fix checked in on bug 1970 I still would prefer we call > the new XML iterator NCBIXML.Iterator(handle) rather than > NCBIXML.parse(handle) but I'll live ;) > I chose "parse" because it is used in the old (Biopython release 1.42) Blast XML parser: Old: >>> from Bio.Blast import NCBIXML >>> b_parser = NCBIXML.BlastParser() >>> b_record = b_parser.parse(blast_out) New: >>> from Bio.Blast import NCBIXML >>> b_records = NCBIXML.parse(blast_out) >>> b_record = b_records.next() # Repeat to get subsequent Blast records Whereas I am not dead set on "parse", it agrees with similar functions in Python: 1) Function name is a verb, not a noun 2) Function name describes what the function does, not what the function returns 3) Function names are short, and start with a lower case letter. For example, to read a file line-by-line in Python: >>> inputfile = open("somefunnyfile") # "open"; not "Iterator", nor "FileToLineIterator", # even though "open" returns an iterator: >>> for line in inputfile: ... print line To read an image file with the Python Imaging Library: >>> import Image >>> im = Image.open("lena.ppm") # "open"; not "Image", nor "FileNameToImage". To read a Python object from a pickled file: >>> import pickle >>> inputfile = open("somepickledfile") >>> myobject = pickle.load(inputfile) # "load"; not "FileToObject". >>> inputfile.close() To parse an XML file with the sax parser framework in Python: >>> from xml.sax.handler import ContentHandler >>> from xml import sax >>> handler = SomeSubclassOfContentHandler() >>> inputfile = open("myxmlfile.xml") >>> sax.parse(inputfile, handler) # "parse", same as in the new Bio.Blast.NCBIXML >>> inputfile.close() So, for Bio.Blast.NCBIXML, good names would be "load", "read", "parse", or something similar. "Iterator" would not be consistent; besides, until recently I didn't know what an iterator is, so I doubt that new users would know. What we could do is to have two functions in Bio.Blast.NCBIXML, perhaps one called "read" and the other "iterate", where the former returns a single Blast record (for an XML file containing only one Blast result), and the latter an iterator over multiple Blast records. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Tue Jan 16 12:47:17 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 16 Jan 2007 12:47:17 -0500 Subject: [Biopython-dev] [BioPython] Next release plans; was: what to use for working with fasta sequences and alignments? In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45AA34AE.5080100@maubp.freeserve.co.uk> Message-ID: <45AD0FA5.3060008@c2b2.columbia.edu> Peter wrote: > In general, I agree that the Blast XML parser in CVS looks in good shape > - but we really need to update the documentation for using Blast for the > next release. > Yeah I know, I've been holding off on updating the documentation so it is consistent with the latest Biopython release 1.42. I'll update it together with the next release. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bugzilla-daemon at portal.open-bio.org Sat Jan 27 18:01:07 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 27 Jan 2007 18:01:07 -0500 Subject: [Biopython-dev] [Bug 1963] Adding __str__ method to codon tables and translators In-Reply-To: Message-ID: <200701272301.l0RN17l6026463@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1963 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2007-01-27 18:01 ------- > Question One: > Is this worth adding to BioPython or not? Yes, definitely. > Question Two: > What is the preferred behaviour for ambiguous tables? Just a 4x4x4 table as > for the unambiguous tables? Or the full 15x15x15 table? I have implemented > both (see commented out code) My feeling is that 15x15x15 would become too large to be clearly visible on the screen. So I'd prefer 4x4x4, maybe with a reminder printed at the end as to what each ambiguous codon may represent. > Question Three: > Is there a standard BioPython function to convert from one letter amino acid > sequences into three letter names? i.e. like one_to_three from > Bio.PDB.Polypeptide but more general. That function does not cope with > ambigous names. There is the function seq3 in Bio/SeqUtils. If it is not complete, it can be extended easily, and seems to be a better place for this general function than Bio/PDB/Polypeptide. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 13:15:00 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 08:15:00 -0500 Subject: [Biopython-dev] [Bug 2174] New: FDist Support in BioPython Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2174 Summary: FDist Support in BioPython Product: Biopython Version: 1.24 Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: tiagoantao at gmail.com This is an enhancement bug to submit code related to fdist2 http://www.rubic.rdg.ac.uk/~mab/software.html -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 13:15:18 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 08:15:18 -0500 Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython In-Reply-To: Message-ID: <200701031315.l03DFIGn007058@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2174 tiagoantao at gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|normal |enhancement -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 13:16:06 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 08:16:06 -0500 Subject: [Biopython-dev] [Bug 2174] FDist Support in BioPython In-Reply-To: Message-ID: <200701031316.l03DG6qL007102@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2174 ------- Comment #1 from tiagoantao at gmail.com 2007-01-03 08:16 ------- Created an attachment (id=532) --> (http://bugzilla.open-bio.org/attachment.cgi?id=532&action=view) Code support fdist -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Wed Jan 3 13:16:30 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Wed, 3 Jan 2007 13:16:30 +0000 Subject: [Biopython-dev] FDist: more Population Genetics code Message-ID: <6d941f120701030516m1adb3daeh6e4645121ba8679d@mail.gmail.com> Hi! I have submitted another enhancement bug, with support for FDist. It allows to generate and parse Fdist files and to control fdist applications. There are also a couple of utility functions. FDist is a niche application (mainly used to detect selection in animal genetics). Not the most fundamental one to support, but it is currently one that I am working on, thus, the code. Regarding my summited code for GenePop, I have summited a different version on bugzilla. The main difference, is that I moved everything from Bio to Bio.PopGen. Before I continue putting code on bugzilla I would like to know if it is worthwhile doing it... Any opinions on the code submitted or if any changes are required? I would really like to continue converting my code to BioPython, but only if it has any possibility of ending up being useful/included in distribution somewhere in the future... ;) I am currently working on code related to SimCoal2, Arlequin and general statistics (Fst, heterozygosity, ...). Which will probably be ready quite soon (ie, next two weeks). This is more mainstream than FDist I have some other code lying around mainly related to HapMap, but I will only submit it after reviewing and reusing it again. This is more distant future ... like a couple of months. Tiago From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:38:39 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:38:39 -0500 Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached In-Reply-To: Message-ID: <200701032138.l03Lcdji028402@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2051 ------- Comment #13 from mdehoon at ims.u-tokyo.ac.jp 2007-01-03 16:38 ------- > Regardless, I do still see a > number of inconsistencies. Please submit a separate bug report (including your patch) for these inconsistencies. The current bug report is titled "XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached" With Peter's patch, we can now parse multiple blast queries, so I'd like to close this bug report. For future bug reports and patches: Try to handle separate bugs in separate bug reports and patches. For developers, when looking at a patch handling several issues at the same time, it's difficult to understand which parts of the patch are essential, which are good but non-essential, and which are code cleanup. Speaking for myself, I would probably have considered this patch earlier if it had been less convoluted. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:48:45 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:48:45 -0500 Subject: [Biopython-dev] [Bug 2176] New: XML Blast parser: miscellaneous bug fixes and cleanup Message-ID: http://bugzilla.open-bio.org/show_bug.cgi?id=2176 Summary: XML Blast parser: miscellaneous bug fixes and cleanup Product: Biopython Version: Not Applicable Platform: PC OS/Version: Linux Status: NEW Severity: normal Priority: P2 Component: Main Distribution AssignedTo: biopython-dev at biopython.org ReportedBy: jmjoseph at andrew.cmu.edu This follows the discussion started in bug 2051. The blast XML parser does now work (Thanks!), but could still use a little work. Here's a list of the issues I can see now. I'll follow with patches to correct a few. In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still defined as (None,None) tuples. However, in NCBIXML.py, these variables are set as integers. I don't see a point of a tuple at all, at least for NCBIXML. (I realize it is used in NCBIStandalone.py). Most importantly, the inconsistency makes it difficult to handle cases when the parameter is not set. It seems easiest, though, to just retain the tuple format. In the past, I worried that the order of tuple building for self._blast.gap_penalties or ka_params could cause the tuple to have an incorrect ordering. I seem to remember hitting an issue where the tuple was built with the wrong length, but I can't be specific. In general, it remains odd to me to not just use a list and set each element respectively. If necessary, one could convert to a tuple when finished or use some other approach that does not rely upon order. Why not use query_len, as defined in the XML file, or query_length instead of query_letters as a variable name? In BlastParser._end_Iteration, self._blast.query_letters is set. This is not defined/documented in the Parameters class in Record.py. Rather, query_length is defined there. In the Header class, though, the name query_letters is used. There also seems to be some confusion between num_letters_in_database, num_sequences_in_database, database_letters, and database_sequences. Note that even if this naming is not corrected, NCBIXML.py:186 is wrong with "self._blast_query_letters" rather than "self._blast.query_letters". Similarly, why store the bit score and E-value as 'bits' and '_hsp.expect'/'descr.e' rather than just using bit_score and evalue, as in the blast XML ouput? I make use of in 2.2.13. This value missing entirely. The parsing of and is confusing. For example, 1 gnl|BL_ORD_ID|0 3377250 ... results in _hit.title set to "gnl|BL_ORD_ID|0 3377250". I would rather they remain separate (or both methods be used). This is certainly not an exhaustive list. I'm happy to provide another patch correcting many of these inconsistencies. At the very least, the variable names defined in Record.py should be used in NCBIXML.py. May I modify at least the above names to correspond more closely to the names used in the XML? I know I've found this particularly confusing. -Jacob -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:50:33 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:50:33 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701032150.l03LoXp4028921@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #1 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 ------- Created an attachment (id=533) --> (http://bugzilla.open-bio.org/attachment.cgi?id=533&action=view) Patch to NCBIXML.py These patches to NCBIXML and Record: * replace query_letters with query_length, * use tuples for _hsp.identities, positives, and gaps * store _hsp.align_length * separate the hit id and hit def elements. title is retained -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:50:53 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:50:53 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701032150.l03Lorvn028958@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #2 from jmjoseph at andrew.cmu.edu 2007-01-03 16:50 ------- Created an attachment (id=534) --> (http://bugzilla.open-bio.org/attachment.cgi?id=534&action=view) Patch to Record.py -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 3 21:53:02 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 3 Jan 2007 16:53:02 -0500 Subject: [Biopython-dev] [Bug 2051] XML Blast parser unusable with multiple queries and recent (2.2.13) blast - patch attached In-Reply-To: Message-ID: <200701032153.l03Lr2Th029085@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2051 jmjoseph at andrew.cmu.edu changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED ------- Comment #14 from jmjoseph at andrew.cmu.edu 2007-01-03 16:53 ------- Michiel, I have started bug 2176. Thank you for your assistance. -Jacob -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From tiagoantao at gmail.com Fri Jan 5 10:35:59 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 5 Jan 2007 10:35:59 +0000 Subject: [Biopython-dev] biopython-dev In-Reply-To: References: Message-ID: <6d941f120701050235p437e9283sfad21772401baefa@mail.gmail.com> Hi Ralph, Thanks for the info, let me see if I can sum up what I have and what I am planning to do... I currently work with microsatellite and SNP data (already isolated ones, not retrieved from sequences that I have). I have code (parsers, controllers... is varies from case to case; the quality also varies) related to GenePop, fdist2, SimCoal2, Arlequin. I also have preliminary code to work with HapMap and the UCSC table browser. I have code implementing some statistics like Fst (Cockram and Weir), expected/observed heterozygozity, ... I will be, in the middle term, quite interested in all the sequence part (Tajima Ds, Fu and Li's, and e.g. the new statistic in the Voight 2006 paper). Also, linkage disequilibrium is very high on my priority list. I have been thinking quite a bit on representation of markers and populations (especially in a genomic context). e.g. I have noticed that you use a couple of arrays, one with names, the other with sequences, to represent population data. I am currently scratching my head with representation on a genomic scale (ie, multi-marker, mainly because of LD). But I think this will come smoothly when I really start to do LD studies... This is all in a context of detecting selection, disentangling selection from population structure, and hopefully, in the near future coevolution in the context of host/parasite (diseases...). I have set aside some time to assure that all the code that I am doing can be reused by the community. It is my plan to build and maintain this code during the next years (I am funded until 2010 with a PhD grant). Regards, Tiago On 1/4/07, Ralph Haygood wrote: > Tiago, > > Yes, I do still read biopython-dev. But at the moment, I have even > less time than usual, because I'm at a conference. If there's > something you want to ask me, go ahead, but unless the answer is > trivial, it may take me several days. > > You're right that my stuff is very sequence oriented. In fact, it's > very alignment oriented. It can analyze simple insertion/deletion as > well as single-nucleotide variation. Here's a typical use case, to > give you the flavor: > > alignment = phylip_file_to_alignment("sm50PromoterSpurAfra.phy") > populations = {'Spur': range(20), 'Afra': [20]} > statistics = Statistics(alignment, populations) > print "ungapped length: %d" % statistics.ungapped_length() > print "K SNPs: %d" % statistics.get_K('Spur') > print "K simple indels: %d" % statistics.get_K_simple_indel('Spur') > print "theta_W SNPs: %g" % statistics.get_theta_W('Spur') > print "theta_W simple indels: %g" % statistics.get_theta_W_simple_indel('Spur') > print "pi SNPs: %g" % statistics.get_pi('Spur') > print "pi simple indels: %g" % statistics.get_pi_simple_indel('Spur') > print "D_T SNPs: %g" % statistics.get_D_T('Spur') > print "D_T simple indels: %g" % statistics.get_D_T_simple_indel('Spur') > print "D_FL SNPs: %g" % statistics.get_D_FL('Spur', 'Afra') > print "D_FL simple indels: %g" % statistics.get_D_FL_simple_indel('Spur', 'Afra') > etc. > > Spur is Stronglyocentrotus purpuratus and Afra is Allocentrotus > fragilis, two closely related species of sea urchin. In this example, > I have 20 sequences of a certain region from Spur and one from Afra, > so I'm analyzing the population genetics of the region within Spur, > with Afra as an outgroup for doing things like inferring which allele > is ancestral at a polymorphism within Spur. K is the number of > polymorphisms, theta_W is Watterson's estimator of 4 x effective > population size x neutral mutation rate, pi is the average number of > pairwise differences between alleles, D_T is Tajima's D, D_FL is Fu > and Li's D (which requires an outgroup), etc. The software can do > more elaborate things like permutation tests for assessing whether a > statistic differs between two alignments, which might be something > like known transcription factor binding sites versus other nucleotide > sites in a promoter. The canned software DnaSP can't do that, which > is one of the reasons why I wrote my stuff. > > Ralph > -- Good judgment comes from experience. Experience comes from bad judgment. - Unknown author From bugzilla-daemon at portal.open-bio.org Sun Jan 7 21:25:18 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sun, 7 Jan 2007 16:25:18 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701072125.l07LPIiS032620@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #3 from mdehoon at ims.u-tokyo.ac.jp 2007-01-07 16:25 ------- > In Record.py, HSP.identities, HSP.gaps, and HSP.positives are still > defined as (None,None) tuples. However, in NCBIXML.py, these > variables are set as integers. I don't see a point of a tuple at all, > at least for NCBIXML. (I realize it is used in NCBIStandalone.py). > Most importantly, the inconsistency makes it difficult to handle cases > when the parameter is not set. It seems easiest, though, to just > retain the tuple format. I don't see a good reason for a tuple either -- though it may have seemed like a good idea back in the days that Blast only produced plain-text output. Instead of making NCBIXML also use a tuple, I'd rather set HSP.identities|gaps|positives to None instead of (None, None) in Record.py. This may break some code for people using NCBIStandalone. On the other hand, it doesn't break Biopython's test suite. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 8 15:24:20 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jan 2007 10:24:20 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701081524.l08FOKFn008935@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 10:24 ------- Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps - I would like all the parsers NCBIStandalone and NCBIXML (and ideally the HTML parser too) to return identical record objects. To do this, we could either: (a) change NCBIXML to use tuples instead of integers (as suggested by Jacob) or, (b) change NCBIStandalone to use simple integers instead of tuples (is this what you meant in comment 3 Michiel?) Choice (b) would seem simpler in the long term - but would probably break more existing code. Also, users of NCBIXML are going to have to update their scripts anyway after bug 2051, so choice (a) would distrupt less people. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 8 16:14:36 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 8 Jan 2007 11:14:36 -0500 Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current Swiss-Prot version (RX and OH lines are broken) In-Reply-To: Message-ID: <200701081614.l08GEaMm011511@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2043 ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-08 11:14 ------- Checked in support for for Line type OH (Organism Host) for viral hosts based on code from Kristian Rother. These lines were just being ignored. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 9 16:10:06 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jan 2007 11:10:06 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701091610.l09GA6Wm004669@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #5 from mdehoon at ims.u-tokyo.ac.jp 2007-01-09 11:10 ------- > Regarding the inconsistent use tuples for _hsp.identities, positives, and gaps > - I would like all the parsers NCBIStandalone and NCBIXML (and ideally the > HTML parser too) to return identical record objects. Even after patch #2090, the NCBIStandalone parser is broken for multiple Blast records, and will probably be broken for single Blast records also when a new Blast version comes out. I haven't tried the HTML parser, but I'd be surprised if it can parse HTML output from recent versions of Blast. So whereas I agree in principle that the three parsers should return identical records objects, in practice it's hardly relevant given that two of the three parsers either don't work or cannot work reliably. > To do this, we could either: > > (a) change NCBIXML to use tuples instead of integers (as suggested by Jacob) All three of us agree that there's no good reason for tuples. Option (a) implies copying a bad design choice from a semi-broken parser to a functioning parser. > or, > > (b) change NCBIStandalone to use simple integers instead of tuples (is this > what you meant in comment 3 Michiel?) > > Choice (b) would seem simpler in the long term - but would probably break more > existing code. Also, users of NCBIXML are going to have to update their > scripts anyway after bug 2051, so choice (a) would distrupt less people. Both option (a) and (b) break existing code. So let me suggest option (c): (c) Don't do anything. This doesn't break any code. In the near term, people that use both the plain-text parser and the XML parser will have to deal with differences in the Blast record produced by the parser. But how many people are that anyway? Most likely, not enough to justify option (a). In the long term, assuming that both the plain-text parser and the HTML parser will be deprecated, there will be no more inconsistencies. My question to Jacob: Why do you need to use the plain-text Blast parser? Is there something it can do that the XML parser cannot? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Tue Jan 9 17:29:01 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Tue, 9 Jan 2007 12:29:01 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701091729.l09HT1Vi009189@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-09 12:29 ------- Tuples/Integers for HSP.identities, HSP.gaps, and HSP.positives --------------------------------------------------------------- Michiel's option (c) of doing nothing is very pragmatic. If we go for this, I think we should at least update the record object's documentation to say it will be a tuple (when used with NCBIStandalone) or an integer (when used with NCBIXML). Perhaps we should also change the default in a new record object too... query_letters versus query_length --------------------------------- Another of Jacobs suggestions was to rename the record.query_letters (short for number of letters in query?) to something like query_length (which is closer to the actual text of query_len used in the XML file). I personally am not inclined to change this even though it would be slightly clearer. Note that I have corrected the error on line 186 of NCBIXML.py in CVS - well spotted Jacob. This mistake was my fault - recently introduced as part of the changes I made on bug 2051 -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 10 20:47:59 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 10 Jan 2007 15:47:59 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701102047.l0AKlxX6027453@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #7 from mdehoon at ims.u-tokyo.ac.jp 2007-01-10 15:47 ------- > Another of Jacobs suggestions was to rename the record.query_letters (short for > number of letters in query?) to something like query_length (which is closer to > the actual text of query_len used in the XML file). I personally am not > inclined to change this even though it would be slightly clearer. In principle I agree with Jacob on this one. But as Jacob also indicates, there are probably more variable names that are less than ideal. So if we change these variable names, it's better to change all of them at the same time. This, however, will break a lot of existing code. With all the other changes to the Blast parsers, now doesn't seem to be the best time for such a change. However, let's get back to this point once the dust settles with the Blast parsers. With hit_id, hit_def, and hsp.align_length, I see no problems with Jacob's suggestion. Objections, anybody? -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Wed Jan 10 21:05:23 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Wed, 10 Jan 2007 16:05:23 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701102105.l0AL5Nht028299@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #8 from jmjoseph at andrew.cmu.edu 2007-01-10 16:05 ------- > My question to Jacob: > Why do you need to use the plain-text Blast parser? Is there something it can > do that the XML parser cannot? I use only the XML parser. My greatest concern is not that the plain-text and XML parsers are different, but rather that the XML parser is not consistent with Record.py. An example that I consider completely broken is the definition of query_length in Record.py, but the use of self._blast.query_letters in NCBIXML.py. To avoid breaking the existing plain-text parser code, would it be too objectionable to use a new class, Record-XML.py, with definitions that exactly match the usage in NCBIXML.py? Since few people are likely to use both parsers, and any using the XML parser have required recent code updates anyway, perhaps this separation would be easiest. Thanks. -Jacob -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Thu Jan 11 05:18:17 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Thu, 11 Jan 2007 00:18:17 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701110518.l0B5IHuD018624@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #9 from mdehoon at ims.u-tokyo.ac.jp 2007-01-11 00:18 ------- > To avoid breaking the existing plain-text parser code, would it be too > objectionable to use a new class, Record-XML.py, with definitions that exactly > match the usage in NCBIXML.py? Go ahead, but don't add it to Biopython ;-). It would just add to the confusion, without a real benefit to users. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From kosa at genesilico.pl Thu Jan 11 08:52:27 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 09:52:27 +0100 Subject: [Biopython-dev] powerful Alignment class in Biopython? Message-ID: <45A5FACB.40109@genesilico.pl> Hi, Is anyone going to develop or developing now an Alignment class in Biopython as powerful as for example SimpleAlign in Bioperl? Look here for instance for methods available in Bioperl http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html. The reason I am asking is that I do not know if I should start working on more functional subclass of biopython Alignment class (I do not want to come back to Perl ;-)... Regards, Janek From fkauff at duke.edu Thu Jan 11 09:27:25 2007 From: fkauff at duke.edu (Frank) Date: Thu, 11 Jan 2007 10:27:25 +0100 Subject: [Biopython-dev] powerful Alignment class in Biopython? In-Reply-To: <45A5FACB.40109@genesilico.pl> References: <45A5FACB.40109@genesilico.pl> Message-ID: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de> Hi Janek, then Nexus parser in Biopython (for which I still haven't written any documentation yet...) basically holds an alignment, and has some methods that deal with basic alignment functionality. If you're going to work on a more sophisticated alignment class, maybe we should try to get Nexus class and alignment class work smoothly together. Frank On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote: > Hi, > > Is anyone going to develop or developing now an Alignment class in > Biopython as powerful as for example SimpleAlign in Bioperl? Look here > for instance for methods available in Bioperl > http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html. > > The reason I am asking is that I do not know if I should start working > on more functional subclass of biopython Alignment class (I do not want > to come back to Perl ;-)... > > Regards, > Janek > > > > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev From tiagoantao at gmail.com Thu Jan 11 10:36:35 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 11 Jan 2007 10:36:35 +0000 Subject: [Biopython-dev] PopGen code Message-ID: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> Hi, A couple of weeks ago I have put on bugzilla code related to population genetics, namely parsing of GenePop and Fdist files, plus code to control fdist. I have had no feedback whatsoever, namely comments to the quality of the code, if there is interest in adding it in the future to BioPython, etc... I have much more code that I could start converting to BioPython format, some of which is a bit more complicated to convert (e.g., an Arlequin format parser). Before I start doing it I would like to know if there will be any feedback at all or if I am just loosing my time... Regards, Tiago -- Good judgment comes from experience. Experience comes from bad judgment. - Unknown author From kosa at genesilico.pl Thu Jan 11 11:34:50 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 12:34:50 +0100 Subject: [Biopython-dev] powerful Alignment class in Biopython? In-Reply-To: <1168507645.2888.3.camel@osiris.biologie.uni-kl.de> References: <45A5FACB.40109@genesilico.pl> <1168507645.2888.3.camel@osiris.biologie.uni-kl.de> Message-ID: <45A620DA.4000302@genesilico.pl> Hi, I have a feeling that it would be better to write all methods similar to BioPerl ones directly for BioPython Alignment class. The main reason is that this class is not related to any format like Fasta, Clustal or Nexus. It stores SeqRecords which are also not in Fasta or other format. It would make many things easier. For instance, I can write all my functions which do sth with alignments so that they accept general Alignment objects (and not necessarily FastaAlignment or ClustalAlignment objects ). Would not it better to write all stuff which do general things with alignments (column counting, column selection/removal etc.) so that it works with general Alignment class rather than with class for alignment of specific biological format? Janek Frank wrote: > Hi Janek, > > then Nexus parser in Biopython (for which I still haven't written any > documentation yet...) basically holds an alignment, and has some methods > that deal with basic alignment functionality. If you're going to work on > a more sophisticated alignment class, maybe we should try to get Nexus > class and alignment class work smoothly together. > > Frank > > > On Thu, 2007-01-11 at 09:52 +0100, Jan Kosinski wrote: > >> Hi, >> >> Is anyone going to develop or developing now an Alignment class in >> Biopython as powerful as for example SimpleAlign in Bioperl? Look here >> for instance for methods available in Bioperl >> http://doc.bioperl.org/releases/bioperl-1.2/Bio/SimpleAlign.html. >> >> The reason I am asking is that I do not know if I should start working >> on more functional subclass of biopython Alignment class (I do not want >> to come back to Perl ;-)... >> >> Regards, >> Janek >> >> >> >> _______________________________________________ >> Biopython-dev mailing list >> Biopython-dev at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython-dev >> From kosa at genesilico.pl Thu Jan 11 12:11:11 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 13:11:11 +0100 Subject: [Biopython-dev] [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> Message-ID: <45A6295F.5030103@genesilico.pl> Are you going to fix this in the new SeqIO?: When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are stripped away after the first "space". Janek Peter (BioPython List) wrote: > Jan Kosinski wrote: >> Hi, >> >> I am quite new in BioPython and I am a little bit confused when >> trying to use BioPython for working with fasta sequences and alignments. >> >> For instance, I can read and parse fasta files with Bio.Fasta, return >> records (as Fasta.record class), iterate and so on. But then I am >> going to Bio.Fasta.FastaAlign module which offers FastaAlignment >> (subclass of Alignment class) class. However, this class has very >> limited methods and get_all_seqs and get_seq_by_num return SeqRecord >> object instead of Fasta.record (why??) what makes it hard to use >> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta >> (with Fasta.record) for sequences. Maybe I am wrong but Biopython >> seems to be full of incompatibilities. Or one should know which >> modules and classes should not be used? >> >> Could you recommend me what should I use for my work with fasta >> sequences and alignments? Which BioPython modules and classes? > > You can use Bio.Fasta to read in files either as Fasta.Record objects, > or as SeqRecord objects. I would use SeqRecord objects - they are > more general should you ever want to use a different input file format > - plus as you have noticed, the alignment object also uses SeqRecord > objects to hold each (gapped) sequence. > > There are other options if you search the code - but Bio.Fasta is the > best documented and most used. > > If you are brave, then you might have a look at the new code in > Bio.SeqIO which you can get from CVS. This is still in a state of > flux however... but the Fasta parsing is much faster. See this page > and the mailing list archives for more: > > http://www.biopython.org/wiki/SeqIO > > > Or should I use other packages like CoreBio? > > You could do - it has the advantage of having started recently from a > clean slate, and having much less "old code". > >> Thank you in advance for any guidelines, >> Janek Kosinski > > Peter From biopython-dev at maubp.freeserve.co.uk Fri Jan 12 12:36:56 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Jan 2007 12:36:56 +0000 Subject: [Biopython-dev] PopGen code In-Reply-To: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> Message-ID: <45A780E8.2070803@maubp.freeserve.co.uk> Tiago Ant?o wrote: > Hi, > > A couple of weeks ago I have put on bugzilla code related to > population genetics, namely parsing of GenePop and Fdist files, plus > code to control fdist. > I have had no feedback whatsoever, namely comments to the quality of > the code, if there is interest in adding it in the future to > BioPython, etc... > > I have much more code that I could start converting to BioPython > format, some of which is a bit more complicated to convert (e.g., an > Arlequin format parser). Before I start doing it I would like to know > if there will be any feedback at all or if I am just loosing my > time... I suppose I/we would be able to read your code from a general perspective (coding style, clarity of comments, etc). I haven't made time for this. I suspect BioPython currently has no active developers who feel qualified to interpret your population genetics code. I was hoping that you and Ralph Haygood would combine forces - if you are both happy with some code that does bode well. Any comments Michiel? Regarding population genetic file formats - from a very quick search about Arlequin it sounds like this file format can hold lots of different types of data. I would encourage you to try and come up with a generic population record data object that could hold this or information from GenePop or Fdist as well. I have no idea how easy this would be... Peter From tiagoantao at gmail.com Fri Jan 12 14:16:53 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 12 Jan 2007 14:16:53 +0000 Subject: [Biopython-dev] PopGen code In-Reply-To: <45A780E8.2070803@maubp.freeserve.co.uk> References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> <45A780E8.2070803@maubp.freeserve.co.uk> Message-ID: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com> Hi, Thanks for the answer. > I suspect BioPython currently has no active developers who feel > qualified to interpret your population genetics code. I was hoping that > you and Ralph Haygood would combine forces - if you are both happy with > some code that does bode well. Any comments Michiel? I think Ralph (who subscribes to this list, and thus can comment) has strong time constraints, and will probably have little available time in the near future... > Regarding population genetic file formats - from a very quick search > about Arlequin it sounds like this file format can hold lots of > different types of data. I would encourage you to try and come up with > a generic population record data object that could hold this or > information from GenePop or Fdist as well. I have no idea how easy this > would be... I have been thinking a lot about a generic data structure to hold population genomic (ie not only genetic) data. I have, in fact, implemented (in CAML, not Python) quite a few different data representations. I was not happy with none of them. Different kinds of markers (that sometimes overlap - eg sequences and SNPs), linkage disequilibrium (thus relations between markers...), ploidy (no need to think on different organisms, think mitochondria, nuclear chromosomes, Y chromosome), ... make a general solution not trivial. As I see it, there are a few options: 1. Have a grand, unified structure, but that will take time to mature 2. Assume that there will be different representations for different scopes, assume that that is a bad thing and live with that 3. Assume that there will be different representations, and that that is good, in the sense that a one size, fits all approach in this case has lots of problems I think the pragmatic approach for now is not to have a generic representation. I would lean more to let things mature (develop statistics, parsers, ...) and after there is more experience (and, hopefully, user feedback) then reassess the issue of a general representation. I am aware that this will entail each part of code having a different calling data structure, but I think that with care and common sense that won't be very problematic. I don't mind having the code on an alpha branch for as long as you see fit, I just want to be sure that whatever effort I put in converting (or creating new) my code to BioPython is not lost, that is why I would like feedback on what will happen to the code that I am submitting. I am willing to accommodate any reasonable requirements regarding code quality and development process... Regards, Tiago -- Good judgment comes from experience. Experience comes from bad judgment. - Unknown author From bsouthey at gmail.com Fri Jan 12 14:30:26 2007 From: bsouthey at gmail.com (Bruce Southey) Date: Fri, 12 Jan 2007 08:30:26 -0600 Subject: [Biopython-dev] PopGen code In-Reply-To: <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com> References: <6d941f120701110236h738799b0i662455bfc98256d1@mail.gmail.com> <45A780E8.2070803@maubp.freeserve.co.uk> <6d941f120701120616t61db8102o2f3eb3ef3da12fef@mail.gmail.com> Message-ID: Hi, While I do have a remote interest in this, I do not have any time to look at this at present. As I mentioned in a previous email, that John Cole is doing related work in Python but not part of BioPython. It would probably be good to have some unified approach and direction because of the overlaps that occur and so other pieces of code can be easily added. Regards Bruce On 1/12/07, Tiago Ant?o wrote: > Hi, > > Thanks for the answer. > > > I suspect BioPython currently has no active developers who feel > > qualified to interpret your population genetics code. I was hoping that > > you and Ralph Haygood would combine forces - if you are both happy with > > some code that does bode well. Any comments Michiel? > > I think Ralph (who subscribes to this list, and thus can comment) has > strong time constraints, and will probably have little available time > in the near future... > > > Regarding population genetic file formats - from a very quick search > > about Arlequin it sounds like this file format can hold lots of > > different types of data. I would encourage you to try and come up with > > a generic population record data object that could hold this or > > information from GenePop or Fdist as well. I have no idea how easy this > > would be... > > I have been thinking a lot about a generic data structure to hold > population genomic (ie not only genetic) data. I have, in fact, > implemented (in CAML, not Python) quite a few different data > representations. I was not happy with none of them. Different kinds of > markers (that sometimes overlap - eg sequences and SNPs), linkage > disequilibrium (thus relations between markers...), ploidy (no need to > think on different organisms, think mitochondria, nuclear chromosomes, > Y chromosome), ... make a general solution not trivial. > As I see it, there are a few options: > 1. Have a grand, unified structure, but that will take time to mature > 2. Assume that there will be different representations for different > scopes, assume that that is a bad thing and live with that > 3. Assume that there will be different representations, and that that > is good, in the sense that a one size, fits all approach in this case > has lots of problems > > I think the pragmatic approach for now is not to have a generic > representation. I would lean more to let things mature (develop > statistics, parsers, ...) and after there is more experience (and, > hopefully, user feedback) then reassess the issue of a general > representation. I am aware that this will entail each part of code > having a different calling data structure, but I think that with care > and common sense that won't be very problematic. > > I don't mind having the code on an alpha branch for as long as you see > fit, I just want to be sure that whatever effort I put in converting > (or creating new) my code to BioPython is not lost, that is why I > would like feedback on what will happen to the code that I am > submitting. I am willing to accommodate any reasonable requirements > regarding code quality and development process... > > Regards, > Tiago > > -- > Good judgment comes from experience. > Experience comes from bad judgment. > - Unknown author > _______________________________________________ > Biopython-dev mailing list > Biopython-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython-dev > From bugzilla-daemon at portal.open-bio.org Sat Jan 13 00:27:11 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Fri, 12 Jan 2007 19:27:11 -0500 Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup In-Reply-To: Message-ID: <200701130027.l0D0RBlR027978@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2176 ------- Comment #10 from mdehoon at ims.u-tokyo.ac.jp 2007-01-12 19:27 ------- I've committed the code to handle hit_id, hit_def, and hsp.align_length to CVS. Let's keep this bug report open for now to remind ourselves to revisit the issues with variable names at some point. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:04:24 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:04:24 -0500 Subject: [Biopython-dev] [Bug 2043] SProt.py fails to parse the current Swiss-Prot version (RX and OH lines are broken) In-Reply-To: Message-ID: <200701151104.l0FB4OdQ015531@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=2043 k_rother at yahoo.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |k_rother at yahoo.de -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:07:33 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:07:33 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151107.l0FB7Xdb015724@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 ------- Comment #2 from k_rother at yahoo.de 2007-01-15 06:07 ------- Created an attachment (id=543) --> (http://bugzilla.open-bio.org/attachment.cgi?id=543&action=view) new date() method handling new style DT lines new to bugzilla. don't know whether this is the proper way to commit code. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:17:55 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:17:55 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151117.l0FBHtd1016589@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 ------- Comment #3 from k_rother at yahoo.de 2007-01-15 06:17 ------- Created an attachment (id=544) --> (http://bugzilla.open-bio.org/attachment.cgi?id=544&action=view) SProt.py that digests all 250,000 Uniprot entries successfully. also checked the data record contents whether the dates and version numbers of the first few entries are correct. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 11:19:59 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 06:19:59 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151119.l0FBJxt9016846@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 k_rother at yahoo.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |k_rother at yahoo.de Status|NEW |RESOLVED Resolution| |WORKSFORME ------- Comment #4 from k_rother at yahoo.de 2007-01-15 06:19 ------- i think this should finish the bug unless someone wants to beautify the code. KR -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 12:26:57 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 07:26:57 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151226.l0FCQvNE022159@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |biopython- | |bugzilla at maubp.freeserve.co. | |uk Status|RESOLVED |REOPENED Resolution|WORKSFORME | ------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:26 ------- Reopening - its not fixed until we have updated the code in CVS. However, I will try and have a look at your code. By the way - in general developers pefer patches rather than chunks of code, or edited copies of the original. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From bugzilla-daemon at portal.open-bio.org Mon Jan 15 12:51:46 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Mon, 15 Jan 2007 07:51:46 -0500 Subject: [Biopython-dev] [Bug 1956] SwissProt release 49 - Support for new DT lines In-Reply-To: Message-ID: <200701151251.l0FCpk8H023891@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 biopython-bugzilla at maubp.freeserve.co.uk changed: What |Removed |Added ---------------------------------------------------------------------------- Status|REOPENED |RESOLVED Resolution| |FIXED ------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2007-01-15 07:51 ------- I have updated CVS with a slightly modified version of your code Kristian. See revision 1.36, web version here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SwissProt/SProt.py?cvsroot=biopython It passes the old unit test, test_SProt.py, but if you could double check this on the latest release that would be great. Thanks very much. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee. From biopython-dev at maubp.freeserve.co.uk Mon Jan 15 20:04:34 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Jan 2007 20:04:34 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45A94BFD.5080209@c2b2.columbia.edu> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> Message-ID: <45ABDE52.1030300@maubp.freeserve.co.uk> Michiel de Hoon wrote: > In my opinion, the new Bio.SeqIO code is a huge improvement to > Biopython, so I'd be happy to make a new release for it. > > ... > > For Bio.SeqIO, we're also in pretty good shape, as far as I can tell. > From what I remember, the remaining issues were > 1) Which functionality to include, in particular > a) if functions should accept file names in addition to file handles; I have decided to follow Michiel's stance on this issue: handles only. > b) if functions should infer the file format from the file extension, > the file content, or otherwise. Right now the file format string is optional and if omitted the file extension (via handle.name) is used to try and guess. It would be trivial to remove this functionality and make format a required argument. We could at a later date chose to add limited support for format guessing based on file contents without altering the function parameters (i.e. the API). Both these features would be nice to have (speaking as user) but then again, am I prepared to support the headaches they may cause later on. I'm wavering on this issue (having previously been in favour of including the format guessing). Item 1(c) on Michiel's list could have been do we need the three "helper functions" which turned a file into a SeqRecord list, dictionary or alignment. Again, I have come round to Michiel's view and removed these as they were just simple wrappers for list, SequencesToDictionary and SequencesToAlignment. > 2) What are the best names for the functions that the user will see. The good news is that after that little spring clean there are less functions to name - just these four really: SequenceIterator, once known as FileToSequenceIterator and before that File2SequenceIterator. Now takes just an input file handle and an optional file format. Returns a SeqRecord iterator. SequencesToDictionary - takes SeqRecord iterator or list, plus an optional function to define the keys, and returns a dictionary. SequencesToAlignment - takes SeqRecord iterator or list, and returns an alignment object. Perhaps this functionality should be included in the alignment class itself... WriteSequences, once known as SequencesToFile - takes a SeqRecord iterator or list, and output handle, and a format string. Intended for use on a whole file at once (i.e. the general case where there may be headers/footers etc). This does not let you do incremental writes one for each record (which would be possible for some formats like GenBank or fasta) Peter From mdehoon at c2b2.columbia.edu Mon Jan 15 23:00:16 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Mon, 15 Jan 2007 18:00:16 -0500 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45ABDE52.1030300@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> Message-ID: <45AC0780.9060306@c2b2.columbia.edu> Peter wrote: > WriteSequences, once known as SequencesToFile - takes a SeqRecord > iterator or list, and output handle, and a format string. Intended for > use on a whole file at once (i.e. the general case where there may be > headers/footers etc). This does not let you do incremental writes one > for each record (which would be possible for some formats like GenBank > or fasta) At the end of WriteSequences, the file is closed: def WriteSequences(sequences, handle, format) : ... handle.close() #just in case the writer object forgot Why would it be a problem if the handle is not closed? --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 11:10:21 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jan 2007 11:10:21 +0000 Subject: [Biopython-dev] Bio.SeqIO In-Reply-To: <45AC0780.9060306@c2b2.columbia.edu> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> <45AC0780.9060306@c2b2.columbia.edu> Message-ID: <45ACB29D.2010002@maubp.freeserve.co.uk> Michiel Jan Laurens de Hoon wrote: > Peter wrote: >> WriteSequences, once known as SequencesToFile - takes a SeqRecord >> iterator or list, and output handle, and a format string. Intended for >> use on a whole file at once (i.e. the general case where there may be >> headers/footers etc). This does not let you do incremental writes one >> for each record (which would be possible for some formats like GenBank >> or fasta) > > At the end of WriteSequences, the file is closed: > > def WriteSequences(sequences, handle, format) : > ... > handle.close() #just in case the writer object forgot > > Why would it be a problem if the handle is not closed? OK, I've fixed that. That issue was on my mind too - in particular it would stop Bio.SeqIO from creating concatenated phylip alignments which are used in bootstrapping. Reading this sort of file is a different issue, which I am also currently thinking about. Peter From biopython-dev at maubp.freeserve.co.uk Tue Jan 16 12:48:49 2007 From: biopython-dev at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jan 2007 12:48:49 +0000 Subject: [Biopython-dev] Bio.SeqIO - Output In-Reply-To: <45ACB29D.2010002@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45ABDE52.1030300@maubp.freeserve.co.uk> <45AC0780.9060306@c2b2.columbia.edu> <45ACB29D.2010002@maubp.freeserve.co.uk> Message-ID: <45ACC9B1.6010709@maubp.freeserve.co.uk> I've been thinking about sequence output (i.e. writing sequence files), and have come to the conclusion that my writer classes in Bio/SeqIO/Interfaces.py are probably too complicated. My current Bio.SeqIO output implementation tries to be very flexible - if you look beyond the top level function WriteSequences (aka SequencesToFile) then the individual writer classes have a confusing range of capabilities. New Idea ======== I was thinking that we should only support two cases for sequence output: (*) simple sequential file formats - record by record, or file at once - can use a SeqRecord iterator (or a list) (*) all other file formats - file at once only - probably needs a list of SeqRecords (not an iterator) For the sequential file formats such as fasta, genbank and swiss there are no headers or footers - and a single sequence alone would be a valid file. For all other file formats (e.g. clustal, stockholm, phylip, anything in XML, ...) we would only offer the "file at once" option. When implementing a writer for a new file format, you just have to implement a "write file" function or a "write record" function which takes the record(s) and a handle. The implementation details are up to you. Drawbacks ========= There are some sequential file formats where, under the scheme above, you would be forced to write the file in one go... However, I can only think of one irrelevant example, so this may not matter. Can anyone suggest some other examples? Some sort of simple tabular file with a header row maybe? For example simple Stockholm files (if you ignore the PFAM style annotation) have a generic header, followed by sequential records and a generic footer. The point here is that the header does not contain anything about the records which will follow it. e.g. The number of records, or if they are protein or nucleotides. For files like this it would be possible to write the file record by record given an iterator - provided you also write the header and footer. Right now this is the only file format I can think of that has this property - and I don't currently even support this (instead like BioPerl I create Stockholm files with PFAM style annotations). Stockholm files with PFAM style annotation do not qualify, because the header contains the number of records. Similarly for non-interlaced PHYLIP. Peter From mdehoon at c2b2.columbia.edu Tue Jan 16 17:51:23 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 16 Jan 2007 12:51:23 -0500 Subject: [Biopython-dev] [BioPython] Next release plans; was: what to use for working with fasta sequences and alignments? In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45AA34AE.5080100@maubp.freeserve.co.uk> Message-ID: <45AD109B.4030009@c2b2.columbia.edu> Peter wrote: > Regarding the fix checked in on bug 1970 I still would prefer we call > the new XML iterator NCBIXML.Iterator(handle) rather than > NCBIXML.parse(handle) but I'll live ;) > I chose "parse" because it is used in the old (Biopython release 1.42) Blast XML parser: Old: >>> from Bio.Blast import NCBIXML >>> b_parser = NCBIXML.BlastParser() >>> b_record = b_parser.parse(blast_out) New: >>> from Bio.Blast import NCBIXML >>> b_records = NCBIXML.parse(blast_out) >>> b_record = b_records.next() # Repeat to get subsequent Blast records Whereas I am not dead set on "parse", it agrees with similar functions in Python: 1) Function name is a verb, not a noun 2) Function name describes what the function does, not what the function returns 3) Function names are short, and start with a lower case letter. For example, to read a file line-by-line in Python: >>> inputfile = open("somefunnyfile") # "open"; not "Iterator", nor "FileToLineIterator", # even though "open" returns an iterator: >>> for line in inputfile: ... print line To read an image file with the Python Imaging Library: >>> import Image >>> im = Image.open("lena.ppm") # "open"; not "Image", nor "FileNameToImage". To read a Python object from a pickled file: >>> import pickle >>> inputfile = open("somepickledfile") >>> myobject = pickle.load(inputfile) # "load"; not "FileToObject". >>> inputfile.close() To parse an XML file with the sax parser framework in Python: >>> from xml.sax.handler import ContentHandler >>> from xml import sax >>> handler = SomeSubclassOfContentHandler() >>> inputfile = open("myxmlfile.xml") >>> sax.parse(inputfile, handler) # "parse", same as in the new Bio.Blast.NCBIXML >>> inputfile.close() So, for Bio.Blast.NCBIXML, good names would be "load", "read", "parse", or something similar. "Iterator" would not be consistent; besides, until recently I didn't know what an iterator is, so I doubt that new users would know. What we could do is to have two functions in Bio.Blast.NCBIXML, perhaps one called "read" and the other "iterate", where the former returns a single Blast record (for an XML file containing only one Blast result), and the latter an iterator over multiple Blast records. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From mdehoon at c2b2.columbia.edu Tue Jan 16 17:47:17 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Tue, 16 Jan 2007 12:47:17 -0500 Subject: [Biopython-dev] [BioPython] Next release plans; was: what to use for working with fasta sequences and alignments? In-Reply-To: <45AA34AE.5080100@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> <45A94BFD.5080209@c2b2.columbia.edu> <45AA34AE.5080100@maubp.freeserve.co.uk> Message-ID: <45AD0FA5.3060008@c2b2.columbia.edu> Peter wrote: > In general, I agree that the Blast XML parser in CVS looks in good shape > - but we really need to update the documentation for using Blast for the > next release. > Yeah I know, I've been holding off on updating the documentation so it is consistent with the latest Biopython release 1.42. I'll update it together with the next release. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From bugzilla-daemon at portal.open-bio.org Sat Jan 27 23:01:07 2007 From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org) Date: Sat, 27 Jan 2007 18:01:07 -0500 Subject: [Biopython-dev] [Bug 1963] Adding __str__ method to codon tables and translators In-Reply-To: Message-ID: <200701272301.l0RN17l6026463@portal.open-bio.org> http://bugzilla.open-bio.org/show_bug.cgi?id=1963 ------- Comment #2 from mdehoon at ims.u-tokyo.ac.jp 2007-01-27 18:01 ------- > Question One: > Is this worth adding to BioPython or not? Yes, definitely. > Question Two: > What is the preferred behaviour for ambiguous tables? Just a 4x4x4 table as > for the unambiguous tables? Or the full 15x15x15 table? I have implemented > both (see commented out code) My feeling is that 15x15x15 would become too large to be clearly visible on the screen. So I'd prefer 4x4x4, maybe with a reminder printed at the end as to what each ambiguous codon may represent. > Question Three: > Is there a standard BioPython function to convert from one letter amino acid > sequences into three letter names? i.e. like one_to_three from > Bio.PDB.Polypeptide but more general. That function does not cope with > ambigous names. There is the function seq3 in Bio/SeqUtils. If it is not complete, it can be extended easily, and seems to be a better place for this general function than Bio/PDB/Polypeptide. -- Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.