From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:28:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:28:12 -0400
Subject: [Biopython-dev] [Bug 2802] New: Loader.py: load SeqRecord comments
as list
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
Summary: Loader.py: load SeqRecord comments as list
Product: Biopython
Version: 1.49b
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: BioSQL
AssignedTo: biopython-dev at biopython.org
ReportedBy: andrea at biodec.com
Loader.py version: 1.38 or below
python: any
Actually seqrecord.annotation['comment'] is a string. SProt parser and GenBank
parser parse comment as string. SProt record parser, instead, parse comment as
list, according to the "-!-" tag. I'm working on parsing comment as lists,
either for Uniprot and for GenBank (ncbi), and I need to have the possibility
to manage comment as lists.
The biosql schema, also, has in the table "comment", the field "rank" that
is suitable to be used for storing list entries. In this way the table is
ready and implemented to store list data.
The patch is retro-compatible, so the _load_comment function is able to
load either string or list entries, according to the data type.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:29:02 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:29:02 -0400
Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as
list
In-Reply-To:
Message-ID: <200904011129.n31BT23k007952@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
------- Comment #1 from andrea at biodec.com 2009-04-01 07:29 EST -------
Created an attachment (id=1270)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1270&action=view)
proposed Patch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:48:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:48:15 -0400
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200904011148.n31BmFmX009292@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:48 EST -------
I've updated CVS as per comment 12 to also use record.query_length, and comment
13 to also use record.database_length.
Before:
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("xbt007.xml")) :
... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters,
record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 None
13958303 None None
gi|129628|sp|P07175.1|PARA_AGRTU
222 None
13958303 None None
Now, with Bio/Blast/NCBIXML.py CVS revision 1.20 or 1.21,
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("xbt007.xml")) :
... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters,
record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 270
13958303 None 13958303
gi|129628|sp|P07175.1|PARA_AGRTU
222 222
13958303 None 13958303
We could perhaps deprecate record.database_letters immediately, and at a later
point, record.query_letters
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 07:50:07 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:50:07 -0400
Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as
list
In-Reply-To:
Message-ID: <200904011150.n31Bo7ib009452@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:50 EST -------
See also Bug 2235 for the SwissProt parsing into SeqRecord objects.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 08:33:37 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 08:33:37 -0400
Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as
list
In-Reply-To:
Message-ID: <200904011233.n31CXbuM012687@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 08:33 EST -------
Thanks for the report and suggested patch. This is now fixed in CVS (slightly
differently though). I'd be grateful if you could test the latest code. A
fresh CVS checkout would be easiest - you'll need to update several files as I
was working on another issue at the same time:
Checking in BioSQL/BioSeq.py;
/home/repository/biopython/biopython/BioSQL/BioSeq.py,v <-- BioSeq.py
new revision: 1.35; previous revision: 1.34
done
Checking in BioSQL/Loader.py;
/home/repository/biopython/biopython/BioSQL/Loader.py,v <-- Loader.py
new revision: 1.39; previous revision: 1.38
done
Checking in Tests/test_BioSQL_SeqIO.py;
/home/repository/biopython/biopython/Tests/test_BioSQL_SeqIO.py,v <--
test_BioSQL_SeqIO.py
new revision: 1.33; previous revision: 1.32
done
Checking in Tests/output/test_BioSQL_SeqIO;
/home/repository/biopython/biopython/Tests/output/test_BioSQL_SeqIO,v <--
test_BioSQL_SeqIO
new revision: 1.6; previous revision: 1.5
done
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Apr 1 10:23:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Apr 2009 15:23:45 +0100
Subject: [Biopython-dev] Testing Biopython with NumPy 1.3
In-Reply-To: <320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com>
References: <320fb6e00903301535j21ae6659r931c9be0fd17faf3@mail.gmail.com>
<730606.962.qm@web62408.mail.re1.yahoo.com>
<320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com>
Message-ID: <320fb6e00904010723j594bc958kc721a234c54d4ea5@mail.gmail.com>
On Tue, Mar 31, 2009 at 10:12 AM, Peter wrote:
> On Tue, Mar 31, 2009 at 1:08 AM, Michiel de Hoon wrote:
>>
>>> So, whatever is going wrong on test_Cluster.py seems to be
>>> specific to Windows (XP) and Python 2.6 - and possibly just
>>> my Windows development machine.
>>>
>> I believe that the problem is that msvcr90.dll is missing. This
>> is the C runtime from Microsoft. Earlier Pythons used
>> msvcr71.dll, if I'm not mistaken.
>
> You may be right - there is some stuff on the numpy mailing list
> about this and manifest files etc when using mingw32. ?It may
> be simplest to try the appropriate MS compiler instead...
OK, good news using the MS compiler:
I went to http://www.microsoft.com/express/download/ and installed the
free VC++ 2008 Express Edition (using the web install, unticking the
optional silverlight and sql server bits). Using the "Visual Studio
2008 Command Prompt" shortcut I was able to build, test, install
Biopython CVS fine. All this shortcut claims to do is setup suitable
environment variables first, so this last bit can probably be
simplified for every day use. This should mean we can include a
Biopython 1.50 (beta) installer for Windows on Python 2.6 using NumPy
1.3 :)
It would still be nice to resolve the mingw32 issue, but it isn't
critical right now.
Peter
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 10:41:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 10:41:24 -0400
Subject: [Biopython-dev] [Bug 2803] New: Insure Alignment objects are passed
to AlignIO.write()
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
Summary: Insure Alignment objects are passed to AlignIO.write()
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
Insure Alignment objects are passed to AlignIO.write()
Stops this kind of abuse:
records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir"))
AlignIO.write([records], open("alignIO.fasta", "w"), "fasta")
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 10:42:55 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 10:42:55 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011442.n31EgtlQ023181@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
------- Comment #1 from cymon.cox at gmail.com 2009-04-01 10:42 EST -------
Created an attachment (id=1271)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1271&action=view)
nsure-Alignment-objects-are-passed-to-write-AlignIO
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:25:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 11:25:36 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011525.n31FPa3V026200@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:25 EST -------
Thanks for filing the bug (originally raised in our discussion on the mailing
list).
There is a major drawback to your proposed fix,
+ if isinstance(alignments, types.GeneratorType):
+ alignments = list(alignments)
This means if you gave the AlignIO.write function a generator returning
hundreds or large alignment objects, they would all get loaded into memory at
once. One of the big aims with Bio.SeqIO and AlignIO in using
generators/iterators is to allow memory efficient working where we try to keep
only one record/alignment in memory at a time.
Anyway, I'll take a look at this. I think we need to just check the case where
Bio.AlignIO.write uses Bio.SeqIO.write internally...
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:36:54 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 11:36:54 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011536.n31Fasdu027053@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
------- Comment #3 from cymon.cox at gmail.com 2009-04-01 11:36 EST -------
(In reply to comment #2)
> Thanks for filing the bug (originally raised in our discussion on the mailing
> list).
>
> There is a major drawback to your proposed fix,
>
> + if isinstance(alignments, types.GeneratorType):
> + alignments = list(alignments)
>
> This means if you gave the AlignIO.write function a generator returning
> hundreds or large alignment objects, they would all get loaded into memory at
> once. One of the big aims with Bio.SeqIO and AlignIO in using
> generators/iterators is to allow memory efficient working where we try to keep
> only one record/alignment in memory at a time.
>
> Anyway, I'll take a look at this. I think we need to just check the case where
> Bio.AlignIO.write uses Bio.SeqIO.write internally...
>
Yes, I see. I had originally intended to check the type while looping through
the alignments before calling SeqIO.write, but thought better of it because
some alignments may get written before a error occurs, whereas it seems best
that either all or none at all get written from the call to AlignIO.write.
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:55:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 11:55:26 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011555.n31FtQ9X028474@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:55 EST -------
(In reply to comment #3)
> > Anyway, I'll take a look at this. I think we need to just check the case
> > where Bio.AlignIO.write uses Bio.SeqIO.write internally...
That turned out to be the case, fixed in CVS. See Bio/AlignIO/__init__.py
revision 1.22 and Tests/test_AlignIO.py 1.19
> Yes, I see. I had originally intended to check the type while looping through
> the alignments before calling SeqIO.write, but thought better of it because
> some alignments may get written before a error occurs, whereas it seems best
> that either all or none at all get written from the call to AlignIO.write.
You are right, if we are given a list/iterator containing some real Alignments
but also some non-Alignments we have a problem. We can't pre-check all the
entries before writing without converting to a list (and this ruins the memory
benefits). We just catching the erroneous input when we reach it, even though
it may happen half way through writing to the file.
Marking as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:04:05 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 14:04:05 -0400
Subject: [Biopython-dev] [Bug 2804] New: Clustalw subprocess hangs when
large stdout returned
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
Summary: Clustalw subprocess hangs when large stdout returned
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
As noted on the mailing list, the following hangs waiting for a return:
from Bio import SeqIO
from Bio import Clustalw
from Bio.Clustalw import MultipleAlignCL
records = list(SeqIO.parse(open("Tests/NBRF/Cw_prot.pir", "r"), "pir"))
handle = open("temp.fasta", "w")
SeqIO.write(records, handle, "fasta")
handle.close()
cline = MultipleAlignCL("temp.fasta", command="clustalw")
align = Clustalw.do_alignment(cline)
This appears to be due to a known issue as documented here:
http://docs.python.org/library/subprocess.html#subprocess.Popen.wait
but wasnt being picked up by the tests - presumably because no test file is
large enough to trigger the problem.
Instead of using .wait() it suggests .communicate()
The attached patch works for me on Linux. But as noted in __init__.py this
maybe an issue for Windows:
#We don't need to supply any piped input, but we setup the
#standard input pipe anyway as a work around for a python
#bug if this is called from a Windows GUI program. For
#details, see http://bugs.python.org/issue1124861
Also subprocess.returncode is now /3 so moved "if status: value = status / 256
"so that only done if calling os.popen()
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:05:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 14:05:10 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904011805.n31I5ACv005787@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #1 from cymon.cox at gmail.com 2009-04-01 14:05 EST -------
Created an attachment (id=1272)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view)
clustalw subprocess patch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:05:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 18:05:40 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904012205.n31M5eDa024097@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 18:05 EST -------
It is great that you've found a simple and reproduceable test case. I can
confirm this problem on a Linux machine with Python 2.4.3 (what version of
python do you have?)
(In reply to comment #1)
> Created an attachment (id=1272)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details]
> clustalw subprocess patch
Unfortunately the patch is flawed here:
status = child_process.communicate()[1]
We want to get the return code (a numerical error value), but the communicate
method returns two strings giving the contents of stdout and strerr, i.e.
...
CLUSTAL W (1.83) Multiple Sequence Alignments
...
Sequence format is Pearson
Sequence 1: HLA_HLA00401 366 aa
Sequence 2: HLA_HLA00402 366 aa
...
Group 109: Sequences: 3 Score:6519
Group 110: Sequences: 111 Score:4464
Alignment Score 8299041
CLUSTAL-Alignment file created [temp.aln]
for stdout, and an empty string for stderr. Doing this seems to work on Linux
with python 2.4.3,
child_process.communicate() #ignore the stdout and stderr data!
child_process.stdin.close()
child_process.stdout.close()
child_process.stderr.close()
status = child_process.returncode
However, I have only tested this one example far, and not on Windows or the Mac
yet. It would be a good idea to extend test_Clustalw_tool.py to cover some
deliberate failures to check we can read the error level (return code) ClustalW
gives back. Of course, this will need testing with both clustalw 1.x and 2.x
to be safe.
Note that the original code using os.popen still works fine for this example.
We switched to subprocess because os.popen* are being deprecated on Python 2.6,
and didn't work well with names with spaces as I recall.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:42:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 18:42:39 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904012242.n31MgdKd026637@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #3 from cymon.cox at gmail.com 2009-04-01 18:42 EST -------
(In reply to comment #2)
> It is great that you've found a simple and reproduceable test case. I can
> confirm this problem on a Linux machine with Python 2.4.3 (what version of
> python do you have?)
Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49)
[GCC 4.3.2] on linux2
on Ubuntu Intrepid
>
> (In reply to comment #1)
> > Created an attachment (id=1272)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details] [details]
> > clustalw subprocess patch
>
> Unfortunately the patch is flawed here:
>
> status = child_process.communicate()[1]
Actually, the 'whole' patch is good. Have a look at the second bit of the
patch, where I change my initial commit to my branch:
#Grab stderr
- status = child_process.communicate()[1]
+ child_process.communicate()
+ value = child_process.returncode
except ImportError :
etc...
I've been trying to get to grips with git - and clearly havent succeeded to
yet!
When you run the command "git format-patch" it creates a separate for each
commit to the branch, and I can't figure out how to just get the patch against
only the current version of the file. So git gave me two patches, which I
cat'ed together and submitted as a composite patch.
Sorry I didnt make that clear.
If anyone knows how to get the diff against only the current file version, I'd
appreciate the answer ;)
Cheers, C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 07:00:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 07:00:48 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021100.n32B0mEZ014206@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1272 is|0 |1
obsolete| |
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 07:00 EST -------
Created an attachment (id=1273)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view)
Patch to Bio/Clustalw/__init__.py
(In reply to comment #3)
>
> When you run the command "git format-patch" it creates a separate for each
> commit to the branch, and I can't figure out how to just get the patch against
> only the current version of the file. So git gave me two patches, which I
> cat'ed together and submitted as a composite patch.
>
I see - that odd looking patch had confused me. I think you want to look at
"giff diff ..." for this, it also can do things like show the diff between the
remote branches.
I have tested this new patch on both Linux and Mac now, using both ClustalW
1.83 and 2.0.10 - next up Windows, and extending the unit test.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 07:32:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 07:32:40 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021132.n32BWdqU016365@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #5 from cymon.cox at gmail.com 2009-04-02 07:32 EST -------
(In reply to comment #4)
> Created an attachment (id=1273)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view) [details]
> Patch to Bio/Clustalw/__init__.py
>
> (In reply to comment #3)
> >
> > When you run the command "git format-patch" it creates a separate for each
> > commit to the branch, and I can't figure out how to just get the patch against
> > only the current version of the file. So git gave me two patches, which I
> > cat'ed together and submitted as a composite patch.
> >
>
> I see - that odd looking patch had confused me. I think you want to look at
> "giff diff ..." for this, it also can do things like show the diff between the
> remote branches.
>
> I have tested this new patch on both Linux and Mac now, using both ClustalW
> 1.83 and 2.0.10 - next up Windows, and extending the unit test.
Your new patch doesnt indent the lines (as in my original patch):
113 value = 0
114 if status: value = status / 256
so that they only get executed when run_clust = os.popen(str(command_line))
The return code from child_process.communicate() is already /256
also assign value = child_process.returncode (the return code is 0 for success
and never "")
"""
child_process.communicate()
value = child_process.returncode
except ImportError :
#Fall back for python 2.3
run_clust = os.popen(str(command_line))
status = run_clust.close()
# The exit status is the second byte of the termination status
# TODO - Check this holds on win32...
value = 0
if status: value = status / 256
# check the return value for errors, as on 1.81 the return value
# from Clustalw is actually helpful for figuring out errors
# 1 => bad command line option
if value == 1:
raise ValueError("Bad command line option in the command: %s"
% str(command_line))
"""
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 10:34:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 10:34:10 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021434.n32EYApO032328@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 10:34 EST -------
I've updated test_Clustalw_tool.py in CVS to catch this dead lock, and
confirmed the unit test will fail on Mac and Linux when using subprocess (on
the bright side, Python 2.3 should still work), but the test passes with the
fix outlined - or simply using the os.popen code instead.
Interestingly the lockup seems to happen more readily on Linux that on the Mac.
I've yet to test on Windows.
I also added three tests for standard error conditions - interestingly I don't
ever seem to get an error code back (either with subprocess or os.popen). What
about you? This makes testing these special cases for raising specific IOError
exceptions difficult.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:19:04 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 11:19:04 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021519.n32FJ4DC003715@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
OS/Version|Linux |All
Resolution| |FIXED
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:19 EST -------
Hi Cymon,
I've updated the unit test for Windows on Python 2.3 through 2.6 (had to move
some file deletions to the end, and watch out for extra error message
variations).
Windows also deadlocks on this example when using subprocess - the test should
normally take about four seconds in total (depending on your computer's speed
of course). Using os.popen avoids the deadlock (but can't cope with file names
with spaces). Your fix in comment 5 also works :)
So, now we have a unit test which catches this deadlock on all three operating
systems, which confirms your fix which works on all three. I've checked it
into CVS, and marked this bug as fixed.
[I'm still not sure what is happening with the return values - if you look into
this further please raise a new bug for it.]
Thanks!
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:32:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 11:32:39 -0400
Subject: [Biopython-dev] [Bug 2806] New: Possible deadlock (hang) in
Bio.Application using subprocess wait()
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2806
Summary: Possible deadlock (hang) in Bio.Application using
subprocess wait()
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
CC: cymon.cox at gmail.com
See Bug 2804 which demonstrated a reproducible hang on Windows, Linux and Mac
from the subprocess .wait() method, and a work around.
Bio.Application may suffer from the same problem, and could be fixed with the
same approach. Patch to follow ...
Ideally we'd have a suitable unit test covering this - perhaps using
Bio.EMBOSS?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:33:30 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 11:33:30 -0400
Subject: [Biopython-dev] [Bug 2806] Possible deadlock (hang) in
Bio.Application using subprocess wait()
In-Reply-To:
Message-ID: <200904021533.n32FXU67004756@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2806
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:33 EST -------
Created an attachment (id=1274)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1274&action=view)
Patch to Bio/Application/__init__.py
Use the .communicate() method instead of .wait()
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:18:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 15:18:56 -0400
Subject: [Biopython-dev] [Bug 2734] db.load problem with postgresql and
psycopg2
In-Reply-To:
Message-ID: <200904021918.n32JIuXc023154@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2734
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID
------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 15:18 EST -------
As per comment 8, I'm going to assume Stephen had an old copy of Biopython on
his machine, which would explain the error. In the absence of any further
information there isn't anything we can do. Marking bug as invalid.
Stephen - if you do work out what was going on, or if you still have a problem
after sorting out any issue with multiple copies of Biopython installed, please
do reopen this report.
Thanks
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 18:29:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 18:29:18 -0400
Subject: [Biopython-dev] [Bug 2807] New: Clustalw return codes
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2807
Summary: Clustalw return codes
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
see bug 2804
More on clustalw return codes:
Note return codes are the same whether using subprocess.returncode or
(os.popen().close() \3)
clustalw1.81 clustalw2.09
----------------- ------------------
error: Bad command line option in the command: clustalw_bogus
-INFILE=Fasta/f002
127 127
error: can't open sequence file: clustalw -INFILE=no_file_present
2 255
error: wrong format of input file: clustalw -INFILE=Phylip/hennigian.phy
3 255
error: only one sequence in input: clustalw -INFILE=Fasta/f001
4 0
=========================================================
Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get caught
generically.
I dont think it is possible to generate a return code 1 using 1.81 because
interface doesnt allow ad hoc options to be added to the command line. Invalid
values of options are just ignore by clustalw and it aligns the data anyway (ie
return code 0).
Return codes 127 and 255 could be caught for newer versions and a more
informative error returned. But given that there are 9 other clustalw versions
between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which
I havent checked the return codes, it might be better to just return a generic
command line error if the return value is > 0.
In the case where only one sequence is present, newer versions return code 0,
but throws a ValueError when trying to parse the non-existent output file (see
comment in test_Clustalw_tools.py).
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 3 05:50:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 3 Apr 2009 05:50:44 -0400
Subject: [Biopython-dev] [Bug 2807] Clustalw return codes
In-Reply-To:
Message-ID: <200904030950.n339oiIx019752@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2807
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-03 05:50 EST -------
(In reply to comment #0)
> Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get
> caught generically.
With the CVS code, using clustalw1.81, is it definitely catching these errors
and raising specific IOErrors?
> I dont think it is possible to generate a return code 1 using 1.81 because
> interface doesnt allow ad hoc options to be added to the command line.
The Bio.Clustalw.do_alignment() function accepts any command line string, so
you should be able to feed it a clustalw command with invalid arguments.
> Invalid values of options are just ignore by clustalw and it aligns the
> data anyway (ie return code 0).
We'd have to look at the clustalw source code to confirm what should trigger an
return error code of 1.
> Return codes 127 and 255 could be caught for newer versions and a more
> informative error returned.
Yes, that sounds sensible.
> But given that there are 9 other clustalw versions
> between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which
> I havent checked the return codes, it might be better to just return a generic
> command line error if the return value is > 0.
That also sounds sensible.
> In the case where only one sequence is present, newer versions return code 0,
> but throws a ValueError when trying to parse the non-existent output file (see
> comment in test_Clustalw_tools.py).
Maybe we should report that as a bug, I think clustalw2.0 is intended to be API
compatible with clustalw1.x
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From thamelry at binf.ku.dk Fri Apr 3 09:31:05 2009
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 3 Apr 2009 15:31:05 +0200
Subject: [Biopython-dev] PDB tidy script
In-Reply-To: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com>
References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com>
Message-ID: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com>
Hi everybody,
> I haven't been on this list long enough to know -- is Thomas still
> > supporting the PDB module?
Yes and no. First, I've been pretty busy with establishing a group here in
Copenhagen, but it looks like I will have time for Bio.PDB again in the
future. There's for example a set of classes dealing with RNA structure
coming up. Just have to submit it.
Second, I have no interest in doing anything beyond 3D stuff. I am not going
to implement header parsing for example. I know many people have donated
code, but in general this code is very messy and ad-hoc.
The PDB parser is pretty lean, fast and quite stable now - IMO parsing the
header should be the responsibility of a helper class, in order not to
overload the 3D code with a lot of stuff that most people will not use.
Also, the header info is for most purposes quite useless, especially in PDB
files. It makes no sense to parse the PDB header in fact - if you need
header info, use the MMCIF files.
> If so, would he give his blessing to some more
> > invasive changes to the PDB module, such as unifying PDBParser and
> > parse_pdb_header? That separation has always seemed curiously vestigal to
> > me.
You could provide a uniform interface, but please keep the 3D data
processing and the header processing in separate classes! The Structure
object has functionality to be 'annotated', so you could transfer data from
the header to the Structure object easily.
> If you look back over the history, there initially was no header parsing,
> it was a contribution from Kristian Rother, and I would agree, it is rather
> disjoint from the rest of the code. One thing I personally wanted last
> time I was working with PDB files was to have secondary structure
> information (for them alpha and beta sheet lines in the header)
> mapped onto the residue objects automatically.
This is a good example of why header parsing is something of a red herring.
You really want to recompute that using some decent program like DSSP or
PSEA, or even an internal Bio.PDB procedure. But it's fine of course if you
want to add this!
I would suggest you try and get Thomas involved now for his input
> on the design (before you start coding), but if need be press ahead
> anyway for your own use, and he can always comment on your
> public branch. I hope the two of you can work together on this, and
> if/when Thomas does stand down (or delagate), you could then be
> in an excellent position to take over as the Bio.PDB maintainer if
> that's what you wanted.
Sure, I'm open to this, but I'd like to stay involved if the 3D stuff is
altered, even just to discuss new designs.
Cheers,
-Thomas
From biopython at maubp.freeserve.co.uk Fri Apr 3 12:41:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 3 Apr 2009 17:41:04 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta)
Message-ID: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote:
> Hi all,
>
> OK guys, after a brief chat off the mailing list, I'm hoping to do the
> Biopython 1.50 beta release roughly this weekend, somewhere between
> Friday 4 and Monday 6 April. ?Until then please consider CVS "frozen"
> for anything other that documentation changes or unit test additions,
> or at a push really tiny changes. ?Once I'm ready to actually do the
> release, I'll send out an email requesting no further CVS commits.
I'm going to try and do the release tonight (in the next few hours),
so please consider CVS frozen until further notice.
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Fri Apr 3 14:07:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 3 Apr 2009 19:07:58 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta)
In-Reply-To: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
Message-ID: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com>
On Fri, Apr 3, 2009 at 5:41 PM, Peter wrote:
>
> I'm going to try and do the release tonight (in the next few hours),
> so please consider CVS frozen until further notice.
>
OK, its done - uploaded, and tagged in CVS. Could you all give it a
quick test now that would be great, especially the Windows installers
if possible as I currently only have ready access to the one Windows
machine which is where the installers were built.
I'll prepare the news entry and email announcement later on tonight,
based on the current NEWS file. If there is anything missing which
should be mentioned, please email me ASAP.
I'm happy for CVS to be used again to check in documentation changes,
but no code changes yet please.
Thanks
Peter
From tiagoantao at gmail.com Sat Apr 4 12:43:10 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sat, 4 Apr 2009 17:43:10 +0100
Subject: [Biopython-dev] Merging branches
Message-ID: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
Hi,
This might be a lame question but I am completely stuck and don't seem
to understand why.
I am trying to PARTIALLY merge 2 branches: my popgen branch with Giovanni's.
I want to import his changes to Bio/PopGen/Stats , but only that
(nothing on other Bio directories, and, above all not a new test).
This changes are not conflictual, so I have no warning and everything
gets in: If I do a git-merge I get the whole bang.
Is there any way to just get partial merge? In this case I only want
to merge a single sub dir (although, in general one might just want to
import a single file)
Of course I could do 2 checkouts and copy files across, on the local
filesystem, but is that not loosing the history of connections between
the files?
Many thanks,
Tiago
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From biopython at maubp.freeserve.co.uk Sat Apr 4 13:01:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 4 Apr 2009 18:01:53 +0100
Subject: [Biopython-dev] Merging branches
In-Reply-To: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
Message-ID: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
2009/4/4 Tiago Ant?o:
> Is there any way to just get partial merge? In this case I only want
> to merge a single sub dir (although, in general one might just want
> to import a single file)
Can you cherry pick the changes you want? Github's fork queue
provides another approach to the same issue. However, these both work
on patches (individual commits) rather than files/directories.
Peter
From tiagoantao at gmail.com Sat Apr 4 13:29:20 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sat, 4 Apr 2009 18:29:20 +0100
Subject: [Biopython-dev] Merging branches
In-Reply-To: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
Message-ID: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
Me thinks I need to get a book on git and understand, once and for
all, the basic concepts. I am getting merge conflicts with cherry
picking and I don't even understand why
Anyway it would be nice (but not fundamental) to merge just a single file.
2009/4/4 Peter :
> 2009/4/4 Tiago Ant?o:
>> Is there any way to just get partial merge? In this case I only want
>> to merge a single sub dir (although, in general one might just want
>> to import a single file)
>
> Can you cherry pick the changes you want? ?Github's fork queue
> provides another approach to the same issue. ?However, these both work
> on patches (individual commits) ?rather than files/directories.
>
> Peter
>
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From biopython at maubp.freeserve.co.uk Sat Apr 4 15:06:57 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 4 Apr 2009 20:06:57 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta)
In-Reply-To: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com>
References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
<320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com>
Message-ID: <320fb6e00904041206yb0e4a29ja715a54faeeca28e@mail.gmail.com>
On Fri, Apr 3, 2009 at 7:07 PM, Peter wrote:
> I'm happy for CVS to be used again to check in documentation changes,
> but no code changes yet please.
Also I should have said before, those with CVS access, please feel
free to add more unit tests. I've started work on one using the
EMBOSS tools, to check both the command line wrappers in Bio.Emboss
but also our parsers.
I'm repeating myself but if you have some new code you'd like to check
in, while CVS is "frozen" for the release process, this is a nice
chance to try playing with git and github ;)
Peter
From bartek at rezolwenta.eu.org Sun Apr 5 05:49:14 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Sun, 5 Apr 2009 11:49:14 +0200
Subject: [Biopython-dev] Merging branches
In-Reply-To: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
Message-ID: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
Hi Tiago,
2009/4/4 Tiago Ant?o :
> Me thinks I need to get a book on git and understand, once and for
> all, the basic concepts. I am getting merge conflicts with cherry
> picking and I don't even understand why
>
If you could be a bit more specific (providing the files and revision numbers
would be great), than it would be easier to help. I know it is an extra work,
but we need some info, also to improve our wiki documents.
> Anyway it would be nice (but not fundamental) to merge just a single file.
>
This is one of the fundamentalo changes between CVS and git. CVS uses
files as the atomic piece of data, while git works with changesets (commits).
This means, that if you only need a part of what was committed as a
big changeset,
you will need to put an extra effort into selecting what you need.
>> 2009/4/4 Tiago Ant?o:
>>> Is there any way to just get partial merge? In this case I only want
>>> to merge a single sub dir (although, in general one might just want
>>> to import a single file)
Looking at specific files is not the default way things work in git.
The idea is that if
someone makes a single commit, it is an atomic contribution that is
either to be
accepted or not. You can of course create a diff file and then split
it into specific files.
I'll look into possible easier ways of doing it.
cheers
Bartek
From eric.talevich at gmail.com Sun Apr 5 12:47:39 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 5 Apr 2009 12:47:39 -0400
Subject: [Biopython-dev] Merging branches
In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
<8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
Message-ID: <3f6baf360904050947m5d9ec75eh18d64c53b8d9e2a6@mail.gmail.com>
2009/4/5 Bartek Wilczynski
> Hi Tiago,
>
> >> 2009/4/4 Tiago Ant?o:
> >>> Is there any way to just get partial merge? In this case I only want
> >>> to merge a single sub dir (although, in general one might just want
> >>> to import a single file)
>
> Looking at specific files is not the default way things work in git.
> The idea is that if
> someone makes a single commit, it is an atomic contribution that is
> either to be
> accepted or not. You can of course create a diff file and then split
> it into specific files.
> I'll look into possible easier ways of doing it.
>
> cheers
> Bartek
>
You can get a list of the changes that affected a single subdirectory by
giving the directory name to git log, e.g. "git log Bio/PopGen/Stats/".
Those commits don't necessarily just affect Bio/PopGen/Stats, but assuming
there aren't any single-commit code bombs, then it's probably a good idea to
take those associated modifications anyway. You can also give a range of
versions to git-log to get the commits that occurred since Gio's branch
diverged from yours -- it looks something like "git log [path] HEAD..[gio's
branch]", details are in the help page for git-rev-parse. Then you can use
that list of commits for cherry-picking, in the original order.
If it's essential to get just a specific file at a specific version, you can
find the SHA1 hash for that blob (probably easiest through github) and use
git-show with a redirect to the file in your tree, or a temporary filename.
This loses the history, though.
Cheers,
Eric
From tiagoantao at gmail.com Mon Apr 6 06:35:47 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 6 Apr 2009 11:35:47 +0100
Subject: [Biopython-dev] Merging branches
In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
<8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
Message-ID: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com>
Hi,
2009/4/5 Bartek Wilczynski :
> If you could be a bit more specific (providing the files and revision numbers
> would be great), than it would be easier to help. I know it is an extra work,
> but we need some info, also to improve our wiki documents.
>
I would like to replace this:
http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py
With this:
http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py
It would be cool not to loose the history relationship (I suppose that
would be the good practice).
> This means, that if you only need a part of what was committed as a
> big changeset,
> you will need to put an extra effort into selecting what you need.
But how do you do that (other than manually copying files)? Cherry
pick seems to be commit based...
> Looking at specific files is not the default way things work in git.
> The idea is that if
> someone makes a single commit, it is an atomic contribution that is
> either to be
> accepted or not. You can of course create a diff file and then split
> it into specific files.
> I'll look into possible easier ways of doing it.
The point is: wanting to use part of a commit without loosing history.
In my case, I dont want to import a test_PopGen_Fst file that Gio has.
That being said, I dont think this is a big deal. I was just to
preserve the history connectivity between repositiories. I think we
can just use the old fashioned method of copying some files around.
But it would be good to know if there is a "best practice" (which, I
could not find out)
Tiago
PS - I might have to go under surgery this week, if I stop responding
for a long time, my apologies in advance but I am probably recovering.
From biopython at maubp.freeserve.co.uk Mon Apr 6 09:25:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 6 Apr 2009 14:25:29 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
Message-ID: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com>
Brad has been working on his GFF parsing code - see progress reports
on his blog http://bcbio.wordpress.com/ and his code on github,
http://github.com/chapmanb/bcbb/tree/master/gff
Potentially this could make it into Biopython 1.51, and I was just
thinking about where the code would go. Brad is supporting both GFF3
and the loosely defined GFF2 variants, so Bio.GFF seems a good place.
There would also be a wrapper under Bio.SeqIO for loading GFF files as
SeqRecord objects (I haven't played with Brad's code, but it can do
this already).
However, we already have a Bio.GFF module from Michael Hoffman created
back in 2002 which accesses MySQL General Feature Format (GFF)
databases created with BioPerl. Perhaps we should poll the main
discussion list now, and if there are no responses from people using
it, we could deprecate Bio.GFF for Biopython 1.50? Under our current
deprecation policy we shouldn't then remove Bio.GFF until Biopython
1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy
What do you think Brad? How about using Bio.GFF3 instead?
Peter
From chapmanb at 50mail.com Mon Apr 6 18:08:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 6 Apr 2009 18:08:26 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com>
References: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com>
Message-ID: <20090406220826.GH43636@sobchak.mgh.harvard.edu>
Peter;
Thanks for the plug. GFF parsing is moving along; the main feature
two things I would like to finish before proposing it for inclusion
are writing of GFF files and putting GFF into BioSQL with the nested
features. The code does work for parsing, and I've been using it for
some real projects; anyone who would like to test it is more than
welcome.
As far as the current Bio.GFF, that is a bit of a conundrum. The
current code does work and for some cases it would be nice of having
the utility of working with GFF from a database. Eventually BioSQL
from GFF may supplant that, but that should be finished and tested
first. I would argue for keeping it in.
However, it is a bit confusing if someone is looking for a parser. It
would make more sense if it lived under a namespace like Bio.GFF.DB.
What do you think about adding a warning that it is going to move to
a new namespace and then moving it there, if we don't hear any
complaints, for 1.51? This is less cumbersome than a removal for
users since it's just an import change.
Brad
> Brad has been working on his GFF parsing code - see progress reports
> on his blog http://bcbio.wordpress.com/ and his code on github,
> http://github.com/chapmanb/bcbb/tree/master/gff
>
> Potentially this could make it into Biopython 1.51, and I was just
> thinking about where the code would go. Brad is supporting both GFF3
> and the loosely defined GFF2 variants, so Bio.GFF seems a good place.
> There would also be a wrapper under Bio.SeqIO for loading GFF files as
> SeqRecord objects (I haven't played with Brad's code, but it can do
> this already).
>
> However, we already have a Bio.GFF module from Michael Hoffman created
> back in 2002 which accesses MySQL General Feature Format (GFF)
> databases created with BioPerl. Perhaps we should poll the main
> discussion list now, and if there are no responses from people using
> it, we could deprecate Bio.GFF for Biopython 1.50? Under our current
> deprecation policy we shouldn't then remove Bio.GFF until Biopython
> 1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy
>
> What do you think Brad? How about using Bio.GFF3 instead?
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From mjldehoon at yahoo.com Tue Apr 7 07:32:52 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 7 Apr 2009 04:32:52 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090406220826.GH43636@sobchak.mgh.harvard.edu>
Message-ID: <316000.69837.qm@web62407.mail.re1.yahoo.com>
Hi Brad,
Thanks for your work on the GFF parser; I'm dealing with GFF files quite a lot. Could you maybe give a simple example of how to use your GFF parser, once it's included into Biopython?
--Michiel.
--- On Mon, 4/6/09, Brad Chapman wrote:
> From: Brad Chapman
> Subject: Re: [Biopython-dev] Bio.GFF and Brad's code
> To: biopython-dev at lists.open-bio.org
> Date: Monday, April 6, 2009, 6:08 PM
> Peter;
> Thanks for the plug. GFF parsing is moving along; the main
> feature
> two things I would like to finish before proposing it for
> inclusion
> are writing of GFF files and putting GFF into BioSQL with
> the nested
> features. The code does work for parsing, and I've been
> using it for
> some real projects; anyone who would like to test it is
> more than
> welcome.
>
> As far as the current Bio.GFF, that is a bit of a
> conundrum. The
> current code does work and for some cases it would be nice
> of having
> the utility of working with GFF from a database. Eventually
> BioSQL
> from GFF may supplant that, but that should be finished and
> tested
> first. I would argue for keeping it in.
>
> However, it is a bit confusing if someone is looking for a
> parser. It
> would make more sense if it lived under a namespace like
> Bio.GFF.DB.
> What do you think about adding a warning that it is going
> to move to
> a new namespace and then moving it there, if we don't
> hear any
> complaints, for 1.51? This is less cumbersome than a
> removal for
> users since it's just an import change.
>
> Brad
>
>
>
> > Brad has been working on his GFF parsing code - see
> progress reports
> > on his blog http://bcbio.wordpress.com/ and his code
> on github,
> > http://github.com/chapmanb/bcbb/tree/master/gff
> >
> > Potentially this could make it into Biopython 1.51,
> and I was just
> > thinking about where the code would go. Brad is
> supporting both GFF3
> > and the loosely defined GFF2 variants, so Bio.GFF
> seems a good place.
> > There would also be a wrapper under Bio.SeqIO for
> loading GFF files as
> > SeqRecord objects (I haven't played with
> Brad's code, but it can do
> > this already).
> >
> > However, we already have a Bio.GFF module from Michael
> Hoffman created
> > back in 2002 which accesses MySQL General Feature
> Format (GFF)
> > databases created with BioPerl. Perhaps we should
> poll the main
> > discussion list now, and if there are no responses
> from people using
> > it, we could deprecate Bio.GFF for Biopython 1.50?
> Under our current
> > deprecation policy we shouldn't then remove
> Bio.GFF until Biopython
> > 1.52 at the earliest,
> http://biopython.org/wiki/Deprecation_policy
> >
> > What do you think Brad? How about using Bio.GFF3
> instead?
> >
> > Peter
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> >
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From bartek at rezolwenta.eu.org Tue Apr 7 08:35:21 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 7 Apr 2009 14:35:21 +0200
Subject: [Biopython-dev] Merging branches
In-Reply-To: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
<8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
<6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com>
Message-ID: <8b34ec180904070535r3a6f23e8w9b917f7592930eda@mail.gmail.com>
Hi,
2009/4/6 Tiago Ant?o :
>> This means, that if you only need a part of what was committed as a
>> big changeset,
>> you will need to put an extra effort into selecting what you need.
>
> But how do you do that (other than manually copying files)?
I think that in this case you need to do this manually.
If you care only about one file, copying it is the easiest option.
> Cherry pick seems to be commit based...
In fact the whole git is commit based. It's not tracking files as
such, but blobs of data.
>I would like to replace this:
>http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py
>With this:
>http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py
>It would be cool not to loose the history relationship (I suppose that
>would be the good practice).
Indeed, keeping history is the right thing and it was one of the
reasons to switch to git.
It would be perfect if Giovanni could "redo" some of his commits and
split them into
smaller operations, so that cherry picking commits would be possible.
I know it's a pain...
> The point is: wanting to use part of a commit without loosing history.
> In my case, I dont want to import a test_PopGen_Fst file that Gio has.
> That being said, I dont think this is a big deal. I was just to
> preserve the history connectivity between repositiories. I think we
> can just use the old fashioned method of copying some files around.
> But it would be good to know if there is a "best practice" (which, I
> could not find out)
As far as I can tell, there is no way you could take only a part of a
commit. The best practice is to make smaller, atomic commits. It has
many advantages:
-it's easier to document a smaller change (I think it makes up for
potentially more work because of more commits)
-you can then "undo" small locally committed changes before pushing
them to public repo
-cherry picking of nicely documented small changes is an easy job
In this particular case of changes in tests, I think really changes to
one test should be committed separately from changes in other tests.
cheers
Bartek
From tiagoantao at gmail.com Tue Apr 7 12:43:49 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 7 Apr 2009 17:43:49 +0100
Subject: [Biopython-dev] PopGen Stats
Message-ID: <6d941f120904070943n7de7afa7m262dd4f4c0149cb@mail.gmail.com>
Hi,
I've started a page documenting the effort to implement statstics here
http://biopython.org/wiki/PopGen_dev_Statistics
anyone is welcomed to participate.
I was expecting to have a personal hurdle during this week, which
didn't happen. So I expect to be working heavily on this (finally).
Tiago
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From peter at maubp.freeserve.co.uk Tue Apr 7 15:38:50 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Tue, 7 Apr 2009 20:38:50 +0100
Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available
In-Reply-To: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
Message-ID: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
Hi all,
There is a new version of BLAST out - we'll need to check if the
NCBI's online server has been updated (if so, our unit test
test_NCBI_qblast.py should catch any obvious issues).
We'll also want to check the standalone version of BLAST is OK.
Point (2) below sounds interesting, previously using BLAST databases
with spaces in the path on Windows was rather hairy.
Peter
---------- Forwarded message ----------
From: mcginnis
Date: Apr 7, 2009 1:50 PM
Subject: [blast-announce] BLAST 2.2.20 now available
To: blast-announce at ncbi.nlm.nih.gov
New BLAST binaries are available on the NCBI FTP site
(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)
The list of changes are:
1.) Ungapped blastn searches allow arbitrary reward/penalty scores.
2.) Spaces are allowed in database pathnames on windows
3.) Seedtop now has gilist support.
4.) Fix a bug that caused the number and order of queries to affect
blastx results.
5.) Modified the 2-hit blastn algorithm so that no overlap is allowed
between hits.
From jacobporter2002 at yahoo.com Tue Apr 7 22:27:21 2009
From: jacobporter2002 at yahoo.com (Jacob Porter)
Date: Tue, 7 Apr 2009 19:27:21 -0700 (PDT)
Subject: [Biopython-dev] Phylogeny modules for BioPython
Message-ID: <296822.1198.qm@web33706.mail.mud.yahoo.com>
Hi all,
My name is Jacob Porter, and I am a graduate student in the math department at UC Davis.? I've done work before on phylogeny inference using so-called "phylogenetic invariants" that can be found at the website: http://www.shsu.edu/~ldg005/small-trees/
It appears to me that BioPython doesn't have much support for phylogeny inference and tools related to phylogeny inference.
I have applied to the Google Summer of Code (12 weeks of working part-time on a programming assignment), and I am looking for a project that could work with BioPython as I see a lot of potential in it.? I can bring my expertise on phylogeny inference to this project to add some support for this.
I need three things from the community ASAP:
1) Ideas as to which of my several project ideas are the most useful to the BioPython community
2) Information as to what is already included in BioPython concerning phylogeny inference and related tools
3) A mentor that will help me with the project (and possibly work in conjunction with Nascent (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors)? I?would need a 12 -week schedule of tasks for the project (TBD), and answers to questions related to developing for BioPython.? (I've worked with Python a lot before, so I shouldn't need much help with Python so much as I need help with BioPython).
Project?1:
Add support for popular phylogeny representation standards such as DND files.? Give the ability to read and write such files.? Convert between such files.? I need help in picking which standards to use and need help in picking which operations on these files is the most useful.
Project?2:
Add wrappers for modern (hopefully high throughput and accurate) phylogeny inference software written in C++/C.? Examples of such software include neighbor-joining, MJOIN software (similar to neighbor-joining) (http://bio.math.berkeley.edu/mjoin/), Garli (http://www.molecularevolution.org/si/software/garli/), treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html), and maximum parsimony.? I would like to know which sort of phylogeny inference software is the most useful in your opinion.? I assume no wrappers for such software exist.
Project?3:
Add analytic algorithms that use phylogeny in some way.? Examples include bootstrapping and protein-protein interaction inference algorithms.? (i.e. "Inferring protein interactions from phylogenetic distance matrices" by Gertz et al.)? I need information as to what sort of algorithms would be useful.
Project 4:
Enhance phylogeny inference software further.? MJOIN has bugs (I think it returns negative distances in some cases, and some modifications to it that I developed using phylogenetic invariants are seg-faulting).
Not all of these ideas will probably be able to be developed, so I need information as to what might be the most useful.? I was thinking of focusing on Project 1 and Project 2 for the initial phase.
Any information will be appreciated, and any mentorship will be great.? I would like a response quickly, so that I can inform Nascent of my plans.
Thanks,
Jacob Porter
UC Davis
From p.j.a.cock at googlemail.com Wed Apr 8 04:54:35 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 8 Apr 2009 09:54:35 +0100
Subject: [Biopython-dev] Phylogeny modules for BioPython
In-Reply-To: <296822.1198.qm@web33706.mail.mud.yahoo.com>
References: <296822.1198.qm@web33706.mail.mud.yahoo.com>
Message-ID: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com>
On 4/8/09, Jacob Porter wrote:
>
> Hi all,
>
> My name is Jacob Porter, and I am a graduate student in the math
> department at UC Davis. I've done work before on phylogeny inference
> ...
> It appears to me that BioPython doesn't have much support for
> phylogeny inference and tools related to phylogeny inference.
I'm sure there is room for improvement.
> I have applied to the Google Summer of Code (12 weeks of
> working part-time on a programming assignment), and I am
> looking for a project that could work with BioPython as I see
> a lot of potential in it. I can bring my expertise on phylogeny
> inference to this project to add some support for this.
>
> I need three things from the community ASAP:
>
> 1) Ideas as to which of my several project ideas are the
> most useful to the BioPython community
Personally, I might pick command line wrappers for existing command
line tools. However, these don't actually make anything new possible,
as writting your own command line is already fairly easy. This in
itself wouldn't be that much work either.
> 2) Information as to what is already included in BioPython
> concerning phylogeny inference and related tools
Look at Bio.Nexus, plus somewhat related, Bio.AlignIO.
> 3) A mentor that will help me with the project (and
> possibly work in conjunction with Nascent
> (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors)
> I would need a 12 -week schedule of tasks for the
> project (TBD), and answers to questions related to
> developing for BioPython. (I've worked with Python
> a lot before, so I shouldn't need much help with
> Python so much as I need help with BioPython).
Brad Chapman may be willing to mentor a GSoC student, have a look back
of the recent email discussions here. In particular, Nick Matzke has
already expressed some interest in Biogeographical and community
phylogenetics for Biopython (there is a wiki page on open-bio.org on
this).
> Project 1:
> Add support for popular phylogeny representation
> standards such as DND files. Give the ability to
> read and write such files. Convert between such
> files. I need help in picking which standards to use
> and need help in picking which operations on these
> files is the most useful.
We have this already in Bio.Nexus, but there is still room for
improvement - see Bug 2788 for example.
> Project 2:
> Add wrappers for modern (hopefully high throughput
> and accurate) phylogeny inference software written in
> C++/C. Examples of such software include
> neighbor-joining, MJOIN software (similar to
> neighbor-joining) (http://bio.math.berkeley.edu/mjoin/),
> Garli (http://www.molecularevolution.org/si/software/garli/),
> treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html),
> and maximum parsimony. I would like to know which
> sort of phylogeny inference software is the most useful
> in your opinion. I assume no wrappers for such software
> exist.
Well, Bio.Nexus is a great help with certain tools. There is scope
for adding more command line wrappers though (I like quick-join and
and also quicktree for NJ tree building).
> Project 3:
> Add analytic algorithms that use phylogeny in some
> way. Examples include bootstrapping and protein-protein
> interaction inference algorithms. (i.e. "Inferring protein
> interactions from phylogenetic distance matrices" by
> Gertz et al.) I need information as to what sort of
> algorithms would be useful.
I feel that this is still very much an active area of research, and
there are no clear gold standards. However, perhaps some published
algorithms may be worth re-implementing in Biopython. I would still
tend to favour more general work for Biopython that would support
people implementing any/their own algorithm.
> Project 4:
> Enhance phylogeny inference software further.
> MJOIN has bugs (I think it returns negative distances
> in some cases, and some modifications to it that I
> developed using phylogenetic invariants are seg-faulting).
Fixing any bug in MJOIN sounds like a good idea - but doesn't really
affect Biopython directly.
> Not all of these ideas will probably be able to be
> developed, so I need information as to what might
> be the most useful. I was thinking of focusing on
> Project 1 and Project 2 for the initial phase.
>
> Any information will be appreciated, and any
> mentorship will be great. I would like a response
> quickly, so that I can inform Nascent of my plans.
Peter.
P.S. Its Biopython, not BioPython
From chapmanb at 50mail.com Wed Apr 8 08:32:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Apr 2009 08:32:26 -0400
Subject: [Biopython-dev] Phylogeny modules for BioPython
In-Reply-To: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com>
References: <296822.1198.qm@web33706.mail.mud.yahoo.com>
<320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com>
Message-ID: <20090408123226.GL43636@sobchak.mgh.harvard.edu>
Jacob;
Thanks much for your interest in Biopython for Summer of Code; glad
to see a discussion here about your proposal.
Peter's comments are great; I will add to them from the SoC
perspective.
> > I have applied to the Google Summer of Code (12 weeks of
> > working part-time on a programming assignment)
SoC is a full time commitment for the summer. Your proposal also
lists some conflicts (classes, other research) for the summer
months. On your updated proposal you should be explicit about these
and describe how you plan to make up time you miss during the first
two weeks of the quarter.
More generally, your proposal needs a detailed plan of deliverables
on a week to week basis over the project timeline, starting with
coding on May 23rd:
http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline
This is the last hour for refining proposals, so you will need to
update your proposal quickly for us to still have time to consider
it. I would recommend copying your current proposal to a Google Doc,
adding all of the specifics needed, and then submitting a link to
the open document as a comment to your initial proposal.
> Brad Chapman may be willing to mentor a GSoC student, have a look back
> of the recent email discussions here. In particular, Nick Matzke has
> already expressed some interest in Biogeographical and community
> phylogenetics for Biopython (there is a wiki page on open-bio.org on
> this).
I am definitely willing to help; spots will be very competitive
throughout the program.
Echoing Peter's comments, I would put together a project proposal
that tackles:
- Improving parsing support in Bio.Nexus, based on existing code and
bug reports, and other suggestions you might have.
- Providing code wrapping for other phylogeny software. Since the
usefulness of different algorithms depends heavily on the context
in which it is used, you will not find a consensus about which
program is most useful. My suggestion is to suggest wrappers for
several useful programs covering the spectrum of possibilities.
In additions to the ones you listed, a couple others are:
RAxML http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm
FastTree http://www.microbesonline.org/fasttree/index.html
- A higher level API over the parsing and command line program support
that helps users with specific phylogenetic tasks. Based on your
experience and input from the Biopython community of users, this
would have the goal of providing a simple way to do common tasks.
This should be a combination of code to surround repetitive items,
and cookbook style documentation to help people with specific
phylogenetic problems.
Other general suggestions:
- Tests. Please describe your plans to write unit tests for all the
code your write.
- Documentation. Please do leave time in your project plan to fully
document using your proposed code.
- Projects 3 and 4, as Peter suggests, are out of the scope of GSoC.
3, specifically, is more of a research project.
Finally, a few meta-items from your e-mail meant as helpful advice:
> It appears to me that BioPython doesn't have much support for
> phylogeny inference and tools related to phylogeny inference.
I understand this is an attempt to provide motivation for your
proposal, but you should do so in a way that does not disparage the
work of the people you are soliciting advice from. Your request
would be better received if you described it in the context of
improving existing phylogenetic support in Biopython.
> I need three things from the community ASAP:
[...]
> I would like a response quickly
No one likes to be told what to do, much less a group your are
requesting help and hopefully a job from. Again, you should think
about how your phrasing will be interpreted by those reading it.
> Nascent
You twice misspelled this: NESCent. Mistakes happen, but it reflects
badly on your commitment to the project to not be able to spell the
name of the organization you would like to work with. These are the
small things you should be careful and double check.
Thanks again for your interest and looking forward to seeing your
revised project plan,
Brad
From chapmanb at 50mail.com Wed Apr 8 08:49:08 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Apr 2009 08:49:08 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <316000.69837.qm@web62407.mail.re1.yahoo.com>
References: <20090406220826.GH43636@sobchak.mgh.harvard.edu>
<316000.69837.qm@web62407.mail.re1.yahoo.com>
Message-ID: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
Hi Michiel;
> Thanks for your work on the GFF parser; I'm dealing with GFF files
> quite a lot. Could you maybe give a simple example of how to use your
> GFF parser, once it's included into Biopython?
Awesome; I'm glad it will be useful. I'd definitely welcome any
feedback you have on the API or implementation. At this stage we can
be flexible and hopefully get it finalized before it hits Biopython.
I will get some user documentation together soon, but here is some
basic usage.
To parse an entire GFF file, getting all features at once:
from BCBio.GFF.GFFParser import GFFAddingIterator
gff_iterator = GFFAddingIterator()
rec_dict = gff_iterator.get_all_features(gff_file)
The returned dictionary is like a dictionary from SeqIO.to_dict;
keys are ids and values are SeqRecords.
You can also seed the parser with an initial dictionary containing
sequences or other features, and the features from the GFF file will
be added to those records:
with open(seq_file) as seq_handle:
seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
gff_iterator = GFFAddingIterator(seq_dict)
If a file is very large, you have two ways of limiting the size of
items parsed. The first is to specify which items you are interested
and return only those. This code will parse out coding transcripts
on chromosome I:
cds_limit_info = dict(
gff_source_type = [('Coding_transcript', 'gene'),
('Coding_transcript', 'mRNA'),
('Coding_transcript', 'CDS')],
gff_id = ['I']
)
rec_dict = gff_iterator.get_all_features(gff_file, limit_info=cds_limit_info)
The second is to use an iterator over a section of the file:
for rec_dict in gff_iterator.get_features(gff_file, target_lines=1000000):
# handle partial rec dictionary of first 1000000 lines
Finally, there is an interface to examine a GFF file and figure out
useful ways to limit it. This will give you a dictionary of all
possible ways to limit a file along with the counts in each:
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)
and this will give a dictionary of the parent-child relationships in
the file:
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)
Since GFF providers tend to differ in how they structure their
information, this helps get a quick overview of the file to
determine how to manage it.
Happy to hear about thoughts you might have. Thanks,
Brad
>
> --Michiel.
>
>
> --- On Mon, 4/6/09, Brad Chapman wrote:
>
> > From: Brad Chapman
> > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code
> > To: biopython-dev at lists.open-bio.org
> > Date: Monday, April 6, 2009, 6:08 PM
> > Peter;
> > Thanks for the plug. GFF parsing is moving along; the main
> > feature
> > two things I would like to finish before proposing it for
> > inclusion
> > are writing of GFF files and putting GFF into BioSQL with
> > the nested
> > features. The code does work for parsing, and I've been
> > using it for
> > some real projects; anyone who would like to test it is
> > more than
> > welcome.
> >
> > As far as the current Bio.GFF, that is a bit of a
> > conundrum. The
> > current code does work and for some cases it would be nice
> > of having
> > the utility of working with GFF from a database. Eventually
> > BioSQL
> > from GFF may supplant that, but that should be finished and
> > tested
> > first. I would argue for keeping it in.
> >
> > However, it is a bit confusing if someone is looking for a
> > parser. It
> > would make more sense if it lived under a namespace like
> > Bio.GFF.DB.
> > What do you think about adding a warning that it is going
> > to move to
> > a new namespace and then moving it there, if we don't
> > hear any
> > complaints, for 1.51? This is less cumbersome than a
> > removal for
> > users since it's just an import change.
> >
> > Brad
> >
> >
> >
> > > Brad has been working on his GFF parsing code - see
> > progress reports
> > > on his blog http://bcbio.wordpress.com/ and his code
> > on github,
> > > http://github.com/chapmanb/bcbb/tree/master/gff
> > >
> > > Potentially this could make it into Biopython 1.51,
> > and I was just
> > > thinking about where the code would go. Brad is
> > supporting both GFF3
> > > and the loosely defined GFF2 variants, so Bio.GFF
> > seems a good place.
> > > There would also be a wrapper under Bio.SeqIO for
> > loading GFF files as
> > > SeqRecord objects (I haven't played with
> > Brad's code, but it can do
> > > this already).
> > >
> > > However, we already have a Bio.GFF module from Michael
> > Hoffman created
> > > back in 2002 which accesses MySQL General Feature
> > Format (GFF)
> > > databases created with BioPerl. Perhaps we should
> > poll the main
> > > discussion list now, and if there are no responses
> > from people using
> > > it, we could deprecate Bio.GFF for Biopython 1.50?
> > Under our current
> > > deprecation policy we shouldn't then remove
> > Bio.GFF until Biopython
> > > 1.52 at the earliest,
> > http://biopython.org/wiki/Deprecation_policy
> > >
> > > What do you think Brad? How about using Bio.GFF3
> > instead?
> > >
> > > Peter
> > > _______________________________________________
> > > Biopython-dev mailing list
> > > Biopython-dev at lists.open-bio.org
> > >
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>
From bugzilla-daemon at portal.open-bio.org Wed Apr 8 18:55:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Apr 2009 18:55:59 -0400
Subject: [Biopython-dev] [Bug 2808] New: Bio.SeqIO "ig" format parser
doesn't deal with optional 1 terminator
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2808
Summary: Bio.SeqIO "ig" format parser doesn't deal with optional
1 terminator
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
While working on new unit test test_Emboss.py I noticed that EMBOSS seqret
creates ig files where the sequence includes a terminal digit one. Further
research online suggests this is an optional feature of the file format,
although not commonly used. See:
http://bmerc-www.bu.edu/needle-doc/latest/seq-formats.html#seq-file-format
The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1"
marker, and not include it in the returned sequence. Perhaps we should even
add this when writing the files.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Fri Apr 10 09:10:34 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 10 Apr 2009 09:10:34 -0400
Subject: [Biopython-dev] Invitation for Biopython news coordinators
In-Reply-To: <49DD5575.4040901@student.otago.ac.nz>
References: <20090406230542.GK43636@sobchak.mgh.harvard.edu>
<49DD5575.4040901@student.otago.ac.nz>
Message-ID: <20090410131034.GH54672@sobchak.mgh.harvard.edu>
David;
Thanks for taking the time to write; it is great to hear that you are
interested. Copying this to the dev list so others can comment and you
can feel free to discuss as much as you want.
> I'd be keen to help spread the good word about bio-python, I'm a very
> novice programmer who has been using the tools to work on some 454
> transcriptome data. I will probably never be a good enough programmer to
> contribute code to the project so would see this as a way to "give
> something back".
Perfect. Getting involved is the first step; you'd be
surprised how much you can learn just by taking on new tasks. I
started helping with Biopython by writing documentation.
> For me as a n00b the most useful resource by far has been the cookbook -
> seeing some working scripts that I could change to suit my ends has
> helped me get to the point that I can write much more generalised code
> for my project 'from scratch'. To that end I think it would be really
> helpful to highlight work that other people have done, either published
> or made available by authors, with a little detail on the questions
> and the way BioPython was used to get at them. We could extend it to
> show some "use cases" for BioPython working with other programs or how
> new features can be used once they are included in the main release.
>
> To me the most obvious way of presenting such information would be a
> blog, we could invite authors and developers to make short posts and
> failing that I'd be happy write up posts summarising published research.
> We could also try an aggregate blogs from the devs and anyone else
> talking about biopython "in the wild".
This sounds great. You are welcome to use the twitter account, news
posts, the wiki, or a blog -- however you see fit. For your aggregation
idea, you might want to take a look at friendfeed. It's pretty simple
to set up a room and pull in RSS feeds, twitter postings, and what not.
There is a Python for Bioinformatics room:
http://friendfeed.com/rooms/python-for-bioinformatics
Most feeds come from general Python sources so it is a bit more
broad, but is a good starting place. I know some of the admins (Chris,
Paulo, Andrew) are around here, and may want to chime in.
For publications, Peter has done a lot of work on identifying papers
that use Biopython:
http://biopython.org/wiki/Publications
Building on this to include short reusable examples from the research
would be very useful.
> Anyway, those are a few ideas, I'm definitely keen to help out and to
> take on board any other ideas that are out there.
Great, let us know how you want to get started. Feel free to start
with something small and expand from there. Peter can help out
with account information for twitter; if you need other things just
ask away.
Brad
> Cheers,
> David
>
> Brad Chapman wrote:
> > Biopythonistas;
> > Communication is a key component of successful open source projects.
> > The challenges of distributed programming by volunteers can be
> > overcome by ensuring that the whole community is aware of
> > interesting discussions, new contributions, and development goals.
> > Traditionally, this communication has happened through our mailing
> > lists, wiki pages, and bug tracking system. While these will
> > continue to to be useful resources, new methods of disseminating
> > information are changing how we interact through the web.
> >
> > I'd like to issue an invitation for anyone interested in helping
> > revolutionize how Biopython news is disseminated. We are looking for
> > contributors from the community to brainstorm new ways to make the
> > discussions that happen at biopython.org accessible. You would
> > actively follow development here and on the development lists and
> > distill this information into useful quick bullet points for those
> > interested in Biopython but too busy to follow detailed discussions.
> >
> > We are proposing two ways to do this:
> >
> > - Monthly highlights on our news server:
> > http://news.open-bio.org/news/category/obf-projects/biopython/
> > The RSS feed from these posts are currently widely distributed around the
> > internet.
> >
> > - More frequent pointers to interesting discussions or other items
> > of interest happening in Biopython through our Twitter account:
> > http://twitter.com/biopython
> >
> > This is an opportunity for those of you who are looking to become
> > more involved, and would like to learn more about Biopython by
> > following all of the coding activity more closely. The position is
> > very flexible and we are happy to have one or more people take it
> > on; we would also encourage you to be as creative as you want in
> > doing so.
> >
> > I see this as an chance to both provide information and to highlight
> > the great work people do at Biopython. If you are interested in
> > taking on this role please respond with your ideas. Thanks for your
> > interest,
> >
> > Brad
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 10:13:58 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 10:13:58 -0400
Subject: [Biopython-dev] [Bug 2809] New: Adding startswith and endswith
methods to the Seq object
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2809
Summary: Adding startswith and endswith methods to the Seq object
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
OtherBugsDependingO 2351
nThis:
As part of making the Seq object more like the Python string (Bug 2351), we
need alphabet aware startswith and endswith methods. Patch to follow.
There are many possible use cases for this. One example which prompted me to
work on this was taking SeqRecord objects from sequencing reads (a FASTQ file
read in with Bio.SeqIO) where some include a PCR primer associated
prefix/suffix which I want to strip off (by slicing the SeqRecord). To do this
I need to know if a given SeqRecord's sequence starts with (or ends with) a
given primer sequence (or tuple of primer sequences).
Current work around, str(record.seq).startswith(prefix)
Patch to follow, which will allow record.seq.startswith(prefix) directly.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 10:13:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 10:13:59 -0400
Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string,
even subclass string?
In-Reply-To:
Message-ID: <200904101413.n3AEDx5I004913@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2351
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn| |2809
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 10:15:27 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 10:15:27 -0400
Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods
to the Seq object
In-Reply-To:
Message-ID: <200904101415.n3AEFRRb005139@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2809
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 10:15 EST -------
Created an attachment (id=1275)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1275&action=view)
Patch to Bio/Seq.py and Tests/test_Seq_objs.py
Adds startswith and endswith methods to the Seq object, and tests these with
simple doctest and a longer separate unit test.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Fri Apr 10 10:46:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 10 Apr 2009 15:46:02 +0100
Subject: [Biopython-dev] Tutorial & Cookbook
Message-ID: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
David wrote:
>> For me as a n00b the most useful resource by far has been the cookbook -
>> seeing some working scripts that I could change to suit my ends has
>> helped me get to the point that I can write much more generalised code
>> for my project 'from scratch'. ...
When you said "cookbook", did you mean the Biopython Tutorial & Cookbook?
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
There are a couple of other documents under the "Cookbook" folder here:
http://biopython.org/DIST/docs/cookbook/Restriction.html
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
I have been wondering if the "Biopython Tutorial & Cookbook" should be
separated now - it is getting a bit long (which in some ways is a good
thing!). Maybe we should re-title it as just the "Biopython
Tutorial". Some bits of the current "Cookbook chapter" might be moved
into the main body of the tutorial (e.g. the alignment stuff), but
having the cookbook entries separate might be a good idea.
For a separate "Cookbook", we could again use LaTeX for another
HTML/PDF document (or set of documents) but perhaps just a series of
pages on the wiki would be more accessible - and much easier for
people to contribute to? We'd need to organize things (e.g. a
cookbook category on the wiki) to make sure everything is still
accessible. As a bonus, it would give us more hits on Google - which
is probably a good thing.
On the other hand, it would be very good if all our cookbook use cases
could be rolled into the unit test framework - which wouldn't be so
easy if they live on the wiki. Something based on doctests might
work...
Peter
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 13:29:06 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 13:29:06 -0400
Subject: [Biopython-dev] [Bug 2808] Bio.SeqIO "ig" format parser doesn't
deal with optional 1 terminator
In-Reply-To:
Message-ID: <200904101729.n3AHT6g0020169@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2808
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 13:29 EST -------
(In reply to comment #0)
>
> The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1"
> marker, and not include it in the returned sequence.
>
Fixed in CVS,
Bio/SeqIO/IgIO.p revision 1.5
Tests/test_Emboss.py revision 1.10
>
> Perhaps we should even add this when writing the files.
>
We don't write out ig files so this isn't an issue at the moment.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Fri Apr 10 14:12:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 10 Apr 2009 19:12:12 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
Message-ID: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
Hi
Those of you following the CVS RSS feed will have noticed a lot of
activity on my new unit test test_Emboss.py, which now works on
Windows, Linux and Mac OS (provided EMBOSS is installed), and does
four main tasks:
- runs needle, checks Bio.AlignIO can parse the output
- runs water, checks Bio.AlignIO can parse the output
- runs seqret to check Bio.SeqIO
- runs seqret to check Bio.AlignIO
It would probably be logical to also include tests for the EMBOSS
version of primer3 here too, but I am not familiar with this tool and
the Biopython parsers.
For now I build the command line strings for seqret and needle "by
hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note
that the existing wrappers in Bio.EMBOSS don't support the very handy
-auto and -filter command line arguments supported by all (or at least
most) of the EMBOSS command line tools. Using -auto turns off any
user prompting for missing arguments (very important for calling from
a script). Using -filter is useful for running the tools with pipes
(i.e. no output file is required as stdout can be used instead, and
potentially no input file if we write to stdin correctly).
Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
these features? The needle wrapper would make an excellent basis for
a new water wrapper. For adding -auto and -filter support, there is
probably a clever approach with a common EMBOSS specific subclass of
Bio.Application.AbstractCommandline, but I haven't tried.
Peter
From mjldehoon at yahoo.com Fri Apr 10 22:26:45 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 10 Apr 2009 19:26:45 -0700 (PDT)
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
Message-ID: <93403.18413.qm@web62406.mail.re1.yahoo.com>
--- On Fri, 4/10/09, Peter wrote:
> I have been wondering if the "Biopython Tutorial &
> Cookbook" should be separated now - it is getting
> a bit long (which in some ways is a good thing!).
In my opinion, it doesn't matter if the "Biopython Tutorial & Cookbook" is long. I guess that few people actually print this document anyway.
I am in favor of having one "official" documentation for Biopython. If we have one Tutorial and one Cookbook, we'll have lots of overlap between the two, it'll be unclear what should be in the Tutorial and what in the Cookbook, and we'll have to make sure the two are consistent.
A cookbook on the Wiki could be helpful though, and since the Wiki pages can be fixed easily we won't have to worry so much about inconsistencies with the official documentation.
> Maybe we should re-title it as just the "Biopython Tutorial".
That sounds like a good idea.
> Some bits of the current "Cookbook chapter" might be moved
> into the main body of the tutorial (e.g. the alignment
> stuff),
Yes. The cookbook chapter has the same problem as a cookbook document; it's not clear what should go there. A more logical place for cookbook-style examples is at the end of each chapter in the documentation. For example, Bio.Entrez has a bunch of cookbook-style examples at the end of its chapter in the Biopython Tutorial & Cookbook.
Currently, there are not so many sections left in the cookbook chapter; most of them have become full-fledged chapters and were moved out of the cookbook chapter.
> For a separate "Cookbook", we could again use LaTeX for another
> HTML/PDF document (or set of documents) but perhaps just a
> series of pages on the wiki would be more accessible - and much
> easier for people to contribute to?
+1 for the wiki, -1 for another HTML/PDF document.
> On the other hand, it would be very good if all our
> cookbook use cases
> could be rolled into the unit test framework - which
> wouldn't be so
> easy if they live on the wiki. Something based on doctests
> might work...
Whereas it can be useful if some cookbook examples are part of the unit tests, I don't think it's absolutely required. I see a wiki cookbook more as complementary to the unit tests.
--Michiel.
From mjldehoon at yahoo.com Sat Apr 11 07:29:47 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 11 Apr 2009 04:29:47 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
Message-ID: <830379.9837.qm@web62402.mail.re1.yahoo.com>
Hi Brad,
Thanks for the examples; that clarified it a lot.
I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython.
Looking at your first example:
> from BCBio.GFF.GFFParser import GFFAddingIterator
>
> gff_iterator = GFFAddingIterator()
> rec_dict = gff_iterator.get_all_features(gff_file)
>
> The returned dictionary is like a dictionary from
> SeqIO.to_dict;
> keys are ids and values are SeqRecords.
It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this:
from Bio import GFF
handle = open("my_gff_file.gff")
for line in handle:
# call the appropriate GFF function on the line
The second point is about GFFAddingIterator.get_all_features. If this is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
Then the code looks as follows:
from Bio import GFF
handle = open("my_gff_file.gff")
rec_dict = GFF.to_dict(handle)
Another thing to consider is that IDs in the GFF file do not need to be unique. For example, consider a GFF file that stores genome mapping locations for short sequences stored in a Fasta file. Since each sequence can have more than one mapping location, we can have multiple lines in the GFF file for one sequence ID.
The last point is about storing SeqRecords in rec_dict. A GFF file typically does not store sequences; if it does, it's not clear which field in the GFF file does. On the other hand, a SeqRecord often does not contain the chromosomal location, which is what the GFF file stores. So why use a SeqRecord for GFF information?
Sorry for bringing up lots of issues. But I think that a GFF parser will be heavily used, so we should optimize its design as much as possible.
Best,
--Michiel.
From biopython at maubp.freeserve.co.uk Sun Apr 12 09:16:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 12 Apr 2009 14:16:58 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
Message-ID: <320fb6e00904120616u390cfe56w3889804d2bffd385@mail.gmail.com>
On 4/10/09, Peter wrote:
> Hi
>
> Those of you following the CVS RSS feed will have noticed a lot of
> activity on my new unit test test_Emboss.py, which now works on
> Windows, Linux and Mac OS (provided EMBOSS is installed), and does
> four main tasks:
>
> - runs needle, checks Bio.AlignIO can parse the output
> - runs water, checks Bio.AlignIO can parse the output
> - runs seqret to check Bio.SeqIO
> - runs seqret to check Bio.AlignIO
It now also runs transeq to check the Bio.Seq translations on all
common tables. This has shown up some differences in our translations
for ambiguous sequences - I may have found a bug in EMBOSS...
Peter
From sbassi at clubdelarazon.org Sun Apr 12 21:57:52 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Sun, 12 Apr 2009 22:57:52 -0300
Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available
In-Reply-To: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
<320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
Message-ID: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com>
On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote:
> Hi all,
....
> We'll also want to check the standalone version of BLAST is OK.
I've made the following check:
Run a blast query (with blast 2.2.20) with output in xml. Run my
python script that converts XML to HTML using Biopython (under
Biopython 1.50beta) and it worked OK. The script deals with most
information bits found in an XML blast file so if there is any change
in the blast output, this program would crash.
From eric.talevich at gmail.com Sun Apr 12 23:13:32 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 12 Apr 2009 23:13:32 -0400
Subject: [Biopython-dev] PDB tidy script
In-Reply-To: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com>
References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com>
<2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com>
Message-ID: <3f6baf360904122013k21aa8efcm4aae0ac872e8e6af@mail.gmail.com>
Hi Thomas & everyone,
I've started a separate branch on GitHub for this work:
http://github.com/etal/biopython/tree/pdbtidy
I pushed one small change just now (partly to play with git branches), which
is basically the example code I gave earlier. It wraps the PDBLoader and
parse_pdb_header classes, and sticks a finger into PDBList too, so that
parsing and building a structure from a PDB file is a one-liner for both
local and RCSB-hosted files:
>>> from Bio import PDB
>>> prot = PDB.load('pdb2hmb.ent')
>>> dir(prot)
['__doc__', '__init__', '__module__', 'author', 'compound',
'deposition_date', 'head', 'journal', 'journal_reference', 'keywords',
'name', 'release_date', 'resolution', 'source', 'structure',
'structure_method', 'structure_reference']
Or:
>>> PDB.fetch('2hmb')
/usr/lib/python2.5/site-packages/Bio/PDB/PDBList.py:240: UserWarning:
Retrieving
ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/hm/pdb2hmb.ent.gz
warn("Retrieving %s" % url)
(The warning is supposed to be a comment, but that cleanup is happening in
another branch: http://github.com/etal/biopython/tree/bug2754 ).
My idea is to pull all of the parse_pdb_header data out of the PDBParser and
Structure classes, and store it in the PDBLoader wrapper instead. The
existing "header" attributes can point to the PDBLoader parent if it exists,
or temporarily contain None or "" if necessary to avoid breaking scripts,
according to the deprecation plan. Annotations could either stay in
Structure or move to Loader. Then we'd have a fast, lean, consistent
hierarchy of classes for 3D structure work, and an easy API for loading and
exploring PDB files interactively.
Part of the pdbtidy concept is to check that the PDB header is consistent
with the structure it represents, so I'd like the API for metadata to be
just as nice as the existing one for 3D structure.
So, this is just a start, but I hope the intent is clear enough that someone
will tell me to stop if the whole idea is misguided.
Thanks,
Eric
From biopython at maubp.freeserve.co.uk Mon Apr 13 05:51:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 10:51:38 +0100
Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available
In-Reply-To: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com>
References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
<320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
<9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com>
Message-ID: <320fb6e00904130251k3e3e77f2x20e03fba19fd8ff7@mail.gmail.com>
On Mon, Apr 13, 2009 at 2:57 AM, Sebastian Bassi
wrote:
> On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote:
>> Hi all,
> ....
>> We'll also want to check the standalone version of BLAST is OK.
>
> I've made the following check:
> Run a blast query (with blast 2.2.20) with output in xml. Run my
> python script that converts XML to HTML using Biopython (under
> Biopython 1.50beta) and it worked OK. The script deals with most
> information bits found in an XML blast file so if there is any change
> in the blast output, this program would crash.
Great - thanks for checking that :)
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 06:44:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 11:44:29 +0100
Subject: [Biopython-dev] BOSC 2009
Message-ID: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com>
Hello Biopythoneers,
Those of you following the dev-mailing list or the OBF news feed will
know that talk abstracts for BOSC 2009 are due in today, see
http://www.open-bio.org/wiki/BOSC_2009
I should to be able to attend and present the Biopython Project
Update, and a few other Biopython developers may also be around too,
so some sort of hackathon is in the air.
It is a bit unfortunate the deadline was scheduled on the Easter
break, as I'm sure quite a few of you will be on holiday, but here is
an outline abstract. If anyone has comments, please let me know (on
the list or directly) in the next couple of hours...
Biopython Project Update (draft abstract for BOSC 2009)
In this talk we present the current status of the Biopython project,
focusing on features developed in the last year, and future plans for
the project. The Oxford University Press journal Bioinformatics has
recently published an application note describing Biopython:
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,
Hamelryck T, Kauff F, Wilczynski B, and de Hoon MJ. Biopython: freely
available Python tools for computational molecular biology and
bioinformatics. Bioinformatics 2009 Mar 20.
doi:10.1093/bioinformatics/btp163
Since BOSC 2008, Biopython 1.49 has been released. This was an
important milestone in bringing support for Python 2.6, and in terms
of our dependence on Numerical Python as we made the transition from
the obsolete Numeric library to NumPy. Biopython 1.49 also added more
biological methods to our core sequence object.
April 2009 will see the release of Biopython 1.50 (at the time of
writing, a beta has already been released). Some of the new features
include:
1. GenomeDiagram by Leighton Pritchard has been integrated into
Biopython as the Bio.Graphics.GenomeDiagram module.
2. A new module Bio.Motif has been added, which is intended to replace
the existing Bio.AlignAce and Bio.MEME modules.
3. Bio.SeqIO can now read and write FASTQ and QUAL files used in
second generation sequencing work.
Biopython will celebrate its 10th Birthday later this year, we will
present a brief history of the project and current work. This
includes the evaluation of git (and github) as a possible distributed
version control system (DVCS) to replace our existing very stable CVS
server hosted by the Open Bioinformatics Foundation, which we hope
will encourage more participation in the project.
--
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 08:16:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 13:16:10 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <830379.9837.qm@web62402.mail.re1.yahoo.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
Message-ID: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
On Sat, Apr 11, 2009 at 12:29 PM, Michiel de Hoon wrote:
>
> Hi Brad,
>
> Thanks for the examples; that clarified it a lot.
I haven't tried the code yet, but I have a GFF file I need to convert
into FASTA format. Hopefully later this week I'll get to that...
There are a few things I can ask now through:
Why are the functions _gff_line_map() and _gff_line_reduce() private
(leading underscores)? I had thought you wanted to make the
map/reduce approach available to people trying to parse GFF files on
multiple threads (e.g. using disco) which would require them to use
these two functions, wouldn't it? If so, they should be part of the
public API.
I don't see any support for the optional FASTA block in a GFF file.
Is this something you intend to add later Brad? See also my thoughts
below for Bio.SeqIO integration.
> I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython.
> Looking at your first example:
>
>> from BCBio.GFF.GFFParser import GFFAddingIterator
>>
>> gff_iterator = GFFAddingIterator()
>> rec_dict = gff_iterator.get_all_features(gff_file)
>>
>> The returned dictionary is like a dictionary from
>> SeqIO.to_dict;
>> keys are ids and values are SeqRecords.
>
> It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this:
>
> from Bio import GFF
> handle = open("my_gff_file.gff")
> for line in handle:
> ? ?# call the appropriate GFF function on the line
I think the appropriate GFF function here might be Brad's
_gff_line_map(). This knows about different GFF line types (e.g. ##
header lines). I'm not sure if a line based approach like this can
cope with the optional ##FASTA block through.
> The second point is about GFFAddingIterator.get_all_features. If this
> is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
> Then the code looks as follows:
>
> from Bio import GFF
> handle = open("my_gff_file.gff")
> rec_dict = GFF.to_dict(handle)
Well, the Bio.SeqIO.to_dict() function takes a SeqRecord list/iterator
rather than a handle, but that might make sense here.
> Another thing to consider is that IDs in the GFF file do not need to be unique.
> For example, consider a GFF file that stores genome mapping locations for
> short sequences stored in a Fasta file. Since each sequence can have more
> than one mapping location, we can have multiple lines in the GFF file for one
> sequence ID.
That sounds nasty. Do you have any example files of this we could use
for a test case?
> The last point is about storing SeqRecords in rec_dict. A GFF file typically
> does not store sequences; if it does, it's not clear which field in the GFF file
> does. On the other hand, a SeqRecord often does not contain the
> chromosomal location, which is what the GFF file stores. So why use a
> SeqRecord for GFF information?
I don't think the GFF parser should only return SeqRecord object, but
I do see a use for this (via Bio.SeqIO). GFF files could be
represented as a list of SeqFeature objects, and using a SeqRecord to
hold this seems very natural to me. It also means we could use
Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a
BioSQL database.
If you look at the NCBI FTP site, they often provide genome sequences
in a range of file formats including GenBank and GFF.
e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/
The GenBank files contain the features plus the sequence,
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk
Their GFF3 file only contains the features:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
Some GFF files will include the sequence too, in this case we can
fetch it in FASTA format:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
In principle, you could parse this FASTA file and the GFF3 file and
put together a GenBank file - or vice versa.
As an aside, I would also consider adding protein table support on the
same lines, look at this file:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt
The header information gives us the genome size, so Bio.SeqIO could
return a SeqRecord with lots of SeqFeature objects and for the
SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp.
This is something I might look at implementing myself after Biopython
1.50 is out. We should be able to read in a GenBank file and output a
PTT file, and verify it matches the NCBI provided version of the PTT
file.
Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
give me a SeqRecord with lots of SeqFeature objects. If the sequence
is present in the file, it should use that (not the case for these
NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual
sequence length which we'd need to use the new Bio.Seq.UnknownSeq
object. However, we can infer from the maximum feature coordinates a
minimum sequence length. For these NCBI GFF3 files, as there is a
source feature this does actually give use the genome length, so this
should work very nicely.
Peter
From chapmanb at 50mail.com Mon Apr 13 08:32:19 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 08:32:19 -0400
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
Message-ID: <20090413123219.GB5429@sobchak.mgh.harvard.edu>
Hi Peter;
The tests from EMBOSS look great; thanks for putting this together.
> For now I build the command line strings for seqret and needle "by
> hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note
> that the existing wrappers in Bio.EMBOSS don't support the very handy
> -auto and -filter command line arguments supported by all (or at least
> most) of the EMBOSS command line tools. Using -auto turns off any
> user prompting for missing arguments (very important for calling from
> a script). Using -filter is useful for running the tools with pipes
> (i.e. no output file is required as stdout can be used instead, and
> potentially no input file if we write to stdin correctly).
>
> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
> these features? The needle wrapper would make an excellent basis for
> a new water wrapper. For adding -auto and -filter support, there is
> probably a clever approach with a common EMBOSS specific subclass of
> Bio.Application.AbstractCommandline, but I haven't tried.
Definitely go for it. My approach on this has mostly been to add
command lines as they are requested, or if I need them for something
I am doing. Not ideal.
Having a subclass with -auto and -filter is a really good idea;
unfortunately nothing clever is designed into the command line builders
right now. Feel free to add away.
Brad
From chapmanb at 50mail.com Mon Apr 13 08:52:55 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 08:52:55 -0400
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <93403.18413.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
<93403.18413.qm@web62406.mail.re1.yahoo.com>
Message-ID: <20090413125255.GC5429@sobchak.mgh.harvard.edu>
Hi all;
> > I have been wondering if the "Biopython Tutorial &
> > Cookbook" should be separated now - it is getting
> > a bit long (which in some ways is a good thing!).
>
> In my opinion, it doesn't matter if the "Biopython Tutorial &
> Cookbook" is long. I guess that few people actually print this
> document anyway.
>
> I am in favor of having one "official" documentation for Biopython.
> If we have one Tutorial and one Cookbook, we'll have lots of overlap
> between the two, it'll be unclear what should be in the Tutorial
> and what in the Cookbook, and we'll have to make sure the two are
> consistent.
I am for whatever is easiest to maintain. Being long isn't a problem
as people can just skip to whatever they need; reading things online
will be increasingly common.
Agreed with Michiel that minimizing overlap is key. It's the same as
maintaining code; if you have the same thing in multiple places it
is more likely to get out of sync and be confusing. There is a
pretty clear distinction between tutorial documentation and cookbook
examples, so...
> A cookbook on the Wiki could be helpful though, and since the Wiki
> pages can be fixed easily we won't have to worry so much about
> inconsistencies with the official documentation.
[...]
> +1 for the wiki, -1 for another HTML/PDF document.
Same vote for me. I am responsible for the LaTeX file, but if I were
starting it today would do things entirely on the web. The barrier
to contributing is much lower.
> > On the other hand, it would be very good if all our cookbook use cases
> > could be rolled into the unit test framework - which wouldn't be so
> > easy if they live on the wiki. Something based on doctests might work...
This is a good idea; broken examples in documentation are definitely
annoying. If we enforce a common format for cookbook items, then we
could scrape the wiki pages, extract the python code and run it as
part of the tests. The python cookbook could serve as some
inspiration:
http://code.activestate.com/recipes/langs/python/
Brad
From biopython at maubp.freeserve.co.uk Mon Apr 13 08:53:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 13:53:18 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <20090413123219.GB5429@sobchak.mgh.harvard.edu>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
On Mon, Apr 13, 2009 at 1:32 PM, Brad Chapman wrote:
>> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
>> these features? ?The needle wrapper would make an excellent basis for
>> a new water wrapper. ?For adding -auto and -filter support, there is
>> probably a clever approach with a common EMBOSS specific subclass of
>> Bio.Application.AbstractCommandline, but I haven't tried.
>
> Definitely go for it. My approach on this has mostly been to add
> command lines as they are requested, or if I need them for something
> I am doing. Not ideal.
>
> Having a subclass with -auto and -filter is a really good idea;
> unfortunately nothing clever is designed into the command line builders
> right now. Feel free to add away.
I need to work on my delegation skills - that seems to have back fired ;)
Regarding adding -auto support, I have a question about the needle
wrapper and the gap parameters. Using the needle tool at the command
line will prompt for the gap parameters UNLESS the -auto argument has
been used. i.e. Without -auto, it makes sense to insist on the gap
parameters being included, which is what the current wrapper does.
However, if we add support for -auto, then these parameters can be
optional. We could handle this in the wrapper, but it would be messy
(and there may be similar questions with other EMBOSS tools). What do
you think - stick with the simple option of insisting the Biopython
user set the gap parameters, even if they are using -auto?
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 09:16:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 14:16:51 +0100
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <20090413125255.GC5429@sobchak.mgh.harvard.edu>
References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
<93403.18413.qm@web62406.mail.re1.yahoo.com>
<20090413125255.GC5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com>
Brad wrote:
>Michiel wrote:
>> A cookbook on the Wiki could be helpful though, and since the Wiki
>> pages can be fixed easily we won't have to worry so much about
>> inconsistencies with the official documentation.
>> [...]
>> +1 for the wiki, -1 for another HTML/PDF document.
>
> Same vote for me. I am responsible for the LaTeX file, but if I were
> starting it today would do things entirely on the web. The barrier
> to contributing is much lower.
One of the nice things about the current PDF (and HTML) file is we can
ship it with each release, meaning it can be used while offline. Also
it means we don't have to worry too much about having our online
documentation deal with older versions of Biopython.
But you are right that LaTeX is a slight barrier to contributing -
although it wasn't an issue for me personally as I learnt LaTeX during
my Maths/Physics undergraduate degree. In anycase, I've previously
said that if people have additions for the tutorial, I'll take plain
text and do the mark up for them.
>> > On the other hand, it would be very good if all our cookbook use cases
>> > could be rolled into the unit test framework - which wouldn't be so
>> > easy if they live on the wiki. ?Something based on doctests might work...
>
> This is a good idea; broken examples in documentation are definitely
> annoying. If we enforce a common format for cookbook items, then we
> could scrape the wiki pages, extract the python code and run it as
> part of the tests.
That sounds possible - we might be able to scrape the wiki page,
reformat it and feed it into doctests... although testing graphical
output will still be a problem.
Speaking of doctests, we should do more of those in our docstrings.
For our online API documentation at
http://biopython.org/DIST/docs/api/ it would be nice to have the
python examples within the docstrings (including the doctests) shown
with syntax colouring. See
http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for
an example, and compare this to
http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need
to adjust our indentation?
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 09:33:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 14:33:03 +0100
Subject: [Biopython-dev] BOSC 2009
In-Reply-To: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com>
References: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com>
Message-ID: <320fb6e00904130633k68fe32bdj3c0419afc5ada71a@mail.gmail.com>
On Mon, Apr 13, 2009 at 11:44 AM, Peter wrote:
> Hello Biopythoneers,
>
> Those of you following the dev-mailing list or the OBF news feed will
> know that talk abstracts for BOSC 2009 are due in today, see
> http://www.open-bio.org/wiki/BOSC_2009
> I should to be able to attend and present the Biopython Project
> Update, and a few other Biopython developers may also be
> around too, so some sort of hackathon is in the air.
>
> It is a bit unfortunate the deadline was scheduled on the Easter
> break, as I'm sure quite a few of you will be on holiday, but here
> is an outline abstract. ?If anyone has comments, please let me
> know (on the list or directly) in the next couple of hours...
That's been submitted now, although I can still make revisions at the
moment if anyone spots something worth adding/fixing. I did remember
to add the website and license information as BOSC request on their
instructions.
Peter
From chapmanb at 50mail.com Mon Apr 13 09:35:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 09:35:39 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
Message-ID: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
Michiel and Peter;
Thanks for your comments on this. I'm definitely open to modifying
the interface and am happy to y'all giving feedback.
In reading through your comments, there is a bit of a disconnect
between what you are expecting the parser to do and how it is
designed right now. You both are thinking of the GFF parser as a
line oriented parser that emits an object, like a SeqFeature, for
each line in the file. This one way to do it, but the downsides are:
- Many features, like coding regions, are actually represented over
multiple lines.
- As Michiel pointed out, almost all files have many replicating
IDs (the first column). Ideally you want all of these features
consolidated to a single SeqRecord.
So the parser now takes a higher level view and assumes that the
user will want those two things done for them. So it is designed as
an "adder," that puts features onto SeqRecord objects. A normal
use case would be:
- Use SeqIO to parse a FASTA file with the sequences => SeqRecords
- Use the GFFParser to add features from a separate GFF file to the
SeqRecords. These are SeqFeatures, added to the right records and
nested in a parent/child relationship as appropriate.
Ideally you would parse the entire GFF file and do all this feature
adding at once. For big files this fails due to memory issues, which
is why the filtering and iterating features were introduced.
Okay, so that is the top level view. I will try to hit some of the
specifics:
> Why are the functions _gff_line_map() and _gff_line_reduce() private
> (leading underscores)? I had thought you wanted to make the
> map/reduce approach available to people trying to parse GFF files on
> multiple threads (e.g. using disco) which would require them to use
> these two functions, wouldn't it? If so, they should be part of the
> public API.
I don't think a standard user would want to deal with these
directly. They just parse lines into their components and build an
intermediate dictionary object. To parallelize the job, the
GFFMapReduceFeatureAdder class has a 'disco_host' parameter which
then runs the job in parallel.
> I don't see any support for the optional FASTA block in a GFF file.
> Is this something you intend to add later Brad? See also my thoughts
> below for Bio.SeqIO integration.
I haven't added anything for parsing header and footer directives but
it is on the to do list and I have a good idea how to handle them. Definitely
pass along a file that uses these you want to parse and we can work on it.
> > I have a couple of suggestions of how to make the GFF parser more
> > generally usable, and more consistent with other parsers in Biopython.
[...]
> > It's not clear to me why we need an iterator for GFF files. Can't we
> > just use Python's line iterator instead? I would expect code like this:
> >
> > from Bio import GFF
> > handle = open("my_gff_file.gff")
> > for line in handle:
> > ? ?# call the appropriate GFF function on the line
Right, so this was tackled in the top level overview above. Michiel,
does the design make more sense now?
> > The second point is about GFFAddingIterator.get_all_features. If this
> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
> > Then the code looks as follows:
> >
> > from Bio import GFF
> > handle = open("my_gff_file.gff")
> > rec_dict = GFF.to_dict(handle)
Yes, except in the more common cases you are adding to a dictionary
of records as opposed to generating one from scratch. My thought was
that copying the SeqIO behavior made it more confusing because it
doesn't do quite the same thing. After my explanation, what are your
thoughts?
> > Another thing to consider is that IDs in the GFF file do not need to be unique.
> > For example, consider a GFF file that stores genome mapping locations for
> > short sequences stored in a Fasta file. Since each sequence can have more
> > than one mapping location, we can have multiple lines in the GFF file for one
> > sequence ID.
Yes, this goes back to my explanation above and is why the
parser works differently than the standard SeqIO parsers. GFF ends
up being a different beast. I think it makes sense to copy useful
patterns we have already, but don't want to confuse users with close
by not the same functionality.
> > The last point is about storing SeqRecords in rec_dict. A GFF file typically
> > does not store sequences; if it does, it's not clear which field in the GFF file
> > does. On the other hand, a SeqRecord often does not contain the
> > chromosomal location, which is what the GFF file stores. So why use a
> > SeqRecord for GFF information?
Hopefully the SeqRecords make more sense now. What it is really doing is
adding SeqFeatures to SeqRecords. When the user doesn't provide one,
it creates an empty SeqRecord with the appropriate ID to use and
adds SeqFeatures to it.
> If you look at the NCBI FTP site, they often provide genome sequences
> in a range of file formats including GenBank and GFF.
[...]
> Their GFF3 file only contains the features:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>
> Some GFF files will include the sequence too, in this case we can
> fetch it in FASTA format:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
Right on. So you would first parse the Fasta file with the SeqIO
parser to_dict functionality, and then feed this dictionary to the
GFF parser to add the features.
> In principle, you could parse this FASTA file and the GFF3 file and
> put together a GenBank file - or vice versa.
Yes.
> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
> give me a SeqRecord with lots of SeqFeature objects. If the sequence
> is present in the file, it should use that (not the case for these
> NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual
> sequence length which we'd need to use the new Bio.Seq.UnknownSeq
> object. However, we can infer from the maximum feature coordinates a
> minimum sequence length. For these NCBI GFF3 files, as there is a
> source feature this does actually give use the genome length, so this
> should work very nicely.
Using UnknownSeq is a good idea, and I will do.
Whew. Michiel and Peter -- hopefully the high level intentions are a
bit more clear. Thanks for your input so far; let's hash this out so
it makes sense to everyone.
Brad
From chapmanb at 50mail.com Mon Apr 13 09:44:29 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 09:44:29 -0400
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
Message-ID: <20090413134429.GE5429@sobchak.mgh.harvard.edu>
Hi Peter;
> >> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
> >> these features? ?The needle wrapper would make an excellent basis for
> >> a new water wrapper. ?For adding -auto and -filter support, there is
> >> probably a clever approach with a common EMBOSS specific subclass of
> >> Bio.Application.AbstractCommandline, but I haven't tried.
> >
> > Definitely go for it. My approach on this has mostly been to add
> > command lines as they are requested, or if I need them for something
> > I am doing. Not ideal.
> >
> > Having a subclass with -auto and -filter is a really good idea;
> > unfortunately nothing clever is designed into the command line builders
> > right now. Feel free to add away.
>
> I need to work on my delegation skills - that seems to have back fired ;)
Oops. I honestly read that as "do I have your permission?" I can of
course tackle this, but am a bit underwater now.
> Regarding adding -auto support, I have a question about the needle
> wrapper and the gap parameters. Using the needle tool at the command
> line will prompt for the gap parameters UNLESS the -auto argument has
> been used. i.e. Without -auto, it makes sense to insist on the gap
> parameters being included, which is what the current wrapper does.
> However, if we add support for -auto, then these parameters can be
> optional. We could handle this in the wrapper, but it would be messy
> (and there may be similar questions with other EMBOSS tools). What do
> you think - stick with the simple option of insisting the Biopython
> user set the gap parameters, even if they are using -auto?
I think we should stick with the simple option. These were meant to
be pretty dumb specifiers that help users write more modular code than
simply pasting in a raw string for the command line. Trying to get
too fancy is probably overkill.
Brad
From biopython at maubp.freeserve.co.uk Mon Apr 13 09:49:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 14:49:56 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <20090413134429.GE5429@sobchak.mgh.harvard.edu>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
<20090413134429.GE5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote:
>> > ... Feel free to add away.
>>
>> I need to work on my delegation skills - that seems to have back fired ;)
>
> Oops. I honestly read that as "do I have your permission?" I can of
> course tackle this, but am a bit underwater now.
Looking back, I was a bit ambiguous. I don't mind who does it - let's
see who has time free first.
>> Regarding adding -auto support, I have a question about the needle
>> wrapper and the gap parameters. ?Using the needle tool at the command
>> line will prompt for the gap parameters UNLESS the -auto argument has
>> been used. ?i.e. Without -auto, it makes sense to insist on the gap
>> parameters being included, which is what the current wrapper does.
>> However, if we add support for -auto, then these parameters can be
>> optional. ?We could handle this in the wrapper, but it would be messy
>> (and there may be similar questions with other EMBOSS tools). ?What do
>> you think - stick with the simple option of insisting the Biopython
>> user set the gap parameters, even if they are using -auto?
>
> I think we should stick with the simple option. These were meant to
> be pretty dumb specifiers that help users write more modular code than
> simply pasting in a raw string for the command line. Trying to get
> too fancy is probably overkill.
Agreed.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 10:19:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 15:19:54 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
> Okay, so that is the top level view. I will try to hit some of the
> specifics:
>
>> Why are the functions _gff_line_map() and _gff_line_reduce() private
>> (leading underscores)? ?I had thought you wanted to make the
>> map/reduce approach available to people trying to parse GFF files on
>> multiple threads (e.g. using disco) which would require them to use
>> these two functions, wouldn't it? ?If so, they should be part of the
>> public API.
>
> I don't think a standard user would want to deal with these
> directly. They just parse lines into their components and build an
> intermediate dictionary object. To parallelize the job, the
> GFFMapReduceFeatureAdder class has a 'disco_host' parameter which
> then runs the job in parallel.
Are you aware of any alternatives to disco for doing map/reduce on
Python, and does that impact your design choices?
>> I don't see any support for the optional FASTA block in a GFF file.
>> Is this something you intend to add later Brad? ?See also my thoughts
>> below for Bio.SeqIO integration.
>
> I haven't added anything for parsing header and footer directives but
> it is on the to do list and I have a good idea how to handle them. Definitely
> pass along a file that uses these you want to parse and we can work on it.
There are some partial examples here:
http://www.sequenceontology.org/gff3.shtml
We should have a peep at BioPerl's unit tests and/or ask Lincoln directly.
>> > I have a couple of suggestions of how to make the GFF parser more
>> > generally usable, and more consistent with other parsers in Biopython.
> [...]
>> > It's not clear to me why we need an iterator for GFF files. Can't we
>> > just use Python's line iterator instead? I would expect code like this:
>> >
>> > from Bio import GFF
>> > handle = open("my_gff_file.gff")
>> > for line in handle:
>> > ? ?# call the appropriate GFF function on the line
>
> Right, so this was tackled in the top level overview above. Michiel,
> does the design make more sense now?
>
>> > The second point is about GFFAddingIterator.get_all_features. If this
>> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
>> > Then the code looks as follows:
>> >
>> > from Bio import GFF
>> > handle = open("my_gff_file.gff")
>> > rec_dict = GFF.to_dict(handle)
>
> Yes, except in the more common cases you are adding to a dictionary
> of records as opposed to generating one from scratch. My thought was
> that copying the SeqIO behavior made it more confusing because it
> doesn't do quite the same thing. After my explanation, what are your
> thoughts?
Maybe there is a role for a to_dict() function for when you start from
scratch, but as you say, it does sound like there is a general need to
add to an existing dict.
>> > Another thing to consider is that IDs in the GFF file do not need to be unique.
>> > For example, consider a GFF file that stores genome mapping locations for
>> > short sequences stored in a Fasta file. Since each sequence can have more
>> > than one mapping location, we can have multiple lines in the GFF file for one
>> > sequence ID.
>
> Yes, this goes back to my explanation above and is why the
> parser works differently than the standard SeqIO parsers. GFF ends
> up being a different beast. I think it makes sense to copy useful
> patterns we have already, but don't want to confuse users with close
> by not the same functionality.
>
>> > The last point is about storing SeqRecords in rec_dict. A GFF file typically
>> > does not store sequences; if it does, it's not clear which field in the GFF file
>> > does. On the other hand, a SeqRecord often does not contain the
>> > chromosomal location, which is what the GFF file stores. So why use a
>> > SeqRecord for GFF information?
>
> Hopefully the SeqRecords make more sense now. What it is really doing is
> adding SeqFeatures to SeqRecords. When the user doesn't provide one,
> it creates an empty SeqRecord with the appropriate ID to use and
> adds SeqFeatures to it.
>
>> If you look at the NCBI FTP site, they often provide genome sequences
>> in a range of file formats including GenBank and GFF.
>> [...]
>> Their GFF3 file only contains the features:
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>>
>> Some GFF files will include the sequence too, in this case we can
>> fetch it in FASTA format:
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
>
> Right on. So you would first parse the Fasta file with the SeqIO
> parser to_dict functionality, and then feed this dictionary to the
> GFF parser to add the features.
Hmm. I'm with you on the idea that you may need to parse a GFF file
and a separate second file to get the actual sequence (e.g. a FASTA
file), but there is more than one way to combine the two. For a
single sequence, I was thinking more along the lines of:
from Bio import SeqIO
record = SeqIO.read(open("NC_000913.fna"),"fasta")
record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features
Or, depending on what other annotation you can extract, perhaps the
other way round would be best:
from Bio import SeqIO
record = SeqIO.read(open("NC_000913.gff"),"gff3")
record.seq = SeqIO.read(open("NC_000913.fna"),"fasta").seq
The above is pretty trivial I think, as long as we include examples of
this in our documentation. This kind of manipulation is also file
format neutral - it would work equally well with a FASTA file and a
PTT file (assuming we add parsing NCBI protein tables to Bio.SeqIO as
outlined in my earlier email). Or for another example, perhaps an
annotated GenBank file without the sequence (e.g. just a CONTIG
assembly line) plus a FASTA file for the full nucleotide sequence.
If the FASTA and GFF file apply to multiple sequences (e.g. a set of
contigs, rather than a single chromosome), and you have enough memory,
then something using dictionaries should work:
from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") :
records[temp_rec.id].features = temp_rec.features
or,
from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.gff"),"gff3"))
for temp_rec in SeqIO.parse(open("NC_000913.fna"),"fasta") :
records[temp_rec.id].seq = temp_rec.seq
(You may need to massage the keys to match up, I'm assuming here that
isn't required).
i.e. It can all be done from Bio.SeqIO without needing to dive into
Bio.GFF unless you need to do something special (e.g. filtering the
features).
>> In principle, you could parse this FASTA file and the GFF3 file and
>> put together a GenBank file - or vice versa.
>
> Yes.
>
>> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
>> give me a SeqRecord with lots of SeqFeature objects. ?If the sequence
>> is present in the file, it should use that (not the case for these
>> NCBI GFF3 files). ?Otherwise, we wouldn't necessarily know the actual
>> sequence length which we'd need to use the new Bio.Seq.UnknownSeq
>> object. ?However, we can infer from the maximum feature coordinates a
>> minimum sequence length. ?For these NCBI GFF3 files, as there is a
>> source feature this does actually give use the genome length, so this
>> should work very nicely.
>
> Using UnknownSeq is a good idea, and I will do.
Great.
> Whew. Michiel and Peter -- hopefully the high level intentions are a
> bit more clear. Thanks for your input so far; let's hash this out so
> it makes sense to everyone.
Good plan :)
As you can probably tell, I am concentrating on getting this to match
up well with the Bio.SeqIO framework. It will be nice to know the
underlying Bio.GFF module has more options, but I expect most people
to start with reading in a GFF file using Bio.SeqIO, and being able to
transfer their existing knowledge of SeqFeature objects learnt from
using Bio.SeqIO to read in GenBank files.
Peter
From jflatow at gmail.com Mon Apr 13 10:41:56 2009
From: jflatow at gmail.com (Jared Flatow)
Date: Mon, 13 Apr 2009 09:41:56 -0500
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
<320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
Message-ID: <3050CC48-7365-4746-B30C-F56C2ACAA2F8@gmail.com>
FYI:
On Apr 13, 2009, at 9:19 AM, Peter wrote:
> Are you aware of any alternatives to disco for doing map/reduce on
> Python, and does that impact your design choices?
You can use Python map/reduce functions with Hadoop via the Streaming
contrib package included with Hadoop.
An overview: http://docs.google.com/Presentation?id=dgr666gg_31cd4n7qdz
Here is an input reader/record reader for FASTA: http://gist.github.com/45551
jared
From bugzilla-daemon at portal.open-bio.org Mon Apr 13 11:41:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 13 Apr 2009 11:41:29 -0400
Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal
In-Reply-To:
Message-ID: <200904131541.n3DFfTGN022460@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2601
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-13 11:41 EST -------
See also Bug 2809, for the much narrower option of adding string-like
startswith and endswith methods to the Seq object (which as proposed would not
deal with ambiguity characters).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Apr 13 13:55:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 18:55:53 +0100
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
Message-ID: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
Hi all,
At then end of last week I found test_SeqIO_online.py was failing and
traced this to a change in Entrez EFetch. EFetch is documented here:
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
The issue is with EFetch and the undocumented rettype=genbank argument
which we currently use in our documentation and unit tests. This
isn't an "official" argument in that it isn't listed on their website,
but until recently it returned plain text GenBank files, acting like
the official rettype=gb or gp arguments. However, as of the end of
last week, EFtech returns the default format instead (ASN.1), causing
test_SeqIO_online.py to fail and rendering some of our examples
misleading.
I emailed the NCBI and received a very prompt reply,
> Dear Colleague,
>
>?As the e-Utils continue to be refined our developers sometimes
> address one-off issues, and this was one of them. The 'official'
> parameter for GenBank is rettype=gb. Now if the parameter is not
> correct you will default to ASN.1 in the nucleotide databases. We
> apologize for any inconvenience.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
I then emailed back (before Easter) to ask if they would reconsider
this change, and have just had a reply:
> Hi Peter,
>
> This will likely not reverse back as the true parameters are laid out
> in the help documents and are now required, so to speak.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
With hindsight we shouldn't have used rettype="genbank", but it did
seem to make things simpler for our documentation and I really hadn't
expected the NCBI to change this.
I think we have two options:
(1) Add a special case to Bio.Entrez.eftech to map rettype="genbank"
to rettype="gb" (or "gp" for the protein database). This is simple
and causes least disruption to Biopython uses, but is a bad idea in
the long run as it means we are effectively providing our own variant
of the Entrez API.
(2) Update our documentation and unit tests to use rettype="gb" or
"gp" instead of rettype="genbank", and add a special case to
Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp"
for the protein database) and issue a warning that the NCBI have
changed their API. At a later point we might change this warning to
an error. This would provide a clear transition for end user scripts,
and keep us consistent with the official Entrez API.
I favour option (2) here. Any other thoughts? Whatever we do should
happen before we release Biopython 1.50.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 14:06:25 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 19:06:25 +0100
Subject: [Biopython-dev] Plan for Biopython 1.50 (final)
Message-ID: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com>
On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote:
> Hi all,
>
> OK guys, after a brief chat off the mailing list, I'm hoping to do the
> Biopython 1.50 beta release roughly this weekend, ...
>
> After the release of Biopython 1.50 beta, we'll reopen CVS again for
> small changes and documentation. ?While the beta is being tested by
> our user base, I'd like us to push to finish any missing documentation
> - in particular for new modules Bio.Motif (Bartek) and
> Bio.Graphics.GenomeDiagram (me and/or Leighton), plus the new
> SeqRecord slicing and UnknownSeq class (me).
That documentation still needs doing, and it would be nice to have it
with Biopython 1.50. If Bartek or Leighton expects to add anything in
the next few days, then I'd be happy to hold back the release for
that. I'll try and do the SeqRecord stuff myself shortly.
> Depending on the feedback from the beta, I'd hope we can do the final
> release of Biopython 1.50 well before the end of April, and then
> reopen CVS for new code.
There haven't been any problems with the beta reported, however there
is the issue of EFetch returning ASN.1 not genbank format (see my
earlier email) which I think we must resolve before Biopython 1.50 is
released.
Apart from these two points (documentation and EFetch), are there any
issues regarding doing the official release of Biopython 1.50? I
think we can aim for a release this week...
Peter
From lpritc at scri.ac.uk Tue Apr 14 04:29:14 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 14 Apr 2009 09:29:14 +0100
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
Message-ID:
On 13/04/2009 18:55, "Peter" wrote:
[...]
> I think we have two options:
>
> (1) Add a special case to Bio.Entrez.eftech to map rettype="genbank"
> to rettype="gb" (or "gp" for the protein database). This is simple
> and causes least disruption to Biopython uses, but is a bad idea in
> the long run as it means we are effectively providing our own variant
> of the Entrez API.
>
> (2) Update our documentation and unit tests to use rettype="gb" or
> "gp" instead of rettype="genbank", and add a special case to
> Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp"
> for the protein database) and issue a warning that the NCBI have
> changed their API. At a later point we might change this warning to
> an error. This would provide a clear transition for end user scripts,
> and keep us consistent with the official Entrez API.
>
> I favour option (2) here. Any other thoughts? Whatever we do should
> happen before we release Biopython 1.50.
Option (2). Option (1) risks cementing an argument into place in Biopython
that could potentially contradict future Entrez API usage.
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From mjldehoon at yahoo.com Tue Apr 14 04:33:48 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 14 Apr 2009 01:33:48 -0700 (PDT)
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
Message-ID: <273080.33626.qm@web62408.mail.re1.yahoo.com>
I am also in favor of option (2).
--Michiel
> I think we have two options:
>
> (1) Add a special case to Bio.Entrez.eftech to map
> rettype="genbank"
> to rettype="gb" (or "gp" for the
> protein database). This is simple
> and causes least disruption to Biopython uses, but is a bad
> idea in
> the long run as it means we are effectively providing our
> own variant
> of the Entrez API.
>
> (2) Update our documentation and unit tests to use
> rettype="gb" or
> "gp" instead of rettype="genbank", and
> add a special case to
> Bio.Entrez.eftech to map rettype="genbank" to
> rettype="gb" (or "gp"
> for the protein database) and issue a warning that the NCBI
> have
> changed their API. At a later point we might change this
> warning to
> an error. This would provide a clear transition for end
> user scripts,
> and keep us consistent with the official Entrez API.
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 04:51:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 04:51:56 -0400
Subject: [Biopython-dev] [Bug 2811] New: EFetch returning ASN.1 not GenBank
format for rettype=genbank
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
Summary: EFetch returning ASN.1 not GenBank format for
rettype=genbank
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
At the end of last week I found test_SeqIO_online.py was failing and
traced this to a change in Entrez EFetch. EFetch is documented here:
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
The issue is with EFetch and the undocumented rettype=genbank argument
which we currently use in our documentation and unit tests. This
isn't an "official" argument in that it isn't listed on their website,
but until recently it returned plain text GenBank files, acting like
the official rettype=gb or gp arguments. However, as of the end of
last week, EFtech returns the default format instead (ASN.1), causing
test_SeqIO_online.py to fail and rendering some of our examples
misleading.
I emailed the NCBI and received a very prompt reply,
> Dear Colleague,
>
> As the e-Utils continue to be refined our developers sometimes
> address one-off issues, and this was one of them. The 'official'
> parameter for GenBank is rettype=gb. Now if the parameter is not
> correct you will default to ASN.1 in the nucleotide databases. We
> apologize for any inconvenience.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
I then emailed back (before Easter) to ask if they would reconsider
this change, and have just had a reply:
> Hi Peter,
>
> This will likely not reverse back as the true parameters are laid out
> in the help documents and are now required, so to speak.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
With hindsight we shouldn't have used rettype="genbank", but it did
seem to make things simpler for our documentation and I really hadn't
expected the NCBI to change this.
After discussion on the mailing list, the plan is to update our documentation
and unit tests to use rettype="gb" or "gp" instead of rettype="genbank", and
add a special case to Bio.Entrez.eftech to map rettype="genbank" to
rettype="gb" (or "gp" for the protein database) and issue a warning that the
NCBI have changed their API. At a later point we might change this warning to
an error. This would provide a clear transition for end user scripts, and keep
us consistent with the official Entrez API.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Apr 14 04:53:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Apr 2009 09:53:02 +0100
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
In-Reply-To: <273080.33626.qm@web62408.mail.re1.yahoo.com>
References: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
<273080.33626.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00904140153w4c659655q64f19540f7bd12b7@mail.gmail.com>
On Tue, Apr 14, 2009 at 9:33 AM, Michiel de Hoon wrote:
>
> I am also in favor of option (2).
>
> --Michiel
>
OK. Let's do that then. I've filed Bug 2811 for this issue,
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 05:54:23 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 05:54:23 -0400
Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank
format for rettype=genbank
In-Reply-To:
Message-ID: <200904140954.n3E9sND0024084@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 05:54 EST -------
Tutorial updated, see Doc/Tutorial.tex revision 1.221
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mjldehoon at yahoo.com Tue Apr 14 06:36:03 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 14 Apr 2009 03:36:03 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
Message-ID: <322143.67385.qm@web62403.mail.re1.yahoo.com>
--- On Mon, 4/13/09, Brad Chapman wrote:
> A normal use case would be:
>
> - Use SeqIO to parse a FASTA file with the sequences =>
> SeqRecords
> - Use the GFFParser to add features from a separate GFF
> file to the SeqRecords. These are SeqFeatures, added to
> the right records and nested in a parent/child relationship
> as appropriate.
Usually, when I use a GFF file I either don't have an associated Fasta file, or I am not particularly interested in the original sequences. So while this approach is useful for some people, in its current form it's not exactly generally usable.
First, let's discuss how to represent the information contained in a GFF file. SeqRecords are good if the GFF file is associated with a Fasta file (or contains the sequence itself), but if not it seems to be a bit awkward. How about the following (and I think Peter was hinting at the same idea):
The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects that closely resemble the GFF file structure. For example, we use the GFF specified fields ( [attributes] [comments]) as attributes to Bio.GFF.Record objects.
Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the appropriate fields of a SeqRecord. Here, we have to think about two cases: Simply creating a SeqRecord based on the GFF file, and adding the information in the GFF file as annotations to a pre-existing set of SeqRecords. (I am not sure if we need a separate function for that, or, as Peter suggested, let the user do that himself, guided by some examples in the documentation).
Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs.
How does that sound?
--Michiel
From biopython at maubp.freeserve.co.uk Tue Apr 14 07:04:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Apr 2009 12:04:39 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <322143.67385.qm@web62403.mail.re1.yahoo.com>
References: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
<322143.67385.qm@web62403.mail.re1.yahoo.com>
Message-ID: <320fb6e00904140404x35f87a00ude242e6c3c4c7971@mail.gmail.com>
On Tue, Apr 14, 2009 at 11:36 AM, Michiel de Hoon wrote:
>
> Usually, when I use a GFF file I either don't have an associated Fasta file,
> or I am not particularly interested in the original sequences. So while this
> approach is useful for some people, in its current form it's not exactly
> generally usable.
>
> First, let's discuss how to represent the information contained in a GFF
> file. SeqRecords are good if the GFF file is associated with a Fasta file
> (or contains the sequence itself), but if not it seems to be a bit awkward.
I think parsing a GFF file with Bio.SeqIO into SeqRecord object(s) can
still be useful even without the sequence. The list of SeqFeature
objects belonging to each SeqRecord can be used for example with
GenomeDiagram to draw a picture of the organism. Because you lack the
sequence, you won't be able to include GC% or GC skew, but it is nice
to visualize the annotation all the same. You could also do things
like looking for the ratio of genic and inter-genic usage, or hunt for
overlapping genes - although for these it may be easier to work with a
more low level representation.
> How about the following (and I think Peter was hinting at the same idea):
>
> The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects
> that closely resemble the GFF file structure. For example, we use the
> GFF specified fields (
> [attributes] [comments]) as attributes to
> Bio.GFF.Record objects.
That sounds possible to me - although I haven't given the basic
Bio.GFF.Record structure any thought, nor indeed have I examined what
data objects Brad is returning at the moment.
> Bio.SeqIO then uses the parser in Bio.GFF, and puts its information in the
> appropriate fields of a SeqRecord.
Yes - much like how Bio.SeqIO calls other modules like Bio.GenBank and
Bio.SwissProt now. However, regarding the implementation, I wouldn't
automatically insist the Bio.SeqIO GFF wrapper *has* to use a
Bio.GFF.Record internally (assuming we have such a thing) as that
could be a performance bottleneck. I guess it depends on how simple
the Bio.GFF.Record objects are.
> Here, we have to think about two cases:
> Simply creating a SeqRecord based on the GFF file, and adding the
> information in the GFF file as annotations to a pre-existing set of SeqRecords.
> (I am not sure if we need a separate function for that, or, as Peter suggested,
> let the user do that himself, guided by some examples in the documentation).
Simply creating SeqRecord objects from a GFF file is the standard
Bio.SeqIO approach. For combining data from a GFF file and a FASTA
file, this is rather like the FASTA+QUAL situation. Here we do
document (in the docstrings, not yet in the tutorial) how to use
Bio.SeqIO to read in two sets of SeqRecord objects and combine them,
but also provide a "paired file iterator" to do this for you. Right
now this function is in Bio.SeqIO.QualityIO, but I am open to moving
this and the low level bits to somewhere like Bio.Sequencing.Quality
instead (as long as we do this before Biopython 1.50 is released).
I have pondered a "paired file iterator" function for Bio.SeqIO for
dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take
TWO file handles and return SeqRecord objects. Interestingly all the
examples thus far are FASTA+other. Anyway, this could be added later
if need be.
> Users then have a choice to use Bio.SeqIO to get SeqRecords, or Bio.GFF to see the "raw" GFF data, depending on their needs.
> How does that sound?
Pretty much what I had in mind - although as I said, I've not given
much thought to how to present the "raw" GFF data.
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 08:05:07 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 08:05:07 -0400
Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank
format for rettype=genbank
In-Reply-To:
Message-ID: <200904141205.n3EC570L032323@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 08:05 EST -------
Bio/Entrez/__init__.py CVS revision 1.41
Tests/test_SeqIO_online.py CVS revision 1.7
DEPRECATED CVS revision 1.50
Marking as fixed (although a proof reading of the tutorial wouldn't hurt).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 19:33:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 19:33:59 -0400
Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank
format for rettype=genbank
In-Reply-To:
Message-ID: <200904142333.n3ENXxFX018002@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
------- Comment #3 from sbassi at gmail.com 2009-04-14 19:33 EST -------
I saw in the online Tutorial this small typo:
"form Bio import SeqIO"
I and have a question regarding this bug: What about adding "gb" as format type
in SeqIO, and mapped to "genbank". This would add consistency (if I retrieve a
sequence using "gb" from Entrez, I expect to save it using SeqIO with "gb"). I
think it won't hurt to have "gb" as an alias for "genbank" in SeqIO.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From peter at maubp.freeserve.co.uk Tue Apr 14 19:34:02 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 00:34:02 +0100
Subject: [Biopython-dev] Bio.Motif breaks epydoc?
Message-ID: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
Hi all,
I forgot to run epydoc when I did Biopython 1.50 beta, but I've just
tried and it is failing - apparently due to an issue with Bio.Motif.
First of all there are some warnings which we should probably address
now, before the Bio.Motif API is officially released:
Warning: Module Bio.Motif.AlignAceParser is shadowed by a variable with the
same name.
Warning: Module Bio.Motif.MEMEParser is shadowed by a variable with the
same name.
Warning: Module Bio.Motif.Motif is shadowed by a variable with the same
name.
Ignoring these warnings for now, epydoc then crashes for me doing
Bio.Motif.Motif.Motif-class.html - which is bigger problem. This was
using Epydoc version 3.0.1 (with python 2.6 on Ubuntu Jaunty). I'll
try another machine tomorrow just to make sure this isn't a local
setup issue.
Also we should probably fix these "shadowing warnings", they can make
the API confusing - in addition to confusing epydoc and making the API
doc pages confusing. GenomeDiagram is also doing this, and we should
try and fix that too:
Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a
variable with the same name.
Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a
variable with the same name.
Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a
variable with the same name.
Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable
with the same name.
However it may be a bit late to fix the main source of these warnings,
Bio.PDB, without breaking things (i.e. any fix may not be backwards
compatible). See also this thread from when I was running epydoc for
Biopython 1.49 late last year:
http://lists.open-bio.org/pipermail/biopython-dev/2008-November/004810.html
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 20:13:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 20:13:49 -0400
Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank
format for rettype=genbank
In-Reply-To:
Message-ID: <200904150013.n3F0DnkE021278@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 20:13 EST -------
(In reply to comment #3)
> I saw in the online Tutorial this small typo:
> "form Bio import SeqIO"
I'd fixed at least one occurange of that error before, but you are right -
there were still two left in CVS. Thanks.
> I and have a question regarding this bug: What about adding "gb" as format
> type in SeqIO, and mapped to "genbank". This would add consistency (if I
> retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO
> with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in
> SeqIO.
The reason we have this bug in the first place was we used an unofficial return
type in EFetch in order to use the same format name ("genbank") in both
Bio.Entrez and Bio.SeqIO - and this did make the examples straight forward.
Adding aliases (such as "gb", "gp", and maybe also "genpept" for "genbank")
might make Bio.Entrez and Bio.SeqIO a little nicer to use together after the
changes forced by this bug. There are also several aliases used in EMBOSS that
would also make sense (e.g. "pfam" for "stockholm"). On the down side, having
more than one name risks confusion. Bring this up on the mailing list if you
like.
Leaving this bug as fixed.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From sbassi at clubdelarazon.org Tue Apr 14 22:05:53 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Tue, 14 Apr 2009 23:05:53 -0300
Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in
SeqIO
Message-ID: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com>
As a follow up to bug 2811 where "gb" is now a valid name in
Bio.Entrez, I propose to add "gb" as an alias for "genbank" in SeqIO.
This proposal is backward compatible since previous code using
"genbank" is unaffected. The rationale behind my request is that
Entrez.efetch(db=db,id=x,rettype='gb')
When I want to save the sequence I got using rettype='gb', seems
consistent to use SeqIO.write(myseq,fielhandle,'gb')
Bugtrack chat related:
---------- Forwarded message ----------
From:
Date: Tue, Apr 14, 2009 at 9:13 PM
Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank
format for rettype=genbank
To: biopython-dev at biopython.org
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk
2009-04-14 20:13 EST -------
(In reply to comment #3)
> I saw in the online Tutorial this small typo:
> "form Bio import SeqIO"
I'd fixed at least one occurange of that error before, but you are right -
there were still two left in CVS. Thanks.
> I and have a question regarding this bug: What about adding "gb" as format
> type in SeqIO, and mapped to "genbank". This would add consistency (if I
> retrieve a sequence using "gb" from Entrez, I expect to save it using SeqIO
> with "gb"). I think it won't hurt to have "gb" as an alias for "genbank" in
> SeqIO.
The reason we have this bug in the first place was we used an unofficial return
type in EFetch in order to use the same format name ("genbank") in both
Bio.Entrez and Bio.SeqIO - and this did make the examples straight forward.
Adding aliases (such as "gb", "gp", and maybe also "genpept" for "genbank")
might make Bio.Entrez and Bio.SeqIO a little nicer to use together after the
changes forced by this bug. There are also several aliases used in EMBOSS that
would also make sense (e.g. "pfam" for "stockholm"). On the down side, having
more than one name risks confusion. Bring this up on the mailing list if you
like.
Leaving this bug as fixed.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
_______________________________________________
Biopython-dev mailing list
Biopython-dev at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
Sebasti?n Bassi. Diplomado en Ciencia y Tecnolog?a.
Non standard disclaimer: READ CAREFULLY. By reading this email,
you agree, on behalf of your employer, to release me from all
obligations and waivers arising from any and all NON-NEGOTIATED
agreements, licenses, terms-of-service, shrinkwrap, clickwrap,
browsewrap, confidentiality, non-disclosure, non-compete and
acceptable use policies ("BOGUS AGREEMENTS") that I have
entered into with your employer, its partners, licensors, agents and
assigns, in perpetuity, without prejudice to my ongoing rights and
privileges. You further represent that you have the authority to release
me from any BOGUS AGREEMENTS on behalf of your employer.
From biopython at maubp.freeserve.co.uk Wed Apr 15 05:40:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 10:40:56 +0100
Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank
in SeqIO
In-Reply-To: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com>
References: <9e2f512b0904141905x69d10c48s95f5a808e1cc430f@mail.gmail.com>
Message-ID: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com>
On Wed, Apr 15, 2009 at 3:05 AM, Sebastian Bassi
wrote:
> As a follow up to bug 2811 where "gb" is now a valid name in
> Bio.Entrez, ...
Just to note that in Entrez EFetch, using rettype=gb (and the related
rettype=gb for proteins in GenPept format) has always been a valid
argument (and in fact has always been the documented way to get a
GenBank/GenPept file back).
>From my point of view it was a nice feature of Entrez EFetch that they
used to (unofficially) support retype=genbank, which was consistent with
Bio.SeqIO. I suppose you could all try lobbing the NCBI to put Entrez
EFetch back to the pre Easter 2009 behavior, but realistically we'll just
have to live with it.
Now that Entrez EFetch doesn't support the unofficial rettype=genbank
argument anymore, we have the current situation where you must use
"gb" (or "gp") for Bio.Entrez but "genbank" for Bio.SeqIO. I agree this
isn't so nice, but as I wrote on Bug 2811, I'm not keen on having aliases
in Bio.SeqIO (but I may be in a minority here, hence suggesting a
discussion). On the plus side, EMBOSS offers "gb" (and "ddbj") as
alternative aliases for "genbank", so there is precedent.
In a related approach, I suppose we could have Bio.SeqIO take
"genbank" to mean GenBank or GenPept as determined from the file
or the alphabet (as now), and add "gb" meaning (nucelotide) GenBank
files, and "gb" meaning (protein) GenPept files.
But again, this breaks the Python ideal of there being one clear way to
do things (having multiple names for the same format).
Peter
From peter at maubp.freeserve.co.uk Wed Apr 15 06:43:40 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 11:43:40 +0100
Subject: [Biopython-dev] Bio.Motif breaks epydoc?
In-Reply-To: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
Message-ID: <320fb6e00904150343u35f66911pd45520c399e2e5f1@mail.gmail.com>
On Wed, Apr 15, 2009 at 12:34 AM, Peter wrote:
> Hi all,
>
> I forgot to run epydoc when I did Biopython 1.50 beta, but I've just tried [...]
> we should probably fix these "shadowing warnings", they can make
> the API confusing - in addition to confusing epydoc and making the API
> doc pages confusing. ?GenomeDiagram is also doing this, and we should
> try and fix that too:
>
> Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a
> ? ? ? ? variable with the same name.
> Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a
> ? ? ? ? variable with the same name.
> Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a
> ? ? ? ? variable with the same name.
> Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable
> ? ? ? ? with the same name.
The shadowing issue with GenomeDiagram should be OK in CVS now - this was
an accidental side effect of renaming the internal modules as part of
integrating
GenomeDiagram into Biopython. I discussed this with Leighton (off list) and we
agreed that renaming the modules with the simplest solution, and opted
for adding
an underscore which makes it explicit that the modules concerned are intended to
be private. This doesn't affect the (intended) public API for GenomeDiagram.
Peter
From mjldehoon at yahoo.com Wed Apr 15 06:57:43 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 15 Apr 2009 03:57:43 -0700 (PDT)
Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank
in SeqIO
In-Reply-To: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com>
Message-ID: <587664.25168.qm@web62402.mail.re1.yahoo.com>
I think it's nice to be consistent with NCBI, and I don't see a big problem in having an alias for GenBank in SeqIO. At least, having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would go against the principle of least surprise.
--Michiel.
--- On Wed, 4/15/09, Peter wrote:
> From: Peter
> Subject: Re: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank in SeqIO
> To: "Sebastian Bassi"
> Cc: biopython-dev at lists.open-bio.org
> Date: Wednesday, April 15, 2009, 5:40 AM
> On Wed, Apr 15, 2009 at 3:05 AM, Sebastian Bassi
> wrote:
> > As a follow up to bug 2811 where "gb" is now
> a valid name in
> > Bio.Entrez, ...
>
> Just to note that in Entrez EFetch, using rettype=gb (and
> the related
> rettype=gb for proteins in GenPept format) has always been
> a valid
> argument (and in fact has always been the documented way to
> get a
> GenBank/GenPept file back).
>
> >From my point of view it was a nice feature of Entrez
> EFetch that they
> used to (unofficially) support retype=genbank, which was
> consistent with
> Bio.SeqIO. I suppose you could all try lobbing the NCBI to
> put Entrez
> EFetch back to the pre Easter 2009 behavior, but
> realistically we'll just
> have to live with it.
>
> Now that Entrez EFetch doesn't support the unofficial
> rettype=genbank
> argument anymore, we have the current situation where you
> must use
> "gb" (or "gp") for Bio.Entrez but
> "genbank" for Bio.SeqIO. I agree this
> isn't so nice, but as I wrote on Bug 2811, I'm not
> keen on having aliases
> in Bio.SeqIO (but I may be in a minority here, hence
> suggesting a
> discussion). On the plus side, EMBOSS offers
> "gb" (and "ddbj") as
> alternative aliases for "genbank", so there is
> precedent.
>
> In a related approach, I suppose we could have Bio.SeqIO
> take
> "genbank" to mean GenBank or GenPept as
> determined from the file
> or the alphabet (as now), and add "gb" meaning
> (nucelotide) GenBank
> files, and "gb" meaning (protein) GenPept files.
>
> But again, this breaks the Python ideal of there being one
> clear way to
> do things (having multiple names for the same format).
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From biopython at maubp.freeserve.co.uk Wed Apr 15 07:01:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 12:01:54 +0100
Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank
in SeqIO
In-Reply-To: <587664.25168.qm@web62402.mail.re1.yahoo.com>
References: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com>
<587664.25168.qm@web62402.mail.re1.yahoo.com>
Message-ID: <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com>
On Wed, Apr 15, 2009 at 11:57 AM, Michiel de Hoon wrote:
>
> I think it's nice to be consistent with NCBI, and I don't see a big
> problem in having an alias for GenBank in SeqIO. At least,
> having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would
> go against the principle of least surprise.
True.
Would you support other aliases such as "pfam" for "stockholm", an
alias supported in EMBOSS for this alignment format?
Peter
From biopython at maubp.freeserve.co.uk Wed Apr 15 08:21:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 13:21:17 +0100
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com>
References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
<93403.18413.qm@web62406.mail.re1.yahoo.com>
<20090413125255.GC5429@sobchak.mgh.harvard.edu>
<320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com>
Message-ID: <320fb6e00904150521q536fa27drd54db5e267876b15@mail.gmail.com>
On Mon, Apr 13, 2009 at 2:16 PM, Peter wrote:
> Speaking of doctests, we should do more of those in our docstrings.
> For our online API documentation at
> http://biopython.org/DIST/docs/api/ it would be nice to have the
> python examples within the docstrings (including the doctests) shown
> with syntax colouring. ?See
> http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for
> an example, and compare this to
> http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need
> to adjust our indentation?
We currently explicitly use plain text for epydoc, rather than the
default epytext markup language. If we switch to epytext (or at least
a very simple subset of it, as some of the markup doesn't lend itself
to friendly human readable docstrings) then we do get python syntax
colouring on the doctests. However, this will require some effort to
fine tune the docstrings, and right now it makes a mess of in some
cases.
Peter
From biopython at maubp.freeserve.co.uk Wed Apr 15 09:19:34 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 14:19:34 +0100
Subject: [Biopython-dev] docstrings, doctests and epydoc API pages
Message-ID: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com>
I've changed the thread title to something a little more specific.
On Wed, Apr 15, 2009 at 1:21 PM, Peter wrote:
> We currently explicitly use plain text for epydoc, rather than the
> default epytext markup language. ?If we switch to epytext (or at least
> a very simple subset of it, as some of the markup doesn't lend itself
> to friendly human readable docstrings) then we do get python syntax
> colouring on the doctests. ?However, this will require some effort to
> fine tune the docstrings, and right now it makes a mess of in some
> cases.
As a test, I was able to update Bio/Seq.py to look good as epytext
(while still being equally readable as plain text for when reading the
API documentation at the python prompt with the help function). I
uploaded one new page to the website:
http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html
The rest of the online API pages are currently still from Biopython
1.49, when epydoc parsed the docstrings as plain text. For another
example with quite a few docstrings and doctests, look at:
http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
What do you all think? I don't know if it will encourage more people
to look at the API pages, but I certainly like the new version where
the doctests are shown boxed with syntax colouring.
Note that this would be a lot easier to do if epydoc supported "plaintext
with doctests" as a markup type, or did this automatically when told the
markup is just "plaintext" (as I had originally hoped for). I wonder how
easy that would be to implement... it might be less work than checking
all our API pages by hand and fixing our markup to follow epytext
standards. See also: http://epydoc.sourceforge.net/epytext.html
Peter
From bartek at rezolwenta.eu.org Wed Apr 15 10:43:15 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 15 Apr 2009 16:43:15 +0200
Subject: [Biopython-dev] Bio.Motif breaks epydoc?
In-Reply-To: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
Message-ID: <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com>
Hi,
I'm working on Bio.Motif to fix this. I'll send a patch later today.
cheers
Bartek
On Wed, Apr 15, 2009 at 1:34 AM, Peter wrote:
> Hi all,
>
> I forgot to run epydoc when I did Biopython 1.50 beta, but I've just
> tried and it is failing - apparently due to an issue with Bio.Motif.
>
> First of all there are some warnings which we should probably address
> now, before the Bio.Motif API is officially released:
>
> Warning: Module Bio.Motif.AlignAceParser is shadowed by a variable with the
> ? ? ? ? same name.
> Warning: Module Bio.Motif.MEMEParser is shadowed by a variable with the
> ? ? ? ? same name.
> Warning: Module Bio.Motif.Motif is shadowed by a variable with the same
> ? ? ? ? name.
>
> Ignoring these warnings for now, epydoc then crashes for me doing
> Bio.Motif.Motif.Motif-class.html - which is bigger problem. ?This was
> using Epydoc version 3.0.1 (with python 2.6 on Ubuntu Jaunty). ?I'll
> try another machine tomorrow just to make sure this isn't a local
> setup issue.
>
> Also we should probably fix these "shadowing warnings", they can make
> the API confusing - in addition to confusing epydoc and making the API
> doc pages confusing. ?GenomeDiagram is also doing this, and we should
> try and fix that too:
>
> Warning: Module Bio.Graphics.GenomeDiagram.Diagram is shadowed by a
> ? ? ? ? variable with the same name.
> Warning: Module Bio.Graphics.GenomeDiagram.FeatureSet is shadowed by a
> ? ? ? ? variable with the same name.
> Warning: Module Bio.Graphics.GenomeDiagram.GraphSet is shadowed by a
> ? ? ? ? variable with the same name.
> Warning: Module Bio.Graphics.GenomeDiagram.Track is shadowed by a variable
> ? ? ? ? with the same name.
>
> However it may be a bit late to fix the main source of these warnings,
> Bio.PDB, without breaking things (i.e. any fix may not be backwards
> compatible). ?See also this thread from when I was running epydoc for
> Biopython 1.49 late last year:
> http://lists.open-bio.org/pipermail/biopython-dev/2008-November/004810.html
>
> Peter
>
--
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433
From biopython at maubp.freeserve.co.uk Wed Apr 15 12:46:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 17:46:02 +0100
Subject: [Biopython-dev] docstrings, doctests and epydoc API pages
In-Reply-To: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com>
References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com>
Message-ID: <320fb6e00904150946y45010c99u8508e8e6fd71eb75@mail.gmail.com>
> As a test, I was able to update Bio/Seq.py to look good as epytext
> (while still being equally readable as plain text for when reading the
> API documentation at the python prompt with the help function). I
> uploaded one new page to the website:
>
> http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html
>
> The rest of the online API pages are currently still from Biopython
> 1.49, when epydoc parsed the docstrings as plain text. ?For another
> example with quite a few docstrings and doctests, look at:
>
> http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
>
> What do you all think? ?I don't know if it will encourage more people
> to look at the API pages, but I certainly like the new version where
> the doctests are shown boxed with syntax colouring.
I've done Bio/SeqIO/QualityIO.py as well which proved harder due
to lots of example FASTQ records embedded in the text.
I've also worked out how to set the epydoc markup format on a per
file basis with the __docformat__ setting (see also PEP 258). This
means we can gradually convert existing docstrings on a file by file
basis - I'd suggest we focus on those with docstrings first, as they
will benefit most from this.
The only downside thus far is that the epytext mark up seems rather
fragile, and it is easy to "break" a docstring such that epydoc fails to
render nicely. At least epydoc falls back on plain text in this situation,
so the text is still human readable.
Tip: You need an EMPTY line before and after each doctest in order
for it to work with epydoc as epytext markup. This is annoying as
the doctest framework can cope with a line with spaces in it.
Peter
From sbassi at clubdelarazon.org Wed Apr 15 15:19:13 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Wed, 15 Apr 2009 16:19:13 -0300
Subject: [Biopython-dev] Proposal: Parse and read in SeqIO and NCBIXML
Message-ID: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com>
In SeqIO there is parse and read. Parse return an iterable with all
the record found in the file, while read return only a record and it
is used when we know that the file has only one record. This is OK.
But in NCBIXML, there is only parse. If the the ncbiblast output has
only one record (because it was made from 1 query), now we have to
write:
NCBIXML.parse(x).next() or iterate over a "list" of one member. I
think it would be nice to add a read method to NCBIXML, such as the
one in SeqIO.
From biopython at maubp.freeserve.co.uk Wed Apr 15 17:30:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 15 Apr 2009 22:30:55 +0100
Subject: [Biopython-dev] Proposal: Parse and read in SeqIO and NCBIXML
In-Reply-To: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com>
References: <9e2f512b0904151219i60a8eda0xd06c9c86c690b6e3@mail.gmail.com>
Message-ID: <320fb6e00904151430i19983fafq43ca1c9395579fb3@mail.gmail.com>
On Wed, Apr 15, 2009 at 8:19 PM, Sebastian Bassi
wrote:
> In SeqIO there is parse and read. Parse return an iterable with all
> the record found in the file, while read return only a record and it
> is used when we know that the file has only one record. This is OK.
> But in NCBIXML, there is only parse. If the the ncbiblast output has
> only one record (because it was made from 1 query), now we have to
> write:
> NCBIXML.parse(x).next() or iterate over a "list" of one member. I
> think it would be nice to add a read method to NCBIXML, such as the
> one in SeqIO.
That seems sensible to me, we could probably squeeze that in for
Biopython 1.50 too. Could you file an enhancement bug in case I
forget about this?
Peter
From bugzilla-daemon at portal.open-bio.org Wed Apr 15 17:42:28 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 15 Apr 2009 17:42:28 -0400
Subject: [Biopython-dev] [Bug 2812] New: Adding read method to NCBIXML (just
like SeqIO and SwissProt).
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2812
Summary: Adding read method to NCBIXML (just like SeqIO and
SwissProt).
Product: Biopython
Version: 1.50b
Platform: PC
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: sbassi at gmail.com
NCBIXML should have a "read" method. It has a parse method that returns an
iterable. If the the ncbiblast output has
only one record (because it was made from 1 query), now we have to
write: NCBIXML.parse(x).next() or iterate over a "list" of one member.
Other objects like SeqIO and SwissProt has both "read" and "parse" to deal with
one entry files. I think for the sake of consistency NCBIXML should also have a
read method.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 15 17:58:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 15 Apr 2009 17:58:15 -0400
Subject: [Biopython-dev] [Bug 2812] Adding read method to NCBIXML (just like
SeqIO and SwissProt).
In-Reply-To:
Message-ID: <200904152158.n3FLwFYc027155@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2812
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-15 17:58 EST -------
Adding this should do the trick (based on the SeqIO.read function):
def read(handle, debug=0) :
"""Returns a single Blast record (assumes just one query).
Use the Bio.Blast.NCBIXML.read() function if you expect more than
one BLAST record (i.e. if you have more than one query sequence).
This function is for use when there is one and only one BLAST
result.
"""
iterator = parse(handle, debug)
try :
first = iterator.next()
except StopIteration :
first = None
if first is None :
raise ValueError("No records found in handle")
try :
second = iterator.next()
except StopIteration :
second = None
if second is not None :
raise ValueError("More than one record found in handle")
return first
However, on reflection this needs some special testing for when there is a
single query giving NO hits. I suspect that means the BLAST XML file will
contain no records (at least that's my guess from recent versions - I haven't
tried 2.2.20 yet). Would raising a ValueError in this situation reasonable?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bartek at rezolwenta.eu.org Wed Apr 15 19:10:03 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 16 Apr 2009 01:10:03 +0200
Subject: [Biopython-dev] Bio.Motif breaks epydoc?
In-Reply-To: <320fb6e00904151514g2b9709fbj7c3de68d88db3f7d@mail.gmail.com>
References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
<8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com>
<8b34ec180904151458k39fec681u53fcf64de9f7590d@mail.gmail.com>
<320fb6e00904151514g2b9709fbj7c3de68d88db3f7d@mail.gmail.com>
Message-ID: <8b34ec180904151610j4c62b7d7k51be600420aa73c@mail.gmail.com>
Hi
On Thu, Apr 16, 2009 at 12:14 AM, Peter wrote:
> How about putting it in Bio/Motif/_Motif.py? ?That makes it clear
> people are expected to access it via Bio.Motif.Motif, and not go via
> the module. ?This is what Leighton and I did for GenomeDiagram which
> was a very similar situation. ?Using an underscore denotes a private
> module, so you could at a later date rename it to something else
> without worrying about backwards compatibiltiy (if you do change your
> mind).
>
OK, I'll update the source tomorrow.
> Are you planning any documentation to go with this? ?It would be nice
> to include it with Biopython 1.50 but not essential.
There is a cookbook-style tutorial in Docs/cookbook/motif. I'm not sure
if it's ready for inclusion into the official tutorial. I'm hoping to add some
more features soon and then it could be improved and included into the
tutorial.
cheers
Bartek
From winda002 at student.otago.ac.nz Wed Apr 15 23:30:43 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 16 Apr 2009 15:30:43 +1200
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
Message-ID: <49E6A663.90900@student.otago.ac.nz>
Hi all,
Sorry about the delay in replying to this, the easter holidays are the
last chance to play in the sun in the southern hemisphere.
Peter wrote:
> David wrote:
>
>>> For me as a n00b the most useful resource by far has been the cookbook -
>>>
>>>
>
> When you said "cookbook", did you mean the Biopython Tutorial & Cookbook?
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> There are a couple of other documents under the "Cookbook" folder here:
> http://biopython.org/DIST/docs/cookbook/Restriction.html
> http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
>
I really meant the Tutorial and Cookbook and specifically the examples
in it. The first thing I tried to do with BioPython was parse BLAST
outputs and actually seeing a loop that would work and I that I could
tweak to get what I wanted from by BLAST results was really cool.
From my perspective it makes sense to have a tutorial that walks
through the main features with some relatively simple examples (like the
existing one) with a separate cookbook highlighting what you can
actually do when you bring everything together. I think this would
fulfill the goals I was talking about in my original post (having nicely
documented examples of BioPython in action out there for anyone who's
looking) and adding a cookbook catergory to the wiki achieves this with
the smallest impediment to participation . If anyone's counting I think
that's +3 for wiki and -3 for a new html/pdf document.
David
From peter at maubp.freeserve.co.uk Thu Apr 16 06:56:23 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Apr 2009 11:56:23 +0100
Subject: [Biopython-dev] Bio.Motif breaks epydoc?
In-Reply-To: <8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com>
References: <320fb6e00904141634o3c6fd441sf040d7d4c45e8b02@mail.gmail.com>
<8b34ec180904150743tffaca5dva03c58bbe2a0eade@mail.gmail.com>
Message-ID: <320fb6e00904160356i68ca063ak370faa78eda63876@mail.gmail.com>
On Wed, Apr 15, 2009 at 3:43 PM, Bartek Wilczynski
wrote:
> Hi,
>
> I'm working on Bio.Motif to fix this. [...]
>
> cheers
> Bartek
Bartek has solved the epydoc problem in CVS now, and I have been able
to build the API documentation using a clean installation of Biopython
from CVS. :)
It looks like the LaTeX equation in Bio/Motif/Motif.py (which was full
of backslashes) was causing some of the trouble.
Peter
From biopython at maubp.freeserve.co.uk Thu Apr 16 12:45:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Apr 2009 17:45:13 +0100
Subject: [Biopython-dev] Where to put command line wrappers
Message-ID: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
Hi all,
We were recently discussing alignment tools like MUSCLE and ClustalW
and putting together a set of command line wrappers under Bio.Align
for them. I think Bio.Align.Applications was suggested to match
Bio.EMBOSS.Applications.
For EMBOSS we have a single file, Bio/Emboss/Applications.py, which
has about 15 wrappers (all very similar as the EMBOSS applications are
very consistent). This is nice in that all the wrappers are in the
Bio.Emboss.Application namespace.
Bartek and I have been having a similar discussion for Motif tools,
and if the AliceAce wrappers should go in Bio.Motif.Applications to
match. For now Bio.Motif has just one wrapper for AlignACE and sister
tool CompareACE. Now giving each tool-set its own file is possible
(Bio/Motif/Applications/AlignAce.py) but would one (large) file be
simpler? (i.e. Bio/Motif/Applications.py).
I'm not sure how many wrappers we might eventually expect for multiple
sequence alignments, maybe ten or twenty, mostly from different tool
sets. Maybe Bio/Align/Applications/Muscle.py etc is the way to go,
but we can then import all the command line objects under the
Bio.Align.Applications namespace.
Any comments?
Peter
From biopython at maubp.freeserve.co.uk Thu Apr 16 13:16:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Apr 2009 18:16:10 +0100
Subject: [Biopython-dev] Where to put command line wrappers
In-Reply-To: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
Message-ID: <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote:
> Hi all,
>
> We were recently discussing alignment tools like MUSCLE and ClustalW
> and putting together a set of command line wrappers under Bio.Align
> for them. ?I think Bio.Align.Applications was suggested to match
> Bio.EMBOSS.Applications.
>
> For EMBOSS we have a single file, Bio/Emboss/Applications.py, which
> has about 15 wrappers (all very similar as the EMBOSS applications are
> very consistent). ?This is nice in that all the wrappers are in the
> Bio.Emboss.Application namespace.
>
> Bartek and I have been having a similar discussion for Motif tools,
> and if the AliceAce wrappers should go in Bio.Motif.Applications to
> match. ?For now Bio.Motif has just one wrapper for AlignACE and sister
> tool CompareACE. ?Now giving each tool-set its own file is possible
> (Bio/Motif/Applications/AlignAce.py) but would one (large) file be
> simpler? (i.e. Bio/Motif/Applications.py).
>
> I'm not sure how many wrappers we might eventually expect for multiple
> sequence alignments, maybe ten or twenty, mostly from different tool
> sets. ?Maybe Bio/Align/Applications/Muscle.py etc is the way to go,
> but we can then import all the command line objects under the
> Bio.Align.Applications namespace.
>
> Any comments?
For any that missed the thread last week, I'd like to link back to the
end of my post:
http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005658.html
I see introducing Bio.Align.Applications as chance to get a more
consistent approach to Biopython's command line wrappers established
(replacing Bio.Clustalw). And as I wrote last month, I think we
should focus on the Bio.Application command line wrapper object. For
reasons explained in the linked email, I would want to rewrite
Bio.Blast.NCBIStandalone in the same way (probably putting the command
line wrapper classes in Bio.Blast.Applications, and if there is
interesting, include other variants like WUBlast). Are there any
other wrappers not using Bio.Application which I have forgotten about?
Bio/AlignAce/Applications.py does use Bio.Application, but we are
planning to replace this module with Bio.Motif which gives us a chance
to review the API without worrying too much about backwards
compatibility. As part of moving it to Bio.Motif, I would remove the
run methods from AlignAceCommandline and CompareAceCommandline (none
of the other Biopython command line objects have them as far as I
know), and also remove the AlignAce and CompareAce helper functions
(in Bio/AlignAce/AlignAceStandalone.py and
Bio/AlignAce/CompareAceStandalone.py). Internally these all call the
Bio.Application.generic_run function, and return stdout and stderr as
wrapped StringIO handles.
Because it reads in all the stdout and stderr output into memory,
Bio.Application.generic_run function is only suitable for tools with
print very little to the console (or nothing, in which case the return
values can be ignored). This method is useless on things like BLAST
XML output to stdout which can be hundreds of megabytes in size. I
would generally discourage the use of the Bio.Application.generic_run
function and instead we should give examples using the command line
object together with the subprocess module (Python 2.3 doesn't have
subprocess, but Biopthyon 1.50 will be the last release to care about
this) which lets the user choose what if any handles they care about.
Peter
From bartek at rezolwenta.eu.org Thu Apr 16 13:37:29 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 16 Apr 2009 19:37:29 +0200
Subject: [Biopython-dev] Fwd: Where to put command line wrappers
In-Reply-To: <8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
<8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
Message-ID: <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
Hi All,
On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote:
> For EMBOSS we have a single file, Bio/Emboss/Applications.py, which
> has about 15 wrappers (all very similar as the EMBOSS applications are
> very consistent). ?This is nice in that all the wrappers are in the
> Bio.Emboss.Application namespace.
>
> Bartek and I have been having a similar discussion for Motif tools,
> and if the AliceAce wrappers should go in Bio.Motif.Applications to
> match. ?For now Bio.Motif has just one wrapper for AlignACE and sister
> tool CompareACE. ?Now giving each tool-set its own file is possible
> (Bio/Motif/Applications/AlignAce.py) but would one (large) file be
> simpler? (i.e. Bio/Motif/Applications.py).
>
I think that there is a difference between EMBOSS and
Bio.[Motif|Align]. In EMBOSS we
have a very nicely comoditized set of tools with similar interfaces,
while both for multiple
alignment and motif searching the tools vary a lot. In case of
multiple alignments this is only
with respect to parameters and output format, while in motif searching
there is also a lot of
differences in the types of input (background models etc.). Also,
quite likely the parsers for
different tools will be written by different people.
In this case, I think that it's much easier from the maintainers point
of view to have a directory
with separate files rather than a single module. If people are scared
by nested namespaces,
we can import the important classes into the higher level.
>> I'm not sure how many wrappers we might eventually expect for multiple
>> sequence alignments, maybe ten or twenty, mostly from different tool
>> sets. ?Maybe Bio/Align/Applications/Muscle.py etc is the way to go,
>> but we can then import all the command line objects under the
>> Bio.Align.Applications namespace.
>>
+1 from me.
>
> Bio/AlignAce/Applications.py does use Bio.Application, but we are
> planning to replace this module with Bio.Motif which gives us a chance
> to review the API without worrying too much about backwards
> compatibility. ?As part of moving it to Bio.Motif, I would remove the
> run methods from AlignAceCommandline and CompareAceCommandline (none
> of the other Biopython command line objects have them as far as I
> know), and also remove the AlignAce and CompareAce helper functions
> (in Bio/AlignAce/AlignAceStandalone.py and
> Bio/AlignAce/CompareAceStandalone.py). Internally these all call the
> Bio.Application.generic_run function, and return stdout and stderr as
> wrapped StringIO handles.
>
> Because it reads in all the stdout and stderr output into memory,
> Bio.Application.generic_run function is only suitable for tools with
> print very little to the console (or nothing, in which case the return
> values can be ignored). ?This method is useless on things like BLAST
> XML output to stdout which can be hundreds of megabytes in size. ?I
> would generally discourage the use of the Bio.Application.generic_run
> function and instead we should give examples using the command line
> object together with the subprocess module (Python 2.3 doesn't have
> subprocess, but Biopthyon 1.50 will be the last release to care about
> this) which lets the user choose what if any handles they care about.
Motif finding programs usually output a lot less than there is input. Normally,
you don't want to see more than 10 motifs and each contributes ~1kb so
I don't see this as a huge problem in this case. To be honest, I'm not too keen
on rewriting this old code (as well as MEME parser which was contributed by
Jason Hackney). But if there will be any new motif parsers (I'd like
to have weederand RSAT one day...) I'm happy to conform to any
(reasonable) policy.
cheers
Bartek
From biopython at maubp.freeserve.co.uk Thu Apr 16 14:53:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Apr 2009 19:53:03 +0100
Subject: [Biopython-dev] Fwd: Where to put command line wrappers
In-Reply-To: <8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
<8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
<8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
Message-ID: <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com>
On 4/16/09, Bartek Wilczynski wrote:
> Hi All,
>
> On Thu, Apr 16, 2009 at 5:45 PM, Peter wrote:
> > For EMBOSS we have a single file, Bio/Emboss/Applications.py, which
> > has about 15 wrappers (all very similar as the EMBOSS applications are
> > very consistent). This is nice in that all the wrappers are in the
> > Bio.Emboss.Application namespace.
> >
> > Bartek and I have been having a similar discussion for Motif tools,
> > and if the AliceAce wrappers should go in Bio.Motif.Applications to
> > match. For now Bio.Motif has just one wrapper for AlignACE and sister
> > tool CompareACE. Now giving each tool-set its own file is possible
> > (Bio/Motif/Applications/AlignAce.py) but would one (large) file be
> > simpler? (i.e. Bio/Motif/Applications.py).
> >
> I think that there is a difference between EMBOSS and
> Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized
> set of tools with similar interfaces, while both for multiple
> alignment and motif searching the tools vary a lot. In case of
> multiple alignments this is only with respect to parameters and
> output format, while in motif searching there is also a lot of
> differences in the types of input (background models etc.).
That is a good argument for using Bio/Align/Applications/XXX.py and
Bio/Motif/Applications/XXX.py while also having
Bio/EMBOSS/Applications.py
> Also, quite likely the parsers for different tools will be written by
> different people.
Biopython's command line wrappers can be quite separate from the
parsers - this is a natural break. One can be useful without the
other, and keeping them separate allows you to for example use a
Biopython wrapper with another parser, or vice versa.
> In this case, I think that it's much easier from the maintainers point
> of view to have a directory with separate files rather than a single
> module. [...]
True.
> >> I'm not sure how many wrappers we might eventually expect for multiple
> >> sequence alignments, maybe ten or twenty, mostly from different tool
> >> sets. Maybe Bio/Align/Applications/Muscle.py etc is the way to go,
> >> but we can then import all the command line objects under the
> >> Bio.Align.Applications namespace.
>
> +1 from me.
>
> > Bio/AlignAce/Applications.py does use Bio.Application, but we are
> > planning to replace this module with Bio.Motif which gives us a chance
> > to review the API without worrying too much about backwards
> > compatibility. As part of moving it to Bio.Motif, I would remove the
> > run methods from AlignAceCommandline and CompareAceCommandline (none
> > of the other Biopython command line objects have them as far as I
> > know), and also remove the AlignAce and CompareAce helper functions
> > (in Bio/AlignAce/AlignAceStandalone.py and
> > Bio/AlignAce/CompareAceStandalone.py). Internally these all call the
> > Bio.Application.generic_run function, and return stdout and stderr as
> > wrapped StringIO handles.
> >
> > Because it reads in all the stdout and stderr output into memory,
> > Bio.Application.generic_run function is only suitable for tools with
> > print very little to the console (or nothing, in which case the return
> > values can be ignored). This method is useless on things like BLAST
> > XML output to stdout which can be hundreds of megabytes in size. I
> > would generally discourage the use of the Bio.Application.generic_run
> > function and instead we should give examples using the command line
> > object together with the subprocess module (Python 2.3 doesn't have
> > subprocess, but Biopthyon 1.50 will be the last release to care about
> > this) which lets the user choose what if any handles they care about.
>
> Motif finding programs usually output a lot less than there is input. Normally,
> you don't want to see more than 10 motifs and each contributes ~1kb so
> I don't see this as a huge problem in this case.
I can see that Bio.Application.generic_run function is often handy,
but sometimes it is quite inappropriate. For AlignAce obviously it
has sufficed.
> To be honest, I'm not too keen on rewriting this old code (as well as
> MEME parser which was contributed by Jason Hackney). But if there
> will be any new motif parsers (I'd like to have weederand RSAT one
> day...) I'm happy to conform to any (reasonable) policy.
In the AlignAce case, in the above I wasn't suggesting rewriting,
rather removing some of the what I saw as redundant bits (in an effort
at consistency).
On reflection, perhaps the core Bio.Application.AbstractCommandline
object might benefit from some "run" like methods? However they do
morph it from a command line string representation into something
bigger... feature creep! ;)
Peter
From biopython at maubp.freeserve.co.uk Thu Apr 16 16:16:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 16 Apr 2009 21:16:04 +0100
Subject: [Biopython-dev] Where to put command line wrappers
In-Reply-To: <320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
Message-ID: <320fb6e00904161316m62162af2s506442502b73c8bc@mail.gmail.com>
> I see introducing Bio.Align.Applications as chance to get a more
> consistent approach to Biopython's command line wrappers established
> (replacing Bio.Clustalw). And as I wrote last month, I think we
> should focus on the Bio.Application command line wrapper object. For
> reasons explained in the linked email, I would want to rewrite
> Bio.Blast.NCBIStandalone in the same way (probably putting the command
> line wrapper classes in Bio.Blast.Applications, and if there is
> interesting, include other variants like WUBlast). Are there any
> other wrappers not using Bio.Application which I have forgotten about?
Funnily enough, there already is a Bio.Blast.Applications module
containing a wrapper for NCBI Fasta and NCBI blastall (a little out of
data, also nothing for rpsblast or blastpgpg). The older
Bio.Blast.NCBIStandalone was never updated to use this internally.
Here's a nice little job for after Biopython 1.50 is out...
Peter
From bugzilla-daemon at portal.open-bio.org Thu Apr 16 18:40:53 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 16 Apr 2009 18:40:53 -0400
Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods
to the Seq object
In-Reply-To:
Message-ID: <200904162240.n3GMerIj001589@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2809
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-16 18:40 EST -------
Checked in after discussion on the mailing list.
Checking in Bio/Seq.py;
/home/repository/biopython/biopython/Bio/Seq.py,v <-- Seq.py
new revision: 1.76; previous revision: 1.75
done
Checking in Tests/test_Seq_objs.py;
/home/repository/biopython/biopython/Tests/test_Seq_objs.py,v <--
test_Seq_objs.py
new revision: 1.5; previous revision: 1.4
done
Marking as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 16 18:40:54 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 16 Apr 2009 18:40:54 -0400
Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string,
even subclass string?
In-Reply-To:
Message-ID: <200904162240.n3GMesOq001602@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2351
Bug 2351 depends on bug 2809, which changed state.
Bug 2809 Summary: Adding startswith and endswith methods to the Seq object
http://bugzilla.open-bio.org/show_bug.cgi?id=2809
What |Old Value |New Value
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From winda002 at student.otago.ac.nz Fri Apr 17 01:31:45 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Fri, 17 Apr 2009 17:31:45 +1200
Subject: [Biopython-dev] Cookbook recipes on the wiki
Message-ID: <49E81441.8040906@student.otago.ac.nz>
Hi all,
In the recent thread about the cookbook style entries in the tutorial
everyone that had an opinion seemed to think it was best to incorporate
these into the wiki. I've made a very small start at doing this with a
category on the wiki (http://biopython.org/wiki/Category:Cookbook) and
an example of what an entry in the cookbook might look like
(http://biopython.org/wiki/Split_fasta_file).
What do people think of these? If we decide this is the way to go then
to have an entry turn up in the cookbook category you need only to add
[[Category:Cookbook]] to an entry
david
From biopython at maubp.freeserve.co.uk Fri Apr 17 05:32:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 10:32:32 +0100
Subject: [Biopython-dev] Cookbook recipes on the wiki
In-Reply-To: <49E81441.8040906@student.otago.ac.nz>
References: <49E81441.8040906@student.otago.ac.nz>
Message-ID: <320fb6e00904170232i75d88a73p5738e54a32de8bdf@mail.gmail.com>
On Fri, Apr 17, 2009 at 6:31 AM, David Winter
wrote:
> Hi all,
>
> In the recent thread about the cookbook style entries in the tutorial
> everyone that had an opinion seemed to think it was best to incorporate
> these into the wiki. I've made a very small start at doing this with a
> category on the wiki (http://biopython.org/wiki/Category:Cookbook) and an
> example of what an entry in the cookbook might look like
> (http://biopython.org/wiki/Split_fasta_file).
>
> What do people think of these? If we decide this is the way to go then to
> have an entry turn up in the cookbook category you need only to add
> [[Category:Cookbook]] to an entry
We'd previously discussed using a cookbook category on the wiki, and
that looks good:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005715.html
I'm tempted to get rid of Category:Wiki_Documentation though - it
seems a bit redundant, almost everything on the wiki is documentation.
At least rename this to Category:Documentation?
Peter
From biopython at maubp.freeserve.co.uk Fri Apr 17 07:08:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 12:08:12 +0100
Subject: [Biopython-dev] sff reader
In-Reply-To: <200904171246.46568.jblanca@btc.upv.es>
References: <200904161146.28203.jblanca@btc.upv.es>
<320fb6e00904160715o23c05c36jfff9abab521fde8@mail.gmail.com>
<200904171246.46568.jblanca@btc.upv.es>
Message-ID: <320fb6e00904170408w6aa4031dl75d67f321ea85bd8@mail.gmail.com>
On Fri, Apr 17, 2009 at 11:46 AM, Jose Blanca wrote:
> Hi Peter:
> Here you have some code to read the sff files.
Thanks - I'm not sure when I'll get to look at this, maybe next week.
> For the time being it creates a dict for the sequences. I'm not sure about
> how to integrate the generated data in BioPython. The sequence and
> qualities should go to a SeqRecord, but there is also the information
> about the clipping.
For Bio.SeqIO, we would need to use a SeqRecord. Ideally we'd want to
be able to read and write SFF files, and to do that we'll have to record all
the essential annotation (i.e. clipping) somehow. Can you write SFF files?
> For my work I use a kind of SeqRecord with a mask property and the
> mask is a Location that shows which part of the sequence is ok. I don't
> know if that's a valid model for BioPython.
A mask could be done as a list of booleans, and we can treat it as
another per-letter-annotation in the SeqRecord. I'm not sure if this
is helpful or not.
The Roche tools let you choose to extract trimmed reads as FASTA
and QUAL, or untrimmed. Perhaps for reading SFF files with
Bio.SeqIO we should get the user to choose between these
options (e.g. format names "roche-sff" and "roche-sff-notrim")?
Roche's FASTA files use upper case for the trimmed region, and
lower case for the start/end which would get trimmed off. This is
simple and we could do this for Biopython too - meaning you'd get
the same data if you read the SFF file directly, or used Roche's
FASTA+QUAL files with SeqIO. Note that when reading an SFF
file directly, we should probably record the real trim data as well.
> In the extract_sff script we generated three files: the fasta sequences,
> the fasta qualities and the xml with the clippings.
> One option could be to clip the sequences, but I don't know if that's the
> desired behaviour in all cases.
Trimming is probably a sensible default. If we do give the untrimmed
sequences, we'd need a way to easily trim them.
> There's also a couple of more tricks with the clipping.
> In theory there's clip_qual and clip_adapter, but in the files
> we've seen clip_adapter is always zero and clip_quality is used
> instead for both quality and adapter. I think we could generate
> one clipping combining both. Let me know what do you think.
> Also take into account that in some cases the generated clipping
> from the 454 software are just wrong.
I'll need to learn more about the details before coming to any
conclusions about how to deal with this information in Biopython.
> If you want to forward this mail to the list you're more than welcome.
> Best regards,
>
> Jose Blanca
I've CC'd this reply to the list (without the python file attachments).
Regards,
Peter
From chapmanb at 50mail.com Fri Apr 17 09:23:34 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 17 Apr 2009 09:23:34 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
<320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
Message-ID: <20090417132334.GA16092@sobchak.mgh.harvard.edu>
Peter, Michiel and Jared;
Thanks for the comments. My apologies for the late reply; I've been
sick the past few days and am trying to catch back up. All your
points from the different posts are consolidated below.
[Michiel]
> First, let's discuss how to represent the information contained in
> a GFF file. SeqRecords are good if the GFF file is associated with
> a Fasta file (or contains the sequence itself), but if not it seems
> to be a bit awkward. How about the following (and I think Peter was
> hinting at the same idea):
>
> The actual parser lives in Bio.GFF, and produces Bio.GFF.Record
> objects that closely resemble the GFF file structure. For example, we
> use the GFF specified fields (
> [attributes] [comments]) as attributes
> to Bio.GFF.Record objects.
The GFF parser right now is really generating SeqFeature objects for
each GFF line; the top level SeqRecords are a collection
that holds the individual features. The SeqFeature object is
pretty similar to GFF and the generic object you are proposing. For
instance, here is a GFF line and the relevant attributes from
SeqFeature for the line:
I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1
type: PCR_product
location: [12759746:12764936]
strand: -1
qualifiers:
Key: amplified, Value: ['1']
Key: pcr_product, Value: ['mv_B0019.1']
Key: source, Value: ['Orfeome']
Things are a bit more generalized as key/value pairs in qualifiers,
but the mapping straightforward. My only suggestion would be that we
add 'start' and 'end' accessors to SeqFeature that map to
feature.location.nofuzzy_start and feature.location.nofuzzy_end,
respectively. SeqFeature is more generalized, for GenBank location
nastiness, but we should make the common simple case simpler.
> Bio.SeqIO then uses the parser in Bio.GFF, and puts its information
> in the appropriate fields of a SeqRecord. Here, we have to think
> about two cases: Simply creating a SeqRecord based on the GFF file,
> and adding the information in the GFF file as annotations to a
> pre-existing set of SeqRecords.
Yes. Both of these cases are handled now -- a user can supply a
seed dictionary of SeqRecords to which SeqFeatures are added.
Alternatively, a new SeqRecord is created for features if one is not
provided.
> Users then have a choice to use Bio.SeqIO to get SeqRecords, or
> Bio.GFF to see the "raw" GFF data, depending on their needs.
>
> How does that sound?
So we could have two ways to access the GFF file:
- An iterator that returns SeqFeature objects for each line in the
file. No other processing is done.
- The higher level interface that we have been discussing, which
adds them to records and nests features.
My only question is concerning the nested features, like coding
sequences. This a very common GFF case (see
http://www.sequenceontology.org/gff3.shtml; The Canonical Gene
section for the GFF). A raw parser iterator cannot handle these as
it needs to read multiple lines to build the nested feature. Is this
still useful for the use cases you were thinking of?
[Peter]
> Hmm. I'm with you on the idea that you may need to parse a GFF file
> and a separate second file to get the actual sequence (e.g. a FASTA
> file), but there is more than one way to combine the two. For a
> single sequence, I was thinking more along the lines of:
>
> from Bio import SeqIO
> record = SeqIO.read(open("NC_000913.fna"),"fasta")
> record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features
Make sense, but this only works for the case where you have a single
FASTA sequence and a single GFF file describing one record. This is a
special case for bacterial genomes and GFF from NCBI, but doesn't work
for other Eukaryotic GFFs and SOLiD GFF files. Do we want different ways
to use the parser for custom cases?
> If the FASTA and GFF file apply to multiple sequences (e.g. a set of
> contigs, rather than a single chromosome), and you have enough memory,
> then something using dictionaries should work:
>
> from Bio import SeqIO
> records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
> for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") :
> records[temp_rec.id].features = temp_rec.features
Your intention makes good sense here, and this is more or less what
it is doing under the covers. Could we think about expanding SeqIO
to have functionality for this "adding to a record" case? Something
like:
from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
records = SeqIO.add_to_dict(records, open("NC_000913.gff"), "gff3")
This exposes less of the actual implementation details to the user.
> As you can probably tell, I am concentrating on getting this to match
> up well with the Bio.SeqIO framework. It will be nice to know the
> underlying Bio.GFF module has more options, but I expect most people
> to start with reading in a GFF file using Bio.SeqIO, and being able to
> transfer their existing knowledge of SeqFeature objects learnt from
> using Bio.SeqIO to read in GenBank files.
I'm really glad you are thinking about it from this angle. The limit
cases will be pretty common for real life work; most of the
eukaryotic GFF dumps from Ensembl or wherever are quite large and
are going to need some intelligent parsing to not get into memory
issues. I worry that if we try to put this right on top of the
existing SeqIO functionality, which deal with different kinds of
files, we are going to clutter the interface.
> I have pondered a "paired file iterator" function for Bio.SeqIO for
> dealing with FASTA+QUAL, FASTA+GFF, FASTA+PPT, etc, which would take
> TWO file handles and return SeqRecord objects. Interestingly all the
> examples thus far are FASTA+other. Anyway, this could be added later
> if need be.
I like the way you did this for FASTA/Qual files but am not sure if
would map nicely to GFF for the memory reasons mentioned above.
[MapReduce]
> Are you aware of any alternatives to disco for doing map/reduce on
> Python, and does that impact your design choices?
Jared is right on; Hadoop is the another MapReduce framework in wide use.
More generally, I agree with you; the distributed portion needs to
be generalized. Let's lock down the interface and local parsing, and
then I will circle around on that again.
Thanks all again for the thoughts,
Brad
From chapmanb at 50mail.com Fri Apr 17 09:30:20 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 17 Apr 2009 09:30:20 -0400
Subject: [Biopython-dev] docstrings, doctests and epydoc API pages
In-Reply-To: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com>
References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com>
Message-ID: <20090417133020.GB16092@sobchak.mgh.harvard.edu>
Peter;
> As a test, I was able to update Bio/Seq.py to look good as epytext
> (while still being equally readable as plain text for when reading the
> API documentation at the python prompt with the help function). I
> uploaded one new page to the website:
>
> http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html
I had some time where I was obsessed with making Biopython look
good in one of these API documentation modules (maybe HappyDoc,
back in the day). Eventually I came to the sad conclusion that not too
many people really seem to actually look at auto generated API docs.
Most will fire up the code in their favorite editor if they are
interested in the fine details.
So, I like the way this looks, but my vote is it is probably not
worth the cycles unless you are having fun with it. Also, be ready
get mad when the preferred method of markup changes from epytext to
structuredtext or someothertext.
Brad
From biopython at maubp.freeserve.co.uk Fri Apr 17 09:45:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 14:45:11 +0100
Subject: [Biopython-dev] docstrings, doctests and epydoc API pages
In-Reply-To: <20090417133020.GB16092@sobchak.mgh.harvard.edu>
References: <320fb6e00904150619i3d5b4238tf00b29fc6aaeac43@mail.gmail.com>
<20090417133020.GB16092@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904170645u463d0b4ej8be66735bd2889e3@mail.gmail.com>
On Fri, Apr 17, 2009 at 2:30 PM, Brad Chapman wrote:
> Peter;
>
>> As a test, I was able to update Bio/Seq.py to look good as epytext
>> (while still being equally readable as plain text for when reading the
>> API documentation at the python prompt with the help function). I
>> uploaded one new page to the website:
>>
>> http://biopython.org/DIST/docs/api/Bio.Seq.Seq-class.html
>
> I had some time where I was obsessed with making Biopython look
> good in one of these API documentation modules (maybe HappyDoc,
> back in the day). Eventually I came to the sad conclusion that not too
> many people really seem to actually look at auto generated API docs.
> Most will fire up the code in their favorite editor if they are
> interested in the fine details.
I agree that we don't push the API docs page enough (and indeed
corresponding built in documentation). This is a shame, as the
built in docstrings should really get more attention. To try and raise
their profile I've added links to the relevant pages from some of the
wiki pages to try and encourage people to look at them. There is
probably a cunning redirect link which will get the frames to work,
but I've just used deep linking on these pages for now:
http://biopython.org/wiki/Seq
http://biopython.org/wiki/SeqRecord
http://biopython.org/wiki/SeqIO
http://biopython.org/wiki/AlignIO
In fact, maybe we should simplify/remove these wiki pages and just
push the API pages and relevant cookbook wiki pages in their place?
Up until now, the wiki was nicer in that it looked better - with the
epydoc mark up that isn't the case. The API docs should be the
definitive documentation, in that the are kept up to date with the
code, and are under version control.
> So, I like the way this looks, but my vote is it is probably not
> worth the cycles unless you are having fun with it. Also, be ready
> get mad when the preferred method of markup changes from epytext to
> structuredtext or someothertext.
I know what you mean - the novelty has worn off now, and doing
further conversions is tedious. I like the idea of a tweak to epydoc
to do "plain text + automatic markup of doctests". If that existed
it would be a great default option for Biopython, as all I really care
about for the markup is getting the python doctests to look good.
Peter
From chapmanb at 50mail.com Fri Apr 17 10:02:41 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 17 Apr 2009 10:02:41 -0400
Subject: [Biopython-dev] Fwd: Where to put command line wrappers
In-Reply-To: <320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
<8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
<8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
<320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com>
Message-ID: <20090417140241.GD16092@sobchak.mgh.harvard.edu>
Hi all;
[Where to put the commandline objects]
> > I think that there is a difference between EMBOSS and
> > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized
> > set of tools with similar interfaces, while both for multiple
> > alignment and motif searching the tools vary a lot. In case of
> > multiple alignments this is only with respect to parameters and
> > output format, while in motif searching there is also a lot of
> > differences in the types of input (background models etc.).
>
> That is a good argument for using Bio/Align/Applications/XXX.py and
> Bio/Motif/Applications/XXX.py while also having
> Bio/EMBOSS/Applications.py
There is a natural tension between overgeneralizing and dumping
too much into one file. At one end you have deeply nested Java-like
directories with a few lines of code in each file. I tend towards the
"more in a single file and less nesting" camp. My vote would be that
if the Motif Applications file will only contain commandline
wrappers, they could live in one file.
[generic_run]
> > Motif finding programs usually output a lot less than there is input. Normally,
> > you don't want to see more than 10 motifs and each contributes ~1kb so
> > I don't see this as a huge problem in this case.
>
> I can see that Bio.Application.generic_run function is often handy,
> but sometimes it is quite inappropriate. For AlignAce obviously it
> has sufficed.
Yeah, generic_run is not as generic as it should be. It does have a
lot of hard fought logic for working with multiple python versions
and windows/unix. Could we make generic_run appropriate for the big
standard out cases so we don't end up duplicating that in
Blast/Clustalw/wherever runners?
Brad
From biopython at maubp.freeserve.co.uk Fri Apr 17 10:13:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 15:13:18 +0100
Subject: [Biopython-dev] Fwd: Where to put command line wrappers
In-Reply-To: <20090417140241.GD16092@sobchak.mgh.harvard.edu>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
<8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
<8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
<320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com>
<20090417140241.GD16092@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904170713p30dc4d51m284c897ec1b9b505@mail.gmail.com>
On Fri, Apr 17, 2009 at 3:02 PM, Brad Chapman wrote:
>>
>> I can see that Bio.Application.generic_run function is often handy,
>> but sometimes it is quite inappropriate. ?For AlignAce obviously it
>> has sufficed.
>
> Yeah, generic_run is not as generic as it should be. It does have a
> lot of hard fought logic for working with multiple python versions
> and windows/unix. Could we make generic_run appropriate for the big
> standard out cases so we don't end up duplicating that in
> Blast/Clustalw/wherever runners?
The AlignAce and Clustalw already call generic_run internally - and
for them it is fine. For BLAST, by default the output goes to
standard out, so generic run is a bad idea as this loads all of stdout
into memory. We may want to add some variations on generic_run for
this kind of usage, or say it is up to the user to deal with it as
appropriate for their setup.
Peter
From p.j.a.cock at googlemail.com Fri Apr 17 10:40:42 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 17 Apr 2009 15:40:42 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090417132334.GA16092@sobchak.mgh.harvard.edu>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
<320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
<20090417132334.GA16092@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com>
On Fri, Apr 17, 2009 at 2:23 PM, Brad Chapman wrote:
> Things are a bit more generalized as key/value pairs in qualifiers,
> but the mapping straightforward. My only suggestion would be that we
> add 'start' and 'end' accessors to SeqFeature that map to
> feature.location.nofuzzy_start and feature.location.nofuzzy_end,
> respectively. SeqFeature is more generalized, for GenBank location
> nastiness, but we should make the common simple case simpler.
The SeqFeature already has start and end "attributes", but they are
done with some magic in __getattr__, I was planning to update this
to use a modern python property get. I can't find an enhancement
bug on this so it may just have been on my mental to do list ;)
See also,
http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html
Peter
From biopython at maubp.freeserve.co.uk Fri Apr 17 11:25:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 16:25:35 +0100
Subject: [Biopython-dev] Propose: Adding an alias name (gb) for Genbank
in SeqIO
In-Reply-To: <320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com>
References: <320fb6e00904150240l1e6b424tdd34035256876226@mail.gmail.com>
<587664.25168.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904150401q209ae99id6746f2a0c4e3532@mail.gmail.com>
Message-ID: <320fb6e00904170825w191f7c90p9cb7f175e3f5be17@mail.gmail.com>
On Wed, Apr 15, 2009 at 12:01 PM, Peter wrote:
> On Wed, Apr 15, 2009 at 11:57 AM, Michiel de Hoon wrote:
>>
>> I think it's nice to be consistent with NCBI, and I don't see a big
>> problem in having an alias for GenBank in SeqIO. At least,
>> having "gb" in Bio.Entrez but "genbank" in Bio.SeqIO would
>> go against the principle of least surprise.
>
> True.
OK, in the absence of any objections, I have added "gb" as an alias
for "genbank" in Bio.SeqIO:
Bio/SeqIO/__init__.py CVS revision 1.52
Tests/test_SeqIO_online.py revision 1.8
Tests/output/test_SeqIO_online CVS revision 1.4
Doc/Tutorial.tex CVS revision 1.229
Peter
From mjldehoon at yahoo.com Fri Apr 17 12:44:34 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 17 Apr 2009 09:44:34 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090417132334.GA16092@sobchak.mgh.harvard.edu>
Message-ID: <148828.89199.qm@web62404.mail.re1.yahoo.com>
--- On Fri, 4/17/09, Brad Chapman wrote:
> The GFF parser right now is really generating SeqFeature
> objects for each GFF line; the top level SeqRecords are a
> collection that holds the individual features. The SeqFeature
> object is pretty similar to GFF and the generic object you are
> proposing. For instance, here is a GFF line and the relevant
> attributes from SeqFeature for the line:
>
> I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1
>
> type: PCR_product
> location: [12759746:12764936]
> strand: -1
> qualifiers:
> Key: amplified, Value: ['1']
> Key: pcr_product, Value: ['mv_B0019.1']
> Key: source, Value: ['Orfeome']
>
Just to make I understand how this works, looking at your previous code example:
>>> from BCBio.GFF.GFFParser import GFFAddingIterator
>>> gff_iterator = GFFAddingIterator()
>>> rec_dict = gff_iterator.get_all_features(gff_file)
> The returned dictionary is like a dictionary from SeqIO.to_dict;
> keys are ids and values are SeqRecords.
What will be the key in rec_dict for the example GFF file above? Is that the "I" in the first column, as in
rec_dict["I"] = a SeqRecord with the SeqFeature you described above?
Best,
--Michiel
From bugzilla-daemon at portal.open-bio.org Fri Apr 17 13:03:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 17 Apr 2009 13:03:59 -0400
Subject: [Biopython-dev] [Bug 2812] Adding read method to NCBIXML (just like
SeqIO and SwissProt).
In-Reply-To:
Message-ID: <200904171703.n3HH3xrq015467@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2812
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-17 13:03 EST -------
Fixed in CVS (without the read/parse typo in the docstring suggested in comment
1).
Checking in Bio/Blast/NCBIXML.py;
/home/repository/biopython/biopython/Bio/Blast/NCBIXML.py,v <-- NCBIXML.py
new revision: 1.22; previous revision: 1.21
done
Checking in Tests/test_NCBIXML.py;
/home/repository/biopython/biopython/Tests/test_NCBIXML.py,v <--
test_NCBIXML.py
new revision: 1.7; previous revision: 1.6
done
Checking in Tests/test_NCBI_qblast.py;
/home/repository/biopython/biopython/Tests/test_NCBI_qblast.py,v <--
test_NCBI_qblast.py
new revision: 1.6; previous revision: 1.5
done
Checking in Tests/output/test_NCBIXML;
/home/repository/biopython/biopython/Tests/output/test_NCBIXML,v <--
test_NCBIXML
new revision: 1.6; previous revision: 1.5
done
RCS file: /home/repository/biopython/biopython/Tests/Blast/blastp_no_hits.xml,v
done
Checking in blastp_no_hits.xml;
/home/repository/biopython/biopython/Tests/Blast/blastp_no_hits.xml,v <--
blastp_no_hits.xml
initial revision: 1.1
done
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Fri Apr 17 13:16:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 18:16:55 +0100
Subject: [Biopython-dev] Plan for Biopython 1.50 (final)
In-Reply-To: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com>
References: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com>
Message-ID: <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com>
On Mon, Apr 13, 2009 at 7:06 PM, Peter wrote:
> Apart from these two points (documentation and EFetch), are there any
> issues regarding doing the official release of Biopython 1.50? ?I
> think we can aim for a release this week...
Other than a little more documentation polishing, I think we are ready for
Biopython 1.50 now. Thanks Bartek and Tiago for dealing with the
Bio.Motif and Bio.PopGen issues I raised so promptly :)
Are there any release blocking issues I've missed?
I was going to do it this evening before leaving work, but I'm tired and
wouldn't want to make any mistakes. Instead, I aim to do the release
this weekend, and make the Windows installers at some point on
Monday. The more rain we get this weekend, the more time I'll try
and spend on the docs first - otherwise the lawn needs cutting... ;)
I'll send out a warning email before hand - but until then please
feel free to check in documentation changes (including docstrings
and doctests).
We still don't have much on GenomeDiagram in the main tutorial, but I
have some plans to improve this. We also don't have the misc GC
related functions from the standalone GenomeDiagram which we might
add to Bio.SeqUtils, but I think that can wait till Biopython 1.51.
Bartek has made a start on the Bio.Motif documentation as a separate
"cookbook" LaTeX file (plus we have some basic docstrings done):
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Doc/cookbook/motif/motif.tex?cvsroot=biopython
For the long term I think we want to get rid of these misc "cookbook"
documents (by moving their content), to focus on the main document
("Biopython Tutorial and Cookbook"), the docstrings, and in future
cookbook entries on the wiki (which can be more user driven).
Peter
From ogmaciel at gnome.org Fri Apr 17 13:23:07 2009
From: ogmaciel at gnome.org (Og Maciel)
Date: Fri, 17 Apr 2009 13:23:07 -0400
Subject: [Biopython-dev] Plan for Biopython 1.50 (final)
In-Reply-To: <320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com>
References: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com>
<320fb6e00904171016q40f99a3fjda75b3add17ab8c0@mail.gmail.com>
Message-ID: <98a1f5280904171023p13c1e7a9o5686b451fd3da61c@mail.gmail.com>
On Fri, Apr 17, 2009 at 1:16 PM, Peter wrote:
>> Apart from these two points (documentation and EFetch), are there any
>> issues regarding doing the official release of Biopython 1.50? ?I
>> think we can aim for a release this week...
Cool! I have 1.50b packaged for Foresight Linux and will update it
once the new version is released. :)
Cheers,
--
Og B. Maciel
omaciel at foresightlinux.org
ogmaciel at gnome.org
ogmaciel at ubuntu.com
GPG Keys: D5CFC202
http://www.ogmaciel.com (en_US)
http://blog.ogmaciel.com (pt_BR)
From chapmanb at 50mail.com Fri Apr 17 16:05:58 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 17 Apr 2009 16:05:58 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
<320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
<20090417132334.GA16092@sobchak.mgh.harvard.edu>
<320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com>
Message-ID: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
Peter and Michiel;
[start/end attributes on SeqFeatures]
> The SeqFeature already has start and end "attributes", but they are
> done with some magic in __getattr__, I was planning to update this
> to use a modern python property get. I can't find an enhancement
> bug on this so it may just have been on my mental to do list ;)
These attributes are on the FeatureLocation object. The whole
location hierarchy is a bit complicated to represent all of the
GenBank fuzziness, but it looks like:
SeqFeature -- has_a --> FeatureLocation -- has_two --> Positions (start, end)
So if you wanted to get a non-fuzzy start end, you need to do:
feature.location.nofuzzy_start, feature.location.nofuzzy_end
Your way above would be:
feature.location.start.position
So, I was thinking of hiding this Location/Position stuff from the
end user and just adding a start and end attribute directly on the
feature. For everyone that never touches fuzziness, this would make
more sense; it is also in line with making SeqFeature like Michiel's
proposed GFFRecord object.
[GFF to SeqFeature example]
> > I Orfeome PCR_product 12759747 12764936 . - . PCR_product "mv_B0019.1" ; Amplified 1 ; Amplified 1
> >
> > type: PCR_product
> > location: [12759746:12764936]
> > strand: -1
> > qualifiers:
> > Key: amplified, Value: ['1']
> > Key: pcr_product, Value: ['mv_B0019.1']
> > Key: source, Value: ['Orfeome']
> >
>
> Just to make I understand how this works, looking at your previous code example:
>
> >>> from BCBio.GFF.GFFParser import GFFAddingIterator
> >>> gff_iterator = GFFAddingIterator()
> >>> rec_dict = gff_iterator.get_all_features(gff_file)
>
> > The returned dictionary is like a dictionary from SeqIO.to_dict;
> > keys are ids and values are SeqRecords.
>
> What will be the key in rec_dict for the example GFF file above? Is that the "I" in the first column, as in
>
> rec_dict["I"] = a SeqRecord with the SeqFeature you described above?
Yes, that is exactly right. If we decide to have a SeqFeature
iterator, we should also add a 'rec_id' key/value pair to the
qualifiers that would map to the record -- chromosome 'I' in this
case. This would let the user do the mapping themselves.
Brad
From biopython at maubp.freeserve.co.uk Fri Apr 17 18:12:14 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 17 Apr 2009 23:12:14 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
<320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
<20090417132334.GA16092@sobchak.mgh.harvard.edu>
<320fb6e00904170740y17aec13cta2137d6d45f43e7c@mail.gmail.com>
<20090417200558.GC19290@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904171512n3ff0090dy8042b1c860cf5a2c@mail.gmail.com>
On 4/17/09, Brad Chapman wrote:
> Peter and Michiel;
>
> [start/end attributes on SeqFeatures]
>
> > The SeqFeature already has start and end "attributes", but they are
> > done with some magic in __getattr__, I was planning to update this
> > to use a modern python property get. I can't find an enhancement
> > bug on this so it may just have been on my mental to do list ;)
>
> These attributes are on the FeatureLocation object.
Sorry - yeah, you're right. I wasn't paying enough attention.
> The whole location hierarchy is a bit complicated to represent all
> of the GenBank fuzziness, but it looks like:
>
> SeqFeature -- has_a --> FeatureLocation -- has_two --> Positions (start, end)
>
And that's the nice case without sub-features and joins ;)
Peter
From mjldehoon at yahoo.com Sat Apr 18 00:28:09 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 17 Apr 2009 21:28:09 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
Message-ID: <252312.21376.qm@web62408.mail.re1.yahoo.com>
I tried this code to read a GFF file from miRBase, containing the genome positions of microRNAs in human. The good news is that the code works as advertised. At the same time, I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO), the SeqFeatures are way too complicated for my mind.
This is how I used the parser:
>>> from GFFParser import GFFAddingIterator
>>> gff_iterator = GFFAddingIterator()
>>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff")
# It would be better to pass a handle to get_all_features
# instead of a file name. The file may be gzipped or bzipped,
# or the user may want to read it from the internet.
>>> len(rec_dict['1'])
50
# fifty microRNAs on chromosome 1
>>> rec_dict['1'].features[0]
Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2')
>>> rec_dict['1'].features[0].qualifiers['ACC']
['MI0006363']
>>> rec_dict['1'].features[0].qualifiers['ID']
['hsa-mir-1302-2']
# This is still OK, though a bit more deeply nested than I would like.
>>> rec_dict['1'].features[0].location
Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366))
>>> rec_dict['1'].features[0].location._start
Bio.SeqFeature.ExactPosition(20228)
# Am I supposed to use _start here? It looks like a private variable.
>>> rec_dict['1'].features[0].location._start.position
20228
# Too much typing for everyday usage. I don't think that I would use it.
For a basic parser, I like the _gff_line_map function much better. Applied to the first line in the GFF file, it returns
>>> result = _gff_line_map(line, params)
[('parent', {'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1})]
>>> print result[0][1]
{'quals': {'ACC': ['MI0006363'], 'ID': ['hsa-mir-1302-2']}, 'rec_id': '1', 'location': [20228, 20366], 'is_gff2': False, 'type': 'miRNA', 'id': 'hsa-mir-1302-2', 'strand': 1}
which is exactly what I need, in (almost) the places where I'd expect them.
--Michiel
From biopython at maubp.freeserve.co.uk Sat Apr 18 09:54:44 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 18 Apr 2009 14:54:44 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <252312.21376.qm@web62408.mail.re1.yahoo.com>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
On Sat, Apr 18, 2009 at 5:28 AM, Michiel de Hoon wrote:
>
> This is how I used the parser:
>
>>>> from GFFParser import GFFAddingIterator
>>>> gff_iterator = GFFAddingIterator()
>>>> rec_dict = gff_iterator.get_all_features("Data/miRBase/hsa.gff")
> # It would be better to pass a handle to get_all_features
> # instead of a file name. The file may be gzipped or bzipped,
> # or the user may want to read it from the internet.
>>>> len(rec_dict['1'])
> 50
> # fifty microRNAs on chromosome 1
>>>> rec_dict['1'].features[0]
> Bio.SeqFeature.SeqFeature(Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366)), type='miRNA', strand=1, id='hsa-mir-1302-2')
>>>> rec_dict['1'].features[0].qualifiers['ACC']
> ['MI0006363']
>>>> rec_dict['1'].features[0].qualifiers['ID']
> ['hsa-mir-1302-2']
> # This is still OK, though a bit more deeply nested than I would like.
>>>> rec_dict['1'].features[0].location
> Bio.SeqFeature.FeatureLocation(Bio.SeqFeature.ExactPosition(20228),Bio.SeqFeature.ExactPosition(20366))
>>>> rec_dict['1'].features[0].location._start
> Bio.SeqFeature.ExactPosition(20228)
> # Am I supposed to use _start here? It looks like a private variable.
>>>> rec_dict['1'].features[0].location._start.position
> 20228
No, you are meant to use start, e.g.:
>>> print rec_dict['1'].features[0].location.start
20228
>>> rec_dict['1'].features[0].location.start.position
20228
This is what I was talking about in the earlier email on this thread,
the SeqFeature has start and end "attributes", but they are done
with some magic in __getattr__. I plan to update this to use a
modern python property get (so they will show up in dir(...) and
we can give them docstring), but don't recall filing a bug on this
issue yet.
See also,
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html
http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html
Related to this, perhaps the position classes (and in particular
the ExactPosition class) should have an __int__ method, so you
can use the object directly (rather than messing about with
subproperties like .position). This should let you do the following
(untested):
record = ... #e.g. a SeqRecord from a GFF file or GenBank
feature = record.features[5] #for example
sub_seq = my_seq[feature.location.start:feature.location.end]
Coupled with a variation of Brad's suggestion of adding start
and end properties to the SeqFeature, if we make these act
as proxies for feature.location.start and feature.location.end
that would become just:
record = ...
feature = record.features[5] #for example
sub_seq = my_seq[feature.start:feature.end]
The fuzzy locations (from GenBank or EMBL files) would need
a bit of care, ideally matching how the NCBI do things (easily
checked by taking an NCBI GenBank files and comparing it to
the simpler locations given in their FASTA, PTT or GFF files).
Peter
From bugzilla-daemon at portal.open-bio.org Sat Apr 18 17:45:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 18 Apr 2009 17:45:12 -0400
Subject: [Biopython-dev] [Bug 2814] New: Use properties instead of
__getattr__ in FeatureLocation
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2814
Summary: Use properties instead of __getattr__ in FeatureLocation
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
The SeqFeature's location (i.e. the FeatureLocation object) has start and end
"attributes", but they are done with some magic in __getattr__. We should use
a modern python property get (so they will show up in dir(...) and we can give
them docstrings)
See also,
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005781.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005734.html
http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html
Patch to follow
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sat Apr 18 17:47:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 18 Apr 2009 17:47:59 -0400
Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in
FeatureLocation
In-Reply-To:
Message-ID: <200904182147.n3ILlx88027985@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2814
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-18 17:47 EST -------
Created an attachment (id=1278)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1278&action=view)
Patch to Bio/SeqFeature.py
This doesn't try and change the functionality or API at all.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Sat Apr 18 17:48:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 18 Apr 2009 22:48:58 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
<320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
Message-ID: <320fb6e00904181448u49e92549t70c3a23c1a0c4d4f@mail.gmail.com>
On Sat, Apr 18, 2009 at 2:54 PM, Peter wrote:
> This is what I was talking about in the earlier email on this thread,
> the SeqFeature has start and end "attributes", but they are done
> with some magic in __getattr__. I plan to update this to use a
> modern python property get (so they will show up in dir(...) and
> we can give them docstring), but don't recall filing a bug on this
> issue yet.
Filed now, Bug 2814 - Use properties instead of __getattr__ in FeatureLocation
http://bugzilla.open-bio.org/show_bug.cgi?id=2814
Something for after Biopython 1.50 is done.
Peter
From bugzilla-daemon at portal.open-bio.org Sat Apr 18 18:42:57 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 18 Apr 2009 18:42:57 -0400
Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in
FeatureLocation
In-Reply-To:
Message-ID: <200904182242.n3IMgvOq031013@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2814
------- Comment #2 from eric.talevich at gmail.com 2009-04-18 18:42 EST -------
(In reply to comment #1)
> Created an attachment (id=1278)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1278&action=view) [details]
> Patch to Bio/SeqFeature.py
Peter, you mentioned on the mailing list that this will be applied after the
1.50 release. Since Py2.3 support ends there also, you could use the newer
decorator style instead:
start = property(fget= lambda self : self._start,
doc="Start location (possibly a fuzzy position).")
becomes:
@property
def start(self):
"""Start location (possibly a fuzzy position)."""
return self._start
I think this is the preferred style for Python 2.4 and later.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Apr 20 05:03:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 10:03:47 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50
Message-ID: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
On Fri, Apr 17, 2009 at 6:16 PM, Peter wrote:
>
> Are there any release blocking issues I've missed?
I'm going to assume not.
> I was going to do it this evening before leaving work, but I'm tired and
> wouldn't want to make any mistakes. ?Instead, I aim to do the release
> this weekend, and make the Windows installers at some point on
> Monday. ?The more rain we get this weekend, the more time I'll try
> and spend on the docs first - otherwise the lawn needs cutting... ;)
Well the good news is it didn't rain, I had a nice weekend, and cut
half the grass. The bad news is obviously I didn't do the Biopython
release, although I did work on the documentation. In addition to the
nice weather, my other excuse is I had forgotten I'd upgraded my old
laptop so I didn't have a Python 2.3 machine handy at home. ;)
> I'll send out a warning email before hand - but until then please
> feel free to check in documentation changes (including docstrings
> and doctests).
This is the CVS freeze email. I'm going to do the release in the next
hour or two.
> We still don't have much on GenomeDiagram in the main tutorial, but I
> have some plans to improve this. [...]
I got most of that done at the weekend :)
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 20 08:11:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 13:11:18 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50
In-Reply-To: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
Message-ID: <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
On Mon, Apr 20, 2009 at 10:03 AM, Peter wrote:
>
> This is the CVS freeze email. ?I'm going to do the release in the next
> hour or two.
>
Well, its done. CVS is tagged, the packages are online, I've updated
the wiki, the epydoc API pages, and the online copy of the tutorial.
You can use CVS again, but just in case there are any surprises in the
next few days which would force a re-release, minor changes only
please.
That just leaves the official announcement on the news page (which
will be echoed onto twitter automatically) and to the mailing lists.
I'll circulate a draft after lunch, unless one of our news coordinator
volunteers wants to write something? I realize I should have
suggested this earlier as this is short notice, and you are in
different time zones, but its worth a try.
For reference, here is the 1.50 beta announcement,
http://news.open-bio.org/news/2009/04/biopython-150-beta-released/
I can't find anything on
http://lists.open-bio.org/pipermail/biopython-announce/ or the main
list, so it looks like I forget that :( This might explain the
relatively low amount of feedback...
The NEWS and DEPRECATED files are here:
http://biopython.open-bio.org/SRC/biopython/NEWS
http://biopython.open-bio.org/SRC/biopython/DEPRECATED
Peter
From bugzilla-daemon at portal.open-bio.org Mon Apr 20 08:18:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 20 Apr 2009 08:18:33 -0400
Subject: [Biopython-dev] [Bug 2815] New: Bio.Application MUSCLE command line
interface
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
Summary: Bio.Application MUSCLE command line interface
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
Attached is a module to run the MUSCLE alignment programme based on the
Bio.Applications interface.
A couple of helper functions are included MuscleAlign and ProfileMuscleAlign.
Discussion on the dev-list suggests that helper functions are superfluous.
Maybe, but I thought I'd include them anyway. A couple of unittests are
included for the helper funcs.
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Apr 20 08:19:38 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 20 Apr 2009 08:19:38 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904201219.n3KCJcSu009533@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #1 from cymon.cox at gmail.com 2009-04-20 08:19 EST -------
Created an attachment (id=1279)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1279&action=view)
MUSCLE module
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Mon Apr 20 08:21:19 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 20 Apr 2009 08:21:19 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904201221.n3KCLJjf009683@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #2 from cymon.cox at gmail.com 2009-04-20 08:21 EST -------
Created an attachment (id=1280)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1280&action=view)
unittest for MuscleAlign
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Mon Apr 20 09:29:46 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 20 Apr 2009 09:29:46 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
<320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
Message-ID: <20090420132946.GB29652@sobchak.mgh.harvard.edu>
Michiel;
Thanks for trying this out and your thoughts.
> > # It would be better to pass a handle to get_all_features
> > # instead of a file name. The file may be gzipped or bzipped,
> > # or the user may want to read it from the internet.
Yes, this is the way it was originally designed. I changed to files to
be consistent with a distributed Disco implementation, which needs to be
fed a file instead of a handle. Your suggestion is a good one. Let me
give some thought to separating the interfaces, as handles would be more
consistent with the rest of Biopython.
[accessing start and end]
> >>> print rec_dict['1'].features[0].location.start
> 20228
> >>> rec_dict['1'].features[0].location.start.position
> 20228
[...]
> Coupled with a variation of Brad's suggestion of adding start
> and end properties to the SeqFeature, if we make these act
> as proxies for feature.location.start and feature.location.end
> that would become just:
>
> record = ...
> feature = record.features[5] #for example
> sub_seq = my_seq[feature.start:feature.end]
Thanks Peter, that's exactly right. Accessing the start and end
coordinates in SeqFeatures is unnecessarily cumbersome right now,
but can be fixed fairly simply. We should be able to get this in now
that 1.50 is rolled out. Eric's decorator way of doing this was very
nice.
> The fuzzy locations (from GenBank or EMBL files) would need
> a bit of care, ideally matching how the NCBI do things (easily
> checked by taking an NCBI GenBank files and comparing it to
> the simpler locations given in their FASTA, PTT or GFF files).
To be clear, start and end in SeqFeature would be integers and not
handle any fuzzy stuff. All of the representation is still there for
those actually dealing with fuzziness, but the top level attributes
would expose the coordinates nicely for the remaining 99% of cases.
> I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO),
> the SeqFeatures are way too complicated for my mind.
[...]
> For a basic parser, I like the _gff_line_map function much better.
> Applied to the first line in the GFF file, it returns
[...]
> which is exactly what I need, in (almost) the places where I'd expect them.
Does solving the start/end problem as described above help bridge the
gap between SeqFeatures and the custom representation? Are there other
usability issues you found? I would prefer to expose one data structure
and think SeqFeature can handle the data well. They scale to nested
cases, and will be familiar to those using features in SeqIO or BioSQL.
Brad
From dave.bridges at gmail.com Mon Apr 20 09:55:40 2009
From: dave.bridges at gmail.com (Dave Bridges)
Date: Mon, 20 Apr 2009 09:55:40 -0400
Subject: [Biopython-dev] Bio.Motif Suggestions
Message-ID: <49EC7EDC.2030809@gmail.com>
From an off-list conversation with Bartek
> > Is it possible to give a name to an instance, so that when you
> print, say to
> > fasta it retains that info
>
Yes and no... Motifs have a .name property which can be used for
storing names of motifs, but it is currently not used in fasta output.
BTW. fasta (and other) output functions changed recently in CVS, but
I didn't have time to update my branch in git. Please have a look
at the .format method of Motif class in the main branch. There are
also some (minor) changes in the tutorial, so you may want to merge
them back into your branch. Bio.Motif got refactored quite a bit (on
Peter's request), so you should update the code, but the API didn't
change too much.
Currently, the fasta output prints only Instance 1, Instance 2 and so
on in the ID field but it would be a trivial improvement to add motif
name there.
> > Is there an alphabet that accepts spaces which might be necessary for
> > correct alignment of a motif, and if so will that work with the rest of
> > motif.py?
>
That's a tougher one. It wasn't really needed so far (DNA motifs
rarely have spaces), but I guess that for protein motifs it's a very
important thing.
I have some code for doing that, but I will need to find it. I'll
write you later about it.
> > in to_horizontal_matrix/to_vertical_matrix is it possible to print
> out a
> > legend for the matrices (for ex. the alphabet letters and the position)
> > along the top and side.
>
No, not yet, but again, it would be a nice improvement (and easy to make).
cheers
Bartek
From biopython at maubp.freeserve.co.uk Mon Apr 20 10:35:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 15:35:15 +0100
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To: <49EC7EDC.2030809@gmail.com>
References: <49EC7EDC.2030809@gmail.com>
Message-ID: <320fb6e00904200735y1002ee71i1a2f11c664045567@mail.gmail.com>
On Mon, Apr 20, 2009 at 2:55 PM, Dave Bridges wrote:
>
>> > Is there an alphabet that accepts spaces which might be necessary for
>> > correct alignment of a motif, and if so will that work with the rest of
>> > motif.py?
>>
>
> That's a tougher one. It wasn't really needed so far (DNA motifs
> rarely have spaces), but I guess that for protein motifs it's a very
> important thing.
> I have some code for doing that, but I will need to find it. I'll
> write you later about it.
>
What would a space in a motif mean? Clearly something different from
a wildcard like N or X in nucleotide or protein sequences. Does it
mean a gap of variable length? If it means a gap of one character
then surely just using a "-" would be sensible (as used in multiple
sequence alignments), for which we have a gapped alphabet system
setup.
Note that there are some issues with the current Bio.Motif code and
alphabets, which should be addressed. For example, generic alphabets
don't have a letters property giving the list of expected letters, so
using set() on the sequences themselves might be more appropriate in
places.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 20 10:37:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 15:37:02 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090420132946.GB29652@sobchak.mgh.harvard.edu>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
<320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
<20090420132946.GB29652@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904200737s71e0dfa2y3d7cfbf36324a79d@mail.gmail.com>
On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman wrote:
> Michiel;
> Thanks for trying this out and your thoughts.
>
>> > # It would be better to pass a handle to get_all_features
>> > # instead of a file name. The file may be gzipped or bzipped,
>> > # or the user may want to read it from the internet.
>
> Yes, this is the way it was originally designed. I changed to files to
> be consistent with a distributed Disco implementation, which needs to be
> fed a file instead of a handle. Your suggestion is a good one. Let me
> give some thought to separating the interfaces, as handles would be more
> consistent with the rest of Biopython.
I'd second that - definitely go with handles rather than filenames.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 20 10:55:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 15:55:21 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50
In-Reply-To: <320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
Message-ID: <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
On Mon, Apr 20, 2009 at 1:11 PM, Peter wrote:
> That just leaves the official announcement on the news page (which
> will be echoed onto twitter automatically) and to the mailing lists.
> I'll circulate a draft after lunch, unless one of our news coordinator
> volunteers wants to write something? ?I realize I should have
> suggested this earlier as this is short notice, and you are in
> different time zones, but its worth a try.
And here is my draft - the HTML is just for the links on the news
site. Should we add something about the Entrez EFetch change
("genbank" to "gb")?
Peter
--
We are pleased to announce Biopython release 1.50, featuring some
significant additions since Biopython 1.49 was released late last
year.
GenomeDiagram
by Leighton Pritchard has been integrated into Biopython as the
Bio.Graphics.GenomeDiagram module.
A new module Bio.Motif has been added, which is intended to replace
the existing Bio.AlignAce and Bio.MEME modules. Also have a look at
Bio.ExPASy and the revised Prosite and Enzyme parsers.
As noted in a previous news posting, Bio.SeqIO can now read and
write FASTQ
and QUAL files used in second generation sequencing work. In
connection with this, our SeqRecord object has a
new dictionary attribute, letter_annotations, for
per-letter-annotation information like sequence quality scores or
secondary structure predictions. Also, the SeqRecord object can now be
sliced to give a new SeqRecord covering just part of the sequence.
Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is
expected to be the final version to support Python 2.3 (see this previous
announcement ). Also, Biopython 1.50 should be the last release to
include our old deprecated parsing infrastructure (Martel and
Bio.Mindy).
We?ve also updated the Biopython
Tutorial and Cookbook (also available in PDF ),
and not just by adding our logo to the cover ;)
Thank you to everyone who tested the Biopython
1.50 beta release , and to all our contributors.
Source distributions and Windows installers are available from the downloads page on the Biopython website (biopython.org) .
-Peter on behalf of the Biopython developers
From bartek at rezolwenta.eu.org Mon Apr 20 11:04:44 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Mon, 20 Apr 2009 17:04:44 +0200
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To: <8b34ec180904200800j52f9accdk27bb9c499c7b0761@mail.gmail.com>
References: <49EC7EDC.2030809@gmail.com>
<320fb6e00904200735y1002ee71i1a2f11c664045567@mail.gmail.com>
<8b34ec180904200800j52f9accdk27bb9c499c7b0761@mail.gmail.com>
Message-ID: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com>
On Mon, Apr 20, 2009 at 4:35 PM, Peter wrote:
> On Mon, Apr 20, 2009 at 2:55 PM, Dave Bridges wrote:
>>
>>> > Is there an alphabet that accepts spaces which might be necessary for
>>> > correct alignment of a motif, and if so will that work with the rest of
>>> > motif.py?
>>>
>>
>> That's a tougher one. It wasn't really needed so far (DNA motifs
>> rarely have spaces), but I guess that for protein motifs it's a very
>> important thing.
>> I have some code for doing that, but I will need to find it. I'll
>> write you later about it.
>>
>
> What would a space in a motif mean? ?Clearly something different from
> a wildcard like N or X in nucleotide or protein sequences. ?Does it
> mean a gap of variable length? ?If it means a gap of one character
> then surely just using a "-" would be sensible (as used in multiple
> sequence alignments), for which we have a gapped alphabet system
> setup.
>
I think that once we start talking about gapped motifs, we are really
talking about
multiple alignments on steroids. This hasn't been done so far because you don't
really need it for DNA motifs, but in case of protein motifs we need to make it
compatible with multiple alignments. I think it would be great to be
able to easily
convert multiple alignments into motifs. This would allow us to ?use
the power of
BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
how to design API for these ?functions. What about:
align= Bio.AlignIO.read(....)
motif=Bio.Motif.from_alignment(align)
...
> Note that there are some issues with the current Bio.Motif code and
> alphabets, which should be addressed. ?For example, generic alphabets
> don't have a letters property giving the list of expected letters, so
> using set() on the sequences themselves might be more appropriate in
> places.
Yes, I was using Bio.Motif only for DNA motifs myself, so there was
not much consideration
given to proper handling of alphabets. I'll need to clear it up now.
cheers
?Bartek
From bartek at rezolwenta.eu.org Mon Apr 20 11:08:57 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Mon, 20 Apr 2009 17:08:57 +0200
Subject: [Biopython-dev] CVS freeze for Biopython 1.50
In-Reply-To: <320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
Message-ID: <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
Hi Peter,
Looks fine to me.
Thanks for your effort put into this release.
Bio.Motif certainly benefited from refactoring initiated by your
comments before the release.
cheers
Bartek
On Mon, Apr 20, 2009 at 4:55 PM, Peter wrote:
> On Mon, Apr 20, 2009 at 1:11 PM, Peter wrote:
>> That just leaves the official announcement on the news page (which
>> will be echoed onto twitter automatically) and to the mailing lists.
>> I'll circulate a draft after lunch, unless one of our news coordinator
>> volunteers wants to write something? ?I realize I should have
>> suggested this earlier as this is short notice, and you are in
>> different time zones, but its worth a try.
>
> And here is my draft - the HTML is just for the links on the news
> site. ?Should we add something about the Entrez EFetch change
> ("genbank" to "gb")?
>
> Peter
>
> --
>
> We are pleased to announce Biopython release 1.50, featuring some
> significant additions since Biopython 1.49 was released late last
> year.
>
> GenomeDiagram
> by Leighton Pritchard has been integrated into Biopython as the
> Bio.Graphics.GenomeDiagram module.
>
> A new module Bio.Motif has been added, which is intended to replace
> the existing Bio.AlignAce and Bio.MEME modules. Also have a look at
> Bio.ExPASy and the revised Prosite and Enzyme parsers.
>
> As noted in a previous news posting, href="http://biopython.org/wiki/SeqIO">Bio.SeqIO can now read and
> write FASTQ
> and QUAL files used in second generation sequencing work. In
> connection with this, our href="http://biopython.org/wiki/SeqRecord">SeqRecord object has a
> new dictionary attribute, letter_annotations, for
> per-letter-annotation information like sequence quality scores or
> secondary structure predictions. Also, the SeqRecord object can now be
> sliced to give a new SeqRecord covering just part of the sequence.
>
> Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is
> expected to be the final version to support Python 2.3 (see this href="http://news.open-bio.org/news/2009/04/2008/11/biopython-and-python-26-and-python-23/">previous
> announcement ). Also, Biopython 1.50 should be the last release to
> include our old deprecated parsing infrastructure (Martel and
> Bio.Mindy).
>
> We?ve also updated the Biopython
> Tutorial and Cookbook (also available in href="http://biopython.org/DIST/docs/tutorial/Tutorial.pdf">PDF ),
> and not just by adding href="http://biopython.org/wiki/Logo">our logo to the cover ;)
>
> Thank you to everyone who tested the href="http://news.open-bio.org/news/2009/04/biopython-150-beta-released/">Biopython
> 1.50 beta release , and to all our contributors.
>
> Source distributions and Windows installers are available from the href="http://biopython.org/wiki/Download">downloads page on the href="http://biopython.org/">Biopython website (biopython.org) .
>
> -Peter on behalf of the Biopython developers
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
--
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433
From biopython at maubp.freeserve.co.uk Mon Apr 20 12:04:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 17:04:56 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50
In-Reply-To: <8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
<8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
Message-ID: <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
On Mon, Apr 20, 2009 at 4:08 PM, Bartek Wilczynski
wrote:
> Hi Peter,
>
> Looks fine to me.
Cool.
> Thanks for your effort put into this release.
Thanks. I'd forgotten how much work these can be - the Biopython 1.50
beta release seemed to go much more smoothly, but there I wasn't
aiming quite so high (e.g. it didn't have any GenomeDiagram
documentation in it, and I hadn't really looked at Bio.Motif in
detail). Michiel did offer this time round, but maybe next time it
should be someone else's turn to do the actual release bit? What I
mean is the project co-ordination is a bit nebulous, but the actual
mechanics of doing a release are fairly simple (assuming you have a
Windows machine already setup to do the installers), pretty well
documented, and that part could be delegated. See
http://biopython.org/wiki/Building_a_release
i.e. Maybe in a few months time I (or Michiel) can say "Right, CVS
freeze while XXX does the release", where person XXX gets to scan the
documentation, double check the NEWS files, check the unit tests etc,
before putting together the packages and uploading them to the server.
And maybe then hand over to our "News Coordinator" to do the release
announcement? Having more people involved will make it take a little
longer, but should mean less minor things get missed (e.g. a typo in
the NEWS file, or a broken unit test specific to a particular OS or
version of python).
> Bio.Motif certainly benefited from refactoring ?initiated by your
> comments before the release.
Well, I hope so :)
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 20 13:27:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 18:27:00 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
Message-ID: <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
On Wed, Mar 25, 2009 at 11:28 AM, Peter wrote:
> On Tue, Mar 24, 2009 at 11:58 PM, Bartek Wilczynski wrote:
>> For the tags, they were not pushed to github before, because I didn't
>> know I need to specifically do it qith git push --tags.
>
> ...
> They also show up in github (near the top, drop down menu next to
> branches) and in gitx (and I assume other GUI clients).
Bartek fixed the tag issue, but I don't like how they show up in
github.
The most visible sign of the tags is in the downloads menu which
lets you get a source code bundle using that tag. If we could turn
that off I would - these bundles won't include the compiled PDF
and HTML documentation, and could cause confusion when
people have a problems and they just say they "downloaded
version X from the website".
My main concern is the tags don't appear to be shown when
looking at the history in github, which is the main reason I
wanted them in the first place. e.g.
http://github.com/biopython/biopython/commits/master/Bio/Blast/NCBIXML.py
Compare this to ViewCVS, which shows the tags in the history:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/Blast/NCBIXML.py?cvsroot=biopython
I find this very handy for investigating bugs, and much easier
than messing about at the command line with CVS. The fact
that I can do this from almost any networked computer in the
world is great for triaging bugs or responding to emails - it lets
me look back over the history with our releases clearly labeled.
So right now, the github history is a big step backwards for me.
As an alternative, I had a quick look at GitX (on the Mac) from
the GUI, they don't seem to have a history-of-one-file view, just
a global history. For how I have been using ViewCVS's history,
this is useless. However, interesting for a GUI tool, they have
a command line option which sort of does this, e.g.
$ gitx -- Bio/Blast/NCBIXML.py
Then the history shows all changes affecting the given file (or
path), but as you might guess from git's commit based design,
you also get shown other changes made in the same commit.
This is kind of nice, just different. But still no tags visible :(
Peter
P.S. Tags aside, the github history view hasn't been working
100% for me, e.g.
http://support.github.com/discussions/site/487-commit-history-sorry-this-commit-log-is-taking-too-long-to-generate
From biopython at maubp.freeserve.co.uk Mon Apr 20 13:36:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 18:36:36 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
Message-ID: <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
On Mon, Apr 20, 2009 at 6:27 PM, Peter wrote:
> On Wed, Mar 25, 2009 at 11:28 AM, Peter wrote:
>> On Tue, Mar 24, 2009 at 11:58 PM, Bartek Wilczynski wrote:
>>> For the tags, they were not pushed to github before, because I didn't
>>> know I need to specifically do it qith git push --tags.
>>
>> ...
>> They also show up in github (near the top, drop down menu next to
>> branches) and in gitx (and I assume other GUI clients).
>
> Bartek fixed the tag issue, but I don't like how they show up in
> github.
>From some more reading this, it sounds like our CVS tags are
essentially turned into commit markers in git. See:
http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#how-git-stores-references
http://book.git-scm.com/3_git_tag.html
This shouldn't rule out showing them in the history, but perhaps the
cvs to git migration confuses things...
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 20 15:02:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Apr 2009 20:02:18 +0100
Subject: [Biopython-dev] Biopython 1.50 released
Message-ID: <320fb6e00904201202j4bb9666es18c89136ce973a48@mail.gmail.com>
Dear all,
We are pleased to announce Biopython release 1.50, featuring some
significant additions since Biopython 1.49 was released late last
year.
GenomeDiagram by Leighton Pritchard has been integrated into Biopython
as the Bio.Graphics.GenomeDiagram module.
A new module Bio.Motif has been added, which is intended to replace
the existing Bio.AlignAce and Bio.MEME modules. Also have a look at
Bio.SwissProt and Bio.ExPASy and their revised parsers.
As noted in a previous news posting, Bio.SeqIO can now read and write
FASTQ and QUAL files used in second generation sequencing work. In
connection with this, our SeqRecord object has a new dictionary
attribute, letter_annotations, for per-letter-annotation information
like sequence quality scores or secondary structure predictions. Also,
the SeqRecord object can now be sliced to give a new SeqRecord
covering just part of the sequence.
Biopython 1.50 supports Python 2.3, 2.4, 2.5 and 2.6. However, this is
expected to be the final version to support Python 2.3 (see this
previous announcement). Also, Biopython 1.50 should be the last
release to include our old deprecated parsing infrastructure (Martel
and Bio.Mindy).
We?ve also updated the Biopython Tutorial and Cookbook (also available
in PDF), and not just by adding our logo to the cover ;)
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
Thank you to everyone who tested the Biopython 1.50 beta release, and
to all our contributors.
Source distributions and Windows installers are available from the
downloads page on the Biopython website:
http://biopython.org/wiki/Download
-Peter, on behalf of the Biopython developers
P.S. This news post is online at
http://news.open-bio.org/news/2009/04/biopython-release-150/
You may wish to subscribe to our news feed. For RSS links etc, see:
http://biopython.org/wiki/News
From lpritc at scri.ac.uk Tue Apr 21 04:34:25 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 21 Apr 2009 09:34:25 +0100
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com>
Message-ID:
Hi,
Some thoughts and a bit of a wishlist...
On 20/04/2009 16:04, "Bartek Wilczynski" wrote:
> On Mon, Apr 20, 2009 at 4:35 PM, Peter
> wrote:
>>
>> What would a space in a motif mean? ?Clearly something different from
>> a wildcard like N or X in nucleotide or protein sequences. ?Does it
>> mean a gap of variable length? ?If it means a gap of one character
>> then surely just using a "-" would be sensible (as used in multiple
>> sequence alignments), for which we have a gapped alphabet system
>> setup.
>>
> I think that once we start talking about gapped motifs, we are really
> talking about
> multiple alignments on steroids. This hasn't been done so far because you
> don't
> really need it for DNA motifs,
It might not be required for the motifs you've been working with, but we've
been doing profile-based searches for bipartite regulatory binding sites in
DNA. These sites have a variable-length spacer region, and so require
gapped alignments for building motifs. The spacer region consensus
(depending on the level of identity required for the consensus) is usually
composed of Ns.
I guess that this comes down to whether we choose to restrict the meaning of
"motif" to an ungapped string of symbols (including ambiguity) representing
nt/aa, or whether we want to permit the inclusion of variable-length gaps,
regions, or ambiguities in a PROSITE or regular expression-like manner (e.g.
C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or
C{,3}A{3,5}TTTT). Although profile methods like HMMer can produce a
consensus output that looks like an ungapped string of symbols to represent
a motif, it doesn't capture important features of the HMM representation.
I think the latter representations are more useful, even if harder to
code/maintain. I think that leaving them out would be a glaring hole in
functionality, and that they're a target Biopython should aim for.
> I think it would be great to be
> able to easily
> convert multiple alignments into motifs. This would allow us to ?use
> the power of
> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
> how to design API for these ?functions.
I agree. I think that there's another important question: what do we mean,
and need to do, when we talk about converting an alignment into a motif?
Consensus/majority and PSSM methods from a sequence alignment should be
straightforward to implement in Python - even for gapped alignments.
Including a representation of variable-length gaps might be a little more
difficult, and storing an HMM representation may be too much to manage
immediately. That's still three different types of object - with likely
different components to their interfaces - to be stored. In their
relationship to a source alignment, these representations could be
properties of a single alignment, or independent Bio.Motif objects (perhaps
each with a link back to their parent alignment).
The results of searches are also likely to be qualitatively different,
depending on the type of motif used for the search, and the results desired
by the user.
I think that, for anything other than simple searches (string search,
regex), we'd be on a hiding to nothing by implementing search methods within
Python. It's not likely to be as fast as dedicated search packages, and it
would be a headache for maintenance. So, with apologies if I missed this
part of the discussion or documentation, it seems to me that Bio.Motif could
be most powerful in the alignment/searching/comparison process as a 'broker'
within BioPython, providing a consistent API for interface with external
alignment/search/comparison applications that also permits programmatic
manipulation of the profile/HMM/alignment. E.g.
align = Bio.AlignIO.read(alignfilehandle)
consensus = align.build_consensus(threshold=0.9)
pssm = align.build_pssm()
hmmer = align.build_hmmer()
hmm = align.build_hmm(order=3)
Or
consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9)
pssm = Bio.Motif.build_pssm_from_alignment(align)
hmmer = Bio.Motif.build_hmmer_from_alignment(align)
hmm = Bio.Motif.build_hmm_from_alignment(align, order=3)
(which I don't think is as neat an interface, even if all
align.build_consensus does is call the Bio.Motif.consensus_from_alignment
method)
Followed by things like
pssm.consensus()
pssm.logo()
hmm.generate_sequence(length=100)
hmm.to_graphviz()
And then the consensus, pssm, hmm and hmmer objects could be used as input
to interfaces for the relevant applications.
Converting an alignment into an HMM for this purpose may itself benefit from
a call to HMMer's hmmbuild (and Pythonic representation of the data
structure), rather than implementation of an equivalent internal function -
even though I think one of those would be useful, too.
Cheers,
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From biopython at maubp.freeserve.co.uk Tue Apr 21 06:15:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Apr 2009 11:15:05 +0100
Subject: [Biopython-dev] Python 2.3 support
Message-ID: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
Hi all,
As we've been warning for the last couple of releases, Biopython 1.50
should be the last release to officially support Python 2.3. No one
has complained yet, but they may not have noticed. I suspect there may
be people out there using a local Biopython installation on an old
Linux/Unix computer where the system Python is rather old. For
Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so
that may get more attention.
Given the small possibility that we may get need to do a fix release
with Python 2.3 support, I propose that we don't actively remove any
Python 2.3 support in CVS yet (maybe not until after Biopython 1.51?).
Any new modules that require Python 2.4+ to run would be OK, but I
would like to avoid breaking existing core functionality on Python 2.3
in the short term.
I know I'm dragging my feed on this, but being a bit cautious here
shouldn't hurt. Plus I have an ulterior motive: I'm one of the few
Biopython users still actually using Python 2.3! To be precise, this
now only on one machine at work - but this is the cluster head node.
However, an upgrade is planned in the next month or so, and once that
is done, maybe I'll relent and we can remove Python 2.3 support in CVS
;)
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 21 07:05:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 21 Apr 2009 07:05:48 -0400
Subject: [Biopython-dev] [Bug 2817] New: Meta-bug for cleanup once we drop
Python 2.3 support
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2817
Summary: Meta-bug for cleanup once we drop Python 2.3 support
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
We are going to drop support for Python 2.3, see:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005812.html
This means we can remove a number of workarounds in the code:
Python 2.4+ includes the built in set, so we can remove numerous uses of the
following:
#TODO - Remove this work around once we drop python 2.3 support
try:
set = set
except NameError:
from sets import Set as set
Python 2.4+ includes the subprocess module, so we can use this unconditionally
in Bio.Application.generic_run() etc.
Python 2.4+ includes support for generator expressions. We should update the
documentation examples as appropriate, and this may also allow some memory
optimizations in places.
Python 2.4+ will also allow us to update our property methods to use decorators
as suggested by Eric Talevich on Bug 2814.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 21 07:12:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 21 Apr 2009 07:12:18 -0400
Subject: [Biopython-dev] [Bug 2814] Use properties instead of __getattr__ in
FeatureLocation
In-Reply-To:
Message-ID: <200904211112.n3LBCILI021318@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2814
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-21 07:12 EST -------
(In reply to comment #2)
> Peter, you mentioned on the mailing list that this will be applied after the
> 1.50 release. Since Py2.3 support ends there also, you could use the newer
> decorator style instead:
>
> start = property(fget= lambda self : self._start,
> doc="Start location (possibly a fuzzy position).")
>
> becomes:
>
> @property
> def start(self):
> """Start location (possibly a fuzzy position)."""
> return self._start
>
>
> I think this is the preferred style for Python 2.4 and later.
Thanks for the suggestion Eric. That sounds like a good plan, but not yet. See
Bug 2917.
I've checked in this patch and am marking this bug as fixed. See:
Bio/SeqFeature.py CVS revision 1.17
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mjldehoon at yahoo.com Tue Apr 21 07:12:20 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 21 Apr 2009 04:12:20 -0700 (PDT)
Subject: [Biopython-dev] SwissProt parsing inconsistency between Bio.SeqIO,
Bio.SwissProt
Message-ID: <393946.5637.qm@web62408.mail.re1.yahoo.com>
Dear all,
I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files.
For these DE lines:
DE RecName: Full=11S globulin seed storage protein 2;
DE AltName: Full=11S globulin seed storage protein II;
DE AltName: Full=Alpha-globulin;
DE Contains:
DE RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE AltName: Full=11S globulin seed storage protein II acidic chain;
DE Contains:
DE RecName: Full=11S globulin seed storage protein 2 basic chain;
DE AltName: Full=11S globulin seed storage protein II basic chain;
DE Flags: Precursor;
a SwissProt record created by Bio.SwissProt contains the following:
>>> print swiss_record.description
RecName: Full=11S globulin seed storage protein 2;
AltName: Full=11S globulin seed storage protein II;
AltName: Full=Alpha-globulin;
Contains:
RecName: Full=11S globulin seed storage protein 2 acidic chain;
AltName: Full=11S globulin seed storage protein II acidic chain;
Contains:
RecName: Full=11S globulin seed storage protein 2 basic chain;
AltName: Full=11S globulin seed storage protein II basic chain;
Flags: Precursor;
but a SeqRecord returned by Bio.SeqIO contains this:
>>> print seq_record.description
RecName: Full=11S globulin seed storage protein 2;
AltName: Full=11S globulin seed storage protein II;
AltName: Full=Alpha-globulin;
Contains:
RecName: Full=11S globulin seed storage protein 2 acidic chain;
AltName: Full=11S globulin seed storage protein II acidic chain;
Contains:
RecName: Full=11S globulin seed storage protein 2 basic chain;
AltName: Full=11S globulin seed storage protein II basic chain;
Flags: Precursor;
So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't.
For consistency, I think it's better to decide on one of these two styles.
My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO?
--Michiel.
From p.j.a.cock at googlemail.com Tue Apr 21 07:26:00 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 12:26:00 +0100
Subject: [Biopython-dev] SwissProt parsing inconsistency between
Bio.SeqIO, Bio.SwissProt
In-Reply-To: <393946.5637.qm@web62408.mail.re1.yahoo.com>
References: <393946.5637.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com>
On Tue, Apr 21, 2009 at 12:12 PM, Michiel de Hoon wrote:
>
> Dear all,
>
> I've noticed an inconsistency between how Bio.SeqIO and Bio.SwissProt parse DE (description) lines in SwissProt files.
>
> For these DE lines:
>
> DE ? RecName: Full=11S globulin seed storage protein 2;
> DE ? AltName: Full=11S globulin seed storage protein II;
> DE ? AltName: Full=Alpha-globulin;
> DE ? Contains:
> DE ? ? RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE ? ? AltName: Full=11S globulin seed storage protein II acidic chain;
> DE ? Contains:
> DE ? ? RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE ? ? AltName: Full=11S globulin seed storage protein II basic chain;
> DE ? Flags: Precursor;
>
> a SwissProt record created by Bio.SwissProt contains the following:
>>>> print swiss_record.description
> RecName: Full=11S globulin seed storage protein 2;
> AltName: Full=11S globulin seed storage protein II;
> AltName: Full=Alpha-globulin;
> Contains:
> ?RecName: Full=11S globulin seed storage protein 2 acidic chain;
> ?AltName: Full=11S globulin seed storage protein II acidic chain;
> Contains:
> ?RecName: Full=11S globulin seed storage protein 2 basic chain;
> ?AltName: Full=11S globulin seed storage protein II basic chain;
> Flags: Precursor;
>
> but a SeqRecord returned by Bio.SeqIO contains this:
>
>>>> print seq_record.description
> RecName: Full=11S globulin seed storage protein 2;
> AltName: Full=11S globulin seed storage protein II;
> AltName: Full=Alpha-globulin;
> Contains:
> RecName: Full=11S globulin seed storage protein 2 acidic chain;
> AltName: Full=11S globulin seed storage protein II acidic chain;
> Contains:
> RecName: Full=11S globulin seed storage protein 2 basic chain;
> AltName: Full=11S globulin seed storage protein II basic chain;
> Flags: Precursor;
>
> So Bio.SeqIO removes the spaces in front of the line, but Bio.SwissProt doesn't.
> For consistency, I think it's better to decide on one of these two styles.
> My preference is for the approach used by Bio.SwissProt. Any objections to modifying the code used by Bio.SeqIO?
Have you got a link for the full record in your example?
For interaction with other Bio.SeqIO formats, I generally expect the
description to be a single line string (with no embedded newlines).
If you look at the (old) SwissProt files in our unit tests, the
current Bio.SeqIO behaviour makes sense - the DE line(s) just encode a
fairly short simple string. It looks like the SwissProt format has
changed, and we should be parsing the new extended DE lines more
carefully, and splitting these entries up and recording them in the
SeqRecord.annotations dictionary?
Peter
From bartek at rezolwenta.eu.org Tue Apr 21 07:29:39 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 21 Apr 2009 13:29:39 +0200
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To:
References: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com>
Message-ID: <8b34ec180904210429g36d089a6h578dc0197a94516a@mail.gmail.com>
On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard wrote:
> Hi,
>
> Some thoughts and a bit of a wishlist...
These are always welcome. I can make no promises on timing of making
your wishes come true ;)
>>>
>> I think that once we start talking about gapped motifs, we are really
>> talking about
>> multiple alignments on steroids. This hasn't been done so far because you
>> don't
>> really need it for DNA motifs,
>
> It might not be required for the motifs you've been working with, but we've
> been doing profile-based searches for bipartite regulatory binding sites in
> DNA. ?These sites have a variable-length spacer region, and so require
> gapped alignments for building motifs. ?The spacer region consensus
> (depending on the level of identity required for the consensus) is usually
> composed of Ns.
Indeed There are dyadic motifs for some of transcription factors. So
far I was working
only under assumption that that the gap is not too variable (say 3-5
nucleotides) and
this you can fake by using multiple PWMs with different sizes of the gap e.g.:
CACnnnGTG
CACnnnnGTG
CACnnnnnGTG
But it is a workaround rather than a feature... I'd be also interested
in knowing about other
applications where maybe this assumption (small gaps) is violated. Are
there also motifs with multiple
gaps? Implementing this feature would probably require a separate
subclass of Motif, since
the internal implementation of searching would need to be different.
This is a very good feature request, I think it is worth implementing,
though currently
I have no time to do it properly. If You don't care too much about
efficiency, I could write
quickly this dyadic subclass with the implementation based on two
motif instances and a
variable gap.
>
> I guess that this comes down to whether we choose to restrict the meaning of
> "motif" to an ungapped string of symbols (including ambiguity) representing
> nt/aa, or whether we want to permit the inclusion of variable-length gaps,
> regions, or ambiguities in a PROSITE or regular expression-like manner (e.g.
> C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or
> C{,3}A{3,5}TTTT). ?Although profile methods like HMMer can produce a
> consensus output that looks like an ungapped string of symbols to represent
> a motif, it doesn't capture important features of the HMM representation.
>
I think that you are touching on multiple issues here. I'll try to
answer them separately:
- gapped alignemnts are one thing. If we have a gap in one sequence
but not in the others
(frequent in protein motifs, not so much in DNA motifs) we just need a
way to sensibly use it
in creation of PWMs for searching
- dyadic motifs (gaps in otherwise ungapped alignments) are a
different issue, since we have a
gap in all instances, but it may have a variable length. see above.
-regular expressions are a different way of describing motifs. I think
that it is not a purpose of
Bio.Motif to compete with regexps, but it would be certainly valuable
to be able to have a possibility
of creating motifs from some sort of (simplified) regexps. This was,
to some extent, discussed in
a recent thread on Seq.startswith methods
-HMM motifs are totally different kind of beast. These guys introduce
dependencies between positions
(doable also with regexps) and there is currently no support for them
in Bio.Motif. It would be cool to have
support for them, but I'm not an expert here and it looks to me like a
lot of work (also probably the methods
of Bio.Motif are not exactly right for HMMs).
-finally, suporting prosite syntax seems to be depending on the
variable gap feature, but otherwise it's simple
an important input fomat to support.
> I think the latter representations are more useful, even if harder to
> code/maintain. ?I think that leaving them out would be a glaring hole in
> functionality, and that they're a target Biopython should aim for.
Usefulness is hard to define in abstract of a particular problem , so
this is arguable. It is certain that bio.Motif is
not complete suite for all kinds of motif analysis but i don't know of
any tool that is supporting alll these
types of motifs with a single API (if you know one, please tell me).
We should have ambitious goals, but
I wouldn't call it a glaring hole not to have what is currently not
available elsewhere...
>
>> I think it would be great to be able to easily
>> convert multiple alignments into motifs. This would allow us to ?use
>> the power of
>> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
>> how to design API for these ?functions.
>
> I agree. ?I think that there's another important question: what do we mean,
> and need to do, when we talk about converting an alignment into a motif?
> Consensus/majority and PSSM methods from a sequence alignment should be
> straightforward to implement in Python - even for gapped alignments.
> Including a representation of variable-length gaps might be a little more
> difficult, and storing an HMM representation may be too much to manage
> immediately. ?That's still three different types of object - with likely
> different components to their interfaces - to be stored. ?In their
> relationship to a source alignment, these representations could be
> properties of a single alignment, or independent Bio.Motif objects (perhaps
> each with a link back to their parent alignment).
>
> The results of searches are also likely to be qualitatively different,
> depending on the type of motif used for the search, and the results desired
> by the user.
>
> I think that, for anything other than simple searches (string search,
> regex), we'd be on a hiding to nothing by implementing search methods within
> Python. ?It's not likely to be as fast as dedicated search packages, and it
> would be a headache for maintenance. ?So, with apologies if I missed this
What do you mean by searching here? Searching for a known motif or searching
for a new motif? And what dedicated packages you have on your mind?
> part of the discussion or documentation, it seems to me that Bio.Motif could
> be most powerful in the alignment/searching/comparison process as a 'broker'
> within BioPython, providing a consistent API for interface with external
> alignment/search/comparison applications that also permits programmatic
> manipulation of the profile/HMM/alignment. ?E.g.
>
That's definitely an important field, though I'm not sure if _the_
function for Bio.Motif.
I think that the most valuable thing would be to internalize some of
the compliexity of
different ways of using motifs in bioinformatics. My modest goal for
now is making protein
motifs first class citizens (meaning handling alphabets and gaps
properly etc. ).
The next thing would be to make bio.motif cooperate nicely with
- Bio.Seq (e.g seq.startswith etc.),
- Bio.Align (conversions from-to alignments)
which includes easy motif creation from simple formats like IUPAC and
simple regexps and
would correspond to the "broker" function if I understand it correctly.
Then I think it would be really cool to have spaced motifs, although
here we need to
be careful about performance.
> align = Bio.AlignIO.read(alignfilehandle)
> consensus = align.build_consensus(threshold=0.9)
> pssm = align.build_pssm()
> hmmer = align.build_hmmer()
> hmm = align.build_hmm(order=3)
>
> Or
>
> consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9)
> pssm = Bio.Motif.build_pssm_from_alignment(align)
> hmmer = Bio.Motif.build_hmmer_from_alignment(align)
> hmm = Bio.Motif.build_hmm_from_alignment(align, order=3)
>
I would guess that the first example is what would be actually used,
but it requires
the functions on the Motif.side to be available.
As for more specific things:
- I don't like the usage of PSSM and consensus here. these are just
different ways of
looking at a Motif.
-Also the difference between HMMer and HMM is unclear to me
(isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?)
But I'm not too concerned about HMMs at the moment.
I would rather think of something like:
align = Bio.AlignIO.read(alignfilehandle)
motif= align.build_motif()
followed by:
motif.consensus()
motif.search_pwm(seq)
motif.search_instances(seq)
motif.weblogo()
>
> And then the consensus, pssm, hmm and hmmer objects could be used as input
> to interfaces for the relevant applications.
>
I don't understand your idea of separating consensus from pssm motifs. These
are not fundamentally different. HMMs though are really different.
> Converting an alignment into an HMM for this purpose may itself benefit from
> a call to HMMer's hmmbuild (and Pythonic representation of the data
> structure), rather than implementation of an equivalent internal function -
> even though I think one of those would be useful, too.
>
Again, I'm not sure whether we have support for HMMer now (it was
mentioned on the
mailing-list once, but I don't know what happened to it).
But I agree it would be useful.
To summarize:
- thanks for so much input, I especially apreciate the input on possible usages
- I will work on the features I mentioned in the direction of unifying
the API for
DNA and protein motifs, and I would definitely appreciate any help from others
- The dyadic motifs (or more generally gapped motifs) are next, and
require taking
care of performance issues
- HMM support is currently further down on my to-do list, mostly because
It needs a rather different API. But once we have the "glue" functions
for motifs, we
can try to make similar "glue" functions for HMMs.
cheers
Bartek
From p.j.a.cock at googlemail.com Tue Apr 21 07:52:26 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 12:52:26 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090420132946.GB29652@sobchak.mgh.harvard.edu>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
<320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
<20090420132946.GB29652@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com>
On Mon, Apr 20, 2009 at 2:29 PM, Brad Chapman wrote:
> [accessing start and end]
>> >>> print rec_dict['1'].features[0].location.start
>> 20228
>> >>> rec_dict['1'].features[0].location.start.position
>> 20228
> [...]
>> Coupled with a variation of Brad's suggestion of adding start
>> and end properties to the SeqFeature, if we make these act
>> as proxies for feature.location.start and feature.location.end
>> that would become just:
>>
>> record = ...
>> feature = record.features[5] #for example
>> sub_seq = my_seq[feature.start:feature.end]
>
> Thanks Peter, that's exactly right.
Actually, it isn't - my mistake. Adding start and end properties to
the SeqFeature as proxies for feature.location.start and
feature.location.end wouldn't be a great idea. Currently
feature.location.start and features.location.end are position objects,
and even if they had an __int__ method you can't do this:
record[feature.location.start:record.feature.location.end]
or:
record.seq[feature.location.start:record.feature.location.end]
You would have to do this:
record[int(feature.location.start):int(record.feature.location.end)]
or:
record.seq[int(feature.location.start):int(record.feature.location.end)]
The above wouldn't work well for fuzzy locations, we're better off
with the current explicit option:
record[feature.location.start.position:record.feature.location.end.position]
or:
record.seq[feature.location.start.position:record.feature.location.end.position]
where if the user wants to they can take into account the fuzzy
details, such as adding record.feature.location.end.extension to the
end slice point.
----------------
Now the good news, we can instead simply using the FeatureLocation
shortcuts for (approximated) plain integers:
record[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end]
or:
record.seq[feature.location.nofuzzy_start:record.feature.location.nofuzzy_end]
These methods already take into consideration fuzzy ends, and knows to
treat the start and end differently to get the wider feature.
So, a slight variation of the proposed internal details would be to
make SeqFeature.start and end proxies for
SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end
(i.e. plain integers), achieving the goal of just:
record[feature.start:record.feature.end]
or:
record.seq[feature.start:record.feature.location.end]
(Suitable for non-join features, and gives a reasonable approximation
for fuzzy locations).
> Accessing the start and end coordinates in SeqFeatures is unnecessarily
> cumbersome right now, but can be fixed fairly simply. We should be able
> to get this in now that 1.50 is rolled out.
> ...
> To be clear, start and end in SeqFeature would be integers and not
> handle any fuzzy stuff. All of the representation is still there for
> those actually dealing with fuzziness, but the top level attributes
> would expose the coordinates nicely for the remaining 99% of cases.
Right - and with the above correction that SeqFeature.start and end
would be proxies for SeqFeature.location.nofuzzy_start and
SeqFeature.location.nofuzzy_end, you would get plain integers, and
this should cover most use cases. At least for non-Eukaryotes ;)
>> I think that for a basic parser (as opposed to a parser integrated with Bio.SeqIO),
>> the SeqFeatures are way too complicated for my mind.
> [...]
>> For a basic parser, I like the _gff_line_map function much better.
>> Applied to the first line in the GFF file, it returns
> [...]
>> which is exactly what I need, in (almost) the places where I'd expect them.
>
> Does solving the start/end problem as described above help bridge the
> gap between SeqFeatures and the custom representation? Are there other
> usability issues you found? I would prefer to expose one data structure
> and think SeqFeature can handle the data well. They scale to nested
> cases, and will be familiar to those using features in SeqIO or BioSQL.
You must agree that SeqFeature and FeatureLocation objects are not
very lightweight. I understood that one of your goals with Bio.GFF
and map/reduce is to handle massive files, so surely it makes sense to
use a simple object structure here?
Peter
From mjldehoon at yahoo.com Tue Apr 21 07:55:36 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 21 Apr 2009 04:55:36 -0700 (PDT)
Subject: [Biopython-dev] SwissProt parsing inconsistency between
Bio.SeqIO, Bio.SwissProt
In-Reply-To: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com>
Message-ID: <861995.42083.qm@web62406.mail.re1.yahoo.com>
> Have you got a link for the full record in your example?
>
You can find it here:
http://www.uniprot.org/uniprot/Q9XHP0.txt
> For interaction with other Bio.SeqIO formats, I generally
> expect the description to be a single line string (with no
> embedded newlines).
> It looks like the SwissProt format has changed, and we
> should be parsing the new extended DE lines more
> carefully, and splitting these entries up and recording
> them in the SeqRecord.annotations dictionary?
>
That sounds reasonable. The dictionary will have to be nested though. Something like this:
annotations["RecName"] = [{"Full": "11S globulin seed storage protein 2"}]
annotations["AltName"] = [{"Full": "11S globulin seed storage protein II"},
{"Full": "Alpha-globulin"}]
annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}},
"AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}},
{"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}},
"AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}},
]
annotations["Flags"] = "Precursor"
--Michiel
From p.j.a.cock at googlemail.com Tue Apr 21 08:04:44 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 13:04:44 +0100
Subject: [Biopython-dev] SwissProt parsing inconsistency between
Bio.SeqIO, Bio.SwissProt
In-Reply-To: <861995.42083.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com>
<861995.42083.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com>
On Tue, Apr 21, 2009 at 12:55 PM, Michiel de Hoon wrote:
>
>> Have you got a link for the full record in your example?
>>
> You can find it here:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
>> For interaction with other Bio.SeqIO formats, I generally
>> expect the description to be a single line string (with no
>> embedded newlines).
>
>> It looks like the SwissProt format has changed, and we
>> should be parsing the new extended DE lines more
>> carefully, and splitting these entries up and recording
>> them in the SeqRecord.annotations dictionary?
>>
> That sounds reasonable. The dictionary will have to be nested though. Something like this:
>
> annotations["RecName"] = [{"Full=11S globulin seed storage protein 2"]
> annotations["AltName"] = ["Full=11S globulin seed storage protein II", "Full=Alpha-globulin"]
> annotations["Contains"] = [{"RecName": {"Full": "11S globulin seed storage protein 2 acidic chain"}},
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?"AltName": {"Full": "Full=11S globulin seed storage protein II acidic chain"}},
> ? ? ? ? ? ? ? ? ? ? ? ? ? {"RecName": {"Full": "11S globulin seed storage protein 2 basic chain"}},
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?"AltName": {"Full": "Full=11S globulin seed storage protein II basic chain"}},
> ? ? ? ? ? ? ? ? ? ? ? ? ?]
> annotations["Flags"] = "Precursor"
>
Possible - but for BioSQL we couldn't store those dictionaries. A
list of strings should work, but isn't as elegant. Maybe something
along these lines?
annotations["RecName"] = ["Full: 11S globulin seed storage protein 2;"}]
annotations["AltName"] = ["Full: 11S globulin seed storage protein
II", "Full: Alpha-globulin"]
annotations["Contains"] = ["RecName: Full=11S globulin seed storage
protein 2 acidic chain;\nAltName: Full=11S globulin seed storage
protein II acidic chain;",
"RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"]
annotations["Flags"] = "Precursor"
Or for "Contains" just have a flat list of strings, one for each name
(here four names).
Or for "Contains" just drop the AltName entries, and simply have a
list of the RecName entries (here two names).
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 21 08:13:04 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 21 Apr 2009 08:13:04 -0400
Subject: [Biopython-dev] [Bug 2818] New: Add start and end properties to
SeqFeature object
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2818
Summary: Add start and end properties to SeqFeature object
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
An enhancment proposed on the mailing list would add start and end properties
to the SeqFeature returning plain integers (non-fuzzy approximations to the
start and end locations) suitable for slicing most parent sequences. Dealing
with a join location would still be tricky.
Example usage:
>>> from Bio import SeqIO
>>> record = SeqIO.read(open("NC_005816.gb"),"gb")
>>> feature = record.features[2]
>>> print feature
type: gene
location: [86:1109]
ref: None:None
strand: 1
qualifiers:
Key: db_xref, Value: ['GeneID:2767718']
Key: locus_tag, Value: ['YP_pPCP01']
>>> record[feature.start:feature.end]
SeqRecord(seq=Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia
pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])
>>> record.seq[feature.start:feature.end]
Seq('ATGGTCACTTTTGAGACAGTTATGGAAATTAAAATCCTGCACAAGCAGGGAATG...TGA',
IUPACAmbiguousDNA())
Patch to follow.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 21 08:16:17 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 21 Apr 2009 08:16:17 -0400
Subject: [Biopython-dev] [Bug 2818] Add start and end properties to
SeqFeature object
In-Reply-To:
Message-ID: <200904211216.n3LCGHWZ025657@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2818
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-21 08:16 EST -------
Created an attachment (id=1281)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1281&action=view)
Patch to Bio/SeqFeature.py
Makes SeqFeature.start and end proxies for
SeqFeature.location.nofuzzy_start and SeqFeature.location.nofuzzy_end
(i.e. plain integers)
See also:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005818.html
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Tue Apr 21 08:17:41 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 13:17:41 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
<320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
<20090420132946.GB29652@sobchak.mgh.harvard.edu>
<320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com>
Message-ID: <320fb6e00904210517k63edc766xcb830a7150e4c5d1@mail.gmail.com>
On Tue, Apr 21, 2009 at 12:52 PM, Peter Cock wrote:
>> Accessing the start and end coordinates in SeqFeatures is unnecessarily
>> cumbersome right now, but can be fixed fairly simply. We should be able
>> to get this in now that 1.50 is rolled out.
>> ...
>> To be clear, start and end in SeqFeature would be integers and not
>> handle any fuzzy stuff. All of the representation is still there for
>> those actually dealing with fuzziness, but the top level attributes
>> would expose the coordinates nicely for the remaining 99% of cases.
>
> Right - and with the above correction that SeqFeature.start and end
> would be proxies for SeqFeature.location.nofuzzy_start and
> SeqFeature.location.nofuzzy_end, you would get plain integers, and
> this should cover most use cases. ?At least for non-Eukaryotes ;)
Patch for this proposal on Bug 2818,
http://bugzilla.open-bio.org/show_bug.cgi?id=2818
Peter
From bartek at rezolwenta.eu.org Tue Apr 21 08:17:55 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 21 Apr 2009 14:17:55 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904210516y21bec2e1r3294b2d15edf386f@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<8b34ec180904210434t2ee76e8bsc91af814f53e2df4@mail.gmail.com>
<320fb6e00904210457p6189e096m966becad772cd610@mail.gmail.com>
<8b34ec180904210516y21bec2e1r3294b2d15edf386f@mail.gmail.com>
Message-ID: <8b34ec180904210517i259762e7t343e0f773c939a15@mail.gmail.com>
On Tue, Apr 21, 2009 at 1:57 PM, Peter wrote:
> Maybe. ?We can double check this by creating a trivial project in
> github, doing a few commits, tag, commits, tag - and checking the
> github interface and also the GitX presentation. ?That should tell us
> if the issue is specific to our converted repository or not.
no it's not specific. You can find a toy repository here:
http://github.com/barwil/testing_tags/tree/master
(please don't consider this link a permanent one, I'll remove it soon.)
cheers
Bartek
From chapmanb at 50mail.com Tue Apr 21 08:20:45 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 21 Apr 2009 08:20:45 -0400
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
<8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
<320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
Message-ID: <20090421122045.GD30529@sobchak.mgh.harvard.edu>
Hi Peter;
> http://biopython.org/wiki/Building_a_release
>
> i.e. Maybe in a few months time I (or Michiel) can say "Right, CVS
> freeze while XXX does the release", where person XXX gets to scan the
> documentation, double check the NEWS files, check the unit tests etc,
> before putting together the packages and uploading them to the server.
> And maybe then hand over to our "News Coordinator" to do the release
> announcement? Having more people involved will make it take a little
> longer, but should mean less minor things get missed (e.g. a typo in
> the NEWS file, or a broken unit test specific to a particular OS or
> version of python).
It would be great to have others involved in rolling releases.
BioPerl often passes the Release Manager hat around for release to
release, and perhaps we can get the same tradition going here. I
like the idea of people volunteering for this.
It would also be worth thinking about what the worst parts of
building the releases are and seeing if we can automate or eliminate
them. A few things that I can think of:
- Remove support for older python versions, which would eliminate
all those windows installers. I will write more about this in your
other thread.
- Eliminating the beta releases. Biopython is developed as stable
in Git/CVS, so gets testing that way on developer machines. Are we
getting enough feedback from betas to make them worthwhile?
- Automate building the docs nightly/weekly on biopython.org. If the
Tutorial/epydoc stuff is a lot of work, we could work up a script
and cron to eliminate this part.
That's from my fuzzy memory of rolling releases.
Brad
From chapmanb at 50mail.com Tue Apr 21 08:35:31 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 21 Apr 2009 08:35:31 -0400
Subject: [Biopython-dev] Python 2.3 support
In-Reply-To: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
Message-ID: <20090421123531.GE30529@sobchak.mgh.harvard.edu>
Hi Peter;
> As we've been warning for the last couple of releases, Biopython 1.50
> should be the last release to officially support Python 2.3. No one
> has complained yet, but they may not have noticed. I suspect there may
> be people out there using a local Biopython installation on an old
> Linux/Unix computer where the system Python is rather old. For
> Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so
> that may get more attention.
Are we getting a lot of feedback that we need to keep supporting these
old versions? 2.3 was released in 2003, 2.4 in 2004, and 2.5 in 2006.
This means people who need anything prior to 2.5 haven't updated in over
3 years. I understand the problem of non-responsive sysadmins and what
not. However, we only have so many cycles for testing and coding; is it
worthwhile spending some on these problems?
One of the nice selling points of Python is that it's a dynamic language,
and I like using new features of the language as much as anyone.
Beyond the 2/3 split, it is very back compatible and I've never had
any problems moving even very large projects forward to new
versions.
Practically, I'd be for dropping 2.4 support in the next release and
being a bit more aggressive in general on moving upwards and onwards.
Brad
From p.j.a.cock at googlemail.com Tue Apr 21 08:43:11 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 13:43:11 +0100
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <20090421122045.GD30529@sobchak.mgh.harvard.edu>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
<8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
<320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
<20090421122045.GD30529@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
On Tue, Apr 21, 2009 at 1:20 PM, Brad Chapman wrote:
> It would also be worth thinking about what the worst parts of
> building the releases are and seeing if we can automate or eliminate
> them. A few things that I can think of:
>
> - Remove support for older python versions, which would eliminate
> ?all those windows installers. I will write more about this in your
> ?other thread.
That makes almost no difference, its just one extra line to do at the
command line:
c:\python23\python setup.py bdist_wininst
c:\python24\python setup.py bdist_wininst
c:\python25\python setup.py bdist_wininst
c:\python26\python setup.py bdist_wininst
Yes, you also have to build and test on each version of python, but
honestly, once the build environment is setup doing the Windows
release on three versus four versions of Python isn't worth worrying
about.
> - Eliminating the beta releases. Biopython is developed as stable
> ?in Git/CVS, so gets testing that way on developer machines. Are we
> ?getting enough feedback from betas to make them worthwhile?
For Biopython's move from Numeric to NumPy, I think doing a beta was
worthwhile. Maybe the feedback from the 1.50 beta release wasn't that
big, but it didn't take that much effort, and it focused us ready for
Biopython 1.50 well. Beta releases are also good for any Windows
users, for whom setting up the build environment is quite a hurdle, so
running the latest code from the repository is more difficult. Beta
releases also give us more press coverage - and gives us a clear way
to ask people to try out particular new stuff.
> - Automate building the docs nightly/weekly on biopython.org. If the
> ?Tutorial/epydoc stuff is a lot of work, we could work up a script
> ?and cron to eliminate this part.
Again, building the docs is pretty trivial. We have in the past
deliberately NOT updated the online copies, so that it is in sync with
the latest release. I suppose we could have two copies on the
website, the "latest release" and the "nightly code".
Peter
From chapmanb at 50mail.com Tue Apr 21 08:44:49 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 21 Apr 2009 08:44:49 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com>
References: <20090417200558.GC19290@sobchak.mgh.harvard.edu>
<252312.21376.qm@web62408.mail.re1.yahoo.com>
<320fb6e00904180654j7d686963yadb4982fff7eb4e3@mail.gmail.com>
<20090420132946.GB29652@sobchak.mgh.harvard.edu>
<320fb6e00904210452n6c17fcafl205abcfcc4e3edd1@mail.gmail.com>
Message-ID: <20090421124449.GF30529@sobchak.mgh.harvard.edu>
Hi Peter;
[...fuzzy handling...]
> Right - and with the above correction that SeqFeature.start and end
> would be proxies for SeqFeature.location.nofuzzy_start and
> SeqFeature.location.nofuzzy_end, you would get plain integers, and
> this should cover most use cases. At least for non-Eukaryotes ;)
Yes, that was my proposal. Thanks for fleshing it out and for the
patch.
> > Does solving the start/end problem as described above help bridge the
> > gap between SeqFeatures and the custom representation? Are there other
> > usability issues you found? I would prefer to expose one data structure
> > and think SeqFeature can handle the data well. They scale to nested
> > cases, and will be familiar to those using features in SeqIO or BioSQL.
>
> You must agree that SeqFeature and FeatureLocation objects are not
> very lightweight. I understood that one of your goals with Bio.GFF
> and map/reduce is to handle massive files, so surely it makes sense to
> use a simple object structure here?
Unless you are thinking of having an object representation as being too
heavy, the non-light part of SeqFeature is all the FeatureLocation
fuzziness.
I would be for a SeqFeatureLite class that is API compatible with
SeqFeature (with the new start/end attributes) and does not support
fuzzy locations. This would handle GFF understandably, be lightweight,
and allow access to BioSQL and SeqIO. How does this sound?
Brad
From p.j.a.cock at googlemail.com Tue Apr 21 08:56:23 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 13:56:23 +0100
Subject: [Biopython-dev] Python 2.3 support
In-Reply-To: <20090421123531.GE30529@sobchak.mgh.harvard.edu>
References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
<20090421123531.GE30529@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com>
On Tue, Apr 21, 2009 at 1:35 PM, Brad Chapman wrote:
> Hi Peter;
>
>> As we've been warning for the last couple of releases, Biopython 1.50
>> should be the last release to officially support Python 2.3. ?No one
>> has complained yet, but they may not have noticed. I suspect there may
>> be people out there using a local Biopython installation on an old
>> Linux/Unix computer where the system Python is rather old. For
>> Biopython 1.50 I added a warning to setup.py when run on Python 2.3 so
>> that may get more attention.
>
> Are we getting a lot of feedback that we need to keep supporting these
> old versions? 2.3 was released in 2003, 2.4 in 2004, and 2.5 in 2006.
> This means people who need anything prior to 2.5 haven't updated in over
> 3 years. I understand the problem of non-responsive sysadmins and what
> not. However, we only have so many cycles for testing and coding; is it
> worthwhile spending some on these problems?
Until recently I have a very strong personal interest in keeping
Biopython running on Python 2.3, so I never regarded this as "wasted
cycles".
My personal Windows machine ran Python 2.3 and MSCV 6.0. In order to
update the python version and continue to compile Biopython, I would
also have had to replace the compiler etc. and the hard drive was
pretty full so this didn't appeal. I have recently been trying Ubuntu
on this machine instead (on a second hard drive).
For reference, my current (only) Windows machine (at work) has Python
2.3, 2.4 and 2.5 for which I use mingw32 to compile Biopython (same
setup as Michiel), plus Python 2.6 for which I'm using Microsoft's
free VC++ 2008 Express Edition from
http://www.microsoft.com/express/download/
> Practically, I'd be for dropping 2.4 support in the next release and
> being a bit more aggressive in general on moving upwards and onwards.
I wouldn't support that. I would insist on giving at least one
release's notice as a minimum.
Peter
From p.j.a.cock at googlemail.com Tue Apr 21 09:05:23 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 14:05:23 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)
Message-ID: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman wrote:
>> You must agree that SeqFeature and FeatureLocation objects are not
>> very lightweight. ?I understood that one of your goals with Bio.GFF
>> and map/reduce is to handle massive files, so surely it makes sense to
>> use a simple object structure here?
>
> Unless you are thinking of having an object representation as being too
> heavy, the non-light part of SeqFeature is all the FeatureLocation
> fuzziness.
Fair point.
> I would be for a SeqFeatureLite class that is API compatible with
> SeqFeature (with the new start/end attributes) and does not support
> fuzzy locations. This would handle GFF understandably, be lightweight,
> and allow access to BioSQL and SeqIO. How does this sound?
I have also been thinking about how I would (re)design the SeqFeature
and FeatureLocation objects. In particular I would want to put the
strand as part of the same object as the location, and also any
join-locations. I would still want to cope with fuzzy locations, but
make the non-fuzzy approximations more prominent in comparison. Also,
I really don't like the way joins are currently stored as more
SeqFeatures in the sub_features list (plus this kind of blocks
alternative usage for child/parent nesting that might be nice for GFF
files).
The prime use case to keep in mind is taking a feature location (even
a join), and using this to extract that region of nucleotides from the
parent sequence (i.e. a Seq object or a SeqRecord object, as now both
can be sliced).
Peter
From dalloliogm at gmail.com Tue Apr 21 09:25:52 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 21 Apr 2009 15:25:52 +0200
Subject: [Biopython-dev] Python 2.3 support
In-Reply-To: <320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com>
References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
<20090421123531.GE30529@sobchak.mgh.harvard.edu>
<320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com>
Message-ID: <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com>
On Tue, Apr 21, 2009 at 2:56 PM, Peter Cock wrote:
> On Tue, Apr 21, 2009 at 1:35 PM, Brad Chapman wrote:
> > Hi Peter;
> >
> >> As we've been warning for the last couple of releases, Biopython 1.50
> >> should be the last release to officially support Python 2.3. No one
> >> has complained yet, but they may not have noticed.
>
I know of many people (a whole lab) which until recently were still using
python 2.3.
However, please, drop support for these older version or people won't never
upgrade :)
--
My blog on bioinformatics (now in English): http://bioinfoblog.it
From p.j.a.cock at googlemail.com Tue Apr 21 09:51:26 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 14:51:26 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
Bio.GFF)
In-Reply-To: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
Message-ID: <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
On Tue, Apr 21, 2009 at 1:44 PM, Brad Chapman wrote:
> Unless you are thinking of having an object representation as being too
> heavy, the non-light part of SeqFeature is all the FeatureLocation
> fuzziness.
I've just had a quick go at what should be a 100% backwards compatible
modification to the FeatureLocation class to store ExactPosition start
or end positions as integers. The idea should be more memory
efficient, using the complex position objects only when required.
The new __init__ method would look like this:
def __init__(self, start, end):
"""Specify the start and end of a sequence feature."""
#Keeps exact locations as plain integers
#Calculates the non-fuzzy versions now so make accessing
#them simpler and faster (expected to be used more often)
if isinstance(start, int) or isinstance(start, long):
self._start = None
self._start_int_nofuzzy = start
elif isinstance(start, ExactPosition) :
#Don't need to keep the full object
self._start = None
self._start_int_nofuzzy = start.position
else :
assert isinstance(start, AbstractPosition), repr(start)
self._start = start
self._start_int_nofuzzy = min(start.position,
start.position + start.extension)
if isinstance(end, int) or isinstance(end, long) :
self._end = None
self._end_int_nofuzzy = end
elif isinstance(end, ExactPosition) :
#Don't need to keep the full object
self._end = None
self._end_int_nofuzzy = end.position
else :
assert isinstance(end, AbstractPosition), repr(end)
self._end = end
self._end_int_nofuzzy = max(end.position,
end.position + end.extension)
The associated methods are then updated accordingly. When a position
object is requested, self._start or self._end is used (if it is not
None, when an ExactPosition is generated on the fly from the integer
self.self._start_int_nofuzzy or self._end_int_nofuzzy). When the
non-fuzzy integer approximation is wanted (the typical use case), we
have those cached as the integers.
The unit tests all pass (except test_BioSQL_SeqIO.py), but we'd need
to have some sort of benchmark to demonstrate any memory gains in
order to justify this kind of change. Maybe try it with Brad's GFF
parser on a very large file? I could stick the full patch on Bugzilla
(or perhaps github) is this sounds worth pursuing...
An alternative implementation would use a single private variable to
store either the integer position or the position object, and check
the type when the public properties are accessed. This should be an
even bigger memory saving, but may be slower.
Peter
From p.j.a.cock at googlemail.com Tue Apr 21 09:55:23 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 14:55:23 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
Bio.GFF)
In-Reply-To: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
Message-ID: <320fb6e00904210655u5f497227o24457ec2ee3d711f@mail.gmail.com>
> I have also been thinking about how I would (re)design the SeqFeature
> and FeatureLocation objects. ?In particular I would want to put the
> strand as part of the same object as the location, and also any
> join-locations. ?I would still want to cope with fuzzy locations, but
> make the non-fuzzy approximations more prominent in comparison. ?Also,
> I really don't like the way joins are currently stored as more
> SeqFeatures in the sub_features list (plus this kind of blocks
> alternative usage for child/parent nesting that might be nice for GFF
> files).
>
> The prime use case to keep in mind is taking a feature location (even
> a join), and using this to extract that region of nucleotides from the
> parent sequence (i.e. a Seq object or a SeqRecord object, as now both
> can be sliced).
I forgot to mention the second major use case I'm concerned about,
which is recovering the GenBank/EMBL style location string. I have
looked at this in the past, by adding methods to the FeatureLocation
and all the Position objects, but it is complicated by the fact the
Position objects don't know if they are at the start or end (and for
the start locations we need to add one to convert from Python
counting). This is the main block on having Bio.SeqIO support writing
GenBank (or EMBL) files with their features included.
Peter
From lpritc at scri.ac.uk Tue Apr 21 09:50:01 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 21 Apr 2009 14:50:01 +0100
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To: <8b34ec180904210429g36d089a6h578dc0197a94516a@mail.gmail.com>
Message-ID:
Hi Bartek,
It's a long one, this... I expect many TLDR response ;)
On 21/04/2009 12:29, "Bartek Wilczynski" wrote:
> On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard
> wrote:
>> Some thoughts and a bit of a wishlist...
>
> These are always welcome. I can make no promises on timing of making
> your wishes come true ;)
No-one ever does :(
> But it is a workaround rather than a feature... I'd be also interested
> in knowing about other applications where maybe this assumption (small gaps)
> is violated. Are there also motifs with multiple gaps?
Yes - it might be a stretch, but if you wanted to represent the organisation
of protein domains in a multi-domain protein (e.g. a transposase, or some
pathogen effectors) as motifs you might want to do this.
> Implementing this feature would probably require a separate
> subclass of Motif, since the internal implementation of searching would
> need to be different.
I'm not sure that this needs to be true. A motif with no gaps can be
considered as a special case of a motif with an arbitrary number of gaps.
If the base implementation is that of a gapped motif (e.g. Represented as
ACT.{5,10}CCC.{,4}TATCAT.{3}GGG) then the basic method of searching - and
here using the re module might work - doesn't need to be any different for
an ungapped variant representing a particular instance of the
multiply-gapped motif (ACTNNNNNNCCCNNNNTATCATNNNGGG), or for any other
ungapped sequence (e.g. ACTCCCTATCATGGG).
This may not be the case for more complex search algorithms, however.
Other classes of Motif may well be necessary, in any case...
> This is a very good feature request, I think it is worth implementing,
> though currently I have no time to do it properly.
I'm right there with you, unfortunately ;)
>> I guess that this comes down to whether we choose to restrict the meaning of
>> "motif" to an ungapped string of symbols (including ambiguity) representing
>> nt/aa, or whether we want to permit the inclusion of variable-length gaps
>>
> I think that you are touching on multiple issues here.
I was trying to focus on one issue, but it does have lots of implications,
which you cover below.
The one issue I intended is this: A sequence motif can be represented in
more than one way, and those ways are not necessarily interchangeable -
either conceptually or in code. An ungapped string of symbols isn't able to
represent the same information as a regular expression (can do ambiguity of
repeat counts), which in turn isn't able to represent the same information
as a PSSM (can represent probabilities at each position), which in turn
isn't able to represent the same information as an HMM (can represent
variable-order dependency).
However, the things you want to do with that motif, such as use it to search
a set of candidate sequences or produce an example matching sequence for
test purposes, can be the same regardless of the coding or conceptual
representation of that motif.
We come back to this below, but for now this does lead on to...
> - gapped alignemnts are one thing. If we have a gap in one sequence
> but not in the others
> (frequent in protein motifs, not so much in DNA motifs) we just need a
> way to sensibly use it in creation of PWMs for searching
> - dyadic motifs (gaps in otherwise ungapped alignments) are a
> different issue, since we have a
> gap in all instances, but it may have a variable length. see above.
These are, I think, the same issue.
In your first example, PWMs will (mostly) work because the lengths of most
sequences are the same and there are few gaps. However, unless you have a
way of varying the length of your PWM during a query of the target sequence,
the PWM need not match the gapped sequence strongly, potentially leading to
a false negative. As an example:
ABCDE
AB-DE
ABCDE
ABCDE
The PWM will be (shorthand) [A1][B1][C.75,-.25][D1][E1], and when applied to
the target sequence ABDE (which was in your alignment), will not produce as
high a score as it would for the other members of the alignment. For the
alignment:
A-CDE
AB-DE
ABC-E
ABCDE
The PWM is (shorthand) [A1][B.75,-.25][C.75,-.25][D.75,-.25][E1]
With corresponding poor scores (potential false negatives) for target
sequences ACDE, ABDE and ABCE.
Without a way to (intelligently) place gaps in your target sequences, or
otherwise account for gaps when searching, the problem is the same whether
there is one gap or a dyadic motif. The *practical* issue is different, in
that you can probably accept the odd false negative for a motif in which one
training sequence has a gap, but PWMs are poor candidates for alignments
with many gaps, as they can readily produce false negatives.
The key issue is that PWMs are fixed-length, and variable-length
representations are common, desirable, and difficult to express in a
fixed-width framework.
> -regular expressions are a different way of describing motifs.
That is true - they are intermediate between consensus sequence, and PSSMs
in their ability to describe variation, but also have the capacity to
represent variable-length sequences.
> I think that it is not a purpose of Bio.Motif to compete with regexps, but it
> would be certainly valuable to be able to have a possibility of creating
> motifs from some sort of (simplified) regexps. This was, to some extent,
> discussed in a recent thread on Seq.startswith methods
I was involved in that discussion :D
I don't think that Bio.Motif needs to compete with the re module, but
instead could use its robust, stable code to implement a regular expression
representation of sequence motifs, seamlessly.
> -HMM motifs are totally different kind of beast. These guys introduce
> dependencies between positions (doable also with regexps) and there is
> currently no support for them in Bio.Motif. It would be cool to have
> support for them, but I'm not an expert here and it looks to me like a
> lot of work (also probably the methods of Bio.Motif are not exactly right for
> HMMs).
You're right about the dependencies - they're the important features I was
alluding to in my post - but I don't think that regular expressions are a
good way to approach the same problem; they don't encode the same
information.
> -finally, suporting prosite syntax seems to be depending on the variable gap
> feature, but otherwise it's simple an important input fomat to support.
I wasn't suggesting PROSITE syntax as part of any desire for implementation
- though a PROSITE <-> regex/consensus translation would be useful, I think
- rather as an illustration that more people than me need variable length
spacers in their motifs.
>> I think the latter representations are more useful, even if harder to
>> code/maintain. ?I think that leaving them out would be a glaring hole in
>> functionality
>
> Usefulness is hard to define in abstract of a particular problem , so
> this is arguable. It is certain that bio.Motif is not complete suite for all
> kinds of motif analysis but i don't know of any tool that is supporting alll
> these types of motifs with a single API (if you know one, please tell me).
> We should have ambitious goals, but I wouldn't call it a glaring hole not to
> have what is currently not available elsewhere...
I apologise for my poor wording. What I meant was that it would seem odd if
support for motif representation was considered complete without
representing variable-length sequences. Left alone, this would always
represent an obvious target for improvement (i.e. 'a glaring hole in
functionality'). No criticism was meant by it - I think you've done a great
job so far on Bio.Motif - and I apologise if I have caused offence.
>> I think that, for anything other than simple searches (string search,
>> regex), we'd be on a hiding to nothing by implementing search methods within
>> Python. ?It's not likely to be as fast as dedicated search packages, and it
>> would be a headache for maintenance.
> What do you mean by searching here? Searching for a known motif or searching
> for a new motif? And what dedicated packages you have on your mind?
Searching for a known motif in a larger sequence. Three packages - two
biologically-dedicated, one not - spring to mind.
The non-biologically-dedicated one is grep. Representing ambiguity symbols
as combinations of bases, e.g. [ACT] . [TA], [^T] and so on - with FASTA
files where sequences are not punctuated by \n or \r - is highly effective
for finding sequence motifs representable by regular expressions.
Dedicated 1: PSI-BLAST - takes PSSMs representing a sequence profile
Dedicated 2: HMMer - builds and uses an HMM representation of the sequence
profile.
There are others, but I'd have to think hard to recall them. You could
consider HMMer versions 1, 2 and 3 as different, in a number of ways -
including their utility for nucleotide sequence representation...
>> it seems to me that Bio.Motif could
>> be most powerful in the alignment/searching/comparison process as a 'broker'
>> within BioPython, providing a consistent API for interface with external
>> alignment/search/comparison applications that also permits programmatic
>> manipulation of the profile/HMM/alignment. ?E.g.
> I think that the most valuable thing would be to internalize some of
> the compliexity of different ways of using motifs in bioinformatics. My modest
> goal for now is making protein motifs first class citizens (meaning handling
> alphabets and gaps properly etc. ).
> The next thing would be to make bio.motif cooperate nicely with
> - Bio.Seq (e.g seq.startswith etc.),
> - Bio.Align (conversions from-to alignments)
> which includes easy motif creation from simple formats like IUPAC and
> simple regexps and would correspond to the "broker" function if I understand
> it correctly.
> Then I think it would be really cool to have spaced motifs, although
> here we need to be careful about performance.
If I might suggest: the main role of the Bio.Motif module as you intend it
appears to be to represent motifs of biological sequences, and to provide
useful functionality for them. Now, there are several ways of representing
these motifs both conceptually, and in code - and they're not all
interchangeable. Some of them have a many -> one mapping (PSSM -> consensus
sequence), and some have no obvious mapping at all (HMM <-/-> PSSM). There
is a decision to be made concerning how motifs are represented internally:
PSSM, regex and/or HMM. PSSM has the clear benefit that, given a PSSM, you
can easily generate the consensus sequence and a regular expression of
fixed-length - but the mapping to a regular expression is not clear, and may
not produce the one that the user would prefer. HMMs can't readily be
converted to other representations, and regular expressions can't be
expanded to PSSMs, or converted to consensus sequences (unless they have no
length ambiguities). It is not just performance we need to think about, but
the very representation of a motif.
Each of these representations is useful under different circumstances. I
think it is worth avoiding a structure that enforces a single internal
representation and closes off future alternative representations. Giving
the user sufficient flexibility/rope to hang themselves with in their choice
of internal representation is a Good Thing?, in my opinion.
> As for more specific things:
> - I don't like the usage of PSSM and consensus here. these are just
> different ways of looking at a Motif.
> I don't understand your idea of separating consensus from pssm motifs. These
> are not fundamentally different. HMMs though are really different.
I see what you mean, but I think you're associating PSSM with Motif too
strongly. A PSSM can be used to generate a consensus sequence, but the
resulting consensus sequence cannot be used to generate the corresponding
PSSM uniquely. There is not a one-one mapping, and they do not describe the
same information. Consensus sequences, for example, do not indicate the
probability of finding a particular symbol at any given position; PSSMs can.
PSSMs are fundamentally different from consensus sequences in that they
don't encode variability at any position.
Consensus, regex, PSSM and HMM are all different ways of looking at a Motif,
but they're not all internally-compatible - which is my point. If you build
a PSSM motif and make the alignment data nonrecoverable, you cannot
reconstruct a corresponding HMM representation, later, for example. So you
would have to decide what kind of representation you use at motif
build-time, build all of them at once, or keep the alignment around to build
what you need later. I'd prefer to choose at build time, but YMMV.
> -Also the difference between HMMer and HMM is unclear to me
> (isn't hmmer a tool to make HMMS? Do we support HMMER in Biopython currently?)
> But I'm not too concerned about HMMs at the moment.
There is a fair amount of flexibility in how you choose to define your HMM
for a motif, and not just in the order of the HMM. There has been
corresponding variation in how HMMer represents its data internally, over
the years. I was meaning to imply by syntax that a HMMer-specific
representation could be called 'hmmer', but a generic internal HMM
representation could just be called 'hmm', to reflect this. I'm not going
to insist on the convention, but it seems simple and obvious to me (again,
YMMV).
Sorry for the length and likely repetition, but I think these are issues
worth thinking about.
Cheers,
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From biopython at maubp.freeserve.co.uk Tue Apr 21 10:30:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Apr 2009 15:30:20 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
Message-ID: <320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
>
> From some more reading this, it sounds like our CVS tags are
> essentially turned into commit markers in git. ?See:
>
> http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#how-git-stores-references
> http://book.git-scm.com/3_git_tag.html
>
> This shouldn't rule out showing them in the history, but perhaps the
> cvs to git migration confuses things...
By setting up a toy repository with tags done though git itself (I
assume), Bartek has convinced me that GitHub itself never shows the
tags in the history. I think this is big drawback, and that we should
ask GitHub about this.
However, using the Mac GUI tool GitX, I was able to see the tags in
the history using the toy repository (they show up as nice yellow
blobs), but not using the current Biopython CVS to git conversion.
There appears to be something less than ideal about our CVS to git
conversion. I believe this relates to how the tag commits appear in
the commit tree - and it looks like for Biopython they are all tiny
branches off the main trunk. i.e. If you look at the main trunk
history (overall or for any one file) then tags commits are not in it.
This hunch appears to be supported by the git log output:
$ git clone git://github.com/biopython/biopython.git
$ cd biopython
$ git log --graph --all
...
|
* commit 8fb446965d58f266ba8bf41a992a09e4bedbac3e
| Author: peterc
| Date: Mon Apr 20 16:07:41 2009 +0000
|
| Bump the version number now that Biopython 1.50 is released
|
| * commit 4ed11049092d86704a2a15359c77459bad30e291
|/ Author: cvs2dvcs transform
| Date: Mon Apr 20 10:48:32 2009 +0000
|
| This commit was manufactured by cvs2svn to create tag 'biopython-150'.
|
* commit 29aa4df3480cdee803694766f137ab2baf5625b2
| Author: peterc
| Date: Mon Apr 20 10:48:31 2009 +0000
|
| You don't have to email Iddo to get on the CONTRIB file
|
...
In comparison, for Bartek's toy repository there is a single branch shown.
Peter
From sbassi at clubdelarazon.org Tue Apr 21 10:34:20 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Tue, 21 Apr 2009 11:34:20 -0300
Subject: [Biopython-dev] Python 2.3 support
In-Reply-To: <5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com>
References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
<20090421123531.GE30529@sobchak.mgh.harvard.edu>
<320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com>
<5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com>
Message-ID: <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com>
On Tue, Apr 21, 2009 at 10:25 AM, Giovanni Marco Dall'Olio
wrote:
> However, please, drop support for these older version or people won't never
> upgrade :)
That is true, but is also true that you can use a new version without
upgrading. The reason for not upgrading is in most cases avoiding to
break working scripts. I my (old) OS, the WIFI card uses Python 2.3 to
work. But Python allows to install "alternative" versions without
conflicting with your default system version. This way I have Python
2.4, 2.5, 2.6 and 3 all installed in the same machine. Using
alt-install or just compiling a Python version without doing a system
install. I even have more than one 2.5 version and each with a
different Biopython installation (using virtual_env) for testing
purposes. So I don't think there is a valid reason to keep supporting
such an old version.
From biopython at maubp.freeserve.co.uk Tue Apr 21 10:47:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Apr 2009 15:47:40 +0100
Subject: [Biopython-dev] Python 2.3 support
In-Reply-To: <9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com>
References: <320fb6e00904210315i7acbcffy40314df645ec4eda@mail.gmail.com>
<20090421123531.GE30529@sobchak.mgh.harvard.edu>
<320fb6e00904210556i7fd3efe3pe74a6773d367a3e9@mail.gmail.com>
<5aa3b3570904210625v78e604c4j87c4be62b9a0488c@mail.gmail.com>
<9e2f512b0904210734w46d22856k46c4bdfcddcf0346@mail.gmail.com>
Message-ID: <320fb6e00904210747h73b8881dkcfaf8a53f2f7aab@mail.gmail.com>
> ... Python allows to install "alternative" versions without
> conflicting with your default system version. This way I have Python
> 2.4, 2.5, 2.6 and 3 all installed in the same machine. Using
> alt-install or just compiling a Python version without doing a system
> install. I even have more than one 2.5 version and each with a
> different Biopython installation (using virtual_env) for testing
> purposes. So I don't think there is a valid reason to keep supporting
> such an old version.
OK, OK, no one loves Python 2.3 anymore, and you'll all be glad to see
the back of it ;)
Shall we say that at the end of April, unless anyone has come forward
with a strong need to continue using Biopython on Python 2.3 (or we
are forced to do another release to fix something), we'll start work
on removing Python 2.3 specific code in May?
A lot (hopefully most) of the Python 2.3 bits have a comment about
this in the source code, so a quick grep should pull out most of them.
If any of you remember any other specific things we need to change
add a note to Bug 2817 please.
http://bugzilla.open-bio.org/show_bug.cgi?id=2817
Thanks
Peter
From bartek at rezolwenta.eu.org Tue Apr 21 11:19:00 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 21 Apr 2009 17:19:00 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
Message-ID: <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
Hi,
> There appears to be something less than ideal about our CVS to git
> conversion. ?I believe this relates to how the tag commits appear in
> the commit tree - and it looks like for Biopython they are all tiny
> branches off the main trunk. ?i.e. If you look at the main trunk
> history (overall or for any one file) then tags commits are not in it.
>
I haven't noticed this difference. It just seems to be the way cvs2got
handles tags.
This behavior does not seem to be controllable from the config file.
I'll try to ask on the
cvs2git mailing list.
In case it is not possible to change it in cvs2git itself, the worst
scenario would be to re-tag
the git tree manually (or with a help of some script). So there is no
risk of loosing tags.
I'll post when I have any progress onn this issue.
cheers
?Bartek
--
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433
From mjldehoon at yahoo.com Tue Apr 21 11:23:03 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 21 Apr 2009 08:23:03 -0700 (PDT)
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
Message-ID: <955867.40270.qm@web62404.mail.re1.yahoo.com>
--- On Tue, 4/21/09, Peter Cock wrote:
> Again, building the docs is pretty trivial. We have in the
> past deliberately NOT updated the online copies, so that it is
> in sync with the latest release. I suppose we could have two
> copies on the website, the "latest release" and the
> "nightly code".
>
That would be nice. In the past, I've done such things by hand to let people look at the documentation for a piece of code that's about to go into CVS.
From mjldehoon at yahoo.com Tue Apr 21 11:28:56 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 21 Apr 2009 08:28:56 -0700 (PDT)
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <20090421122045.GD30529@sobchak.mgh.harvard.edu>
Message-ID: <268684.34243.qm@web62402.mail.re1.yahoo.com>
--- On Tue, 4/21/09, Brad Chapman wrote:
> - Eliminating the beta releases. Biopython is developed as
> stable in Git/CVS, so gets testing that way on developer
> machines. Are we getting enough feedback from betas to make
> them worthwhile?
I agree. A project like Biopython is destined to be in perpetual beta mode anyway. To my mind, Biopython 1.50-beta is as stable as Biopython 1.49 and Biopython 1.51. In addition, will we be able to remember that Biopython 1.50b is the beta release of version 1.50 (or did we have a 1.50, then a 1.50a, and then a 1.50b release?).
--Michiel
From mjldehoon at yahoo.com Tue Apr 21 11:35:44 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 21 Apr 2009 08:35:44 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090421124449.GF30529@sobchak.mgh.harvard.edu>
Message-ID: <595877.69734.qm@web62408.mail.re1.yahoo.com>
--- On Tue, 4/21/09, Brad Chapman wrote:
> I would be for a SeqFeatureLite class that is API
> compatible with SeqFeature (with the new start/end
> attributes) and does not support
> fuzzy locations. This would handle GFF understandably, be
> lightweight, and allow access to BioSQL and SeqIO.
> How does this sound?
Depends on whether SeqFeatureLite only exists for the benefit of GFF files. If so, we're better off with a light-weight GFF-specific object. If not, then it may make sense. But even then it sounds a bit like class creep.
--Michiel.
From p.j.a.cock at googlemail.com Tue Apr 21 11:58:23 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 16:58:23 +0100
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <955867.40270.qm@web62404.mail.re1.yahoo.com>
References: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
<955867.40270.qm@web62404.mail.re1.yahoo.com>
Message-ID: <320fb6e00904210858k1aa4b5cav2a784b75fb3b3f8@mail.gmail.com>
On Tue, Apr 21, 2009 at 4:23 PM, Michiel de Hoon wrote:
>
> Peter wrote:
>> Again, building the docs is pretty trivial. ?We have in the
>> past deliberately NOT updated the online copies, so that it is
>> in sync with the latest release. ?I suppose we could have two
>> copies on the website, the "latest release" and the
>> "nightly code".
>
> That would be nice. In the past, I've done such things by hand to
> let people look at the documentation for a piece of code that's
> about to go into CVS.
>
This should be trivial to get setup - at least as long as our
repository lives on the OBF server. There are already scripts or CVS
hooks in place to update http://biopython.org/SRC/ although I don't
know how exactly this is configured.
On Tue, Apr 21, 2009 at 4:28 PM, Michiel de Hoon wrote:
>Brad wrote:
>>> - Eliminating the beta releases. Biopython is developed as
>>> stable in Git/CVS, so gets testing that way on developer
>>> machines. Are we getting enough feedback from betas to make
>>> them worthwhile?
>
> I agree. A project like Biopython is destined to be in perpetual beta
> mode anyway. To my mind, Biopython 1.50-beta is as stable as
> Biopython 1.49 and Biopython 1.51. In addition, will we be able to
> remember that Biopython 1.50b is the beta release of version 1.50
> (or did we have a 1.50, then a 1.50a, and then a 1.50b release?).
Maybe I have hung about with computer scientists / programmers too
long, as to me there is no confusion about the ordering alpha -> beta
-> release candidate -> final.
However, if the consensus is that explicit beta releases are
redundant, then so be it.
Peter
From bartek at rezolwenta.eu.org Tue Apr 21 11:59:32 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 21 Apr 2009 17:59:32 +0200
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To:
References: <8b34ec180904200804o58d531ache7e3110f3b919cfa@mail.gmail.com>
Message-ID: <8b34ec180904210859r34a0a034qdfb54d57c3ca85e3@mail.gmail.com>
Hi,
thanks for your suggestions.
To make the long story short:
- I mostly agree with your points
- I've updated the wiki page to include your requests
http://biopython.org/wiki/MotifDev
- I'll definitely spend some time working on particular requests and
then post specifically.
cheers
Bartek
On Tue, Apr 21, 2009 at 10:34 AM, Leighton Pritchard wrote:
> Hi,
>
> Some thoughts and a bit of a wishlist...
>
> On 20/04/2009 16:04, "Bartek Wilczynski" wrote:
>
>> On Mon, Apr 20, 2009 at 4:35 PM, Peter
>> wrote:
>>>
>>> What would a space in a motif mean? ?Clearly something different from
>>> a wildcard like N or X in nucleotide or protein sequences. ?Does it
>>> mean a gap of variable length? ?If it means a gap of one character
>>> then surely just using a "-" would be sensible (as used in multiple
>>> sequence alignments), for which we have a gapped alphabet system
>>> setup.
>>>
>> I think that once we start talking about gapped motifs, we are really
>> talking about
>> multiple alignments on steroids. This hasn't been done so far because you
>> don't
>> really need it for DNA motifs,
>
> It might not be required for the motifs you've been working with, but we've
> been doing profile-based searches for bipartite regulatory binding sites in
> DNA. ?These sites have a variable-length spacer region, and so require
> gapped alignments for building motifs. ?The spacer region consensus
> (depending on the level of identity required for the consensus) is usually
> composed of Ns.
>
> I guess that this comes down to whether we choose to restrict the meaning of
> "motif" to an ungapped string of symbols (including ambiguity) representing
> nt/aa, or whether we want to permit the inclusion of variable-length gaps,
> regions, or ambiguities in a PROSITE or regular expression-like manner (e.g.
> C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H, GAACC.{17,21}AAC or
> C{,3}A{3,5}TTTT). ?Although profile methods like HMMer can produce a
> consensus output that looks like an ungapped string of symbols to represent
> a motif, it doesn't capture important features of the HMM representation.
>
> I think the latter representations are more useful, even if harder to
> code/maintain. ?I think that leaving them out would be a glaring hole in
> functionality, and that they're a target Biopython should aim for.
>
>> I think it would be great to be
>> able to easily
>> convert multiple alignments into motifs. This would allow us to ?use
>> the power of
>> BIo.AlignIO for IO and Bio.Motif for searching and comparisons.The question is
>> how to design API for these ?functions.
>
> I agree. ?I think that there's another important question: what do we mean,
> and need to do, when we talk about converting an alignment into a motif?
> Consensus/majority and PSSM methods from a sequence alignment should be
> straightforward to implement in Python - even for gapped alignments.
> Including a representation of variable-length gaps might be a little more
> difficult, and storing an HMM representation may be too much to manage
> immediately. ?That's still three different types of object - with likely
> different components to their interfaces - to be stored. ?In their
> relationship to a source alignment, these representations could be
> properties of a single alignment, or independent Bio.Motif objects (perhaps
> each with a link back to their parent alignment).
>
> The results of searches are also likely to be qualitatively different,
> depending on the type of motif used for the search, and the results desired
> by the user.
>
> I think that, for anything other than simple searches (string search,
> regex), we'd be on a hiding to nothing by implementing search methods within
> Python. ?It's not likely to be as fast as dedicated search packages, and it
> would be a headache for maintenance. ?So, with apologies if I missed this
> part of the discussion or documentation, it seems to me that Bio.Motif could
> be most powerful in the alignment/searching/comparison process as a 'broker'
> within BioPython, providing a consistent API for interface with external
> alignment/search/comparison applications that also permits programmatic
> manipulation of the profile/HMM/alignment. ?E.g.
>
> align = Bio.AlignIO.read(alignfilehandle)
> consensus = align.build_consensus(threshold=0.9)
> pssm = align.build_pssm()
> hmmer = align.build_hmmer()
> hmm = align.build_hmm(order=3)
>
> Or
>
> consensus = Bio.Motif.consensus_from_alignment(align, threshold=0.9)
> pssm = Bio.Motif.build_pssm_from_alignment(align)
> hmmer = Bio.Motif.build_hmmer_from_alignment(align)
> hmm = Bio.Motif.build_hmm_from_alignment(align, order=3)
>
> (which I don't think is as neat an interface, even if all
> align.build_consensus does is call the Bio.Motif.consensus_from_alignment
> method)
>
> Followed by things like
>
> pssm.consensus()
> pssm.logo()
> hmm.generate_sequence(length=100)
> hmm.to_graphviz()
>
> And then the consensus, pssm, hmm and hmmer objects could be used as input
> to interfaces for the relevant applications.
>
> Converting an alignment into an HMM for this purpose may itself benefit from
> a call to HMMer's hmmbuild (and Pythonic representation of the data
> structure), rather than implementation of an equivalent internal function -
> even though I think one of those would be useful, too.
>
> Cheers,
>
> L.
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk ? ? ? w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C ? ? ? tel:+44(0)1382 562731 x2405
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. ?This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. ?It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
> this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
--
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433
From biopython at maubp.freeserve.co.uk Tue Apr 21 12:29:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Apr 2009 17:29:19 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
Message-ID: <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
On Tue, Apr 21, 2009 at 4:19 PM, Bartek Wilczynski
wrote:
> Hi,
>
>> There appears to be something less than ideal about our CVS to git
>> conversion. ?I believe this relates to how the tag commits appear in
>> the commit tree - and it looks like for Biopython they are all tiny
>> branches off the main trunk. ?i.e. If you look at the main trunk
>> history (overall or for any one file) then tags commits are not in it.
>>
>
> I haven't noticed this difference. It just seems to be the way cvs2got
> handles tags. This behavior does not seem to be controllable from
> the config file. I'll try to ask on the cvs2git mailing list.
>
> In case it is not possible to change it in cvs2git itself, the worst
> scenario would be to re-tag the git tree manually (or with a help
> of some script). So there is no risk of loosing tags.
>
> I'll post when I have any progress onn this issue.
There is another option, redo the import using git cvsimport. This
has the downside that we lose all the network history currently in
github, but its only going to affect a couple of people and that was
always a possibility.
I've just done this twice, firstly over the network (just over an
hour, probably a bad idea in terms of wasting the OBF bandwidth).
Then I succeeded in doing it locally (under 15 minutes) on my Mac
after logging into dev.open-bio.org and fetching a zipped up copy of
the CVS files. The hard bit was working out how to get the CVSROOT
directory setup:
cvs -d $PWD/biopython_cvs init
cd biopython_cvs
unzip ../../Biopython-CVS-2009-04-21.zip
cd ..
time nice -n 10 git cvsimport -v -k -d
/Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs -C
biopython_git biopython
Both conversion appear to give the same result. Using GitX the
history how shows the tags as I expect them to appear (nice yellow
markers on the main branch), and the tag side branches have gone:
$ cd biopython_git
$ git log --graph --all
...
|
* commit 6283ffe77fdd07ae678d2fa35ae9311ee7fd51ee
| Author: peterc
| Date: Mon Apr 20 16:07:41 2009 +0000
|
| Bump the version number now that Biopython 1.50 is released
|
* commit 17a9b80f89be97fd4cc31d7c3618e82e4c83cafc
| Author: peterc
| Date: Mon Apr 20 10:48:31 2009 +0000
|
| You don't have to email Iddo to get on the CONTRIB file
|
...
I'm not sure if "git log" can be told to show the tags itself.
Also, just like Bartek's conversion using cvs2svn, this also appears
to correctly identify simple file moving (when the add and delete are
done in one CVS operation, obviously not when it was done in two steps
like my recent changes in Bio.Graphics.GenomeDiagram).
Note - we can probably use
http://github.com/guides/change-author-details-in-commit-history to
map author names to github user names later, but in theory git
cvsimport will do this with the -A option.
Peter
From biopython at maubp.freeserve.co.uk Tue Apr 21 12:58:23 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Apr 2009 17:58:23 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180903241649p7e81a2cew6587512c0cef16f@mail.gmail.com>
<8b34ec180903241658k21a76269r789600f92c17fbbb@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
Message-ID: <320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
On Tue, Apr 21, 2009 at 5:29 PM, Peter wrote:
>
> There is another option, redo the import using git cvsimport. ?This
> has the downside that we lose all the network history currently in
> github, but its only going to affect a couple of people and that was
> always a possibility.
>
> I've just done this twice, firstly over the network (just over an
> hour, probably a bad idea in terms of wasting the OBF bandwidth).
> Then I succeeded in doing it locally (under 15 minutes) on my Mac
> after logging into dev.open-bio.org and fetching a zipped up copy of
> the CVS files. ?The hard bit was working out how to get the CVSROOT
> directory setup:
>
> cvs -d $PWD/biopython_cvs init
> cd biopython_cvs
> unzip ../../Biopython-CVS-2009-04-21.zip
> cd ..
> time nice -n 10 git cvsimport -v -k -d
> /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs ?-C
> biopython_git biopython
>
> Both conversion appear to give the same result. ?Using GitX the
> history how shows the tags as I expect them to appear (nice yellow
> markers on the main branch), and the tag side branches have gone:
>
I've pushed this to github as
http://github.com/peterjc/biopython-cvs-import/tree/master
$ cd biopython_git
$ git remote add origin git at github.com:peterjc/biopython-cvs-import.git
$ git push origin master
$ git push origin master --tags
This won't be automatically updated, so please don't fork it!
Peter
From biopython at maubp.freeserve.co.uk Tue Apr 21 14:18:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Apr 2009 19:18:12 +0100
Subject: [Biopython-dev] Possible re-import from CVS to git
Message-ID: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com>
On the thread about the missing history tags in github, I wrote:
>> ... Then I succeeded in doing it locally (under 15 minutes) on my Mac
>> after logging into dev.open-bio.org and fetching a zipped up copy of
>> the CVS files. ?The hard bit was working out how to get the CVSROOT
>> directory setup:
>>
>> cvs -d $PWD/biopython_cvs init
>> cd biopython_cvs
>> unzip ../../Biopython-CVS-2009-04-21.zip
>> cd ..
>> time nice -n 10 git cvsimport -v -k -d
>> /Users/pjcock/repositories/bp_cvs_local_to_git/biopython_cvs ?-C
>> biopython_git biopython
>>
I've been testing the -A option for git cvsimport to map our CVS
usernames to hithub accounts.
http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html
The following format omitting the email address does nothing at all
(checking the local repository), which is a shame as I was hoping it
would allow a quick and simple way to map the CVS usernames to the
github usernames:
peterc=peterjc
However, the documented format does work:
peterc=full name
It seems that as long as the email address matches that used for your
github account, once the repository is uploaded to github it will all
work nicely - and your github account will be linked to the commit.
So, if we are going to re-do the git import (and we may have to fix
the tag history), it would be very nice if all the existing CVS users
could first:
(a) setup an account on github, and
(b) tell me the email address you are using for it.
If we do move to github, you would need to do this anyway in order to
be given collaborator status to make commits direct to the main trunk.
> I've pushed this to github as
> http://github.com/peterjc/biopython-cvs-import/tree/master
That is deleted now.
Peter
From p.j.a.cock at googlemail.com Tue Apr 21 16:06:56 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 21 Apr 2009 21:06:56 +0100
Subject: [Biopython-dev] SwissProt parsing inconsistency between
Bio.SeqIO, Bio.SwissProt
In-Reply-To: <320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com>
References: <320fb6e00904210426q19ea7a44xdb902661297ec855@mail.gmail.com>
<861995.42083.qm@web62406.mail.re1.yahoo.com>
<320fb6e00904210504g6c7f60f1o96129c9a6759c256@mail.gmail.com>
Message-ID: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com>
On Tue, Apr 21, 2009 at 1:04 PM, Peter Cock wrote:
>>> It looks like the SwissProt format has changed, and we
>>> should be parsing the new extended DE lines more
>>> carefully, and splitting these entries up and recording
>>> them in the SeqRecord.annotations dictionary?
>>
>> That sounds reasonable. The dictionary will have to be
>> nested though. Something like this ...
>>
Thinking this over, we should take that SwissProt file and load it
into BioSQL using BioPerl, and see how they dealt with the DE lines,
and try and do the same for Bio.SeqIO in order that loading it into
BioSQL with Biopython gives more or less the same thing.
Peter
From eric.talevich at gmail.com Wed Apr 22 00:32:33 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 22 Apr 2009 00:32:33 -0400
Subject: [Biopython-dev] Possible re-import from CVS to git
In-Reply-To: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com>
References: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com>
Message-ID: <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com>
On Tue, Apr 21, 2009 at 2:18 PM, Peter wrote:
> So, if we are going to re-do the git import (and we may have to fix
> the tag history), it would be very nice if all the existing CVS users
> could first:
> (a) setup an account on github, and
> (b) tell me the email address you are using for it.
>
> If we do move to github, you would need to do this anyway in order to
> be given collaborator status to make commits direct to the main trunk.
>
>
Eek. Now that the Summer of Code is under way, I guess this is a good time
to bring up the question of how Nick and I should be following the Biopython
trunk and publishing our own code.
In spite of the warning that the CVS tracker in GitHub was tentative, I was
getting comfortable with the setup we had. Should I (we) hold off on pushing
anything substantial to GitHub until this tagging situation is resolved, or
is there a better way to approach this? For example, does anyone know if
it's straightforward to back up a branch's recent history with
git-format-patch and apply it directly onto a new repository with different
references?
Thanks,
Eric
From lpritc at scri.ac.uk Wed Apr 22 03:56:46 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Wed, 22 Apr 2009 08:56:46 +0100
Subject: [Biopython-dev] Bio.Motif Suggestions
In-Reply-To: <8b34ec180904210859r34a0a034qdfb54d57c3ca85e3@mail.gmail.com>
Message-ID:
Hi Bart,
On 21/04/2009 16:59, "Bartek Wilczynski" wrote:
> Hi,
>
> thanks for your suggestions.
>
> To make the long story short:
> - I mostly agree with your points
> - I've updated the wiki page to include your requests
> http://biopython.org/wiki/MotifDev
> - I'll definitely spend some time working on particular requests and
> then post specifically.
Many thanks for the quick response - I've seen your wiki update, too.
Cheers,
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From bartek at rezolwenta.eu.org Wed Apr 22 04:53:21 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Wed, 22 Apr 2009 10:53:21 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<320fb6e00903250328y19165a77t470124ce490cea3d@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
Message-ID: <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
Hi,
On Tue, Apr 21, 2009 at 6:58 PM, Peter wrote:
> On Tue, Apr 21, 2009 at 5:29 PM, Peter wrote:
>>
>> There is another option, redo the import using git cvsimport. ?This
>> has the downside that we lose all the network history currently in
>> github, but its only going to affect a couple of people and that was
>> always a possibility.
Yes, it is ?an option, but I would be quite reluctant to do it. I think this
issue with tags is possible to get fixed without re-doing the import.
I'm scared by the possibility the we re-import stuff, fix the tags, everybody
swithches, people complain how good it was back then with CVS, ane one
month down the road, we find that there is an issue with something else,
that was not present in the previous import.
I think this is becoming a bit chaotic now. We still haven't removed the
first github conversion: (biopython_old branch: is anyone using it anyway?),
?there is this semi-official one that has a (fixable in my opinion) issue with
tags and now there is a new one made by Peter.
In summary:
I have no objections to using any particular tool for importing stuff to git.
I don't like the idea of not even trying to fix tghe problem we have
but instantly
changing the tool we are using.
I consider now re-importing stuff a major problem: everybody will need to port
their changes which is work.
>>
>> I've just done this twice, firstly over the network (just over an
>> hour, probably a bad idea in terms of wasting the OBF bandwidth).
>> Then I succeeded in doing it locally (under 15 minutes) on my Mac
>> after logging into dev.open-bio.org and fetching a zipped up copy of
>> the CVS files. ?The hard bit was working out how to get the CVSROOT
>> directory setup:
>>
itt's good to know it works, I don't think the time differences are
significant.
>
> This won't be automatically updated, so please don't fork it!
exactly
Bartek
--
Bartek Wilczynski
==================
Postdoctoral fellow
EMBL, Furlong group
Meyerhoffstrasse 1,
69012 Heidelberg,
Germany
tel: +49 6221 387 8433
From biopython at maubp.freeserve.co.uk Wed Apr 22 05:08:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 10:08:13 +0100
Subject: [Biopython-dev] Possible re-import from CVS to git
In-Reply-To: <3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com>
References: <320fb6e00904211118n23e68b2vc43de8397ff3b08c@mail.gmail.com>
<3f6baf360904212132k7110deeft6829c1b4a7b18f24@mail.gmail.com>
Message-ID: <320fb6e00904220208oaecd1a5p844ce8642acc51fa@mail.gmail.com>
On Wed, Apr 22, 2009 at 5:32 AM, Eric Talevich wrote:
> On Tue, Apr 21, 2009 at 2:18 PM, Peter wrote:
>
>> So, if we are going to re-do the git import (and we may have to fix
>> the tag history), ...
>
> Eek. Now that the Summer of Code is under way, I guess this is a good time
> to bring up the question of how Nick and I should be following the Biopython
> trunk and publishing our own code.
>
> In spite of the warning that the CVS tracker in GitHub was tentative, I was
> getting comfortable with the setup we had. Should I (we) hold off on pushing
> anything substantial to GitHub until this tagging situation is resolved, or
> is there a better way to approach this? For example, does anyone know if
> it's straightforward to back up a branch's recent history with
> git-format-patch and apply it directly onto a new repository with different
> references?
Bartek is looking into fixing the existing CVS to git mirror on github, but
that may not be possible. And I do think it is worth fixing the tag history
even at the cost of some upheaval in the short term.
In terms of you and Nick, for now carry on using github if you are comfortable
with it. The new phylogenetics stuff will I assume be mostly new python
modules, or modifications to a couple of existing ones (e.g. Bio.Nexus).
Merging this later shouldn't be too bad - you should be able to generate a
diff against CVS (or its current mirror in git) and we can apply that to CVS
(or a new git repository).
Peter
From biopython at maubp.freeserve.co.uk Wed Apr 22 05:23:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 10:23:46 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<320fb6e00904201027o501a0c67x2050538189685825@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
Message-ID: <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
On Wed, Apr 22, 2009 at 9:53 AM, Bartek Wilczynski
wrote:
> Hi,
>
> Peter wrote:
>>> There is another option, redo the import using git cvsimport. ?This>
>>> has the downside that we lose all the network history currently in
>>> github, but its only going to affect a couple of people and that was
>>> always a possibility.
>
> Yes, it is ?an option, but I would be quite reluctant to do it. I think this
> issue with tags is possible to get fixed without re-doing the import.
If you can fix the current git hub repository, great.
> I'm scared by the possibility the we re-import stuff, fix the tags, everybody
> swithches, people complain how good it was back then with CVS, ane one
> month down the road, we find that there is an issue with something else,
> that was not present in the previous import.
This is why we are testing things: We have found something wrong with
the current import, and it wasn't immediately obvious (partly because
we were still getting to know git and github).
> I think this is becoming a bit chaotic now. We still haven't removed the
> first github conversion: (biopython_old branch: is anyone using it anyway?),
The old conversion's deletion is still in progress, it must have stalled:
http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename
>?there is this semi-official one that has a (fixable in my opinion) issue with
> tags ...
If we can fix the tags, great. If we can also remap the authors to
their git usernames, even better.
> ... and now there is a new one made by Peter.
I deleted that one - it was just a proof of principle.
> In summary:
> I have no objections to using any particular tool for importing stuff to git.
> I don't like the idea of not even trying to fix the problem we have
> but instantly changing the tool we are using.
It was really to demonstrate to my own satisfaction that we could have
the tags in the history properly.
> I consider now re-importing stuff a major problem: everybody will need to port
> their changes which is work.
True - but this was always a possibility. From browsing the github
network this really will just affect basically just two people:
* Eric - quite a few changes, some of which we can probably look at
merging into CVS now which would solve that.
* Giovanni - quite a few changes (on a couple of files) on one branch,
and a couple of other branches for proposed unit tests
Also:
* Dave Bridges - documentation changes to one file which we can merge
into CVS and then he can delete that branch
* Tiago - trivial changes to one file (stats in PopGen)
* Peter (me) - I have a few test branches, nothing I care about.
Brad, Bartek and Leighton have no changes made.
Peter
From cy at cymon.org Wed Apr 22 05:48:08 2009
From: cy at cymon.org (Cymon Cox)
Date: Wed, 22 Apr 2009 10:48:08 +0100
Subject: [Biopython-dev] Bio.Application interface
Message-ID: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
>From reading the previous discussion on the list, I gather there is a
preference for removing helper functions to the Bio.Application command line
interfaces, such that the user interface would be something like:
from Bio import Application
from Bio.Align.Applications import MafftCommandline
cmd = MafftCommandline()
cmd.set_parameter("input", "sample.fa")
[etc...]
i, o, e = Application.generic_run(cmd)
ie the user explicitly sets the cl parameters.
Ive written Application.AbstractCommandline for both MUSCLE and MAFFT.
However, each of these programmes uses a variation on the parameter styles
not easily covered by the current _AbstractParameter classes _Option and
_Argument. The _Option class deals with parameters of the type "-
-append=yes" and "-a yes", and the _Argument returns just the value to the
command line, ie cmd.set_parameter("input", "sample.fa") puts just
"sample.fa" on the cl.
A muscle command might be:
"muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp
-noanchors"
ie with a "-noanchors" command, currently the parameter would need to be an
_Argument and set using:
cmd.set_parameter("noanchors", "-noanchors")
A MAFFT command might be:
"mafft - -maxiterate 200 - -nofft myInputData.fa"
ie with a "- -nofft" parameter which would need to be an _Argument and set
using:
cmd.set_parameter("nofft", "- -nofft")
and a "- -maxiterate 200" parameter which _Option doesnt cover, that is "-
-" params always have an "=" before the value.
So, it looks like a _OptionNoEquals parameter class is required to cover the
"- -param value", and I would suggest a _ArgumentName class that returns the
parameter name to the command line such that:
cmd.set_parameter("- -nofit") returns "- -nofit" to the cl, and
cmd.set_parameter("- -nofit", value) raises and error via the
checker_function
As and aside, MAFFT also has a:
"mafft - -seed file1 - -seed file2 inputData.fa" ie mulitple number of -
-seed parameters which is not covered by the current interface.
Cheers, C.
--
From biopython at maubp.freeserve.co.uk Wed Apr 22 06:26:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 11:26:11 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
Message-ID: <320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote:
> From reading the previous discussion on the list, I gather there is a
> preference for removing helper functions to the Bio.Application command line
> interfaces, such that the user interface would be something like:
>
> from Bio import Application
> from Bio.Align.Applications import MafftCommandline
> cmd = MafftCommandline()
> cmd.set_parameter("input", "sample.fa")
> [etc...]
> i, o, e = Application.generic_run(cmd)
>
> ie the user explicitly sets the cl parameters.
Yes, that would fit my preference for giving the user direct access to
the command line as a string, to invoke as they choose. We might
want to discuss extending the AbstractCommandline __init__ method
to take **kwargs, allowing the parameters to be set like this:
from Bio import Application
from Bio.Align.Applications import MafftCommandline
cmd = MafftCommandline(input="sample.fa", ...)
return_code, std_handle, err_handle = Application.generic_run(cmd)
I'm not sure how well this would work in practice as the range of
validate argument names in python may not overlap with the valid
parameter names.
> Ive written Application.AbstractCommandline for both MUSCLE and MAFFT.
> However, each of these programmes uses a variation on the parameter styles
> not easily covered by the current _AbstractParameter classes _Option and
> _Argument. The _Option class deals with parameters of the type "-
> -append=yes" and "-a yes", ...
> A muscle command might be:
> "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp
> -noanchors"
> ie with a "-noanchors" command
Those kind of options which don't take a value are really common on
Unix, I suspect we already have things like this in the other wrappers.
I'd guess they just use the _Option class and omit the value.
> So, it looks like a _OptionNoEquals parameter class is required to cover the
> "- -param value", and I would suggest a _ArgumentName class that returns the
> parameter name to the command line such that:
>
> cmd.set_parameter("- -nofit") returns "- -nofit" to the cl, and
> cmd.set_parameter("- -nofit", value) raises and error via the
> checker_function
You are right, a subclass of _Option which checks there is no value argument
could be sensible. Maybe _OptionNoValue rather than _OptionNoEquals?
Peter
From peter at maubp.freeserve.co.uk Wed Apr 22 07:00:19 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 12:00:19 +0100
Subject: [Biopython-dev] Nice small test case for fuzzy locations
Message-ID: <320fb6e00904220400m5c18ad42gbe301b739d54ce99@mail.gmail.com>
Hi all,
This is a nice small GenBank file with fuzzy locations, joins, and fuzzy joins:
ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/NC_006980.gbk
I think this will make an excellent test case, see new unittest based
Tests/test_SeqIO_feature.py which we can extend to include GFF or PTT
files when they are in Bio.SeqIO too. The good news is our non-fuzzy
locations appear to be doing just what GenBank does - you did a good
job there Brad :)
If anyone comes across a better example file let us know (i.e. also
very small, but with between positions, one of position etc as well).
Peter
From biopython at maubp.freeserve.co.uk Wed Apr 22 09:30:00 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 14:30:00 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
Message-ID: <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
On Wed, Apr 22, 2009 at 2:23 PM, Cymon Cox wrote:
> 2009/4/22 Peter
>>
>> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote:
>>
>> > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT.
>> > However, each of these programmes uses a variation on the parameter
>> > styles
>> > not easily covered by the current _AbstractParameter classes _Option and
>> > _Argument. The _Option class deals with parameters of the type "-
>> > -append=yes" and "-a yes", ...
>> > A muscle command might be:
>> > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp
>> > -noanchors"
>> > ie with a "-noanchors" command
>>
>> Those kind of options which don't take a value are really common on
>> Unix, ?I suspect we already have things like this in the other wrappers.
>> I'd guess they just use the _Option class and omit the value.
>
> Yes, I see now... they need to be _Options with a "lambda x: 0" value
> checker function - for some reason was trying to force them into _Argument
>
> This is the current _Option class:
> ...
> So _Option covers: "- -param=value", "-param value", "-param", "- -param"
>
> What it doesnt cover is "- -param value" and "-param=value"
> ...
This might be a silly question, but do you actually these exact option
layouts for MUSCLE and MAFFT? Many Unix tools use something like
libopt and will actually take slight variations, and may also offer short
and long names for the same option. Perhaps the existing option code
in Bio.Application will suffice?
Peter
From mjldehoon at yahoo.com Wed Apr 22 10:31:48 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 22 Apr 2009 07:31:48 -0700 (PDT)
Subject: [Biopython-dev] SwissProt parsing inconsistency between
Bio.SeqIO, Bio.SwissProt
In-Reply-To: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com>
Message-ID: <218724.9949.qm@web62406.mail.re1.yahoo.com>
--- On Tue, 4/21/09, Peter Cock wrote:
> Thinking this over, we should take that SwissProt file and
> load it into BioSQL using BioPerl, and see how they dealt
> with the DE lines, and try and do the same for Bio.SeqIO
> in order that loading it into BioSQL with Biopython gives
> more or less the same thing.
Good point. Does anybody know how BioPerl stores SwissProt files in SQL databases? I know neither Perl nor SQL ...
--Michiel
From p.j.a.cock at googlemail.com Wed Apr 22 10:44:23 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 22 Apr 2009 15:44:23 +0100
Subject: [Biopython-dev] SwissProt parsing inconsistency between
Bio.SeqIO, Bio.SwissProt
In-Reply-To: <218724.9949.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e00904211306u50955608ndccef5d0cb6ba09b@mail.gmail.com>
<218724.9949.qm@web62406.mail.re1.yahoo.com>
Message-ID: <320fb6e00904220744s1c88c725nb1fa607ce10df723@mail.gmail.com>
On Wed, Apr 22, 2009 at 3:31 PM, Michiel de Hoon wrote:
>
> --- On Tue, 4/21/09, Peter Cock wrote:
>
>> Thinking this over, we should take that SwissProt file and
>> load it into BioSQL using BioPerl, and see how they dealt
>> with the DE lines, and try and do the same for Bio.SeqIO
>> in order that loading it into BioSQL with Biopython gives
>> more or less the same thing.
>
> Good point. Does anybody know how BioPerl stores SwissProt files in SQL databases? I know neither Perl nor SQL ...
>
Not off hand, but I know enough about BioPerl to be able to load the
file into a BioSQL database. I'll post back later (but probably not
today).
Peter
From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:14:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 22 Apr 2009 12:14:47 -0400
Subject: [Biopython-dev] [Bug 2819] New: Bio.SeqIO support for NCBI protein
tables (*.ptt files)
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
Summary: Bio.SeqIO support for NCBI protein tables (*.ptt files)
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
On their FTP site the NCBI provide a range of files for each
genome/plasmid/chromosome, e.g.
ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/
The *.ptt files are simple tab separated tables listing all the proteins. They
correspond to the CDS features in the GenBank file.
This enhancement bug is about adding "ptt" as an input file format in Bio.SeqIO
(and potentially as an output format too), where a single ptt file gives a
single SeqRecord object containing a SeqFeature object for each protein. The
header line gives the sequence length, so an UnknownSeq can be used for the
SeqRecrd's seq property.
One example application of this would be to draw a GenomeDiagram showing the
protein locations. This can be done using the SeqFeature objects from parsing
a GenBank file, but using the ptt file will be much faster.
See earlier suggestions on the mailing list (part of the GFF thread):
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005725.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005745.html
Patch to follow...
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:15:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 22 Apr 2009 12:15:26 -0400
Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein
tables (*.ptt files)
In-Reply-To:
Message-ID: <200904221615.n3MGFQZi027802@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:15 EST -------
Created an attachment (id=1282)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1282&action=view)
New file Bio/SeqIO/ProteinTableIO.py
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:16:37 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 22 Apr 2009 12:16:37 -0400
Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein
tables (*.ptt files)
In-Reply-To:
Message-ID: <200904221616.n3MGGbXh027904@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:16 EST -------
Created an attachment (id=1283)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1283&action=view)
Patch to Bio/SeqIO/__init__.py to use "ptt" files for input
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 22 12:19:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 22 Apr 2009 12:19:15 -0400
Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein
tables (*.ptt files)
In-Reply-To:
Message-ID: <200904221619.n3MGJF3V028128@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-22 12:19 EST -------
Created an attachment (id=1284)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1284&action=view)
Patch to Tests/test_SeqIO_features.py to check "genbank" vs "ptt" parsing
Requires additional input files from the NCBI to go in Tests/GenBank,
ftp://ftp.ncbi.nih.gov/genomes/Protozoa/Cryptosporidium_parvum/NC_006980.ptt
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Yersinia_pestis_biovar_Microtus_91001/NC_005816.ptt
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Apr 22 12:24:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 17:24:36 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
Message-ID: <320fb6e00904220924x38466ac1sc80fe344eec1b200@mail.gmail.com>
On Mon, Apr 13, 2009 at 1:16 PM, Peter wrote:
> I don't think the GFF parser should only return SeqRecord object, but
> I do see a use for this (via Bio.SeqIO). ?GFF files could be
> represented as a list of SeqFeature objects, and using a SeqRecord to
> hold this seems very natural to me. ?It also means we could use
> Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a
> BioSQL database.
>
> If you look at the NCBI FTP site, they often provide genome sequences
> in a range of file formats including GenBank and GFF.
>
> e.g.
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/
>
> The GenBank files contain the features plus the sequence,
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk
>
> Their GFF3 file only contains the features:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>
> Some GFF files will include the sequence too, in this case we can
> fetch it in FASTA format:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
>
> In principle, you could parse this FASTA file and the GFF3 file and
> put together a GenBank file - or vice versa.
>
> As an aside, I would also consider adding protein table support on the
> same lines, look at this file:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt
> The header information gives us the genome size, so Bio.SeqIO could
> return a SeqRecord with lots of SeqFeature objects and for the
> SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp.
> ?This is something I might look at implementing myself after Biopython
> 1.50 is out. ?We should be able to read in a GenBank file and output a
> PTT file, and verify it matches the NCBI provided version of the PTT
> file.
There is a working NCBI protein table ("ptt") format parser for Bio.SeqIO
on Bug 2819 including unit tests.
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
Hopefully this will be useful in integrating the GFF/GFF3 parser into
Bio.SeqIO, as well as being worth while in its own right. This "ptt"
parser should work fine with BioSQL and GenomeDiagram, offering
a light weight alternative to parsing the GenBank or GFF3 file when
all you care about is the locations of the proteins (CDS features).
Peter
From cy at cymon.org Wed Apr 22 13:00:38 2009
From: cy at cymon.org (Cymon Cox)
Date: Wed, 22 Apr 2009 18:00:38 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
Message-ID: <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
2009/4/22 Peter
> On Wed, Apr 22, 2009 at 2:23 PM, Cymon Cox wrote:
> > 2009/4/22 Peter
> >>
> >> On Wed, Apr 22, 2009 at 10:48 AM, Cymon Cox wrote:
> >>
> >> > Ive written Application.AbstractCommandline for both MUSCLE and MAFFT.
> >> > However, each of these programmes uses a variation on the parameter
> >> > styles
> >> > not easily covered by the current _AbstractParameter classes _Option
> and
> >> > _Argument. The _Option class deals with parameters of the type "-
> >> > -append=yes" and "-a yes", ...
> >> > A muscle command might be:
> >> > "muscle -in Fasta/f002 -out Fasta/temp_align_out2.fa -objscore sp
> >> > -noanchors"
> >> > ie with a "-noanchors" command
> >>
> >> Those kind of options which don't take a value are really common on
> >> Unix, I suspect we already have things like this in the other wrappers.
> >> I'd guess they just use the _Option class and omit the value.
> >
> > Yes, I see now... they need to be _Options with a "lambda x: 0" value
> > checker function - for some reason was trying to force them into
> _Argument
> >
> > This is the current _Option class:
> > ...
> > So _Option covers: "- -param=value", "-param value", "-param", "- -param"
> >
> > What it doesnt cover is "- -param value" and "-param=value"
> > ...
>
> This might be a silly question, but do you actually these exact option
> layouts for MUSCLE and MAFFT? Many Unix tools use something like
> libopt and will actually take slight variations, and may also offer short
> and long names for the same option. Perhaps the existing option code
> in Bio.Application will suffice?
MAFFT uses "--param value" style options, and won't accept "--param=value"
or "-param value" as alternatives. Neither use "-param=value", but if more
applications it may turn up.
C.
>
>
> Peter
>
--
____________________________________________________________________
Cymon J. Cox
Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal
Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77
From biopython at maubp.freeserve.co.uk Wed Apr 22 17:25:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 22 Apr 2009 22:25:35 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
Message-ID: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote:
>>
>> This might be a silly question, but do you actually these exact option
>> layouts for MUSCLE and MAFFT? Many Unix tools use something like
>> libopt and will actually take slight variations, and may also offer short
>> and long names for the same option. Perhaps the existing option code
>> in Bio.Application will suffice?
>
> MAFFT uses "--param value" style options, and won't accept "--param=value"
> or "-param value" as alternatives.
OK. Then yes, we should support that. Brad, as Bio.Application is your
module, would you like to comment?
>
> Neither use "-param=value", but if more applications it may turn up.
>
I don't think I have ever see a command line application that used that.
Peter
From chapmanb at 50mail.com Wed Apr 22 18:44:01 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 22 Apr 2009 18:44:01 -0400
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
Message-ID: <20090422224401.GC34546@sobchak.mgh.harvard.edu>
Peter and Cymon;
> >> This might be a silly question, but do you actually these exact option
> >> layouts for MUSCLE and MAFFT? Many Unix tools use something like
> >> libopt and will actually take slight variations, and may also offer short
> >> and long names for the same option. Perhaps the existing option code
> >> in Bio.Application will suffice?
> >
> > MAFFT uses "--param value" style options, and won't accept "--param=value"
> > or "-param value" as alternatives.
>
> OK. Then yes, we should support that. Brad, as Bio.Application is your
> module, would you like to comment?
My comment is: I think it is awesome MAFFT made up their own way
of doing the command line.
Seriously, y'all are doing the right thing. Add a new class to
Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's
inventive new way to specify command line arguments. Adapt the
__str__ from _Option to do it the "--param val" way in this class.
Then use this for your MAFFT commandline.
I believe I just summarized your discussion, so you can replace this
whole message with +1.
Brad
From winda002 at student.otago.ac.nz Wed Apr 22 22:14:31 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Thu, 23 Apr 2009 14:14:31 +1200
Subject: [Biopython-dev] main page on wiki
Message-ID: <49EFCF07.2050502@student.otago.ac.nz>
Hi all,
As you probably know the main page of the wiki
(http://biopython.org/wiki/Main_Page) is the first place someone washes
up when they google 'biopython'. As part of this "news coordinator" idea
I have made an alternative version of the main page
(http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more
as a "portal" for the wiki/project. This is born from my own experience
with the wiki as a newcomer; it took me a long time to cotton on to the
fact there was a navigation box on each page so I didn't realise what
the website had to offer (this may say more about me than the design of
the front page).
Which version would you like to see as the main page? Obviously this
isn't an either-or thing, my 'mock-up' version can be edited by anyone
with an account on the wiki (the main page is protected for obvious
reasons) so any ideas that you have can be incorporated to that one
(older versions of the page are all saved so you can edit as bravely as
you like).
Thanks,
David
From sbassi at clubdelarazon.org Wed Apr 22 21:53:09 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Wed, 22 Apr 2009 22:53:09 -0300
Subject: [Biopython-dev] main page on wiki
In-Reply-To: <49EFCF07.2050502@student.otago.ac.nz>
References: <49EFCF07.2050502@student.otago.ac.nz>
Message-ID: <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com>
On Wed, Apr 22, 2009 at 11:14 PM, David Winter
wrote:
> Which version would you like to see as the main page? Obviously this isn't
I liked the new version. I would add (if I knew how to do it) some
icons near each of:
Get Started Get help Contribute
From argriffi at ncsu.edu Wed Apr 22 21:42:21 2009
From: argriffi at ncsu.edu (alex)
Date: Wed, 22 Apr 2009 21:42:21 -0400
Subject: [Biopython-dev] main page on wiki
In-Reply-To: <49EFCF07.2050502@student.otago.ac.nz>
References: <49EFCF07.2050502@student.otago.ac.nz>
Message-ID: <49EFC77D.5070307@ncsu.edu>
David Winter wrote:
> Hi all,
>
> As you probably know the main page of the wiki
> (http://biopython.org/wiki/Main_Page) is the first place someone washes
> up when they google 'biopython'. As part of this "news coordinator" idea
> I have made an alternative version of the main page
> (http://biopython.org/wiki/User:Davidw/homepage) which acts a bit more
> as a "portal" for the wiki/project. This is born from my own experience
> with the wiki as a newcomer; it took me a long time to cotton on to the
> fact there was a navigation box on each page so I didn't realise what
> the website had to offer (this may say more about me than the design of
> the front page).
>
> Which version would you like to see as the main page? Obviously this
> isn't an either-or thing, my 'mock-up' version can be edited by anyone
> with an account on the wiki (the main page is protected for obvious
> reasons) so any ideas that you have can be incorporated to that one
> (older versions of the page are all saved so you can edit as bravely as
> you like).
>
> Thanks,
> David
I like your version better than the current main page.
From idoerg at gmail.com Wed Apr 22 23:49:39 2009
From: idoerg at gmail.com (Iddo Friedberg)
Date: Wed, 22 Apr 2009 20:49:39 -0700
Subject: [Biopython-dev] main page on wiki
In-Reply-To: <9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com>
References: <49EFCF07.2050502@student.otago.ac.nz>
<9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com>
Message-ID: <49EFE553.6070405@gmail.com>
I second Sebastian on the icons, and third Sebastian and Alex on
preferring David's take on a main page.
Sebastian Bassi wrote:
> On Wed, Apr 22, 2009 at 11:14 PM, David Winter
> wrote:
>> Which version would you like to see as the main page? Obviously this isn't
>
> I liked the new version. I would add (if I knew how to do it) some
> icons near each of:
> Get Started Get help Contribute
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
--
Iddo Friedberg Ph.D.
Atkinson Hall MC 0446
University of California San Diego
9500 Gilman Dr.
La Jolla, CA 92093-0446 USA
http://iddo-friedberg.net
From biopython at maubp.freeserve.co.uk Thu Apr 23 05:16:44 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Apr 2009 10:16:44 +0100
Subject: [Biopython-dev] main page on wiki
In-Reply-To: <49EFE553.6070405@gmail.com>
References: <49EFCF07.2050502@student.otago.ac.nz>
<9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com>
<49EFE553.6070405@gmail.com>
Message-ID: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com>
On Thu, Apr 23, 2009 at 4:49 AM, Iddo Friedberg wrote:
> I second Sebastian on the icons, and third Sebastian and Alex on preferring
> David's take on a main page.
Are you all looking at the *current* home page which already has a few
of David's suggestions (in particular the news feed on the right), or
the old version from memory?
Also, what size screens do you all have? It should ideally look OK on
small screens or windows (e.g. 1024 by 768 is what my laptop uses,
which isn't that old). From playing with my window size, it should be
OK - the proposed layout seems quite flexible :)
If there are no counter comments, I'll put David's changes up later
today or tomorrow.
Peter
From biopython at maubp.freeserve.co.uk Thu Apr 23 05:29:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Apr 2009 10:29:04 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <20090422224401.GC34546@sobchak.mgh.harvard.edu>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
<20090422224401.GC34546@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904230229o7efcfbe0ld2da94f10bd1b3b8@mail.gmail.com>
On Wed, Apr 22, 2009 at 11:44 PM, Brad Chapman wrote:
> Peter and Cymon;
>
> My comment is: I think it is awesome MAFFT made up their own way
> of doing the command line.
Was that sarcasm Brad?
> Seriously, y'all are doing the right thing. Add a new class to
> Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's
> inventive new way to specify command line arguments. Adapt the
> __str__ from _Option to do it the "--param val" way in this class.
> Then use this for your MAFFT commandline.
Maybe _DoubleDashOption for the class name? I haven't looked
at this closely enough to have a firm opinion - but as this will be a
private class anyway, the name doesn't matter so much.
> I believe I just summarized your discussion, so you can replace this
> whole message with +1.
:)
What about this bit I wrote earlier:
>> ... We might want to discuss extending the AbstractCommandline
>> __init__ method to take **kwargs, allowing the parameters to be
>> set like this:
>>
>> from Bio import Application
>> from Bio.Align.Applications import MafftCommandline
>> cmd = MafftCommandline(input="sample.fa", ...)
>> return_code, std_handle, err_handle = Application.generic_run(cmd)
>>
>> I'm not sure how well this would work in practice as the range of
>> valid argument names in python may not overlap with the valid
>> parameter names.
We'll have to see how well the above idea works in practice - it
may not be general enough to be useful.
Also, perhaps we can automatically generate properties for each
argument allowing this:
cmd.input = "sample.fa"
rather than:
cmd.set_parameter("input", "sample.fa")
For the "switch" type arguments which take no value, if these are
implemented with a separate option class (maybe _Switch or
_OptionNoValue) then rather than:
cmd.set_parameter("noanchors")
we might want to do:
cmd.noanchors = True
and allow the switch to be removed with:
cmd.noanchors = False
i.e. For those arguments which take no argument (is "switch" the
right term here?), evaluate the property set value as a boolean to
add/remove -noanchors from the command line string.
I think using properties in this way could make the command line
object more intuitive, but again python puts limits on property names
which might mean for some arguments you'd have to use the
set_parameter version.
Peter
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 05:39:11 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 05:39:11 -0400
Subject: [Biopython-dev] [Bug 2819] Bio.SeqIO support for NCBI protein
tables (*.ptt files)
In-Reply-To:
Message-ID: <200904230939.n3N9dBZ5000718@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 05:39 EST -------
Just to note that Bio/SeqIO/ProteinTableIO.py needs a minor improvement to cope
with one special case - features which wrap the origin, e.g. NEQ001 in
Nanoarchaeum equitans.
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.gbk
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/NC_005213.ptt
This is the first CDS in the GenBank file, location given as:
complement(join(490883..490885,1..879))
It is the last entry in the Protein Table file,
490883..879 - ...
All my code needs to do is spot when start > end, and then add the two
appropriate sub-features (using the known genome length, 490885) and set the
location operator to join (to match what the GenBank parser does). I'll do
this at some point assuming there is interest in adding this parser to
Bio.SeqIO.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Thu Apr 23 08:36:35 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 23 Apr 2009 08:36:35 -0400
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
Bio.GFF)
In-Reply-To: <320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
<320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
Message-ID: <20090423123635.GD34546@sobchak.mgh.harvard.edu>
Hi all;
> > Unless you are thinking of having an object representation as being too
> > heavy, the non-light part of SeqFeature is all the FeatureLocation
> > fuzziness.
>
> I've just had a quick go at what should be a 100% backwards compatible
> modification to the FeatureLocation class to store ExactPosition start
> or end positions as integers. The idea should be more memory
> efficient, using the complex position objects only when required.
I like the idea here but I would go a step further and get rid of
FeatureLocation, collapsing the start and end location onto the
SeqFeature itself. FeatureLocation is basically just a holder for a
start and end coordinates. In this version, you would store the
positions plus extensions and fuzzy type on the Feature, and then
instantiate fuzzy objects on demand.
I took a look at the resource usage of these objects versus
a lightweight implementation. For a GFF file with 70k features, the
maximum memory usage is 128M versus 111M for the lightweight
version. So the improvement is rather modest, ~15%.
> I forgot to mention the second major use case I'm concerned about,
> which is recovering the GenBank/EMBL style location string. I have
> looked at this in the past, by adding methods to the FeatureLocation
> and all the Position objects, but it is complicated by the fact the
> Position objects don't know if they are at the start or end (and for
> the start locations we need to add one to convert from Python
> counting). This is the main block on having Bio.SeqIO support writing
> GenBank (or EMBL) files with their features included.
I admittedly haven't looked at this in a while, but this was
designed to be round tripped. The GenBank Record class can be
written out back in GenBank format, and test_GenBank explicitly
checks that the start and end records are the same.
Brad
From chapmanb at 50mail.com Thu Apr 23 08:53:56 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 23 Apr 2009 08:53:56 -0400
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
<8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
<320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
<20090421122045.GD30529@sobchak.mgh.harvard.edu>
<320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
Message-ID: <20090423125356.GE34546@sobchak.mgh.harvard.edu>
Hi all;
> > It would also be worth thinking about what the worst parts of
> > building the releases are and seeing if we can automate or eliminate
> > them. A few things that I can think of:
[Brainstorming a few suggestions]
I feel like I derailed from the main point by making suggestions.
Separate from a debate about betas and version support and
documentation -- how can we make releases easier to roll?
Peter, this started when you mentioned that rolling the release
felt kind of painful and it would be great if others would pitch in.
The idea of soliciting volunteers as release coordinators is great.
In addition to that, we should think about streamlining the release
process -- what are the parts we can get rid of and still have high
quality releases? Peter, since you are doing them right now, what
are your thoughts?
Brad
From lpritc at scri.ac.uk Thu Apr 23 09:43:43 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Thu, 23 Apr 2009 14:43:43 +0100
Subject: [Biopython-dev] main page on wiki
In-Reply-To: <49EFC77D.5070307@ncsu.edu>
Message-ID:
On 23/04/2009 02:42, "alex" wrote:
> David Winter wrote:
>> Hi all,
[...]
>> Which version would you like to see as the main page?
> I like your version better than the current main page.
+1 I like the layout.
Sebastian's idea for icons is also good.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From p.j.a.cock at googlemail.com Thu Apr 23 09:58:58 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 23 Apr 2009 14:58:58 +0100
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <20090423125356.GE34546@sobchak.mgh.harvard.edu>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
<8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
<320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
<20090421122045.GD30529@sobchak.mgh.harvard.edu>
<320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
<20090423125356.GE34546@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com>
On Thu, Apr 23, 2009 at 1:53 PM, Brad Chapman wrote:
> Hi all;
>
>> > It would also be worth thinking about what the worst parts of
>> > building the releases are and seeing if we can automate or eliminate
>> > them. A few things that I can think of:
>
> [Brainstorming a few suggestions]
>
> I feel like I derailed from the main point by making suggestions.
> Separate from a debate about betas and version support and
> documentation -- how can we make releases easier to roll?
>
> Peter, this started when you mentioned that rolling the release
> felt kind of painful and it would be great if others would pitch in.
> The idea of soliciting volunteers as release coordinators is great.
I didn't mean painful, so much as time consuming - but this was mostly
coordinating final polish/bug fixes and documentation. This kind of
thing requires some debate and judgement calls, and will be different
for every release. I spent quite a lot of time on documentation for
things which I really wanted to get into the Tutorial that shipped
with the release (some of which should have happened earlier, so this
was partly my own fault).
In terms of getting the documentation updated for each release, this
would be less effort if we as a group were more diligent about putting
things in the tutorial and/or docstrings as we go along. It's
important that nice new features are demonstrated, otherwise no-one
will know they are there without reading the code itself or from
following the mailing list discussions carefully.
> In addition to that, we should think about streamlining the release
> process -- what are the parts we can get rid of and still have high
> quality releases? Peter, since you are doing them right now, what
> are your thoughts?
The complicated bit is getting the code and documentation in CVS
ready, and that is harder to delegate. Once that is done though, the
actual release process is fairly straight forward - as documented here
- and could be delegated to anyone methodical with suitably setup
development machine(s):
http://biopython.org/wiki/Building_a_release
Maybe some of the release process could be automated literally as a
script - but doing each step methodically by hand and checking as you
go is wise.
For the release process, I'm basically proposing splitting this up
into up to three jobs:
(1) Coordinating final bug fixes and documentation in CVS. This has
recently been handled by me or Michiel with most discussion on the dev
lists, and some module specific details off list, and this works and I
wouldn't change it.
(2) Once CVS is ready, building the documentation, doing the release
archives, doing epydoc, doing the Windows installers, tagging CVS, and
uploading to the website. Part of the job would include scanning the
NEWS and DEPRECATED files, plus recent documentation to make sure
nothing was missed. This can be delegated.
(3) Writing and publishing the release announcement on the news site
and email lists (with the timing coordinated with the people doing
jobs 1 and 2). I suggest having our new news coordinators take over
this bit.
So, while historically (1), (2) and (3) have be done by one person I
think this could be split up into the "Release Director", "Release
Manager" and "News Coordinator" roles (perhaps with different job
titles?).
Peter
From p.j.a.cock at googlemail.com Thu Apr 23 10:06:14 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 23 Apr 2009 15:06:14 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
Bio.GFF)
In-Reply-To: <20090423123635.GD34546@sobchak.mgh.harvard.edu>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
<320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
<20090423123635.GD34546@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904230706j213d6a47iadc6722581e52588@mail.gmail.com>
On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman wrote:
> Hi all;
>
>> > Unless you are thinking of having an object representation as being too
>> > heavy, the non-light part of SeqFeature is all the FeatureLocation
>> > fuzziness.
>>
>> I've just had a quick go at what should be a 100% backwards compatible
>> modification to the FeatureLocation class to store ExactPosition start
>> or end positions as integers. ?The idea should be more memory
>> efficient, using the complex position objects only when required.
>
> I like the idea here but I would go a step further and get rid of
> FeatureLocation, collapsing the start and end location onto the
> SeqFeature itself. FeatureLocation is basically just a holder for a
> start and end coordinates. In this version, you would store the
> positions plus extensions and fuzzy type on the Feature, and then
> instantiate fuzzy objects on demand.
>
> I took a look at the resource usage of these objects versus
> a lightweight implementation. For a GFF file with 70k features, the
> maximum memory usage is 128M versus 111M for the lightweight
> version. So the improvement is rather modest, ~15%.
Thanks for that. Perhaps the variant idea using a using a single
reference for each location would save more (currently is uses two
references, one for the object and one for the integer - so in general
we are wasting memory on a pointer to None).
Certainly merging the SeqFeature and FeatureLocation should save even
more memory. We could do this with full backward compatibility by
generating the FeatureLocation object on request (using a property
method for the SeqFeature's location), and this can also trigger a
deprecation warning. We'd have to think about what to do with the
SeqFeature's __init__ method more carefully.
>> I forgot to mention the second major use case I'm concerned about,
>> which is recovering the GenBank/EMBL style location string. ?I have
>> looked at this in the past, by adding methods to the FeatureLocation
>> and all the Position objects, but it is complicated by the fact the
>> Position objects don't know if they are at the start or end (and for
>> the start locations we need to add one to convert from Python
>> counting). ?This is the main block on having Bio.SeqIO support writing
>> GenBank (or EMBL) files with their features included.
>
> I admittedly haven't looked at this in a while, but this was
> designed to be round tripped. The GenBank Record class can be
> written out back in GenBank format, and test_GenBank explicitly
> checks that the start and end records are the same.
Yes - The Bio.GenBank.Record class should round-trip, from memory it
stores feature locations as string.
I'm interested in writing a SeqRecord out as a GenBank file (which
already do, but without the features). This would let you do things
like load an EMBL or GFF3 file as a SeqRecord, and output it as a
GenBank file.
Peter
From cy at cymon.org Thu Apr 23 10:32:10 2009
From: cy at cymon.org (Cymon Cox)
Date: Thu, 23 Apr 2009 15:32:10 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <20090422224401.GC34546@sobchak.mgh.harvard.edu>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
<20090422224401.GC34546@sobchak.mgh.harvard.edu>
Message-ID: <7265d4f0904230732i124670ebvf859b2e27943ba37@mail.gmail.com>
2009/4/22 Brad Chapman
> Peter and Cymon;
>
> > >> This might be a silly question, but do you actually these exact option
> > >> layouts for MUSCLE and MAFFT? Many Unix tools use something like
> > >> libopt and will actually take slight variations, and may also offer
> short
> > >> and long names for the same option. Perhaps the existing option code
> > >> in Bio.Application will suffice?
> > >
> > > MAFFT uses "--param value" style options, and won't accept
> "--param=value"
> > > or "-param value" as alternatives.
> >
> > OK. Then yes, we should support that. Brad, as Bio.Application is your
> > module, would you like to comment?
>
> My comment is: I think it is awesome MAFFT made up their own way
> of doing the command line.
I think you'll be likewise inspired by the MUSCLE command line parsing:
[cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing -cluster1
upgmb
Command-line option "upgmb" must start with '-'
But of course, these two are perfectly acceptable:
[cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing
--cluster1=upgmb
[cymon at chara mafft]$ muscle -in Tests/Fasta/f002 -anchorspacing
on-balance-I-think-Ill-go-home -cluster1 upgmb
At present, there is no current way to force a value argument to an option
so cmd.set_parameter("-anchorspacing") is acceptable in the interface. But,
in general, I assume the idea is not 'save' the user from niceties of the
particular programme command line, ie in command line interface I'm allowing
users to set parameters which either dont work or crash the programme...
> Seriously, y'all are doing the right thing. Add a new class to
> Bio.Application: _OptionAlt or whatever you'd like to call MAFFT's
> inventive new way to specify command line arguments. Adapt the
> __str__ from _Option to do it the "--param val" way in this class.
> Then use this for your MAFFT commandline.
class _OptionAlt(_AbstractParameter):
"""Represent an option that can be set for a program.
This holds UNIXish options like:
--append yes
--append
"""
def __str__(self):
"""Return the value of this option for the commandline.
"""
if self.names[0].find("--") >= 0:
output = "%s" % self.names[0]
if self.value is not None:
output += " %s " % self.value
else:
output += " "
else:
raise ValueError("Unrecognized option type: %s" % self.names[0])
return output
C.
--
____________________________________________________________________
Cymon J. Cox
Centro de Ciencias do Mar
Faculdade de Ciencias do Mar e Ambiente (FCMA)
Universidade do Algarve
Campus de Gambelas
8005-139 Faro
Portugal
Phone: +0351 289800909 ext 7909
Fax: +0351 289800051
Email: cy at cymon.org, cymon at ualg.pt, cymon.cox at gmail.com
HomePage : http://biology.duke.edu/bryology/cymon.html
-8.63/-6.77
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:22:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 11:22:36 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904231522.n3NFMal6026332@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
cymon.cox at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1280 is|0 |1
obsolete| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:23:05 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 11:23:05 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904231523.n3NFN5va026431@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
cymon.cox at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1279 is|0 |1
obsolete| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:25:43 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 11:25:43 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904231525.n3NFPhPH026661@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #3 from cymon.cox at gmail.com 2009-04-23 11:25 EST -------
Created an attachment (id=1285)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1285&action=view)
Bio.Align.Applications.py text
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:32:34 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 11:32:34 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904231532.n3NFWYkO027258@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #4 from cymon.cox at gmail.com 2009-04-23 11:32 EST -------
Created an attachment (id=1286)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1286&action=view)
Patch for Bio.Applications __init__.py
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:33:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 11:33:09 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904231533.n3NFX9kw027294@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #5 from cymon.cox at gmail.com 2009-04-23 11:33 EST -------
MUSCLE and MAFFT Bio.Application command lines
Patch for Bio.Applications __init__py
to add _OptionAlt class covering "--param value" style options
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 11:43:04 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 11:43:04 -0400
Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to
stderr, not stdout
In-Reply-To:
Message-ID: <200904231543.n3NFh4cT028184@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2754
------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 11:43 EST -------
In comment #3 Bruce wrote:
>
> I believe that we should be using the using Python warnings module for these
> types of messages:
> http://docs.python.org/library/warnings.html
>
> This permits the user to have a greater control over the output and also
> allows redirecting the output as required. In the Bio directory, there are
> currently 36 and 25 uses of stderr and stdout, respectively.
>
> In terms of the patch, my limited understanding is that local import sys will
> override any global redirection of the output which in my opinion is a bad
> idea.
Good points, and yes, using the warnings module here (and probably elsewhere in
Biopython) makes sense.
Eric wrote in comment #9:
> Yes, something must be done with test_PDB.py, because I don't think
> warnings.warn can be made to play nice with that print-and-compare test
> -- or any print-and-compare, since the warning messages contain extra
> environment-specific information.
I was able to solve this with the following trick:
import warnings
def send_warnings_to_stdout(message, category, filename, lineno, file=None):
print message
warnings.showwarning = send_warnings_to_stdout
This now prints *just* the message text without the stack trace information
etc. This also means it looks like any other output from the print-and-compare
test, to test_PDB.py required only a trivial change.
Note that I haven't taken Eric's patches/branch as is - for one thing I wanted
to use the same import style as elsewhere in Biopython:
i.e.
import warnings
warnings.warn("Message")
rather than:
from warnings import warn
warn("Message")
However, I think we can now close Bug 2754. Eric - please try the latest code
from CVS (or the mirror on github).
Also, could you also open separate bug(s) for the other issues, such as your
new unittest based version of test_PDB.py?
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Thu Apr 23 12:34:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Apr 2009 17:34:27 +0100
Subject: [Biopython-dev] How are people doing their git merges from the
trunk?
Message-ID: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com>
Hi all,
We have the CVS trunk mirrored here:
http://github.com/biopython/biopython/tree/master
I have a copy of this in my github account here,
http://github.com/peterjc/biopython/tree/master
I decided that I would (initially at least) treat my master branch as
a copy of the master branch, and not commit local changes to this
branch. Instead I periodically grab the latest commits from the
master using the commands:
#Do this once only:
#git remote add official_dist git://github.com/biopython/biopython.git
echo Checking out my local master branch...
git checkout master
echo Updating my local master branch with the official dist...
git pull official_dist master
echo Status:
git status
echo Pushing to my github master branch...
git push origin master
This means the github network diagram only advances by one step, even
if the operation combined 10s of individual commits (which are still
shown individually on my history on github).
Alternatively, I could have used github's cherry pick interface (the
fork queue), or used git cherry pick at the command line. I can see
this is useful if you only want to pick out a few patches. Is there
any reason to use this when you want all the commits from another
branch? Bartek's latest activity on the github network is a series of
points - I think this means he did a "cherry pick", and selected most
(maybe even all) of the changes from the main trunk. Am I
interpreting this right?
Thanks
Peter
From biopython at maubp.freeserve.co.uk Thu Apr 23 17:21:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Apr 2009 22:21:41 +0100
Subject: [Biopython-dev] Fwd: Where to put command line wrappers
In-Reply-To: <20090417140241.GD16092@sobchak.mgh.harvard.edu>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
<8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
<8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
<320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com>
<20090417140241.GD16092@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com>
On Fri, Apr 17, 2009 at 3:02 PM, Brad Chapman wrote:
> Hi all;
>
> [Where to put the commandline objects]
>> > I think that there is a difference between EMBOSS and
>> > Bio.[Motif|Align]. In EMBOSS we have a very nicely comoditized
>> > set of tools with similar interfaces, while both for multiple
>> > alignment and motif searching the tools vary a lot. In case of
>> > multiple alignments this is only with respect to parameters and
>> > output format, while in motif searching there is also a lot of
>> > differences in the types of input (background models etc.).
>>
>> That is a good argument for using Bio/Align/Applications/XXX.py and
>> Bio/Motif/Applications/XXX.py while also having
>> Bio/EMBOSS/Applications.py
>
> There is a natural tension between overgeneralizing and dumping
> too much into one file. At one end you have deeply nested Java-like
> directories with a few lines of code in each file. I tend towards the
> "more in a single file and less nesting" camp. My vote would be that
> if the Motif Applications file will only contain commandline
> wrappers, they could live in one file.
OK, what I propose is that the command line objects are exposed as
Bio.Align.Applications.MuscleCommandline,
Bio.Align.Applications.ClustalwCommandline, etc but that the
implementations live in Bio/Align/Applications/_Muscle.py,
_Clustalw.py etc. To do this the Bio/Align/Applications/__init__.py
file will look like this:
from _Muscle import MuscleCommandline
from _Clustalw import ClustalwCommandline
This avoids having a single massive file, yet keeps the public
namespace simple. For the user, they do this:
from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(...)
or if they prefer,
from Bio.Align import Applications
cline = Applications.MuscleCommandline(...)
>From the user's point of view all the alignment command line wrapper
objects live together under Bio.Align.Applications.
This will be consistent with the public API for the EMBOSS wrappers
where you can do:
from Bio.Emboss.Applications import Primer3Commandline
cline = Primer3Commandline(...)
or variants like that.
For Bio.Motif.Applications we can do the same as for
Bio.Align.Applications, or if there are only one or two wrappers
initially put the classes directly in
Bio/Motif/Applications/__init__.py and then split them into private
files later on if the file gets too big.
Peter
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 17:52:08 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 17:52:08 -0400
Subject: [Biopython-dev] [Bug 2820] New: Convert test_PDB.py to unittest
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
Summary: Convert test_PDB.py to unittest
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P3
Component: Unit Tests
AssignedTo: biopython-dev at biopython.org
ReportedBy: eric.talevich at gmail.com
The current test script for Bio.PDB uses the print-and-compare approach. I've
written an equivalent test script using unittest, assuming that style is the
preferred one.
It was written to go with Bug 2754, but now lives on my pdbtidy branch:
http://github.com/etal/biopython/tree/pdbtidy
This script could also live alongside the original test_PDB.py for awhile, as
an additional check on Bio.PDB's error handling.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 18:01:16 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 18:01:16 -0400
Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to
stderr, not stdout
In-Reply-To:
Message-ID: <200904232201.n3NM1GW1025781@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2754
------- Comment #14 from eric.talevich at gmail.com 2009-04-23 18:01 EST -------
(In reply to comment #13)
> I think we can now close Bug 2754. Eric - please try the latest code
> from CVS (or the mirror on github).
Works for me. I'll delete the bug2754 branch from github.
> Also, could you also open separate bug(s) for the other issues, such as your
> new unittest based version of test_PDB.py?
I opened Bug 2820 for the unittest version of test_PDB.py. The script itself is
living on my pdbtidy branch at Tests/test_PDB_unit.py now, although one of the
tests broke during the merge (there were a lot of conflicts).
I'll open bugs for the other changes once I figure out which modifications are
worth sharing.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 18:26:38 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 18:26:38 -0400
Subject: [Biopython-dev] [Bug 2754] Bio.PDB: Parse warnings should print to
stderr, not stdout
In-Reply-To:
Message-ID: <200904232226.n3NMQcPf027372@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2754
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 18:26 EST -------
(In reply to comment #14)
> (In reply to comment #13)
> > I think we can now close Bug 2754. Eric - please try the latest code
> > from CVS (or the mirror on github).
>
> Works for me.
>
Great - marking this bug as fixed :)
>
> I'll delete the bug2754 branch from github.
>
OK - it has served its purpose now :)
> > Also, could you also open separate bug(s) for the other issues,
> > such as your new unittest based version of test_PDB.py?
>
> I opened Bug 2820 for the unittest version of test_PDB.py. The script itself
> is living on my pdbtidy branch at Tests/test_PDB_unit.py now, although one
> of the tests broke during the merge (there were a lot of conflicts).
Thanks.
> I'll open bugs for the other changes once I figure out which modifications
> are worth sharing.
Thank you :)
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From eric.talevich at gmail.com Thu Apr 23 18:54:02 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Thu, 23 Apr 2009 18:54:02 -0400
Subject: [Biopython-dev] How are people doing their git merges from the
trunk?
In-Reply-To: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com>
References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com>
Message-ID: <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com>
On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote:
>
> I decided that I would (initially at least) treat my master branch as
> a copy of the master branch, and not commit local changes to this
> branch. Instead I periodically grab the latest commits from the
> master using the commands:
I think this is the recommended way to do it. I read a thread where
Mercurial gurus recommended keeping a clean clone of the upstream
repository, and never committing to that clone. Git seems to have a cleaner
version of this with in-place branches.
After a few bad incidents with git-rebase, I resolved to keep 'master' in
sync with the biopython trunk, and use new named branches for all
modifications. The workflow is:
git checkout master
git pull origin # if I've pushed commits from a different computer
recently
git pull upstream master # upstream is the remote biopython/biopython
git push origin master
git checkout phyloxml # a local branch
git merge master
# hack, commit, repeat
# rebasing commits made in this session on this branch is still safe
git push origin phyloxml
This means the github network diagram only advances by one step, even
> if the operation combined 10s of individual commits (which are still
> shown individually on my history on github).
>
>
I think mine shows up as multiple dots, and I don't use cherry-pick. Pulling
from upstream on the master branch always results in a fast-forward, though.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 19:36:28 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 19:36:28 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904232336.n3NNaSw6031547@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-23 19:36 EST -------
(In reply to comment #0)
> The current test script for Bio.PDB uses the print-and-compare approach. I've
> written an equivalent test script using unittest, assuming that style is the
> preferred one.
Yes, in principle the unittest style is prefferred. In practice I am
pragmatic about this - a print-and-compare test is better than nothing,
and for some things is much easier to write.
> It was written to go with Bug 2754, but now lives on my pdbtidy branch:
> http://github.com/etal/biopython/tree/pdbtidy
>
> This script could also live alongside the original test_PDB.py for awhile, as
> an additional check on Bio.PDB's error handling.
I've checked in a slightly modified version as test_PDB_unit.py - I think
having both this and the original test_PDB.py is sensible in the short term.
You wrote on Bug 2754 comment 14 that "one of the tests broke during the
merge", was that this one:
def test_warnings(self):
"""Parse a flawed PDB file in permissive mode, with warnings"""
# Python 2.6+: rewrite this using warnings.catch_warnings
parser = PDBParser(PERMISSIVE=1)
msg_redef_n = r"Atom N defined twice in residue at line 19\."
msg_blank_alt = r"Blank altlocs in duplicate residue SER \(' ', 4, '
'\) at line 41\."
msg_redef_o = r"Atom O defined twice in residue at line 820\."
warnings.simplefilter('ignore')
# NB: Order is important here!
warnings.filterwarnings('error', msg_redef_n, PDBConstructionWarning)
self.assertRaises(PDBConstructionWarning,
parser.get_structure, "example", "PDB/a_structure.pdb")
warnings.filters.pop(0)
warnings.filterwarnings('error', msg_blank_alt, PDBConstructionWarning)
self.assertRaises(PDBConstructionWarning,
parser.get_structure, "example", "PDB/a_structure.pdb")
warnings.filters.pop(0)
warnings.filterwarnings('error', msg_redef_o, PDBConstructionWarning)
self.assertRaises(PDBConstructionWarning,
parser.get_structure, "example", "PDB/a_structure.pdb")
warnings.filters.pop(0)
warnings.filters.pop(0)
I tried but couldn't get this to work (on Python 2.4.3 on Linux), even with
plenty of warnings.resetwarnings() which seemed cleaner than popping things.
I agree with the idea that we should make sure particular errors do get raised
(this is checked by the print-and-compare test_PDB.py because we capture these
warnings to stdout), but right now how to make it work escapes me. Maybe after
a good night's sleep things will make sense ;)
Leaving this bug open to address this point.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 23:12:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 23:12:15 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904240312.n3O3CFdn011360@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #2 from eric.talevich at gmail.com 2009-04-23 23:12 EST -------
(In reply to comment #1)
> You wrote on Bug 2754 comment 14 that "one of the tests broke during the
> merge", was that this one:
>
> def test_warnings(self):
> [...]
>
> I tried but couldn't get this to work (on Python 2.4.3 on Linux), even with
> plenty of warnings.resetwarnings() which seemed cleaner than popping things.
>
Yep, that's the one.
The behavior of the warnings module and resetwarnings() is pathological, I
think. If a warning is triggered before the warnings.simplefilter('always')
function is called, that specific warning will be silent until the interpreter
is restarted. That's why order is sensitive in that function, and why the three
exceptions aren't three separate functions. The attribute warnings.filters is a
list of filters that warnings are checked against as they're raised, and at
startup the list is not empty. Calling warnings.resetwarnings() just empties
this list, including the default filters and any use of 'ignore' or 'always'.
Maybe the popping was just voodoo and an empty filter list is fine... dunno.
Python 2.6 includes a context manager that makes all these problems
*completely* go away, by catching all of the warnings raised within a context
and optionally storing them as a list of warning objects that can be inspected.
Would you be interested in having a unit test that does a more thorough check
of the warnings system, but only runs on Py2.6? I'm guessing no, but hey, worth
a shot.
Most likely, some warnings just aren't being caught because my version of the
unit test assumed a different variety of warnings coming out of PDB. If that's
the case then it should be an easy fix and you can disregard my whining.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 23 23:56:49 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 23 Apr 2009 23:56:49 -0400
Subject: [Biopython-dev] [Bug 2821] New: NCBIXML.parse only returns results
for non-empty hits rather than one per query sequence
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2821
Summary: NCBIXML.parse only returns results for non-empty hits
rather than one per query sequence
Product: Biopython
Version: 1.50b
Platform: Other
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: camilla at ip.id.au
I used NCBIStandalone.blastall to BLAST all records in query database VEKY.faa
(a FASTA-format file of 226 proteins) significantly similar in proteins in
target database VPOO.faa (a FASTA-format file of 80 proteins).
Many of the 'VEKY' proteins do not have a significant hit in the 'VPOO'
database (which is what I expect and this is fine).
To access the results, I iterate using a loop like the following to parse the
raw BLAST results in XML format:
blast_out = _open_file(outraw_file, 'r')
blast_records = NCBIXML.parse(blast_out)
for b_record in blast_records:
# deal with each record here
However, instead of getting 226 records as I expect, some of which have a
description of alignments field of length zero, this returns 64 records - the
records that did not have 'no hits'.
My problem is that I'd like to work out which VEKY query sequence each
'b_record' corresponds to. But so far I have not been able to find any such
information in the b_record. And because it doesn't produce one per query
sequence, I cannot infer that information from the order of the query sequences
in my input VEKY.faa file.
Do you know how I can get around this problem?
Warm thanks in advance for any help or tips,
Camilla
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 04:05:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 04:05:29 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904240805.n3O85TqY030236@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #3 from dalloliogm at gmail.com 2009-04-24 04:05 EST -------
(In reply to comment #0)
> The current test script for Bio.PDB uses the print-and-compare approach. I've
> written an equivalent test script using unittest, assuming that style is the
> preferred one.
>
> It was written to go with Bug 2754, but now lives on my pdbtidy branch:
> http://github.com/etal/biopython/tree/pdbtidy
>
> This script could also live alongside the original test_PDB.py for awhile, as
> an additional check on Bio.PDB's error handling.
>
I also tried to write an unittest-based test for PDB exposure, just for playing
with it a bit:
-
http://github.com/dalloliogm/biopython/blob/7dabfff5f7b523479bf8d6de120d0f6c7d03f7df/Tests/test_PDBexposure.py
I used the approach where one unit test is equivalent to a PDB file, instead of
a set of functions.
For example:
- test case 1: PDB.NeighborSearch is able to read a random generated PDB file
- test case 2: PDB.NeighborSearch is able to read a pdb file with only one
structure
- test case 3: PDB.NeighborSearch is able to read another specific pdb case
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 04:06:43 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 04:06:43 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904240806.n3O86h0q030360@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #4 from dalloliogm at gmail.com 2009-04-24 04:06 EST -------
(In reply to comment #3)
This has the advantage that you can write a base test class and then apply the
same tests to various files, by subclassing.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:02:35 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 05:02:35 -0400
Subject: [Biopython-dev] [Bug 2821] NCBIXML.parse only returns results for
non-empty hits rather than one per query sequence
In-Reply-To:
Message-ID: <200904240902.n3O92Z68004987@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2821
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:02 EST -------
What version of BLAST do you have, and (assuming its less than say 10 MB) could
you attach the XML file to this bug?
>From memory this is a limitation of the raw XML file from the NCBI - there is
no way to tell if there were additional queries with no hits (so Biopython
can't help directly). I have not checked BLAST 2.2.20, but had been meaning to
ask the NCBI about this. They may not regard it as a "bug", but it was
annoying.
I have used two workarounds in my own code.
(1) Load a list of the query IDs into memory, and as you go though the BLAST
results you can see which queries don't appear - and therefore had no hits.
(2) Use the .next() methods on a FASTA iterator on the query file, and the
NCBIXML iterator on the BLAST XML file to step through the two files in sync.
I have some code to do this somewhere... maybe I should turn this into a
cookbook recipe for the wiki: http://biopython.org/wiki/Category:Cookbook
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:07:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 05:07:48 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904240907.n3O97mDU005535@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #5 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:07 EST -------
(In reply to comment #3)
>
> I also tried to write an unittest-based test for PDB exposure, just for
> playing with it a bit:
> ...
> I used the approach where one unit test is equivalent to a PDB file,
> instead of a set of functions.
Hi Giovanni,
Isn't Bug 2759 for the PDB exposure test? I was thinking of just adding that
to the new file test_PDB_unit.py, rather than making it into its own file.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:15:05 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 05:15:05 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904240915.n3O9F5hr006324@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #6 from dalloliogm at gmail.com 2009-04-24 05:15 EST -------
(In reply to comment #5)
> (In reply to comment #3)
> >
> > I also tried to write an unittest-based test for PDB exposure, just for
> > playing with it a bit:
> > ...
> > I used the approach where one unit test is equivalent to a PDB file,
> > instead of a set of functions.
>
> Hi Giovanni,
>
> Isn't Bug 2759 for the PDB exposure test? I was thinking of just adding that
> to the new file test_PDB_unit.py, rather than making it into its own file.
>
> Peter
Ok, of course :)
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:50:33 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 05:50:33 -0400
Subject: [Biopython-dev] [Bug 2759] Unit test for Bio.PDB.HSExposure
In-Reply-To:
Message-ID: <200904240950.n3O9oXlV008332@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2759
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1234 is|0 |1
obsolete| |
------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:50 EST -------
(From update of attachment 1234)
I have checked this initial exposure test in as part of new file
test_PDB_unit.py (created for Bug 2820).
Leaving this bug open to look at Martin and/or Giovanni's
improvements/extensions.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 24 05:59:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 24 Apr 2009 05:59:09 -0400
Subject: [Biopython-dev] [Bug 2820] Convert test_PDB.py to unittest
In-Reply-To:
Message-ID: <200904240959.n3O9x9M8008849@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2820
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-24 05:59 EST -------
(In reply to comment #2)
>
> Yep, that's the one.
>
> The behavior of the warnings module and resetwarnings() is pathological, I
> think. If a warning is triggered before the warnings.simplefilter('always')
> function is called, that specific warning will be silent until the interpreter
> is restarted. That's why order is sensitive in that function, and ...
> Calling warnings.resetwarnings() just empties this list, including the
> default filters and any use of 'ignore' or 'always'.
The reduced warning test in CVS was working until I added more unit tests (for
Bug 2759). This changed the test order, and the warnings were no longer being
triggered. I tried a few things like setting warnings.defaultaction="always"
at the top of the file, and adding and warnings. onceregistry={} to the test
method, but I have given up. We need to be able to *completely* reset the
warnings module for this approach to work.
> Python 2.6 includes a context manager that makes all these problems
> *completely* go away, by catching all of the warnings raised within a
> context and optionally storing them as a list of warning objects that
> can be inspected.
That sounds much better :)
> Would you be interested in having a unit test that does a more thorough
> check of the warnings system, but only runs on Py2.6? I'm guessing no,
> but hey, worth a shot.
Yes - other than using the old print-and-compare test, this seems worth doing
in order to actually test the warnings we expect are being issued. It could be
a whole new file, test_PDB_warnings.py which required Python 2.6+, but as its
just one or two tests, maybe just use conditional method(s) within the
test_PDB_unit.py file.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From p.j.a.cock at googlemail.com Fri Apr 24 06:57:03 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 24 Apr 2009 11:57:03 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
Bio.GFF)
In-Reply-To: <20090423123635.GD34546@sobchak.mgh.harvard.edu>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
<320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
<20090423123635.GD34546@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com>
On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman wrote:
> I took a look at the resource usage of these objects versus
> a lightweight implementation. For a GFF file with 70k features, the
> maximum memory usage is 128M versus 111M for the lightweight
> version. So the improvement is rather modest, ~15%.
How did you measure these memory figures?
And was your 15% comparison between the current "heavy" SeqFeature +
FeatureLocation system as in CVS, and my lightweight alternative
described earlier?
Peter
From cy at cymon.org Fri Apr 24 07:43:33 2009
From: cy at cymon.org (Cymon Cox)
Date: Fri, 24 Apr 2009 12:43:33 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
Message-ID: <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com>
2009/4/22 Peter
> On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote:
> >>
> >> This might be a silly question, but do you actually these exact option
> >> layouts for MUSCLE and MAFFT? Many Unix tools use something like
> >> libopt and will actually take slight variations, and may also offer
> short
> >> and long names for the same option. Perhaps the existing option code
> >> in Bio.Application will suffice?
> >
> > MAFFT uses "--param value" style options, and won't accept
> "--param=value"
> > or "-param value" as alternatives.
>
> OK. Then yes, we should support that. Brad, as Bio.Application is your
> module, would you like to comment?
>
> >
> > Neither use "-param=value", but if more applications it may turn up.
> >
>
> I don't think I have ever see a command line application that used that.
PRANK - Probabilistic Alignment Kit
http://www.ebi.ac.uk/goldman-srv/prank/prank/
Advanced usage: 'prank [optional parameters] -d=sequence_file [optional
parameters]'
Doesn't accept "-d sequence_file" or "- -d=sequence_file"
C.
--
From biopython at maubp.freeserve.co.uk Fri Apr 24 07:51:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Apr 2009 12:51:58 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
<7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com>
Message-ID: <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com>
On Fri, Apr 24, 2009 at 12:43 PM, Cymon Cox wrote:
> 2009/4/22 Peter
>
>> On Wed, Apr 22, 2009 at 6:00 PM, Cymon Cox wrote:
>> >>
>> >> This might be a silly question, but do you actually these exact option
>> >> layouts for MUSCLE and MAFFT? ?Many Unix tools use something like
>> >> libopt and will actually take slight variations, and may also offer
>> >> short and long names for the same option. ?Perhaps the existing
>> >> option code in Bio.Application will suffice?
>> >
>> > MAFFT uses "--param value" style options, and won't accept
>> "--param=value"
>> > or "-param value" as alternatives.
>>
>> OK. ?Then yes, we should support that. ?Brad, as Bio.Application is your
>> module, would you like to comment?
>>
>> >
>> > Neither use "-param=value", but if more applications it may turn up.
>> >
>>
>> I don't think I have ever see a command line application that used that.
>
>
> PRANK - Probabilistic Alignment Kit
> http://www.ebi.ac.uk/goldman-srv/prank/prank/
>
> Advanced usage: 'prank [optional parameters] -d=sequence_file [optional
> parameters]'
>
> Doesn't accept "-d sequence_file" or "- -d=sequence_file"
I had misunderstood the quotes to be literally typed on the command line ;)
Peter
From biopython at maubp.freeserve.co.uk Fri Apr 24 08:39:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Apr 2009 13:39:51 +0100
Subject: [Biopython-dev] How are people doing their git merges from the
trunk?
In-Reply-To: <3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com>
References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com>
<3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com>
Message-ID: <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com>
On Thu, Apr 23, 2009 at 11:54 PM, Eric Talevich wrote:
> On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote:
>
>> I decided that I would (initially at least) treat my master branch as
>> a copy of the master branch, and not commit local changes to this
>> branch. ?Instead I periodically grab the latest commits from the
>> master using the commands:
>
> I think this is the recommended way to do it. I read a thread where
> Mercurial gurus recommended keeping a clean clone of the upstream
> repository, and never committing to that clone. Git seems to have a cleaner
> version of this with in-place branches.
>
> After a few bad incidents with git-rebase, I resolved to keep 'master' in
> sync with the biopython trunk, and use new named branches for all
> modifications. The workflow is:
>
> git checkout master
> git pull origin ? ?# if I've pushed commits from a different computer
> recently
> git pull upstream master ? # upstream is the remote biopython/biopython
> git push origin master
Using "upstream" seems like a very sensible name, I assume you set up:
git remote add upstream git://github.com/biopython/biopython.git
Peter
From chapmanb at 50mail.com Fri Apr 24 08:45:15 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 24 Apr 2009 08:45:15 -0400
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects
(was Bio.GFF)
In-Reply-To: <320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
<320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
<20090423123635.GD34546@sobchak.mgh.harvard.edu>
<320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com>
Message-ID: <20090424124515.GJ34546@sobchak.mgh.harvard.edu>
Hi Peter;
> > I took a look at the resource usage of these objects versus
> > a lightweight implementation. For a GFF file with 70k features, the
> > maximum memory usage is 128M versus 111M for the lightweight
> > version. So the improvement is rather modest, ~15%.
>
> How did you measure these memory figures?
With the unix 'time' command; those are the values reported by %M,
which is the maximum memory used during the process.
> And was your 15% comparison between the current "heavy" SeqFeature +
> FeatureLocation system as in CVS, and my lightweight alternative
> described earlier?
This was with an even lighter version. I just added start/end as
attributes to the SeqFeatures. So there was no FeatureLocation or
individual position objects. This was a hack to look at the best case
scenario to save memory. The baseline was the default SeqFeatures
before we started thinking about changing them.
> How does this version look? It should save more memory that the
> version I sent you three days ago, and again aims for 100% backwards
> compatibility - all the unit tests pass.
That is nice. Do we still want to keep a FeatureLocation, or
condense this all onto the SeqFeature itself?
Brad
From chapmanb at 50mail.com Fri Apr 24 08:47:06 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 24 Apr 2009 08:47:06 -0400
Subject: [Biopython-dev] How are people doing their git merges from
the trunk?
In-Reply-To: <320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com>
References: <320fb6e00904230934kac2ca0eoc8ba284fbe2c382e@mail.gmail.com>
<3f6baf360904231554y7e28084ex79bb0a7f60b4cef7@mail.gmail.com>
<320fb6e00904240539n616a5e77s7ef4377c2cd4c336@mail.gmail.com>
Message-ID: <20090424124706.GK34546@sobchak.mgh.harvard.edu>
Eric and Peter;
This is really good stuff. Can we add the details to the wiki? It
looks like this section could use the information from this thread:
http://biopython.org/wiki/GitUsage#Merging_upstream_changes
Brad
> On Thu, Apr 23, 2009 at 11:54 PM, Eric Talevich wrote:
> > On Thu, Apr 23, 2009 at 12:34 PM, Peter wrote:
> >
> >> I decided that I would (initially at least) treat my master branch as
> >> a copy of the master branch, and not commit local changes to this
> >> branch. ?Instead I periodically grab the latest commits from the
> >> master using the commands:
> >
> > I think this is the recommended way to do it. I read a thread where
> > Mercurial gurus recommended keeping a clean clone of the upstream
> > repository, and never committing to that clone. Git seems to have a cleaner
> > version of this with in-place branches.
> >
> > After a few bad incidents with git-rebase, I resolved to keep 'master' in
> > sync with the biopython trunk, and use new named branches for all
> > modifications. The workflow is:
> >
> > git checkout master
> > git pull origin ? ?# if I've pushed commits from a different computer
> > recently
> > git pull upstream master ? # upstream is the remote biopython/biopython
> > git push origin master
>
> Using "upstream" seems like a very sensible name, I assume you set up:
> git remote add upstream git://github.com/biopython/biopython.git
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From p.j.a.cock at googlemail.com Fri Apr 24 10:14:10 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 24 Apr 2009 15:14:10 +0100
Subject: [Biopython-dev] SeqFeature and FeatureLocation objects (was
Bio.GFF)
In-Reply-To: <20090424124515.GJ34546@sobchak.mgh.harvard.edu>
References: <320fb6e00904210605t5a5af0ek7b2e4e8cb549cd2d@mail.gmail.com>
<320fb6e00904210651j3b3cf753m1a143b490e7370c1@mail.gmail.com>
<20090423123635.GD34546@sobchak.mgh.harvard.edu>
<320fb6e00904240357j29663811q80d4f7c2e7cf6382@mail.gmail.com>
<20090424124515.GJ34546@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904240714s3a0df8cfk75330fd4025c13a3@mail.gmail.com>
On Fri, Apr 24, 2009 at 1:45 PM, Brad Chapman wrote:
> With the unix 'time' command; those are the values reported by %M,
> which is the maximum memory used during the process.
>
You said 70k features, but how big was the file on disk?
>>
>> And was your 15% comparison between the current "heavy" SeqFeature +
>> FeatureLocation system as in CVS, and my lightweight alternative
>> described earlier?
>>
>
> This was with an even lighter version. I just added start/end as
> attributes to the SeqFeatures. So there was no FeatureLocation or
> individual position objects. This was a hack to look at the best case
> scenario to save memory. The baseline was the default SeqFeatures
> before we started thinking about changing them.
Right - so even if the FeatureLocation is a bit "heavy", getting rid of it
wouldn't make that much difference based on your simple profiling.
>> How does this version look? It should save more memory that the
>> version I sent you three days ago, and again aims for 100% backwards
>> compatibility - all the unit tests pass.
>
> That is nice. Do we still want to keep a FeatureLocation, or
> condense this all onto the SeqFeature itself?
For the moment I was exploring ways to avoid wasting memory in the
FeatureLocation object while retaining 100% compatibility. If your
simple profiling numbers are telling the whole story, then there isn't
a great deal of point in adding any internal complexity for a small
memory saving.
If we do want to preserve the current SeqFeature and FeatureLocation
API, then the proposal on Bug 2818 is a worthwhile incremental
improvement.
However, we can probably come up with something even nicer if we
change the SeqFeature and FeatureLocation in a non-backwards
compatible way. If we did change the API, I would want to stop using
the sub_features list to hold join information as child SeqFeatures.
I was thinking the FeatureLocation object should hold this, but
merging the SeqFeature and FeatureLocation could make sense. Are
there any other non-join location operators we really have to deal
with?
Internally the FeatureLocation (or SeqFeature) could have a list of
child locations held as a private list holding two entry tuples (start
and end positions). Typically for a non-join feature this will be
just _loc_list=[(start,end)], while more generally it would be
_loc_list=[(start1,end1),...,(startN,endN)]. The FeatureLocation (or
SeqFeature) would have (fuzzy/non-fuzzy) start and end properties
which would access _loc_list[0][0] for the start, and loc_list[-1][1]
for the end. I would still use the existing position objects to store
fuzzy positions.
Peter
From cy at cymon.org Fri Apr 24 11:31:28 2009
From: cy at cymon.org (Cymon Cox)
Date: Fri, 24 Apr 2009 16:31:28 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
<7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com>
<320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com>
Message-ID: <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com>
2009/4/24 Peter
> >> > MAFFT uses "--param value" style options, and won't accept
> >> "--param=value"
> >> > or "-param value" as alternatives.
> >>
> >> OK. Then yes, we should support that. Brad, as Bio.Application is your
> >> module, would you like to comment?
> >>
> >> >
> >> > Neither use "-param=value", but if more applications it may turn up.
> >> >
> >>
> >> I don't think I have ever see a command line application that used that.
> >
> >
> > PRANK - Probabilistic Alignment Kit
> > http://www.ebi.ac.uk/goldman-srv/prank/prank/
> >
> > Advanced usage: 'prank [optional parameters] -d=sequence_file [optional
> > parameters]'
> >
> > Doesn't accept "-d sequence_file" or "- -d=sequence_file"
>
> I had misunderstood the quotes to be literally typed on the command line ;)
So the upshot is that both "- -param value" and "-param=value" need to be
supported.
Rather than add another variation on _Option, or alter _OptionAlt to cover
"-param=value", and as we only have a few command line interfaces at
present, I'd like to suggest the following simplification to _Option:
_AbstractParameter.__init__(:self, names = [], types = [], checker_function
= None,
is_required = 0, description = "", equate=True):
self.names = names
self.param_types = types
self.checker_function = checker_function
self.description = description
self.is_required = is_required
self.equate = equate
[...]
class _Option(_AbstractParameter):
"""Represent an option that can be set for a program.
This holds UNIXish options like:
--append=yes
--append yes
--append
-append=yes
-a yes
-append
"""
def __str__(self):
"""Return the value of this option for the commandline.
"""
output = "%s" % self.names[0]
if self.value is not None:
output += "%s%s " % \
(self.equate and "=" or " ", self.value)
return output
ie. add an equate flag
C.
--
From biopython at maubp.freeserve.co.uk Fri Apr 24 12:59:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Apr 2009 17:59:28 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
<7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com>
<320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com>
<7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com>
Message-ID: <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com>
On Fri, Apr 24, 2009 at 4:31 PM, Cymon Cox wrote:
>
> So the upshot is that both "- -param value" and "-param=value" need to be
> supported.
>
> Rather than add another variation on _Option, or alter _OptionAlt to cover
> "-param=value", and as we only have a few command line interfaces at
> present, I'd like to suggest the following simplification to _Option:
> ...
> ie. add an equate flag
That looks very sensible. If there are no counter suggestions, I
think that could be checked in :)
Peter
From bugzilla-daemon at portal.open-bio.org Sat Apr 25 16:36:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sat, 25 Apr 2009 16:36:36 -0400
Subject: [Biopython-dev] [Bug 2817] Meta-bug for cleanup once we drop Python
2.3 support
In-Reply-To:
Message-ID: <200904252036.n3PKaa7G001530@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2817
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-25 16:36 EST -------
Python 2.4+ should let us use the package_data option in setup.py to install
the data files needed for Bio.Entrez and Bio.PopGen (and, if we still include
it, Bio.EUtils).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Sat Apr 25 19:30:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 26 Apr 2009 00:30:15 +0100
Subject: [Biopython-dev] Removing Bio.Mindy and Martel
Message-ID: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com>
Hi all,
Bio.Mindy and Martel are the old "regular expressions on steroids"
parsing framework we used to use in Biopython, which needed the
external dependency mxTextTools (v2, we never got things to work fully
with v3). These modules were deprecated in Biopython 1.48 (Sept
2008), and I explicitly wrote in the release announcements for
Biopython 1.50 (and its beta) that this would be the final release to
include them.
I decided to do this in two steps (partly because of the number of
files involved). I've just removed Mindy and associated bits in CVS,
and everything looks fine from a setup and unit test point of view.
Next comes Martel and its remaining dependent modules. Martel is
still used in the following modules, which were also deprecated in
Biopython 1.48 (Sept 2008):
Bio.MetaTool (parser for output from an obsolete version of MetaTool)
Bio.Saf (an obscure alignment format)
Bio.NBRF (replaced with "pir" format in Bio.SeqIO)
Bio.IntelliGenetics (replaced with "ig" format in Bio.SeqIO)
We've actually had three releases where these modules have had a
deprecation warning in place, but not quite the full year as stated in
the written policy: http://biopython.org/wiki/Deprecation_policy
Does anyone have any objections about us pressing ahead with removing
Martel and these modules now?
Peter
From biopython at maubp.freeserve.co.uk Sun Apr 26 06:58:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 26 Apr 2009 11:58:29 +0100
Subject: [Biopython-dev] Bio.Application interface
In-Reply-To: <320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com>
References: <7265d4f0904220248p49d66e2bve1495b63f62b3e49@mail.gmail.com>
<320fb6e00904220326t4058427csa061696eeb4e1cdb@mail.gmail.com>
<7265d4f0904220623g8f688a6t7a8aa40db240baf4@mail.gmail.com>
<320fb6e00904220630u3c109a58qfee7f6ede100c6c9@mail.gmail.com>
<7265d4f0904221000x61a95187l39a4bc7ae7dc685b@mail.gmail.com>
<320fb6e00904221425t6d0a4d42ga544ebb0e79b02ec@mail.gmail.com>
<7265d4f0904240443rbabd763sf358200c48718b8@mail.gmail.com>
<320fb6e00904240451ma79b037y36315270a03724b@mail.gmail.com>
<7265d4f0904240831p378a467dx34152951a3f16c52@mail.gmail.com>
<320fb6e00904240959k78d0805bo469dc9666c70d3c0@mail.gmail.com>
Message-ID: <320fb6e00904260358i424a6436v24f21e928fffc073@mail.gmail.com>
On Fri, Apr 24, 2009 at 5:59 PM, Peter wrote:
> On Fri, Apr 24, 2009 at 4:31 PM, Cymon Cox wrote:
>> Rather than add another variation on _Option, or alter _OptionAlt to cover
>> "-param=value", and as we only have a few command line interfaces at
>> present, I'd like to suggest the following simplification to _Option:
>> ...
>> ie. add an equate flag
>
> That looks very sensible. If there are no counter suggestions, I
> think that could be checked in :)
The equate argument is now in CVS.
One catch was that the old code used an equals on options starting
"--", e.g. "--apped=yes", but not on short options starting "-", e.g.
"-append yes" (a bit of magic based on the behaviour of typical Unix
tools?). From a grep for "_Option", the only files concerned are:
AlignAce/Applications.py
Application/__init__.py
Blast/Applications.py
Emboss/Applications.py
Motif/Applications/AlignAce.py
And from looking at these, they all use options with a single leading
dash, so for backwards compatibility I set equate to False by default
(not True as in your outlined code).
Does this work for you Cymon?
Peter
From bartek at rezolwenta.eu.org Sun Apr 26 08:08:51 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Sun, 26 Apr 2009 14:08:51 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<320fb6e00904201036h4ff71aecmd971acfb9fd63410@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
Message-ID: <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
Hi all,
On Wed, Apr 22, 2009 at 11:23 AM, Peter wrote:
>
> If you can fix the current git hub repository, great.
>
I 've finally found some time to fix the tag issue in our repository.
I've actually spent some time looking at git-rebase (learned a lot,
but nothing useful for our problem. Then I realised that since tags
are just references to commits, we need to move them to the trunk
(instead of re-basing the trunk).
Long story short - assuming you are in the directory of your git repo
you can fix any particular tag wit a single command. E.g. If you want
to fix the biopython-149 tag, you do:
git tag -f biopython-149 biopython-149~1
-f option enforces the replacement of the existing label, while
biopython-149~1 references the parent commit of our empty tag commit
(you can also use ~2 for a grand parent and so on).
You can see the effect of this procedure (as seen in gitx -- a very
nice tool) in the attached images.
If you want to fix all biopython tags, you simply do:
for t in `git tag|grep biopython`; do git tag -f $t $t~1; done
It works locally, the changes can be pushed back to github (need
--tags -f to force tag renames),
I've done this on my branch of biopython on github.
If there are no objections to the way tags are handled, I can try to
update the trunk.
This is a bit tricky, because I need to make the update scripts work
nicely with moving
the tags, but it should be doable.
> The old conversion's deletion is still in progress, it must have stalled:
> http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename
Seems to be gone now. that's one problem less :)
>
> If we can fix the tags, great. ?If we can also remap the authors to
> their git usernames, even better.
>
This is doable in the current setup. I don't know whether we need to
do this. The old commits
are signed by the same credentials (name, e-mail) as on CVS server. If
we start re-mapping them
now, we are going to have essentially a new commit history, so
everybody would need to rebase their
branches... I don't see a problem of having old commits signed with
old e-mails, and new commits
signed by new. Especially, that everybody can have multiple e-mails
assigned to their github account
(that's how I did with mine).
cheers
Bartek
-------------- next part --------------
A non-text attachment was scrubbed...
Name: before.png
Type: image/png
Size: 18855 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: after.png
Type: image/png
Size: 15150 bytes
Desc: not available
URL:
From biopython at maubp.freeserve.co.uk Sun Apr 26 08:29:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 26 Apr 2009 13:29:01 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<320fb6e00904210730v440bc483td21ca9aa118cc314@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
Message-ID: <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski
wrote:
> Hi all,
>
> On Wed, Apr 22, 2009 at 11:23 AM, Peter wrote:
>>
>> If you can fix the current git hub repository, great.
>>
> I 've finally found some time to fix the tag issue in our repository.
> I've actually spent some time looking at git-rebase (learned a lot,
> but nothing useful for our problem. Then I realised that since tags
> are just references to commits, we need to move them to the trunk
> (instead of re-basing the trunk).
>
> Long story short - assuming you are in the directory of your git repo
> you can fix any particular tag wit a single command. E.g. If you want
> to fix the biopython-149 tag, you do:
> git tag -f biopython-149 biopython-149~1
>
> -f option enforces the replacement of the existing label, while
> biopython-149~1 references the parent commit of our empty tag commit
> (you can also use ~2 for a grand parent and so on).
>
> You can see the effect of this procedure (as seen in gitx -- a very
> nice tool) in the attached images.
>
> If you want to fix all biopython tags, you simply do:
>
> for t in `git tag|grep biopython`; do git tag -f $t $t~1; done
>
> It works locally, the changes can be pushed back to github (need
> --tags -f to force tag renames),
> I've done this on my branch of biopython on github.
>
> If there are no objections to the way tags are handled, I can try to
> update the trunk.
> This is a bit tricky, because I need to make the update scripts work
> nicely with moving
> the tags, but it should be doable.
I say give this a go - fingers crossed :)
>> The old conversion's deletion is still in progress, it must have stalled:
>> http://support.github.com/discussions/repos/485-reposiotry-stuck-in-rename
>
> Seems to be gone now. that's one problem less :)
Great. I did have reminded them, but they solved it.
>>
>> If we can fix the tags, great. If we can also remap the authors to
>> their git usernames, even better.
>>
> This is doable in the current setup. I don't know whether we need to
> do this. The old commits
> are signed by the same credentials (name, e-mail) as on CVS server. If
> we start re-mapping them
> now, we are going to have essentially a new commit history, so
> everybody would need to rebase their
> branches... I don't see a problem of having old commits signed with
> old e-mails, and new commits
> signed by new. Especially, that everybody can have multiple e-mails
> assigned to their github account
> (that's how I did with mine).
That would be simpler. I'll have to try on my github account...
Peter
From biopython at maubp.freeserve.co.uk Sun Apr 26 08:46:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 26 Apr 2009 13:46:55 +0100
Subject: [Biopython-dev] Properties in Bio.Application interface?
Message-ID: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
On Thu, Apr 23, 2009 at 10:29 AM, Peter wrote:
> What about this bit I wrote earlier:
>>> ... We might want to discuss extending the AbstractCommandline
>>> __init__ method to take **kwargs, allowing the parameters to be
>>> set like this:
>>>
>>> from Bio import Application
>>> from Bio.Align.Applications import MafftCommandline
>>> cmd = MafftCommandline(input="sample.fa", ...)
>>> return_code, std_handle, err_handle = Application.generic_run(cmd)
>>>
>>> I'm not sure how well this would work in practice as the range of
>>> valid argument names in python may not overlap with the valid
>>> parameter names.
>
> We'll have to see how well the above idea works in practice - it
> may not be general enough to be useful.
>
> Also, perhaps we can automatically generate properties for each
> argument allowing this:
>
> cmd.input = "sample.fa"
>
> rather than:
>
> cmd.set_parameter("input", "sample.fa")
>
> For the "switch" type arguments which take no value, if these are
> implemented with a separate option class (maybe _Switch or
> _OptionNoValue) then rather than:
>
> cmd.set_parameter("noanchors")
>
> we might want to do:
>
> cmd.noanchors = True
>
> and allow the switch to be removed with:
>
> cmd.noanchors = False
>
> i.e. For those arguments which take no argument (is "switch" the
> right term here?), evaluate the property set value as a boolean to
> add/remove -noanchors from the command line string.
>
> I think using properties in this way could make the command line
> object more intuitive, but again python puts limits on property names
> which might mean for some arguments you'd have to use the
> set_parameter version.
>
> Peter
>
I have cleaning up the existing Bio.Application command line objects
in CVS to follow the parameter alias convention already laid out in
Bio.Application. i.e. They all now have human readable paramater
aliases, which are also valid python identifiers. This means these
"human readable names" can also be used for argument names in __init__
(using **kwargs), or as property names.
I think I've got properties working now as an experiment on my
machine, generated at run time using the "human readable name" for
each parameter. We would need to special case "switch" arguments
(i.e. those which take no value) as outlined above.
Does this sound worthwhile? If so, I'll put together an enhancement
bug with a patch, or a branch on github.
Peter
From bugzilla-daemon at portal.open-bio.org Sun Apr 26 09:45:47 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 26 Apr 2009 09:45:47 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application MUSCLE command line
interface
In-Reply-To:
Message-ID: <200904261345.n3QDjlkm022449@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
cymon.cox at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1286 is|0 |1
obsolete| |
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Sun Apr 26 09:49:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 26 Apr 2009 09:49:44 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904261349.n3QDniuI022654@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
cymon.cox at gmail.com changed:
What |Removed |Added
----------------------------------------------------------------------------
Summary|Bio.Application MUSCLE |Bio.Application command line
|command line interface |interfaces
------- Comment #6 from cymon.cox at gmail.com 2009-04-26 09:49 EST -------
(Change title of this bug.)
Now tracking github branch:
http://github.com/cymon/biopython-github-master/tree/applic-int
Added command line interfaces for:
MUSCLE, MAFFT, DALIGN, PRANK
To do:
Clustalw, T-coffee
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Sun Apr 26 13:22:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 26 Apr 2009 18:22:43 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
<20090413134429.GE5429@sobchak.mgh.harvard.edu>
<320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
Message-ID: <320fb6e00904261022l43f799a8g8729a47ba15042f8@mail.gmail.com>
On Mon, Apr 13, 2009 at 2:49 PM, Peter wrote:
> On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote:
>>> > ... Feel free to add away.
>>>
>>> I need to work on my delegation skills - that seems to have back fired ;)
>>
>> Oops. I honestly read that as "do I have your permission?" I can of
>> course tackle this, but am a bit underwater now.
>
> Looking back, I was a bit ambiguous. I don't mind who does it - let's
> see who has time free first.
OK, I've added a minimal needle wrapper based on the water wrapper.
As part of this I remove the -nosimilarity option which doesn't work on
the current versions of EMBOSS needle and water (5.0 or 6.0).
For -auto and -filter, I think we probably should extend the parameter
classes to explicitly cover these switch arguments which take no value
(they are either part of the command line, or omitted). We've touched
on this already on Cymon's thread...
Peter
From bugzilla-daemon at portal.open-bio.org Sun Apr 26 16:09:51 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Sun, 26 Apr 2009 16:09:51 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904262009.n3QK9p8U011039@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #7 from cymon.cox at gmail.com 2009-04-26 16:09 EST -------
Added CLUSTALW Bio.Application command line interface
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Apr 27 05:58:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Apr 2009 10:58:52 +0100
Subject: [Biopython-dev] main page on wiki
In-Reply-To: <320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com>
References: <49EFCF07.2050502@student.otago.ac.nz>
<9e2f512b0904221853m50cb1a62l72c297b84936a2d8@mail.gmail.com>
<49EFE553.6070405@gmail.com>
<320fb6e00904230216l4b2cd91k513baa3f7a08172c@mail.gmail.com>
Message-ID: <320fb6e00904270258s523c49a1j1bfc5d4a12ca86a9@mail.gmail.com>
On Thu, Apr 23, 2009 at 10:16 AM, Peter wrote:
>
> If there are no counter comments, I'll put David's changes up later
> today or tomorrow.
>
OK - make that a couple of days later ;)
This isn't exactly as in David's draft - I shortened some of the link
text and omitted a couple of links under "Contribute" which seemed
unnecessary on the home page.
I've also kept the final line giving the latest release and date
(although the text is shorter now). Brad commented (off list?) that
having this is a good indicator of the project's activity, and I
agree. Alternatively, I'd like to try having dates on the news feed,
but the media wiki plugin needs to be updated for that to work...
Peter
From bugzilla-daemon at portal.open-bio.org Mon Apr 27 08:23:00 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 27 Apr 2009 08:23:00 -0400
Subject: [Biopython-dev] [Bug 2671] Including GenomeDiagram in the main
Biopython distribution
In-Reply-To:
Message-ID: <200904271223.n3RCN0GL009972@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2671
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #34 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-27 08:23 EST -------
(In reply to comment #25)
> OK, GenomeDiagram is now in CVS, with some basic tests. Still to do:
>
> * Updating the existing GenomeDiagram manual to match (different imports,
> colour to color), which I think can stay as a separate PDF file.
Leighton can do that...
> * A short introduction to Bio.Graphics including GenomeDiagram as part
> of a new chapter in the tutorial?
Done.
(In reply to comment #33)
> Plus (as pointed out on Bug 2711 / Bug 2710):
>
> * Updating the installation instructions so that the ReportLab
> section also covers renderPM (needed for bitmaps).
Done.
Marking this bug fixed as of Biopython 1.50.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
From bugzilla-daemon at portal.open-bio.org Mon Apr 27 10:12:51 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 27 Apr 2009 10:12:51 -0400
Subject: [Biopython-dev] [Bug 2821] NCBIXML.parse only returns results for
non-empty hits rather than one per query sequence
In-Reply-To:
Message-ID: <200904271412.n3RECpZC019165@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2821
camilla at ip.id.au changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #2 from camilla at ip.id.au 2009-04-27 10:12 EST -------
Hi Peter
Thanks for the suggestions. In the end, I realised that b_record.query contains
the header line of the query sequence all along, so there is no real bug here,
just my misunderstanding of what information is stored where.
I think this issue can be closed.
For anyone else out there with similar problems, if you aren't certain what
data is in an object, you can use the dir() function to list them
all.
Thanks again
Camilla
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Apr 27 11:59:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Apr 2009 16:59:03 +0100
Subject: [Biopython-dev] Installation documentation
Message-ID: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
I've made some updates to Installation.tex, which I think are an
improvement over the version shipped with Biopython 1.50 and currently
online. I think we could update these files now:
http://biopython.org/DIST/docs/install/Installation.html
http://biopython.org/DIST/docs/install/Installation.pdf
Does that seem sensible? Before that, would anyone like to proof read
the text in CVS, or make further updates? For example, are the bits
on FreeBSD, Fink and RPMs still valid?
Peter
From p.j.a.cock at googlemail.com Mon Apr 27 12:09:57 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 27 Apr 2009 17:09:57 +0100
Subject: [Biopython-dev] Rolling new releases
In-Reply-To: <320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com>
References: <320fb6e00904200203h31f3976es76703a0cd85ee41a@mail.gmail.com>
<320fb6e00904200511r794bc959i6b6719f406b5ca96@mail.gmail.com>
<320fb6e00904200755h4a32ed91pe2969810e68f256d@mail.gmail.com>
<8b34ec180904200808p28a2ea9ew8e16a25ae3557744@mail.gmail.com>
<320fb6e00904200904x61cc3d5dj7d9f88c3ec3a7dd6@mail.gmail.com>
<20090421122045.GD30529@sobchak.mgh.harvard.edu>
<320fb6e00904210543x79c021f3j3f3272841b16d1c6@mail.gmail.com>
<20090423125356.GE34546@sobchak.mgh.harvard.edu>
<320fb6e00904230658y310609c8l89bf27c33bd56d56@mail.gmail.com>
Message-ID: <320fb6e00904270909y35ebc841yd2074d6970b71fe4@mail.gmail.com>
On Thu, Apr 23, 2009 at 2:58 PM, Peter Cock wrote:
> The complicated bit is getting the code and documentation in CVS
> ready, and that is harder to delegate. ?Once that is done though, the
> actual release process is fairly straight forward - as documented here
> - and could be delegated to anyone methodical with suitably setup
> development machine(s):
> http://biopython.org/wiki/Building_a_release
> Maybe some of the release process could be automated literally as a
> script - but doing each step methodically by hand and checking as you
> go is wise.
On the bright side, after dropping Martel the "Building a release"
instructions will get a little shorter :)
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 27 12:26:37 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Apr 2009 17:26:37 +0100
Subject: [Biopython-dev] Removing Bio.Mindy and Martel
In-Reply-To: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com>
References: <320fb6e00904251630t43ec275ehd25906476c6afe18@mail.gmail.com>
Message-ID: <320fb6e00904270926l7e2db7e0x21a7bde1e47af4b0@mail.gmail.com>
On Sun, Apr 26, 2009 at 12:30 AM, Peter wrote:
>
> Does anyone have any objections about us pressing ahead with removing
> Martel and these modules now?
>
Well I hope not, as I've just make the changes in CVS. Note that I
have not deleted all the files in the Martel folder, but simply
excluded Martel from setup.py.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 27 12:28:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Apr 2009 17:28:15 +0100
Subject: [Biopython-dev] Installation documentation
In-Reply-To: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
Message-ID: <320fb6e00904270928w26fb0f1axbb7be88188d0355f@mail.gmail.com>
On Mon, Apr 27, 2009 at 4:59 PM, Peter wrote:
> I've made some updates to Installation.tex, which I think are an
> improvement over the version shipped with Biopython 1.50 and currently
> online. ?I think we could update these files now:
>
> http://biopython.org/DIST/docs/install/Installation.html
> http://biopython.org/DIST/docs/install/Installation.pdf
>
> Does that seem sensible? ?Before that, would anyone like to proof read
> the text in CVS, or make further updates? ?For example, are the bits
> on ?FreeBSD, Fink and RPMs still valid?
If we are going to update the online version, I'll refrain from
removing the mxTextTools bit from Installation.tex for the time being.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 27 12:37:09 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 27 Apr 2009 17:37:09 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180904210810o77496205n8af85a58f0f83a49@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
Message-ID: <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
> On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski
>>>
>>> If we can fix the tags, great. ?If we can also remap the authors to
>>> their git usernames, even better.
>>>
>> This is doable in the current setup. I don't know whether we need to
>> do this. The old commits are signed by the same credentials (name,
>> e-mail) as on CVS server.
>From looking at git log, they just have our CVS usename, e.g.
Author: peterc
i.e. No email address
>> If we start re-mapping them now, we are going to have essentially a
>> new commit history, so everybody would need to rebase their
>> branches... I don't see a problem of having old commits signed with
>> old e-mails, and new commits signed by new. Especially, that
>> everybody can have multiple e-mails assigned to their github
>> account (that's how I did with mine).
>
> That would be simpler. ?I'll have to try on my github account...
>
Given we don't have email addresses embedded in the old commits,
do you think is this going to be possible (without changing the
repository)?
Peter
From chapmanb at 50mail.com Tue Apr 28 08:41:20 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 28 Apr 2009 08:41:20 -0400
Subject: [Biopython-dev] Installation documentation
In-Reply-To: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
Message-ID: <20090428124119.GV34546@sobchak.mgh.harvard.edu>
Hi Peter;
> I've made some updates to Installation.tex, which I think are an
> improvement over the version shipped with Biopython 1.50 and currently
> online. I think we could update these files now:
>
> http://biopython.org/DIST/docs/install/Installation.html
> http://biopython.org/DIST/docs/install/Installation.pdf
>
> Does that seem sensible? Before that, would anyone like to proof read
> the text in CVS, or make further updates? For example, are the bits
> on FreeBSD, Fink and RPMs still valid?
The FreeBSD port is out of date now, so I commented that section out
and replaced it with a section on using easy_install. This also
reminded me that I needed to update the version on the Python
Package Index. I added a note to the release details to do this; oh
man, another step.
Peter, if you have an account on pypi, let me know your login and I
can add you as an owner for Biopython.
Brad
From p.j.a.cock at googlemail.com Tue Apr 28 09:36:37 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 28 Apr 2009 14:36:37 +0100
Subject: [Biopython-dev] [Biopython] Parsing large blast files
In-Reply-To: <627305.69090.qm@web62401.mail.re1.yahoo.com>
References: <320fb6e00904270354i351bccb3q6aaa2369db6f82e0@mail.gmail.com>
<627305.69090.qm@web62401.mail.re1.yahoo.com>
Message-ID: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com>
On Tue, Apr 28, 2009 at 2:00 PM, Michiel de Hoon wrote:
>> NCBIStandalone.Iterator() is the old semi-obsolete plain
>> text parser - it won't parse the XML output, hence the
>> "Invalid header" error. ?Maybe the tutorial
>> (or the error message) could be clearer.
>
> I think part of the problem is the organization of the code in Bio.Blast,
> which seems to have grown historically. Bio.Blast.NCBIStandalone
> contains blastall, blastpgp, and rpsblast, which makes sense, but also
> ?BlastParser and PsiBlastParser, which are not necessarily connected
> to standalone Blast. Bio.Blast.ParseBlastTable contains the parser for
> blastpgp output. Bio.Blast.NCBIWWW contains qblast, but also the
> parser for Blast HTML output, though qblast does not necessarily
> generate output in HTML format.
I presumed that initially the standalone tools only produced plain text,
and the website (qblast) only produced HTML - hence the use of
Bio.Blast.NCBIStandalone for both command line wrappers AND the
plain text parser, and Bio.Blast.NCBIWWW for both the qblast function
AND the HTML parser.
> The usage of this module may be more understandable if all functions
> were accessible from Bio.Blast directly in a fashion more consistent
> with current Biopython. Bio.Blast would then have the following functions:
>
> read(handle, format='xml')
> parse(handle, format='xml')
> blastall
> blastpgp
> rpsblast
> qblast
>
> with most of the actual code hiding in Bio.Blast.NCBIStandalone etcetera.
>
> Any objections, comments?
I do like the idea of moving/importing the qblast function directly
under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML later on.
For read/parse functions, we should probably call the format
"blastxml" to match BioPerl. Would you continue to support the plain
text output here? Also something to keep in mind is there may be
non-NCBI variants of BLAST with their own formats as well.
Rather than continuing to encourage the use of blastall, blastpgp and
rpsblast I would rather bring Bio.Blast.Applications up to date, and
then declare them obsolete . These three "helper" functions are very
limiting in how the command line is invoked - you can't choose the
exact call used (e.g. subprocess options) or what you want back (e.g.
you may not care about the handles). For example, getting BLAST to
write its output to a file is confusingly difficult right now using
these functions. Also, dealing with errors isn't nice.
Peter
From biopython at maubp.freeserve.co.uk Tue Apr 28 09:40:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Apr 2009 14:40:43 +0100
Subject: [Biopython-dev] Installation documentation
In-Reply-To: <20090428124119.GV34546@sobchak.mgh.harvard.edu>
References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
<20090428124119.GV34546@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com>
On Tue, Apr 28, 2009 at 1:41 PM, Brad Chapman wrote:
>
> The FreeBSD port is out of date now, so I commented that section out
> and replaced it with a section on using easy_install. This also
> reminded me that I needed to update the version on the Python
> Package Index. I added a note to the release details to do this; oh
> man, another step.
Well, easy_install isn't (yet) an official python standard so I hadn't
previously worried about it - our wiki Downloads page does mention it.
Frankly the less "official" ways the are to install, the less ways it
can go wrong, and then the less questions need to be asked when it
goes wrong.
Nor had I worried about how PyPi's listing might need to be updated.
I assumed it was clever enough to scan the http://biopython.org/DIST/
directory and parse the filenames. Is the real answer you (Brad) kept
it up to date?
http://pypi.python.org/pypi/biopython/
> Peter, if you have an account on pypi, let me know your login and I
> can add you as an owner for Biopython.
I don't have an account on pypi.
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 11:04:09 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 11:04:09 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281504.n3SF49so024149@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #8 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 11:04 EST -------
I've checked the MUSCLE wrapper into CVS, and added the -diags option. I also
created test_Muscle_tool.py which requires MUSCLE be installed, and checks we
can invoke it and parse its clustal output OK.
A more general alignment wrapper unit test can simply construct some command
line objects and check them against an expected string (without requiring the
tools to be installed).
Note that I am concerned about the file exists check on the input file
argument. This is helpful, but also prevents certain reasonable usage examples
- e.g. the input file is created on the fly and doesn't exist yet, or, the
command line constructed will be submitted to a cluster where the path will be
valid (even if the path isn't valid on the local machine where Biopython is
running).
Also, perhaps we should think about Bio.Application including automatic quoting
for filenames with spaces in them... see the _escape_filename function used in
Bio.Clustalw and Bio.Blast.NCBIStandalone. This would be only for parameters
explicitly tagged as filenames.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 11:25:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 11:25:20 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281525.n3SFPKbd025807@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #9 from cymon.cox at gmail.com 2009-04-28 11:25 EST -------
(In reply to comment #8)
> I've checked the MUSCLE wrapper into CVS, and added the -diags option.
You pulled this from the applic-int branch yes? (Hmm, missed that -diags...)
I also
> created test_Muscle_tool.py which requires MUSCLE be installed, and checks we
> can invoke it and parse its clustal output OK.
Ive also just checked in (to the github branch) some unittests for MUSCLE,
MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return
code is 0 - a few other checks are made but not much else.
> A more general alignment wrapper unit test can simply construct some command
> line objects and check them against an expected string (without requiring the
> tools to be installed).
I will do these - all in one test_ApplicationCommandlines.py unittest suite.
> Note that I am concerned about the file exists check on the input file
> argument. This is helpful, but also prevents certain reasonable usage examples
> - e.g. the input file is created on the fly and doesn't exist yet, or, the
> command line constructed will be submitted to a cluster where the path will be
> valid (even if the path isn't valid on the local machine where Biopython is
> running).
Good point. Perhaps the os.path.exists on input files needs to be dropped from
all wrappers.
>
> Also, perhaps we should think about Bio.Application including automatic quoting
> for filenames with spaces in them... see the _escape_filename function used in
> Bio.Clustalw and Bio.Blast.NCBIStandalone. This would be only for parameters
> explicitly tagged as filenames.
Yes, I thought about doing that but havent acted.
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 11:44:06 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 11:44:06 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281544.n3SFi61M027248@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #10 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 11:44 EST -------
(In reply to comment #9)
> (In reply to comment #8)
> > I've checked the MUSCLE wrapper into CVS, and added the -diags option.
>
> You pulled this from the applic-int branch yes? (Hmm, missed that -diags...)
Yes. I spotted the -diags because it is an example given if you just run
"muscle".
> Ive also just checked in (to the github branch) some unittests for MUSCLE,
> MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return
> code is 0 - a few other checks are made but not much else.
I'll look at that.
> > A more general alignment wrapper unit test can simply construct some command
> > line objects and check them against an expected string (without requiring
> > the tools to be installed).
>
> I will do these - all in one test_ApplicationCommandlines.py unittest suite.
Sounds good. Maybe just test_AlignApps.py if it is just for
Bio.Align.Applications?
> > Note that I am concerned about the file exists check on the input file
> > argument. This is helpful, but also prevents certain reasonable usage
> > examples - e.g. the input file is created on the fly and doesn't exist
> > yet, or, the command line constructed will be submitted to a cluster
> > where the path will be valid (even if the path isn't valid on the local
> > machine where Biopython is running).
>
> Good point. Perhaps the os.path.exists on input files needs to be dropped
> from all wrappers.
Maybe - I dropped most of them from the Muscle and Clustalw ones. The matrix
arguments are a little trickier, where the argument can be either a special
word of a filename. See below for a related issue ...
> > Also, perhaps we should think about Bio.Application including automatic
> > quoting for filenames with spaces in them... see the _escape_filename
> > function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would
> > be only for parameters explicitly tagged as filenames.
>
> Yes, I thought about doing that but havent acted.
>
Another issue is any file exists check needs to be aware that filenames may be
quoted (due to containing spaces). i.e. A simple call to os.path.isfile(...)
won't work. I've integrated your Clustalw wrapper into CVS, and in order to
extend my existing unit tests to use this with spaces in file names, I was
forced to drop the existence check.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:18:11 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 12:18:11 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281618.n3SGIBPl029571@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #11 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 12:18 EST -------
(In reply to comment #9)
>
> Ive also just checked in (to the github branch) some unittests for MUSCLE,
> MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return
> code is 0 - a few other checks are made but not much else.
>
I think I was looking at your master branch, rather than the applic-int branch:
http://github.com/cymon/biopython-github-master/commits/applic-int
I see the changes now...
> > Also, perhaps we should think about Bio.Application including automatic
> > quoting for filenames with spaces in them... see the _escape_filename
> > function used in Bio.Clustalw and Bio.Blast.NCBIStandalone. This would
> > be only for parameters explicitly tagged as filenames.
>
> Yes, I thought about doing that but havent acted.
Seeing as we both think this makes sense, I've done that in CVS.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:30:53 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 12:30:53 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281630.n3SGUrLn030516@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #12 from cymon.cox at gmail.com 2009-04-28 12:30 EST -------
(In reply to comment #11)
> (In reply to comment #9)
> >
> > Ive also just checked in (to the github branch) some unittests for MUSCLE,
> > MAFFT, PRANK, and DIALIGN. These just check that the app runs and the return
> > code is 0 - a few other checks are made but not much else.
> >
>
> I think I was looking at your master branch, rather than the applic-int branch:
> http://github.com/cymon/biopython-github-master/commits/applic-int
> I see the changes now...
In those unittests, you'll note that I have no idea about the windows
environment! (dont use window, never have used windows). I just copied from the
Emboss wrapper...
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:39:20 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 12:39:20 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281639.n3SGdKcO030951@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 12:39 EST -------
(In reply to comment #11)
> In those unittests, you'll note that I have no idea about the windows
> environment! (dont use window, never have used windows). I just copied
> from the Emboss wrapper...
>
> C.
That explains things :)
I had already guessed you hadn't run any of these tests on Windows, because the
executable isn't recorded properly, and even if it was, you never use it when
creating the command line objects:
#Don't do this if you want to actually run the application, as
#it would only work on Unix where the command is on the path:
#cmdline = MafftCommandline()
#Instead, use the exe name we determined earlier:
cmdline = MafftCommandline(mafft_exe)
The EMBOSS installer is nice and *does* setup EMBOSS_ROOT, which is why
test_Emboss.py looks for it.
However, for test_Clustalw_tool.py I just made a list of the default install
locations, and check them. There is no environment variable!
I haven't looked at the documentation but I would be pleasantly surprised if a
MAFFT_ROOT environment variable was setup by the default method of installing
MAFFT on Windows (and similarly for the other tools).
If the tools do record their install location in the registry, we can do a
win32api call to get the path. Then if win32api isn't installed just raise the
MissingExternalDependencyError exception.
If you look at my test_Muscle_tool.py in CVS, you'll see I haven't yet
determined how best to try and locate MUSCLE.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 12:54:32 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 12:54:32 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281654.n3SGsWBX032007@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #14 from cymon.cox at gmail.com 2009-04-28 12:54 EST -------
(In reply to comment #13)
> (In reply to comment #11)
> > In those unittests, you'll note that I have no idea about the windows
> > environment! (dont use window, never have used windows). I just copied
> > from the Emboss wrapper...
> >
> > C.
>
> That explains things :)
>
> I had already guessed you hadn't run any of these tests on Windows, because the
> executable isn't recorded properly, and even if it was, you never use it when
> creating the command line objects:
>
> #Don't do this if you want to actually run the application, as
> #it would only work on Unix where the command is on the path:
> #cmdline = MafftCommandline()
> #Instead, use the exe name we determined earlier:
> cmdline = MafftCommandline(mafft_exe)
OK, thanks, I'll update the wrappers on my branch.
> If the tools do record their install location in the registry, we can do a
> win32api call to get the path.
> Then if win32api isn't installed just raise the
> MissingExternalDependencyError exception.
We can? ;)
Can you give me some code, or I could just use this in the meantime:
if sys.platform=="win32" :
raise MissingExternalDependencyError("Testing with MUSCLE not implemented
on Windows yet")
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 13:07:11 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 13:07:11 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281707.n3SH7BTG000522@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #15 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 13:07 EST -------
(In reply to comment #14)
> > #Don't do this if you want to actually run the application, as
> > #it would only work on Unix where the command is on the path:
> > #cmdline = MafftCommandline()
> > #Instead, use the exe name we determined earlier:
> > cmdline = MafftCommandline(mafft_exe)
>
> OK, thanks, I'll update the wrappers on my branch.
Have a look at this test_Muscle_tool.py CVS revision 1.4 first:
http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Tests/test_Muscle_tool.py?cvsroot=biopython
> > If the tools do record their install location in the registry,
> > we can do a win32api call to get the path. Then if win32api
> > isn't installed just raise the MissingExternalDependencyError
> > exception.
>
> We can? ;)
>
> Can you give me some code, ...
There are a lot of ifs here. The code is fairly simple (I've done this kind of
thing before, but can't find an example right away). The catch is establishing
IF the information we want gets written to the registry during the tool
installation or not.
> or I could just use this in the meantime:
> if sys.platform=="win32" :
> raise MissingExternalDependencyError("Testing with MUSCLE not implemented
> on Windows yet")
Yeah - use something like that, but be aware that the tests shouldn't assume
that the executable name is just "muscle". Hopefully test_Muscle_tool.py does
this right...
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 13:32:14 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 13:32:14 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281732.n3SHWEip002274@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #16 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 13:32 EST -------
(In reply to comment #15)
>
> Yeah - use something like that, but be aware that the tests shouldn't assume
> that the executable name is just "muscle". Hopefully test_Muscle_tool.py does
> this right...
>
Well, it does now. I've got test_Muscle_tool.py to run on Windows, assuming
the user chooses to put MUSCLE under the program files directory in a
reasonably predictable folder. Given the MUSCLE installation process on
Windows is entirely manual, we can't really do anything else.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Apr 28 13:45:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Apr 2009 18:45:01 +0100
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<8b34ec180904210819n1b69b0aao53fc8d10c570abb8@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
Message-ID: <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
On Mon, Apr 27, 2009 at 5:37 PM, Peter wrote:
>> On Sun, Apr 26, 2009 at 1:08 PM, Bartek Wilczynski
>>>>
>>>> If we can fix the tags, great. ?If we can also remap the authors to
>>>> their git usernames, even better.
>>>>
>>> This is doable in the current setup. I don't know whether we need to
>>> do this. The old commits are signed by the same credentials (name,
>>> e-mail) as on CVS server.
>
> From looking at git log, they just have our CVS usename, e.g.
> Author: peterc
> i.e. No email address
>
>>> If we start re-mapping them now, we are going to have essentially a
>>> new commit history, so everybody would need to rebase their
>>> branches... I don't see a problem of having old commits signed with
>>> old e-mails, and new commits signed by new. Especially, that
>>> everybody can have multiple e-mails assigned to their github
>>> account (that's how I did with mine).
>>
>> That would be simpler. ?I'll have to try on my github account...
>
> Given we don't have email addresses embedded in the old commits,
> do you think is this going to be possible (without changing the
> repository)?
I take that back - I added an email address of just "peterc" to my
github account (it seems they don't do any validation, perhaps for
this very reason?). This had no immediate effect, but one day later
and all my CVS commits are now shown with my photo in github. Neat -
but it makes it much more obvious that I have a tendency to do lots of
small commits!
Peter
From bartek at rezolwenta.eu.org Tue Apr 28 13:50:20 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 28 Apr 2009 19:50:20 +0200
Subject: [Biopython-dev] history on github - where are the tags?
In-Reply-To: <320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
References: <320fb6e00903170206h570989bbgb6b9a761d2aa70ed@mail.gmail.com>
<320fb6e00904210929k1e4f552fnf9154b0f8c30c1aa@mail.gmail.com>
<320fb6e00904210958y4f386a8ciba86cde9e7c1bc65@mail.gmail.com>
<8b34ec180904220152rf454c5crb20a6858ecbde468@mail.gmail.com>
<8b34ec180904220153p45f1004emf6e0f9b01ce9db1b@mail.gmail.com>
<320fb6e00904220223q119342aem70feb30e0f2cb747@mail.gmail.com>
<8b34ec180904260508w1974cfe1kc2ba6ce4ce57934e@mail.gmail.com>
<320fb6e00904260529x31043735wd673eb6af34059d4@mail.gmail.com>
<320fb6e00904270937x402d7e97oa39e939ab55cad62@mail.gmail.com>
<320fb6e00904281045u33c91e2ci9647d85f75804e35@mail.gmail.com>
Message-ID: <8b34ec180904281050w4e8886cbufa4eb3c0adf63f09@mail.gmail.com>
On Tue, Apr 28, 2009 at 7:45 PM, Peter wrote:
> I take that back - I added an email address of just "peterc" to my
> github account (it seems they don't do any validation, perhaps for
> this very reason?). ?This had no immediate effect, but one day later
> and all my CVS commits are now shown with my photo in github. ?Neat -
great
> but it makes it much more obvious that I have a tendency to do lots of
> small commits!
>
That's good practice in git :)
cheers
Bartek
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 13:55:00 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 13:55:00 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281755.n3SHt0FK003782@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #17 from cymon.cox at gmail.com 2009-04-28 13:55 EST -------
(In reply to comment #16)
> (In reply to comment #15)
> >
> > Yeah - use something like that, but be aware that the tests shouldn't assume
> > that the executable name is just "muscle". Hopefully test_Muscle_tool.py does
> > this right...
> >
>
> Well, it does now. I've got test_Muscle_tool.py to run on Windows, assuming
> the user chooses to put MUSCLE under the program files directory in a
> reasonably predictable folder. Given the MUSCLE installation process on
> Windows is entirely manual, we can't really do anything else.
>
OK, pushed to applic-int updated unittest for PRANK, MAFFT, and DIALIGN -
skipping tests on windows.
Also changed the names to test_XXXX_tool.py
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 14:28:31 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 14:28:31 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904281828.n3SISVTs005955@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #18 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-28 14:28 EST -------
(In reply to comment #17)
>
> OK, pushed to applic-int updated unittest for PRANK, MAFFT, and DIALIGN -
> skipping tests on windows.
>
> Also changed the names to test_XXXX_tool.py
>
> C.
Great. In addition to the MUSCLE and ClustalW stuff, I've got the PRANK code
and unit tests in CVS now. These three tests all work on a Linux, Mac and
Windows machine (with Python 2.4, 2.5 and 2.6). I'm stopping working on this
for today.
It would be great if you could test a clean checkout from CVS, and we'll resume
this merge later on for the remaining tools MAFFT and DIALIGN.
Also, would you be able to look into making the Prank test faster to run? Maybe
use a smaller example input file? After we do that, I'd like to use it to test
the Nexus parser via Bio.AlignIO (just something simple which won't be affected
by gap differences between different versions of PRANK - like my tests for
MUSCLE).
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Apr 28 15:27:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 28 Apr 2009 20:27:41 +0100
Subject: [Biopython-dev] Where to put command line wrappers
In-Reply-To: <320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com>
References: <320fb6e00904160945p3e15b6daoaa7891dbf9e144ec@mail.gmail.com>
<320fb6e00904161016i63c6d149xf414725ee489bfdb@mail.gmail.com>
<8b34ec180904161034j3a5a778dsfbca8ba2a57d3e88@mail.gmail.com>
<8b34ec180904161037wea0373cnde2964b25f5c357@mail.gmail.com>
<320fb6e00904161153h70f54b36p94c09e23733c941@mail.gmail.com>
<20090417140241.GD16092@sobchak.mgh.harvard.edu>
<320fb6e00904231421k3e18d0b1y9003614e906fcb1c@mail.gmail.com>
Message-ID: <320fb6e00904281227l5e17159g4333fd98d019ad60@mail.gmail.com>
On Thu, Apr 23, 2009 at 10:21 PM, Peter wrote:
>
> OK, what I propose is that the command line objects are exposed as
> Bio.Align.Applications.MuscleCommandline,
> Bio.Align.Applications.ClustalwCommandline, etc but that the
> implementations live in Bio/Align/Applications/_Muscle.py,
> _Clustalw.py etc. To do this the Bio/Align/Applications/__init__.py
> file will look like this:
>
> from _Muscle import MuscleCommandline
> from _Clustalw import ClustalwCommandline
>
> This avoids having a single massive file, yet keeps the public
> namespace simple. For the user, they do this:
>
> from Bio.Align.Applications import MuscleCommandline
> cline = MuscleCommandline(...)
>
> or if they prefer,
>
> from Bio.Align import Applications
> cline = Applications.MuscleCommandline(...)
>
> From the user's point of view all the alignment command line wrapper
> objects live together under Bio.Align.Applications.
As no one objected or put forward an alternative scheme, Cymon and I
have been pressing ahead on Bug 2815 using the above file layout. I
have also updated Bio.Motif.Applications to match (this module was
deliberately left out of Biopython 1.50 while this issue was settled).
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 28 17:18:11 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 28 Apr 2009 17:18:11 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904282118.n3SLIB0N015984@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #19 from cymon.cox at gmail.com 2009-04-28 17:18 EST -------
(In reply to comment #18)
> (In reply to comment #17)
> >
> It would be great if you could test a clean checkout from CVS,
Done - on Ubuntu 9.04 Python2.6.2 - Clustalw_tool and Prank_tool both good.
Cant test Muscle_tool as Muscle 3.7 is broken on this release (builds and
core-dumps).
>
> Also, would you be able to look into making the Prank test faster to run?
Will look into this.
(merged upstream into applic-int)
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mjldehoon at yahoo.com Tue Apr 28 21:28:26 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 28 Apr 2009 18:28:26 -0700 (PDT)
Subject: [Biopython-dev] [Biopython] Parsing large blast files
In-Reply-To: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com>
Message-ID: <290052.25369.qm@web62407.mail.re1.yahoo.com>
--- On Tue, 4/28/09, Peter Cock wrote:
> I do like the idea of moving/importing the qblast function
> directly under Bio.Blast, and perhaps removing Bio.Blast.NCBIXML
> later on.
Well Bio.Blast.NCBIXML would still be there (containing the code for the XML parser), but users would access it through Bio.Blast.parse/read.
> For read/parse functions, we should probably call the
> format "blastxml" to match BioPerl.
We could have both "xml" and "blastxml" for Blast XML output, "text" and "blasttext" for Blast text output, and "table" and "blasttable" for Blast table (-m 8 and 9) output.
> Would you continue to support the plain text output here?
Yes. I'm more thinking about code reorganization than removing/adding functionality.
> Rather than continuing to encourage the use of blastall,
> blastpgp and rpsblast I would rather bring Bio.Blast.Applications
> up to date, and then declare them obsolete.
How would users typically use Bio.Blast.Applications?
--Michiel.
From p.j.a.cock at googlemail.com Wed Apr 29 04:33:03 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 29 Apr 2009 09:33:03 +0100
Subject: [Biopython-dev] [Biopython] Parsing large blast files
In-Reply-To: <290052.25369.qm@web62407.mail.re1.yahoo.com>
References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com>
<290052.25369.qm@web62407.mail.re1.yahoo.com>
Message-ID: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
On Wed, Apr 29, 2009 at 2:28 AM, Michiel de Hoon wrote:
>
> How would users typically use Bio.Blast.Applications?
>
In the next release, I would aim to have Bio.Blast.Applications
updated to cover blastall (fully), plus blastpgp and rpsblast
(currently not covered) and for the three helper functions
Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use
Bio.Blast.Applications internally. I would suggest at some point
(perhaps a release later) calling the three helper functions obsolete,
and eventually deprecating them, but I appreciate these are well
documented and well used, so this should be a gradual transistion.
In the future I would see people contructing their application command
line object and then using it to spawn the task as needed. The
Bio.Applicaition.generic_run might suffice for low output tools,
ranging up to using the builtin subprocess module for full control.
The command line string can also be used in other ways, e.g. for
submission to a computing cluster using qsub, or writing to a shell
script etc.
The point about this is decoupling constuction of the command line
string, and actually executing it. Right now the
Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions do
both, and there is no way to (a) see what the command line used was,
which makes debugging difficult, and (b) no way to control how it is
invoked (e.g. recent Windows GUI questions).
Another immediate benefit is an example usage that I do quite often:
Running BLAST and saving the output to a file. The cleanest way to do
this is to use the -o option to get BLAST itself to write to a file.
If you do this, then there is no useful output written to the handles
- but the Bio.Blast.NCBIStandalone make this fiddly (see Bug 2654).
Right now the tutorial does something equally indirect - in python
read BLAST output from stdout and save it to a file (and probably not
in a memory efficient way either!).
See also this thread on where to put new command line wrappers:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html
If you where asking about the actual code for how to build the command
line object, well I have some thoughts on making the current
Bio.Application base class easier to use (properties and keyword
arguments at init) which I have started to discuss on the dev list.
Peter
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 05:55:13 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 05:55:13 -0400
Subject: [Biopython-dev] [Bug 2822] New: Bio.Application.AbstractCommandline
- properties and kwargs
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2822
Summary: Bio.Application.AbstractCommandline - properties and
kwargs
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
I have two related proposals to make the command line wrapper objects easier to
use,
(1) Supporting keyword arguments in __init__
(2) Supporting parameters as python properties
These both require each parameter to have a "human readable alias" which is
also a valid python identifier (this should be the case in CVS now). I will
attach patches to this bug, and perhaps put this on github too.
For reference, consider this example (based on one in test_Emboss.py) using the
old code in CVS:
>>> from Bio.Emboss.Applications import WaterCommandline
>>> water_exe = r"C:\Progra~1\Emboss\water.exe"
>>> cline = WaterCommandline(cmd=water_exe)
>>> cline.set_parameter("-asequence", "asis:ACCCGGGCGCGGT")
>>> cline.set_parameter("-bsequence", "asis:ACCCGAGCGCGGT")
>>> cline.set_parameter("-gapopen", "10")
>>> cline.set_parameter("-gapextend", "0.5")
>>> cline.set_parameter("-outfile", "temp_test.water")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5
-outfile=temp_test.water
Note that the parameters can have aliases (sometimes at the actual command
line, e.g. a long and a short version of the same switch). Here the following
is also supported:
>>> from Bio.Emboss.Applications import WaterCommandline
>>> water_exe = r"C:\Progra~1\Emboss\water.exe"
>>> cline = WaterCommandline(cmd=water_exe)
>>> cline.set_parameter("asequence", "asis:ACCCGGGCGCGGT")
>>> cline.set_parameter("bsequence", "asis:ACCCGAGCGCGGT")
>>> cline.set_parameter("gapopen", "10")
>>> cline.set_parameter("gapextend", "0.5")
>>> cline.set_parameter("outfile", "temp_test.water")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5
-outfile=temp_test.water
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 06:00:14 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 06:00:14 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
properties and kwargs
In-Reply-To:
Message-ID: <200904291000.n3TA0EBu028672@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2822
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 06:00 EST -------
Created an attachment (id=1287)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1287&action=view)
Adds keyword argument support to the __init__ method
This patch adds keyword argument support to the __init__ method, although for
the purposes of demonstration in this patch I have only updated the EMBOSS
wrappers to use it. As an alternative to the earlier example you would be able
to do:
>>> from Bio.Emboss.Applications import WaterCommandline
>>> water_exe = r"C:\Progra~1\Emboss\water.exe"
>>> cline = WaterCommandline(cmd=water_exe, asequence="asis:ACCCGGGCGCGGT", bsequence="asis:ACCCGAGCGCGGT", gapopen="10", gapextend="0.5", outfile="temp_test.water")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5
-outfile=temp_test.water
You can of course still use the set_parameter approach as well, for example to
change a setting:
>>> cline.set_parameter("gapopen", "20")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=20 -gapextend=0.5
-outfile=temp_test.water
I think this is much nicer, and also more like some of the existing "helper
functions" we have for wrapping command line tools.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Apr 29 06:25:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 29 Apr 2009 11:25:17 +0100
Subject: [Biopython-dev] Properties in Bio.Application interface?
In-Reply-To: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
Message-ID: <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com>
On Sun, Apr 26, 2009 at 1:46 PM, Peter wrote:
>
> I have cleaning up the existing Bio.Application command line objects
> in CVS to follow the parameter alias convention already laid out in
> Bio.Application. ?i.e. They all now have human readable paramater
> aliases, which are also valid python identifiers. ?This means these
> "human readable names" can also be used for argument names in
> __init__ (using **kwargs), or as property names.
>
> I think I've got properties working now as an experiment on my
> machine, generated at run time using the "human readable name" for
> each parameter. ?We would need to special case "switch" arguments
> (i.e. those which take no value) as outlined above.
>
> Does this sound worthwhile? ?If so, I'll put together an enhancement
> bug with a patch, or a branch on github.
I've filed Bug 2822 for these enhancements to the Bio.Application
based command line objects,
http://bugzilla.open-bio.org/show_bug.cgi?id=2822
So far there is just a patch to support keyword arguments (quite
simple really), with an example of how this changes the interface.
I'm still working on the code to do properties as well - I thought I'd
solved this a few days ago but it doesn't quite work...
Peter
From p.j.a.cock at googlemail.com Wed Apr 29 06:31:26 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 29 Apr 2009 11:31:26 +0100
Subject: [Biopython-dev] [Biopython] Parsing large blast files
In-Reply-To: <320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
References: <320fb6e00904280636s7966690dh8f109e9e438fdec3@mail.gmail.com>
<290052.25369.qm@web62407.mail.re1.yahoo.com>
<320fb6e00904290133o28d1b45dh5091bc8d9b278fff@mail.gmail.com>
Message-ID: <320fb6e00904290331n654964bficfc68ae92d477387@mail.gmail.com>
On Apr 29, Peter wrote:
> On Apr 29, Michiel de Hoon wrote:
>>
>> How would users typically use Bio.Blast.Applications?
>>
>
> In the next release, I would aim to have Bio.Blast.Applications
> updated to cover blastall (fully), plus blastpgp and rpsblast
> (currently not covered) and for the three helper functions
> Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast to all use
> Bio.Blast.Applications internally. ?...
>
> If you where asking about the actual code for how to build the command
> line object, well I have some thoughts on making the current
> Bio.Application base class easier to use (properties and keyword
> arguments at init) which I have started to discuss on the dev list.
See this dev list thread:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005916.html
And Bug 2822 (with examples):
http://bugzilla.open-bio.org/show_bug.cgi?id=2822
Peter
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 07:05:25 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 07:05:25 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904291105.n3TB5PRe000547@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #20 from cymon.cox at gmail.com 2009-04-29 07:05 EST -------
(In reply to comment #18)
> (In reply to comment #17)
> Also, would you be able to look into making the Prank test faster to run? Maybe
> use a smaller example input file? After we do that, I'd like to use it to test
> the Nexus parser via Bio.AlignIO (just something simple which won't be affected
> by gap differences between different versions of PRANK - like my tests for
> MUSCLE).
Reduced run time from 8s to 1s, added asserts for Nexus outfile parsing.
Pushed to applic-int
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 07:40:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 07:40:56 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904291140.n3TBeu6o002524@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #21 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 07:40 EST -------
(In reply to comment #20)
> (In reply to comment #18)
> > (In reply to comment #17)
> > Also, would you be able to look into making the Prank test faster to run?
> > Maybe use a smaller example input file? After we do that, I'd like to
> > use it to test the Nexus parser via Bio.AlignIO (just something simple
> > which won't be affected by gap differences between different versions of
> > PRANK - like my tests for MUSCLE).
>
> Reduced run time from 8s to 1s, added asserts for Nexus outfile parsing.
>
> Pushed to applic-int
> C.
Lovely - checked into CVS. On the Linux machine I tested this on it went from
16s to 2s :)
P.S. See also Bug 2822 for some of my ideas on making the Bio.Application base
class easier to use.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Wed Apr 29 08:11:15 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 29 Apr 2009 08:11:15 -0400
Subject: [Biopython-dev] Installation documentation
In-Reply-To: <320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com>
References: <320fb6e00904270859q23a9abd8rfb21b10ebee30a77@mail.gmail.com>
<20090428124119.GV34546@sobchak.mgh.harvard.edu>
<320fb6e00904280640l58f0098fs467d90646a039e29@mail.gmail.com>
Message-ID: <20090429121115.GX34546@sobchak.mgh.harvard.edu>
Hi Peter;
> Well, easy_install isn't (yet) an official python standard so I hadn't
> previously worried about it - our wiki Downloads page does mention it.
> Frankly the less "official" ways the are to install, the less ways it
> can go wrong, and then the less questions need to be asked when it
> goes wrong.
I hear you about too many options. I am a fan of easy_install
and PyPi seems to have some momentum even if it is not officially
endorsed. The way I normally work on cluster/shared machines is to
have an up to date local version of Python and easy_install
things I need. PyPi can also handle dependencies, which is nice -- I
actually wrote some commented out code in setup.py which will help
enable automatic numpy installation now that we are supporting only
2.4 or better.
> Nor had I worried about how PyPi's listing might need to be updated.
> I assumed it was clever enough to scan the http://biopython.org/DIST/
> directory and parse the filenames. Is the real answer you (Brad) kept
> it up to date?
> http://pypi.python.org/pypi/biopython/
Yes, I've been doing it on PyPi. The -f option you recommended on
the wiki is good in case that is out of date, and I copied that into
the install docs for consistency.
> > Peter, if you have an account on pypi, let me know your login and I
> > can add you as an owner for Biopython.
>
> I don't have an account on pypi.
Cool -- if you end up wanting to play with it just let me know and
I'll add you.
Brad
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 08:23:19 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 08:23:19 -0400
Subject: [Biopython-dev] [Bug 2815] Bio.Application command line interfaces
In-Reply-To:
Message-ID: <200904291223.n3TCNJaG005773@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2815
------- Comment #22 from cymon.cox at gmail.com 2009-04-29 08:23 EST -------
(In reply to comment #21)
> P.S. See also Bug 2822 for some of my ideas on making the Bio.Application base
> class easier to use.
Eagerly anticipating the github branch ;)
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 08:53:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 08:53:40 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
properties and kwargs
In-Reply-To:
Message-ID: <200904291253.n3TCrec4008244@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2822
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 08:53 EST -------
OK, for the moment I'm going to give up on the property idea. I was trying to
add them dynamically in __init__ or __new__ based on the parameter list, but
this is actually rather tricky. I still think it should be possible though...
We could use __getattr__ but that doesn't create an entry in dir(...), and thus
is not discoverable - nor can use use each parameter's description for a
docstring this way.
Perhaps the simplest idea would be to the properties explicitly in each
subclass, but this would require more upfront effort as all the existing
property object lists would need to be replaced.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 29 10:35:41 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 29 Apr 2009 10:35:41 -0400
Subject: [Biopython-dev] [Bug 2822] Bio.Application.AbstractCommandline -
properties and kwargs
In-Reply-To:
Message-ID: <200904291435.n3TEZfED018571@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2822
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1287 is|0 |1
obsolete| |
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-29 10:35 EST -------
Created an attachment (id=1288)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1288&action=view)
Adds keyword argument support to the __init__ method AND properties
(In reply to comment #2)
> OK, for the moment I'm going to give up on the property idea. I was trying to
> add them dynamically in __init__ or __new__ based on the parameter list, but
> this is actually rather tricky. I still think it should be possible though...
I was close earlier, and think I have solved it now :)
As before, this patch adds keyword argument support to the __init__ method, but
also setups properties dynamically. Again, for the purposes of demonstration
in this patch I have only updated the EMBOSS wrappers to use this.
So, my original example (using the current code) was:
>>> from Bio.Emboss.Applications import WaterCommandline
>>> water_exe = r"C:\Progra~1\Emboss\water.exe"
>>> cline = WaterCommandline(cmd=water_exe)
>>> cline.set_parameter("asequence", "asis:ACCCGGGCGCGGT")
>>> cline.set_parameter("bsequence", "asis:ACCCGAGCGCGGT")
>>> cline.set_parameter("gapopen", "10")
>>> cline.set_parameter("gapextend", "0.5")
>>> cline.set_parameter("outfile", "temp_test.water")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5
-outfile=temp_test.water
With the __init__ keyword argument support, this becomes valid:
>>> from Bio.Emboss.Applications import WaterCommandline
>>> water_exe = r"C:\Progra~1\Emboss\water.exe"
>>> cline = WaterCommandline(cmd=water_exe, asequence="asis:ACCCGGGCGCGGT", bsequence="asis:ACCCGAGCGCGGT", gapopen="10", gapextend="0.5", outfile="temp_test.water")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5
-outfile=temp_test.water
You can of course still use the set_parameter approach as well, for example to
change a setting:
>>> cline.set_parameter("gapopen", "20")
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=20 -gapextend=0.5
-outfile=temp_test.water
With the property support, you can then read/or set parameter values directly:
>>> cline.gapopen
'20'
>>> cline.gapopen = 15
>>> cline.gapopen
15
>>> print cline
C:\Progra~1\Emboss\water.exe -asequence=asis:ACCCGGGCGCGGT
-bsequence=asis:ACCCGAGCGCGGT -gapopen=15 -gapextend=0.5
-outfile=temp_test.water
This is much nicer I think, but perhaps the biggest plus point is the
properties have docstrings which show via:
>>> help(cline)
...
and are discoverable:
>>> dir(cline)
['__class__', '__delattr__', '__dict__', '__doc__', '__getattribute__',
'__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__str__', '__weakref__', '_check_value',
'_get_parameter', 'aformat', 'asequence', 'bsequence', 'datafile', 'gapextend',
'gapopen', 'outfile', 'parameters', 'program_name', 'set_parameter',
'similarity', 'snucleotide', 'sprotein']
This makes the parameters all readily discoverable, without having to resort to
looking at Biopython's source code, or the command line application's help.
Right now (using the old code in CVS), the information is there but buried:
>>> print cline.parameters
[, , ,
, , ,
, , ,
]
>>> for p in cline.parameters :
... print p.names, p.description
...
['-asequence', 'asequence'] First sequence to align
['-bsequence', 'bsequence'] Second sequence to align
['-gapopen', 'gapopen'] Gap open penalty
['-gapextend', 'gapextend'] Gap extension penalty
['-outfile', 'outfile'] Output file for the alignment
['-datafile', 'datafile'] Matrix file
['-similarity', 'similarity'] Display percent identity and similarity
['-snucleotide', 'snucleotide'] Sequences are nucleotide (boolean)
['-sprotein', 'sprotein'] Sequences are protein (boolean)
['-aformat', 'aformat'] Display output in a different specified output format
So, comments? We can choose to add EITHER the __init__ keyword arguments OR
the properties. Or of course, BOTH. Or neither, and just leave the interface
as it stand in CVS now.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Apr 29 11:34:26 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 29 Apr 2009 16:34:26 +0100
Subject: [Biopython-dev] Properties in Bio.Application interface?
In-Reply-To: <320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com>
References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
<320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com>
Message-ID: <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com>
On Wed, Apr 29, 2009 at 11:25 AM, Peter wrote:
> I've filed Bug 2822 for these enhancements to the Bio.Application
> based command line objects,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2822
I think I learnt some more about python in the process, which may be a
sign that the code I've come up with is too complicated, but Bug 2822
now has a patch to support both keyword arguments and properties in
the Bio.Application style command line wrappers. This will require
minor changes to the __init__ method of any command line sub-class
(demonstrated using Bio.Emboss.Applications only thus far). I can
envision a simpler approach to this code by defining the properties
explicitly in each subclass, but that would mean a lot of boring/risky
refactoring (or a clever script to do it for us).
There are examples using the new code in the bug comments. Apart from
preferring this API, the other big difference is the properties
provide built in help.
I'll be away for the next four days so I (probably) won't be able to
reply to any comments or questions till Monday.
Peter
From biopython at maubp.freeserve.co.uk Wed Apr 29 12:10:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 29 Apr 2009 17:10:28 +0100
Subject: [Biopython-dev] Git on Windows
Message-ID: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com>
Hi all,
I just wanted to say I've had a very quick test of git on Windows,
using the git package from cygwin, and it seems to work OK. After
copying my SSH key over from my main machine, I was able to clone my
github repository, merge from the upstream Biopython branch (i.e. the
one being updated from CVS), and push this back to my personal github
repository.
Why did I use git from cygwin? Well I have cygwin installed anyway
for mingw32 (the compiler used for the Biopython Windows installers
for Python 2.3 to 2.5), and was already using the cvs package from
cygwin, so this seemed simplest.
Peter
From dalloliogm at gmail.com Wed Apr 29 12:42:50 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Wed, 29 Apr 2009 18:42:50 +0200
Subject: [Biopython-dev] Git on Windows
In-Reply-To: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com>
References: <320fb6e00904290910y29c6386ax8975ef3c2597d09e@mail.gmail.com>
Message-ID: <5aa3b3570904290942g6a73fae3k3a53c2e13c95c258@mail.gmail.com>
On Wed, Apr 29, 2009 at 6:10 PM, Peter wrote:
> Hi all,
>
> I just wanted to say I've had a very quick test of git on Windows,
Hi,
by the way, this is a document published by google on a comparison hg/git:
- http://code.google.com/p/support/wiki/DVCSAnalysis
In the comments, there is some discussion over git clients for Windows.
--
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
My blog on bioinformatics: http://bioinfoblog.it
From eric.talevich at gmail.com Wed Apr 29 15:28:58 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Wed, 29 Apr 2009 15:28:58 -0400
Subject: [Biopython-dev] XML parsing library for new modules
Message-ID: <3f6baf360904291228m1bba6008kb26e345510772e7a@mail.gmail.com>
Hi all,
I'm writing a parser for the PhyloXML format for Google Summer of Code this
year, and as the name would imply, it requires parsing some large XML files.
The existing modules in Biopython for parsing XML formats seem to use
xml.sax in the standard library. In Python 2.5, a faster and more Pythonic
parser was added to the standard lib: ElementTree (xml.etree), in
pure-Python and C-enhanced flavors. How do you feel about each of these
libraries as the basis for a new Biopython module?
Here are some interesting benchmarks:
http://effbot.org/zone/celementtree.htm#benchmarks
The ElementTree library is also available as a standalone package,
compatible back to Python 2.1, and the lxml package also offers an
independent implementation. So maintaining compatibility with Python 2.4
would require the availability of one of these third-party packages, and my
code would try each of these imports in order:
from xml.etree import cElementTree as ElementTree
from xml.etree import ElementTree
# Separate lxml package
from lxml.etree import ElementTree
# Standalone elementtree package
import cElementTree as ElementTree
from elementtree import ElementTree
Then one day, when Python 2.4 is no longer supported, only the first two
lines would be needed. (The second line is for sites that disable C
extensions, like Google App Engine, or alternate Python implementations like
Jython.)
Another option is xml.parsers.expat, but just Googling around, it appears
that the Python zeitgeist is strongly in favor of xml.etree for new code.
Thoughts?
Thanks,
Eric
From chapmanb at 50mail.com Thu Apr 30 08:05:32 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 30 Apr 2009 08:05:32 -0400
Subject: [Biopython-dev] Properties in Bio.Application interface?
In-Reply-To: <320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com>
References: <320fb6e00904260546s2d3eeb73iec08df89d93f9908@mail.gmail.com>
<320fb6e00904290325w756493f2i264f79cd3f7e08b8@mail.gmail.com>
<320fb6e00904290834g7a73c7au487564e3b103250@mail.gmail.com>
Message-ID: <20090430120532.GA50777@sobchak.mgh.harvard.edu>
Hi Peter;
> > I've filed Bug 2822 for these enhancements to the Bio.Application
> > based command line objects,
> > http://bugzilla.open-bio.org/show_bug.cgi?id=2822
>
> I think I learnt some more about python in the process, which may be a
> sign that the code I've come up with is too complicated, but Bug 2822
> now has a patch to support both keyword arguments and properties in
> the Bio.Application style command line wrappers. This will require
> minor changes to the __init__ method of any command line sub-class
> (demonstrated using Bio.Emboss.Applications only thus far). I can
> envision a simpler approach to this code by defining the properties
> explicitly in each subclass, but that would mean a lot of boring/risky
> refactoring (or a clever script to do it for us).
I love what you are doing here. The keywords and properties make
it much more Pythonic; the old way reeks of Java-style get/sets. My
vote is to put them both in.
Brad
From marcin.swiatek at mail.mcgill.ca Thu Apr 30 11:23:35 2009
From: marcin.swiatek at mail.mcgill.ca (Marcin Swiatek)
Date: Thu, 30 Apr 2009 11:23:35 -0400
Subject: [Biopython-dev] MUMmer
Message-ID: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
Hello,
I guess I should start with a nice 'hi' to everybody, now that I am
sending my first message to this group. So: Hi, Everybody!
Now, that we have the formality out of the way, I will get to the point.
Recently, I have written some Python code for parsing and processing the
output of MUMmer tool (http://mummer.sourceforge.net/). More
specifically, the code I have manages invocations and handles outputs of
the nucmer pipeline (alignment of multiple closely related nucleotide
sequences) and of mummer itself (short exact matches). Obviously, the
results are ultimately rendered as pairs of biopython's Seq objects.
I use this stuff only myself, in work on bacterial genomes, but I would
be more than willing to contribute it to the project. It may be rough
around the edges at the moment, but I think I could easily give it the
necessary polish if there is interest in having it included.
Should that be the case, could one of the project leads point me in the
right direction, please? How should I go about the submission?
Regards,
Marcin Swiatek
From bartek at rezolwenta.eu.org Thu Apr 30 12:50:41 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 30 Apr 2009 18:50:41 +0200
Subject: [Biopython-dev] MUMmer
In-Reply-To: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
References: <176A06E658ED0745965C072C5F2C116A037F057A@EXCHANGE2VS2.campus.mcgill.ca>
Message-ID: <8b34ec180904300950n2c75f010oed27493f52d0da14@mail.gmail.com>
Hi Marcin,
On Thu, Apr 30, 2009 at 5:23 PM, Marcin Swiatek
wrote:
> Hello,
>
>
>
> I use this stuff only myself, in work on bacterial genomes, but I would
> be more than willing to contribute it to the project. It may be rough
> around the edges at the moment, but I think I could easily give it the
> necessary polish if there is interest in having it included.
>
Contributions are always welome
>
>
> Should that be the case, could one of the project leads point me in the
> right direction, please? How should I go about the submission?
>
>
I don't think I qualify as a lead, but nonetheless I think I can help here.
I think that the best way to submit your code currently is to create a
branch (fork) of
biopython on github and submit your changes there and then notify
people on biopython-dev
that there is new code to review. You can also submit an enhancement
bug to bugzilla.
There are a couple of wiki pages which might be of interest to you:
- http://biopython.org/wiki/Contributing
- http://biopython.org/wiki/GitUsage
If you have any questions or problems during the process, ask on the list.
As for the code, I'm not sure, but maybe instead of returning a pair
of sequences, an alignment object might be a better choice?
You might want to also check out a recent code on application wrappers:
http://lists.open-bio.org/pipermail/biopython-dev/2009-April/005766.html
cheers
Bartek
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:28:12 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:28:12 -0400
Subject: [Biopython-dev] [Bug 2802] New: Loader.py: load SeqRecord comments
as list
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
Summary: Loader.py: load SeqRecord comments as list
Product: Biopython
Version: 1.49b
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: BioSQL
AssignedTo: biopython-dev at biopython.org
ReportedBy: andrea at biodec.com
Loader.py version: 1.38 or below
python: any
Actually seqrecord.annotation['comment'] is a string. SProt parser and GenBank
parser parse comment as string. SProt record parser, instead, parse comment as
list, according to the "-!-" tag. I'm working on parsing comment as lists,
either for Uniprot and for GenBank (ncbi), and I need to have the possibility
to manage comment as lists.
The biosql schema, also, has in the table "comment", the field "rank" that
is suitable to be used for storing list entries. In this way the table is
ready and implemented to store list data.
The patch is retro-compatible, so the _load_comment function is able to
load either string or list entries, according to the data type.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:29:02 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:29:02 -0400
Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as
list
In-Reply-To:
Message-ID: <200904011129.n31BT23k007952@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
------- Comment #1 from andrea at biodec.com 2009-04-01 07:29 EST -------
Created an attachment (id=1270)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1270&action=view)
proposed Patch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:48:15 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:48:15 -0400
Subject: [Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug
fixes and cleanup
In-Reply-To:
Message-ID: <200904011148.n31BmFmX009292@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2176
------- Comment #14 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:48 EST -------
I've updated CVS as per comment 12 to also use record.query_length, and comment
13 to also use record.database_length.
Before:
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("xbt007.xml")) :
... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters,
record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 None
13958303 None None
gi|129628|sp|P07175.1|PARA_AGRTU
222 None
13958303 None None
Now, with Bio/Blast/NCBIXML.py CVS revision 1.20 or 1.21,
>>> from Bio.Blast import NCBIXML
>>> for record in NCBIXML.parse(open("xbt007.xml")) :
... print record.query_id
... print record.query_letters, record.query_length
... print record.num_letters_in_database, record.database_letters,
record.database_length
...
gi|585505|sp|Q08386|MOPB_RHOCA
270 270
13958303 None 13958303
gi|129628|sp|P07175.1|PARA_AGRTU
222 222
13958303 None 13958303
We could perhaps deprecate record.database_letters immediately, and at a later
point, record.query_letters
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 11:50:07 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 07:50:07 -0400
Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as
list
In-Reply-To:
Message-ID: <200904011150.n31Bo7ib009452@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 07:50 EST -------
See also Bug 2235 for the SwissProt parsing into SeqRecord objects.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 12:33:37 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 08:33:37 -0400
Subject: [Biopython-dev] [Bug 2802] Loader.py: load SeqRecord comments as
list
In-Reply-To:
Message-ID: <200904011233.n31CXbuM012687@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2802
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #3 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 08:33 EST -------
Thanks for the report and suggested patch. This is now fixed in CVS (slightly
differently though). I'd be grateful if you could test the latest code. A
fresh CVS checkout would be easiest - you'll need to update several files as I
was working on another issue at the same time:
Checking in BioSQL/BioSeq.py;
/home/repository/biopython/biopython/BioSQL/BioSeq.py,v <-- BioSeq.py
new revision: 1.35; previous revision: 1.34
done
Checking in BioSQL/Loader.py;
/home/repository/biopython/biopython/BioSQL/Loader.py,v <-- Loader.py
new revision: 1.39; previous revision: 1.38
done
Checking in Tests/test_BioSQL_SeqIO.py;
/home/repository/biopython/biopython/Tests/test_BioSQL_SeqIO.py,v <--
test_BioSQL_SeqIO.py
new revision: 1.33; previous revision: 1.32
done
Checking in Tests/output/test_BioSQL_SeqIO;
/home/repository/biopython/biopython/Tests/output/test_BioSQL_SeqIO,v <--
test_BioSQL_SeqIO
new revision: 1.6; previous revision: 1.5
done
Thanks,
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Wed Apr 1 14:23:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 1 Apr 2009 15:23:45 +0100
Subject: [Biopython-dev] Testing Biopython with NumPy 1.3
In-Reply-To: <320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com>
References: <320fb6e00903301535j21ae6659r931c9be0fd17faf3@mail.gmail.com>
<730606.962.qm@web62408.mail.re1.yahoo.com>
<320fb6e00903310212o29bba163ma9d68a901eabc2c9@mail.gmail.com>
Message-ID: <320fb6e00904010723j594bc958kc721a234c54d4ea5@mail.gmail.com>
On Tue, Mar 31, 2009 at 10:12 AM, Peter wrote:
> On Tue, Mar 31, 2009 at 1:08 AM, Michiel de Hoon wrote:
>>
>>> So, whatever is going wrong on test_Cluster.py seems to be
>>> specific to Windows (XP) and Python 2.6 - and possibly just
>>> my Windows development machine.
>>>
>> I believe that the problem is that msvcr90.dll is missing. This
>> is the C runtime from Microsoft. Earlier Pythons used
>> msvcr71.dll, if I'm not mistaken.
>
> You may be right - there is some stuff on the numpy mailing list
> about this and manifest files etc when using mingw32. ?It may
> be simplest to try the appropriate MS compiler instead...
OK, good news using the MS compiler:
I went to http://www.microsoft.com/express/download/ and installed the
free VC++ 2008 Express Edition (using the web install, unticking the
optional silverlight and sql server bits). Using the "Visual Studio
2008 Command Prompt" shortcut I was able to build, test, install
Biopython CVS fine. All this shortcut claims to do is setup suitable
environment variables first, so this last bit can probably be
simplified for every day use. This should mean we can include a
Biopython 1.50 (beta) installer for Windows on Python 2.6 using NumPy
1.3 :)
It would still be nice to resolve the mingw32 issue, but it isn't
critical right now.
Peter
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:41:24 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 10:41:24 -0400
Subject: [Biopython-dev] [Bug 2803] New: Insure Alignment objects are passed
to AlignIO.write()
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
Summary: Insure Alignment objects are passed to AlignIO.write()
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
Insure Alignment objects are passed to AlignIO.write()
Stops this kind of abuse:
records = list(SeqIO.parse(open("Tests/NBRF/DMA_nuc.pir", "r"), "pir"))
AlignIO.write([records], open("alignIO.fasta", "w"), "fasta")
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 14:42:55 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 10:42:55 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011442.n31EgtlQ023181@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
------- Comment #1 from cymon.cox at gmail.com 2009-04-01 10:42 EST -------
Created an attachment (id=1271)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1271&action=view)
nsure-Alignment-objects-are-passed-to-write-AlignIO
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 15:25:36 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 11:25:36 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011525.n31FPa3V026200@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:25 EST -------
Thanks for filing the bug (originally raised in our discussion on the mailing
list).
There is a major drawback to your proposed fix,
+ if isinstance(alignments, types.GeneratorType):
+ alignments = list(alignments)
This means if you gave the AlignIO.write function a generator returning
hundreds or large alignment objects, they would all get loaded into memory at
once. One of the big aims with Bio.SeqIO and AlignIO in using
generators/iterators is to allow memory efficient working where we try to keep
only one record/alignment in memory at a time.
Anyway, I'll take a look at this. I think we need to just check the case where
Bio.AlignIO.write uses Bio.SeqIO.write internally...
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 15:36:54 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 11:36:54 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011536.n31Fasdu027053@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
------- Comment #3 from cymon.cox at gmail.com 2009-04-01 11:36 EST -------
(In reply to comment #2)
> Thanks for filing the bug (originally raised in our discussion on the mailing
> list).
>
> There is a major drawback to your proposed fix,
>
> + if isinstance(alignments, types.GeneratorType):
> + alignments = list(alignments)
>
> This means if you gave the AlignIO.write function a generator returning
> hundreds or large alignment objects, they would all get loaded into memory at
> once. One of the big aims with Bio.SeqIO and AlignIO in using
> generators/iterators is to allow memory efficient working where we try to keep
> only one record/alignment in memory at a time.
>
> Anyway, I'll take a look at this. I think we need to just check the case where
> Bio.AlignIO.write uses Bio.SeqIO.write internally...
>
Yes, I see. I had originally intended to check the type while looping through
the alignments before calling SeqIO.write, but thought better of it because
some alignments may get written before a error occurs, whereas it seems best
that either all or none at all get written from the call to AlignIO.write.
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 15:55:26 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 11:55:26 -0400
Subject: [Biopython-dev] [Bug 2803] Insure Alignment objects are passed to
AlignIO.write()
In-Reply-To:
Message-ID: <200904011555.n31FtQ9X028474@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2803
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 11:55 EST -------
(In reply to comment #3)
> > Anyway, I'll take a look at this. I think we need to just check the case
> > where Bio.AlignIO.write uses Bio.SeqIO.write internally...
That turned out to be the case, fixed in CVS. See Bio/AlignIO/__init__.py
revision 1.22 and Tests/test_AlignIO.py 1.19
> Yes, I see. I had originally intended to check the type while looping through
> the alignments before calling SeqIO.write, but thought better of it because
> some alignments may get written before a error occurs, whereas it seems best
> that either all or none at all get written from the call to AlignIO.write.
You are right, if we are given a list/iterator containing some real Alignments
but also some non-Alignments we have a problem. We can't pre-check all the
entries before writing without converting to a list (and this ruins the memory
benefits). We just catching the erroneous input when we reach it, even though
it may happen half way through writing to the file.
Marking as fixed.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:04:05 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 14:04:05 -0400
Subject: [Biopython-dev] [Bug 2804] New: Clustalw subprocess hangs when
large stdout returned
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
Summary: Clustalw subprocess hangs when large stdout returned
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
As noted on the mailing list, the following hangs waiting for a return:
from Bio import SeqIO
from Bio import Clustalw
from Bio.Clustalw import MultipleAlignCL
records = list(SeqIO.parse(open("Tests/NBRF/Cw_prot.pir", "r"), "pir"))
handle = open("temp.fasta", "w")
SeqIO.write(records, handle, "fasta")
handle.close()
cline = MultipleAlignCL("temp.fasta", command="clustalw")
align = Clustalw.do_alignment(cline)
This appears to be due to a known issue as documented here:
http://docs.python.org/library/subprocess.html#subprocess.Popen.wait
but wasnt being picked up by the tests - presumably because no test file is
large enough to trigger the problem.
Instead of using .wait() it suggests .communicate()
The attached patch works for me on Linux. But as noted in __init__.py this
maybe an issue for Windows:
#We don't need to supply any piped input, but we setup the
#standard input pipe anyway as a work around for a python
#bug if this is called from a Windows GUI program. For
#details, see http://bugs.python.org/issue1124861
Also subprocess.returncode is now /3 so moved "if status: value = status / 256
"so that only done if calling os.popen()
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 18:05:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 14:05:10 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904011805.n31I5ACv005787@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #1 from cymon.cox at gmail.com 2009-04-01 14:05 EST -------
Created an attachment (id=1272)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view)
clustalw subprocess patch
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 22:05:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 18:05:40 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904012205.n31M5eDa024097@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-01 18:05 EST -------
It is great that you've found a simple and reproduceable test case. I can
confirm this problem on a Linux machine with Python 2.4.3 (what version of
python do you have?)
(In reply to comment #1)
> Created an attachment (id=1272)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details]
> clustalw subprocess patch
Unfortunately the patch is flawed here:
status = child_process.communicate()[1]
We want to get the return code (a numerical error value), but the communicate
method returns two strings giving the contents of stdout and strerr, i.e.
...
CLUSTAL W (1.83) Multiple Sequence Alignments
...
Sequence format is Pearson
Sequence 1: HLA_HLA00401 366 aa
Sequence 2: HLA_HLA00402 366 aa
...
Group 109: Sequences: 3 Score:6519
Group 110: Sequences: 111 Score:4464
Alignment Score 8299041
CLUSTAL-Alignment file created [temp.aln]
for stdout, and an empty string for stderr. Doing this seems to work on Linux
with python 2.4.3,
child_process.communicate() #ignore the stdout and stderr data!
child_process.stdin.close()
child_process.stdout.close()
child_process.stderr.close()
status = child_process.returncode
However, I have only tested this one example far, and not on Windows or the Mac
yet. It would be a good idea to extend test_Clustalw_tool.py to cover some
deliberate failures to check we can read the error level (return code) ClustalW
gives back. Of course, this will need testing with both clustalw 1.x and 2.x
to be safe.
Note that the original code using os.popen still works fine for this example.
We switched to subprocess because os.popen* are being deprecated on Python 2.6,
and didn't work well with names with spaces as I recall.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Wed Apr 1 22:42:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 1 Apr 2009 18:42:39 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904012242.n31MgdKd026637@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #3 from cymon.cox at gmail.com 2009-04-01 18:42 EST -------
(In reply to comment #2)
> It is great that you've found a simple and reproduceable test case. I can
> confirm this problem on a Linux machine with Python 2.4.3 (what version of
> python do you have?)
Python 2.5.2 (r252:60911, Oct 5 2008, 19:24:49)
[GCC 4.3.2] on linux2
on Ubuntu Intrepid
>
> (In reply to comment #1)
> > Created an attachment (id=1272)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1272&action=view) [details] [details]
> > clustalw subprocess patch
>
> Unfortunately the patch is flawed here:
>
> status = child_process.communicate()[1]
Actually, the 'whole' patch is good. Have a look at the second bit of the
patch, where I change my initial commit to my branch:
#Grab stderr
- status = child_process.communicate()[1]
+ child_process.communicate()
+ value = child_process.returncode
except ImportError :
etc...
I've been trying to get to grips with git - and clearly havent succeeded to
yet!
When you run the command "git format-patch" it creates a separate for each
commit to the branch, and I can't figure out how to just get the patch against
only the current version of the file. So git gave me two patches, which I
cat'ed together and submitted as a composite patch.
Sorry I didnt make that clear.
If anyone knows how to get the diff against only the current file version, I'd
appreciate the answer ;)
Cheers, C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:00:48 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 07:00:48 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021100.n32B0mEZ014206@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Attachment #1272 is|0 |1
obsolete| |
------- Comment #4 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 07:00 EST -------
Created an attachment (id=1273)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view)
Patch to Bio/Clustalw/__init__.py
(In reply to comment #3)
>
> When you run the command "git format-patch" it creates a separate for each
> commit to the branch, and I can't figure out how to just get the patch against
> only the current version of the file. So git gave me two patches, which I
> cat'ed together and submitted as a composite patch.
>
I see - that odd looking patch had confused me. I think you want to look at
"giff diff ..." for this, it also can do things like show the diff between the
remote branches.
I have tested this new patch on both Linux and Mac now, using both ClustalW
1.83 and 2.0.10 - next up Windows, and extending the unit test.
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 11:32:40 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 07:32:40 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021132.n32BWdqU016365@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #5 from cymon.cox at gmail.com 2009-04-02 07:32 EST -------
(In reply to comment #4)
> Created an attachment (id=1273)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1273&action=view) [details]
> Patch to Bio/Clustalw/__init__.py
>
> (In reply to comment #3)
> >
> > When you run the command "git format-patch" it creates a separate for each
> > commit to the branch, and I can't figure out how to just get the patch against
> > only the current version of the file. So git gave me two patches, which I
> > cat'ed together and submitted as a composite patch.
> >
>
> I see - that odd looking patch had confused me. I think you want to look at
> "giff diff ..." for this, it also can do things like show the diff between the
> remote branches.
>
> I have tested this new patch on both Linux and Mac now, using both ClustalW
> 1.83 and 2.0.10 - next up Windows, and extending the unit test.
Your new patch doesnt indent the lines (as in my original patch):
113 value = 0
114 if status: value = status / 256
so that they only get executed when run_clust = os.popen(str(command_line))
The return code from child_process.communicate() is already /256
also assign value = child_process.returncode (the return code is 0 for success
and never "")
"""
child_process.communicate()
value = child_process.returncode
except ImportError :
#Fall back for python 2.3
run_clust = os.popen(str(command_line))
status = run_clust.close()
# The exit status is the second byte of the termination status
# TODO - Check this holds on win32...
value = 0
if status: value = status / 256
# check the return value for errors, as on 1.81 the return value
# from Clustalw is actually helpful for figuring out errors
# 1 => bad command line option
if value == 1:
raise ValueError("Bad command line option in the command: %s"
% str(command_line))
"""
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 14:34:10 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 10:34:10 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021434.n32EYApO032328@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
------- Comment #6 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 10:34 EST -------
I've updated test_Clustalw_tool.py in CVS to catch this dead lock, and
confirmed the unit test will fail on Mac and Linux when using subprocess (on
the bright side, Python 2.3 should still work), but the test passes with the
fix outlined - or simply using the os.popen code instead.
Interestingly the lockup seems to happen more readily on Linux that on the Mac.
I've yet to test on Windows.
I also added three tests for standard error conditions - interestingly I don't
ever seem to get an error code back (either with subprocess or os.popen). What
about you? This makes testing these special cases for raising specific IOError
exceptions difficult.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:19:04 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 11:19:04 -0400
Subject: [Biopython-dev] [Bug 2804] Clustalw subprocess hangs when large
stdout returned
In-Reply-To:
Message-ID: <200904021519.n32FJ4DC003715@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2804
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
OS/Version|Linux |All
Resolution| |FIXED
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:19 EST -------
Hi Cymon,
I've updated the unit test for Windows on Python 2.3 through 2.6 (had to move
some file deletions to the end, and watch out for extra error message
variations).
Windows also deadlocks on this example when using subprocess - the test should
normally take about four seconds in total (depending on your computer's speed
of course). Using os.popen avoids the deadlock (but can't cope with file names
with spaces). Your fix in comment 5 also works :)
So, now we have a unit test which catches this deadlock on all three operating
systems, which confirms your fix which works on all three. I've checked it
into CVS, and marked this bug as fixed.
[I'm still not sure what is happening with the return values - if you look into
this further please raise a new bug for it.]
Thanks!
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:32:39 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 11:32:39 -0400
Subject: [Biopython-dev] [Bug 2806] New: Possible deadlock (hang) in
Bio.Application using subprocess wait()
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2806
Summary: Possible deadlock (hang) in Bio.Application using
subprocess wait()
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
CC: cymon.cox at gmail.com
See Bug 2804 which demonstrated a reproducible hang on Windows, Linux and Mac
from the subprocess .wait() method, and a work around.
Bio.Application may suffer from the same problem, and could be fixed with the
same approach. Patch to follow ...
Ideally we'd have a suitable unit test covering this - perhaps using
Bio.EMBOSS?
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 15:33:30 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 11:33:30 -0400
Subject: [Biopython-dev] [Bug 2806] Possible deadlock (hang) in
Bio.Application using subprocess wait()
In-Reply-To:
Message-ID: <200904021533.n32FXU67004756@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2806
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 11:33 EST -------
Created an attachment (id=1274)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1274&action=view)
Patch to Bio/Application/__init__.py
Use the .communicate() method instead of .wait()
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 19:18:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 15:18:56 -0400
Subject: [Biopython-dev] [Bug 2734] db.load problem with postgresql and
psycopg2
In-Reply-To:
Message-ID: <200904021918.n32JIuXc023154@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2734
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID
------- Comment #9 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-02 15:18 EST -------
As per comment 8, I'm going to assume Stephen had an old copy of Biopython on
his machine, which would explain the error. In the absence of any further
information there isn't anything we can do. Marking bug as invalid.
Stephen - if you do work out what was going on, or if you still have a problem
after sorting out any issue with multiple copies of Biopython installed, please
do reopen this report.
Thanks
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Thu Apr 2 22:29:18 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Thu, 2 Apr 2009 18:29:18 -0400
Subject: [Biopython-dev] [Bug 2807] New: Clustalw return codes
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2807
Summary: Clustalw return codes
Product: Biopython
Version: Not Applicable
Platform: PC
OS/Version: Linux
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: cymon.cox at gmail.com
see bug 2804
More on clustalw return codes:
Note return codes are the same whether using subprocess.returncode or
(os.popen().close() \3)
clustalw1.81 clustalw2.09
----------------- ------------------
error: Bad command line option in the command: clustalw_bogus
-INFILE=Fasta/f002
127 127
error: can't open sequence file: clustalw -INFILE=no_file_present
2 255
error: wrong format of input file: clustalw -INFILE=Phylip/hennigian.phy
3 255
error: only one sequence in input: clustalw -INFILE=Fasta/f001
4 0
=========================================================
Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get caught
generically.
I dont think it is possible to generate a return code 1 using 1.81 because
interface doesnt allow ad hoc options to be added to the command line. Invalid
values of options are just ignore by clustalw and it aligns the data anyway (ie
return code 0).
Return codes 127 and 255 could be caught for newer versions and a more
informative error returned. But given that there are 9 other clustalw versions
between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which
I havent checked the return codes, it might be better to just return a generic
command line error if the return value is > 0.
In the case where only one sequence is present, newer versions return code 0,
but throws a ValueError when trying to parse the non-existent output file (see
comment in test_Clustalw_tools.py).
C.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 3 09:50:44 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 3 Apr 2009 05:50:44 -0400
Subject: [Biopython-dev] [Bug 2807] Clustalw return codes
In-Reply-To:
Message-ID: <200904030950.n339oiIx019752@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2807
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-03 05:50 EST -------
(In reply to comment #0)
> Clustalw.__init__ tries to catch return codes 1, 2, 3, and 4, others get
> caught generically.
With the CVS code, using clustalw1.81, is it definitely catching these errors
and raising specific IOErrors?
> I dont think it is possible to generate a return code 1 using 1.81 because
> interface doesnt allow ad hoc options to be added to the command line.
The Bio.Clustalw.do_alignment() function accepts any command line string, so
you should be able to feed it a clustalw command with invalid arguments.
> Invalid values of options are just ignore by clustalw and it aligns the
> data anyway (ie return code 0).
We'd have to look at the clustalw source code to confirm what should trigger an
return error code of 1.
> Return codes 127 and 255 could be caught for newer versions and a more
> informative error returned.
Yes, that sounds sensible.
> But given that there are 9 other clustalw versions
> between 1.81 (June 2003) and the latest 2.0.10 (Oct 2008 the latest) for which
> I havent checked the return codes, it might be better to just return a generic
> command line error if the return value is > 0.
That also sounds sensible.
> In the case where only one sequence is present, newer versions return code 0,
> but throws a ValueError when trying to parse the non-existent output file (see
> comment in test_Clustalw_tools.py).
Maybe we should report that as a bug, I think clustalw2.0 is intended to be API
compatible with clustalw1.x
Peter
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From thamelry at binf.ku.dk Fri Apr 3 13:31:05 2009
From: thamelry at binf.ku.dk (Thomas Hamelryck)
Date: Fri, 3 Apr 2009 15:31:05 +0200
Subject: [Biopython-dev] PDB tidy script
In-Reply-To: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com>
References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com>
Message-ID: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com>
Hi everybody,
> I haven't been on this list long enough to know -- is Thomas still
> > supporting the PDB module?
Yes and no. First, I've been pretty busy with establishing a group here in
Copenhagen, but it looks like I will have time for Bio.PDB again in the
future. There's for example a set of classes dealing with RNA structure
coming up. Just have to submit it.
Second, I have no interest in doing anything beyond 3D stuff. I am not going
to implement header parsing for example. I know many people have donated
code, but in general this code is very messy and ad-hoc.
The PDB parser is pretty lean, fast and quite stable now - IMO parsing the
header should be the responsibility of a helper class, in order not to
overload the 3D code with a lot of stuff that most people will not use.
Also, the header info is for most purposes quite useless, especially in PDB
files. It makes no sense to parse the PDB header in fact - if you need
header info, use the MMCIF files.
> If so, would he give his blessing to some more
> > invasive changes to the PDB module, such as unifying PDBParser and
> > parse_pdb_header? That separation has always seemed curiously vestigal to
> > me.
You could provide a uniform interface, but please keep the 3D data
processing and the header processing in separate classes! The Structure
object has functionality to be 'annotated', so you could transfer data from
the header to the Structure object easily.
> If you look back over the history, there initially was no header parsing,
> it was a contribution from Kristian Rother, and I would agree, it is rather
> disjoint from the rest of the code. One thing I personally wanted last
> time I was working with PDB files was to have secondary structure
> information (for them alpha and beta sheet lines in the header)
> mapped onto the residue objects automatically.
This is a good example of why header parsing is something of a red herring.
You really want to recompute that using some decent program like DSSP or
PSEA, or even an internal Bio.PDB procedure. But it's fine of course if you
want to add this!
I would suggest you try and get Thomas involved now for his input
> on the design (before you start coding), but if need be press ahead
> anyway for your own use, and he can always comment on your
> public branch. I hope the two of you can work together on this, and
> if/when Thomas does stand down (or delagate), you could then be
> in an excellent position to take over as the Bio.PDB maintainer if
> that's what you wanted.
Sure, I'm open to this, but I'd like to stay involved if the 3D stuff is
altered, even just to discuss new designs.
Cheers,
-Thomas
From biopython at maubp.freeserve.co.uk Fri Apr 3 16:41:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 3 Apr 2009 17:41:04 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta)
Message-ID: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote:
> Hi all,
>
> OK guys, after a brief chat off the mailing list, I'm hoping to do the
> Biopython 1.50 beta release roughly this weekend, somewhere between
> Friday 4 and Monday 6 April. ?Until then please consider CVS "frozen"
> for anything other that documentation changes or unit test additions,
> or at a push really tiny changes. ?Once I'm ready to actually do the
> release, I'll send out an email requesting no further CVS commits.
I'm going to try and do the release tonight (in the next few hours),
so please consider CVS frozen until further notice.
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Fri Apr 3 18:07:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 3 Apr 2009 19:07:58 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta)
In-Reply-To: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
Message-ID: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com>
On Fri, Apr 3, 2009 at 5:41 PM, Peter wrote:
>
> I'm going to try and do the release tonight (in the next few hours),
> so please consider CVS frozen until further notice.
>
OK, its done - uploaded, and tagged in CVS. Could you all give it a
quick test now that would be great, especially the Windows installers
if possible as I currently only have ready access to the one Windows
machine which is where the installers were built.
I'll prepare the news entry and email announcement later on tonight,
based on the current NEWS file. If there is anything missing which
should be mentioned, please email me ASAP.
I'm happy for CVS to be used again to check in documentation changes,
but no code changes yet please.
Thanks
Peter
From tiagoantao at gmail.com Sat Apr 4 16:43:10 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sat, 4 Apr 2009 17:43:10 +0100
Subject: [Biopython-dev] Merging branches
Message-ID: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
Hi,
This might be a lame question but I am completely stuck and don't seem
to understand why.
I am trying to PARTIALLY merge 2 branches: my popgen branch with Giovanni's.
I want to import his changes to Bio/PopGen/Stats , but only that
(nothing on other Bio directories, and, above all not a new test).
This changes are not conflictual, so I have no warning and everything
gets in: If I do a git-merge I get the whole bang.
Is there any way to just get partial merge? In this case I only want
to merge a single sub dir (although, in general one might just want to
import a single file)
Of course I could do 2 checkouts and copy files across, on the local
filesystem, but is that not loosing the history of connections between
the files?
Many thanks,
Tiago
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From biopython at maubp.freeserve.co.uk Sat Apr 4 17:01:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 4 Apr 2009 18:01:53 +0100
Subject: [Biopython-dev] Merging branches
In-Reply-To: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
Message-ID: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
2009/4/4 Tiago Ant?o:
> Is there any way to just get partial merge? In this case I only want
> to merge a single sub dir (although, in general one might just want
> to import a single file)
Can you cherry pick the changes you want? Github's fork queue
provides another approach to the same issue. However, these both work
on patches (individual commits) rather than files/directories.
Peter
From tiagoantao at gmail.com Sat Apr 4 17:29:20 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Sat, 4 Apr 2009 18:29:20 +0100
Subject: [Biopython-dev] Merging branches
In-Reply-To: <320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
Message-ID: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
Me thinks I need to get a book on git and understand, once and for
all, the basic concepts. I am getting merge conflicts with cherry
picking and I don't even understand why
Anyway it would be nice (but not fundamental) to merge just a single file.
2009/4/4 Peter :
> 2009/4/4 Tiago Ant?o:
>> Is there any way to just get partial merge? In this case I only want
>> to merge a single sub dir (although, in general one might just want
>> to import a single file)
>
> Can you cherry pick the changes you want? ?Github's fork queue
> provides another approach to the same issue. ?However, these both work
> on patches (individual commits) ?rather than files/directories.
>
> Peter
>
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From biopython at maubp.freeserve.co.uk Sat Apr 4 19:06:57 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 4 Apr 2009 20:06:57 +0100
Subject: [Biopython-dev] CVS freeze for Biopython 1.50 (beta)
In-Reply-To: <320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com>
References: <320fb6e00904030941y5110cb95r8be1fb37b01b9b98@mail.gmail.com>
<320fb6e00904031107q3df63df7q3569c22d4a521b7c@mail.gmail.com>
Message-ID: <320fb6e00904041206yb0e4a29ja715a54faeeca28e@mail.gmail.com>
On Fri, Apr 3, 2009 at 7:07 PM, Peter wrote:
> I'm happy for CVS to be used again to check in documentation changes,
> but no code changes yet please.
Also I should have said before, those with CVS access, please feel
free to add more unit tests. I've started work on one using the
EMBOSS tools, to check both the command line wrappers in Bio.Emboss
but also our parsers.
I'm repeating myself but if you have some new code you'd like to check
in, while CVS is "frozen" for the release process, this is a nice
chance to try playing with git and github ;)
Peter
From bartek at rezolwenta.eu.org Sun Apr 5 09:49:14 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Sun, 5 Apr 2009 11:49:14 +0200
Subject: [Biopython-dev] Merging branches
In-Reply-To: <6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
Message-ID: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
Hi Tiago,
2009/4/4 Tiago Ant?o :
> Me thinks I need to get a book on git and understand, once and for
> all, the basic concepts. I am getting merge conflicts with cherry
> picking and I don't even understand why
>
If you could be a bit more specific (providing the files and revision numbers
would be great), than it would be easier to help. I know it is an extra work,
but we need some info, also to improve our wiki documents.
> Anyway it would be nice (but not fundamental) to merge just a single file.
>
This is one of the fundamentalo changes between CVS and git. CVS uses
files as the atomic piece of data, while git works with changesets (commits).
This means, that if you only need a part of what was committed as a
big changeset,
you will need to put an extra effort into selecting what you need.
>> 2009/4/4 Tiago Ant?o:
>>> Is there any way to just get partial merge? In this case I only want
>>> to merge a single sub dir (although, in general one might just want
>>> to import a single file)
Looking at specific files is not the default way things work in git.
The idea is that if
someone makes a single commit, it is an atomic contribution that is
either to be
accepted or not. You can of course create a diff file and then split
it into specific files.
I'll look into possible easier ways of doing it.
cheers
Bartek
From eric.talevich at gmail.com Sun Apr 5 16:47:39 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 5 Apr 2009 12:47:39 -0400
Subject: [Biopython-dev] Merging branches
In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
<8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
Message-ID: <3f6baf360904050947m5d9ec75eh18d64c53b8d9e2a6@mail.gmail.com>
2009/4/5 Bartek Wilczynski
> Hi Tiago,
>
> >> 2009/4/4 Tiago Ant?o:
> >>> Is there any way to just get partial merge? In this case I only want
> >>> to merge a single sub dir (although, in general one might just want
> >>> to import a single file)
>
> Looking at specific files is not the default way things work in git.
> The idea is that if
> someone makes a single commit, it is an atomic contribution that is
> either to be
> accepted or not. You can of course create a diff file and then split
> it into specific files.
> I'll look into possible easier ways of doing it.
>
> cheers
> Bartek
>
You can get a list of the changes that affected a single subdirectory by
giving the directory name to git log, e.g. "git log Bio/PopGen/Stats/".
Those commits don't necessarily just affect Bio/PopGen/Stats, but assuming
there aren't any single-commit code bombs, then it's probably a good idea to
take those associated modifications anyway. You can also give a range of
versions to git-log to get the commits that occurred since Gio's branch
diverged from yours -- it looks something like "git log [path] HEAD..[gio's
branch]", details are in the help page for git-rev-parse. Then you can use
that list of commits for cherry-picking, in the original order.
If it's essential to get just a specific file at a specific version, you can
find the SHA1 hash for that blob (probably easiest through github) and use
git-show with a redirect to the file in your tree, or a temporary filename.
This loses the history, though.
Cheers,
Eric
From tiagoantao at gmail.com Mon Apr 6 10:35:47 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Mon, 6 Apr 2009 11:35:47 +0100
Subject: [Biopython-dev] Merging branches
In-Reply-To: <8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
<8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
Message-ID: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com>
Hi,
2009/4/5 Bartek Wilczynski :
> If you could be a bit more specific (providing the files and revision numbers
> would be great), than it would be easier to help. I know it is an extra work,
> but we need some info, also to improve our wiki documents.
>
I would like to replace this:
http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py
With this:
http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py
It would be cool not to loose the history relationship (I suppose that
would be the good practice).
> This means, that if you only need a part of what was committed as a
> big changeset,
> you will need to put an extra effort into selecting what you need.
But how do you do that (other than manually copying files)? Cherry
pick seems to be commit based...
> Looking at specific files is not the default way things work in git.
> The idea is that if
> someone makes a single commit, it is an atomic contribution that is
> either to be
> accepted or not. You can of course create a diff file and then split
> it into specific files.
> I'll look into possible easier ways of doing it.
The point is: wanting to use part of a commit without loosing history.
In my case, I dont want to import a test_PopGen_Fst file that Gio has.
That being said, I dont think this is a big deal. I was just to
preserve the history connectivity between repositiories. I think we
can just use the old fashioned method of copying some files around.
But it would be good to know if there is a "best practice" (which, I
could not find out)
Tiago
PS - I might have to go under surgery this week, if I stop responding
for a long time, my apologies in advance but I am probably recovering.
From biopython at maubp.freeserve.co.uk Mon Apr 6 13:25:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 6 Apr 2009 14:25:29 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
Message-ID: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com>
Brad has been working on his GFF parsing code - see progress reports
on his blog http://bcbio.wordpress.com/ and his code on github,
http://github.com/chapmanb/bcbb/tree/master/gff
Potentially this could make it into Biopython 1.51, and I was just
thinking about where the code would go. Brad is supporting both GFF3
and the loosely defined GFF2 variants, so Bio.GFF seems a good place.
There would also be a wrapper under Bio.SeqIO for loading GFF files as
SeqRecord objects (I haven't played with Brad's code, but it can do
this already).
However, we already have a Bio.GFF module from Michael Hoffman created
back in 2002 which accesses MySQL General Feature Format (GFF)
databases created with BioPerl. Perhaps we should poll the main
discussion list now, and if there are no responses from people using
it, we could deprecate Bio.GFF for Biopython 1.50? Under our current
deprecation policy we shouldn't then remove Bio.GFF until Biopython
1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy
What do you think Brad? How about using Bio.GFF3 instead?
Peter
From chapmanb at 50mail.com Mon Apr 6 22:08:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 6 Apr 2009 18:08:26 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com>
References: <320fb6e00904060625v4a49da2au76159eae18f707eb@mail.gmail.com>
Message-ID: <20090406220826.GH43636@sobchak.mgh.harvard.edu>
Peter;
Thanks for the plug. GFF parsing is moving along; the main feature
two things I would like to finish before proposing it for inclusion
are writing of GFF files and putting GFF into BioSQL with the nested
features. The code does work for parsing, and I've been using it for
some real projects; anyone who would like to test it is more than
welcome.
As far as the current Bio.GFF, that is a bit of a conundrum. The
current code does work and for some cases it would be nice of having
the utility of working with GFF from a database. Eventually BioSQL
from GFF may supplant that, but that should be finished and tested
first. I would argue for keeping it in.
However, it is a bit confusing if someone is looking for a parser. It
would make more sense if it lived under a namespace like Bio.GFF.DB.
What do you think about adding a warning that it is going to move to
a new namespace and then moving it there, if we don't hear any
complaints, for 1.51? This is less cumbersome than a removal for
users since it's just an import change.
Brad
> Brad has been working on his GFF parsing code - see progress reports
> on his blog http://bcbio.wordpress.com/ and his code on github,
> http://github.com/chapmanb/bcbb/tree/master/gff
>
> Potentially this could make it into Biopython 1.51, and I was just
> thinking about where the code would go. Brad is supporting both GFF3
> and the loosely defined GFF2 variants, so Bio.GFF seems a good place.
> There would also be a wrapper under Bio.SeqIO for loading GFF files as
> SeqRecord objects (I haven't played with Brad's code, but it can do
> this already).
>
> However, we already have a Bio.GFF module from Michael Hoffman created
> back in 2002 which accesses MySQL General Feature Format (GFF)
> databases created with BioPerl. Perhaps we should poll the main
> discussion list now, and if there are no responses from people using
> it, we could deprecate Bio.GFF for Biopython 1.50? Under our current
> deprecation policy we shouldn't then remove Bio.GFF until Biopython
> 1.52 at the earliest, http://biopython.org/wiki/Deprecation_policy
>
> What do you think Brad? How about using Bio.GFF3 instead?
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From mjldehoon at yahoo.com Tue Apr 7 11:32:52 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 7 Apr 2009 04:32:52 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090406220826.GH43636@sobchak.mgh.harvard.edu>
Message-ID: <316000.69837.qm@web62407.mail.re1.yahoo.com>
Hi Brad,
Thanks for your work on the GFF parser; I'm dealing with GFF files quite a lot. Could you maybe give a simple example of how to use your GFF parser, once it's included into Biopython?
--Michiel.
--- On Mon, 4/6/09, Brad Chapman wrote:
> From: Brad Chapman
> Subject: Re: [Biopython-dev] Bio.GFF and Brad's code
> To: biopython-dev at lists.open-bio.org
> Date: Monday, April 6, 2009, 6:08 PM
> Peter;
> Thanks for the plug. GFF parsing is moving along; the main
> feature
> two things I would like to finish before proposing it for
> inclusion
> are writing of GFF files and putting GFF into BioSQL with
> the nested
> features. The code does work for parsing, and I've been
> using it for
> some real projects; anyone who would like to test it is
> more than
> welcome.
>
> As far as the current Bio.GFF, that is a bit of a
> conundrum. The
> current code does work and for some cases it would be nice
> of having
> the utility of working with GFF from a database. Eventually
> BioSQL
> from GFF may supplant that, but that should be finished and
> tested
> first. I would argue for keeping it in.
>
> However, it is a bit confusing if someone is looking for a
> parser. It
> would make more sense if it lived under a namespace like
> Bio.GFF.DB.
> What do you think about adding a warning that it is going
> to move to
> a new namespace and then moving it there, if we don't
> hear any
> complaints, for 1.51? This is less cumbersome than a
> removal for
> users since it's just an import change.
>
> Brad
>
>
>
> > Brad has been working on his GFF parsing code - see
> progress reports
> > on his blog http://bcbio.wordpress.com/ and his code
> on github,
> > http://github.com/chapmanb/bcbb/tree/master/gff
> >
> > Potentially this could make it into Biopython 1.51,
> and I was just
> > thinking about where the code would go. Brad is
> supporting both GFF3
> > and the loosely defined GFF2 variants, so Bio.GFF
> seems a good place.
> > There would also be a wrapper under Bio.SeqIO for
> loading GFF files as
> > SeqRecord objects (I haven't played with
> Brad's code, but it can do
> > this already).
> >
> > However, we already have a Bio.GFF module from Michael
> Hoffman created
> > back in 2002 which accesses MySQL General Feature
> Format (GFF)
> > databases created with BioPerl. Perhaps we should
> poll the main
> > discussion list now, and if there are no responses
> from people using
> > it, we could deprecate Bio.GFF for Biopython 1.50?
> Under our current
> > deprecation policy we shouldn't then remove
> Bio.GFF until Biopython
> > 1.52 at the earliest,
> http://biopython.org/wiki/Deprecation_policy
> >
> > What do you think Brad? How about using Bio.GFF3
> instead?
> >
> > Peter
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> >
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
From bartek at rezolwenta.eu.org Tue Apr 7 12:35:21 2009
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Tue, 7 Apr 2009 14:35:21 +0200
Subject: [Biopython-dev] Merging branches
In-Reply-To: <6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com>
References: <6d941f120904040943p60f4fb33m38826b2f72eedb97@mail.gmail.com>
<320fb6e00904041001q4c5b0067i7ca9eaf6a6e52b43@mail.gmail.com>
<6d941f120904041029s46b6cb02x30c68885acbfb47f@mail.gmail.com>
<8b34ec180904050249n156185abx4538e51876280266@mail.gmail.com>
<6d941f120904060335m380ac820k97a558e8332fdf63@mail.gmail.com>
Message-ID: <8b34ec180904070535r3a6f23e8w9b917f7592930eda@mail.gmail.com>
Hi,
2009/4/6 Tiago Ant?o :
>> This means, that if you only need a part of what was committed as a
>> big changeset,
>> you will need to put an extra effort into selecting what you need.
>
> But how do you do that (other than manually copying files)?
I think that in this case you need to do this manually.
If you care only about one file, copying it is the easiest option.
> Cherry pick seems to be commit based...
In fact the whole git is commit based. It's not tracking files as
such, but blobs of data.
>I would like to replace this:
>http://github.com/tiagoantao/biopython-popgen-test/blob/fa5ebc23e7aaabce94ae594d9a4f83be9bf90215/Bio/PopGen/Stats/Simple.py
>With this:
>http://github.com/dalloliogm/biopython/blob/cbaf6249cb91ed505cb575f09c2eaef3809872b9/Bio/PopGen/Stats/Simple.py
>It would be cool not to loose the history relationship (I suppose that
>would be the good practice).
Indeed, keeping history is the right thing and it was one of the
reasons to switch to git.
It would be perfect if Giovanni could "redo" some of his commits and
split them into
smaller operations, so that cherry picking commits would be possible.
I know it's a pain...
> The point is: wanting to use part of a commit without loosing history.
> In my case, I dont want to import a test_PopGen_Fst file that Gio has.
> That being said, I dont think this is a big deal. I was just to
> preserve the history connectivity between repositiories. I think we
> can just use the old fashioned method of copying some files around.
> But it would be good to know if there is a "best practice" (which, I
> could not find out)
As far as I can tell, there is no way you could take only a part of a
commit. The best practice is to make smaller, atomic commits. It has
many advantages:
-it's easier to document a smaller change (I think it makes up for
potentially more work because of more commits)
-you can then "undo" small locally committed changes before pushing
them to public repo
-cherry picking of nicely documented small changes is an easy job
In this particular case of changes in tests, I think really changes to
one test should be committed separately from changes in other tests.
cheers
Bartek
From tiagoantao at gmail.com Tue Apr 7 16:43:49 2009
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Tue, 7 Apr 2009 17:43:49 +0100
Subject: [Biopython-dev] PopGen Stats
Message-ID: <6d941f120904070943n7de7afa7m262dd4f4c0149cb@mail.gmail.com>
Hi,
I've started a page documenting the effort to implement statstics here
http://biopython.org/wiki/PopGen_dev_Statistics
anyone is welcomed to participate.
I was expecting to have a personal hurdle during this week, which
didn't happen. So I expect to be working heavily on this (finally).
Tiago
--
"A man who dares to waste one hour of time has not discovered the
value of life" - Charles Darwin
From peter at maubp.freeserve.co.uk Tue Apr 7 19:38:50 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Tue, 7 Apr 2009 20:38:50 +0100
Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available
In-Reply-To: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
Message-ID: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
Hi all,
There is a new version of BLAST out - we'll need to check if the
NCBI's online server has been updated (if so, our unit test
test_NCBI_qblast.py should catch any obvious issues).
We'll also want to check the standalone version of BLAST is OK.
Point (2) below sounds interesting, previously using BLAST databases
with spaces in the path on Windows was rather hairy.
Peter
---------- Forwarded message ----------
From: mcginnis
Date: Apr 7, 2009 1:50 PM
Subject: [blast-announce] BLAST 2.2.20 now available
To: blast-announce at ncbi.nlm.nih.gov
New BLAST binaries are available on the NCBI FTP site
(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/)
The list of changes are:
1.) Ungapped blastn searches allow arbitrary reward/penalty scores.
2.) Spaces are allowed in database pathnames on windows
3.) Seedtop now has gilist support.
4.) Fix a bug that caused the number and order of queries to affect
blastx results.
5.) Modified the 2-hit blastn algorithm so that no overlap is allowed
between hits.
From jacobporter2002 at yahoo.com Wed Apr 8 02:27:21 2009
From: jacobporter2002 at yahoo.com (Jacob Porter)
Date: Tue, 7 Apr 2009 19:27:21 -0700 (PDT)
Subject: [Biopython-dev] Phylogeny modules for BioPython
Message-ID: <296822.1198.qm@web33706.mail.mud.yahoo.com>
Hi all,
My name is Jacob Porter, and I am a graduate student in the math department at UC Davis.? I've done work before on phylogeny inference using so-called "phylogenetic invariants" that can be found at the website: http://www.shsu.edu/~ldg005/small-trees/
It appears to me that BioPython doesn't have much support for phylogeny inference and tools related to phylogeny inference.
I have applied to the Google Summer of Code (12 weeks of working part-time on a programming assignment), and I am looking for a project that could work with BioPython as I see a lot of potential in it.? I can bring my expertise on phylogeny inference to this project to add some support for this.
I need three things from the community ASAP:
1) Ideas as to which of my several project ideas are the most useful to the BioPython community
2) Information as to what is already included in BioPython concerning phylogeny inference and related tools
3) A mentor that will help me with the project (and possibly work in conjunction with Nascent (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors)? I?would need a 12 -week schedule of tasks for the project (TBD), and answers to questions related to developing for BioPython.? (I've worked with Python a lot before, so I shouldn't need much help with Python so much as I need help with BioPython).
Project?1:
Add support for popular phylogeny representation standards such as DND files.? Give the ability to read and write such files.? Convert between such files.? I need help in picking which standards to use and need help in picking which operations on these files is the most useful.
Project?2:
Add wrappers for modern (hopefully high throughput and accurate) phylogeny inference software written in C++/C.? Examples of such software include neighbor-joining, MJOIN software (similar to neighbor-joining) (http://bio.math.berkeley.edu/mjoin/), Garli (http://www.molecularevolution.org/si/software/garli/), treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html), and maximum parsimony.? I would like to know which sort of phylogeny inference software is the most useful in your opinion.? I assume no wrappers for such software exist.
Project?3:
Add analytic algorithms that use phylogeny in some way.? Examples include bootstrapping and protein-protein interaction inference algorithms.? (i.e. "Inferring protein interactions from phylogenetic distance matrices" by Gertz et al.)? I need information as to what sort of algorithms would be useful.
Project 4:
Enhance phylogeny inference software further.? MJOIN has bugs (I think it returns negative distances in some cases, and some modifications to it that I developed using phylogenetic invariants are seg-faulting).
Not all of these ideas will probably be able to be developed, so I need information as to what might be the most useful.? I was thinking of focusing on Project 1 and Project 2 for the initial phase.
Any information will be appreciated, and any mentorship will be great.? I would like a response quickly, so that I can inform Nascent of my plans.
Thanks,
Jacob Porter
UC Davis
From p.j.a.cock at googlemail.com Wed Apr 8 08:54:35 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 8 Apr 2009 09:54:35 +0100
Subject: [Biopython-dev] Phylogeny modules for BioPython
In-Reply-To: <296822.1198.qm@web33706.mail.mud.yahoo.com>
References: <296822.1198.qm@web33706.mail.mud.yahoo.com>
Message-ID: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com>
On 4/8/09, Jacob Porter wrote:
>
> Hi all,
>
> My name is Jacob Porter, and I am a graduate student in the math
> department at UC Davis. I've done work before on phylogeny inference
> ...
> It appears to me that BioPython doesn't have much support for
> phylogeny inference and tools related to phylogeny inference.
I'm sure there is room for improvement.
> I have applied to the Google Summer of Code (12 weeks of
> working part-time on a programming assignment), and I am
> looking for a project that could work with BioPython as I see
> a lot of potential in it. I can bring my expertise on phylogeny
> inference to this project to add some support for this.
>
> I need three things from the community ASAP:
>
> 1) Ideas as to which of my several project ideas are the
> most useful to the BioPython community
Personally, I might pick command line wrappers for existing command
line tools. However, these don't actually make anything new possible,
as writting your own command line is already fairly easy. This in
itself wouldn't be that much work either.
> 2) Information as to what is already included in BioPython
> concerning phylogeny inference and related tools
Look at Bio.Nexus, plus somewhat related, Bio.AlignIO.
> 3) A mentor that will help me with the project (and
> possibly work in conjunction with Nascent
> (https://www.nescent.org/wg_phyloinformatics/Main_Pagementors)
> I would need a 12 -week schedule of tasks for the
> project (TBD), and answers to questions related to
> developing for BioPython. (I've worked with Python
> a lot before, so I shouldn't need much help with
> Python so much as I need help with BioPython).
Brad Chapman may be willing to mentor a GSoC student, have a look back
of the recent email discussions here. In particular, Nick Matzke has
already expressed some interest in Biogeographical and community
phylogenetics for Biopython (there is a wiki page on open-bio.org on
this).
> Project 1:
> Add support for popular phylogeny representation
> standards such as DND files. Give the ability to
> read and write such files. Convert between such
> files. I need help in picking which standards to use
> and need help in picking which operations on these
> files is the most useful.
We have this already in Bio.Nexus, but there is still room for
improvement - see Bug 2788 for example.
> Project 2:
> Add wrappers for modern (hopefully high throughput
> and accurate) phylogeny inference software written in
> C++/C. Examples of such software include
> neighbor-joining, MJOIN software (similar to
> neighbor-joining) (http://bio.math.berkeley.edu/mjoin/),
> Garli (http://www.molecularevolution.org/si/software/garli/),
> treeSVD (http://www.stat.uchicago.edu/~eriksson/software.html),
> and maximum parsimony. I would like to know which
> sort of phylogeny inference software is the most useful
> in your opinion. I assume no wrappers for such software
> exist.
Well, Bio.Nexus is a great help with certain tools. There is scope
for adding more command line wrappers though (I like quick-join and
and also quicktree for NJ tree building).
> Project 3:
> Add analytic algorithms that use phylogeny in some
> way. Examples include bootstrapping and protein-protein
> interaction inference algorithms. (i.e. "Inferring protein
> interactions from phylogenetic distance matrices" by
> Gertz et al.) I need information as to what sort of
> algorithms would be useful.
I feel that this is still very much an active area of research, and
there are no clear gold standards. However, perhaps some published
algorithms may be worth re-implementing in Biopython. I would still
tend to favour more general work for Biopython that would support
people implementing any/their own algorithm.
> Project 4:
> Enhance phylogeny inference software further.
> MJOIN has bugs (I think it returns negative distances
> in some cases, and some modifications to it that I
> developed using phylogenetic invariants are seg-faulting).
Fixing any bug in MJOIN sounds like a good idea - but doesn't really
affect Biopython directly.
> Not all of these ideas will probably be able to be
> developed, so I need information as to what might
> be the most useful. I was thinking of focusing on
> Project 1 and Project 2 for the initial phase.
>
> Any information will be appreciated, and any
> mentorship will be great. I would like a response
> quickly, so that I can inform Nascent of my plans.
Peter.
P.S. Its Biopython, not BioPython
From chapmanb at 50mail.com Wed Apr 8 12:32:26 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Apr 2009 08:32:26 -0400
Subject: [Biopython-dev] Phylogeny modules for BioPython
In-Reply-To: <320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com>
References: <296822.1198.qm@web33706.mail.mud.yahoo.com>
<320fb6e00904080154j433ef69cmd99b847240ee5b39@mail.gmail.com>
Message-ID: <20090408123226.GL43636@sobchak.mgh.harvard.edu>
Jacob;
Thanks much for your interest in Biopython for Summer of Code; glad
to see a discussion here about your proposal.
Peter's comments are great; I will add to them from the SoC
perspective.
> > I have applied to the Google Summer of Code (12 weeks of
> > working part-time on a programming assignment)
SoC is a full time commitment for the summer. Your proposal also
lists some conflicts (classes, other research) for the summer
months. On your updated proposal you should be explicit about these
and describe how you plan to make up time you miss during the first
two weeks of the quarter.
More generally, your proposal needs a detailed plan of deliverables
on a week to week basis over the project timeline, starting with
coding on May 23rd:
http://socghop.appspot.com/document/show/program/google/gsoc2009/timeline
This is the last hour for refining proposals, so you will need to
update your proposal quickly for us to still have time to consider
it. I would recommend copying your current proposal to a Google Doc,
adding all of the specifics needed, and then submitting a link to
the open document as a comment to your initial proposal.
> Brad Chapman may be willing to mentor a GSoC student, have a look back
> of the recent email discussions here. In particular, Nick Matzke has
> already expressed some interest in Biogeographical and community
> phylogenetics for Biopython (there is a wiki page on open-bio.org on
> this).
I am definitely willing to help; spots will be very competitive
throughout the program.
Echoing Peter's comments, I would put together a project proposal
that tackles:
- Improving parsing support in Bio.Nexus, based on existing code and
bug reports, and other suggestions you might have.
- Providing code wrapping for other phylogeny software. Since the
usefulness of different algorithms depends heavily on the context
in which it is used, you will not find a consensus about which
program is most useful. My suggestion is to suggest wrappers for
several useful programs covering the spectrum of possibilities.
In additions to the ones you listed, a couple others are:
RAxML http://icwww.epfl.ch/~stamatak/index-Dateien/Page443.htm
FastTree http://www.microbesonline.org/fasttree/index.html
- A higher level API over the parsing and command line program support
that helps users with specific phylogenetic tasks. Based on your
experience and input from the Biopython community of users, this
would have the goal of providing a simple way to do common tasks.
This should be a combination of code to surround repetitive items,
and cookbook style documentation to help people with specific
phylogenetic problems.
Other general suggestions:
- Tests. Please describe your plans to write unit tests for all the
code your write.
- Documentation. Please do leave time in your project plan to fully
document using your proposed code.
- Projects 3 and 4, as Peter suggests, are out of the scope of GSoC.
3, specifically, is more of a research project.
Finally, a few meta-items from your e-mail meant as helpful advice:
> It appears to me that BioPython doesn't have much support for
> phylogeny inference and tools related to phylogeny inference.
I understand this is an attempt to provide motivation for your
proposal, but you should do so in a way that does not disparage the
work of the people you are soliciting advice from. Your request
would be better received if you described it in the context of
improving existing phylogenetic support in Biopython.
> I need three things from the community ASAP:
[...]
> I would like a response quickly
No one likes to be told what to do, much less a group your are
requesting help and hopefully a job from. Again, you should think
about how your phrasing will be interpreted by those reading it.
> Nascent
You twice misspelled this: NESCent. Mistakes happen, but it reflects
badly on your commitment to the project to not be able to spell the
name of the organization you would like to work with. These are the
small things you should be careful and double check.
Thanks again for your interest and looking forward to seeing your
revised project plan,
Brad
From chapmanb at 50mail.com Wed Apr 8 12:49:08 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 8 Apr 2009 08:49:08 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <316000.69837.qm@web62407.mail.re1.yahoo.com>
References: <20090406220826.GH43636@sobchak.mgh.harvard.edu>
<316000.69837.qm@web62407.mail.re1.yahoo.com>
Message-ID: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
Hi Michiel;
> Thanks for your work on the GFF parser; I'm dealing with GFF files
> quite a lot. Could you maybe give a simple example of how to use your
> GFF parser, once it's included into Biopython?
Awesome; I'm glad it will be useful. I'd definitely welcome any
feedback you have on the API or implementation. At this stage we can
be flexible and hopefully get it finalized before it hits Biopython.
I will get some user documentation together soon, but here is some
basic usage.
To parse an entire GFF file, getting all features at once:
from BCBio.GFF.GFFParser import GFFAddingIterator
gff_iterator = GFFAddingIterator()
rec_dict = gff_iterator.get_all_features(gff_file)
The returned dictionary is like a dictionary from SeqIO.to_dict;
keys are ids and values are SeqRecords.
You can also seed the parser with an initial dictionary containing
sequences or other features, and the features from the GFF file will
be added to those records:
with open(seq_file) as seq_handle:
seq_dict = SeqIO.to_dict(SeqIO.parse(seq_handle, "fasta"))
gff_iterator = GFFAddingIterator(seq_dict)
If a file is very large, you have two ways of limiting the size of
items parsed. The first is to specify which items you are interested
and return only those. This code will parse out coding transcripts
on chromosome I:
cds_limit_info = dict(
gff_source_type = [('Coding_transcript', 'gene'),
('Coding_transcript', 'mRNA'),
('Coding_transcript', 'CDS')],
gff_id = ['I']
)
rec_dict = gff_iterator.get_all_features(gff_file, limit_info=cds_limit_info)
The second is to use an iterator over a section of the file:
for rec_dict in gff_iterator.get_features(gff_file, target_lines=1000000):
# handle partial rec dictionary of first 1000000 lines
Finally, there is an interface to examine a GFF file and figure out
useful ways to limit it. This will give you a dictionary of all
possible ways to limit a file along with the counts in each:
gff_examiner = GFFExaminer()
possible_limits = gff_examiner.available_limits(gff_file)
and this will give a dictionary of the parent-child relationships in
the file:
gff_examiner = GFFExaminer()
pc_map = gff_examiner.parent_child_map(gff_file)
Since GFF providers tend to differ in how they structure their
information, this helps get a quick overview of the file to
determine how to manage it.
Happy to hear about thoughts you might have. Thanks,
Brad
>
> --Michiel.
>
>
> --- On Mon, 4/6/09, Brad Chapman wrote:
>
> > From: Brad Chapman
> > Subject: Re: [Biopython-dev] Bio.GFF and Brad's code
> > To: biopython-dev at lists.open-bio.org
> > Date: Monday, April 6, 2009, 6:08 PM
> > Peter;
> > Thanks for the plug. GFF parsing is moving along; the main
> > feature
> > two things I would like to finish before proposing it for
> > inclusion
> > are writing of GFF files and putting GFF into BioSQL with
> > the nested
> > features. The code does work for parsing, and I've been
> > using it for
> > some real projects; anyone who would like to test it is
> > more than
> > welcome.
> >
> > As far as the current Bio.GFF, that is a bit of a
> > conundrum. The
> > current code does work and for some cases it would be nice
> > of having
> > the utility of working with GFF from a database. Eventually
> > BioSQL
> > from GFF may supplant that, but that should be finished and
> > tested
> > first. I would argue for keeping it in.
> >
> > However, it is a bit confusing if someone is looking for a
> > parser. It
> > would make more sense if it lived under a namespace like
> > Bio.GFF.DB.
> > What do you think about adding a warning that it is going
> > to move to
> > a new namespace and then moving it there, if we don't
> > hear any
> > complaints, for 1.51? This is less cumbersome than a
> > removal for
> > users since it's just an import change.
> >
> > Brad
> >
> >
> >
> > > Brad has been working on his GFF parsing code - see
> > progress reports
> > > on his blog http://bcbio.wordpress.com/ and his code
> > on github,
> > > http://github.com/chapmanb/bcbb/tree/master/gff
> > >
> > > Potentially this could make it into Biopython 1.51,
> > and I was just
> > > thinking about where the code would go. Brad is
> > supporting both GFF3
> > > and the loosely defined GFF2 variants, so Bio.GFF
> > seems a good place.
> > > There would also be a wrapper under Bio.SeqIO for
> > loading GFF files as
> > > SeqRecord objects (I haven't played with
> > Brad's code, but it can do
> > > this already).
> > >
> > > However, we already have a Bio.GFF module from Michael
> > Hoffman created
> > > back in 2002 which accesses MySQL General Feature
> > Format (GFF)
> > > databases created with BioPerl. Perhaps we should
> > poll the main
> > > discussion list now, and if there are no responses
> > from people using
> > > it, we could deprecate Bio.GFF for Biopython 1.50?
> > Under our current
> > > deprecation policy we shouldn't then remove
> > Bio.GFF until Biopython
> > > 1.52 at the earliest,
> > http://biopython.org/wiki/Deprecation_policy
> > >
> > > What do you think Brad? How about using Bio.GFF3
> > instead?
> > >
> > > Peter
> > > _______________________________________________
> > > Biopython-dev mailing list
> > > Biopython-dev at lists.open-bio.org
> > >
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
> > _______________________________________________
> > Biopython-dev mailing list
> > Biopython-dev at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython-dev
>
>
>
From bugzilla-daemon at portal.open-bio.org Wed Apr 8 22:55:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Wed, 8 Apr 2009 18:55:59 -0400
Subject: [Biopython-dev] [Bug 2808] New: Bio.SeqIO "ig" format parser
doesn't deal with optional 1 terminator
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2808
Summary: Bio.SeqIO "ig" format parser doesn't deal with optional
1 terminator
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
While working on new unit test test_Emboss.py I noticed that EMBOSS seqret
creates ig files where the sequence includes a terminal digit one. Further
research online suggests this is an optional feature of the file format,
although not commonly used. See:
http://bmerc-www.bu.edu/needle-doc/latest/seq-formats.html#seq-file-format
The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1"
marker, and not include it in the returned sequence. Perhaps we should even
add this when writing the files.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From chapmanb at 50mail.com Fri Apr 10 13:10:34 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 10 Apr 2009 09:10:34 -0400
Subject: [Biopython-dev] Invitation for Biopython news coordinators
In-Reply-To: <49DD5575.4040901@student.otago.ac.nz>
References: <20090406230542.GK43636@sobchak.mgh.harvard.edu>
<49DD5575.4040901@student.otago.ac.nz>
Message-ID: <20090410131034.GH54672@sobchak.mgh.harvard.edu>
David;
Thanks for taking the time to write; it is great to hear that you are
interested. Copying this to the dev list so others can comment and you
can feel free to discuss as much as you want.
> I'd be keen to help spread the good word about bio-python, I'm a very
> novice programmer who has been using the tools to work on some 454
> transcriptome data. I will probably never be a good enough programmer to
> contribute code to the project so would see this as a way to "give
> something back".
Perfect. Getting involved is the first step; you'd be
surprised how much you can learn just by taking on new tasks. I
started helping with Biopython by writing documentation.
> For me as a n00b the most useful resource by far has been the cookbook -
> seeing some working scripts that I could change to suit my ends has
> helped me get to the point that I can write much more generalised code
> for my project 'from scratch'. To that end I think it would be really
> helpful to highlight work that other people have done, either published
> or made available by authors, with a little detail on the questions
> and the way BioPython was used to get at them. We could extend it to
> show some "use cases" for BioPython working with other programs or how
> new features can be used once they are included in the main release.
>
> To me the most obvious way of presenting such information would be a
> blog, we could invite authors and developers to make short posts and
> failing that I'd be happy write up posts summarising published research.
> We could also try an aggregate blogs from the devs and anyone else
> talking about biopython "in the wild".
This sounds great. You are welcome to use the twitter account, news
posts, the wiki, or a blog -- however you see fit. For your aggregation
idea, you might want to take a look at friendfeed. It's pretty simple
to set up a room and pull in RSS feeds, twitter postings, and what not.
There is a Python for Bioinformatics room:
http://friendfeed.com/rooms/python-for-bioinformatics
Most feeds come from general Python sources so it is a bit more
broad, but is a good starting place. I know some of the admins (Chris,
Paulo, Andrew) are around here, and may want to chime in.
For publications, Peter has done a lot of work on identifying papers
that use Biopython:
http://biopython.org/wiki/Publications
Building on this to include short reusable examples from the research
would be very useful.
> Anyway, those are a few ideas, I'm definitely keen to help out and to
> take on board any other ideas that are out there.
Great, let us know how you want to get started. Feel free to start
with something small and expand from there. Peter can help out
with account information for twitter; if you need other things just
ask away.
Brad
> Cheers,
> David
>
> Brad Chapman wrote:
> > Biopythonistas;
> > Communication is a key component of successful open source projects.
> > The challenges of distributed programming by volunteers can be
> > overcome by ensuring that the whole community is aware of
> > interesting discussions, new contributions, and development goals.
> > Traditionally, this communication has happened through our mailing
> > lists, wiki pages, and bug tracking system. While these will
> > continue to to be useful resources, new methods of disseminating
> > information are changing how we interact through the web.
> >
> > I'd like to issue an invitation for anyone interested in helping
> > revolutionize how Biopython news is disseminated. We are looking for
> > contributors from the community to brainstorm new ways to make the
> > discussions that happen at biopython.org accessible. You would
> > actively follow development here and on the development lists and
> > distill this information into useful quick bullet points for those
> > interested in Biopython but too busy to follow detailed discussions.
> >
> > We are proposing two ways to do this:
> >
> > - Monthly highlights on our news server:
> > http://news.open-bio.org/news/category/obf-projects/biopython/
> > The RSS feed from these posts are currently widely distributed around the
> > internet.
> >
> > - More frequent pointers to interesting discussions or other items
> > of interest happening in Biopython through our Twitter account:
> > http://twitter.com/biopython
> >
> > This is an opportunity for those of you who are looking to become
> > more involved, and would like to learn more about Biopython by
> > following all of the coding activity more closely. The position is
> > very flexible and we are happy to have one or more people take it
> > on; we would also encourage you to be as creative as you want in
> > doing so.
> >
> > I see this as an chance to both provide information and to highlight
> > the great work people do at Biopython. If you are interested in
> > taking on this role please respond with your ideas. Thanks for your
> > interest,
> >
> > Brad
> > _______________________________________________
> > BioPython mailing list - BioPython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 14:13:58 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 10:13:58 -0400
Subject: [Biopython-dev] [Bug 2809] New: Adding startswith and endswith
methods to the Seq object
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2809
Summary: Adding startswith and endswith methods to the Seq object
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
OtherBugsDependingO 2351
nThis:
As part of making the Seq object more like the Python string (Bug 2351), we
need alphabet aware startswith and endswith methods. Patch to follow.
There are many possible use cases for this. One example which prompted me to
work on this was taking SeqRecord objects from sequencing reads (a FASTQ file
read in with Bio.SeqIO) where some include a PCR primer associated
prefix/suffix which I want to strip off (by slicing the SeqRecord). To do this
I need to know if a given SeqRecord's sequence starts with (or ends with) a
given primer sequence (or tuple of primer sequences).
Current work around, str(record.seq).startswith(prefix)
Patch to follow, which will allow record.seq.startswith(prefix) directly.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 14:13:59 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 10:13:59 -0400
Subject: [Biopython-dev] [Bug 2351] Make Seq more like a string,
even subclass string?
In-Reply-To:
Message-ID: <200904101413.n3AEDx5I004913@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2351
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
BugsThisDependsOn| |2809
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 14:15:27 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 10:15:27 -0400
Subject: [Biopython-dev] [Bug 2809] Adding startswith and endswith methods
to the Seq object
In-Reply-To:
Message-ID: <200904101415.n3AEFRRb005139@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2809
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 10:15 EST -------
Created an attachment (id=1275)
--> (http://bugzilla.open-bio.org/attachment.cgi?id=1275&action=view)
Patch to Bio/Seq.py and Tests/test_Seq_objs.py
Adds startswith and endswith methods to the Seq object, and tests these with
simple doctest and a longer separate unit test.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Fri Apr 10 14:46:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 10 Apr 2009 15:46:02 +0100
Subject: [Biopython-dev] Tutorial & Cookbook
Message-ID: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
David wrote:
>> For me as a n00b the most useful resource by far has been the cookbook -
>> seeing some working scripts that I could change to suit my ends has
>> helped me get to the point that I can write much more generalised code
>> for my project 'from scratch'. ...
When you said "cookbook", did you mean the Biopython Tutorial & Cookbook?
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
There are a couple of other documents under the "Cookbook" folder here:
http://biopython.org/DIST/docs/cookbook/Restriction.html
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
I have been wondering if the "Biopython Tutorial & Cookbook" should be
separated now - it is getting a bit long (which in some ways is a good
thing!). Maybe we should re-title it as just the "Biopython
Tutorial". Some bits of the current "Cookbook chapter" might be moved
into the main body of the tutorial (e.g. the alignment stuff), but
having the cookbook entries separate might be a good idea.
For a separate "Cookbook", we could again use LaTeX for another
HTML/PDF document (or set of documents) but perhaps just a series of
pages on the wiki would be more accessible - and much easier for
people to contribute to? We'd need to organize things (e.g. a
cookbook category on the wiki) to make sure everything is still
accessible. As a bonus, it would give us more hits on Google - which
is probably a good thing.
On the other hand, it would be very good if all our cookbook use cases
could be rolled into the unit test framework - which wouldn't be so
easy if they live on the wiki. Something based on doctests might
work...
Peter
From bugzilla-daemon at portal.open-bio.org Fri Apr 10 17:29:06 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Fri, 10 Apr 2009 13:29:06 -0400
Subject: [Biopython-dev] [Bug 2808] Bio.SeqIO "ig" format parser doesn't
deal with optional 1 terminator
In-Reply-To:
Message-ID: <200904101729.n3AHT6g0020169@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2808
biopython-bugzilla at maubp.freeserve.co.uk changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-10 13:29 EST -------
(In reply to comment #0)
>
> The Bio.SeqIO "ig" parser should be aware of the (optional) terminal "1"
> marker, and not include it in the returned sequence.
>
Fixed in CVS,
Bio/SeqIO/IgIO.p revision 1.5
Tests/test_Emboss.py revision 1.10
>
> Perhaps we should even add this when writing the files.
>
We don't write out ig files so this isn't an issue at the moment.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Fri Apr 10 18:12:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 10 Apr 2009 19:12:12 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
Message-ID: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
Hi
Those of you following the CVS RSS feed will have noticed a lot of
activity on my new unit test test_Emboss.py, which now works on
Windows, Linux and Mac OS (provided EMBOSS is installed), and does
four main tasks:
- runs needle, checks Bio.AlignIO can parse the output
- runs water, checks Bio.AlignIO can parse the output
- runs seqret to check Bio.SeqIO
- runs seqret to check Bio.AlignIO
It would probably be logical to also include tests for the EMBOSS
version of primer3 here too, but I am not familiar with this tool and
the Biopython parsers.
For now I build the command line strings for seqret and needle "by
hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note
that the existing wrappers in Bio.EMBOSS don't support the very handy
-auto and -filter command line arguments supported by all (or at least
most) of the EMBOSS command line tools. Using -auto turns off any
user prompting for missing arguments (very important for calling from
a script). Using -filter is useful for running the tools with pipes
(i.e. no output file is required as stdout can be used instead, and
potentially no input file if we write to stdin correctly).
Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
these features? The needle wrapper would make an excellent basis for
a new water wrapper. For adding -auto and -filter support, there is
probably a clever approach with a common EMBOSS specific subclass of
Bio.Application.AbstractCommandline, but I haven't tried.
Peter
From mjldehoon at yahoo.com Sat Apr 11 02:26:45 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 10 Apr 2009 19:26:45 -0700 (PDT)
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
Message-ID: <93403.18413.qm@web62406.mail.re1.yahoo.com>
--- On Fri, 4/10/09, Peter wrote:
> I have been wondering if the "Biopython Tutorial &
> Cookbook" should be separated now - it is getting
> a bit long (which in some ways is a good thing!).
In my opinion, it doesn't matter if the "Biopython Tutorial & Cookbook" is long. I guess that few people actually print this document anyway.
I am in favor of having one "official" documentation for Biopython. If we have one Tutorial and one Cookbook, we'll have lots of overlap between the two, it'll be unclear what should be in the Tutorial and what in the Cookbook, and we'll have to make sure the two are consistent.
A cookbook on the Wiki could be helpful though, and since the Wiki pages can be fixed easily we won't have to worry so much about inconsistencies with the official documentation.
> Maybe we should re-title it as just the "Biopython Tutorial".
That sounds like a good idea.
> Some bits of the current "Cookbook chapter" might be moved
> into the main body of the tutorial (e.g. the alignment
> stuff),
Yes. The cookbook chapter has the same problem as a cookbook document; it's not clear what should go there. A more logical place for cookbook-style examples is at the end of each chapter in the documentation. For example, Bio.Entrez has a bunch of cookbook-style examples at the end of its chapter in the Biopython Tutorial & Cookbook.
Currently, there are not so many sections left in the cookbook chapter; most of them have become full-fledged chapters and were moved out of the cookbook chapter.
> For a separate "Cookbook", we could again use LaTeX for another
> HTML/PDF document (or set of documents) but perhaps just a
> series of pages on the wiki would be more accessible - and much
> easier for people to contribute to?
+1 for the wiki, -1 for another HTML/PDF document.
> On the other hand, it would be very good if all our
> cookbook use cases
> could be rolled into the unit test framework - which
> wouldn't be so
> easy if they live on the wiki. Something based on doctests
> might work...
Whereas it can be useful if some cookbook examples are part of the unit tests, I don't think it's absolutely required. I see a wiki cookbook more as complementary to the unit tests.
--Michiel.
From mjldehoon at yahoo.com Sat Apr 11 11:29:47 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Sat, 11 Apr 2009 04:29:47 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
Message-ID: <830379.9837.qm@web62402.mail.re1.yahoo.com>
Hi Brad,
Thanks for the examples; that clarified it a lot.
I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython.
Looking at your first example:
> from BCBio.GFF.GFFParser import GFFAddingIterator
>
> gff_iterator = GFFAddingIterator()
> rec_dict = gff_iterator.get_all_features(gff_file)
>
> The returned dictionary is like a dictionary from
> SeqIO.to_dict;
> keys are ids and values are SeqRecords.
It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this:
from Bio import GFF
handle = open("my_gff_file.gff")
for line in handle:
# call the appropriate GFF function on the line
The second point is about GFFAddingIterator.get_all_features. If this is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
Then the code looks as follows:
from Bio import GFF
handle = open("my_gff_file.gff")
rec_dict = GFF.to_dict(handle)
Another thing to consider is that IDs in the GFF file do not need to be unique. For example, consider a GFF file that stores genome mapping locations for short sequences stored in a Fasta file. Since each sequence can have more than one mapping location, we can have multiple lines in the GFF file for one sequence ID.
The last point is about storing SeqRecords in rec_dict. A GFF file typically does not store sequences; if it does, it's not clear which field in the GFF file does. On the other hand, a SeqRecord often does not contain the chromosomal location, which is what the GFF file stores. So why use a SeqRecord for GFF information?
Sorry for bringing up lots of issues. But I think that a GFF parser will be heavily used, so we should optimize its design as much as possible.
Best,
--Michiel.
From biopython at maubp.freeserve.co.uk Sun Apr 12 13:16:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 12 Apr 2009 14:16:58 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
Message-ID: <320fb6e00904120616u390cfe56w3889804d2bffd385@mail.gmail.com>
On 4/10/09, Peter wrote:
> Hi
>
> Those of you following the CVS RSS feed will have noticed a lot of
> activity on my new unit test test_Emboss.py, which now works on
> Windows, Linux and Mac OS (provided EMBOSS is installed), and does
> four main tasks:
>
> - runs needle, checks Bio.AlignIO can parse the output
> - runs water, checks Bio.AlignIO can parse the output
> - runs seqret to check Bio.SeqIO
> - runs seqret to check Bio.AlignIO
It now also runs transeq to check the Bio.Seq translations on all
common tables. This has shown up some differences in our translations
for ambiguous sequences - I may have found a bug in EMBOSS...
Peter
From sbassi at clubdelarazon.org Mon Apr 13 01:57:52 2009
From: sbassi at clubdelarazon.org (Sebastian Bassi)
Date: Sun, 12 Apr 2009 22:57:52 -0300
Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available
In-Reply-To: <320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
<320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
Message-ID: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com>
On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote:
> Hi all,
....
> We'll also want to check the standalone version of BLAST is OK.
I've made the following check:
Run a blast query (with blast 2.2.20) with output in xml. Run my
python script that converts XML to HTML using Biopython (under
Biopython 1.50beta) and it worked OK. The script deals with most
information bits found in an XML blast file so if there is any change
in the blast output, this program would crash.
From eric.talevich at gmail.com Mon Apr 13 03:13:32 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Sun, 12 Apr 2009 23:13:32 -0400
Subject: [Biopython-dev] PDB tidy script
In-Reply-To: <2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com>
References: <320fb6e00903231405l479ddcc6of9cd0c1aa8fd98d4@mail.gmail.com>
<2d7c25310904030631u56c642d6k83355d6bc4cd5d19@mail.gmail.com>
Message-ID: <3f6baf360904122013k21aa8efcm4aae0ac872e8e6af@mail.gmail.com>
Hi Thomas & everyone,
I've started a separate branch on GitHub for this work:
http://github.com/etal/biopython/tree/pdbtidy
I pushed one small change just now (partly to play with git branches), which
is basically the example code I gave earlier. It wraps the PDBLoader and
parse_pdb_header classes, and sticks a finger into PDBList too, so that
parsing and building a structure from a PDB file is a one-liner for both
local and RCSB-hosted files:
>>> from Bio import PDB
>>> prot = PDB.load('pdb2hmb.ent')
>>> dir(prot)
['__doc__', '__init__', '__module__', 'author', 'compound',
'deposition_date', 'head', 'journal', 'journal_reference', 'keywords',
'name', 'release_date', 'resolution', 'source', 'structure',
'structure_method', 'structure_reference']
Or:
>>> PDB.fetch('2hmb')
/usr/lib/python2.5/site-packages/Bio/PDB/PDBList.py:240: UserWarning:
Retrieving
ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/hm/pdb2hmb.ent.gz
warn("Retrieving %s" % url)
(The warning is supposed to be a comment, but that cleanup is happening in
another branch: http://github.com/etal/biopython/tree/bug2754 ).
My idea is to pull all of the parse_pdb_header data out of the PDBParser and
Structure classes, and store it in the PDBLoader wrapper instead. The
existing "header" attributes can point to the PDBLoader parent if it exists,
or temporarily contain None or "" if necessary to avoid breaking scripts,
according to the deprecation plan. Annotations could either stay in
Structure or move to Loader. Then we'd have a fast, lean, consistent
hierarchy of classes for 3D structure work, and an easy API for loading and
exploring PDB files interactively.
Part of the pdbtidy concept is to check that the PDB header is consistent
with the structure it represents, so I'd like the API for metadata to be
just as nice as the existing one for 3D structure.
So, this is just a start, but I hope the intent is clear enough that someone
will tell me to stop if the whole idea is misguided.
Thanks,
Eric
From biopython at maubp.freeserve.co.uk Mon Apr 13 09:51:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 10:51:38 +0100
Subject: [Biopython-dev] Fwd: [blast-announce] BLAST 2.2.20 now available
In-Reply-To: <9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com>
References: <9E9BBBAC-1E4D-4C2E-AD8E-7F6DF6F317A5@ncbi.nlm.nih.gov>
<320fb6e00904071238v7ed94541me4af7937fa7ae27b@mail.gmail.com>
<9e2f512b0904121857o3f8da862ycf40a0510f9dbd51@mail.gmail.com>
Message-ID: <320fb6e00904130251k3e3e77f2x20e03fba19fd8ff7@mail.gmail.com>
On Mon, Apr 13, 2009 at 2:57 AM, Sebastian Bassi
wrote:
> On Tue, Apr 7, 2009 at 4:38 PM, Peter wrote:
>> Hi all,
> ....
>> We'll also want to check the standalone version of BLAST is OK.
>
> I've made the following check:
> Run a blast query (with blast 2.2.20) with output in xml. Run my
> python script that converts XML to HTML using Biopython (under
> Biopython 1.50beta) and it worked OK. The script deals with most
> information bits found in an XML blast file so if there is any change
> in the blast output, this program would crash.
Great - thanks for checking that :)
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 10:44:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 11:44:29 +0100
Subject: [Biopython-dev] BOSC 2009
Message-ID: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com>
Hello Biopythoneers,
Those of you following the dev-mailing list or the OBF news feed will
know that talk abstracts for BOSC 2009 are due in today, see
http://www.open-bio.org/wiki/BOSC_2009
I should to be able to attend and present the Biopython Project
Update, and a few other Biopython developers may also be around too,
so some sort of hackathon is in the air.
It is a bit unfortunate the deadline was scheduled on the Easter
break, as I'm sure quite a few of you will be on holiday, but here is
an outline abstract. If anyone has comments, please let me know (on
the list or directly) in the next couple of hours...
Biopython Project Update (draft abstract for BOSC 2009)
In this talk we present the current status of the Biopython project,
focusing on features developed in the last year, and future plans for
the project. The Oxford University Press journal Bioinformatics has
recently published an application note describing Biopython:
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I,
Hamelryck T, Kauff F, Wilczynski B, and de Hoon MJ. Biopython: freely
available Python tools for computational molecular biology and
bioinformatics. Bioinformatics 2009 Mar 20.
doi:10.1093/bioinformatics/btp163
Since BOSC 2008, Biopython 1.49 has been released. This was an
important milestone in bringing support for Python 2.6, and in terms
of our dependence on Numerical Python as we made the transition from
the obsolete Numeric library to NumPy. Biopython 1.49 also added more
biological methods to our core sequence object.
April 2009 will see the release of Biopython 1.50 (at the time of
writing, a beta has already been released). Some of the new features
include:
1. GenomeDiagram by Leighton Pritchard has been integrated into
Biopython as the Bio.Graphics.GenomeDiagram module.
2. A new module Bio.Motif has been added, which is intended to replace
the existing Bio.AlignAce and Bio.MEME modules.
3. Bio.SeqIO can now read and write FASTQ and QUAL files used in
second generation sequencing work.
Biopython will celebrate its 10th Birthday later this year, we will
present a brief history of the project and current work. This
includes the evaluation of git (and github) as a possible distributed
version control system (DVCS) to replace our existing very stable CVS
server hosted by the Open Bioinformatics Foundation, which we hope
will encourage more participation in the project.
--
Thanks,
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 12:16:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 13:16:10 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <830379.9837.qm@web62402.mail.re1.yahoo.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
Message-ID: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
On Sat, Apr 11, 2009 at 12:29 PM, Michiel de Hoon wrote:
>
> Hi Brad,
>
> Thanks for the examples; that clarified it a lot.
I haven't tried the code yet, but I have a GFF file I need to convert
into FASTA format. Hopefully later this week I'll get to that...
There are a few things I can ask now through:
Why are the functions _gff_line_map() and _gff_line_reduce() private
(leading underscores)? I had thought you wanted to make the
map/reduce approach available to people trying to parse GFF files on
multiple threads (e.g. using disco) which would require them to use
these two functions, wouldn't it? If so, they should be part of the
public API.
I don't see any support for the optional FASTA block in a GFF file.
Is this something you intend to add later Brad? See also my thoughts
below for Bio.SeqIO integration.
> I have a couple of suggestions of how to make the GFF parser more generally usable, and more consistent with other parsers in Biopython.
> Looking at your first example:
>
>> from BCBio.GFF.GFFParser import GFFAddingIterator
>>
>> gff_iterator = GFFAddingIterator()
>> rec_dict = gff_iterator.get_all_features(gff_file)
>>
>> The returned dictionary is like a dictionary from
>> SeqIO.to_dict;
>> keys are ids and values are SeqRecords.
>
> It's not clear to me why we need an iterator for GFF files. Can't we just use Python's line iterator instead? I would expect code like this:
>
> from Bio import GFF
> handle = open("my_gff_file.gff")
> for line in handle:
> ? ?# call the appropriate GFF function on the line
I think the appropriate GFF function here might be Brad's
_gff_line_map(). This knows about different GFF line types (e.g. ##
header lines). I'm not sure if a line based approach like this can
cope with the optional ##FASTA block through.
> The second point is about GFFAddingIterator.get_all_features. If this
> is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
> Then the code looks as follows:
>
> from Bio import GFF
> handle = open("my_gff_file.gff")
> rec_dict = GFF.to_dict(handle)
Well, the Bio.SeqIO.to_dict() function takes a SeqRecord list/iterator
rather than a handle, but that might make sense here.
> Another thing to consider is that IDs in the GFF file do not need to be unique.
> For example, consider a GFF file that stores genome mapping locations for
> short sequences stored in a Fasta file. Since each sequence can have more
> than one mapping location, we can have multiple lines in the GFF file for one
> sequence ID.
That sounds nasty. Do you have any example files of this we could use
for a test case?
> The last point is about storing SeqRecords in rec_dict. A GFF file typically
> does not store sequences; if it does, it's not clear which field in the GFF file
> does. On the other hand, a SeqRecord often does not contain the
> chromosomal location, which is what the GFF file stores. So why use a
> SeqRecord for GFF information?
I don't think the GFF parser should only return SeqRecord object, but
I do see a use for this (via Bio.SeqIO). GFF files could be
represented as a list of SeqFeature objects, and using a SeqRecord to
hold this seems very natural to me. It also means we could use
Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a
BioSQL database.
If you look at the NCBI FTP site, they often provide genome sequences
in a range of file formats including GenBank and GFF.
e.g.
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/
The GenBank files contain the features plus the sequence,
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk
Their GFF3 file only contains the features:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
Some GFF files will include the sequence too, in this case we can
fetch it in FASTA format:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
In principle, you could parse this FASTA file and the GFF3 file and
put together a GenBank file - or vice versa.
As an aside, I would also consider adding protein table support on the
same lines, look at this file:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt
The header information gives us the genome size, so Bio.SeqIO could
return a SeqRecord with lots of SeqFeature objects and for the
SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp.
This is something I might look at implementing myself after Biopython
1.50 is out. We should be able to read in a GenBank file and output a
PTT file, and verify it matches the NCBI provided version of the PTT
file.
Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
give me a SeqRecord with lots of SeqFeature objects. If the sequence
is present in the file, it should use that (not the case for these
NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual
sequence length which we'd need to use the new Bio.Seq.UnknownSeq
object. However, we can infer from the maximum feature coordinates a
minimum sequence length. For these NCBI GFF3 files, as there is a
source feature this does actually give use the genome length, so this
should work very nicely.
Peter
From chapmanb at 50mail.com Mon Apr 13 12:32:19 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 08:32:19 -0400
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
Message-ID: <20090413123219.GB5429@sobchak.mgh.harvard.edu>
Hi Peter;
The tests from EMBOSS look great; thanks for putting this together.
> For now I build the command line strings for seqret and needle "by
> hand", as Bio.EMBOSS doesn't have wrappers for them yet. I also note
> that the existing wrappers in Bio.EMBOSS don't support the very handy
> -auto and -filter command line arguments supported by all (or at least
> most) of the EMBOSS command line tools. Using -auto turns off any
> user prompting for missing arguments (very important for calling from
> a script). Using -filter is useful for running the tools with pipes
> (i.e. no output file is required as stdout can be used instead, and
> potentially no input file if we write to stdin correctly).
>
> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
> these features? The needle wrapper would make an excellent basis for
> a new water wrapper. For adding -auto and -filter support, there is
> probably a clever approach with a common EMBOSS specific subclass of
> Bio.Application.AbstractCommandline, but I haven't tried.
Definitely go for it. My approach on this has mostly been to add
command lines as they are requested, or if I need them for something
I am doing. Not ideal.
Having a subclass with -auto and -filter is a really good idea;
unfortunately nothing clever is designed into the command line builders
right now. Feel free to add away.
Brad
From chapmanb at 50mail.com Mon Apr 13 12:52:55 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 08:52:55 -0400
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <93403.18413.qm@web62406.mail.re1.yahoo.com>
References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
<93403.18413.qm@web62406.mail.re1.yahoo.com>
Message-ID: <20090413125255.GC5429@sobchak.mgh.harvard.edu>
Hi all;
> > I have been wondering if the "Biopython Tutorial &
> > Cookbook" should be separated now - it is getting
> > a bit long (which in some ways is a good thing!).
>
> In my opinion, it doesn't matter if the "Biopython Tutorial &
> Cookbook" is long. I guess that few people actually print this
> document anyway.
>
> I am in favor of having one "official" documentation for Biopython.
> If we have one Tutorial and one Cookbook, we'll have lots of overlap
> between the two, it'll be unclear what should be in the Tutorial
> and what in the Cookbook, and we'll have to make sure the two are
> consistent.
I am for whatever is easiest to maintain. Being long isn't a problem
as people can just skip to whatever they need; reading things online
will be increasingly common.
Agreed with Michiel that minimizing overlap is key. It's the same as
maintaining code; if you have the same thing in multiple places it
is more likely to get out of sync and be confusing. There is a
pretty clear distinction between tutorial documentation and cookbook
examples, so...
> A cookbook on the Wiki could be helpful though, and since the Wiki
> pages can be fixed easily we won't have to worry so much about
> inconsistencies with the official documentation.
[...]
> +1 for the wiki, -1 for another HTML/PDF document.
Same vote for me. I am responsible for the LaTeX file, but if I were
starting it today would do things entirely on the web. The barrier
to contributing is much lower.
> > On the other hand, it would be very good if all our cookbook use cases
> > could be rolled into the unit test framework - which wouldn't be so
> > easy if they live on the wiki. Something based on doctests might work...
This is a good idea; broken examples in documentation are definitely
annoying. If we enforce a common format for cookbook items, then we
could scrape the wiki pages, extract the python code and run it as
part of the tests. The python cookbook could serve as some
inspiration:
http://code.activestate.com/recipes/langs/python/
Brad
From biopython at maubp.freeserve.co.uk Mon Apr 13 12:53:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 13:53:18 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <20090413123219.GB5429@sobchak.mgh.harvard.edu>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
On Mon, Apr 13, 2009 at 1:32 PM, Brad Chapman wrote:
>> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
>> these features? ?The needle wrapper would make an excellent basis for
>> a new water wrapper. ?For adding -auto and -filter support, there is
>> probably a clever approach with a common EMBOSS specific subclass of
>> Bio.Application.AbstractCommandline, but I haven't tried.
>
> Definitely go for it. My approach on this has mostly been to add
> command lines as they are requested, or if I need them for something
> I am doing. Not ideal.
>
> Having a subclass with -auto and -filter is a really good idea;
> unfortunately nothing clever is designed into the command line builders
> right now. Feel free to add away.
I need to work on my delegation skills - that seems to have back fired ;)
Regarding adding -auto support, I have a question about the needle
wrapper and the gap parameters. Using the needle tool at the command
line will prompt for the gap parameters UNLESS the -auto argument has
been used. i.e. Without -auto, it makes sense to insist on the gap
parameters being included, which is what the current wrapper does.
However, if we add support for -auto, then these parameters can be
optional. We could handle this in the wrapper, but it would be messy
(and there may be similar questions with other EMBOSS tools). What do
you think - stick with the simple option of insisting the Biopython
user set the gap parameters, even if they are using -auto?
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 13:16:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 14:16:51 +0100
Subject: [Biopython-dev] Tutorial & Cookbook
In-Reply-To: <20090413125255.GC5429@sobchak.mgh.harvard.edu>
References: <320fb6e00904100746q2478551dna253f1aa5acbcdb4@mail.gmail.com>
<93403.18413.qm@web62406.mail.re1.yahoo.com>
<20090413125255.GC5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130616t2cd5f029i582f4a2488d1182c@mail.gmail.com>
Brad wrote:
>Michiel wrote:
>> A cookbook on the Wiki could be helpful though, and since the Wiki
>> pages can be fixed easily we won't have to worry so much about
>> inconsistencies with the official documentation.
>> [...]
>> +1 for the wiki, -1 for another HTML/PDF document.
>
> Same vote for me. I am responsible for the LaTeX file, but if I were
> starting it today would do things entirely on the web. The barrier
> to contributing is much lower.
One of the nice things about the current PDF (and HTML) file is we can
ship it with each release, meaning it can be used while offline. Also
it means we don't have to worry too much about having our online
documentation deal with older versions of Biopython.
But you are right that LaTeX is a slight barrier to contributing -
although it wasn't an issue for me personally as I learnt LaTeX during
my Maths/Physics undergraduate degree. In anycase, I've previously
said that if people have additions for the tutorial, I'll take plain
text and do the mark up for them.
>> > On the other hand, it would be very good if all our cookbook use cases
>> > could be rolled into the unit test framework - which wouldn't be so
>> > easy if they live on the wiki. ?Something based on doctests might work...
>
> This is a good idea; broken examples in documentation are definitely
> annoying. If we enforce a common format for cookbook items, then we
> could scrape the wiki pages, extract the python code and run it as
> part of the tests.
That sounds possible - we might be able to scrape the wiki page,
reformat it and feed it into doctests... although testing graphical
output will still be a problem.
Speaking of doctests, we should do more of those in our docstrings.
For our online API documentation at
http://biopython.org/DIST/docs/api/ it would be nice to have the
python examples within the docstrings (including the doctests) shown
with syntax colouring. See
http://epydoc.sourceforge.net/manual-epytext.html#doctest-blocks for
an example, and compare this to
http://biopython.org/DIST/docs/api/Bio.Seq-module.html - maybe we need
to adjust our indentation?
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 13:33:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 14:33:03 +0100
Subject: [Biopython-dev] BOSC 2009
In-Reply-To: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com>
References: <320fb6e00904130344l2dfbeb89je62cbe8410181fe2@mail.gmail.com>
Message-ID: <320fb6e00904130633k68fe32bdj3c0419afc5ada71a@mail.gmail.com>
On Mon, Apr 13, 2009 at 11:44 AM, Peter wrote:
> Hello Biopythoneers,
>
> Those of you following the dev-mailing list or the OBF news feed will
> know that talk abstracts for BOSC 2009 are due in today, see
> http://www.open-bio.org/wiki/BOSC_2009
> I should to be able to attend and present the Biopython Project
> Update, and a few other Biopython developers may also be
> around too, so some sort of hackathon is in the air.
>
> It is a bit unfortunate the deadline was scheduled on the Easter
> break, as I'm sure quite a few of you will be on holiday, but here
> is an outline abstract. ?If anyone has comments, please let me
> know (on the list or directly) in the next couple of hours...
That's been submitted now, although I can still make revisions at the
moment if anyone spots something worth adding/fixing. I did remember
to add the website and license information as BOSC request on their
instructions.
Peter
From chapmanb at 50mail.com Mon Apr 13 13:35:39 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 09:35:39 -0400
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
Message-ID: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
Michiel and Peter;
Thanks for your comments on this. I'm definitely open to modifying
the interface and am happy to y'all giving feedback.
In reading through your comments, there is a bit of a disconnect
between what you are expecting the parser to do and how it is
designed right now. You both are thinking of the GFF parser as a
line oriented parser that emits an object, like a SeqFeature, for
each line in the file. This one way to do it, but the downsides are:
- Many features, like coding regions, are actually represented over
multiple lines.
- As Michiel pointed out, almost all files have many replicating
IDs (the first column). Ideally you want all of these features
consolidated to a single SeqRecord.
So the parser now takes a higher level view and assumes that the
user will want those two things done for them. So it is designed as
an "adder," that puts features onto SeqRecord objects. A normal
use case would be:
- Use SeqIO to parse a FASTA file with the sequences => SeqRecords
- Use the GFFParser to add features from a separate GFF file to the
SeqRecords. These are SeqFeatures, added to the right records and
nested in a parent/child relationship as appropriate.
Ideally you would parse the entire GFF file and do all this feature
adding at once. For big files this fails due to memory issues, which
is why the filtering and iterating features were introduced.
Okay, so that is the top level view. I will try to hit some of the
specifics:
> Why are the functions _gff_line_map() and _gff_line_reduce() private
> (leading underscores)? I had thought you wanted to make the
> map/reduce approach available to people trying to parse GFF files on
> multiple threads (e.g. using disco) which would require them to use
> these two functions, wouldn't it? If so, they should be part of the
> public API.
I don't think a standard user would want to deal with these
directly. They just parse lines into their components and build an
intermediate dictionary object. To parallelize the job, the
GFFMapReduceFeatureAdder class has a 'disco_host' parameter which
then runs the job in parallel.
> I don't see any support for the optional FASTA block in a GFF file.
> Is this something you intend to add later Brad? See also my thoughts
> below for Bio.SeqIO integration.
I haven't added anything for parsing header and footer directives but
it is on the to do list and I have a good idea how to handle them. Definitely
pass along a file that uses these you want to parse and we can work on it.
> > I have a couple of suggestions of how to make the GFF parser more
> > generally usable, and more consistent with other parsers in Biopython.
[...]
> > It's not clear to me why we need an iterator for GFF files. Can't we
> > just use Python's line iterator instead? I would expect code like this:
> >
> > from Bio import GFF
> > handle = open("my_gff_file.gff")
> > for line in handle:
> > ? ?# call the appropriate GFF function on the line
Right, so this was tackled in the top level overview above. Michiel,
does the design make more sense now?
> > The second point is about GFFAddingIterator.get_all_features. If this
> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
> > Then the code looks as follows:
> >
> > from Bio import GFF
> > handle = open("my_gff_file.gff")
> > rec_dict = GFF.to_dict(handle)
Yes, except in the more common cases you are adding to a dictionary
of records as opposed to generating one from scratch. My thought was
that copying the SeqIO behavior made it more confusing because it
doesn't do quite the same thing. After my explanation, what are your
thoughts?
> > Another thing to consider is that IDs in the GFF file do not need to be unique.
> > For example, consider a GFF file that stores genome mapping locations for
> > short sequences stored in a Fasta file. Since each sequence can have more
> > than one mapping location, we can have multiple lines in the GFF file for one
> > sequence ID.
Yes, this goes back to my explanation above and is why the
parser works differently than the standard SeqIO parsers. GFF ends
up being a different beast. I think it makes sense to copy useful
patterns we have already, but don't want to confuse users with close
by not the same functionality.
> > The last point is about storing SeqRecords in rec_dict. A GFF file typically
> > does not store sequences; if it does, it's not clear which field in the GFF file
> > does. On the other hand, a SeqRecord often does not contain the
> > chromosomal location, which is what the GFF file stores. So why use a
> > SeqRecord for GFF information?
Hopefully the SeqRecords make more sense now. What it is really doing is
adding SeqFeatures to SeqRecords. When the user doesn't provide one,
it creates an empty SeqRecord with the appropriate ID to use and
adds SeqFeatures to it.
> If you look at the NCBI FTP site, they often provide genome sequences
> in a range of file formats including GenBank and GFF.
[...]
> Their GFF3 file only contains the features:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>
> Some GFF files will include the sequence too, in this case we can
> fetch it in FASTA format:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
Right on. So you would first parse the Fasta file with the SeqIO
parser to_dict functionality, and then feed this dictionary to the
GFF parser to add the features.
> In principle, you could parse this FASTA file and the GFF3 file and
> put together a GenBank file - or vice versa.
Yes.
> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
> give me a SeqRecord with lots of SeqFeature objects. If the sequence
> is present in the file, it should use that (not the case for these
> NCBI GFF3 files). Otherwise, we wouldn't necessarily know the actual
> sequence length which we'd need to use the new Bio.Seq.UnknownSeq
> object. However, we can infer from the maximum feature coordinates a
> minimum sequence length. For these NCBI GFF3 files, as there is a
> source feature this does actually give use the genome length, so this
> should work very nicely.
Using UnknownSeq is a good idea, and I will do.
Whew. Michiel and Peter -- hopefully the high level intentions are a
bit more clear. Thanks for your input so far; let's hash this out so
it makes sense to everyone.
Brad
From chapmanb at 50mail.com Mon Apr 13 13:44:29 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 13 Apr 2009 09:44:29 -0400
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
Message-ID: <20090413134429.GE5429@sobchak.mgh.harvard.edu>
Hi Peter;
> >> Brad, seeing as Bio.EMBOSS is "your baby" how do you feel about adding
> >> these features? ?The needle wrapper would make an excellent basis for
> >> a new water wrapper. ?For adding -auto and -filter support, there is
> >> probably a clever approach with a common EMBOSS specific subclass of
> >> Bio.Application.AbstractCommandline, but I haven't tried.
> >
> > Definitely go for it. My approach on this has mostly been to add
> > command lines as they are requested, or if I need them for something
> > I am doing. Not ideal.
> >
> > Having a subclass with -auto and -filter is a really good idea;
> > unfortunately nothing clever is designed into the command line builders
> > right now. Feel free to add away.
>
> I need to work on my delegation skills - that seems to have back fired ;)
Oops. I honestly read that as "do I have your permission?" I can of
course tackle this, but am a bit underwater now.
> Regarding adding -auto support, I have a question about the needle
> wrapper and the gap parameters. Using the needle tool at the command
> line will prompt for the gap parameters UNLESS the -auto argument has
> been used. i.e. Without -auto, it makes sense to insist on the gap
> parameters being included, which is what the current wrapper does.
> However, if we add support for -auto, then these parameters can be
> optional. We could handle this in the wrapper, but it would be messy
> (and there may be similar questions with other EMBOSS tools). What do
> you think - stick with the simple option of insisting the Biopython
> user set the gap parameters, even if they are using -auto?
I think we should stick with the simple option. These were meant to
be pretty dumb specifiers that help users write more modular code than
simply pasting in a raw string for the command line. Trying to get
too fancy is probably overkill.
Brad
From biopython at maubp.freeserve.co.uk Mon Apr 13 13:49:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 14:49:56 +0100
Subject: [Biopython-dev] Bio.EMBOSS wrappers
In-Reply-To: <20090413134429.GE5429@sobchak.mgh.harvard.edu>
References: <320fb6e00904101112h544aba7fla94029f32bbdb01d@mail.gmail.com>
<20090413123219.GB5429@sobchak.mgh.harvard.edu>
<320fb6e00904130553n466972enf565ad3bed861b0e@mail.gmail.com>
<20090413134429.GE5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130649v699b16d1seea0c51172b9ab71@mail.gmail.com>
On Mon, Apr 13, 2009 at 2:44 PM, Brad Chapman wrote:
>> > ... Feel free to add away.
>>
>> I need to work on my delegation skills - that seems to have back fired ;)
>
> Oops. I honestly read that as "do I have your permission?" I can of
> course tackle this, but am a bit underwater now.
Looking back, I was a bit ambiguous. I don't mind who does it - let's
see who has time free first.
>> Regarding adding -auto support, I have a question about the needle
>> wrapper and the gap parameters. ?Using the needle tool at the command
>> line will prompt for the gap parameters UNLESS the -auto argument has
>> been used. ?i.e. Without -auto, it makes sense to insist on the gap
>> parameters being included, which is what the current wrapper does.
>> However, if we add support for -auto, then these parameters can be
>> optional. ?We could handle this in the wrapper, but it would be messy
>> (and there may be similar questions with other EMBOSS tools). ?What do
>> you think - stick with the simple option of insisting the Biopython
>> user set the gap parameters, even if they are using -auto?
>
> I think we should stick with the simple option. These were meant to
> be pretty dumb specifiers that help users write more modular code than
> simply pasting in a raw string for the command line. Trying to get
> too fancy is probably overkill.
Agreed.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 14:19:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 15:19:54 +0100
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
Message-ID: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
> Okay, so that is the top level view. I will try to hit some of the
> specifics:
>
>> Why are the functions _gff_line_map() and _gff_line_reduce() private
>> (leading underscores)? ?I had thought you wanted to make the
>> map/reduce approach available to people trying to parse GFF files on
>> multiple threads (e.g. using disco) which would require them to use
>> these two functions, wouldn't it? ?If so, they should be part of the
>> public API.
>
> I don't think a standard user would want to deal with these
> directly. They just parse lines into their components and build an
> intermediate dictionary object. To parallelize the job, the
> GFFMapReduceFeatureAdder class has a 'disco_host' parameter which
> then runs the job in parallel.
Are you aware of any alternatives to disco for doing map/reduce on
Python, and does that impact your design choices?
>> I don't see any support for the optional FASTA block in a GFF file.
>> Is this something you intend to add later Brad? ?See also my thoughts
>> below for Bio.SeqIO integration.
>
> I haven't added anything for parsing header and footer directives but
> it is on the to do list and I have a good idea how to handle them. Definitely
> pass along a file that uses these you want to parse and we can work on it.
There are some partial examples here:
http://www.sequenceontology.org/gff3.shtml
We should have a peep at BioPerl's unit tests and/or ask Lincoln directly.
>> > I have a couple of suggestions of how to make the GFF parser more
>> > generally usable, and more consistent with other parsers in Biopython.
> [...]
>> > It's not clear to me why we need an iterator for GFF files. Can't we
>> > just use Python's line iterator instead? I would expect code like this:
>> >
>> > from Bio import GFF
>> > handle = open("my_gff_file.gff")
>> > for line in handle:
>> > ? ?# call the appropriate GFF function on the line
>
> Right, so this was tackled in the top level overview above. Michiel,
> does the design make more sense now?
>
>> > The second point is about GFFAddingIterator.get_all_features. If this
>> > is essentially analogous to SeqIO.to_dict, how about calling it GFF.to_dict?
>> > Then the code looks as follows:
>> >
>> > from Bio import GFF
>> > handle = open("my_gff_file.gff")
>> > rec_dict = GFF.to_dict(handle)
>
> Yes, except in the more common cases you are adding to a dictionary
> of records as opposed to generating one from scratch. My thought was
> that copying the SeqIO behavior made it more confusing because it
> doesn't do quite the same thing. After my explanation, what are your
> thoughts?
Maybe there is a role for a to_dict() function for when you start from
scratch, but as you say, it does sound like there is a general need to
add to an existing dict.
>> > Another thing to consider is that IDs in the GFF file do not need to be unique.
>> > For example, consider a GFF file that stores genome mapping locations for
>> > short sequences stored in a Fasta file. Since each sequence can have more
>> > than one mapping location, we can have multiple lines in the GFF file for one
>> > sequence ID.
>
> Yes, this goes back to my explanation above and is why the
> parser works differently than the standard SeqIO parsers. GFF ends
> up being a different beast. I think it makes sense to copy useful
> patterns we have already, but don't want to confuse users with close
> by not the same functionality.
>
>> > The last point is about storing SeqRecords in rec_dict. A GFF file typically
>> > does not store sequences; if it does, it's not clear which field in the GFF file
>> > does. On the other hand, a SeqRecord often does not contain the
>> > chromosomal location, which is what the GFF file stores. So why use a
>> > SeqRecord for GFF information?
>
> Hopefully the SeqRecords make more sense now. What it is really doing is
> adding SeqFeatures to SeqRecords. When the user doesn't provide one,
> it creates an empty SeqRecord with the appropriate ID to use and
> adds SeqFeatures to it.
>
>> If you look at the NCBI FTP site, they often provide genome sequences
>> in a range of file formats including GenBank and GFF.
>> [...]
>> Their GFF3 file only contains the features:
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>>
>> Some GFF files will include the sequence too, in this case we can
>> fetch it in FASTA format:
>> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
>
> Right on. So you would first parse the Fasta file with the SeqIO
> parser to_dict functionality, and then feed this dictionary to the
> GFF parser to add the features.
Hmm. I'm with you on the idea that you may need to parse a GFF file
and a separate second file to get the actual sequence (e.g. a FASTA
file), but there is more than one way to combine the two. For a
single sequence, I was thinking more along the lines of:
from Bio import SeqIO
record = SeqIO.read(open("NC_000913.fna"),"fasta")
record.features = SeqIO.read(open("NC_000913.gff"),"gff3").features
Or, depending on what other annotation you can extract, perhaps the
other way round would be best:
from Bio import SeqIO
record = SeqIO.read(open("NC_000913.gff"),"gff3")
record.seq = SeqIO.read(open("NC_000913.fna"),"fasta").seq
The above is pretty trivial I think, as long as we include examples of
this in our documentation. This kind of manipulation is also file
format neutral - it would work equally well with a FASTA file and a
PTT file (assuming we add parsing NCBI protein tables to Bio.SeqIO as
outlined in my earlier email). Or for another example, perhaps an
annotated GenBank file without the sequence (e.g. just a CONTIG
assembly line) plus a FASTA file for the full nucleotide sequence.
If the FASTA and GFF file apply to multiple sequences (e.g. a set of
contigs, rather than a single chromosome), and you have enough memory,
then something using dictionaries should work:
from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.fna"),"fasta"))
for temp_rec in SeqIO.parse(open("NC_000913.gff"),"gff3") :
records[temp_rec.id].features = temp_rec.features
or,
from Bio import SeqIO
records = SeqIO.to_dict(SeqIO.read(open("NC_000913.gff"),"gff3"))
for temp_rec in SeqIO.parse(open("NC_000913.fna"),"fasta") :
records[temp_rec.id].seq = temp_rec.seq
(You may need to massage the keys to match up, I'm assuming here that
isn't required).
i.e. It can all be done from Bio.SeqIO without needing to dive into
Bio.GFF unless you need to do something special (e.g. filtering the
features).
>> In principle, you could parse this FASTA file and the GFF3 file and
>> put together a GenBank file - or vice versa.
>
> Yes.
>
>> Similarly, I would want Bio.SeqIO to be able to parse a GFF3 file, and
>> give me a SeqRecord with lots of SeqFeature objects. ?If the sequence
>> is present in the file, it should use that (not the case for these
>> NCBI GFF3 files). ?Otherwise, we wouldn't necessarily know the actual
>> sequence length which we'd need to use the new Bio.Seq.UnknownSeq
>> object. ?However, we can infer from the maximum feature coordinates a
>> minimum sequence length. ?For these NCBI GFF3 files, as there is a
>> source feature this does actually give use the genome length, so this
>> should work very nicely.
>
> Using UnknownSeq is a good idea, and I will do.
Great.
> Whew. Michiel and Peter -- hopefully the high level intentions are a
> bit more clear. Thanks for your input so far; let's hash this out so
> it makes sense to everyone.
Good plan :)
As you can probably tell, I am concentrating on getting this to match
up well with the Bio.SeqIO framework. It will be nice to know the
underlying Bio.GFF module has more options, but I expect most people
to start with reading in a GFF file using Bio.SeqIO, and being able to
transfer their existing knowledge of SeqFeature objects learnt from
using Bio.SeqIO to read in GenBank files.
Peter
From jflatow at gmail.com Mon Apr 13 14:41:56 2009
From: jflatow at gmail.com (Jared Flatow)
Date: Mon, 13 Apr 2009 09:41:56 -0500
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
References: <20090408124908.GN43636@sobchak.mgh.harvard.edu>
<830379.9837.qm@web62402.mail.re1.yahoo.com>
<320fb6e00904130516x25bf24d8ib76ec36171e93fd8@mail.gmail.com>
<20090413133539.GD5429@sobchak.mgh.harvard.edu>
<320fb6e00904130719l5b31f09eo39e0f867fbd6749@mail.gmail.com>
Message-ID: <3050CC48-7365-4746-B30C-F56C2ACAA2F8@gmail.com>
FYI:
On Apr 13, 2009, at 9:19 AM, Peter wrote:
> Are you aware of any alternatives to disco for doing map/reduce on
> Python, and does that impact your design choices?
You can use Python map/reduce functions with Hadoop via the Streaming
contrib package included with Hadoop.
An overview: http://docs.google.com/Presentation?id=dgr666gg_31cd4n7qdz
Here is an input reader/record reader for FASTA: http://gist.github.com/45551
jared
From bugzilla-daemon at portal.open-bio.org Mon Apr 13 15:41:29 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Mon, 13 Apr 2009 11:41:29 -0400
Subject: [Biopython-dev] [Bug 2601] Seq find() method: proposal
In-Reply-To:
Message-ID: <200904131541.n3DFfTGN022460@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2601
------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-13 11:41 EST -------
See also Bug 2809, for the much narrower option of adding string-like
startswith and endswith methods to the Seq object (which as proposed would not
deal with ambiguity characters).
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Mon Apr 13 17:55:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 18:55:53 +0100
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
Message-ID: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
Hi all,
At then end of last week I found test_SeqIO_online.py was failing and
traced this to a change in Entrez EFetch. EFetch is documented here:
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
The issue is with EFetch and the undocumented rettype=genbank argument
which we currently use in our documentation and unit tests. This
isn't an "official" argument in that it isn't listed on their website,
but until recently it returned plain text GenBank files, acting like
the official rettype=gb or gp arguments. However, as of the end of
last week, EFtech returns the default format instead (ASN.1), causing
test_SeqIO_online.py to fail and rendering some of our examples
misleading.
I emailed the NCBI and received a very prompt reply,
> Dear Colleague,
>
>?As the e-Utils continue to be refined our developers sometimes
> address one-off issues, and this was one of them. The 'official'
> parameter for GenBank is rettype=gb. Now if the parameter is not
> correct you will default to ASN.1 in the nucleotide databases. We
> apologize for any inconvenience.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
I then emailed back (before Easter) to ask if they would reconsider
this change, and have just had a reply:
> Hi Peter,
>
> This will likely not reverse back as the true parameters are laid out
> in the help documents and are now required, so to speak.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
With hindsight we shouldn't have used rettype="genbank", but it did
seem to make things simpler for our documentation and I really hadn't
expected the NCBI to change this.
I think we have two options:
(1) Add a special case to Bio.Entrez.eftech to map rettype="genbank"
to rettype="gb" (or "gp" for the protein database). This is simple
and causes least disruption to Biopython uses, but is a bad idea in
the long run as it means we are effectively providing our own variant
of the Entrez API.
(2) Update our documentation and unit tests to use rettype="gb" or
"gp" instead of rettype="genbank", and add a special case to
Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp"
for the protein database) and issue a warning that the NCBI have
changed their API. At a later point we might change this warning to
an error. This would provide a clear transition for end user scripts,
and keep us consistent with the official Entrez API.
I favour option (2) here. Any other thoughts? Whatever we do should
happen before we release Biopython 1.50.
Peter
From biopython at maubp.freeserve.co.uk Mon Apr 13 18:06:25 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 13 Apr 2009 19:06:25 +0100
Subject: [Biopython-dev] Plan for Biopython 1.50 (final)
Message-ID: <320fb6e00904131106s70028e9el56d334fa732bddf8@mail.gmail.com>
On Tue, Mar 31, 2009 at 10:38 PM, Peter wrote:
> Hi all,
>
> OK guys, after a brief chat off the mailing list, I'm hoping to do the
> Biopython 1.50 beta release roughly this weekend, ...
>
> After the release of Biopython 1.50 beta, we'll reopen CVS again for
> small changes and documentation. ?While the beta is being tested by
> our user base, I'd like us to push to finish any missing documentation
> - in particular for new modules Bio.Motif (Bartek) and
> Bio.Graphics.GenomeDiagram (me and/or Leighton), plus the new
> SeqRecord slicing and UnknownSeq class (me).
That documentation still needs doing, and it would be nice to have it
with Biopython 1.50. If Bartek or Leighton expects to add anything in
the next few days, then I'd be happy to hold back the release for
that. I'll try and do the SeqRecord stuff myself shortly.
> Depending on the feedback from the beta, I'd hope we can do the final
> release of Biopython 1.50 well before the end of April, and then
> reopen CVS for new code.
There haven't been any problems with the beta reported, however there
is the issue of EFetch returning ASN.1 not genbank format (see my
earlier email) which I think we must resolve before Biopython 1.50 is
released.
Apart from these two points (documentation and EFetch), are there any
issues regarding doing the official release of Biopython 1.50? I
think we can aim for a release this week...
Peter
From lpritc at scri.ac.uk Tue Apr 14 08:29:14 2009
From: lpritc at scri.ac.uk (Leighton Pritchard)
Date: Tue, 14 Apr 2009 09:29:14 +0100
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
Message-ID:
On 13/04/2009 18:55, "Peter" wrote:
[...]
> I think we have two options:
>
> (1) Add a special case to Bio.Entrez.eftech to map rettype="genbank"
> to rettype="gb" (or "gp" for the protein database). This is simple
> and causes least disruption to Biopython uses, but is a bad idea in
> the long run as it means we are effectively providing our own variant
> of the Entrez API.
>
> (2) Update our documentation and unit tests to use rettype="gb" or
> "gp" instead of rettype="genbank", and add a special case to
> Bio.Entrez.eftech to map rettype="genbank" to rettype="gb" (or "gp"
> for the protein database) and issue a warning that the NCBI have
> changed their API. At a later point we might change this warning to
> an error. This would provide a clear transition for end user scripts,
> and keep us consistent with the official Entrez API.
>
> I favour option (2) here. Any other thoughts? Whatever we do should
> happen before we release Biopython 1.50.
Option (2). Option (1) risks cementing an argument into place in Biopython
that could potentially contradict future Entrez API usage.
L.
--
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405
______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.
The Scottish Crop Research Institute is a charitable company limited by guarantee.
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
DISCLAIMER:
This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________
From mjldehoon at yahoo.com Tue Apr 14 08:33:48 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 14 Apr 2009 01:33:48 -0700 (PDT)
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
In-Reply-To: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
Message-ID: <273080.33626.qm@web62408.mail.re1.yahoo.com>
I am also in favor of option (2).
--Michiel
> I think we have two options:
>
> (1) Add a special case to Bio.Entrez.eftech to map
> rettype="genbank"
> to rettype="gb" (or "gp" for the
> protein database). This is simple
> and causes least disruption to Biopython uses, but is a bad
> idea in
> the long run as it means we are effectively providing our
> own variant
> of the Entrez API.
>
> (2) Update our documentation and unit tests to use
> rettype="gb" or
> "gp" instead of rettype="genbank", and
> add a special case to
> Bio.Entrez.eftech to map rettype="genbank" to
> rettype="gb" (or "gp"
> for the protein database) and issue a warning that the NCBI
> have
> changed their API. At a later point we might change this
> warning to
> an error. This would provide a clear transition for end
> user scripts,
> and keep us consistent with the official Entrez API.
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 08:51:56 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 04:51:56 -0400
Subject: [Biopython-dev] [Bug 2811] New: EFetch returning ASN.1 not GenBank
format for rettype=genbank
Message-ID:
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
Summary: EFetch returning ASN.1 not GenBank format for
rettype=genbank
Product: Biopython
Version: Not Applicable
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Main Distribution
AssignedTo: biopython-dev at biopython.org
ReportedBy: biopython-bugzilla at maubp.freeserve.co.uk
At the end of last week I found test_SeqIO_online.py was failing and
traced this to a change in Entrez EFetch. EFetch is documented here:
http://www.ncbi.nlm.nih.gov/entrez/query/static/efetchseq_help.html
The issue is with EFetch and the undocumented rettype=genbank argument
which we currently use in our documentation and unit tests. This
isn't an "official" argument in that it isn't listed on their website,
but until recently it returned plain text GenBank files, acting like
the official rettype=gb or gp arguments. However, as of the end of
last week, EFtech returns the default format instead (ASN.1), causing
test_SeqIO_online.py to fail and rendering some of our examples
misleading.
I emailed the NCBI and received a very prompt reply,
> Dear Colleague,
>
> As the e-Utils continue to be refined our developers sometimes
> address one-off issues, and this was one of them. The 'official'
> parameter for GenBank is rettype=gb. Now if the parameter is not
> correct you will default to ASN.1 in the nucleotide databases. We
> apologize for any inconvenience.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
I then emailed back (before Easter) to ask if they would reconsider
this change, and have just had a reply:
> Hi Peter,
>
> This will likely not reverse back as the true parameters are laid out
> in the help documents and are now required, so to speak.
>
> Regards,
>
> Steve Pechous, Ph.D.
> NCBI User Services
With hindsight we shouldn't have used rettype="genbank", but it did
seem to make things simpler for our documentation and I really hadn't
expected the NCBI to change this.
After discussion on the mailing list, the plan is to update our documentation
and unit tests to use rettype="gb" or "gp" instead of rettype="genbank", and
add a special case to Bio.Entrez.eftech to map rettype="genbank" to
rettype="gb" (or "gp" for the protein database) and issue a warning that the
NCBI have changed their API. At a later point we might change this warning to
an error. This would provide a clear transition for end user scripts, and keep
us consistent with the official Entrez API.
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From biopython at maubp.freeserve.co.uk Tue Apr 14 08:53:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Apr 2009 09:53:02 +0100
Subject: [Biopython-dev] EFetch returning ASN.1 not genbank format
In-Reply-To: <273080.33626.qm@web62408.mail.re1.yahoo.com>
References: <320fb6e00904131055r24ac6e42p1da55cfcf173e@mail.gmail.com>
<273080.33626.qm@web62408.mail.re1.yahoo.com>
Message-ID: <320fb6e00904140153w4c659655q64f19540f7bd12b7@mail.gmail.com>
On Tue, Apr 14, 2009 at 9:33 AM, Michiel de Hoon wrote:
>
> I am also in favor of option (2).
>
> --Michiel
>
OK. Let's do that then. I've filed Bug 2811 for this issue,
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
Peter
From bugzilla-daemon at portal.open-bio.org Tue Apr 14 09:54:23 2009
From: bugzilla-daemon at portal.open-bio.org (bugzilla-daemon at portal.open-bio.org)
Date: Tue, 14 Apr 2009 05:54:23 -0400
Subject: [Biopython-dev] [Bug 2811] EFetch returning ASN.1 not GenBank
format for rettype=genbank
In-Reply-To:
Message-ID: <200904140954.n3E9sND0024084@portal.open-bio.org>
http://bugzilla.open-bio.org/show_bug.cgi?id=2811
------- Comment #1 from biopython-bugzilla at maubp.freeserve.co.uk 2009-04-14 05:54 EST -------
Tutorial updated, see Doc/Tutorial.tex revision 1.221
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
From mjldehoon at yahoo.com Tue Apr 14 10:36:03 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 14 Apr 2009 03:36:03 -0700 (PDT)
Subject: [Biopython-dev] Bio.GFF and Brad's code
In-Reply-To: <20090413133539.GD5429@sobchak.mgh.harvard.edu>
Message-ID: <322143.67385.qm@web62403.mail.re1.yahoo.com>
--- On Mon, 4/13/09, Brad Chapman wrote:
> A normal use case would be:
>
> - Use SeqIO to parse a FASTA file with the sequences =>
> SeqRecords
> - Use the GFFParser to add features from a separate GFF
> file to the SeqRecords. These are SeqFeatures, added to
> the right records and nested in a parent/child relationship
> as appropriate.
Usually, when I use a GFF file I either don't have an associated Fasta file, or I am not particularly interested in the original sequences. So while this approach is useful for some people, in its current form it's not exactly generally usable.
First, let's discuss how to represent the information contained in a GFF file. SeqRecords are good if the GFF file is associated with a Fasta file (or contains the sequence itself), but if not it seems to be a bit awkward. How about the following (and I think Peter was hinting at the same idea):
The actual parser lives in Bio.GFF, and produces Bio.GFF.Record objects that closely resemble the GFF file structure. For example, we use the GFF specified fields (