From chapmanb at arches.uga.edu  Sat Sep  1 21:09:35 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:03 2005
Subject: [Biopython-dev] next release imminent
In-Reply-To: <p05101000b7b5bcb25eb1@[171.65.33.250]>
References: <p05101000b7b5bcb25eb1@[171.65.33.250]>
Message-ID: <20010901210935.A8873@ci350185-a.athen1.ga.home.com>

Hi Jeff!

> If nobody has any rejections, I'm going to put together the next 
> release this weekend.  Please let me know if I should hold off...

Sorry to be so slow in getting back to your previous message. It's been a
crazy week at lab. I definately think we should get a new release 
together, especially to take care of the bugs and things. Let me know
when you end up getting it together, and I can go ahead and do the 
Windows Installers and everything for it.

Thanks for getting this together!
Brad

From katel at worldpath.net  Sun Sep  2 01:38:37 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:03 2005
Subject: [Biopython-dev] next release imminent
References: <p05101000b7b5bcb25eb1@[171.65.33.250]>
Message-ID: <003301c13371$84f5b220$010a0a0a@cadence.com>

> If nobody has any rejections, I'm going to put together the next
> release this weekend.  Please let me know if I should hold off...
>
  The MetaTool parser won't be ready till next release after the weekend
release.  I'm at least a week away.
But I could check old code for "fossils" like a reference to GenBank in an
Interpro parser.

                                       Cayte


From jchang at SMI.Stanford.EDU  Sun Sep  2 00:38:26 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:43:03 2005
Subject: [Biopython-dev] next release imminent
In-Reply-To: <003301c13371$84f5b220$010a0a0a@cadence.com>
References: <p05101000b7b5bcb25eb1@[171.65.33.250]>
 <003301c13371$84f5b220$010a0a0a@cadence.com>
Message-ID: <p05101003b7b76805701c@[192.168.0.4]>

At 10:38 PM -0700 9/1/01, Cayte wrote:
>  > If nobody has any rejections, I'm going to put together the next
>>  release this weekend.  Please let me know if I should hold off...
>>
>   The MetaTool parser won't be ready till next release after the weekend
>release.  I'm at least a week away.
>But I could check old code for "fossils" like a reference to GenBank in an
>Interpro parser.
>
>                                        Cayte


Thanks.  It should like MetaTool will have to wait for one more 
release.  I'll start putting this release together on Monday.  Please 
let me know if you find any showstoppers.

Jeff

From xgtl at eth.net  Tue Sep  4 07:45:26 2001
From: xgtl at eth.net (G. DEEPAK REDDY)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Bioinformatics & Molecular Modeling Training
Message-ID: <000b01c13537$18edde00$b1bf09ca@xgt15>

Dear Members,
 
We have recently started a training division in Bioinformatics &
Molecular Modeling.  We are looking for feedback from experts about
including Biophython as part of the curriculum in the course.  Please
send your suggestions as to what topics, exercises and applications to
be included.
 
Regards
 
Jupudi Srinivas
Director-Technical,
Xpert Global Tech Limited,
INDIA
Jupudi@xpertglobaltech.com
http://www.xpertglobaltech.com
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20010904/9a598825/attachment.htm
From Y.Benita at pharm.uu.nl  Wed Sep  5 05:28:00 2001
From: Y.Benita at pharm.uu.nl (Yair Benita)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Biopython 1.00a3 for the Mac
Message-ID: <B7BBBCC0.11A9%Y.Benita@pharm.uu.nl>

Hi guys,
I have compiled the new release for the Mac.
Please download it from "http://homepage.mac.com/ybenita" and post it on the
website.

All test pass except:

======================================================================
FAIL: test_SubsMat
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Yair's G4:Desktop Folder:biopython-1.00a3:Tests:run_tests.py", line
153, in runTest
    expected_handle)
  File "Yair's G4:Desktop Folder:biopython-1.00a3:Tests:run_tests.py", line
247, in compare_output
    assert expected_line == output_line, \
AssertionError: 
Output  : 'H 0.003 0.000 0.003 0.002 0.002 0.003 0.003\n'
Expected: 'H 0.003 0.001 0.003 0.002 0.002 0.003 0.003\n'
----------------------------------------------------------------------

Yair
-- 
Yair Benita
Pharmaceutical Proteomics
Utrecht University


From chapmanb at arches.uga.edu  Wed Sep  5 07:53:20 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Biopython 1.00a3 for the Mac
In-Reply-To: <B7BBBCC0.11A9%Y.Benita@pharm.uu.nl>
Message-ID: <Pine.A41.4.10.10109050751300.40856-100000@archa15.cc.uga.edu>

Hi Yair!

> I have compiled the new release for the Mac.
> Please download it from "http://homepage.mac.com/ybenita" and post it on the
> website.

Sweet! Thanks for doing this. I've put it up on the Download page.

> All test pass except:
> 
> ======================================================================
> FAIL: test_SubsMat
> ----------------------------------------------------------------------

Great, I'm very happy most stuff is passing! Only test_SubsMat fails on
Windows as well (grrrr, we really are going to have to something about
that test!), so we are definately doing well on cross-platform this time.
Great to hear!

Brad


From johann at egenetics.com  Wed Sep  5 09:46:55 2001
From: johann at egenetics.com (Johann Visagie)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3 release now available
In-Reply-To: <p05101000b7bad0e686c9@[171.65.33.250]>; from jchang@SMI.Stanford.EDU on Tue, Sep 04, 2001 at 11:45:13AM -0700
References: <p05101000b7bad0e686c9@[171.65.33.250]>
Message-ID: <20010905154655.E57556@fling.sanbi.ac.za>

Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700:
> 
> A new release of Biopython is now available.

Cool.  :-)

A thought:  Shouldn't these announcements be cross-posted to
python-announce-list@python.org, a.k.a comp.lang.python.announcee?  :-)

-- V

From thomas at cbs.dtu.dk  Wed Sep  5 14:45:40 2001
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] sequence format readers ?
Message-ID: <15254.29396.129922.274263@genome.cbs.dtu.dk>

Hej,

To follow up one of the discussions and questions at ISMB in Copenhagen,
- how are we going to proceed with the sequence format reader (the
biopython variant of readseq ...)

Currently we can only have parsers for Fasta, Embl and GenBank.  What we
need is a internal format and functions/modules which can read/write:
Fasta
Embl
GenBank
GCG
Phylip
PIR
MSF
Nexus
Clustal
Mase
??? - more suggestions ?

I can write most of the rules, but I guess we have to define a smart base
class/parser - where plugging in a new format should only take 5 seconds ...
If we brain storm on the design of the reader/writer, I could volunteer to
implement the format rules ...

Some things to consider:
* some formats are alignment based (e.g. clustal, phylip, nexus)
* some formats have loads of information which is lost when converted to a
  lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
  not lose any information 
* some formats allow multiple entries, some not


back-in-the-sequence-format-jungle'ly yr's
-thomas


-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From reillywu at yahoo.com  Wed Sep  5 15:50:25 2001
From: reillywu at yahoo.com (Chunlei Wu)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] localblast bug?
Message-ID: <20010905195025.67630.qmail@web20503.mail.yahoo.com>

Hi,
   I wrote a script for localblast. It always raised a
TypeError:

   File "e:\python21\Bio\Blast\NCBIStandalone.py",
line 1447, in blastall
   r, w, e = popen2.popen3([blastcmd] + params)
   File "e:\python21\lib\popen2.py", line 129, in
popen3
   w, r, e = os.popen3(cmd, mode, bufsize)
   TypeError: popen3() argument 1 must be string, not
list

   When I modified line 1447 as:

   r, w, e = popen2.popen3(' '.join([blastcmd] +
params))

   then it works.


   Chunlei Wu

Python version: Activepython build 210
Biopython version: 1.00a3
OS:       WinNT
source:

def mylocalblast(input_file,output_file,db='nt'):
    """mylocalblast"""

    from Bio.Blast import NCBIStandalone

    my_blast_db="r:\\blastdb\\"+db
    my_blast_exe=r"r:\localblast\blastall.exe"

    blast_out, error_info =
NCBIStandalone.blastall(my_blast_exe,'blastn',my_blast_db,input_file)

    output_f=open(output_file,'w')
    blast_result=blast_out.read()
    output_f.write(blast_result)
    print blast_result
    output_f.close()


__________________________________________________
Do You Yahoo!?
Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger
http://im.yahoo.com

From chapmanb at arches.uga.edu  Wed Sep  5 17:27:45 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] localblast bug?
In-Reply-To: <20010905195025.67630.qmail@web20503.mail.yahoo.com>
References: <20010905195025.67630.qmail@web20503.mail.yahoo.com>
Message-ID: <20010905172745.A4632@ci350185-a.athen1.ga.home.com>

Hi Chunlei;

>    I wrote a script for localblast. It always raised a
> TypeError:
[...] 
>    When I modified line 1447 as:
> 
>    r, w, e = popen2.popen3(' '.join([blastcmd] +
> params))
> 
>    then it works.

Thanks for the fix. I think you're probably the first to use the
localblast module on windows, so you get to run into the platform
specific problems (aren't you lucky :-). Your fix works fine for me
on UNIX as well (with the Doc/examples/local_blast.py script), so I
checked your change into CVS. It is available from anonymous CVS and
should be in the next release.

Thanks again!
Brad

From chapmanb at arches.uga.edu  Wed Sep  5 17:46:00 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] sequence format readers ?
In-Reply-To: <15254.29396.129922.274263@genome.cbs.dtu.dk>
References: <15254.29396.129922.274263@genome.cbs.dtu.dk>
Message-ID: <20010905174600.A4766@ci350185-a.athen1.ga.home.com>

Hi Thomas!

> To follow up one of the discussions and questions at ISMB in Copenhagen,
> - how are we going to proceed with the sequence format reader (the
> biopython variant of readseq ...)

It's great that you're going to work on this! It's definately much
desired by a lot o' people (in fact I was just having a conversation
today about format conversion).

> Currently we can only have parsers for Fasta, Embl and GenBank.  What we
> need is a internal format and functions/modules which can read/write:
[...impressive list o' formats...]
> ??? - more suggestions ?

I think supporting this many would be an *excellent* start :-).

> I can write most of the rules, but I guess we have to define a smart base
> class/parser - where plugging in a new format should only take 5 seconds ...
> If we brain storm on the design of the reader/writer, I could volunteer to
> implement the format rules ...
> 
> Some things to consider:
> * some formats are alignment based (e.g. clustal, phylip, nexus)
> * some formats have loads of information which is lost when converted to a
>   lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
>   not lose any information 
> * some formats allow multiple entries, some not

Just as a way of getting things started (I haven't done a lot of
thinking about this), my opinion is that the best way to do this is
to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
system would be the standard SeqRecord object that we currently
have. The advantage of this is that existing parsers (ie Fasta,
GenBank), already parse into this, so all that would need to be done
is to define a mapping that converts a generic SeqRecord object to
and from the formats "native" Record based representation. So to
convert from GenBank to Fasta you could do:

GenBank Record Format --> SeqRecord --> Fasta Record Format 

Since the Record formats already provide writing capabilities (and
we have the parsers to parse into them) we would already get writing
and parsing "for free." Also, we would make good use of our existing
"generic" Sequence representations.

The advantages of this is that it would help us avoid having to make
a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
specific converters. The disadvantage of this is that we may lose
some information in the conversion process (but than again, what
converters don't :-).

The tricky part of doing it this way is that we would then need to
define the Record --> SeqRecord mapping, which, as you mention,
may take some thinking for alignment formats and other
complications.

Hopefully-rambling-on-and-on-about-this-helps-a-little-bit-ly yr's,

Brad


From jchang at SMI.Stanford.EDU  Wed Sep  5 19:08:49 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3 release now available
In-Reply-To: <20010905154655.E57556@fling.sanbi.ac.za>
References: <p05101000b7bad0e686c9@[171.65.33.250]>
 <20010905154655.E57556@fling.sanbi.ac.za>
Message-ID: <p05101000b7bc60e142f6@[171.65.33.250]>

At 3:46 PM +0200 9/5/01, Johann Visagie wrote:
>Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700:
>>
>>  A new release of Biopython is now available.
>
>Cool.  :-)
>
>A thought:  Shouldn't these announcements be cross-posted to
>python-announce-list@python.org, a.k.a comp.lang.python.announcee?  :-)

Yes.  Next time.  :)

Thanks,
Jeff

From thomas at cbs.dtu.dk  Thu Sep  6 05:22:01 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] sequence format readers ?
In-Reply-To: Brad Chapman's message of "Wed, 5 Sep 2001 17:46:00 -0400"
References: <15254.29396.129922.274263@genome.cbs.dtu.dk>
	<20010905174600.A4766@ci350185-a.athen1.ga.home.com>
Message-ID: <y9vheug4qti.fsf@genome.cbs.dtu.dk>

Brad Chapman <chapmanb@arches.uga.edu> writes:

> Hi Thomas!
> 
> > To follow up one of the discussions and questions at ISMB in Copenhagen,
> > - how are we going to proceed with the sequence format reader (the
> > biopython variant of readseq ...)
> 
> It's great that you're going to work on this! It's definately much
> desired by a lot o' people (in fact I was just having a conversation
> today about format conversion).
> 
> > Currently we can only have parsers for Fasta, Embl and GenBank.  What we
> > need is a internal format and functions/modules which can read/write:
> [...impressive list o' formats...]
> > ??? - more suggestions ?
> 
> I think supporting this many would be an *excellent* start :-).
> 
> > I can write most of the rules, but I guess we have to define a smart base
> > class/parser - where plugging in a new format should only take 5 seconds ...
> > If we brain storm on the design of the reader/writer, I could volunteer to
> > implement the format rules ...
> > 
> > Some things to consider:
> > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > * some formats have loads of information which is lost when converted to a
> >   lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
> >   not lose any information 
> > * some formats allow multiple entries, some not
> 
> Just as a way of getting things started (I haven't done a lot of
> thinking about this), my opinion is that the best way to do this is
> to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> system would be the standard SeqRecord object that we currently
> have. The advantage of this is that existing parsers (ie Fasta,
> GenBank), already parse into this, so all that would need to be done
> is to define a mapping that converts a generic SeqRecord object to
> and from the formats "native" Record based representation. So to
> convert from GenBank to Fasta you could do:
> 
> GenBank Record Format --> SeqRecord --> Fasta Record Format 
> 
> Since the Record formats already provide writing capabilities (and
> we have the parsers to parse into them) we would already get writing
> and parsing "for free." Also, we would make good use of our existing
> "generic" Sequence representations.
> 
> The advantages of this is that it would help us avoid having to make
> a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> specific converters. The disadvantage of this is that we may lose
> some information in the conversion process (but than again, what
> converters don't :-).

I think inheriting the Seq object to a SeqIOSeq object is enough.
We just need to add a single dictionary (features) where all
Swiss/EMBL/GenBank extra annotations can be added. 

e.g.
class SeqIOSeq(Seq):
    def __init__(self):
        Seq.__init__(self)
        # dictionary for extra annotations (e.g. Embl, GenBank)
        self.features = {} 


In the case of 
GenBank Record Format --> SeqIOSeq --> Fasta Record Format 
we pick only the the name and sequence ...

but for 
GenBank Record Format --> SeqIOSeq --> EMBL Record Format 
the writer function should check if there are any additional features
(self.features.keys())
That way we shouldn't loose any information.


It would be nice if a new format can be added by simply adding functions
for reading, writing and recognizing the format.
I not completely sure of how to define these functions - any ideas ?

example code ...

import sys
from Bio.Seq import Seq
NO, YES = 0,1

class SeqIOSeq(Seq):
    def __init__(self):
        Seq.__init__(self)
        # dictionary for extra annotations (e.g. Embl, GenBank)
        self.features = {} 
        

class SeqIO:
    # dictionary to store functions for
    # recognizing, reading and writing of different sequence formats
    recognizers = {}
    readers = {}
    writers = {}
    
    def __init__(self, **kwds):
        self.name = None
        self.format = None
        self.sequence = SeqIOSeq()
        self.is_an_alignment = NO
        self.allow_multiple_entries = YES
        for k,v in kwds: setattr(self, k, v)
        
    def AddFormat(self, name, recognizeF, readF, writeF):
        self.recognizers[name] = recognizeF
        self.readers[name] = readF
        self.writers[name] = writeF
        

needing-a-machete-for-the-sequence-format-jungle'ly yr's
-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From johann at egenetics.com  Thu Sep  6 08:49:29 2001
From: johann at egenetics.com (Johann Visagie)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Biopython 1.00a3 for the Mac
In-Reply-To: <B7BBBCC0.11A9%Y.Benita@pharm.uu.nl>; from Y.Benita@pharm.uu.nl on Wed, Sep 05, 2001 at 11:28:00AM +0200
References: <B7BBBCC0.11A9%Y.Benita@pharm.uu.nl>
Message-ID: <20010906144929.F35666@fling.sanbi.ac.za>

Yair Benita on 2001-09-05 (Wed) at 11:28:00 +0200:
> 
> I have compiled the new release for the Mac.

FreeBSD port has also just been updated:
  http://www.freebsd.org/cgi/cvsweb.cgi/ports/biology/py-biopython/

Pre-built package (minus CORBA) should appear here in a couple of days:
  ftp://ftp.freebsd.org/pub/FreeBSD/ports/i386/packages-stable/All/py-biopython-1.00.a3.tgz

Unfortunately, this all comes a day or two too late to make it into
4.4-RELEASE, and hence onto the distribution CDs.  :-(

-- V

From pewilkinson at informaxinc.com  Thu Sep  6 19:14:21 2001
From: pewilkinson at informaxinc.com (Peter Wilkinson)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] RE:Sequence format readers
In-Reply-To: <200109061602.f86G29B28847@pw600a.bioperl.org>
Message-ID: <005001c13729$aa920e50$f70210ac@l001696w00>

I have almost completed code for reading in Refseek data. I have finished
classes (1st draft, but functions well) for the smaller organisms), and now
I am moving on to the Human records ....

Also I we need a parser for Derwent data, which should inherit from EMBL,
since its formatting is EMBL like.

Next aslo is the expression data from different manufacturers ....

there are piles more I am sure

Peter Wilkinson

P.S. I am sitting on code for specific fasta formated types .... how about
that?


> -----Original Message-----
> From: biopython-dev-admin@biopython.org
> [mailto:biopython-dev-admin@biopython.org]On Behalf Of
> biopython-dev-request@biopython.org
> Sent: Thursday, September 06, 2001 10:02 AM
> To: biopython-dev@biopython.org
> Subject: Biopython-dev digest, Vol 1 #207 - 7 msgs
>
>
> Send Biopython-dev mailing list submissions to
> 	biopython-dev@biopython.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://biopython.org/mailman/listinfo/biopython-dev
> or, via email, send a message with subject or body 'help' to
> 	biopython-dev-request@biopython.org
>
> You can reach the person managing the list at
> 	biopython-dev-admin@biopython.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython-dev digest..."
>
>
> Today's Topics:
>
>    1. sequence format readers ? (thomas@cbs.dtu.dk)
>    2. localblast bug? (Chunlei Wu)
>    3. Re: localblast bug? (Brad Chapman)
>    4. Re: sequence format readers ? (Brad Chapman)
>    5. Re: [BioPython] Biopython 1.00a3 release now available
> (Jeffrey Chang)
>    6. Re: sequence format readers ? (Thomas Sicheritz-Ponten)
>    7. Re: Biopython 1.00a3 for the Mac (Johann Visagie)
>
> --__--__--
>
> Message: 1
> Date: Wed, 5 Sep 2001 20:45:40 +0200 (MDT)
> From: thomas@cbs.dtu.dk
> To: biopython-dev@biopython.org
> Reply-To: thomas@cbs.dtu.dk
> Subject: [Biopython-dev] sequence format readers ?
>
> Hej,
>
> To follow up one of the discussions and questions at ISMB in
> Copenhagen,
> - how are we going to proceed with the sequence format reader (the
> biopython variant of readseq ...)
>
> Currently we can only have parsers for Fasta, Embl and
> GenBank.  What we
> need is a internal format and functions/modules which can read/write:
> Fasta
> Embl
> GenBank
> GCG
> Phylip
> PIR
> MSF
> Nexus
> Clustal
> Mase
> ??? - more suggestions ?
>
> I can write most of the rules, but I guess we have to define
> a smart base
> class/parser - where plugging in a new format should only
> take 5 seconds ...
> If we brain storm on the design of the reader/writer, I could
> volunteer to
> implement the format rules ...
>
> Some things to consider:
> * some formats are alignment based (e.g. clustal, phylip, nexus)
> * some formats have loads of information which is lost when
> converted to a
>   lower info-rich format( e.g. Embl -> Fasta). But Embl ->
> GenBank should
>   not lose any information
> * some formats allow multiple entries, some not
>
>
> back-in-the-sequence-format-jungle'ly yr's
> -thomas
>
>
> --
> Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
> thomas@biopython.org           The Technical University of Denmark
> CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
> Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas
>
> 	De Chelonian Mobile ... The Turtle Moves ...
>
>
> --__--__--
>
> Message: 2
> Date: Wed, 5 Sep 2001 12:50:25 -0700 (PDT)
> From: Chunlei Wu <reillywu@yahoo.com>
> To: biopython-dev@biopython.org
> Subject: [Biopython-dev] localblast bug?
>
> Hi,
>    I wrote a script for localblast. It always raised a
> TypeError:
>
>    File "e:\python21\Bio\Blast\NCBIStandalone.py",
> line 1447, in blastall
>    r, w, e = popen2.popen3([blastcmd] + params)
>    File "e:\python21\lib\popen2.py", line 129, in
> popen3
>    w, r, e = os.popen3(cmd, mode, bufsize)
>    TypeError: popen3() argument 1 must be string, not
> list
>
>    When I modified line 1447 as:
>
>    r, w, e = popen2.popen3(' '.join([blastcmd] +
> params))
>
>    then it works.
>
>
>    Chunlei Wu
>
> Python version: Activepython build 210
> Biopython version: 1.00a3
> OS:       WinNT
> source:
>
> def mylocalblast(input_file,output_file,db='nt'):
>     """mylocalblast"""
>
>     from Bio.Blast import NCBIStandalone
>
>     my_blast_db="r:\\blastdb\\"+db
>     my_blast_exe=r"r:\localblast\blastall.exe"
>
>     blast_out, error_info =
> NCBIStandalone.blastall(my_blast_exe,'blastn',my_blast_db,input_file)
>
>     output_f=open(output_file,'w')
>     blast_result=blast_out.read()
>     output_f.write(blast_result)
>     print blast_result
>     output_f.close()
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Get email alerts & NEW webcam video instant messaging with
> Yahoo! Messenger
> http://im.yahoo.com
>
> --__--__--
>
> Message: 3
> Date: Wed, 5 Sep 2001 17:27:45 -0400
> From: Brad Chapman <chapmanb@arches.uga.edu>
> To: Chunlei Wu <reillywu@yahoo.com>
> Cc: biopython-dev@biopython.org
> Subject: Re: [Biopython-dev] localblast bug?
>
> Hi Chunlei;
>
> >    I wrote a script for localblast. It always raised a
> > TypeError:
> [...]
> >    When I modified line 1447 as:
> >
> >    r, w, e = popen2.popen3(' '.join([blastcmd] +
> > params))
> >
> >    then it works.
>
> Thanks for the fix. I think you're probably the first to use the
> localblast module on windows, so you get to run into the platform
> specific problems (aren't you lucky :-). Your fix works fine for me
> on UNIX as well (with the Doc/examples/local_blast.py script), so I
> checked your change into CVS. It is available from anonymous CVS and
> should be in the next release.
>
> Thanks again!
> Brad
>
> --__--__--
>
> Message: 4
> Date: Wed, 5 Sep 2001 17:46:00 -0400
> From: Brad Chapman <chapmanb@arches.uga.edu>
> To: biopython-dev@biopython.org
> Subject: Re: [Biopython-dev] sequence format readers ?
>
> Hi Thomas!
>
> > To follow up one of the discussions and questions at ISMB
> in Copenhagen,
> > - how are we going to proceed with the sequence format reader (the
> > biopython variant of readseq ...)
>
> It's great that you're going to work on this! It's definately much
> desired by a lot o' people (in fact I was just having a conversation
> today about format conversion).
>
> > Currently we can only have parsers for Fasta, Embl and
> GenBank.  What we
> > need is a internal format and functions/modules which can
> read/write:
> [...impressive list o' formats...]
> > ??? - more suggestions ?
>
> I think supporting this many would be an *excellent* start :-).
>
> > I can write most of the rules, but I guess we have to
> define a smart base
> > class/parser - where plugging in a new format should only
> take 5 seconds ...
> > If we brain storm on the design of the reader/writer, I
> could volunteer to
> > implement the format rules ...
> >
> > Some things to consider:
> > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > * some formats have loads of information which is lost when
> converted to a
> >   lower info-rich format( e.g. Embl -> Fasta). But Embl ->
> GenBank should
> >   not lose any information
> > * some formats allow multiple entries, some not
>
> Just as a way of getting things started (I haven't done a lot of
> thinking about this), my opinion is that the best way to do this is
> to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> system would be the standard SeqRecord object that we currently
> have. The advantage of this is that existing parsers (ie Fasta,
> GenBank), already parse into this, so all that would need to be done
> is to define a mapping that converts a generic SeqRecord object to
> and from the formats "native" Record based representation. So to
> convert from GenBank to Fasta you could do:
>
> GenBank Record Format --> SeqRecord --> Fasta Record Format
>
> Since the Record formats already provide writing capabilities (and
> we have the parsers to parse into them) we would already get writing
> and parsing "for free." Also, we would make good use of our existing
> "generic" Sequence representations.
>
> The advantages of this is that it would help us avoid having to make
> a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> specific converters. The disadvantage of this is that we may lose
> some information in the conversion process (but than again, what
> converters don't :-).
>
> The tricky part of doing it this way is that we would then need to
> define the Record --> SeqRecord mapping, which, as you mention,
> may take some thinking for alignment formats and other
> complications.
>
> Hopefully-rambling-on-and-on-about-this-helps-a-little-bit-ly yr's,
>
> Brad
>
>
>
> --__--__--
>
> Message: 5
> Date: Wed, 5 Sep 2001 16:08:49 -0700
> To: Johann Visagie <johann@egenetics.com>
> From: Jeffrey Chang <jchang@SMI.Stanford.EDU>
> Cc: biopython-dev@biopython.org
> Subject: [Biopython-dev] Re: [BioPython] Biopython 1.00a3
> release now available
>
> At 3:46 PM +0200 9/5/01, Johann Visagie wrote:
> >Jeffrey Chang on 2001-09-04 (Tue) at 11:45:13 -0700:
> >>
> >>  A new release of Biopython is now available.
> >
> >Cool.  :-)
> >
> >A thought:  Shouldn't these announcements be cross-posted to
> >python-announce-list@python.org, a.k.a
> comp.lang.python.announcee?  :-)
>
> Yes.  Next time.  :)
>
> Thanks,
> Jeff
>
> --__--__--
>
> Message: 6
> To: Brad Chapman <chapmanb@arches.uga.edu>
> Cc: biopython-dev@biopython.org
> Subject: Re: [Biopython-dev] sequence format readers ?
> From: Thomas Sicheritz-Ponten <thomas@cbs.dtu.dk>
> Date: 06 Sep 2001 11:22:01 +0200
>
> Brad Chapman <chapmanb@arches.uga.edu> writes:
>
> > Hi Thomas!
> >
> > > To follow up one of the discussions and questions at ISMB
> in Copenhagen,
> > > - how are we going to proceed with the sequence format reader (the
> > > biopython variant of readseq ...)
> >
> > It's great that you're going to work on this! It's definately much
> > desired by a lot o' people (in fact I was just having a conversation
> > today about format conversion).
> >
> > > Currently we can only have parsers for Fasta, Embl and
> GenBank.  What we
> > > need is a internal format and functions/modules which can
> read/write:
> > [...impressive list o' formats...]
> > > ??? - more suggestions ?
> >
> > I think supporting this many would be an *excellent* start :-).
> >
> > > I can write most of the rules, but I guess we have to
> define a smart base
> > > class/parser - where plugging in a new format should only
> take 5 seconds ...
> > > If we brain storm on the design of the reader/writer, I
> could volunteer to
> > > implement the format rules ...
> > >
> > > Some things to consider:
> > > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > > * some formats have loads of information which is lost
> when converted to a
> > >   lower info-rich format( e.g. Embl -> Fasta). But Embl
> -> GenBank should
> > >   not lose any information
> > > * some formats allow multiple entries, some not
> >
> > Just as a way of getting things started (I haven't done a lot of
> > thinking about this), my opinion is that the best way to do this is
> > to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> > system would be the standard SeqRecord object that we currently
> > have. The advantage of this is that existing parsers (ie Fasta,
> > GenBank), already parse into this, so all that would need to be done
> > is to define a mapping that converts a generic SeqRecord object to
> > and from the formats "native" Record based representation. So to
> > convert from GenBank to Fasta you could do:
> >
> > GenBank Record Format --> SeqRecord --> Fasta Record Format
> >
> > Since the Record formats already provide writing capabilities (and
> > we have the parsers to parse into them) we would already get writing
> > and parsing "for free." Also, we would make good use of our existing
> > "generic" Sequence representations.
> >
> > The advantages of this is that it would help us avoid having to make
> > a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> > specific converters. The disadvantage of this is that we may lose
> > some information in the conversion process (but than again, what
> > converters don't :-).
>
> I think inheriting the Seq object to a SeqIOSeq object is enough.
> We just need to add a single dictionary (features) where all
> Swiss/EMBL/GenBank extra annotations can be added.
>
> e.g.
> class SeqIOSeq(Seq):
>     def __init__(self):
>         Seq.__init__(self)
>         # dictionary for extra annotations (e.g. Embl, GenBank)
>         self.features = {}
>
>
> In the case of
> GenBank Record Format --> SeqIOSeq --> Fasta Record Format
> we pick only the the name and sequence ...
>
> but for
> GenBank Record Format --> SeqIOSeq --> EMBL Record Format
> the writer function should check if there are any additional features
> (self.features.keys())
> That way we shouldn't loose any information.
>
>
> It would be nice if a new format can be added by simply
> adding functions
> for reading, writing and recognizing the format.
> I not completely sure of how to define these functions - any ideas ?
>
> example code ...
>
> import sys
> from Bio.Seq import Seq
> NO, YES = 0,1
>
> class SeqIOSeq(Seq):
>     def __init__(self):
>         Seq.__init__(self)
>         # dictionary for extra annotations (e.g. Embl, GenBank)
>         self.features = {}
>
>
> class SeqIO:
>     # dictionary to store functions for
>     # recognizing, reading and writing of different sequence formats
>     recognizers = {}
>     readers = {}
>     writers = {}
>
>     def __init__(self, **kwds):
>         self.name = None
>         self.format = None
>         self.sequence = SeqIOSeq()
>         self.is_an_alignment = NO
>         self.allow_multiple_entries = YES
>         for k,v in kwds: setattr(self, k, v)
>
>     def AddFormat(self, name, recognizeF, readF, writeF):
>         self.recognizers[name] = recognizeF
>         self.readers[name] = readF
>         self.writers[name] = writeF
>
>
> needing-a-machete-for-the-sequence-format-jungle'ly yr's
> -thomas
>
> --
> Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
> thomas@biopython.org           The Technical University of Denmark
> CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
> Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas
>
> 	De Chelonian Mobile ... The Turtle Moves ...
>
> --__--__--
>
> Message: 7
> Date: Thu, 6 Sep 2001 14:49:29 +0200
> From: Johann Visagie <johann@egenetics.com>
> To: biopython-dev@biopython.org
> Subject: Re: [Biopython-dev] Biopython 1.00a3 for the Mac
>
> Yair Benita on 2001-09-05 (Wed) at 11:28:00 +0200:
> >
> > I have compiled the new release for the Mac.
>
> FreeBSD port has also just been updated:
>   http://www.freebsd.org/cgi/cvsweb.cgi/ports/biology/py-biopython/
>
> Pre-built package (minus CORBA) should appear here in a
> couple of days:
>
> ftp://ftp.freebsd.org/pub/FreeBSD/ports/i386/packages-stable/A
ll/py-biopython-1.00.a3.tgz

Unfortunately, this all comes a day or two too late to make it into
4.4-RELEASE, and hence onto the distribution CDs.  :-(

-- V


--__--__--

_______________________________________________
Biopython-dev mailing list
Biopython-dev@biopython.org
http://biopython.org/mailman/listinfo/biopython-dev


End of Biopython-dev Digest


From katel at worldpath.net  Thu Sep  6 23:38:32 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] RE:Sequence format readers
References: <005001c13729$aa920e50$f70210ac@l001696w00>
Message-ID: <001601c1374e$95d62260$010a0a0a@cadence.com>

----- Original Message -----
From: "Peter Wilkinson" <pewilkinson@informaxinc.com>
To: <biopython-dev@biopython.org>
Sent: Thursday, September 06, 2001 4:14 PM
Subject: [Biopython-dev] RE:Sequence format readers


> I have almost completed code for reading in Refseek data. I have finished
> classes (1st draft, but functions well) for the smaller organisms), and
now
> I am moving on to the Human records ....
>
> Also I we need a parser for Derwent data, which should inherit from EMBL,
> since its formatting is EMBL like.
>
> Next aslo is the expression data from different manufacturers ....
>
  We have tons of good ideas.  I think we need to set priorities so we focus
on the most important tasks.  I'm not the right person to set priorites
because I don't use these tools for my day job  but others may have ideas.

                                                             Cayte


From jchang at SMI.Stanford.EDU  Thu Sep  6 23:39:05 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] sequence format readers ?
In-Reply-To: <y9vheug4qti.fsf@genome.cbs.dtu.dk>
References: <15254.29396.129922.274263@genome.cbs.dtu.dk>
 <20010905174600.A4766@ci350185-a.athen1.ga.home.com>
 <y9vheug4qti.fsf@genome.cbs.dtu.dk>
Message-ID: <p05101000b7bdee0163b3@[192.168.0.4]>

At 11:22 AM +0200 9/6/01, Thomas Sicheritz-Ponten wrote:
>Brad Chapman <chapmanb@arches.uga.edu> writes:

[Thomas]
>  > > I can write most of the rules, but I guess we have to define a smart base
>>  > class/parser - where plugging in a new format should only take 5 
>>seconds ...
>>  > If we brain storm on the design of the reader/writer, I could volunteer to
>>  > implement the format rules ...
>>  >
>>  > Some things to consider:
>>  > * some formats are alignment based (e.g. clustal, phylip, nexus)
>>  > * some formats have loads of information which is lost when converted to a
>>  >   lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
>>  >   not lose any information
>>  > * some formats allow multiple entries, some not
>>
>>  Just as a way of getting things started (I haven't done a lot of
>>  thinking about this), my opinion is that the best way to do this is
>>  to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
>>  system would be the standard SeqRecord object that we currently
>>  have. The advantage of this is that existing parsers (ie Fasta,
>>  GenBank), already parse into this, so all that would need to be done
>>  is to define a mapping that converts a generic SeqRecord object to
>>  and from the formats "native" Record based representation. So to
>>  convert from GenBank to Fasta you could do:
>>
>>  GenBank Record Format --> SeqRecord --> Fasta Record Format
>>
>>  Since the Record formats already provide writing capabilities (and
>>  we have the parsers to parse into them) we would already get writing
>>  and parsing "for free." Also, we would make good use of our existing
>>  "generic" Sequence representations.
>>
>>  The advantages of this is that it would help us avoid having to make
>>  a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
>>  specific converters. The disadvantage of this is that we may lose
>>  some information in the conversion process (but than again, what
>  > converters don't :-).

Yes.  It would be nice to have a design where any conversion can be 
done via an intermediate data structure.  However, it should also be 
possible to plug in your own converter if you want.  For example, if 
you really need to have a good GenBank -> EMBL translator, you can 
code one up that bypasses the intermediate, and Biopython should use 
it.  That is, biopython should have 2 methods for translation, 1) 
general, but possible lossy translation via an intermediate, and 2) 
direct translation if we happen to have a translator for those two 
types; and the methods should work together as seamlessly as possible.


>I think inheriting the Seq object to a SeqIOSeq object is enough.
>We just need to add a single dictionary (features) where all
>Swiss/EMBL/GenBank extra annotations can be added.
>
>e.g.
>class SeqIOSeq(Seq):
>     def __init__(self):
>         Seq.__init__(self)
>         # dictionary for extra annotations (e.g. Embl, GenBank)
>         self.features = {}
>
>
>In the case of
>GenBank Record Format --> SeqIOSeq --> Fasta Record Format
we pick only the the name and sequence ...
>
>but for
>GenBank Record Format --> SeqIOSeq --> EMBL Record Format
>the writer function should check if there are any additional features
>(self.features.keys())
>That way we shouldn't loose any information.

This seems like the same solution to the one that Brad suggested, 
except that SeqRecord is replaced by SeqIOSeq.  The SeqIOSeq is a 
much simpler format, so may be easier to use.  However, it leaves 
unspecified how the the features should be stored, which may be 
problematic.  For example, the converter from SeqIOSeq to 
Fasta.Record will have to know what to use as the Fasta description. 
For GenBank, it might be the accession and comments.  For a 
SProt.Record, it might be the entry_name and description.  Thus, 
unless the SeqIOSeq.features elements are specified better, I'm 
afraid the SeqIOSeq -> X converter will have to know about all the 
other formats.

SeqRecord gets around this by defining (theoretically) all the 
information people would care about from a record, with a consistent 
interface.  Thus, a SeqRecord -> Fasta.Record converter will always 
use the SeqRecord.id and SeqRecord.description (or some other 
combination of attributes).


>It would be nice if a new format can be added by simply adding functions
>for reading, writing and recognizing the format.
>I not completely sure of how to define these functions - any ideas ?

Not exactly, but it would be nice if those functions were exposed. 
For example, there should be a function somewhere called 
"whichformat" (similar to the whichdb package in Python's standard 
library) that returns a best guess at the format.

In the past, Andrew's talked about building this kind of 
functionality into Martel...

Jeff

From jchang at SMI.Stanford.EDU  Thu Sep  6 23:40:45 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] RE:Sequence format readers
In-Reply-To: <005001c13729$aa920e50$f70210ac@l001696w00>
References: <005001c13729$aa920e50$f70210ac@l001696w00>
Message-ID: <p05101001b7bdf20554a2@[192.168.0.4]>

>P.S. I am sitting on code for specific fasta formated types .... how about
>that?

Yes, please send it in!

Jeff

From katel at worldpath.net  Sat Sep  8 00:46:36 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] MetaTool
Message-ID: <004301c13821$3f6bf340$010a0a0a@cadence.com>

  I just added MetaTool.  It passes a superficial test but I need to take a
closer look.

                            Cayte


From katel at worldpath.net  Sat Sep  8 16:52:18 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] MetaTool
References: <004301c13821$3f6bf340$010a0a0a@cadence.com>
Message-ID: <000f01c138a8$27932360$010a0a0a@cadence.com>

  When I've finished checking zillions of Matrixes from the Martel parser,
the next rainy day, I could look into Nexus since that is on the list.  I'm
holding off on more pathway stuff till I find out what Tarjei has.    I want
to post my plans so we don't have two people duplicating effort on the same
parser.

                                                                  Cayte


From tarjei at genome.wi.mit.edu  Sun Sep  9 00:49:23 2001
From: tarjei at genome.wi.mit.edu (Tarjei Mikkelsen)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] MetaTool
In-Reply-To: <000f01c138a8$27932360$010a0a0a@cadence.com>
Message-ID: <000601c138ea$cc57e6a0$6d04fa12@mit.edu>

>I'm holding off on more pathway stuff till I find out what Tarjei has

 I've completed a prototype for the core species/reaction/system classes
as described earlier. 

 Next, I intend to use these classes to implement the missing part from
my KEGG parser - the reaction database. Then I'd like to write a bridge
from the pathway classes to Metatool. Combined with your output parser
this would give us a complete "vertical slice" from database (KEGG/WIT)
through the pathway classes to an analysis program.

 I'll do a first commit as soon as I am satisfied that the pathway
classes are well designed. Time is getting precious these days as school
is starting again, so it might take a while for this to happen.

 Tarjei


From katel at worldpath.net  Sun Sep  9 16:09:43 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] MetaTool
References: <000601c138ea$cc57e6a0$6d04fa12@mit.edu>
Message-ID: <001401c1396b$5ed0a320$010a0a0a@cadence.com>

----- Original Message -----
From: "Tarjei Mikkelsen" <tarjei@genome.wi.mit.edu>
To: "'Cayte'" <katel@worldpath.net>; <biopython-dev@biopython.org>
Sent: Saturday, September 08, 2001 9:49 PM
Subject: RE: [Biopython-dev] MetaTool


> >I'm holding off on more pathway stuff till I find out what Tarjei has
>
>  I've completed a prototype for the core species/reaction/system classes
> as described earlier.
>
>  Next, I intend to use these classes to implement the missing part from
> my KEGG parser - the reaction database. Then I'd like to write a bridge
> from the pathway classes to Metatool. Combined with your output parser
> this would give us a complete "vertical slice" from database (KEGG/WIT)
> through the pathway classes to an analysis program.
>
   I may need to retrofit some MetaTool classes to fit your classes.  Please
share your ideas
 on what's needed for the MetaTool end of the bridge.
>  I'll do a first commit as soon as I am satisfied that the pathway
> classes are well designed. Time is getting precious these days as school
> is starting again, so it might take a while for this to happen.
>
  Commiting code early may allow users to offer suggestions on class
structure.
>
> _______________________________________________
                              Cayte


From tarjei at genome.wi.mit.edu  Sun Sep  9 19:11:16 2001
From: tarjei at genome.wi.mit.edu (Tarjei Mikkelsen)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] MetaTool
In-Reply-To: <001401c1396b$5ed0a320$010a0a0a@cadence.com>
Message-ID: <000601c13984$bb47dd80$0f05f612@mit.edu>

>>  Next, I intend to use these classes to implement the missing part
from
>> my KEGG parser - the reaction database. Then I'd like to write a
bridge
>> from the pathway classes to Metatool. Combined with your output
parser
>> this would give us a complete "vertical slice" from database
(KEGG/WIT)
>> through the pathway classes to an analysis program.
>>
>   I may need to retrofit some MetaTool classes to fit your classes.  
>  Please share your ideas on what's needed for the MetaTool end of 
>  the bridge.

 What I'm envisioning is simply a function/class that converts a system
object into a string or text file that can be used directly as the input
to Metatool.

 I haven't looked at your Metatool module in detail, but unless you
already have a comprehensive method for interfacing with the input side
of Metatool there shouldn't be any need for retrofitting to do this.

 How does that sound?

 Tarjei


From katel at worldpath.net  Mon Sep 10 22:08:50 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] MetaTool
References: <000601c13984$bb47dd80$0f05f612@mit.edu>
Message-ID: <003201c13a66$b4cb92c0$010a0a0a@cadence.com>

----- Original Message -----
From: "Tarjei Mikkelsen" <tarjei@genome.wi.mit.edu>
>  What I'm envisioning is simply a function/class that converts a system
> object into a string or text file that can be used directly as the input
> to Metatool.

  I was thinking of the output.  You may for example want to check if an
elementary mode matches a  pathway.
in a  frog.  Pathways, elementary modes, basis vectors, etc,  all look the
same to the metatool parser, .  They are stored as matrices.  I don't se why
they couldn't all be mixed and matched in a search query..  But they would
need to be converted to a format the search engine understands.
>

>  I haven't looked at your Metatool module in detail, but unless you
> already have a comprehensive method for interfacing with the input side
> of Metatool there shouldn't be any need for retrofitting to do this.
>
  No I just parse  the output of Metatool.  The parser may have to change
with new revs of MetaTool though.

                    Cayte


From sarah at staff.cs.usyd.edu.au  Tue Sep 11 00:46:17 2001
From: sarah at staff.cs.usyd.edu.au (Sarah Kummerfeld)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Local blast problem
Message-ID: <Pine.SOL.4.21.0109111438280.97-100000@staff.cs.usyd.edu.au>

Hi,

Just wondering whether anyone know about an intermittent
problem with locally run blast. 

I'm running lots of blast searches on a very small
database. It will work for a while, or sometimes for
the whole program. But other times it will core dump.

Occasionally I get a python traceback which suggests
that it tried to read a stream that was not there,
but other times there is no error, it just dumps
core.

If I run blast on its own (not from my python program)
it sometimes does the same thing. One time I rebooted
my machine (linux) and found that a blast search I had
run just before and that had crashed, now worked. 

I couldn't find anything helpful in the core file.

I had thought it might be some new memory I put in,
so I had it replaced -- but still have the problem.

Any suggestions would be greatly appreciated!

Sarah


From jchang at SMI.Stanford.EDU  Tue Sep 11 13:27:15 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Local blast problem
In-Reply-To: <Pine.SOL.4.21.0109111438280.97-100000@staff.cs.usyd.edu.au>
References: <Pine.SOL.4.21.0109111438280.97-100000@staff.cs.usyd.edu.au>
Message-ID: <p05101001b7c3f8d06251@[171.65.33.250]>

Hmmm...  I'm not sure.  Is Python core-dumping, or just blast?  I've 
run into BLAST crashing before, but it usually results in a truncated 
stream and hasn't caused python to core-dump.

Some possible workarounds are to:
1) fork off a separate thread to run blast, so if it crashes, it 
won't take down your main application.  The MultiProc library might 
help here.
2) hack the blastall (or blatspgp) function so that it saves the 
output to a file, and then return a handle to that file.  This is 
technically a variation of the first solution, and might be more 
straightforward to implement.

Jeff


At 2:46 PM +1000 9/11/01, Sarah Kummerfeld wrote:
>Hi,
>
>Just wondering whether anyone know about an intermittent
>problem with locally run blast.
>
>I'm running lots of blast searches on a very small
>database. It will work for a while, or sometimes for
>the whole program. But other times it will core dump.
>
>Occasionally I get a python traceback which suggests
>that it tried to read a stream that was not there,
>but other times there is no error, it just dumps
>core.
>
>If I run blast on its own (not from my python program)
>it sometimes does the same thing. One time I rebooted
>my machine (linux) and found that a blast search I had
>run just before and that had crashed, now worked.
>
>I couldn't find anything helpful in the core file.
>
>I had thought it might be some new memory I put in,
>so I had it replaced -- but still have the problem.
>
>Any suggestions would be greatly appreciated!
>
>Sarah
>
>
>
>_______________________________________________
>Biopython-dev mailing list
>Biopython-dev@biopython.org
>http://biopython.org/mailman/listinfo/biopython-dev


From thomas at cbs.dtu.dk  Tue Sep 11 20:43:43 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] sequence format readers ?
In-Reply-To: Brad Chapman's message of "Wed, 5 Sep 2001 17:46:00 -0400"
References: <15254.29396.129922.274263@genome.cbs.dtu.dk>
	<20010905174600.A4766@ci350185-a.athen1.ga.home.com>
Message-ID: <y9vy9nlutkw.fsf@delphinus.cbs.dtu.dk>

Brad, I made some changes to our initial SeqRecord and FastaReader/Write
classes in order to use it for inheritance.
Before I start defining rules for the other formats we should brainstorm
over possible drawbacks/pitfalls of the current implementation (e.g. alignments).

Any ideas/suggestions ?

cheers
-thomas

# ----- SNIP ----- SNAP ----- SNIP ----- SNAP -----
import sys
import string
import Bio.Alphabet

from Bio.Seq import Seq
#from Bio.SeqRecord import SeqRecord

class SeqRecord:
    def __init__(self, seq, id = "<unknown id>", name = "<unknown name>",
                 description = "<unknown description>"):
        self.seq = seq
        self.id = id
        self.name = name
        self.description = description
        # annotations about the whole sequence
        self.annotations = {}
        
        # annotations about parts of the sequence
        self.features = []
        
    def __str__(self):
        res = ''
        res += '%s %s' % (self.name, self.seq.data)
        return res
    

class GenericFormat:
    def __init__(self, instream=None, outstream=None, alphabet = Bio.Alphabet.generic_alphabet):
        self.instream = instream
        self.outstream = outstream
        self.alphabet = alphabet
        self._n = -1
        self._lookahead = None

    def find_start(self):
        # find the start of data
        pass
    
    def next(self):
        pass
        
    def __getitem__(self, i):
        # wrapper to the normal Python "for spam in list:" idiom
        assert i == self._n  # forward iteration only!
        x = self.next()
        if x is None:
            raise IndexError, i
        return x

    def write(self, record):
        pass
    
    def write_records(self, records):
        # In general, can assume homogenous records... useful?
        for record in records:
            self.write(record)

    def close(self):
        return self.outstream.close()

    def flush(self):
        return self.outstream.flush()

class FastaFormat(GenericFormat):
    def __init__(self, instream=None, outstream=None, alphabet = Bio.Alphabet.generic_alphabet):
        GenericFormat.__init__(self, instream, alphabet = Bio.Alphabet.generic_alphabet)
        self.find_start()
        
    def find_start(self):
        line = self.instream.readline()
        while line and line[0] != ">": line = self.instream.readline()
        self._lookahead = line
        self._n = 0

    def next(self):
        self._n = self._n + 1

        line = self._lookahead
        if not line: return None

        x = string.split(line[1:-1], None, 1)
        if len(x) == 1:
            id = x
            desc = ""
        else:
            id, desc = x
            
        lines = []
        line = self.instream.readline()
        while line:
            if line[0] == ">":
                break
            lines.append(line[:-1])
            line = self.instream.readline()
            
        self._lookahead = line

        return SeqRecord(Seq(string.join(lines, ""), self.alphabet),
                         id = id, name = id, description = desc)
        
    def write(self, record):
        id = record.id
        assert "\n" not in id

        description = record.description
        assert "\n" not in description
        
        self.outstream.write(">%s %s\n" % (id, description))

        data = record.seq.tostring()
        for i in range(0, len(data), 60):
            self.outstream.write(data[i:i+60] + "\n")

if __name__ == '__main__':
    txt = """
>TM0001 hypothetical protein
MVYGKEGYGRSKNILLSECVCGIISLELNGFQYFLRGMETL
>TM0002 hypothetical protein
MSPEDWKRLICFHTSKEVLKQTLDDAQQNISDSVSIPLRKY
>TM0003 hypothetical protein
METVKAYEVEDIPAIGFNNSLEVWKLFPASSSRSTSSSFQ
>TM0004 hypothetical protein
MKDLYERFNNSLEVWKLVELFGTSIRIHLFQ
"""
    from StringIO import StringIO
    test = FastaFormat(instream = StringIO(txt))
    while 1:
        r = test.next()
        if not r: break
        print r
        
# ----- SNIP ----- SNAP ----- SNIP ----- SNAP -----

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From chapmanb at arches.uga.edu  Wed Sep 12 04:50:15 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] sequence format readers ?
In-Reply-To: <y9vy9nlutkw.fsf@delphinus.cbs.dtu.dk>
References: <15254.29396.129922.274263@genome.cbs.dtu.dk> <20010905174600.A4766@ci350185-a.athen1.ga.home.com> <y9vy9nlutkw.fsf@delphinus.cbs.dtu.dk>
Message-ID: <20010912045015.A24186@ci350185-a.athen1.ga.home.com>

Hi Thomas;

> Brad, I made some changes to our initial SeqRecord and FastaReader/Write
> classes in order to use it for inheritance.

Cool! Thanks for working on this. With regards to SeqRecord, adding
__str__ stuff for debugging is great. Abstracting out the
common stuff in Reader/Writer is definitely a plus. I have to admit
to not having looked at or used the SeqIO stuff much, mostly because
I always figured it was a work-in-progress.

One thing that comes to mind is you might want to support the
Iterator stuff coming in python 2.2:

http://www.amk.ca/python/2.2/index.html#SECTION000300000000000000000

Seems like all we need to do is add __iter__ that returns the object
itself and we'll be all set (and it should be back compatible and
all of that).
          
> Before I start defining rules for the other formats we should 
> brainstorm over possible drawbacks/pitfalls of the current 
> implementation (e.g. alignments).

Hmm, I guess I just figured we would run into pitfalls after it was
already coded :-). Seriously, I'm pretty happy with the SeqRecord +
SeqFeature classes (with a few mistakes I made which I'll write
about in a separate thread in a second), so it might be best to go
forward and see how they handle what we need. Everything does a
decent job of supporting the BioCorba spec, which is a good sign (to
me!) that they can handle "most common cases."

In terms of alignments, I think these will end up being more "high
level" than SeqRecords. For instance, in the Generic alignment stuff
I coded up, an Alignment is basically a collection of SeqRecords. So
the conversions here will be a little different, I guess:

A File of FASTA records (lots of SeqRecords) --> one Alignment
one Alignment --> a bunch of FASTA records

Other than this, I think you're on target (at least with my
understanding of how conversions will work). If you can coerce
Andrew into commenting, he might have some opinions about how the
SeqIO stuff should work, since he wrote it.

May-the-force-by-with-you-on-sequence-conversions-ly yr's,
Brad

From chapmanb at arches.uga.edu  Tue Sep 18 21:25:32 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Re: Parsing Protein GenBank Records
In-Reply-To: <OIEFKBIBGGKFMCHEPFMJMENJCAAA.j.joung@AptusGenomics.com>
References: <OIEFKBIBGGKFMCHEPFMJMENJCAAA.j.joung@AptusGenomics.com>
Message-ID: <20010918212532.A3580@ci350185-a.athen1.ga.home.com>

Hi Joung;
(ccing this to biopython-dev since this is relevant to everyone)

> I'm having trouble parsing GenBank records obtained from the protein
> database. The parser works fine for nucleotide GenBank records , but not for
> protein records. I would appreciate it very much if you can guide me in
> right direction for parsing such records.
> 
> Here is the code and the error that I get back.
> 
> >>> parser = GenBank.RecordParser()
> >>> ncbi = GenBank.NCBIDictionary(database='Protein')
> >>> rec = ncbi['6754304']

The parser does work for proteins in general, but does fail badly on
this particular REFSEQ sequence. In the past, REFSEQ stuff has been
only "sort of" GenBank format, and this record is no exception. It
has a lot of formatting problems (has no identifier for the sequence
type in the LOCUS line, has extra DBSOURCE tag, has non-standard
feature table types and keys (Protein, Region, region_name)).
Anyways, it is a big non-standard formatting mess.

I've fixed the GenBank parser to be able to handle this, and checked
the changes into CVS. Diffs to the relevant files (Record.py,
__init__.py and genbank_format.py in Bio.GenBank) are also attached
to this file in case you don't have CVS access.

Thanks for the bug report. Hope this works for you!

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
*** Record.py.orig	Sat May 19 15:31:16 2001
--- Record.py	Tue Sep 18 21:02:18 2001
***************
*** 106,112 ****
--- 106,114 ----
      o date - The date of submission of the record, in a form like '28-JUL-1998'
      o accession - list of all accession numbers for the sequence.
      o nid - Nucleotide identifier number.
+     o pid - Proteint identifier number
      o version - The accession number + version (ie. AB01234.2)
+     o db_source - Information about the database the record came from
      o gi - The NCBI gi identifier for the record.
      o keywords - A list of keywords related to the record.
      o segment - If the record is one of a series, this is info about which
***************
*** 153,159 ****
--- 155,163 ----
          self.definition = ''
          self.accession = []
          self.nid = ''
+         self.pid = ''
          self.version = ''
+         self.db_source = ''
          self.gi = ''
          self.keywords = []
          self.segment = ''
***************
*** 185,191 ****
--- 189,197 ----
          output += self._accession_line()
          output += self._version_line()
          output += self._nid_line()
+         output += self._pid_line()
          output += self._keywords_line()
+         output += self._db_source_line()
          output += self._segment_line()
          output += self._source_line()
          output += self._organism_line()
***************
*** 210,216 ****
          output += "%-9s" % self.locus
          output += " " # 22 space
          output += "%7s" % self.size
!         output += " bp "
  
          # treat circular types differently, since they'll have long residue
          # types
--- 216,225 ----
          output += "%-9s" % self.locus
          output += " " # 22 space
          output += "%7s" % self.size
!         if self.residue_type.find("PROTEIN") >= 0:
!             output += " aa"
!         else:
!             output += " bp "
  
          # treat circular types differently, since they'll have long residue
          # types
***************
*** 272,277 ****
--- 281,296 ----
              output = ""
          return output
  
+     def _pid_line(self):
+         """Output for PID line. Presumedly, PID usage is also obsolete.
+         """
+         if self.pid:
+             output = Record.BASE_FORMAT % "PID"
+             output += "%s\n" % self.pid
+         else:
+             output = ""
+         return output
+ 
      def _keywords_line(self):
          """Output for the KEYWORDS line.
          """
***************
*** 288,293 ****
--- 307,322 ----
              output += _wrapped_genbank(keyword_info,
                                         Record.GB_BASE_INDENT)
  
+         return output
+ 
+     def _db_source_line(self):
+         """Output for DBSOURCE line.
+         """
+         if self.db_source:
+             output = Record.BASE_FORMAT % "DBSOURCE"
+             output += "%s\n" % self.db_source
+         else:
+             output = ""
          return output
  
      def _segment_line(self):
-------------- next part --------------
*** __init__.py.orig	Sat Jul 28 12:02:25 2001
--- __init__.py	Tue Sep 18 21:13:48 2001
***************
*** 98,112 ****
      def __getitem__(self, key):
          """Retrieve an item from the dictionary.
          """
!         print "keys:", self._index.keys()
          # get the location of the record of interest in the file
          start, len = self._index[key]
!         print "start:", start, "len:", len
  
          # read through and get the data from the file
          self._handle.seek(start)
          data = self._handle.read(len)
!         print "data:", data
  
          # run the data through the parser if one is specified
          if self._parser is not None:
--- 98,112 ----
      def __getitem__(self, key):
          """Retrieve an item from the dictionary.
          """
!         # print "keys:", self._index.keys()
          # get the location of the record of interest in the file
          start, len = self._index[key]
!         # print "start:", start, "len:", len
  
          # read through and get the data from the file
          self._handle.seek(start)
          data = self._handle.read(len)
!         # print "data:", data
  
          # run the data through the parser if one is specified
          if self._parser is not None:
***************
*** 434,439 ****
--- 434,442 ----
      def nid(self, content):
          self.data.annotations['nid'] = content
  
+     def pid(self, content):
+         self.data.annotations['pid'] = content
+ 
      def version(self, version_id):
          """Set the version to overwrite the id.
  
***************
*** 443,448 ****
--- 446,454 ----
          """
          self.data.id = version_id
  
+     def db_source(self, content):
+         self.data.annotations['db_source'] = content.rstrip()
+ 
      def gi(self, content):
          self.data.annotations['gi'] = content
  
***************
*** 485,510 ****
          (bases 1 to 86436)
          (sites)
          (bases 1 to 105654; 110423 to 111122)
          """
!         # first remove the parentheses
          ref_base_info = content[1:-1]
  
          all_locations = []
!         # only attempt to get out information if we find the words
!         # 'bases' and 'to'
          if (string.find(ref_base_info, 'bases') != -1 and
              string.find(ref_base_info, 'to') != -1):
              # get rid of the beginning 'bases'
              ref_base_info = ref_base_info[5:]
!             # split possibly multiple locations using the ';'
!             all_base_info = string.split(ref_base_info, ';')
! 
!             for base_info in all_base_info:
!                 start, end = string.split(base_info, 'to')
!                 this_location = \
!                   SeqFeature.FeatureLocation(int(string.strip(start)),
!                                              int(string.strip(end)))
!                 all_locations.append(this_location)
  
          # make sure if we are not finding information then we have
          # the string 'sites' or the string 'bases'
--- 491,516 ----
          (bases 1 to 86436)
          (sites)
          (bases 1 to 105654; 110423 to 111122)
+         1  (residues 1 to 182)
          """
!         # first remove the parentheses or other junk
          ref_base_info = content[1:-1]
  
          all_locations = []
!         # parse if we've got 'bases' and 'to'
          if (string.find(ref_base_info, 'bases') != -1 and
              string.find(ref_base_info, 'to') != -1):
              # get rid of the beginning 'bases'
              ref_base_info = ref_base_info[5:]
!             locations = self._split_reference_locations(ref_base_info)
!             all_locations.extend(locations)
!         elif (ref_base_info.find("residues") >= 0 and
!               ref_base_info.find("to") >= 0):
!             residues_start = ref_base_info.find("residues")
!             # get only the information after "residues"
!             ref_base_info = ref_base_info[(residues_start + len("residues ")):]
!             locations = self._split_reference_locations(ref_base_info)
!             all_locations.extend(locations)
  
          # make sure if we are not finding information then we have
          # the string 'sites' or the string 'bases'
***************
*** 517,523 ****
                               (ref_base_info, self.data.id))
  
          self._current_ref.location = all_locations
!                 
      def authors(self, content):
          self._current_ref.authors = content
  
--- 523,551 ----
                               (ref_base_info, self.data.id))
  
          self._current_ref.location = all_locations
! 
!     def _split_reference_locations(self, location_string):
!         """Get reference locations out of a string of reference information
!         
!         The passed string should be of the form:
! 
!             1 to 20; 20 to 100
! 
!         This splits the information out and returns a list of location objects
!         based on the reference locations.
!         """
!         # split possibly multiple locations using the ';'
!         all_base_info = location_string.split(';')
! 
!         new_locations = []
!         for base_info in all_base_info:
!             start, end = base_info.split('to')
!             this_location = \
!               SeqFeature.FeatureLocation(int(string.strip(start)),
!                                              int(string.strip(end)))
!             new_locations.append(this_location)
!         return new_locations
! 
      def authors(self, content):
          self._current_ref.authors = content
  
***************
*** 905,913 ****
--- 933,947 ----
      def nid(self, content):
          self.data.nid = content
  
+     def pid(self, content):
+         self.data.pid = content
+ 
      def version(self, content):
          self.data.version = content
  
+     def db_source(self, content):
+         self.data.db_source = content.rstrip()
+ 
      def gi(self, content):
          self.data.gi = content
  
***************
*** 1070,1076 ****
          # in the MartelParser
          self.interest_tags = ["locus", "size", "residue_type",
                                "data_file_division", "date",
!                               "definition", "accession", "nid", "version",
                                "gi", "keywords", "segment",
                                "source", "organism",
                                "taxonomy", "reference_num",
--- 1104,1111 ----
          # in the MartelParser
          self.interest_tags = ["locus", "size", "residue_type",
                                "data_file_division", "date",
!                               "definition", "accession", "nid", 
!                               "pid", "version", "db_source",
                                "gi", "keywords", "segment",
                                "source", "organism",
                                "taxonomy", "reference_num",
-------------- next part --------------
*** genbank_format.py.orig	Thu May 10 17:42:43 2001
--- genbank_format.py	Tue Sep 18 21:07:11 2001
***************
*** 142,147 ****
--- 142,156 ----
                          nid +
                          Martel.AnyEol())
  
+ # PID         g6754304
+ pid = Martel.Group("pid",
+                    Martel.Re("[\w\d]+"))
+ pid_line = Martel.Group("pid_line", 
+                         Martel.Str("PID") +
+                         blank_space +
+                         pid + 
+                         Martel.AnyEol())
+ 
  # version and GI line
  # VERSION     AC007323.5  GI:6587720
  version = Martel.Group("version",
***************
*** 159,164 ****
--- 168,181 ----
                              gi +
                              Martel.AnyEol())
  
+ # DBSOURCE    REFSEQ: accession NM_010510.1
+ db_source = Martel.Group("db_source",
+                          Martel.ToEol())
+ db_source_line = Martel.Group("db_source_line",
+                               Martel.Str("DBSOURCE") +
+                               blank_space +
+                               db_source) 
+ 
  # keywords line
  # KEYWORDS    antifreeze protein homology; cold-regulated gene; cor6.6 gene;
  #             KIN1 homology.
***************
*** 312,319 ****
--- 329,338 ----
      "primer",           # Primer binding region used with PCR  XXX not in 
                          #   http://www.ncbi.nlm.nih.gov/collab/FT/index.html
      "promoter",         # A region involved in transcription initiation
+     "Protein",          # A REFSEQ invention for referring to a protein
      "protein_bind",     # Non-covalent protein binding site on DNA or RNA
      "RBS",              # Ribosome binding site
+     "Region",           # Another REFSEQ invention that doesn't make any sense
      "rep_origin",       # Replication origin for duplex DNA
      "repeat_region",    # Sequence containing repeated subsequences
      "repeat_unit",      # One repeated unit of a repeat_region
***************
*** 424,429 ****
--- 443,449 ----
                        #   evidence for a feature
      "clone_lib",      # Clone library from which the sequence was obtained
      "clone",          # Clone from which the sequence was obtained
+     "coded_by",       # REFSEQ invention to specify a crossreference
      "codon_start",    # Indicates the first base of the first complete codon
                        #   in a CDS (as 1 or 2 or 3)
      "codon",          # Specifies a codon that is different from any found
***************
*** 505,510 ****
--- 525,531 ----
      "rearranged",     # If the sequence shown is DNA and a member of the
                        #   immunoglobulin family, this qualifier is used to
                        #   denote that the sequence is from rearranged DNA
+     "region_name",    # REFSEQ invention to go with their Region Type
      "replace",        # Indicates that the sequence identified a feature's
                        #   intervals is replaced by the  sequence shown in
                        #   "text"
***************
*** 624,630 ****
--- 645,653 ----
                        definition_block + \
                        accession_block + \
                        Martel.Opt(nid_line) + \
+                       Martel.Opt(pid_line) + \
                        Martel.Opt(version_line) + \
+                       Martel.Opt(db_source_line) + \
                        keywords_block + \
                        Martel.Opt(segment_line) + \
                        source_block + \
***************
*** 633,639 ****
                        Martel.Opt(comment_block) + \
                        features_line + \
                        Martel.Rep1(feature) + \
!                       base_count_line + \
                        sequence_entry + \
                        record_end)
  
--- 656,662 ----
                        Martel.Opt(comment_block) + \
                        features_line + \
                        Martel.Rep1(feature) + \
!                       Martel.Opt(base_count_line) + \
                        sequence_entry + \
                        record_end)
  
From thomas at cbs.dtu.dk  Wed Sep 19 04:53:49 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Q: How to undo a CVS update ?
Message-ID: <y9vy9nb5zoi.fsf@genome.cbs.dtu.dk>

ARGHHHHHH !

Unfortunately I made an cvs update before comitting my local changes to
generic.py ... :(

Are there any hidden cvs commands to undo this stupid error ?

thx
-thomas
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From adalke at mindspring.com  Wed Sep 19 06:34:19 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Q: How to undo a CVS update ?
Message-ID: <00d501c140f6$a3ef8380$0301a8c0@josiah.dalkescientific.com>

Thomas of the Moving Turtle:
>Unfortunately I made an cvs update before comitting my local changes to
>generic.py ... :(
>
>Are there any hidden cvs commands to undo this stupid error ?

Don't know the official way to do that.  Here's what I do.

Suppose you want to revert to version 1.15

  rm filename.py
  cvs co -r1.15 filename.py
  mv filename.py filename.py.tmp
  cvs co filename.py
  rm filename.py
  mv filename.py.tmp filename.py
  cvs commit filename.py

The reason for the mv, co, rm, mv back is because the co with
a version number is sticky.  By moving it then checking out
the current version, I get rid of the sticky part.

There's documentation at http://www.cvshome.org/docs/

Huh.  How about
  http://www.cvshome.org/docs/manual/cvs_4.html#SEC53

] However, this isn't the easiest way, if you are asking
] how to undo a previous checkin (in this example, put
] `file1' back to the way it was as of revision 1.1). In
] that case you are better off using the `-j' option to
] update; for further discussion see 5.8 Merging differences
] between any two revisions.

and points to
 http://www.cvshome.org/docs/manual/cvs_5.html#SEC62

Try out the suggestion on that page.

And let me know if it works :)

                    Andrew
                    dalke@dalkescientific.com


From chapmanb at arches.uga.edu  Wed Sep 19 06:53:36 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Proposed Incompatible Changes to GenBank SeqFeatures
Message-ID: <20010919065335.B4962@ci350185-a.athen1.ga.home.com>

Hello all;
Recently I've been slugging through the biopython implementation of
the new BioCorba spec, and have been forced to come back into some
bad coding I did on the parts of the GenBank parser that converts
GenBank into SeqRecord and SeqFeature objects. Most of the mistakes
that I'm working on fixing are detailed by Andrew:

http://www.biopython.org/pipermail/biopython-dev/2001-July/000451.html

(unfortunately he sent this right before ISMB when I was crazily
coding to get my poster done :-<)

Anyways, I'd like to try and fix the problems he mentions here (and
which are also a problem for biopython-corba), and have got the
changes together. The issue is that some of the changes are
back-incompatible, so I'd like to talk about them here and get
people's thoughts on whether the changes will severely affect their
existing code. Note that his only influences parsing GenBank in
SeqRecord and SeqFeature objects, not in GenBank Record objects (ie.
the FeatureParser, not the RecordParser).

I'll go through the changes step by step, and attach the diffs
(against current CVS) which implement the changes. I'll start with
incompatible changes, and then go on to the more benign back-compatible
changes. Comments are very welcome. Okay, here we go:

Back-incompatible changes
-------------------------
1. 
Andrew:
> feature qualifiers shouldn't really be a dictionary

I have been storing feature qualifiers as a dictionary with the keys
equal to the qualifier keys and the values equal to the qualifier
values. The problem comes when there are multiple keys of the same
name, ie with db_xref in the following CDS Feature:

     CDS             50..250
                     /gene="cor6.6"
                     /db_xref="GI:16230"
                     /db_xref="SWISS-PROT:P31169"

What I did was this hideous hack to add numbers to the end of the
qualifiers to make them unique, so the feature above would be:

{"gene" : "cor6.6",
 "db_xref" : "GI:16230",
 "db_xref1" : "SWISS-PROT:P31169"}

As Andrew points out, this is hideous and all around bad news. I
would like to propose to use a combination dictionary/list structure
to hold multiple keys, so the above now looks like:

{"gene" : ["cor6.6"],
 "db_xref" : [""GI:16230", "SWISS-PROT:P31169"]}

This affects all people using qualifiers (since even single items
are now in a list), but I think it is easier to always have the
values be lists (otherwise you'll always have to check is the key is
of type("") or type([]).

2. 
> - What's the numbering system of the FeatureLocation?

There is a difference between "biological" coordinates used in
GenBank and python/biopython coordinates. In "biological"
coordinates 1 is the first base in a sequence, and if you want to
get 1 to 50, that includes both the first and 50th base.

In python, 0 is the first base in the sequence, and 1 to 50 includes
1, but not 50.

Previously I did no conversion of GenBank "biological" coordinates
to python coordinates, but Andrew argues (and I agree) that I should
do this to make things less confusing. The new implementation does
the conversion, so this will give an "off-by-one" type error to
people using the current numbers.

3. 
Andrew:
> Related to that, what's the type used when there are subfeatures?

Previously, if we had a sequence feature like:
CDS             join(104..160,320..390,504..579)

I would code this as a top level SeqFeature with type "CDS" and
location (104..579), and have sub_features of this top level feature
with type "CDS_join." This is stolen from bioperl, but is not that
great in retrospect, since I'm hacking the type and all of that. 

I'd like to propose adding a location_operator attribute to
SeqFeature (already done in CVS) and have the top level SeqFeature
be type "CDS" with location_operator "join", and all sub_features
also be of the same type and location_operator. This will only
affect people who relied on the previous (fairly ugly)
type/location_operator concatenation mechanism.

4. strand information

Previous I had the default strand for a SeqFeature be 0, which I now
think is a mistake. In my mind, the strand information is:

None -- No strand information, or not relevant (ie. protein)
1 -- The top strand
-1 -- The bottom strand
0 -- both strands

So, I changed the default strand info to be None. Hopefully this is
a relatively minor change.

Back-compatible changes
=======================
1. 
Andrew:
> The biopython SeqFeature currently must be used like:
>    feature = SeqFeature()
>    feature.type = "variation"
>    ...

> I would much rather prefer allowing the values to be set
> through the constructor, as in
>     feature = SeqFeature(type = "variation", ...)
      
I agree with Andrew on this and have implemented it. The only
current issue is that if I try to set sub_features or annotations in
the constructor I'll get infinate recursion problems when printing
out the features (ie. try going this and running test_GenBank). I
haven't been able to figure out why that is yet.

2. New attributes in SeqFeature

I added location_operator and id attributes to SeqFeatures (to help
support BioCorba), which can now be used. These shouldn't affect
anyones old code and you can now use 'em if you want.


Whew, I think that is everything :-). Thanks for reading all of the
way through this. I'd really like comments on whether these changes
are good/bad, and most importantly whether I can check 'em in or
should do something different. Thanks much!

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
*** __init__.py	Wed Sep 19 05:56:29 2001
--- __init__.py.orig	Tue Sep 18 21:13:48 2001
***************
*** 374,397 ****
          """
          return text.replace(" ", "")
  
-     def _convert_to_python_numbers(self, start, end):
-         """Convert a start and end range to python notation.
- 
-         In GenBank, starts and ends are defined in "biological" coordinates,
-         where 1 is the first base and [i, j] means to include both i and j.
- 
-         In python, 0 is the first base and [i, j] means to include i, but
-         not j. 
- 
-         So, to convert "biological" to python coordinates, we need to 
-         subtract 1 from the start, and leave the end and things should
-         be converted happily.
-         """
-         new_start = start - 1
-         new_end = end
- 
-         return new_start, new_end
- 
  class _FeatureConsumer(_BaseGenBankConsumer):
      """Create a SeqRecord object with Features to return.
  
--- 374,379 ----
***************
*** 558,567 ****
          new_locations = []
          for base_info in all_base_info:
              start, end = base_info.split('to')
!             new_start, new_end = \
!               self._convert_to_python_numbers(int(start.strip()),
!                                               int(end.strip()))
!             this_location = SeqFeature.FeatureLocation(new_start, new_end)
              new_locations.append(this_location)
          return new_locations
  
--- 540,548 ----
          new_locations = []
          for base_info in all_base_info:
              start, end = base_info.split('to')
!             this_location = \
!               SeqFeature.FeatureLocation(int(string.strip(start)),
!                                              int(string.strip(end)))
              new_locations.append(this_location)
          return new_locations
  
***************
*** 707,717 ****
          # current feature, then get the information for this feature
          for inner_element in function.args:
              new_sub_feature = SeqFeature.SeqFeature()
!             # inherit the type from the parent
!             new_sub_feature.type = cur_feature.type 
!             # add the join or order info to the location_operator
!             cur_feature.location_operator = function.name
!             new_sub_feature.location_operator = function.name
              # inherit references and strand from the parent feature
              new_sub_feature.ref = cur_feature.ref
              new_sub_feature.ref_db = cur_feature.ref_db
--- 688,695 ----
          # current feature, then get the information for this feature
          for inner_element in function.args:
              new_sub_feature = SeqFeature.SeqFeature()
!             # add _join or _order to the name to make the type clear
!             new_sub_feature.type = cur_feature.type + '_' + function.name
              # inherit references and strand from the parent feature
              new_sub_feature.ref = cur_feature.ref
              new_sub_feature.ref_db = cur_feature.ref_db
***************
*** 726,735 ****
          # set the location of the top -- this should be a combination of
          # the start position of the first sub_feature and the end position
          # of the last sub_feature
- 
-         # these positions are already converted to python coordinates 
-         # (when the sub_features were added) so they don't need to
-         # be converted again
          feature_start = cur_feature.sub_features[0].location.start
          feature_end = cur_feature.sub_features[-1].location.end
          cur_feature.location = SeqFeature.FeatureLocation(feature_start,
--- 704,709 ----
***************
*** 788,796 ****
          # check if we just have a single base
          if not(isinstance(range_info, LocationParser.Range)):
              pos = self._get_position(range_info)
!             # move the single position back one to be consistent with how
!             # python indexes numbers (starting at 0)
!             pos.position = pos.position  - 1
              return SeqFeature.FeatureLocation(pos, pos)
          # otherwise we need to get both sides of the range
          else:
--- 762,768 ----
          # check if we just have a single base
          if not(isinstance(range_info, LocationParser.Range)):
              pos = self._get_position(range_info)
! 
              return SeqFeature.FeatureLocation(pos, pos)
          # otherwise we need to get both sides of the range
          else:
***************
*** 798,807 ****
              start_pos = self._get_position(range_info.low)
              end_pos = self._get_position(range_info.high)
  
-             start_pos.position, end_pos.position = \
-               self._convert_to_python_numbers(start_pos.position,
-                                               end_pos.position)
- 
              return SeqFeature.FeatureLocation(start_pos, end_pos)
  
      def _get_position(self, position):
--- 770,775 ----
***************
*** 854,867 ****
          # if we've got a key from before, add it to the dictionary of
          # qualifiers
          if self._cur_qualifier_key:
!             key = self._cur_qualifier_key
!             value = self._cur_qualifier_value
!             # if the qualifier name exists, append the value
!             if self._cur_feature.qualifiers.has_key(key):
!                 self._cur_feature.qualifiers[key].append(value)
!             # otherwise start a new list of the key with its values
!             else:
!                 self._cur_feature.qualifiers[key] = [value]
  
      def qualifier_key(self, content):
          """When we get a qualifier key, use it as a dictionary key.
--- 822,837 ----
          # if we've got a key from before, add it to the dictionary of
          # qualifiers
          if self._cur_qualifier_key:
!             # get a unique name
!             unique_name = self._cur_qualifier_key
!             counter = 1
!             while self._cur_feature.qualifiers.has_key(unique_name):
!                 unique_name = self._cur_qualifier_key + str(counter)
!                 counter = counter + 1
!                 
!                 
!             self._cur_feature.qualifiers[unique_name] = \
!                                                       self._cur_qualifier_value
  
      def qualifier_key(self, content):
          """When we get a qualifier key, use it as a dictionary key.
-------------- next part --------------
*** SeqFeature.py.orig	Tue Sep 18 20:35:03 2001
--- SeqFeature.py	Wed Sep 19 01:30:26 2001
***************
*** 69,75 ****
      40 and 50 to 60, respectively.
      """
      def __init__(self, location = None, type = '', location_operator = '',
!                  strand = 0, id = "<unknown id>", 
                   qualifiers = {}, sub_features = [],
                   ref = None, ref_db = None):
          """Initialize a SeqFeature on a Sequence.
--- 69,75 ----
      40 and 50 to 60, respectively.
      """
      def __init__(self, location = None, type = '', location_operator = '',
!                  strand = None, id = "<unknown id>", 
                   qualifiers = {}, sub_features = [],
                   ref = None, ref_db = None):
          """Initialize a SeqFeature on a Sequence.
From thomas at cbs.dtu.dk  Wed Sep 19 07:28:11 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] SeqIO
Message-ID: <y9vhetz5sj8.fsf@genome.cbs.dtu.dk>

Hej,

ok - I rewrote and _commited_ the lost code for sequence conversion :-)

Brad or/and Andrew: could you check how we can use the GenBank and the SWISS
parser in the SeqIO stuff ?
The current file for seqeunce format IO is SeqIO/generic.py ... (should
definitely change name, maybe to SeqIO.py ?)

We need to design a method for guessing sequence formats - which is quite
easy with filenames and file streams, but a litte trickier with sys.stdin
... (you can't tell and seek in a stdin - can you???)

cheers
-thomas


-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From pewilkinson at informaxinc.com  Mon Sep 24 16:43:08 2001
From: pewilkinson at informaxinc.com (Peter Wilkinson)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Parsing features fuzzie in Genbank annotation att Brad.
In-Reply-To: <200109191055.f8JAt6p14463@pw600a.bioperl.org>
Message-ID: <002401c14539$859ff670$f00210ac@l001696w00>

Since this is being reviewed ....

Brad, please make a note of this everyone should understand this is there.

In genbank records the following join format will pop up:

    "join(10000,10200..10450)"

The numbers used here represent a one base join with a second exon. Can this
happen in biology, I am still working that out, or what does this annotation
represent of the biology, if this is not a real one base join, I am working
that out too.

However, please note that is a possible annotation in any case. programs
that use feature information should know how to handle this.

This cropped up whiles I was parsing the Refseq S_cerevisiae data. Go to
NCBI and download Chromosome 9, and you will see what I am talking about.

I will post what I find out, but if anyone else wants to look into some
insight on this, please post.

Peter

P.S. pretty umbeleavable is it not?


In response to the following comment --------------------------

3.
Andrew:
> Related to that, what's the type used when there are subfeatures?

Previously, if we had a sequence feature like:
CDS             join(104..160,320..390,504..579)

I would code this as a top level SeqFeature with type "CDS" and
location (104..579), and have sub_features of this top level feature
with type "CDS_join." This is stolen from bioperl, but is not that
great in retrospect, since I'm hacking the type and all of that.

I'd like to propose adding a location_operator attribute to
SeqFeature (already done in CVS) and have the top level SeqFeature
be type "CDS" with location_operator "join", and all sub_features
also be of the same type and location_operator. This will only
affect people who relied on the previous (fairly ugly)
type/location_operator concatenation mechanism.


From chapmanb at arches.uga.edu  Wed Sep 26 22:04:59 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Parsing features fuzzie in Genbank annotation att Brad.
In-Reply-To: <002401c14539$859ff670$f00210ac@l001696w00>
References: <002401c14539$859ff670$f00210ac@l001696w00>
Message-ID: <20010926220458.A27721@ci350185-a.athen1.ga.home.com>

Hi Peter;

> In genbank records the following join format will pop up:
> 
>     "join(10000,10200..10450)"

Thanks for the heads up on this. I tried this location in Andrew's
parser and it seems to handle it just fine, so I'm pretty sure the
GenBank stuff should be able to handle it. If you run across this
case in a record and the parser fails or produces erroneous results,
send the accession number along and I can fix things.

> The numbers used here represent a one base join with a second exon. 
> Can this happen in biology, 

Hmm, I'm not sure if I can think of a biological case off the top of
my head where this makes good sense. It certainly doesn't make sense
for an exon (ie. a 1 base pair exon) but maybe might make sense if
the location described something like a protein binding location or
something similar.

> P.S. pretty umbeleavable is it not?

:-). GenBank has lots of surprises.

BTW, since I haven't heard any negative comments about my proposed
SeqFeature/GenBank parser changes, I committed 'em to CVS. If anyone
gets problems on account of this, please let me know!

Brad
-- 
PGP public key available from http://pgp.mit.edu/

From chapmanb at arches.uga.edu  Wed Sep 26 22:47:54 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] SeqIO
In-Reply-To: <y9vhetz5sj8.fsf@genome.cbs.dtu.dk>
References: <y9vhetz5sj8.fsf@genome.cbs.dtu.dk>
Message-ID: <20010926224754.E27721@ci350185-a.athen1.ga.home.com>

Hi Thomas!

> ok - I rewrote and _commited_ the lost code for sequence conversion :-)

Glad it made it in okay :-)

> Brad or/and Andrew: could you check how we can use the GenBank and the SWISS
> parser in the SeqIO stuff ?

Yeah, I spent some time looking through it (a few more comments are
below), and I think what I'd need to do for GenBank is just create a
converter that takes SeqRecord objects and turns them into a
GenBank.Record object. This way, I could just do str(the_record) to
get the output and re-use the output work I already did. 

One big question I have is, how many of the features do you want to
try and retain in the conversion? So, for GenBank format, do you
want me to just write out the basic information (sequence, type,
etc) and ignore the feature table, or do we want to somehow map the
features from format to format (ie. EMBL <-> GenBank).

If we want to think about feature conversion, this'll be tougher and
we'll need to think about converters between "similar" formats like
EMBL and GenBank.

> The current file for seqeunce format IO is SeqIO/generic.py ... (should
> definitely change name, maybe to SeqIO.py ?)

You could just change it to __init__.py, like in the other modules
(so we could do from Bio import SeqIO and get it).

I also had a couple of questions from looking at this:

=> Why are you duplicating SeqRecord in the SeqIO stuff instead of
just reusing it? I don't think I understand what you are talking
about with stripping newlines...

=> Is there a way to plug in a specialized converter for similar
formats, like I was talking about above with EMBL/GenBank? I think
Jeff suggested this earlier, and it seems like a good idea to me. I
guess right now you could subclass ReadSeq and define your own
Convert function, but maybe there is another way to do it.

Thanks for your work and code on this. Nice to see it progressing
along!

Brad
-- 
PGP public key available from http://pgp.mit.edu/

From biopython-bugs at bioperl.org  Thu Sep 27 05:22:02 2001
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] Notification: incoming/43
Message-ID: <200109270922.f8R9M2p18294@pw600a.bioperl.org>

JitterBug notification

new message incoming/43

Message summary for PR#43
	From: mkersz@pasteur.fr
	Subject: GenBank parser fails (on large files?)
	Date: Thu, 27 Sep 2001 05:22:01 -0400
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From mkersz@pasteur.fr Thu Sep 27 05:22:02 2001
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f8R9M1p18288
	for <biopython-bugs@pw600a.bioperl.org>; Thu, 27 Sep 2001 05:22:01 -0400
Date: Thu, 27 Sep 2001 05:22:01 -0400
Message-Id: <200109270922.f8R9M1p18288@pw600a.bioperl.org>
From: mkersz@pasteur.fr
To: biopython-bugs@bioperl.org
Subject: GenBank parser fails (on large files?)

Full_Name: Michel Kerszberg
Module: GenBank
Version: 1.00a3
OS: linux 2.2
Submission from: cache.pasteur.fr (157.99.64.13)


fetch
 
ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv/AL123456.gbk

open this with 

file_handle = open( ... ,'r')
pars = GenBank.FeatureParser()
iter = GenBank.Iterator(file_handle, pars)
rec = iter.next()

This fails with:    

rec = iter.next()
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 182, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 260, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 1108, in
feed
    self._parser.parseFile(handle)
  File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 205, in
parseFile
    self.parseString(fileobj.read())
  File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 233, in
parseString
    self._err_handler.fatalError(result)
  File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py", line 38, in
fatalError
Martel.Parser.ParserPositionException: error parsing at or beyond character 42

This is in the first line of the record, which seems
correctly formatted. No amount of massaging of the
file seems to help. 

I have seen this problem reported with other large
GenBank records.


From adalke at mindspring.com  Thu Sep 27 15:35:50 2001
From: adalke at mindspring.com (Andrew Dalke)
Date: Sat Mar  5 14:43:04 2005
Subject: [Biopython-dev] GenBank parser fails (on large files?)
Message-ID: <024b01c1478b$af5516e0$0301a8c0@josiah.dalkescientific.com>

>Full_Name: Michel Kerszberg
>ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_
H37Rv/AL123456.gbk

>This fails with:

>Martel.Parser.ParserPositionException: error parsing at or
> beyond character 42
>
>This is in the first line of the record, which seems
>correctly formatted. No amount of massaging of the
>file seems to help.
>
>I have seen this problem reported with other large
>GenBank records.

I've found the problem.  Here's the format definition

locus_line = Martel.Group("locus_line",
                          ...
                          blank_space +
                          Martel.Opt(residue_type +

residue_type = Martel.Group("residue_type",
                            Martel.Opt(Martel.Alt(*residue_prefixes)) +
                            Martel.Opt(Martel.Alt(*residue_types)) +
                            Martel.Opt(blank_space +
                                       Martel.Str("circular")))

In this record, the locus line is

LOCUS       AL123456  4411529 bp          circular  BCT       07-JUL-1998
                                ^^^^^^^^^^ all spaces

so there is no residue type.   The 'blank_space' in 'locus_line'
eats up all those spaces, leaving the parser at the word 'circular'.
That doesn't match the residue_prefixes or the residue_types.  There's
no " " so it doesn't match the 'blank_space', so the residue_type
fails.

Here's a likely solution - move 'blank_space' to occur after the

residue_type = Martel.Group("residue_type",
    Martel.Alt(
         Martel.Opt(Martel.Alt(*residue_prefixes)) + \
           Martel.Alt(*residue_types) + \
           Martel.Opt(blank_space + Martel.Str("circular")),
         Martel.Opt(Martel.Str("circular")))

I've not tested this, since I think the format definition needs
to be revisited first because I've now more experience in writing
these things, and second because the LOCUS line definition is
changing in the next couple months, according to

  ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt

                    Andrew


From chapmanb at arches.uga.edu  Thu Sep 27 16:05:54 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] GenBank parser fails (on large files?)
In-Reply-To: <024b01c1478b$af5516e0$0301a8c0@josiah.dalkescientific.com>
References: <024b01c1478b$af5516e0$0301a8c0@josiah.dalkescientific.com>
Message-ID: <20010927160554.A29159@ci350185-a.athen1.ga.home.com>

Hi Michel, Andrew;

Michel:
> >ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_
> H37Rv/AL123456.gbk
> 
> >This fails with:
> 
> >Martel.Parser.ParserPositionException: error parsing at or
> > beyond character 42

Andrew:
> I've found the problem.  Here's the format definition
[...]
> In this record, the locus line is
> 
> LOCUS       AL123456  4411529 bp          circular  BCT       07-JUL-1998
>                                 ^^^^^^^^^^ all spaces
> 
> so there is no residue type.   The 'blank_space' in 'locus_line'
> eats up all those spaces, leaving the parser at the word 'circular'.

Thanks for looking at this Andrew -- I've also been checking it out
concurrently and came to the same conclusion. Wow, I never would
have expected to have circular without the residue type :-).

I've fixed this and also a second problem with this file, the
version line has no GI:

VERSION     AL123456

I've added these examples to the GenBankFormat test so that we
should be able to catch them in the future.

For Michel, the fixes are in CVS and the patches to
GenBank/__init__.py and GenBank/genbank_format.py are attached. With
these I can parse your file without problems. I've also added a
couple of things which will (hopefully) speed up dealing with large
sequences some. Thanks for the bug report on this; Let us know 
if you come across anything else that fails.

> I've not tested this, since I think the format definition needs
> to be revisited first because I've now more experience in writing
> these things, and second because the LOCUS line definition is
> changing in the next couple months, according to
> 
>   ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt

Yeah, I had read about this previously and I _think_ the format will
handle them (after some modifications I made a while back). In
test_GenBankFormat.py there are a couple of example locus lines with
this new format that it'll parse okay. We'll see if it will hold up
when the full-scale change comes on, though.

But, you are still more than welcome to attack the locus line
parsing anytime you feel up to it -- you are definately the master
o' Martel :-).

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
Index: genbank_format.py
===================================================================
RCS file: /home/repository/biopython/biopython/Bio/GenBank/genbank_format.py,v
retrieving revision 1.8
diff -c -r1.8 genbank_format.py
*** genbank_format.py	2001/09/19 01:15:52	1.8
--- genbank_format.py	2001/09/27 20:02:47
***************
*** 85,92 ****
  residue_type = Martel.Group("residue_type",
                              Martel.Opt(Martel.Alt(*residue_prefixes)) +
                              Martel.Opt(Martel.Alt(*residue_types)) +
!                             Martel.Opt(blank_space +
                                         Martel.Str("circular")))
  date = Martel.Group("date",
                      Martel.Re("[-\w]+"))
  
--- 85,93 ----
  residue_type = Martel.Group("residue_type",
                              Martel.Opt(Martel.Alt(*residue_prefixes)) +
                              Martel.Opt(Martel.Alt(*residue_types)) +
!                             Martel.Opt(Martel.Opt(blank_space) + 
                                         Martel.Str("circular")))
+ 
  date = Martel.Group("date",
                      Martel.Re("[-\w]+"))
  
***************
*** 163,171 ****
                              Martel.Str("VERSION") +
                              blank_space +
                              version +
!                             blank_space +
!                             Martel.Str("GI:") +
!                             gi +
                              Martel.AnyEol())
  
  # DBSOURCE    REFSEQ: accession NM_010510.1
--- 164,172 ----
                              Martel.Str("VERSION") +
                              blank_space +
                              version +
!                             Martel.Opt(blank_space +
!                                        Martel.Str("GI:") +
!                                        gi) +
                              Martel.AnyEol())
  
  # DBSOURCE    REFSEQ: accession NM_010510.1
From biopython-bugs at bioperl.org  Thu Sep 27 16:11:50 2001
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Notification: incoming/43
Message-ID: <200109272011.f8RKBop24395@pw600a.bioperl.org>

JitterBug notification

chapmanb changed notes

Message summary for PR#43
	From: mkersz@pasteur.fr
	Subject: GenBank parser fails (on large files?)
	Date: Thu, 27 Sep 2001 05:22:01 -0400
	0 replies 	0 followups
	Notes: Parser problem was with a LOCUS line containing "circular" but no
sequence type information (ie. DNA, RNA, etc). This is fixed in CVS
in revision 1.24 of __init__.py and 1.9 of genbank_format.py


====> ORIGINAL MESSAGE FOLLOWS <====

>From mkersz@pasteur.fr Thu Sep 27 05:22:02 2001
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f8R9M1p18288
	for <biopython-bugs@pw600a.bioperl.org>; Thu, 27 Sep 2001 05:22:01 -0400
Date: Thu, 27 Sep 2001 05:22:01 -0400
Message-Id: <200109270922.f8R9M1p18288@pw600a.bioperl.org>
From: mkersz@pasteur.fr
To: biopython-bugs@bioperl.org
Subject: GenBank parser fails (on large files?)

Full_Name: Michel Kerszberg
Module: GenBank
Version: 1.00a3
OS: linux 2.2
Submission from: cache.pasteur.fr (157.99.64.13)


fetch
 
ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv/AL123456.gbk

open this with 

file_handle = open( ... ,'r')
pars = GenBank.FeatureParser()
iter = GenBank.Iterator(file_handle, pars)
rec = iter.next()

This fails with:    

rec = iter.next()
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 182, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 260, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 1108, in
feed
    self._parser.parseFile(handle)
  File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 205, in
parseFile
    self.parseString(fileobj.read())
  File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 233, in
parseString
    self._err_handler.fatalError(result)
  File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py", line 38, in
fatalError
Martel.Parser.ParserPositionException: error parsing at or beyond character 42

This is in the first line of the record, which seems
correctly formatted. No amount of massaging of the
file seems to help. 

I have seen this problem reported with other large
GenBank records.


From biopython-bugs at bioperl.org  Thu Sep 27 16:11:50 2001
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Notification: incoming/43
Message-ID: <200109272011.f8RKBop24399@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#43 from incoming to fixed-bugs
Message summary for PR#43
	From: mkersz@pasteur.fr
	Subject: GenBank parser fails (on large files?)
	Date: Thu, 27 Sep 2001 05:22:01 -0400
	0 replies 	0 followups
	Notes: Parser problem was with a LOCUS line containing "circular" but no
sequence type information (ie. DNA, RNA, etc). This is fixed in CVS
in revision 1.24 of __init__.py and 1.9 of genbank_format.py


====> ORIGINAL MESSAGE FOLLOWS <====

>From mkersz@pasteur.fr Thu Sep 27 05:22:02 2001
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f8R9M1p18288
	for <biopython-bugs@pw600a.bioperl.org>; Thu, 27 Sep 2001 05:22:01 -0400
Date: Thu, 27 Sep 2001 05:22:01 -0400
Message-Id: <200109270922.f8R9M1p18288@pw600a.bioperl.org>
From: mkersz@pasteur.fr
To: biopython-bugs@bioperl.org
Subject: GenBank parser fails (on large files?)

Full_Name: Michel Kerszberg
Module: GenBank
Version: 1.00a3
OS: linux 2.2
Submission from: cache.pasteur.fr (157.99.64.13)


fetch
 
ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Mycobacterium_tuberculosis_H37Rv/AL123456.gbk

open this with 

file_handle = open( ... ,'r')
pars = GenBank.FeatureParser()
iter = GenBank.Iterator(file_handle, pars)
rec = iter.next()

This fails with:    

rec = iter.next()
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 182, in
next
    return self._parser.parse(File.StringHandle(data))
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 260, in
parse
    self._scanner.feed(handle, self._consumer)
  File "/usr/lib/python2.0/site-packages/Bio/GenBank/__init__.py", line 1108, in
feed
    self._parser.parseFile(handle)
  File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 205, in
parseFile
    self.parseString(fileobj.read())
  File "/usr/lib/python2.0/site-packages/Martel/Parser.py", line 233, in
parseString
    self._err_handler.fatalError(result)
  File "/var/tmp/python-root//usr/lib/python2.0/xml/sax/handler.py", line 38, in
fatalError
Martel.Parser.ParserPositionException: error parsing at or beyond character 42

This is in the first line of the record, which seems
correctly formatted. No amount of massaging of the
file seems to help. 

I have seen this problem reported with other large
GenBank records.


From biopython-bugs at bioperl.org  Thu Sep 27 16:12:18 2001
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Notification: incoming/40
Message-ID: <200109272012.f8RKCIp24464@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#40 from incoming to fixed-bugs
Message summary for PR#40
	From: joungjh@AptusGenomics.com
	Subject: retrieving GenBank records from NCBI
	Date: Tue, 14 Aug 2001 16:44:34 -0400
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From joungjh@AptusGenomics.com Tue Aug 14 16:44:35 2001
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f7EKiYq02770
	for <biopython-bugs@pw600a.bioperl.org>; Tue, 14 Aug 2001 16:44:34 -0400
Date: Tue, 14 Aug 2001 16:44:34 -0400
Message-Id: <200108142044.f7EKiYq02770@pw600a.bioperl.org>
From: joungjh@AptusGenomics.com
To: biopython-bugs@bioperl.org
Subject: retrieving GenBank records from NCBI

Full_Name: J. Joung
Module: GenBank
Version: biopython-1.00a2
OS: UNIX
Submission from: gw-aptusgen1.cust.fast.net (209.92.248.166)


I'm using GenBank NCBIDictionary to retrieve a GenBank record. The retrived
record is missing the following information: LOCUS, DEFINITION, ACCESSION,
VERSION, and KEYWORDS.  

Is there a way of obtaining the GenBank id from a known locuslink id in
biopython?


From biopython-bugs at bioperl.org  Thu Sep 27 16:12:19 2001
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Notification: incoming/41
Message-ID: <200109272012.f8RKCIp24468@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#41 from incoming to trash
Message summary for PR#41
	From: Jeffrey Chang <jchang@SMI.Stanford.EDU>
	Subject: Re: [Biopython-dev] Notification: incoming/40
	Date: Tue, 14 Aug 2001 22:46:45 -0700
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From jchang@SMI.Stanford.EDU Wed Aug 15 01:45:11 2001
Received: from crg-gw.Stanford.EDU (root@crg-gw.Stanford.EDU [171.65.32.201])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f7F5jAq05966
	for <biopython-bugs@bioperl.org>; Wed, 15 Aug 2001 01:45:11 -0400
Received: from [192.168.0.4] (c1128134-a.stcla1.sfba.home.com [24.176.209.55])
	by crg-gw.Stanford.EDU (8.11.5/8.11.5) with ESMTP id f7F5jDU24945;
	Tue, 14 Aug 2001 22:45:13 -0700 (PDT)
Mime-Version: 1.0
X-Sender: jchang@smi.stanford.edu (Unverified)
Message-Id: <p05101000b79fbcb4bcbf@[192.168.0.4]>
In-Reply-To: <200108142044.f7EKiZq02776@pw600a.bioperl.org>
References: <200108142044.f7EKiZq02776@pw600a.bioperl.org>
Date: Tue, 14 Aug 2001 22:46:45 -0700
To: biopython-bugs@bioperl.org, biopython-dev@biopython.org,
       joungjh@aptusgenomics.com
From: Jeffrey Chang <jchang@SMI.Stanford.EDU>
Subject: Re: [Biopython-dev] Notification: incoming/40
Content-Type: text/plain; charset="us-ascii" ; format="flowed"

At 4:44 PM -0400 8/14/01, biopython-bugs@bioperl.org wrote:
>Full_Name: J. Joung
>I'm using GenBank NCBIDictionary to retrieve a GenBank record. The retrived
>record is missing the following information: LOCUS, DEFINITION, ACCESSION,
>VERSION, and KEYWORDS.

Is this information that's in the Genbank record?  It should be 
returning whatever NCBI returns, or raising an exception.  Dropping 
information would be odd.  Do you have a reproducible?  What is the 
accession you're using?


>Is there a way of obtaining the GenBank id from a known locuslink id in
>biopython?

No, we don't have any locuslink functionality at the moment.

Jeff


From biopython-bugs at bioperl.org  Thu Sep 27 16:12:19 2001
From: biopython-bugs at bioperl.org (biopython-bugs@bioperl.org)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Notification: incoming/42
Message-ID: <200109272012.f8RKCJp24473@pw600a.bioperl.org>

JitterBug notification

chapmanb moved PR#42 from incoming to trash
Message summary for PR#42
	From: joungjh@email.com
	Subject: Re: [Biopython-dev] Notification: incoming/40
	Date: Wed, 15 Aug 2001 08:22:26 -0400
	0 replies 	0 followups

====> ORIGINAL MESSAGE FOLLOWS <====

>From joungjh@email.com Wed Aug 15 08:22:26 2001
Received: from localhost (localhost [127.0.0.1])
	by pw600a.bioperl.org (8.11.2/8.11.2) with ESMTP id f7FCMPq08874
	for <biopython-bugs@pw600a.bioperl.org>; Wed, 15 Aug 2001 08:22:26 -0400
Date: Wed, 15 Aug 2001 08:22:26 -0400
Message-Id: <200108151222.f7FCMPq08874@pw600a.bioperl.org>
From: joungjh@email.com
To: biopython-bugs@bioperl.org
Subject: Re: [Biopython-dev] Notification: incoming/40

Full_Name: 
Module: 
Version: 
OS: 
Submission from: gw-aptusgen1.cust.fast.net (209.92.248.166)


>>I'm using GenBank NCBIDictionary to retrieve a GenBank record. The retrived
>>record is missing the following information: LOCUS, DEFINITION, ACCESSION,
>>VERSION, and KEYWORDS.

>Is this information that's in the Genbank record?  It should be 
>returning whatever NCBI returns, or raising an exception.  Dropping 
>information would be odd.  Do you have a reproducible?  What is the 
>accession you're using?

Yes, LOCUS, DEFINITION, ACCESSION, VERSION, and KEYWORDS information is in
GenBank record. Any GenBank id would drop this information on UNIX. You can try
GenBank id of '15145772'.  I have installed biopython-1.00a1 windows version on
my pc and this seems to return all information correctly. Thank you for your
quick response.


From thomas at cbs.dtu.dk  Fri Sep 28 00:51:45 2001
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] SeqIO
In-Reply-To: Brad Chapman's message of "Wed, 26 Sep 2001 22:47:54 -0400"
References: <y9vhetz5sj8.fsf@genome.cbs.dtu.dk>
	<20010926224754.E27721@ci350185-a.athen1.ga.home.com>
Message-ID: <y9vadzf3oke.fsf@delphinus.cbs.dtu.dk>

Brad Chapman <chapmanb@arches.uga.edu> writes:

thx for the comments !

> One big question I have is, how many of the features do you want to
> try and retain in the conversion? So, for GenBank format, do you
> want me to just write out the basic information (sequence, type,
> etc) and ignore the feature table, or do we want to somehow map the
> features from format to format (ie. EMBL <-> GenBank).
All of them. I think each GenBank feature has an exact equivalence in EMBL
and SwissProt (GenPept). So that leaves us just with the definition of the
corresponding feature names.

> 
> If we want to think about feature conversion, this'll be tougher and
> we'll need to think about converters between "similar" formats like
> EMBL and GenBank.
GenBank, EMBL and SwissProt ... where EMBL and SwissProt are almost
identical (I think...)

> => Why are you duplicating SeqRecord in the SeqIO stuff instead of
> just reusing it? I don't think I understand what you are talking
> about with stripping newlines...
I copied everything so that I c?uld play around without breaking e.g. your
code. Now I think the changes are actually backward compatible - so we
could move it back.
> 
> => Is there a way to plug in a specialized converter for similar
> formats, like I was talking about above with EMBL/GenBank? I think
> Jeff suggested this earlier, and it seems like a good idea to me. I
> guess right now you could subclass ReadSeq and define your own
> Convert function, but maybe there is another way to do it.
I don't know if I understood this question...


A colleague and I, are thinking about converting SWISSPROT into a SQL
database for local use ...  which actually gets close to a former
discussion where Andrew and I dreamed about a python variant of SRS !  
My question: does anybody know about an already existing SQL tables for
SWISSPROT ? The step after that is actually creating an python interface for
generic queries, which would beat SRS ... at least on SWISSPROT.


cheers
-thomas

P.S. is anybody going to the Atlanta meeting in November ?
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas@biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...

From chapmanb at arches.uga.edu  Thu Sep 27 17:02:33 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Re: Blast parser
In-Reply-To: <OIEFKBIBGGKFMCHEPFMJMEOHCAAA.j.joung@AptusGenomics.com>
References: <OIEFKBIBGGKFMCHEPFMJMEOHCAAA.j.joung@AptusGenomics.com>
Message-ID: <20010927170233.A29348@ci350185-a.athen1.ga.home.com>

Hi Jeong;
I'm ccing this message to biopython-dev@biopython.org. By the way,
asking your questions there is probably a better place than asking
me directly, as there are lots of people there to help.

> Hello, I would like to know if the blast standalone parser supports parsing
> of the BLASTX results. When I use blast standalone parser to parse BLASTX
> results, I get an error message of the following:
[...]
> SyntaxError: Line does not start with 'length of query':
> length of database: 27,975,647

Yup, it looks like the blastx output format has changed somewhat
since the last time it was used/tested with blastx. The specific
things that have changed are the lack of the following lines in
blastx output:

'length of query'
'effective length of query'
'effective search space:'
'S2'

I've fixed Bio/Blast/NCBIStandalone.py so that it works again on
blastx. The diff to this file is attached. Jeff, if you have a
chance could you give me the okay on this before I check it in? The
current regression tests all pass with these changes. When I check
it in, I can also add the blastx example file I used to fix this.

Jeong, thanks for the bug report! Please let us know if this fix
doesn't get things working again for you.

Brad
-- 
PGP public key available from http://pgp.mit.edu/
-------------- next part --------------
*** NCBIStandalone.py.orig	Wed Sep  5 17:22:14 2001
--- NCBIStandalone.py	Thu Sep 27 17:01:34 2001
***************
*** 462,481 ****
                            start="Number of HSP's that")
              read_and_call(uhandle, consumer.hsps_gapped,
                            start="Number of HSP's gapped")
! 
!         read_and_call(uhandle, consumer.query_length,
!                       start='length of query')
          read_and_call(uhandle, consumer.database_length,
                        start='length of database')
  
          read_and_call(uhandle, consumer.effective_hsp_length,
                        start='effective HSP')
!         read_and_call(uhandle, consumer.effective_query_length,
!                       start='effective length of query')
          read_and_call(uhandle, consumer.effective_database_length,
                        start='effective length of database')
!         read_and_call(uhandle, consumer.effective_search_space,
!                       start='effective search space')
          # Does not appear in BLASTP 2.0.5
          attempt_read_and_call(uhandle, consumer.effective_search_space_used,
                                start='effective search space used')
--- 462,484 ----
                            start="Number of HSP's that")
              read_and_call(uhandle, consumer.hsps_gapped,
                            start="Number of HSP's gapped")
!         # not in blastx 2.2.1
!         attempt_read_and_call(uhandle, consumer.query_length,
!                               start='length of query')
          read_and_call(uhandle, consumer.database_length,
                        start='length of database')
  
          read_and_call(uhandle, consumer.effective_hsp_length,
                        start='effective HSP')
!         # Not in blastx 2.2.1
!         attempt_read_and_call(uhandle, consumer.effective_query_length,
!                               start='effective length of query')
          read_and_call(uhandle, consumer.effective_database_length,
                        start='effective length of database')
!         # Not in blastx 2.2.1, added a ':' to distinguish between
!         # this and the 'effective search space used' line
!         attempt_read_and_call(uhandle, consumer.effective_search_space,
!                               start='effective search space:')
          # Does not appear in BLASTP 2.0.5
          attempt_read_and_call(uhandle, consumer.effective_search_space_used,
                                start='effective search space used')
***************
*** 490,496 ****
          attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final,
                                start='X3')
          read_and_call(uhandle, consumer.gap_trigger, start='S1')
!         read_and_call(uhandle, consumer.blast_cutoff, start='S2')
  
          consumer.end_parameters()
  
--- 493,507 ----
          attempt_read_and_call(uhandle, consumer.gap_x_dropoff_final,
                                start='X3')
          read_and_call(uhandle, consumer.gap_trigger, start='S1')
!         # not in blastx 2.2.1
!         # need to enclose this inside a try/except because 
!         # attempt_read_and_call will still complain about end of stream.
!         # All attempts are made to be sure we've got the expected error
!         try:
!             read_and_call(uhandle, consumer.blast_cutoff, start='S2')
!         except SyntaxError, reason:
!             assert str(reason) == "Unexpected end of stream.", \
!               "Unexpected reason: '%s'" % reason
  
          consumer.end_parameters()
  
From jchang at SMI.Stanford.EDU  Thu Sep 27 19:16:15 2001
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Re: Blast parser
In-Reply-To: <20010927170233.A29348@ci350185-a.athen1.ga.home.com>
References: <OIEFKBIBGGKFMCHEPFMJMEOHCAAA.j.joung@AptusGenomics.com>
 <20010927170233.A29348@ci350185-a.athen1.ga.home.com>
Message-ID: <p05101001b7d962d4c403@[171.65.33.250]>

Great!  Thanks a lot.  The patch looks good really good.  The only 
thing is, can you change the try: except: to an explicit test for the 
end of the stream?  That would be more robust to changes in the error 
message.

         try:
             read_and_call(uhandle, consumer.blast_cutoff, start='S2')
         except SyntaxError, reason:
             assert str(reason) == "Unexpected end of stream.", \
               "Unexpected reason: '%s'" % reason

(untested)

if uhandle.peekline():
   attempt_read_and_call(uhandle, consumer.blast_cutoff, start='S2')


Jeff


At 5:02 PM -0400 9/27/01, Brad Chapman wrote:
>Hi Jeong;
>I'm ccing this message to biopython-dev@biopython.org. By the way,
>asking your questions there is probably a better place than asking
>me directly, as there are lots of people there to help.
>
>>  Hello, I would like to know if the blast standalone parser supports parsing
>>  of the BLASTX results. When I use blast standalone parser to parse BLASTX
>>  results, I get an error message of the following:
>[...]
>>  SyntaxError: Line does not start with 'length of query':
>>  length of database: 27,975,647
>
>Yup, it looks like the blastx output format has changed somewhat
>since the last time it was used/tested with blastx. The specific
>things that have changed are the lack of the following lines in
>blastx output:
>
>'length of query'
>'effective length of query'
>'effective search space:'
>'S2'
>
>I've fixed Bio/Blast/NCBIStandalone.py so that it works again on
>blastx. The diff to this file is attached. Jeff, if you have a
>chance could you give me the okay on this before I check it in? The
>current regression tests all pass with these changes. When I check
>it in, I can also add the blastx example file I used to fix this.
>
>Jeong, thanks for the bug report! Please let us know if this fix
>doesn't get things working again for you.
>
>Brad
>--
>PGP public key available from http://pgp.mit.edu/
>
>Attachment converted: Macintosh HD:NCBIStandalone.py.diff 
>(TEXT/text) (0015B4C0)


From chapmanb at arches.uga.edu  Thu Sep 27 19:59:16 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Re: Blast parser
In-Reply-To: <p05101001b7d962d4c403@[171.65.33.250]>
References: <OIEFKBIBGGKFMCHEPFMJMEOHCAAA.j.joung@AptusGenomics.com> <20010927170233.A29348@ci350185-a.athen1.ga.home.com> <p05101001b7d962d4c403@[171.65.33.250]>
Message-ID: <20010927195916.A29452@ci350185-a.athen1.ga.home.com>

[problems with new blastx output and proposed patch]

> Great!  Thanks a lot.  The patch looks good really good.  The only 
> thing is, can you change the try: except: to an explicit test for the 
> end of the stream?

Great idea -- thanks! This is much nicer than the uglish try/except
I was using. I've checked in the patch with your suggested change,
as well as the test blastx file and associated changes to the test
suite. Everything seems to pass a-okay.

Brad
-- 
PGP public key available from http://pgp.mit.edu/

From chapmanb at arches.uga.edu  Fri Sep 28 08:17:02 2001
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] GenBank parser fails (on large files?)
In-Reply-To: <200109281141.f8SBfDM188776@electre.pasteur.fr>
References: <200109281141.f8SBfDM188776@electre.pasteur.fr>
Message-ID: <20010928081702.A973@ci350185-a.athen1.ga.home.com>

Hi Michel; 

> Thanks, the fix worked. 

Great to hear. Thanks for reporting back.

> However your solution to make parsing of large sequences 
> faster has currently a side effect. If I print the first 
> feature with qualifier 'translation', I get
> 
> ['MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ\012                     
>               CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR']
> 
> before, when I would have gotten a slightly different result:
> 
> "MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQ\012                     
>               CRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR"

This is actually not a side-effect of the recent changes, but a
deliberate change I made in CVS. I wrote a long message about this
last week concerning non-compatible fixes I made to the GenBank
SeqFeature parser:

http://www.biopython.org/pipermail/biopython-dev/2001-September/000579.html

The part related to your problem involves how I was handing features
that had multiple qualifier keys with the same name (ie. two
'translation' keys). Previously, I was doing something really ugly
-- appending numbers on to the end of multiple keys to make them
unique (translation, translation1, translation2 ...). This allowed
me to have one key and one string value and store things in a
dictionary.

But, this is an ugly way to do things and actually makes life very
hard for people who wanted to get, say, all translation qualifiers
in a feature (if there were multiple translations). The fix was to
use the qualifier key and store the values as a list, ie:

qualifiers = {"translation" : ["CREL", "CRET"]}

When there is one one qualifier name, I also store this as a list
to help people avoid having to do:

if type(qualifier[key]) == type(""):
    # do something with the string
elif type(qualifier[key]) == type([]):
    # do something with the list

in their code.

I am definately sensitive to the fact that the change is bad news
for current code -- I'm sorry about that; it's all due to that bad
design decision I made earlier.

> Now the problem is, I had a hack to shape this string better, namely
> 
> >>> newseq= string.join(string.split(sq.qualifiers['translation']), sep=''))
> 
> This works with the " " form, but not with the [' '] form, which is how I 
> noticed the difference. 

Yes, sorry about that. A potential change (untested) would be:

clean_translations = []
for translation in sq.qualifiers['translation']:
    clean_translations.append(string.join(string.split(translation),
                              sep = ''))
sq.qualifiers['translations'] = clean_translations

But on to the other problem:

[Talking about the translation]
> Note, incidentally, that this is a bit ugly, because the \012's and spaces 
> should have been cleaned out 

I agree with you here -- I haven't yet done any work at massaging
the feature value information. I'll think about a good way to do
this (I'm sure there are other cases where this also needs to be
done), and try to get something done on it this weekend.

Thanks again for the feedback.
Brad
-- 
PGP public key available from http://pgp.mit.edu/

From j.joung at AptusGenomics.com  Fri Sep 28 09:36:02 2001
From: j.joung at AptusGenomics.com (Jeong Joung)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] Re: Blast parser
In-Reply-To: <p05101001b7d962d4c403@[171.65.33.250]>
Message-ID: <OIEFKBIBGGKFMCHEPFMJAEOJCAAA.j.joung@AptusGenomics.com>

Thank you so much for your responses. The changes work really well.

Jeong

-----Original Message-----
From: Jeffrey Chang [mailto:jchang@SMI.Stanford.EDU]
Sent: Thursday, September 27, 2001 7:16 PM
To: Brad Chapman; Jeong Joung
Cc: biopython-dev@biopython.org
Subject: [Biopython-dev] Re: Blast parser


Great!  Thanks a lot.  The patch looks good really good.  The only
thing is, can you change the try: except: to an explicit test for the
end of the stream?  That would be more robust to changes in the error
message.

         try:
             read_and_call(uhandle, consumer.blast_cutoff, start='S2')
         except SyntaxError, reason:
             assert str(reason) == "Unexpected end of stream.", \
               "Unexpected reason: '%s'" % reason

(untested)

if uhandle.peekline():
   attempt_read_and_call(uhandle, consumer.blast_cutoff, start='S2')


Jeff


At 5:02 PM -0400 9/27/01, Brad Chapman wrote:
>Hi Jeong;
>I'm ccing this message to biopython-dev@biopython.org. By the way,
>asking your questions there is probably a better place than asking
>me directly, as there are lots of people there to help.
>
>>  Hello, I would like to know if the blast standalone parser supports
parsing
>>  of the BLASTX results. When I use blast standalone parser to parse
BLASTX
>>  results, I get an error message of the following:
>[...]
>>  SyntaxError: Line does not start with 'length of query':
>>  length of database: 27,975,647
>
>Yup, it looks like the blastx output format has changed somewhat
>since the last time it was used/tested with blastx. The specific
>things that have changed are the lack of the following lines in
>blastx output:
>
>'length of query'
>'effective length of query'
>'effective search space:'
>'S2'
>
>I've fixed Bio/Blast/NCBIStandalone.py so that it works again on
>blastx. The diff to this file is attached. Jeff, if you have a
>chance could you give me the okay on this before I check it in? The
>current regression tests all pass with these changes. When I check
>it in, I can also add the blastx example file I used to fix this.
>
>Jeong, thanks for the bug report! Please let us know if this fix
>doesn't get things working again for you.
>
>Brad
>--
>PGP public key available from http://pgp.mit.edu/
>
>Attachment converted: Macintosh HD:NCBIStandalone.py.diff
>(TEXT/text) (0015B4C0)


From katel at worldpath.net  Sun Sep 30 20:41:35 2001
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:43:05 2005
Subject: [Biopython-dev] what I'm up to
Message-ID: <002201c14a11$d4425160$010a0a0a@cadence.com>

  I'm still looking into nexus, but I'm not sure where the NCL library fits
in.  Do we use it to read in nexus files?  How much continuing support and
development can we expect?  What will the library buy? us.Also, nexus is
more than a sequence format.  It supports phylogenetic and other types of
data.  Mostly, we support sequence data although we're doing a little wiith
pathways.

 I'm in the process of moving from our summer home to our winter home and
work has turned into an alligator swamp since I got back from vacation.  For
this reason I plan to continue investigating nexus but at a low level.
I'd like to write a parser for MASE instead ( IntelliGenetic format )
because it is an almost-FASTA format and should be doable without a large
time commitment that I can't promise for a month or so.

  Let me know if someone has already .written a MASE parser or has ideas
about nexus or NCL.

                                                                 Cayte