From winda002 at student.otago.ac.nz  Wed Jul  1 02:22:08 2009
From: winda002 at student.otago.ac.nz (David WInter)
Date: Wed, 01 Jul 2009 18:22:08 +1200
Subject: [Biopython] Bio.Sequencing.Ace
In-Reply-To: <761477.83949.qm@web65501.mail.ac4.yahoo.com>
References: <761477.83949.qm@web65501.mail.ac4.yahoo.com>
Message-ID: <4A4B0090.70903@student.otago.ac.nz>

Fungazid wrote:
> David hi,
>
> Many many thanks for the diagram.
> I'm not sure I understand the differences between contig.af[readn].padded_start,  and contig.bs[readn].padded_start, and other unknown parameters. I'll try to compare to the Ace format
>
> Avi
>   
Hi again Avi,

I took me a while to get to grips with the difference, the 'bs' list is 
a mapping of the contig's consensus to the particular read that was used 
to as the 'base segment' in that region. If you have a monospaced font 
in your email client this might help:

             consensus             

|===================================|              
+---read3---x
             +---read5--x
                         +--read1---x


(which would give a contig.bs list with 3 bs instances)

I'm not sure that this is particularly important information for a 454 assembly ;) 

I've updated the examples on the wiki page a little, if you find anything else that you think should 
be there feel free to add to it


Cheers,
David

From p.j.a.cock at googlemail.com  Wed Jul  1 03:44:12 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 1 Jul 2009 08:44:12 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
Message-ID: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>

Hi all (BioPerl and Biopython),

This is a continuation of a long thread on the BioPerl mailing
list, which I have now CC'd to the Biopython mailing list. See:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html

On this thread we have been discussing next gen sequencing
tools and co-coordinating things like consistent file format
naming between Biopython, BioPerl and EMBOSS. I've been
chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
and he will look into setting up a cross project mailing list for
this kind of discussion in future.

In the mean time, my replies to Giles below cover both BioPerl
and Biopython (and EMBOSS). Giles' original email is here:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html

Peter

On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters
> (and poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
> pure perl to trim the poor quality sequence from each read, and bioperl
> with emboss to remove the adapter sequence. I'm aware that the pipeline
> contains bugs and would like to simplify it, but at least it does work...
>
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack
> of features and poor performance. I'm sure the features will come with
> time, but the performance is more of a concern to me. ..

I gather you would rather work with (Bio)Perl, but since you are
already using Biopython to do the FASTQ conversion, you could
also use it for more of your pipe line. Our tutorial includes examples
of simple FASTQ quality filtering, and trimming of primer sequences
(something like this might be helpful for removing adaptors). See:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Alternatively, with the new release of EMBOSS this July, you will
also be able to do the Illumina FASTQ to Sanger standard FASTQ
with EMBOSS, and I'm sure BioPerl will offer this soon too.

> Regarding trimming bad quality bases (see comments from
> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
> pure/bioperl solution to be much faster than a primarily bioperl
> based implementation. I found Bio::Seq->subseq(a,b) and
> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
> ~1300 sequences/second, including unzipping the raw data and
> converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.

There are several ways of doing quality trimming, and it would
make an excellent cookbook example (both for BioPerl and
Biopython).

Could you go into a bit more detail about your trimming
algorithm? e.g. Do you just trim any bases on the right below
a certain threshold, perhaps with a minimum length to retain
the trimmed read afterwards?

> Hope this looooong post was of interest to someone!

I was interested at least ;)

Peter

From stran104 at chapman.edu  Wed Jul  1 06:18:42 2009
From: stran104 at chapman.edu (Matthew Strand)
Date: Wed, 1 Jul 2009 03:18:42 -0700
Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid
In-Reply-To: <b5bbbc970906302053n76ac36c1w630d41fdb96287d8@mail.gmail.com>
References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com>
	<320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com>
	<2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com>
	<2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com>
	<320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com>
	<2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com>
	<b5bbbc970906302050p3dd429b4yb29ba1dac2ffa578@mail.gmail.com>
	<b5bbbc970906302053n3656c8bdy2a1125275f8f2bac@mail.gmail.com>
	<b5bbbc970906302053n76ac36c1w630d41fdb96287d8@mail.gmail.com>
Message-ID: <2a63cc350907010318v597f0649u78168decde54d710@mail.gmail.com>

Sure, I can create a page tomorrow when I get into the office. Perhaps
"Retrieving Sequences Based on ID" would be appropriate. Alternative
suggestions are welcome.

On Tue, Jun 30, 2009 at 8:53 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

> Thanks. There is a wiki-based cookbook in the biopython site. Would you
> like to put it up there?
>
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
>
> On Jun 30, 2009 8:02 PM, "Matthew Strand" <stran104 at chapman.edu> wrote:
>
> For the benefit of future users who find this thread through a search, I
> would like to share how to retreive a sequence from NCBI given a non-NCBI
> protein ID (or other ID). This was question 3 in my original message.
>
> Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
> want to retrieve the sequence from NCBI.
>
> You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
> of NCBI GIs that refrence this identifer. In this case there is only one
> (17554770).
>
> Then you can get the sequence using Entrez.efetch(db="protein",
> id='17554770', rettype="fasta").
>
> This may be obvious to some, but it was not to me; primarially because I
> was
> unaware of the esearch functionality.
>
> --
> Matthew Strand
>
> _______________________________________________ Biopython mailing list -
> Biopython at lists.open-bio....
>
>


-- 
Matthew Strand

From cjfields at illinois.edu  Wed Jul  1 08:35:14 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 1 Jul 2009 07:35:14 -0500
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
Message-ID: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>

Peter,

I just committed a fix to FASTQ parsing last night to support read/ 
write for Sanger/Solexa/Illumina following the biopython convention;  
the only thing needed is more extensive testing for the quality  
scores.  There are a few other oddities with it I intend to address  
soon, but it appears to be working.

The Seq instance iterator actually calls a raw data iterator (hash  
refs of named arguments to the class constructor).  That should act as  
a decent filtering step if needed.

We have automated EMBOSS wrapping but I'm not sure how intuitive it  
is; we can probably reconfigure some of that.

chris

On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:

> Hi all (BioPerl and Biopython),
>
> This is a continuation of a long thread on the BioPerl mailing
> list, which I have now CC'd to the Biopython mailing list. See:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>
> On this thread we have been discussing next gen sequencing
> tools and co-coordinating things like consistent file format
> naming between Biopython, BioPerl and EMBOSS. I've been
> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
> and he will look into setting up a cross project mailing list for
> this kind of discussion in future.
>
> In the mean time, my replies to Giles below cover both BioPerl
> and Biopython (and EMBOSS). Giles' original email is here:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>
> Peter
>
> On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>>
>> I'm developing a transcriptomics database for use with next-gen  
>> data, and
>> have found processing the raw data to be a big hurdle.
>>
>> I'm a bit late in responding to this thread, so most issues have  
>> already
>> been discussed. One thing that hasn't been mentioned is removal of  
>> adapters
>> from raw Illumina sequence. This is a PITA, and I'm not aware of  
>> any well
>> developed and documented open source software for removal of adapters
>> (and poor quality sequence) from Illumina reads.
>>
>> My current Illumina sequence processing pipeline is an unholy mix of
>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for  
>> converting
>> the Illumina fastq to Sanger fastq, bioperl to read the quality  
>> values,
>> pure perl to trim the poor quality sequence from each read, and  
>> bioperl
>> with emboss to remove the adapter sequence. I'm aware that the  
>> pipeline
>> contains bugs and would like to simplify it, but at least it does  
>> work...
>>
>> Ideally I'd like to replace as much of the pipeline as possible with
>> bioperl/bioperl-run, but this isn't currently possible due to both  
>> a lack
>> of features and poor performance. I'm sure the features will come  
>> with
>> time, but the performance is more of a concern to me. ..
>
> I gather you would rather work with (Bio)Perl, but since you are
> already using Biopython to do the FASTQ conversion, you could
> also use it for more of your pipe line. Our tutorial includes examples
> of simple FASTQ quality filtering, and trimming of primer sequences
> (something like this might be helpful for removing adaptors). See:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> Alternatively, with the new release of EMBOSS this July, you will
> also be able to do the Illumina FASTQ to Sanger standard FASTQ
> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>
>> Regarding trimming bad quality bases (see comments from
>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>> pure/bioperl solution to be much faster than a primarily bioperl
>> based implementation. I found Bio::Seq->subseq(a,b) and
>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>> ~1300 sequences/second, including unzipping the raw data and
>> converting it to sanger fastq with biopython. Processing an entire
>> sequencing run with the whole pipeline takes in the region of 6-12h.
>
> There are several ways of doing quality trimming, and it would
> make an excellent cookbook example (both for BioPerl and
> Biopython).
>
> Could you go into a bit more detail about your trimming
> algorithm? e.g. Do you just trim any bases on the right below
> a certain threshold, perhaps with a minimum length to retain
> the trimmed read afterwards?
>
>> Hope this looooong post was of interest to someone!
>
> I was interested at least ;)
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From giles.weaver at googlemail.com  Wed Jul  1 12:27:22 2009
From: giles.weaver at googlemail.com (Giles Weaver)
Date: Wed, 1 Jul 2009 17:27:22 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
Message-ID: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>

Peter, the trimming algorithm I use employs a sliding window, as follows:

   - For each sequence position calculate the mean phred quality score for a
   window around that position.
   - Record whether the mean score is above or below a threshold as an array
   of zeros and ones.
   - Use a regular expression on the joined array to find the start and end
   of the good quality sequence(s).
   - Extract the quality sequence(s) and replace any bases below the quality
   threshold with N.
   - Trim any Ns from the ends.

A refinement would be to weight the scores from positions in the window, but
this could give a performance hit, and the method seems to work well enough
as is.

Chris, thanks for committing the fix, I'll give bioperl illumina fastq
parsing a workout soon. Peter, as much as I'd love to help out with
biopython, I'm under too much time pressure right now!

Jonathan, some of the Illumina sequencing adapters are listed at
http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
http://seqanswers.com/forums/showthread.php?t=198
Adapter sequence typically appears towards the end of the read, though the
latter part of it is often misread as the sequencing quality drops off.
I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
real alignments and trim the sequence. This is not the ideal way of doing
things, but it's fast enough, and does seem to work. The adapter sequence
shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
adapter removal.

I'll happily share some code once I've got it to the stage where I'm not
embarrassed by it!

Giles

2009/7/1 Chris Fields <cjfields at illinois.edu>

> Peter,
>
> I just committed a fix to FASTQ parsing last night to support read/write
> for Sanger/Solexa/Illumina following the biopython convention; the only
> thing needed is more extensive testing for the quality scores.  There are a
> few other oddities with it I intend to address soon, but it appears to be
> working.
>
> The Seq instance iterator actually calls a raw data iterator (hash refs of
> named arguments to the class constructor).  That should act as a decent
> filtering step if needed.
>
> We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we
> can probably reconfigure some of that.
>
> chris
>
>
> On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:
>
>  Hi all (BioPerl and Biopython),
>>
>> This is a continuation of a long thread on the BioPerl mailing
>> list, which I have now CC'd to the Biopython mailing list. See:
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>>
>> On this thread we have been discussing next gen sequencing
>> tools and co-coordinating things like consistent file format
>> naming between Biopython, BioPerl and EMBOSS. I've been
>> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
>> and he will look into setting up a cross project mailing list for
>> this kind of discussion in future.
>>
>> In the mean time, my replies to Giles below cover both BioPerl
>> and Biopython (and EMBOSS). Giles' original email is here:
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>>
>> Peter
>>
>> On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>>
>>>
>>> I'm developing a transcriptomics database for use with next-gen data, and
>>> have found processing the raw data to be a big hurdle.
>>>
>>> I'm a bit late in responding to this thread, so most issues have already
>>> been discussed. One thing that hasn't been mentioned is removal of
>>> adapters
>>> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
>>> developed and documented open source software for removal of adapters
>>> (and poor quality sequence) from Illumina reads.
>>>
>>> My current Illumina sequence processing pipeline is an unholy mix of
>>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for
>>> converting
>>> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
>>> pure perl to trim the poor quality sequence from each read, and bioperl
>>> with emboss to remove the adapter sequence. I'm aware that the pipeline
>>> contains bugs and would like to simplify it, but at least it does work...
>>>
>>> Ideally I'd like to replace as much of the pipeline as possible with
>>> bioperl/bioperl-run, but this isn't currently possible due to both a lack
>>> of features and poor performance. I'm sure the features will come with
>>> time, but the performance is more of a concern to me. ..
>>>
>>
>> I gather you would rather work with (Bio)Perl, but since you are
>> already using Biopython to do the FASTQ conversion, you could
>> also use it for more of your pipe line. Our tutorial includes examples
>> of simple FASTQ quality filtering, and trimming of primer sequences
>> (something like this might be helpful for removing adaptors). See:
>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>
>> Alternatively, with the new release of EMBOSS this July, you will
>> also be able to do the Illumina FASTQ to Sanger standard FASTQ
>> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>>
>>  Regarding trimming bad quality bases (see comments from
>>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>>> pure/bioperl solution to be much faster than a primarily bioperl
>>> based implementation. I found Bio::Seq->subseq(a,b) and
>>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>>> ~1300 sequences/second, including unzipping the raw data and
>>> converting it to sanger fastq with biopython. Processing an entire
>>> sequencing run with the whole pipeline takes in the region of 6-12h.
>>>
>>
>> There are several ways of doing quality trimming, and it would
>> make an excellent cookbook example (both for BioPerl and
>> Biopython).
>>
>> Could you go into a bit more detail about your trimming
>> algorithm? e.g. Do you just trim any bases on the right below
>> a certain threshold, perhaps with a minimum length to retain
>> the trimmed read afterwards?
>>
>>  Hope this looooong post was of interest to someone!
>>>
>>
>> I was interested at least ;)
>>
>> Peter
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>

From cjfields at illinois.edu  Wed Jul  1 12:46:49 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 1 Jul 2009 11:46:49 -0500
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
	<1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
Message-ID: <6CAF4023-7D04-4B56-839F-E587A00DEEEA@illinois.edu>

On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:

...

> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>   - For each sequence position calculate the mean phred quality  
> score for a
>   window around that position.
>   - Record whether the mean score is above or below a threshold as  
> an array
>   of zeros and ones.
>   - Use a regular expression on the joined array to find the start  
> and end
>   of the good quality sequence(s).
>   - Extract the quality sequence(s) and replace any bases below the  
> quality
>   threshold with N.
>   - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Just let me know if the qual values match up with what is expected.   
You can also iterate through the data with hashrefs using next_dataset  
(faster than objects).  This is from the fastq tests in core:

-----------------------------------------
$in_qual  = Bio::SeqIO->new(-file     =>  
test_input_file('fastq','test3_illumina.fastq'),
                             -variant  => 'illumina',
                             -format   => 'fastq');

$qual = $in_qual->next_dataset();

isa_ok($qual, 'HASH');
is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
is($qual->{-id}, 'FC12044_91407_8_200_406_24');
is($qual->{-desc}, '');
is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
is(join(',',@{$qual->{-qual}}[0..10]),  
'19,24,24,20,24,24,24,24,24,24,24');
-----------------------------------------

So one could check those values directly and then filter them through  
as needed directly into Bio::Seq::Quality if necessary (note some of  
the key values are constructor args):

my $qualobj = Bio::Seq::Quality->new(%$qual);

chris

From p.j.a.cock at googlemail.com  Thu Jul  2 03:20:07 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 Jul 2009 08:20:07 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
	<1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
Message-ID: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com>

On 7/1/09, Giles Weaver wrote:
> Peter, the trimming algorithm I use employs a sliding window, as follows:
>
>    - For each sequence position calculate the mean phred quality score for a
>    window around that position.
>    - Record whether the mean score is above or below a threshold as an array
>    of zeros and ones.
>    - Use a regular expression on the joined array to find the start and end
>    of the good quality sequence(s).
>    - Extract the quality sequence(s) and replace any bases below the quality
>    threshold with N.
>    - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the window, but
> this could give a performance hit, and the method seems to work well enough
> as is.

Thanks for the details - that is a bit more complex that what I had been
thinking. Do you have any favoured window size and quality threshold,
or does this really depend on the data itself?

Also, if you find a sequence read that goes "good - poor - good" for
example, do you extract the two good regions as two sub reads
(presumably with a minimum length)? This may be silly for Illumina
where the reads are very short, but might make sense for Roche 454.

> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Even use cases are useful - so thank you.

> Jonathan, some of the Illumina sequencing adapters are listed at
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> http://seqanswers.com/forums/showthread.php?t=198
> Adapter sequence typically appears towards the end of the read, though the
> latter part of it is often misread as the sequencing quality drops off.
> I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
> then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> real alignments and trim the sequence. This is not the ideal way of doing
> things, but it's fast enough, and does seem to work. The adapter sequence
> shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
> adapter removal.
>
> I'll happily share some code once I've got it to the stage where I'm not
> embarrassed by it!
>
> Giles

Cheers,

Peter

From vincent.rouilly03 at imperial.ac.uk  Thu Jul  2 09:40:46 2009
From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent)
Date: Thu, 2 Jul 2009 14:40:46 +0100
Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython
Message-ID: <A56CCB96395E684D9D57EAFB4AC41D4D0F61BCC17C@ICEXM4.ic.ac.uk>

Hi,

I have question about Distributed Annotation System (DAS).
What is the current best practice to load a SeqRecord from a DAS description ?

-------
I found that this topic has been discussed in the past here (see below), but I couldn't find the up-to-date method to deal with DAS in BioPython.

[2003] : Draft PyDAS parser from Andrew Dalke:  http://portal.open-bio.org/pipermail/biopython/2003-October/001670.html
Andrew hints at a DAS2 project that might produce a better python tool.

[2006]: Ann Loraine uses a SAX perser to deal with DAS: http://www.bioinformatics.org/pipermail/bbb/2006-December/003694.html

[2007]: PPT Presentation from Sanger Feb 2007: "DAS/2: Next generation Distributed Annotation System". 
Some python code used in the DAS/2 Validation Suite is mentioned.
http://sourceforge.net/projects/dasypus/
Project where Andrew Dalke is involved, but it seems inactive since 2006.
-------

Sorry if I have missed the post where this issue was last discussed,
best wishes,

Vincent.

From giles.weaver at googlemail.com  Fri Jul  3 11:35:00 2009
From: giles.weaver at googlemail.com (Giles Weaver)
Date: Fri, 3 Jul 2009 16:35:00 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
	<1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
	<320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com>
Message-ID: <1d06cd5d0907030835w14407249l5b47db8893820816@mail.gmail.com>

Regarding the trimming algorithm, I've been using a window size of 5, a
minimum score of 20 and a minimum length of 15 with the Illumina data. In
the past I have used a similar algorithm with a larger window size and much
longer minimum length with sequence from ABI 3XXX machines. I imagine that
the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to
those for Illumina and Sanger sequencing respectively.
Window size doesn't appear to affect performance much, if at all.

For sequences with multiple good regions, I do extract all good regions.
Even with the Illumina data there are sometimes two good regions, but
usually the second is adapter or junk and gets filtered out later. I haven't
seen quality data from a 454 machine recently, and would be interested to
know if multiple good regions are commonplace in 454 data. Can anyone with
access to 454 data comment on this?

Giles

2009/7/2 Peter Cock <p.j.a.cock at googlemail.com>

> On 7/1/09, Giles Weaver wrote:
> > Peter, the trimming algorithm I use employs a sliding window, as follows:
> >
> >    - For each sequence position calculate the mean phred quality score
> for a
> >    window around that position.
> >    - Record whether the mean score is above or below a threshold as an
> array
> >    of zeros and ones.
> >    - Use a regular expression on the joined array to find the start and
> end
> >    of the good quality sequence(s).
> >    - Extract the quality sequence(s) and replace any bases below the
> quality
> >    threshold with N.
> >    - Trim any Ns from the ends.
> >
> > A refinement would be to weight the scores from positions in the window,
> but
> > this could give a performance hit, and the method seems to work well
> enough
> > as is.
>
> Thanks for the details - that is a bit more complex that what I had been
> thinking. Do you have any favoured window size and quality threshold,
> or does this really depend on the data itself?
>
> Also, if you find a sequence read that goes "good - poor - good" for
> example, do you extract the two good regions as two sub reads
> (presumably with a minimum length)? This may be silly for Illumina
> where the reads are very short, but might make sense for Roche 454.
>
> > Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> > parsing a workout soon. Peter, as much as I'd love to help out with
> > biopython, I'm under too much time pressure right now!
>
> Even use cases are useful - so thank you.
>
> > Jonathan, some of the Illumina sequencing adapters are listed at
> >
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> > http://seqanswers.com/forums/showthread.php?t=198
> > Adapter sequence typically appears towards the end of the read, though
> the
> > latter part of it is often misread as the sequencing quality drops off.
> > I abuse needle (EMBOSS) into aligning the adapter sequence with each
> read. I
> > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> > real alignments and trim the sequence. This is not the ideal way of doing
> > things, but it's fast enough, and does seem to work. The adapter sequence
> > shouldn't be gapped, so I'm sure there is a lot of scope for optimising
> the
> > adapter removal.
> >
> > I'll happily share some code once I've got it to the stage where I'm not
> > embarrassed by it!
> >
> > Giles
>
> Cheers,
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Sat Jul  4 09:59:31 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 4 Jul 2009 14:59:31 +0100
Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython
In-Reply-To: <A56CCB96395E684D9D57EAFB4AC41D4D0F61BCC17C@ICEXM4.ic.ac.uk>
References: <A56CCB96395E684D9D57EAFB4AC41D4D0F61BCC17C@ICEXM4.ic.ac.uk>
Message-ID: <320fb6e00907040659ua83a793j94c4920608b0ad28@mail.gmail.com>

On Thu, Jul 2, 2009 at 2:40 PM, Rouilly,
Vincent<vincent.rouilly03 at imperial.ac.uk> wrote:
> Hi,
>
> I have question about Distributed Annotation System (DAS).
> What is the current best practice to load a SeqRecord from
> a DAS description ?

I don't know if anyone has done that. We don't have anything
in Biopython for DAS right now (that I know of). Hopefully
Andrew Dalke (CC'd) can give us a quick report on the status
of his code and the DAS/2 project.

Could you give a specific example of a DAS service you'd like
to use to get a sequence record from?

On the bright side, when chatting to Peter Rice from EMBOSS
at BOSC/ISMB 2009, he said they had been doing a lot of work
with DAS, so it sounds like a lot of the problems Andrew was
talking about (like invalid XML files) about may have been
addressed. I'm not sure if the new version of EMBOSS due
this month will include a DAS client of some kind - that would
be worth checking out.

P.S. Have you signed up to the DAS mailing list?
http://lists.open-bio.org/mailman/listinfo/das

Peter

From fungazid at yahoo.com  Sun Jul  5 18:57:08 2009
From: fungazid at yahoo.com (Fungazid)
Date: Sun, 5 Jul 2009 15:57:08 -0700 (PDT)
Subject: [Biopython] suggestion for a little change in the ACE cookbook
Message-ID: <204841.83488.qm@web65510.mail.ac4.yahoo.com>


Hi,

About the cookbook here
http://biopython.org/wiki/ACE_contig_to_alignment

instead of:

def cut_ends(read, start, end):
  return (start-1) * '-' + read[start-1:end] + (end +1) * '-'

I think it is better to write:

def cut_ends(self,read, start, end):
    return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x'

The 2 changes are:
1) correcting the coordinates of the clipped 5' region
2) adding 'x' instead of '-' to separate the clipped region from the gaps


From biopython.chen at gmail.com  Sun Jul  5 23:27:15 2009
From: biopython.chen at gmail.com (chen Ku)
Date: Sun, 5 Jul 2009 20:27:15 -0700
Subject: [Biopython] how to retrieve pdb id of desired keyword
Message-ID: <4c2163890907052027s3a2843b4w3ebe6ee4ef7a5472@mail.gmail.com>

Dear all,


I seek your help again in using Bio.PDBList. As I understood from
Bio.PDBList we can only download whole PDB by (

*download_entire_pdb(self, listfile=None) *


Actually i want to only fetch the pdb id which are only transcription factor
binding to DNA.
I think to download all PDB file will be time taking so without mising
anydata which is the best way.If you can demonstrate me using PDBList method
for this then I can start with next methods and try by my own.


Any suggestion or one demonstaration using PDBList will be of great help.


Regards

Chen

From oda.gumail at gmail.com  Mon Jul  6 11:19:56 2009
From: oda.gumail at gmail.com (Ogan ABAAN)
Date: Mon, 06 Jul 2009 11:19:56 -0400
Subject: [Biopython] retrieve gene name and exon
Message-ID: <4A52161C.8070909@gmail.com>

Hi all,

I have a number of genomic position from the human genome and I want to 
know which genes these positions belong to. I also would like to know 
which exon (if they are from a gene, or even intron if possible) the 
location is on. For example, I want to put in chr1:10,000,000 and would 
like to see an output as such geneX-exon5 or something like that. I know 
ensemble stores that information but I couldn't find the proper tool in 
Biopython, so I would apritiate if anyone could direct me to one. Thank 
you very much

Ogan


From biopython at maubp.freeserve.co.uk  Mon Jul  6 11:44:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 6 Jul 2009 16:44:28 +0100
Subject: [Biopython] retrieve gene name and exon
In-Reply-To: <4A52161C.8070909@gmail.com>
References: <4A52161C.8070909@gmail.com>
Message-ID: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com>

On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN<oda.gumail at gmail.com> wrote:
> Hi all,
>
> I have a number of genomic position from the human genome and I want to know
> which genes these positions belong to. I also would like to know which exon
> (if they are from a gene, or even intron if possible) the location is on.
> For example, I want to put in chr1:10,000,000 and would like to see an
> output as such geneX-exon5 or something like that. I know ensemble stores
> that information but I couldn't find the proper tool in Biopython, so I
> would apritiate if anyone could direct me to one. Thank you very much
>
> Ogan

This thread was on a similar topic:
http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html
Given the GenBank file (or in theory an EMBL file or something else
like a GFF file) for a chromosome, and a position within it, how could
you determine which feature(s) a given position was within.

Note that there are already three different human genomes available
in GenBank, so as mentioned in the earlier thread, you need to know
which human genome your location refers to - and work from the
appropriate GenBank/EMBL/GFF/other data file.

Peter

P.S. How many of these locations do you have?

From oda.gumail at gmail.com  Mon Jul  6 12:58:53 2009
From: oda.gumail at gmail.com (Ogan ABAAN)
Date: Mon, 06 Jul 2009 12:58:53 -0400
Subject: [Biopython] retrieve gene name and exon
In-Reply-To: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com>
References: <4A52161C.8070909@gmail.com>
	<320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com>
Message-ID: <4A522D4D.40602@gmail.com>

Thanks Peter,

Now that you mention it I remember reading that thread. I don't have an 
exact number but for chr1 I have about 350 of these. I parsed them out a 
separate chr files.

Thank you


Peter wrote:
> On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN<oda.gumail at gmail.com> wrote:
>   
>> Hi all,
>>
>> I have a number of genomic position from the human genome and I want to know
>> which genes these positions belong to. I also would like to know which exon
>> (if they are from a gene, or even intron if possible) the location is on.
>> For example, I want to put in chr1:10,000,000 and would like to see an
>> output as such geneX-exon5 or something like that. I know ensemble stores
>> that information but I couldn't find the proper tool in Biopython, so I
>> would apritiate if anyone could direct me to one. Thank you very much
>>
>> Ogan
>>     
>
> This thread was on a similar topic:
> http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html
> Given the GenBank file (or in theory an EMBL file or something else
> like a GFF file) for a chromosome, and a position within it, how could
> you determine which feature(s) a given position was within.
>
> Note that there are already three different human genomes available
> in GenBank, so as mentioned in the earlier thread, you need to know
> which human genome your location refers to - and work from the
> appropriate GenBank/EMBL/GFF/other data file.
>
> Peter
>
> P.S. How many of these locations do you have?
>   

From winda002 at student.otago.ac.nz  Mon Jul  6 19:31:12 2009
From: winda002 at student.otago.ac.nz (David WInter)
Date: Tue, 07 Jul 2009 11:31:12 +1200
Subject: [Biopython] suggestion for a little change in the ACE cookbook
In-Reply-To: <204841.83488.qm@web65510.mail.ac4.yahoo.com>
References: <204841.83488.qm@web65510.mail.ac4.yahoo.com>
Message-ID: <4A528940.6070503@student.otago.ac.nz>

Fungazid wrote:
> Hi,
>
> About the cookbook here
> http://biopython.org/wiki/ACE_contig_to_alignment
>
> instead of:
>
> def cut_ends(read, start, end):
>   return (start-1) * '-' + read[start-1:end] + (end +1) * '-'
>
> I think it is better to write:
>
> def cut_ends(self,read, start, end):
>     return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x'
>   

Yep, well spotted. It seems I'd also put an ugly hack in the 'pad_ends' 
function to deal with the problem (cutting the read to length before 
returning it) so we can get rid to that too ;) I've changed the code on 
the wiki.

As for adding 'x's instead of '-'s - I think this is really going to be 
a case by case thing - the contigs I had to play with had asterisks for 
gaps in the reads so I could tell the difference (and for some strange 
reason I'm squeamish about using letters to represent a gap even if 'x' 
is not an ambiguity code). Do you want to add something to the recipe to 
make it clear that someone could change the 'pad character' to suit the 
assembly you are using?

Cheers,
David


From pzs at dcs.gla.ac.uk  Tue Jul  7 12:41:14 2009
From: pzs at dcs.gla.ac.uk (Peter Saffrey)
Date: Tue, 07 Jul 2009 17:41:14 +0100
Subject: [Biopython] Primer3 for testing primers
Message-ID: <4A537AAA.5040008@dcs.gla.ac.uk>

Has anybody done this through Biopython? I found this posting:

http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html

but it generates a primer3 input file, rather than using the 
set_parameter() method provided by 
Bio.Emboss.Applications.Primer3Commandline.

The problem is that by running primer3 from the command line, I can't 
get it to report problems with (for example) temperature or GC content 
without using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline 
doesn't seem to support that option.

This also makes me wonder whether Biopython's primer3 output parsing 
knows how to read the primer3 "explain" syntax:

PRIMER_LEFT_EXPLAIN=considered 1, ok 1
PRIMER_RIGHT_EXPLAIN=considered 1, ok 1

Does anybody know?

I'm not finding the primer3 documentation all that helpful either :( 
There is no mailing list or contact email address...

Peter

From biopython at maubp.freeserve.co.uk  Tue Jul  7 13:05:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 7 Jul 2009 18:05:55 +0100
Subject: [Biopython] Primer3 for testing primers
In-Reply-To: <4A537AAA.5040008@dcs.gla.ac.uk>
References: <4A537AAA.5040008@dcs.gla.ac.uk>
Message-ID: <320fb6e00907071005t24d79108u76d23c006c19f297@mail.gmail.com>

On Tue, Jul 7, 2009 at 5:41 PM, Peter Saffrey<pzs at dcs.gla.ac.uk> wrote:
> Has anybody done this through Biopython? I found this posting:
>
> http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html
>
> but it generates a primer3 input file, rather than using the set_parameter()
> method provided by Bio.Emboss.Applications.Primer3Commandline.
>
> The problem is that by running primer3 from the command line, I can't get it
> to report problems with (for example) temperature or GC content without
> using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline
> doesn't seem to support that option.
>
> This also makes me wonder whether Biopython's primer3 output parsing knows
> how to read the primer3 "explain" syntax:
>
> PRIMER_LEFT_EXPLAIN=considered 1, ok 1
> PRIMER_RIGHT_EXPLAIN=considered 1, ok 1
>
> Does anybody know?
>
> I'm not finding the primer3 documentation all that helpful either :( There
> is no mailing list or contact email address...

Are you sure you are using the EMBOSS version of primer3? i.e.
the command line tool called eprimer3 (with an "e" at the start).

EMBOSS mailing list:
http://emboss.sourceforge.net/support/#usermail
http://emboss.open-bio.org/mailman/listinfo/emboss

EMBOSS docs:
http://emboss.sourceforge.net/apps/cvs/emboss/apps/eprimer3.html

This does specifically list the "-explainflag" argument, which should be
set to a boolean value. This is supported in the Primer3Commandline
wrapper in Biopython. I'm not sure about the parser off hand.

Peter

From fungazid at yahoo.com  Tue Jul  7 15:19:33 2009
From: fungazid at yahoo.com (Fungazid)
Date: Tue, 7 Jul 2009 12:19:33 -0700 (PDT)
Subject: [Biopython] suggestion for a little change in the ACE cookbook
Message-ID: <927677.46270.qm@web65502.mail.ac4.yahoo.com>


Hi David,

I am working with a version of this cookbook that suits my needs. Right now I do not have extremely existing things to add to the cookbook, but I am working with this code and maybe I can track something important (hopefully not bugs ;) ).

Thanks,
Avi

--- On Tue, 7/7/09, David WInter <winda002 at student.otago.ac.nz> wrote:

> From: David WInter <winda002 at student.otago.ac.nz>
> Subject: Re: [Biopython] suggestion for a little change in the ACE cookbook
> To: "Fungazid" <fungazid at yahoo.com>
> Cc: biopython at lists.open-bio.org
> Date: Tuesday, July 7, 2009, 2:31 AM
> Fungazid wrote:
> > Hi,
> > 
> > About the cookbook here
> > http://biopython.org/wiki/ACE_contig_to_alignment
> > 
> > instead of:
> > 
> > def cut_ends(read, start, end):
> >???return (start-1) * '-' +
> read[start-1:end] + (end +1) * '-'
> > 
> > I think it is better to write:
> > 
> > def cut_ends(self,read, start, end):
> >? ???return (start-1) * 'x' +
> read[start-1:end-1] + (len(read)-end) * 'x'
> >???
> 
> Yep, well spotted. It seems I'd also put an ugly hack in
> the 'pad_ends' function to deal with the problem (cutting
> the read to length before returning it) so we can get rid to
> that too ;) I've changed the code on the wiki.
> 
> As for adding 'x's instead of '-'s - I think this is really
> going to be a case by case thing - the contigs I had to play
> with had asterisks for gaps in the reads so I could tell the
> difference (and for some strange reason I'm squeamish about
> using letters to represent a gap even if 'x' is not an
> ambiguity code). Do you want to add something to the recipe
> to make it clear that someone could change the 'pad
> character' to suit the assembly you are using?
> 
> Cheers,
> David
> 
> 
> 
> 
> 
> 
> 


From lueck at ipk-gatersleben.de  Wed Jul  8 06:08:56 2009
From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de)
Date: Wed,  8 Jul 2009 12:08:56 +0200
Subject: [Biopython] blastall - strange results
Message-ID: <20090708120856.c902mgb7eed4w8c8@webmail.ipk-gatersleben.de>

Hi!

Sorry for the late replay but here is an update:

I tried megablast but it doesn't help...But what I found out and is 
acceptable for the moment:

If the query sequence is >235 bp
   >>> use wordsize 21

If the query sequence is <235 bp
   >>> use wordsize 11

I don't know the reason for that but at least I can work with it. 
However now and than BLAST don't find all sequences (rarely) and soon 
or later I'll switch to a short read aligner or global alignment.
Kind regards
Stefanie

>>>
On Thu, May 28, 2009 at 1:02 PM, Brad Chapman <[EMAIL PROTECTED]> wrote:
> Hi Stefanie;
>
>> I get strange results with blast.
>> My aim is to blast a query sequence, spitted to 21-mers, against a database.
> [...]
>> Is this normal? I would expect to find all 21-mers. Why only some?

I would check the filtering option is off (by default BLAST will mask low
complexity regions).

> BLAST isn't the best tool for this sort of problem. For exhaustively
> aligning short sequences to a database of target sequences, you
> should think about using a short read aligner. This is a nice
> summary of available aligners:
>
> http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml
>
> Personally, I have had good experiences using Mosaik and Bowtie.
>
> Hope this helps,
> Brad

Brad is probably right about normal BLAST not being the best tool.

However, if you haven't done so already you might want to try
megablast instead of blastn, as this is designed for very similar
matches. This should be a very small change to your existing Biopython
script, so it should be easy to try out.

Peter
_______________________________________________
Biopython mailing list  -  [EMAIL PROTECTED]
http://lists.open-bio.org/mailman/listinfo/biopython


From bartomas at gmail.com  Tue Jul 14 07:03:08 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 12:03:08 +0100
Subject: [Biopython] Record count in pcassay database
Message-ID: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>

Hi,
I'm using Biopython to access Entrez databases.
I've retrieved information of the pcassay database with the following code:


handle=Entrez.einfo(db=*"pcassay"*)

record=Entrez.read(handle)

print record[*'DbInfo'*][*'Count'*]


Printing the record count of pcassay gives :

*1659*

Such a limited number of records seems impossible.

Am I using Biopython incorrectly ?


Thanks very much

From dejmail at gmail.com  Tue Jul 14 07:09:49 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Tue, 14 Jul 2009 13:09:49 +0200
Subject: [Biopython] cleaning sequences
Message-ID: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>

Hi everyone

I was wondering if there was a built in method for determining whether a
sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
reason I ask is I am trying to subtype a couple hundred viral DNA sequences,
and due to bad sequencing, the sequences often have ambiguous characters in
them, which the algorithm used to subtype doesn't like. I realise I can
compare each letter of each genome in a loop with GATC to determine
ambiguity, but it might be easier if there was a built in function.

Thanks
Liam


-- 
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
Faculty of Health Sciences, Room 7Q07
7 York Road, Parktown
2193

Tel: 2711 717 2465/7
Fax: 2711 717 2395
Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com

From chapmanb at 50mail.com  Tue Jul 14 07:30:09 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 14 Jul 2009 07:30:09 -0400
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
Message-ID: <20090714113009.GP17086@sobchak.mgh.harvard.edu>

Hello;

> I'm using Biopython to access Entrez databases.
> I've retrieved information of the pcassay database with the following code:
> 
> 
> handle=Entrez.einfo(db=*"pcassay"*)
> record=Entrez.read(handle)
> print record[*'DbInfo'*][*'Count'*]
> 
> Printing the record count of pcassay gives :
> *1659*
> Such a limited number of records seems impossible.
> Am I using Biopython incorrectly ?

That count looks right to me if I manually browse the PubChem
BioAssay database:

http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]

It looks like you are retrieving the top level assay records. The
counts for total compounds assayed will be much higher but you would
need to examine individual records of interest to determine those.

Hope this helps,
Brad

From bartomas at gmail.com  Tue Jul 14 07:48:51 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 12:48:51 +0100
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <20090714113009.GP17086@sobchak.mgh.harvard.edu>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
	<20090714113009.GP17086@sobchak.mgh.harvard.edu>
Message-ID: <fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>

Thanks very much for your reply.
By the way in your http query you specify *term=all[filt]*
I've just tried the same with BioPython and it does retireve all records:


handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*)
Is 'filt' the standard wildcard for Entrez queries ?

Thanks.

On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hello;
>
> > I'm using Biopython to access Entrez databases.
> > I've retrieved information of the pcassay database with the following
> code:
> >
> >
> > handle=Entrez.einfo(db=*"pcassay"*)
> > record=Entrez.read(handle)
> > print record[*'DbInfo'*][*'Count'*]
> >
> > Printing the record count of pcassay gives :
> > *1659*
> > Such a limited number of records seems impossible.
> > Am I using Biopython incorrectly ?
>
> That count looks right to me if I manually browse the PubChem
> BioAssay database:
>
> http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]
>
> It looks like you are retrieving the top level assay records. The
> counts for total compounds assayed will be much higher but you would
> need to examine individual records of interest to determine those.
>
> Hope this helps,
> Brad
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From chapmanb at 50mail.com  Tue Jul 14 08:50:12 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 14 Jul 2009 08:50:12 -0400
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
	<20090714113009.GP17086@sobchak.mgh.harvard.edu>
	<fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>
Message-ID: <20090714125012.GS17086@sobchak.mgh.harvard.edu>

Hello;

> Thanks very much for your reply.
> By the way in your http query you specify *term=all[filt]*
> I've just tried the same with BioPython and it does retireve all records:

It looked like you were getting all the records with your previous
query as well.

> handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*)
> Is 'filt' the standard wildcard for Entrez queries ?

I don't know too much about PubChem queries but had just clicked on the
"All BioAssays" link from the main page:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay

The documentation linked to from there:

http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index

can probably provide additional direction. Thanks,
Brad

> 
> Thanks.
> 
> On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > Hello;
> >
> > > I'm using Biopython to access Entrez databases.
> > > I've retrieved information of the pcassay database with the following
> > code:
> > >
> > >
> > > handle=Entrez.einfo(db=*"pcassay"*)
> > > record=Entrez.read(handle)
> > > print record[*'DbInfo'*][*'Count'*]
> > >
> > > Printing the record count of pcassay gives :
> > > *1659*
> > > Such a limited number of records seems impossible.
> > > Am I using Biopython incorrectly ?
> >
> > That count looks right to me if I manually browse the PubChem
> > BioAssay database:
> >
> > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]
> >
> > It looks like you are retrieving the top level assay records. The
> > counts for total compounds assayed will be much higher but you would
> > need to examine individual records of interest to determine those.
> >
> > Hope this helps,
> > Brad
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >

From chapmanb at 50mail.com  Tue Jul 14 08:45:21 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 14 Jul 2009 08:45:21 -0400
Subject: [Biopython] cleaning sequences
In-Reply-To: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
Message-ID: <20090714124521.GR17086@sobchak.mgh.harvard.edu>

Hi Liam;
I don't believe there is built in functionality for doing this. The
problem itself is hard because it is a bit underspecified: what
should be done when encountering ambiguous characters? Depending on
your situation this can be a couple of different things:

- Trim the sequence to remove the bases. This might be a
  post-sequencing step, and there was some discussion between Peter
  and Giles about the parameters of doing this earlier this month:

  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html

- Replace the bases with an accepted ambiguity character (say, N or
  x)

So it's a bit hard to generalize. Saying that, we'd be happy for
thoughts on an implementation that would tackle these sorts of
issues.

Brad

> I was wondering if there was a built in method for determining whether a
> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> reason I ask is I am trying to subtype a couple hundred viral DNA sequences,
> and due to bad sequencing, the sequences often have ambiguous characters in
> them, which the algorithm used to subtype doesn't like. I realise I can
> compare each letter of each genome in a loop with GATC to determine
> ambiguity, but it might be easier if there was a built in function.
> 
> Thanks
> Liam
> 
> 
> 
> -- 
> -----------------------------------------------------------
> Antiviral Gene Therapy Research Unit
> University of the Witwatersrand
> Faculty of Health Sciences, Room 7Q07
> 7 York Road, Parktown
> 2193
> 
> Tel: 2711 717 2465/7
> Fax: 2711 717 2395
> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From bartomas at gmail.com  Tue Jul 14 09:22:28 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 14:22:28 +0100
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <20090714125012.GS17086@sobchak.mgh.harvard.edu>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
	<20090714113009.GP17086@sobchak.mgh.harvard.edu>
	<fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>
	<20090714125012.GS17086@sobchak.mgh.harvard.edu>
Message-ID: <fdcd75820907140622i4798d911y3c748700df19ea8d@mail.gmail.com>

Thanks a lot!

On Tue, Jul 14, 2009 at 1:50 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hello;
>
> > Thanks very much for your reply.
> > By the way in your http query you specify *term=all[filt]*
> > I've just tried the same with BioPython and it does retireve all records:
>
> It looked like you were getting all the records with your previous
> query as well.
>
> > handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*)
> > Is 'filt' the standard wildcard for Entrez queries ?
>
> I don't know too much about PubChem queries but had just clicked on the
> "All BioAssays" link from the main page:
>
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay
>
> The documentation linked to from there:
>
> http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index
>
> can probably provide additional direction. Thanks,
> Brad
>
> >
> > Thanks.
> >
> > On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman <chapmanb at 50mail.com>
> wrote:
> >
> > > Hello;
> > >
> > > > I'm using Biopython to access Entrez databases.
> > > > I've retrieved information of the pcassay database with the following
> > > code:
> > > >
> > > >
> > > > handle=Entrez.einfo(db=*"pcassay"*)
> > > > record=Entrez.read(handle)
> > > > print record[*'DbInfo'*][*'Count'*]
> > > >
> > > > Printing the record count of pcassay gives :
> > > > *1659*
> > > > Such a limited number of records seems impossible.
> > > > Am I using Biopython incorrectly ?
> > >
> > > That count looks right to me if I manually browse the PubChem
> > > BioAssay database:
> > >
> > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]
> > >
> > > It looks like you are retrieving the top level assay records. The
> > > counts for total compounds assayed will be much higher but you would
> > > need to examine individual records of interest to determine those.
> > >
> > > Hope this helps,
> > > Brad
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biopython
> > >
>

From cjfields at illinois.edu  Tue Jul 14 10:48:04 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 14 Jul 2009 09:48:04 -0500
Subject: [Biopython] cleaning sequences
In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
Message-ID: <16F8D67C-EC52-4C11-8889-B07CAE9D7E1B@illinois.edu>

If you do come up with something, let us Bioperl guys know.  We have a  
preliminary trimming/cleaning version that we're thinking of adding,  
but it would be nice to coalesce around a similar implementation.

chris

On Jul 14, 2009, at 7:45 AM, Brad Chapman wrote:

> Hi Liam;
> I don't believe there is built in functionality for doing this. The
> problem itself is hard because it is a bit underspecified: what
> should be done when encountering ambiguous characters? Depending on
> your situation this can be a couple of different things:
>
> - Trim the sequence to remove the bases. This might be a
>  post-sequencing step, and there was some discussion between Peter
>  and Giles about the parameters of doing this earlier this month:
>
>  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
>
> - Replace the bases with an accepted ambiguity character (say, N or
>  x)
>
> So it's a bit hard to generalize. Saying that, we'd be happy for
> thoughts on an implementation that would tackle these sorts of
> issues.
>
> Brad
>
>> I was wondering if there was a built in method for determining  
>> whether a
>> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous  
>> sequence. The
>> reason I ask is I am trying to subtype a couple hundred viral DNA  
>> sequences,
>> and due to bad sequencing, the sequences often have ambiguous  
>> characters in
>> them, which the algorithm used to subtype doesn't like. I realise I  
>> can
>> compare each letter of each genome in a loop with GATC to determine
>> ambiguity, but it might be easier if there was a built in function.
>>
>> Thanks
>> Liam
>>
>>
>>
>> -- 
>> -----------------------------------------------------------
>> Antiviral Gene Therapy Research Unit
>> University of the Witwatersrand
>> Faculty of Health Sciences, Room 7Q07
>> 7 York Road, Parktown
>> 2193
>>
>> Tel: 2711 717 2465/7
>> Fax: 2711 717 2395
>> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From bartomas at gmail.com  Tue Jul 14 11:39:08 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 16:39:08 +0100
Subject: [Biopython] Problem using efetch
Message-ID: <fdcd75820907140839s4616c3b4o4ef4b0dd33e11020@mail.gmail.com>

Hi,

I?m using BioPython to access Entrez databases.  I?m following the BioPython
tutorial.
I?ve tried retrieving all record ids from pcassay database with esearch and
then retrieving the first full record on the list with efetch:

handle = Entrez.esearch(db="pcassay", term="ALL[filt]")

print record["IdList"]


# This prints the following list of ids:

# ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc.


But when I then try to retrieve the first record:

handle2 = Entrez.efetch(db="pcassay", id="1866")

I get the following error :


<html>

<body>

<br/><h2>Error occurred: Report 'ASN1' not found in 'pcassay'
presentation</h2><br/><ul title="some params from request:">

<li>db=pcassay</li>

<li>query_key=</li>

<li>report=</li>

<li>dispstart=</li>

<li>dispmax=</li>

<li>mode=html</li>

<li>WebEnv=</li>

</ul>

<br/><b>pmfetch need params:</b><br/><br/>

<li>(id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where NNN - number in the
history, 0 - clipboard content for current database)</li>

<li>db=db_name (mandatory)</li>

<li>report=[docsum, brief, abstract, citation, medline, asn.1, mlasn1,
uilist, sgml, gen] (Optional; default is asn.1)</li>

<li>mode=[html, file, text, asn.1, xml] (Optional; default is html)</li>

<li>dispstart - first element to display, from 0 to count - 1, (Optional;
default is 0)</li>

<li>dispmax - number of items to display (Optional; default is all elements,
from dispstart)</li>

<br/>See <a href="
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
">help</a>.</body>

</html>


Do you have an idea of what I?m doing wrong?

Thanks very much


From dejmail at gmail.com  Tue Jul 14 14:21:29 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Tue, 14 Jul 2009 20:21:29 +0200
Subject: [Biopython] cleaning sequences
In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
Message-ID: <f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>

Hi Brad

Yes, I remember the posts rereading them now. I think my problem is a little
less complicated than sequence data, seeing as my sequences are genbank
entries, so they just need to be read, even if they're bad quality. I
suppose changing the letter would be a better option for me, especially as
the reading frame is important for aligning based on peptide sequence.

As for implementation, I am a complete greenhorn at python nevermind
programming, so I wouldn't even know where to start suggestions, sorry about
that.

Regards
Liam


On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hi Liam;
> I don't believe there is built in functionality for doing this. The
> problem itself is hard because it is a bit underspecified: what
> should be done when encountering ambiguous characters? Depending on
> your situation this can be a couple of different things:
>
> - Trim the sequence to remove the bases. This might be a
>  post-sequencing step, and there was some discussion between Peter
>  and Giles about the parameters of doing this earlier this month:
>
>  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
>
> - Replace the bases with an accepted ambiguity character (say, N or
>  x)
>
> So it's a bit hard to generalize. Saying that, we'd be happy for
> thoughts on an implementation that would tackle these sorts of
> issues.
>
> Brad
>
> > I was wondering if there was a built in method for determining whether a
> > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> > reason I ask is I am trying to subtype a couple hundred viral DNA
> sequences,
> > and due to bad sequencing, the sequences often have ambiguous characters
> in
> > them, which the algorithm used to subtype doesn't like. I realise I can
> > compare each letter of each genome in a loop with GATC to determine
> > ambiguity, but it might be easier if there was a built in function.
> >
> > Thanks
> > Liam
> >
> >
> >
> > --
> > -----------------------------------------------------------
> > Antiviral Gene Therapy Research Unit
> > University of the Witwatersrand
> > Faculty of Health Sciences, Room 7Q07
> > 7 York Road, Parktown
> > 2193
> >
> > Tel: 2711 717 2465/7
> > Fax: 2711 717 2395
> > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
Faculty of Health Sciences, Room 7Q07
7 York Road, Parktown
2193

Tel: 2711 717 2465/7
Fax: 2711 717 2395
Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com

From biopython at maubp.freeserve.co.uk  Tue Jul 14 18:08:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Jul 2009 23:08:50 +0100
Subject: [Biopython] Problem using efetch
In-Reply-To: <fdcd75820907140839s4616c3b4o4ef4b0dd33e11020@mail.gmail.com>
References: <fdcd75820907140839s4616c3b4o4ef4b0dd33e11020@mail.gmail.com>
Message-ID: <320fb6e00907141508l13ed0d2i9ddd466538af8816@mail.gmail.com>

On Tue, Jul 14, 2009 at 4:39 PM, bar tomas<bartomas at gmail.com> wrote:
> Hi,
>
> I?m using BioPython to access Entrez databases. ?I?m following
> the BioPython tutorial. I?ve tried retrieving all record ids from
> pcassay database with esearch and then retrieving the first full
> record on the list with efetch:
>
> handle = Entrez.esearch(db="pcassay", term="ALL[filt]")
>
> print record["IdList"]
>
> # This prints the following list of ids:
>
> # ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc.
>
>
> But when I then try to retrieve the first record:
>
> handle2 = Entrez.efetch(db="pcassay", id="1866")
>
> I get the following error :
>
> <html>
> <body>
> <br/><h2>Error occurred: Report 'ASN1' not found in 'pcassay'
> presentation</h2><br/><ul title="some params from request:">
> <li>db=pcassay</li>
> ...
>
> Do you have an idea of what I?m doing wrong?

This isn't anything wrong with Biopython - this is the sort of
slightly cryptic error the NCBI gives when the return type
and/or return mode isn't supported. Apparently the default
(ASN1) isn't supported for this database. The NCBI efetch
documentation is a little vague or simply missing for the
less main-stream databases. You can make some
guesses from playing with the Entrez website, e.g.

>>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist").read()
<html><head><title>PmFetch response</title></head><body>
<pre>
1866
</pre></body></html>
>>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist", retmode="text").read()
1866
>>> print Entrez.efetch(db="pcassay", id="1866", rettype="abstract", retmode="text").read()

1: AID: 1866
Name:  Epi-absorbance-based counterscreen assay for selective VIM-2 inhibitors:
biochemical high throughput screening assay to identify inhibitors of TEM-1
serine-beta-lactamase.
Source:  The Scripps Research Institute Molecular Screening Center
Description:   Source (MLPCN Center Name): The Scripps Research Institute
...

You could also try emailing the NCBI for advice.

Peter


From chapmanb at 50mail.com  Wed Jul 15 08:35:40 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 15 Jul 2009 08:35:40 -0400
Subject: [Biopython] cleaning sequences
In-Reply-To: <f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
	<f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
Message-ID: <20090715123540.GF17086@sobchak.mgh.harvard.edu>

Hi Liam;
That makes sense. It's a good suggestion and I added it to the
Project Ideas area of the wiki so hopefully it'll get picked up on
in the future:

http://biopython.org/wiki/Active_projects#Project_ideas

For your specific problem, you should be able to do something along
the lines of:

def convert_ambiguous(orig_seq):
    new_bases = []
    for base in str(orig_seq).upper():
        if base in ["G", "A", "T", "C"]:
            new_bases.append(base)
        else:
            new_bases.append("N")
    return Seq("".join(new_bases), orig_seq.alphabet)

which would switch all non GATCs to the N ambiguity character,
assuming your downstream program accepts that.

Hope this helps,
Brad

> 
> Yes, I remember the posts rereading them now. I think my problem is a little
> less complicated than sequence data, seeing as my sequences are genbank
> entries, so they just need to be read, even if they're bad quality. I
> suppose changing the letter would be a better option for me, especially as
> the reading frame is important for aligning based on peptide sequence.
> 
> As for implementation, I am a complete greenhorn at python nevermind
> programming, so I wouldn't even know where to start suggestions, sorry about
> that.
> 
> Regards
> Liam
> 
> 
> 
> 
> On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > Hi Liam;
> > I don't believe there is built in functionality for doing this. The
> > problem itself is hard because it is a bit underspecified: what
> > should be done when encountering ambiguous characters? Depending on
> > your situation this can be a couple of different things:
> >
> > - Trim the sequence to remove the bases. This might be a
> >  post-sequencing step, and there was some discussion between Peter
> >  and Giles about the parameters of doing this earlier this month:
> >
> >  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
> >
> > - Replace the bases with an accepted ambiguity character (say, N or
> >  x)
> >
> > So it's a bit hard to generalize. Saying that, we'd be happy for
> > thoughts on an implementation that would tackle these sorts of
> > issues.
> >
> > Brad
> >
> > > I was wondering if there was a built in method for determining whether a
> > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> > > reason I ask is I am trying to subtype a couple hundred viral DNA
> > sequences,
> > > and due to bad sequencing, the sequences often have ambiguous characters
> > in
> > > them, which the algorithm used to subtype doesn't like. I realise I can
> > > compare each letter of each genome in a loop with GATC to determine
> > > ambiguity, but it might be easier if there was a built in function.
> > >
> > > Thanks
> > > Liam
> > >
> > >
> > >
> > > --
> > > -----------------------------------------------------------
> > > Antiviral Gene Therapy Research Unit
> > > University of the Witwatersrand
> > > Faculty of Health Sciences, Room 7Q07
> > > 7 York Road, Parktown
> > > 2193
> > >
> > > Tel: 2711 717 2465/7
> > > Fax: 2711 717 2395
> > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
> 
> 
> -- 
> -----------------------------------------------------------
> Antiviral Gene Therapy Research Unit
> University of the Witwatersrand
> Faculty of Health Sciences, Room 7Q07
> 7 York Road, Parktown
> 2193
> 
> Tel: 2711 717 2465/7
> Fax: 2711 717 2395
> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com

From bartomas at gmail.com  Wed Jul 15 09:12:10 2009
From: bartomas at gmail.com (bar tomas)
Date: Wed, 15 Jul 2009 14:12:10 +0100
Subject: [Biopython] How to run esearch in BioPython without specifying any
	filtering terms
Message-ID: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>

Hi,

The BioPython tutorial (p.86) shows how once the available fields of an
Entrez database have been found with Einfo ,  queries can be run that use
those fields in the term argument of Esearch (for instance Jones[AUTH]).

However, I?d like to retrieve all IDs from a database without specifying any
filtering term.

If I leave the term argument out in the Entrez.efetch method, BioPython
returns an error.

It tried the following, that came up in a previous email on this mailing
list regarding pcassay database:


handle = Entrez.esearch(db='pcsubstance', term="ALL[filt]")


But this returns a list of 20 ids that obviously cannot comprise the whole
pcsubstance database

How can you run esearch in BioPython with no filtering terms?

Thanks very much.


From chapmanb at 50mail.com  Wed Jul 15 16:16:55 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 15 Jul 2009 16:16:55 -0400
Subject: [Biopython] How to run esearch in BioPython without
	specifying	any filtering terms
In-Reply-To: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>
References: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>
Message-ID: <20090715201655.GH39098@sobchak.mgh.harvard.edu>

Hello;

> The BioPython tutorial (p.86) shows how once the available fields of an
> Entrez database have been found with Einfo ,  queries can be run that use
> those fields in the term argument of Esearch (for instance Jones[AUTH]).
> 
> However, I?d like to retrieve all IDs from a database without specifying any
> filtering term.
> 
> If I leave the term argument out in the Entrez.efetch method, BioPython
> returns an error.
[..]
> How can you run esearch in BioPython with no filtering terms?

Retrieving all IDs isn't practical for most of the databases due to
large numbers of entries. That's why a term is required in Biopython,
and why most NCBI databases likely won't have an option to return
everything. For example, 'pcsubstance' looks to contain 81 million
records from the available downloads:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/

To realistically loop over a query, you'll need to limit your search
via some subset of things you are interested in to make the numbers
more manageable.

Hope this helps,
Brad

From dejmail at gmail.com  Wed Jul 15 16:39:38 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Wed, 15 Jul 2009 22:39:38 +0200
Subject: [Biopython] cleaning sequences
In-Reply-To: <20090715123540.GF17086@sobchak.mgh.harvard.edu>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
	<f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
	<20090715123540.GF17086@sobchak.mgh.harvard.edu>
Message-ID: <f2647b310907151339k291254fdh442baf830c42ba11@mail.gmail.com>

Hi Brad

Thanks, it does work really well, and I was quite close, I just need to work
on my loop conditions.

I would suggest for development a way of interacting with the Unafold
software. I know this was talked about a few weeks back, I think someone
(Chris ?) wanted to write a wrapper, and it would be really nice if this
could be added on.

Regards
Liam

From chapmanb at 50mail.com  Thu Jul 16 08:15:07 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 16 Jul 2009 08:15:07 -0400
Subject: [Biopython] cleaning sequences
In-Reply-To: <f2647b310907151339k291254fdh442baf830c42ba11@mail.gmail.com>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
	<f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
	<20090715123540.GF17086@sobchak.mgh.harvard.edu>
	<f2647b310907151339k291254fdh442baf830c42ba11@mail.gmail.com>
Message-ID: <20090716121507.GD44295@sobchak.mgh.harvard.edu>

Hi Liam;

> Thanks, it does work really well, and I was quite close, I just need to work
> on my loop conditions.

Great to hear -- glad you got it all figured out.

> I would suggest for development a way of interacting with the Unafold
> software. I know this was talked about a few weeks back, I think someone
> (Chris ?) wanted to write a wrapper, and it would be really nice if this
> could be added on.

Sounds good. I'd encourage you to register on the wiki and add these
type of ideas to the project ideas section, ideally with links to the
relevant discussion lists:

http://biopython.org/wiki/Active_projects#Project_ideas

This is informal but helps do two things: it keeps the idea from
getting lost on the mailing list, and provides a place for people to
look if they are interested in contributing but don't know where to
start.

Brad

From mmokrejs at ribosome.natur.cuni.cz  Fri Jul 17 05:58:13 2009
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Fri, 17 Jul 2009 11:58:13 +0200
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
Message-ID: <4A604B35.5010708@ribosome.natur.cuni.cz>

Hi Peter and others,
  finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something
wrong with my installation biopython-1.49:

$ python
Python 2.6.2 (r262:71600, Jun 10 2009, 00:54:18) 
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez, Medline, GenBank
>>> Entrez.email = "mmokrejs at iresite.org"
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read
    record = handler.run(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run
    self.parser.ParseFile(handle)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read
    record = handler.run(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run
    self.parser.ParseFile(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 283, in external_entity_ref_handler
    parser.ParseFile(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 280, in external_entity_ref_handler
    handle = urllib.urlopen(systemId)
  File "/usr/lib/python2.6/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/usr/lib/python2.6/urllib.py", line 203, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.6/urllib.py", line 465, in open_file
    return self.open_local_file(url)
  File "/usr/lib/python2.6/urllib.py", line 479, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd'
>>> 


When I upgrade to 1.51b I get slightly better results:

$ python
Python 2.5.4 (r254:67916, Jul 15 2009, 19:40:01) 
[GCC 4.2.2 (Gentoo 4.2.2 p1.0)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez, Medline, GenBank
>>> Entrez.email = "mmokrejs at iresite.org"
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read
    record = handler.run(handle)
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run
    self.parser.ParseFile(handle)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>> _records = Entrez.read(_handle)
>>> _records
[{u'MedlineCitation': {u'DateCompleted': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'OtherID': [], u'DateRevised': {u'Month': '11', u'Day': '14', u'Year': '2007'}, u'MeshHeadingList': [{u'QualifierName': [], u'DescriptorName': '3T3 Cells'}, {u'QualifierName': ['chemistry', 'physiology'], u'DescriptorName': "5' Untranslated Regions"}, {u'QualifierName': [], u'DescriptorName': 'Animals'}, {u'QualifierName': [], u'DescriptorName': 'Base Sequence'}, {u'QualifierName': [], u'DescriptorName': 'Chick Embryo'}, {u'QualifierName': [], u'DescriptorName': 'Mice'}, {u'QualifierName': [], u'DescriptorName': 'Molecular Sequence Data'}, {u'QualifierName': [], u'DescriptorName': 'Protein Biosynthesis'}, {u'QualifierName': ['genetics'], u'DescriptorName': 'Proto-Oncogene Proteins c-jun'}, {u'QualifierName': ['chemistry'], u'DescriptorName': 'RNA, Messenger'}, {u'QualifierName': [], u'DescriptorName': 'Rabbits'}], u'OtherAbstract': [], u'CitationSubset': ['IM'], u'ChemicalList': [{u'Nam
eOfSubstance': "5' Untranslated Regions", u'RegistryNumber': '0'}, {u'NameOfSubstance': 'Proto-Oncogene Proteins c-jun', u'RegistryNumber': '0'}, {u'NameOfSubstance': 'RNA, Messenger', u'RegistryNumber': '0'}], u'KeywordList': [], u'DateCreated': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'SpaceFlightMission': [], u'GeneralNote': [], u'Article': {u'ArticleDate': [], u'Pagination': {u'MedlinePgn': '2836-45'}, u'AuthorList': [{u'LastName': 'Sehgal', u'Initials': 'A', u'ForeName': 'A'}, {u'LastName': 'Briggs', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Rinehart-Kim', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Basso', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Bos', u'Initials': 'TJ', u'ForeName': 'T J'}], u'Language': ['eng'], u'PublicationTypeList': ['Journal Article', "Research Support, Non-U.S. Gov't", "Research Support, U.S. Gov't, P.H.S."], u'Journal': {u'ISSN': '0950-9232', u'ISOAbbreviation': 'Oncogene', u'JournalIssue': {u'Volume': '19',
 u'Issue': '24', u'PubDate': {u'Month': 'Jun', u'Day': '1', u'Year': '2000'}}, u'Title': 'Oncogene'}, u'Affiliation': 'Department of Microbiology and Molecular Cell Biology, Eastern Virginia Medical School, PO Box 1980, Norfolk, Virginia, VA 23501, USA.', u'ArticleTitle': "The chicken c-Jun 5' untranslated region directs translation by internal initiation.", u'ELocationID': [], u'Abstract': {u'AbstractText': "The 5' untranslated region (UTR) of the chicken c-jun message is exceptionally GC rich and has the potential to form a complex and extremely stable secondary structure. Because stable RNA secondary structures can serve as obstacles to scanning ribosomes, their presence suggests inefficient translation or initiation through alternate mechanisms. We have examined the role of the c-jun 5' UTR with respect to its ability to influence translation both in vitro and in vivo. We find, using rabbit reticulocyte lysates, that the presence of the c-jun 5' UTR severely inhibits tran
slation of both homologous and heterologous genes in vitro. Furthermore, translational inhibition correlates with the degree of secondary structure exhibited by the 5' UTR. Thus, in the rabbit reticulocyte lysate system, the c-jun 5' UTR likely impedes ribosome scanning resulting in inefficient translation. In contrast to our results in vitro, the c-jun 5' UTR does not inhibit translation in a variety of different cell lines suggesting that it may direct an alternate mechanism of translational initiation in vivo. To distinguish among the alternate mechanisms, we generated a series of bicistronic expression plasmids. Our results demonstrate that the downstream cistron, in the bicistronic gene, is expressed to a much higher level when directly preceded by the c-jun 5' UTR. In addition, inhibition of ribosome scanning on the bicistronic message, through insertion of a synthetic stable hairpin, inhibits translation of the first cistron but does not inhibit translation of the cist
ron downstream of the c-jun 5' UTR. These results are consistent with a model by which the c-jun message is translated through cap independent internal initiation. Oncogene (2000) 19, 2836 - 2845"}, u'GrantList': [{u'Acronym': 'CA', u'Country': 'United States', u'Agency': 'NCI NIH HHS', u'GrantID': 'R01 CA51982'}]}, u'PMID': '10851087', u'MedlineJournalInfo': {u'MedlineTA': 'Oncogene', u'Country': 'ENGLAND', u'NlmUniqueID': '8711562'}}, u'PubmedData': {u'ArticleIdList': ['10851087', '10.1038/sj.onc.1203601'], u'PublicationStatus': 'ppublish', u'History': [[{u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}, {u'Minute': '0', u'Month': '7', u'Day': '6', u'Hour': '11', u'Year': '2000'}, {u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}]]}}]
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read
    record = handler.run(handle)
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run
    self.parser.ParseFile(handle)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
>>> 


  Any clues what does that mean? TIA,
martin

From bartomas at gmail.com  Fri Jul 17 07:23:28 2009
From: bartomas at gmail.com (bar tomas)
Date: Fri, 17 Jul 2009 12:23:28 +0100
Subject: [Biopython] How to run esearch in BioPython without specifying
	any filtering terms
In-Reply-To: <20090715201655.GH39098@sobchak.mgh.harvard.edu>
References: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>
	<20090715201655.GH39098@sobchak.mgh.harvard.edu>
Message-ID: <fdcd75820907170423r68175d3ej5ca98718eba8a345@mail.gmail.com>

Thanks a lot. I understand now.

On Wed, Jul 15, 2009 at 9:16 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hello;
>
> > The BioPython tutorial (p.86) shows how once the available fields of an
> > Entrez database have been found with Einfo ,  queries can be run that use
> > those fields in the term argument of Esearch (for instance Jones[AUTH]).
> >
> > However, I?d like to retrieve all IDs from a database without specifying
> any
> > filtering term.
> >
> > If I leave the term argument out in the Entrez.efetch method, BioPython
> > returns an error.
> [..]
> > How can you run esearch in BioPython with no filtering terms?
>
> Retrieving all IDs isn't practical for most of the databases due to
> large numbers of entries. That's why a term is required in Biopython,
> and why most NCBI databases likely won't have an option to return
> everything. For example, 'pcsubstance' looks to contain 81 million
> records from the available downloads:
>
> ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/
>
> To realistically loop over a query, you'll need to limit your search
> via some subset of things you are interested in to make the numbers
> more manageable.
>
> Hope this helps,
> Brad
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From chapmanb at 50mail.com  Fri Jul 17 08:01:29 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 17 Jul 2009 08:01:29 -0400
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
Message-ID: <20090717120129.GE46309@sobchak.mgh.harvard.edu>

Hi Martin;
Thanks for the e-mail. Let's tackle your up to date 1.51beta work.

> When I upgrade to 1.51b I get slightly better results:
> 
> >>> from Bio import Entrez, Medline, GenBank
> >>> Entrez.email = "mmokrejs at iresite.org"
> >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
> >>> _records = Entrez.read(_handle)
[ error ]

> >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
> >>> _records = Entrez.read(_handle)
> >>> _records
[ worked ]

>   Any clues what does that mean? TIA,

In the first (and also third) example, you are retrieving the text
based result. The Entrez parser handles XML output, so it is
complaining because it's getting the raw text record instead of XML. 

Your second example is correct and worked; you specified the correct
XML retmode. You should be able to go with this.

More generally, since Entrez returns many different file types, you
want to be sure and match up what you are getting with the parser
you are using.

Hope this helps,
Brad

From mmokrejs at ribosome.natur.cuni.cz  Fri Jul 17 09:29:31 2009
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Fri, 17 Jul 2009 15:29:31 +0200
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <20090717120129.GE46309@sobchak.mgh.harvard.edu>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
	<20090717120129.GE46309@sobchak.mgh.harvard.edu>
Message-ID: <4A607CBB.106@ribosome.natur.cuni.cz>

Hi Brad,
  thanks for clarification. I somewhat overlooked in the tutorial that
Entrez.read() requires me to ask for XML rettype and that it parses the XML
result by itself into the dictionary structure. Still I think it should
check what values I have passed down to Entrez.efetch() function. I know
it might be quite some work to keep it in sync with NCBI website but
let's see what others say. Either way, my code works now with Bio.Entrez
instead of the deprecated Bio.PubMed. I just had to quickly reinvent all
the exceptions because some PubMed entries lack authors, abbreviated
journal name, lack year, etc. ;-)
Best regards,
Martin

Brad Chapman wrote:
> Hi Martin;
> Thanks for the e-mail. Let's tackle your up to date 1.51beta work.
> 
>> When I upgrade to 1.51b I get slightly better results:
>>
>>>>> from Bio import Entrez, Medline, GenBank
>>>>> Entrez.email = "mmokrejs at iresite.org"
>>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>>>> _records = Entrez.read(_handle)
> [ error ]
> 
>>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>>>> _records = Entrez.read(_handle)
>>>>> _records
> [ worked ]
> 
>>   Any clues what does that mean? TIA,
> 
> In the first (and also third) example, you are retrieving the text
> based result. The Entrez parser handles XML output, so it is
> complaining because it's getting the raw text record instead of XML. 
> 
> Your second example is correct and worked; you specified the correct
> XML retmode. You should be able to go with this.
> 
> More generally, since Entrez returns many different file types, you
> want to be sure and match up what you are getting with the parser
> you are using.


From biopython at maubp.freeserve.co.uk  Sat Jul 18 07:40:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 18 Jul 2009 12:40:36 +0100
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00907180440i7a98bef9v8282bb1e2b6b8961@mail.gmail.com>

On Fri, Jul 17, 2009 at 10:58 AM, Martin
MOKREJ?<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi Peter and others,
>  finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something
> wrong with my installation biopython-1.49:
>
> ...
>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>>> _records = Entrez.read(_handle)
> ...
> IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd'

The NCBI added some new DTD files in Jan 2009, there are not included
with Biopython 1.49, but are in 1.51b which is why this error went away
when you upgraded.

Peter


From p.j.a.cock at googlemail.com  Sat Jul 18 07:48:30 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 18 Jul 2009 12:48:30 +0100
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <4A607CBB.106@ribosome.natur.cuni.cz>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
	<20090717120129.GE46309@sobchak.mgh.harvard.edu>
	<4A607CBB.106@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00907180448j4f733b02xac6949048f310103@mail.gmail.com>

On Fri, Jul 17, 2009 at 2:29 PM, Martin
MOKREJ?<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi Brad,
>  thanks for clarification. I somewhat overlooked in the tutorial that
> Entrez.read() requires me to ask for XML rettype and that it parses
> the XML result by itself into the dictionary structure. Still I think it should
> check what values I have passed down to Entrez.efetch() function.

This isn't going to be possible given that Entrez.read() just takes a
file handle. This separation between getting the data and parsing
it is deliberate. The handle you give to Entrez.read() might be to a
file on disk (saved from a previous search) instead of an Internet
handle to a live NCBI Entrez connection.

> Either way, my code works now with Bio.Entrez instead of the
> deprecated Bio.PubMed.

Good.

Note you didn't have to switch to using the XML from Entrez (e.g.
with the Bio.Entrez.read() funciton). It sounds like you were using
Bio.PubMed to access the data (in Medline format), and internally
this used Bio.Medline to parse it. Therefore, it would have been
less upheaval to use Bio.Entrez to fetch the data (as Medline files),
and continue to use Bio.Medline to parse this. See the section
"Parsing Medline records" in the Entrez chapter of the tutorial.

Peter


From lthiberiol at gmail.com  Mon Jul 20 10:22:38 2009
From: lthiberiol at gmail.com (Luiz Thiberio Rangel)
Date: Mon, 20 Jul 2009 11:22:38 -0300
Subject: [Biopython] BLAST footer
Message-ID: <f00cc0d10907200722h33713f8aubb51982bd3b8e52f@mail.gmail.com>

-- 
Luiz Thib?rio Rangel


From lthiberiol at gmail.com  Mon Jul 20 10:29:34 2009
From: lthiberiol at gmail.com (Luiz Thiberio Rangel)
Date: Mon, 20 Jul 2009 11:29:34 -0300
Subject: [Biopython] BLAST footer
Message-ID: <f00cc0d10907200729h31f056f4ye5b874e8dc5ac103@mail.gmail.com>

Hi folks,

Is there any way to get a complete BLAST footer using NCBIXML.parse?
The xml BLAST output generated by blastall doesn't have the complete footer
information, but the txt output has.

I'm running the BLAST using the xml output because this is the format
compatible do BioPython's parser, but I need some information that it
doesn't contains.  If somebody know how I can calculate the footer
information by the xml content would be useful too.

thanks...

-- 
Luiz Thib?rio Rangel


From biopython at maubp.freeserve.co.uk  Mon Jul 20 10:51:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Jul 2009 15:51:51 +0100
Subject: [Biopython] BLAST footer
In-Reply-To: <f00cc0d10907200729h31f056f4ye5b874e8dc5ac103@mail.gmail.com>
References: <f00cc0d10907200729h31f056f4ye5b874e8dc5ac103@mail.gmail.com>
Message-ID: <320fb6e00907200751s42f1387n64d95061a56a382b@mail.gmail.com>

On Mon, Jul 20, 2009 at 3:29 PM, Luiz Thiberio
Rangel<lthiberiol at gmail.com> wrote:
> Hi folks,
>
> Is there any way to get a complete BLAST footer using NCBIXML.parse?
> The xml BLAST output generated by blastall doesn't have the complete
> footer information, but the txt output has.

If the information isn't in the XML file, then the BLAST XML parser can't
tell you it ;)

> I'm running the BLAST using the xml output because this is the format
> compatible do BioPython's parser, but I need some information that it
> doesn't contains. ?If somebody know how I can calculate the footer
> information by the xml content would be useful too.

What information in particular do you need? Have you read the BLAST
book (Ian Korf, Mark Yandell and Joseph Bedell)? They may explain
where some of these numbers come from.

Peter


From iitlife2008 at gmail.com  Mon Jul 20 17:08:21 2009
From: iitlife2008 at gmail.com (life happy)
Date: Mon, 20 Jul 2009 14:08:21 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
Message-ID: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>

Hi there,

I am new to Biopython and have been working for a couple of weeks on Bio.PDB
module.I would appreciate any clue or help in the following matter.

I have some short ,closely related peptide sequences.I want to align these
short peptides and send the aligned structures into a new PDB file.I used
set_atoms class in Superimposer module to align the short peptides. I tried
using PDBIO module, and send the aligned structures into a new PDB file. But
when I see the output PDB file, I get the whole proteins not the short
peptides. I like to have output PDB file with all the short peptides aligned
to any particular short peptide.


#This is the part of my code. B is list of atoms of peptides. C is a list
with PDB ids of each peptide.

from Bio.PDB.Superimposer import Superimposer
fixed = B[0:1*(stop-start+1)]
sup = Superimposer()
for i in range(1,5) :
        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
        sup.set_atoms(fixed, moving)
        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
(C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
sup.rms)
        print "Saving %s aligned structure as PDB file %s" %
(C[0][2].split("'")[1], pdb_out_filename)
        io=Bio.PDB.PDBIO()
        io.set_structure(structure)
        io.save(pdb_out_filename)

thanks in advance!!

cheers,
Kumar.

From biopython at maubp.freeserve.co.uk  Mon Jul 20 17:14:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Jul 2009 22:14:50 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
Message-ID: <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>

On Mon, Jul 20, 2009 at 10:08 PM, life happy<iitlife2008 at gmail.com> wrote:
> Hi there,
>
> I am new to Biopython and have been working for a couple of weeks on Bio.PDB
> module.I would appreciate any clue or help in the following matter.
>
> I have some short ,closely related peptide sequences.I want to align these
> short peptides and send the aligned structures into a new PDB file.I used
> set_atoms class in Superimposer module to align the short peptides. I tried
> using PDBIO module, and send the aligned structures into a new PDB file. But
> when I see the output PDB file, I get the whole proteins not the short
> peptides. I like to have output PDB file with all the short peptides aligned
> to any particular short peptide.
>
>
> #This is the part of my code. B is list of atoms of peptides. C is a list
> with PDB ids of each peptide.
>
> from Bio.PDB.Superimposer import Superimposer
> fixed = B[0:1*(stop-start+1)]
> sup = Superimposer()
> for i in range(1,5) :
>        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
>        sup.set_atoms(fixed, moving)
>        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
> (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
> sup.rms)
>        print "Saving %s aligned structure as PDB file %s" %
> (C[0][2].split("'")[1], pdb_out_filename)
>        io=Bio.PDB.PDBIO()
>        io.set_structure(structure)
>        io.save(pdb_out_filename)
>
> thanks in advance!!

Your example never defines the "structure" variable. I guess it should
be pointing at something in the "C" data structure...

Peter

From biopython at maubp.freeserve.co.uk  Mon Jul 20 18:15:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Jul 2009 23:15:54 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
Message-ID: <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>

On Mon, Jul 20, 2009 at 10:36 PM, life happy<iitlife2008 at gmail.com> wrote:
> No..this is only a piece of code. The structure object 'structure' was
> already created.

You example never seems to appy the transformation. Have you read this?
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

It is a worked example using Bio.PDB's Superimposer, and it saves the output.

Peter

From biopython at maubp.freeserve.co.uk  Tue Jul 21 05:13:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Jul 2009 10:13:13 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
Message-ID: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>

Please keep the mailing list CC'd.

On Mon, Jul 20, 2009 at 11:59 PM, life happy<iitlife2008 at gmail.com> wrote:
> Yes! I have read this.

I'm glad you found that page (something I'd like to integrate into the
main Biopython Tutorial at some point):
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

> Which step applies the transformation?Isn't that
> set_atoms function? I am able to print RMS value. I did not follow the
> superimpose.apply(alt_model.get_atoms()) .

As the name should suggest, superimpose.apply(...) actually applies the
transformation. This is what you are missing. The set_atoms(...) just tells
the code which atoms are going to be superimposed.

> According to description in BioPDB faq pdf and
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html
> set_atom does the transformation, right? If I am wrong, please correct me!

That docstring is rather confusing, we should fix that.

> Also,In which step are we sending the transformed co-ordinates into
> the PDB file?

These lines write out the PDB file for the whole structure:

io=Bio.PDB.PDBIO()
io.set_structure(structure)
io.save(pdb_out_filename)

> Also, the output PDB file has whole protein, I only want the short peptides
> aligned(only the atom lists that I gave as input must be aligned, not the
> whole protein of peptides).

If you only want some of the protein written, then you should only give
some of the structure to the PDB output code.

Peter

From iitlife2008 at gmail.com  Tue Jul 21 16:35:58 2009
From: iitlife2008 at gmail.com (life happy)
Date: Tue, 21 Jul 2009 13:35:58 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
Message-ID: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>

I have tried using   io.save("pdb_out_filename", se.accept_model(alt_model))

       I get error as , 'int' object has no attribute 'accept_model'

If I use  io.save("pdb_out_filename", se = accept_model(alt_model))

      I get Error: name 'accept_model' is not defined

In both the cases I created 'se' an object of Bio.PDB.Select()
Do you have an example for printing out some part of PDB?


On Tue, Jul 21, 2009 at 2:13 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Please keep the mailing list CC'd.
>
> On Mon, Jul 20, 2009 at 11:59 PM, life happy<iitlife2008 at gmail.com> wrote:
> > Yes! I have read this.
>
> I'm glad you found that page (something I'd like to integrate into the
> main Biopython Tutorial at some point):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> > Which step applies the transformation?Isn't that
> > set_atoms function? I am able to print RMS value. I did not follow the
> > superimpose.apply(alt_model.get_atoms()) .
>
> As the name should suggest, superimpose.apply(...) actually applies the
> transformation. This is what you are missing. The set_atoms(...) just tells
> the code which atoms are going to be superimposed.
>
> > According to description in BioPDB faq pdf and
> >
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html
> > set_atom does the transformation, right? If I am wrong, please correct
> me!
>
> That docstring is rather confusing, we should fix that.
>
> > Also,In which step are we sending the transformed co-ordinates into
> > the PDB file?
>
> These lines write out the PDB file for the whole structure:
>
> io=Bio.PDB.PDBIO()
> io.set_structure(structure)
> io.save(pdb_out_filename)
>
> > Also, the output PDB file has whole protein, I only want the short
> peptides
> > aligned(only the atom lists that I gave as input must be aligned, not the
> > whole protein of peptides).
>
> If you only want some of the protein written, then you should only give
> some of the structure to the PDB output code.
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Tue Jul 21 16:48:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Jul 2009 21:48:12 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
	<46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
Message-ID: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>

On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com> wrote:
> I have tried using?? io.save("pdb_out_filename", se.accept_model(alt_model))
>
> ?????? I get error as , 'int' object has no attribute 'accept_model'

If "se" really is an integer, that isn't surprising!

> If I use? io.save("pdb_out_filename", se = accept_model(alt_model))
>
> ????? I get Error: name 'accept_model' is not defined
>
> In both the cases I created 'se' an object of Bio.PDB.Select()
> Do you have an example for printing out some part of PDB?

The examples here may help:
http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
http://biopython.org/wiki/Remove_PDB_disordered_atoms
http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html

See also pages 5 and 6 of the Bio.PDB documentation, the bit
on the Select class:
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

Peter


From biopython at maubp.freeserve.co.uk  Thu Jul 23 06:20:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Jul 2009 11:20:11 +0100
Subject: [Biopython] Storing SeqRecord objects with annotation
Message-ID: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>

Hi Andrea (and everyone else),

This is a continuation of a discussion started on Bug 2883. Andrea had
a problem with unpickling SeqRecord objects which were pickled using
an older version of Biopython. She was using pickle to store complicated
annotated SeqRecord objects on disk.

See http://bugzilla.open-bio.org/show_bug.cgi?id=2883 for details.

http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c6
On Bug 2883 comment 6, Peter wrote:
>>
>> If your SeqRecord objects are all simply loaded from sequence files in
>> the first place (and not modified), I would just keep the original file and
>> re-parse it.
>>
>> If you have generated your own SeqRecords (or modified those from
>> reading a file), then it makes sense to save them somehow. The choice
>> of file format depends on the nature of annotation. The latest Biopython
>> will now record the features in a GenBank file, making that a reasonable
>> choice - but this does not cover per-letter-annotations. BioSQL has the
>> same limitation.

http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c7
On Bug 2883 comment 7, Andrea wrote:
>
> yes, i'm testing some predictors. I do prediction and i compare the
> "newly predicted seqrecords" with the "previously correct predicted
> pickled seqrecords".

Sorry - when you said "test code" on the Bug discussion, I though you
meant you were testing the code - not that this was real work doing
biological tests.

> I've them (the correct ones) only in pickled seqrecord format. The
> correctly predicted seqrecord, before prediction were in fasta format,
> but after i parsed them (into seqrecord), i did prediction, and then
> i pickled them (during prediction i add to seqrecord features and
> annotations).

If you have SeqFeatures and SeqRecords with simple string based
annotation, then BioSQL should be fine.

If you have SeqFeatures, then using GenBank output might be enough.
There are no general fields in the GenBank format for arbitary
annotation though.

> Actually i don't use per-letter-annotation despite the fact it seems
> interesting. But i didn't find any example in documentation (that
> show how the dictionary is populated...) so i really don't know
> how to use it.... even if i've, during prediction, a "per position
> annotation".

You are right that the SeqRecord chapter in the Tutorial doesn't
explicitly cover populating the per-letter-annotation. I can fix that...

However, the built in documentation covers this (e.g. the section
on slicing a SeqRecord to get a sub-record):

>>> from Bio.SeqRecord import SeqRecord
>>> help(SeqRecord)
...

You can read this online:
http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html

> Also if the "per letter annotation" is not managed in the GenBank
> format or in the BioSQL format (that i use a lot) i've to wait!!

Currently the BioSQL schema doesn't have any explicit support
for "per letter annotation", but we could encode it as a string
(e.g. using XML or JSON) perhaps. This will require coordination
with BioSQL, BioPerl etc - and thus far no one has expressed a
strong need for this.

The GenBank file format simply doesn't have an concept of "per
letter annotation". The PFAM/Stockholm alignment format does
(for the special case of a single character per letter of the
sequence), and in sequencing the base quality is also held in
some file formats.

> I was thinking also to store the pssm information somewhere in the
> seqrecord.... but this would be a very big change... (and also
> manage to store it in BioSQL.... )... but it's better to stop
> the discussion here or to move it... :-)

You can record any object in the SeqRecord's annotation dictionary.
However, saving the result to a file will be tricky - and it wouldn't
work in BioSQL either.

Peter

From andrea at biodec.com  Thu Jul 23 08:23:19 2009
From: andrea at biodec.com (Andrea)
Date: Thu, 23 Jul 2009 14:23:19 +0200
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
Message-ID: <4A685637.30806@biodec.com>

An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20090723/f3458454/attachment.html>

From biopython at maubp.freeserve.co.uk  Thu Jul 23 08:54:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Jul 2009 13:54:47 +0100
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <4A685637.30806@biodec.com>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
	<4A685637.30806@biodec.com>
Message-ID: <320fb6e00907230554o1665af8cpbc44328df49c70bf@mail.gmail.com>

On Thu, Jul 23, 2009 at 1:23 PM, Andrea<andrea at biodec.com> wrote:
>
> To be precise i'm really testing code, my code. My predictors are
> implemented in python and to be shure that during time, bug fixes,
> modifications.. i won't alter the prediction results, i build some
> unittest to compare the results of the modified code with the results
> of the old code.
>
>Peter wrote:
>> If you have SeqFeatures and SeqRecords with simple string based
>> annotation, then BioSQL should be fine.
>
> According to me, for unittesting purposes, using Biosql for storing data
> is quite expensive? in term of code (or it seems so...), despite the fact,
> actually, BioSQL is for sure fine for storing? my annotations and
> features.
>
>> If you have SeqFeatures, then using GenBank output might be
>> enough. There are no general fields in the GenBank format for
>> arbitrary annotation though.
>
> Yes, i think that GenBank wont store my "peronal annotations"
> (or i've to check it).
>
>>> Actually i don't use per-letter-annotation despite the fact it seems
>>> interesting. But i didn't find any example in documentation (that
>>> show how the dictionary is populated...) so i really don't know
>>> how to use it.... even if i've, during prediction, a "per position
>>> annotation".
>>
>> You are right that the SeqRecord chapter in the Tutorial doesn't
>> explicitly cover populating the per-letter-annotation. I can fix that...

The next version of the Tutorial will include a short example of this.

>> However, the built in documentation covers this (e.g. the section
>> on slicing a SeqRecord to get a sub-record):
>>
>> from Bio.SeqRecord import SeqRecord
>> help(SeqRecord)
>> ...
>>
>> You can read this online:
>> http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
>
> Very interesting and easy to use. I can either use it for:
> ? - storing per position string representing the "per position label"
>     of the prediction
> ? - storing list of per position reliabilities (raliability of prediction)
> ? - storing sequence variant
> ? - storing possible aligned sequence
> But it's a pity that this is not yet managed in BioSQL ....

Some of those might be possible using SeqFeature objects,
but I agree, the  "per letter annotation" seems more suitable.

> Also if the "per letter annotation" is not managed in the GenBank
> format or in the BioSQL format (that i use a lot) i've to wait!!

Some special cases of "per letter annotation" are supported for
file output (PFAM/Stockholm alignments, FASTQ, and QUAL),
but that's it. The idea of the SeqRecord "per letter annotation"
was to be sufficiently general to cover these and other future
uses.

>> Currently the BioSQL schema doesn't have any explicit support
>> for "per letter annotation", but we could encode it as a string
>> (e.g. using XML or JSON) perhaps. This will require coordination
>> with BioSQL, BioPerl etc - and thus far no one has expressed a
>> strong need for this.
>>
>> ...
>>
>> You can record any object in the SeqRecord's annotation
>> dictionary. However, saving the result to a file will be tricky -
>> and it wouldn't work in BioSQL either.
>
> I could say that i will use it, if it will work in biosql... but until
> there won't be the? possibility to store this information (BioSQL,
> GenBank...) i think the "per letter annotation" will lose part of its
> "charme"....

Currently BioSQL just stores strings for general annotation.
I think extending BioSQL to store simple per-letter-annotation
would be possible - for example strings, integers, and floating
point numbers. However, storing objects like a PSSM might
not be possible as we would want this to be compatible
between the other Bio* bindings.

Peter


From hlapp at gmx.net  Thu Jul 23 09:01:29 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 23 Jul 2009 09:01:29 -0400
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
Message-ID: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net>


On Jul 23, 2009, at 6:20 AM, Peter wrote:

> Currently the BioSQL schema doesn't have any explicit support
> for "per letter annotation"

I haven't been following the thread closely and so may be missing what  
is really meant by this. If, however, you mean associating annotation  
to a specific letter (position) in the sequence, BioSQL does support  
this - you'd create a seqfeature with appropriate location, and attach  
the annotation to the seqfeature.

Bioentry annotations are location-less, by comparison.

>
> The GenBank file format simply doesn't have an concept of "per
> letter annotation"

Since it does for in the above sense, I'm inclined to assume that you  
really do mean something different than the above?

> [...]
> You can record any object in the SeqRecord's annotation dictionary.
> However, saving the result to a file will be tricky - and it wouldn't
> work in BioSQL either.


Note that that's not entirely true. If you have a textual  
serialization (such as XML) of your object, you *can* store it in  
bioentry_qualifier_value. This is what we do in BioPerl with a TagTree  
annotation object that supports a nested hierarchical annotation  
structure needed for lossless representation of some UniProt lines.

Obviously, that won't allow you to query very well by individual  
elements of your custom annotation object. But you can build a custom  
index (e.g., using Lucene) that does that.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu Jul 23 09:32:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Jul 2009 14:32:39 +0100
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
	<8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net>
Message-ID: <320fb6e00907230632q730aa496g4a07c50d5860bd54@mail.gmail.com>

Hi Hilmar!

I've CC'd this to the BioSQL list. The start of the thread was here:
http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html

On Thu, Jul 23, 2009 at 2:01 PM, Hilmar Lapp<hlapp at gmx.net> wrote:
>
> On Jul 23, 2009, at 6:20 AM, Peter wrote:
>
>> Currently the BioSQL schema doesn't have any explicit support
>> for "per letter annotation"
>
> I haven't been following the thread closely and so may be missing what is
> really meant by this. If, however, you mean associating annotation to a
> specific letter (position) in the sequence, BioSQL does support this - you'd
> create a seqfeature with appropriate location, and attach the annotation to
> the seqfeature.
>
> Bioentry annotations are location-less, by comparison.

By "per letter annotation" we mean essentially a list of annotation
data, with one entry for each letter in the sequence. For example,
a sequencing quality score (from a FASTQ file) where this is one
integer per letter (i.e. per base pair). Or, a secondary structure
prediction, encoded as one character per letter (which could
apply to proteins and nucleotides).

This sort of thing could be done by using on feature per letter,
but it would be dreadfully inefficient for storing in the database.

>> [...]
>> You can record any object in the SeqRecord's annotation dictionary.
>> However, saving the result to a file will be tricky - and it wouldn't
>> work in BioSQL either.
>
> Note that that's not entirely true. If you have a textual serialization
> (such as XML) of your object, you *can* store it in
> bioentry_qualifier_value. This is what we do in BioPerl with a TagTree
> annotation object that supports a nested hierarchical annotation
> structure needed for lossless representation of some UniProt lines.

This was what I mentioned earlier in the thread - using XML or
JSON to turn the object into a long string. However, we really need
the Bio* projects to agree on some standards here, rather than
each project adding its own additions ad hoc (which will make
interoperation much trickier). For example, I was unaware you
(BioPerl) had already pressed ahead with this for the UniProt
data - which rather proves my point.

> Obviously, that won't allow you to query very well by individual
> elements of your custom annotation object. But you can build a
> custom index (e.g., using Lucene) that does that.

Yes, doing searches on an XML/JSON encoded string is an issue.
But right now we are probably more interested in just solving the
persistence of more complex objects.

Peter

From iitlife2008 at gmail.com  Thu Jul 23 13:45:46 2009
From: iitlife2008 at gmail.com (life happy)
Date: Thu, 23 Jul 2009 10:45:46 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
	<46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
	<320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>
Message-ID: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com>

Hi Peter ,

Thanks, the links were helpful. But I am facing this problem.

from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb')
structure = parser.get_structure( "3DH4", filehandle)
filehandle.close()
Select = Bio.PDB.Select()
class GlySelect(Select):
    def accept_residue(self, residue):
        if residue.get_name()=='GLY':
            return 1
        else:
            return 0
io=PDBIO()
io.set_structure(structure)
io.save('gly_only.pdb', GlySelect())

I use this code but I am getting the following error!

File "aligned_matches_written_to_new_pdb_file.py", line 34, in <module>
    class GlySelect(Select):
TypeError: Error when calling the metaclass bases
    this constructor takes no arguments

I have also tried the example in
http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error
message. What  does this mean? Any remedy?

Secondly, I didn't understand your answer to my question.."In which step are
we sending the transformed co-ordinates into the PDB file? " The
Superimposer is a black box for me. I give it atom lists, it gives me RMSD.
But I want the aligned co-ordinates of the given atom lists, so that I can
see the alignment in PyMol.I don't know how to extract aligned atom
co-ordinates!

Your example :-

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F

does this job perfectly.It aptly prints out aligned models into a new PDB
file.But I am working on two atom lists from two different proteins, unlike
two models of same structure.Can you give me little push on how to deal
superimposing two different structures?

sincerely,
Kumar.


On Tue, Jul 21, 2009 at 1:48 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com> wrote:
> > I have tried using   io.save("pdb_out_filename",
> se.accept_model(alt_model))
> >
> >        I get error as , 'int' object has no attribute 'accept_model'
>
> If "se" really is an integer, that isn't surprising!
>
> > If I use  io.save("pdb_out_filename", se = accept_model(alt_model))
> >
> >       I get Error: name 'accept_model' is not defined
> >
> > In both the cases I created 'se' an object of Bio.PDB.Select()
> > Do you have an example for printing out some part of PDB?
>
> The examples here may help:
> http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
> http://biopython.org/wiki/Remove_PDB_disordered_atoms
> http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html
>
> See also pages 5 and 6 of the Bio.PDB documentation, the bit
> on the Select class:
> http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
>
> Peter
>

From idoerg at gmail.com  Thu Jul 23 14:09:03 2009
From: idoerg at gmail.com (Iddo Friedberg)
Date: Thu, 23 Jul 2009 11:09:03 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
	<46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
	<320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>
	<46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com>
Message-ID: <b5bbbc970907231109q445c665qe35e330894bbfc7b@mail.gmail.com>

Kumar:

The following works. The main error you had was that you instantiated Select
upon definition like so:
Select = Bio.PDB.Select()

Instead of:

Select = Bio.PDB.Select

Also, you used residue.get_name() instead of residue.get_resname() (there is
no get_name() method).

#!/usr/bin/python
import Bio
import os
from Bio import PDB
from Bio.PDB import PDBIO
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb"
filehandle = open(os.path.join(mypdb), 'rb')
structure = parser.get_structure( "1ZUG", filehandle)
filehandle.close()
Select = Bio.PDB.Select
class GlySelect(Select):
   def accept_residue(self, residue):
#       print dir(residue)
       if residue.get_resname()=='GLY':
           return 1
       else:
           return 0
if __name__ == '__main__':
    io=PDBIO()
    io.set_structure(structure)
    io.save('gly_only.pdb', GlySelect())


On Thu, Jul 23, 2009 at 10:45 AM, life happy <iitlife2008 at gmail.com> wrote:

> Hi Peter ,
>
> Thanks, the links were helpful. But I am facing this problem.
>
> from Bio.PDB.PDBParser import PDBParser
> parser = PDBParser()
> filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb')
> structure = parser.get_structure( "3DH4", filehandle)
> filehandle.close()
> Select = Bio.PDB.Select()
> class GlySelect(Select):
>    def accept_residue(self, residue):
>        if residue.get_name()=='GLY':
>            return 1
>        else:
>            return 0
> io=PDBIO()
> io.set_structure(structure)
> io.save('gly_only.pdb', GlySelect())
>
> I use this code but I am getting the following error!
>
> File "aligned_matches_written_to_new_pdb_file.py", line 34, in <module>
>    class GlySelect(Select):
> TypeError: Error when calling the metaclass bases
>    this constructor takes no arguments
>
> I have also tried the example in
> http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error
> message. What  does this mean? Any remedy?
>
> Secondly, I didn't understand your answer to my question.."In which step
> are
> we sending the transformed co-ordinates into the PDB file? " The
> Superimposer is a black box for me. I give it atom lists, it gives me RMSD.
> But I want the aligned co-ordinates of the given atom lists, so that I can
> see the alignment in PyMol.I don't know how to extract aligned atom
> co-ordinates!
>
> Your example :-
>
>
> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F
>
> does this job perfectly.It aptly prints out aligned models into a new PDB
> file.But I am working on two atom lists from two different proteins, unlike
> two models of same structure.Can you give me little push on how to deal
> superimposing two different structures?
>
> sincerely,
> Kumar.
>
>
> On Tue, Jul 21, 2009 at 1:48 PM, Peter <biopython at maubp.freeserve.co.uk
> >wrote:
>
> > On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com>
> wrote:
> > > I have tried using   io.save("pdb_out_filename",
> > se.accept_model(alt_model))
> > >
> > >        I get error as , 'int' object has no attribute 'accept_model'
> >
> > If "se" really is an integer, that isn't surprising!
> >
> > > If I use  io.save("pdb_out_filename", se = accept_model(alt_model))
> > >
> > >       I get Error: name 'accept_model' is not defined
> > >
> > > In both the cases I created 'se' an object of Bio.PDB.Select()
> > > Do you have an example for printing out some part of PDB?
> >
> > The examples here may help:
> > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
> > http://biopython.org/wiki/Remove_PDB_disordered_atoms
> > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html
> >
> > See also pages 5 and 6 of the Bio.PDB documentation, the bit
> > on the Select class:
> > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
> >
> > Peter
> >
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org

From iitlife2008 at gmail.com  Thu Jul 23 16:57:17 2009
From: iitlife2008 at gmail.com (life happy)
Date: Thu, 23 Jul 2009 13:57:17 -0700
Subject: [Biopython] Creating and adding new models to a structure
Message-ID: <46a813870907231357u47501af9jc96369f9f54faa37@mail.gmail.com>

Hi Iddo Friedberg,

Thanks for correcting me. Its working!!

I have a new question. I like to store an atom list as a model in a
structure.How can I do this?

Kumar.

On Thu, Jul 23, 2009 at 11:09 AM, Iddo Friedberg <idoerg at gmail.com> wrote:

> Kumar:
>
> The following works. The main error you had was that you instantiated
> Select upon definition like so:
> Select = Bio.PDB.Select()
>
> Instead of:
>
> Select = Bio.PDB.Select
>
> Also, you used residue.get_name() instead of residue.get_resname() (there
> is no get_name() method).
>
> #!/usr/bin/python
> import Bio
> import os
> from Bio import PDB
> from Bio.PDB import PDBIO
> from Bio.PDB.PDBParser import PDBParser
> parser = PDBParser()
> mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb"
> filehandle = open(os.path.join(mypdb), 'rb')
> structure = parser.get_structure( "1ZUG", filehandle)
> filehandle.close()
> Select = Bio.PDB.Select
> class GlySelect(Select):
>    def accept_residue(self, residue):
> #       print dir(residue)
>        if residue.get_resname()=='GLY':
>            return 1
>        else:
>            return 0
> if __name__ == '__main__':
>     io=PDBIO()
>     io.set_structure(structure)
>     io.save('gly_only.pdb', GlySelect())
>
>
>
> On Thu, Jul 23, 2009 at 10:45 AM, life happy <iitlife2008 at gmail.com>wrote:
>
>> Hi Peter ,
>>
>> Thanks, the links were helpful. But I am facing this problem.
>>
>> from Bio.PDB.PDBParser import PDBParser
>> parser = PDBParser()
>> filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb')
>> structure = parser.get_structure( "3DH4", filehandle)
>> filehandle.close()
>> Select = Bio.PDB.Select()
>> class GlySelect(Select):
>>    def accept_residue(self, residue):
>>        if residue.get_name()=='GLY':
>>            return 1
>>        else:
>>            return 0
>> io=PDBIO()
>> io.set_structure(structure)
>> io.save('gly_only.pdb', GlySelect())
>>
>> I use this code but I am getting the following error!
>>
>> File "aligned_matches_written_to_new_pdb_file.py", line 34, in <module>
>>    class GlySelect(Select):
>> TypeError: Error when calling the metaclass bases
>>    this constructor takes no arguments
>>
>> I have also tried the example in
>> http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same
>> error
>> message. What  does this mean? Any remedy?
>>
>> Secondly, I didn't understand your answer to my question.."In which step
>> are
>> we sending the transformed co-ordinates into the PDB file? " The
>> Superimposer is a black box for me. I give it atom lists, it gives me
>> RMSD.
>> But I want the aligned co-ordinates of the given atom lists, so that I can
>> see the alignment in PyMol.I don't know how to extract aligned atom
>> co-ordinates!
>>
>> Your example :-
>>
>>
>> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F
>>
>> does this job perfectly.It aptly prints out aligned models into a new PDB
>> file.But I am working on two atom lists from two different proteins,
>> unlike
>> two models of same structure.Can you give me little push on how to deal
>> superimposing two different structures?
>>
>> sincerely,
>> Kumar.
>>
>>
>> On Tue, Jul 21, 2009 at 1:48 PM, Peter <biopython at maubp.freeserve.co.uk
>> >wrote:
>>
>> > On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com>
>> wrote:
>> > > I have tried using   io.save("pdb_out_filename",
>> > se.accept_model(alt_model))
>> > >
>> > >        I get error as , 'int' object has no attribute 'accept_model'
>> >
>> > If "se" really is an integer, that isn't surprising!
>> >
>> > > If I use  io.save("pdb_out_filename", se = accept_model(alt_model))
>> > >
>> > >       I get Error: name 'accept_model' is not defined
>> > >
>> > > In both the cases I created 'se' an object of Bio.PDB.Select()
>> > > Do you have an example for printing out some part of PDB?
>> >
>> > The examples here may help:
>> > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
>> > http://biopython.org/wiki/Remove_PDB_disordered_atoms
>> > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html
>> >
>> > See also pages 5 and 6 of the Bio.PDB documentation, the bit
>> > on the Select class:
>> > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
>> >
>> > Peter
>> >
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
> --
> Iddo Friedberg, Ph.D.
> Atkinson Hall, mail code 0446
> University of California, San Diego
> 9500 Gilman Drive
> La Jolla, CA 92093-0446, USA
> T: +1 (858) 534-0570
> http://iddo-friedberg.org
>
>

From biopython.chen at gmail.com  Thu Jul 23 22:28:21 2009
From: biopython.chen at gmail.com (chen Ku)
Date: Thu, 23 Jul 2009 19:28:21 -0700
Subject: [Biopython] Biopython Digest, Vol 79, Issue 15
In-Reply-To: <mailman.5.1248192001.11069.biopython@lists.open-bio.org>
References: <mailman.5.1248192001.11069.biopython@lists.open-bio.org>
Message-ID: <4c2163890907231928x5429929sd82bddcecdd7a26c@mail.gmail.com>

Hi
              I got successed in downloading all the pdb file
> by biopython module. But now I want to fectch an output file where my
> keyword word is ('carbonic andydrade')
>  second criteria is >=2 chains
> third criteria is homology =30%
>
> Can you please write me few lines of codes to do it as I have some problem
> in doing this.Please suggest me step by step if possible as I am
struggling
> for few days in this .
>
> I will be waiting for your kind help.
>regards
chen

On Tue, Jul 21, 2009 at 9:00 AM, <biopython-request at lists.open-bio.org>wrote:

> Send Biopython mailing list submissions to
>        biopython at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/biopython
> or, via email, send a message with subject or body 'help' to
>        biopython-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        biopython-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython digest..."
>
>
> Today's Topics:
>
>   1. Writing into a PDB file using PDBIO module (life happy)
>   2. Re: Writing into a PDB file using PDBIO module (Peter)
>   3. Re: Writing into a PDB file using PDBIO module (Peter)
>   4. Re: Writing into a PDB file using PDBIO module (Peter)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 20 Jul 2009 14:08:21 -0700
> From: life happy <iitlife2008 at gmail.com>
> Subject: [Biopython] Writing into a PDB file using PDBIO module
> To: biopython at lists.open-bio.org
> Message-ID:
>        <46a813870907201408j5d72e25eg9fffcf61331e4aaa at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi there,
>
> I am new to Biopython and have been working for a couple of weeks on
> Bio.PDB
> module.I would appreciate any clue or help in the following matter.
>
> I have some short ,closely related peptide sequences.I want to align these
> short peptides and send the aligned structures into a new PDB file.I used
> set_atoms class in Superimposer module to align the short peptides. I tried
> using PDBIO module, and send the aligned structures into a new PDB file.
> But
> when I see the output PDB file, I get the whole proteins not the short
> peptides. I like to have output PDB file with all the short peptides
> aligned
> to any particular short peptide.
>
>
> #This is the part of my code. B is list of atoms of peptides. C is a list
> with PDB ids of each peptide.
>
> from Bio.PDB.Superimposer import Superimposer
> fixed = B[0:1*(stop-start+1)]
> sup = Superimposer()
> for i in range(1,5) :
>        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
>        sup.set_atoms(fixed, moving)
>        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
>
> (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
> sup.rms)
>        print "Saving %s aligned structure as PDB file %s" %
> (C[0][2].split("'")[1], pdb_out_filename)
>        io=Bio.PDB.PDBIO()
>        io.set_structure(structure)
>        io.save(pdb_out_filename)
>
> thanks in advance!!
>
> cheers,
> Kumar.
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 20 Jul 2009 22:14:50 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Writing into a PDB file using PDBIO module
> To: life happy <iitlife2008 at gmail.com>
> Cc: biopython at lists.open-bio.org
> Message-ID:
>        <320fb6e00907201414j549e0eefyc556157cf432b327 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Jul 20, 2009 at 10:08 PM, life happy<iitlife2008 at gmail.com> wrote:
> > Hi there,
> >
> > I am new to Biopython and have been working for a couple of weeks on
> Bio.PDB
> > module.I would appreciate any clue or help in the following matter.
> >
> > I have some short ,closely related peptide sequences.I want to align
> these
> > short peptides and send the aligned structures into a new PDB file.I used
> > set_atoms class in Superimposer module to align the short peptides. I
> tried
> > using PDBIO module, and send the aligned structures into a new PDB file.
> But
> > when I see the output PDB file, I get the whole proteins not the short
> > peptides. I like to have output PDB file with all the short peptides
> aligned
> > to any particular short peptide.
> >
> >
> > #This is the part of my code. B is list of atoms of peptides. C is a list
> > with PDB ids of each peptide.
> >
> > from Bio.PDB.Superimposer import Superimposer
> > fixed = B[0:1*(stop-start+1)]
> > sup = Superimposer()
> > for i in range(1,5) :
> >        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
> >        sup.set_atoms(fixed, moving)
> >        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
> >
> (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
> > sup.rms)
> >        print "Saving %s aligned structure as PDB file %s" %
> > (C[0][2].split("'")[1], pdb_out_filename)
> >        io=Bio.PDB.PDBIO()
> >        io.set_structure(structure)
> >        io.save(pdb_out_filename)
> >
> > thanks in advance!!
>
> Your example never defines the "structure" variable. I guess it should
> be pointing at something in the "C" data structure...
>
> Peter
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 20 Jul 2009 23:15:54 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Writing into a PDB file using PDBIO module
> To: life happy <iitlife2008 at gmail.com>
> Cc: biopython at biopython.org
> Message-ID:
>        <320fb6e00907201515o517c885ahb2c396efc4281f73 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Jul 20, 2009 at 10:36 PM, life happy<iitlife2008 at gmail.com> wrote:
> > No..this is only a piece of code. The structure object 'structure' was
> > already created.
>
> You example never seems to appy the transformation. Have you read this?
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> It is a worked example using Bio.PDB's Superimposer, and it saves the
> output.
>
> Peter
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 21 Jul 2009 10:13:13 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Writing into a PDB file using PDBIO module
> To: life happy <iitlife2008 at gmail.com>
> Cc: Biopython Mailing List <biopython at lists.open-bio.org>
> Message-ID:
>        <320fb6e00907210213p5df40d5dl583a962069ed1867 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Please keep the mailing list CC'd.
>
> On Mon, Jul 20, 2009 at 11:59 PM, life happy<iitlife2008 at gmail.com> wrote:
> > Yes! I have read this.
>
> I'm glad you found that page (something I'd like to integrate into the
> main Biopython Tutorial at some point):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> > Which step applies the transformation?Isn't that
> > set_atoms function? I am able to print RMS value. I did not follow the
> > superimpose.apply(alt_model.get_atoms()) .
>
> As the name should suggest, superimpose.apply(...) actually applies the
> transformation. This is what you are missing. The set_atoms(...) just tells
> the code which atoms are going to be superimposed.
>
> > According to description in BioPDB faq pdf and
> >
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html
> > set_atom does the transformation, right? If I am wrong, please correct
> me!
>
> That docstring is rather confusing, we should fix that.
>
> > Also,In which step are we sending the transformed co-ordinates into
> > the PDB file?
>
> These lines write out the PDB file for the whole structure:
>
> io=Bio.PDB.PDBIO()
> io.set_structure(structure)
> io.save(pdb_out_filename)
>
> > Also, the output PDB file has whole protein, I only want the short
> peptides
> > aligned(only the atom lists that I gave as input must be aligned, not the
> > whole protein of peptides).
>
> If you only want some of the protein written, then you should only give
> some of the structure to the PDB output code.
>
> Peter
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 79, Issue 15
> *****************************************
>

From jblanca at btc.upv.es  Fri Jul 24 04:53:15 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Fri, 24 Jul 2009 10:53:15 +0200
Subject: [Biopython] next-gen sequencing software
Message-ID: <200907241053.15954.jblanca@btc.upv.es>

Hi:

We have been writting some code that we think that could be interesting to the 
Biopython community. Right now we're mainly interested in the new sequencing 
technologies, specially in:
	- cleaning of the raw reads provided by the sequencers.
	- parsing of the assembler results (ace, caf and bowtie map files)
	- SNP detecion and mining.
	- sequence annotation.
We're writing some software to deal with that problems. Currently the software 
is not finished but it starts to be useful. Everything is written in python. 
We have used Biopython for some things, but for some others we have used a 
slighty different approach. If the Biopython developers think that some of 
our ideas could be of any use we would be willing to incorporate it into 
Biopython.
If you want to take a look just go to:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/

Recently we have finished the cleaning infrastructure. We haven't yet 
pipelines defined for all the new sequencing technologies but we have created 
a pipeline system very easy to modify. With just a dozen of lines of code a 
new pipeline suited to a new sequencing technology can be created. There's 
also an script that runs those pipelines (run_cleannig_pipeline.py).
We have also created a set of scripts that create statistics that ease the 
quality evaluation of the cleaning process.

Regarding the SNPs we can get them using ace and caf files and we're finishing 
the parsing of the bowtie map files. All these files are transformed into an 
iterator of contig objects. There is also funcionallity to get SNPs and 
statistics from these contig objects.

We're willing to get comments, suggestions, criticisms.
Best regards,

-- 
Jose M. Blanca
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

P.D. We're using this functionallity in a computer cluster, so everything is 
parallelized.


From biopython at maubp.freeserve.co.uk  Fri Jul 24 05:38:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Jul 2009 10:38:43 +0100
Subject: [Biopython]  Searching a local copy of the PDB
Message-ID: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>

Hi Chen,

When replying to a digest email, it is a good idea to change the subject
line to something specific.

On Fri, Jul 24, 2009 at 3:28 AM, chen Ku<biopython.chen at gmail.com> wrote:
> Hi
>? ? ? ? ?I got successed in downloading all the pdb file by biopython module.

Good.

> But now I want to fectch an output file where my
> keyword word is ('carbonic andydrade')
>?second criteria is >=2 chains
> third criteria is homology =30%
>
> Can you please write me few lines of codes to do it as I have some problem
> in doing this.Please suggest me step by step if possible as I am struggling
> for few days in this .

If I understand you correctly, you have download all the PDB files to your
computer (as plain text PDB format data). And now you want to search them?

Are you using Unix or Windows? There are several Unix command line
tools like grep, which are very good at searching plain text files. That
might be a good way to look for PDB files containing the words 'carbonic
andydrade'.

I'm not sure what the fastest way to count the chains in a PDB file would
be. If you only find a few hundred PDB files with 'carbonic andydrade',
it might be OK just to parse them with Bio.PDB and count the chains
that way.

Finally, your third criteria is homology =30% - but homology to what?
And how are you measuring homology? I guess you mean 30%
sequence identity to a reference carbonic andydrade protein?

If what you want to do is take a known carbonic andydrade protein,
and search the PDB for similar sequences then there are better
ways to do this. I would run BLASTP against the PDB sequences.
You can do this at the NCBI via their webpages, or from within
Biopython using the Bio.Blast.NCBIWWW.qblast function.

Peter


From biopython at maubp.freeserve.co.uk  Fri Jul 24 05:50:08 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Jul 2009 10:50:08 +0100
Subject: [Biopython] next-gen sequencing software
In-Reply-To: <200907241053.15954.jblanca@btc.upv.es>
References: <200907241053.15954.jblanca@btc.upv.es>
Message-ID: <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com>

On Fri, Jul 24, 2009 at 9:53 AM, Jose Blanca<jblanca at btc.upv.es> wrote:
> Hi:
>
> We have been writting some code that we think that could be interesting to the
> Biopython community. ... Currently the software is not finished but it starts to
> be useful. Everything is written in python. We have used Biopython for some
> things, but for some others we have used a slighty different approach. If the
> Biopython developers think that some of our ideas could be of any use we
> would be willing to incorporate it into Biopython.
> If you want to take a look just go to:
> http://bioinf.comav.upv.es/svn/biolib/biolib/src/

Cool. I already knew you had some interested ideas for contig classes.
I see you also have a parser for EMBOSS water output - where you
actually collect some useful information from the header, which the
Biopython parser ignores. This was a simplification because the current
Biopython alignment object doesn't have a proper annotation system.

Work on improving the Biopython alignment object and introducing a
contig object is something I would like to see for the next release (once
Biopython 1.51 is out).

I'm sure there is other stuff in your code that would also be very useful.

If you want to contribute code to Biopython is will have to be under our
MIT style license, but in the meantime maybe you should stick an
an explicit license on your code?

Peter

From darnells at dnastar.com  Fri Jul 24 10:15:09 2009
From: darnells at dnastar.com (Steve Darnell)
Date: Fri, 24 Jul 2009 09:15:09 -0500
Subject: [Biopython] Searching a local copy of the PDB
In-Reply-To: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>
References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD5861001A3D875@FS1.dnastar.com>

Greetings,

You could also do this using the PDB Advanced Search option.  Although not a scriptable solution, it's perfect for a few manual queries.  Here are my suggested parameters:

Match **all** of the following conditions

Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text**
Subquery 2: Sequence Features: Number of Chains, Between: **2** and **<blank>**

**<checkbox>** Remove Similar Sequences at **30%** Identity

Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase."  No results for "andydrade."

Regards,
Steve Darnell


-----Original Message-----
From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
Sent: Friday, July 24, 2009 4:39 AM
To: chen Ku
Cc: biopython at lists.open-bio.org
Subject: [Biopython] Searching a local copy of the PDB

Hi Chen,

When replying to a digest email, it is a good idea to change the subject line to something specific.

On Fri, Jul 24, 2009 at 3:28 AM, chen Ku<biopython.chen at gmail.com> wrote:
> Hi
>? ? ? ? ?I got successed in downloading all the pdb file by biopython module.

Good.

> But now I want to fectch an output file where my  keyword word is 
>('carbonic andydrade')
>?second criteria is >=2 chains
> third criteria is homology =30%
>
> Can you please write me few lines of codes to do it as I have some 
> problem in doing this.Please suggest me step by step if possible as I 
> am struggling for few days in this .

If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them?

Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'.

I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way.

Finally, your third criteria is homology =30% - but homology to what?
And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein?

If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences.
You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function.

Peter

_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython


From jkhilmer at gmail.com  Fri Jul 24 11:19:27 2009
From: jkhilmer at gmail.com (Jonathan Hilmer)
Date: Fri, 24 Jul 2009 09:19:27 -0600
Subject: [Biopython] Searching a local copy of the PDB
In-Reply-To: <A4009967D1886D4286A9B7931FD5861001A3D875@FS1.dnastar.com>
References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD5861001A3D875@FS1.dnastar.com>
Message-ID: <81277ce10907240819j3710c35j2d336209ba474451@mail.gmail.com>

Just for the record, a few years back I ran some Biopython-based code
to check structural statistics of a local copy of the entire PDB.  I
was parsing to the level of each alpha-carbon, but it was still fast
enough to be a very viable way to run the calculations.  Clearly in
this case it's not the best solution to use Bio.PDB, but if you have a
local mirror then there's no reason you couldn't do it via
structure-parsing.

Also, the PDB Advanced search should be scriptable, just not in a
convenient way.  The Python module ClientForm should handle it.

Jonathan


On Fri, Jul 24, 2009 at 8:15 AM, Steve Darnell<darnells at dnastar.com> wrote:
> Greetings,
>
> You could also do this using the PDB Advanced Search option. ?Although not a scriptable solution, it's perfect for a few manual queries. ?Here are my suggested parameters:
>
> Match **all** of the following conditions
>
> Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text**
> Subquery 2: Sequence Features: Number of Chains, Between: **2** and **<blank>**
>
> **<checkbox>** Remove Similar Sequences at **30%** Identity
>
> Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase." ?No results for "andydrade."
>
> Regards,
> Steve Darnell
>
>
> -----Original Message-----
> From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
> Sent: Friday, July 24, 2009 4:39 AM
> To: chen Ku
> Cc: biopython at lists.open-bio.org
> Subject: [Biopython] Searching a local copy of the PDB
>
> Hi Chen,
>
> When replying to a digest email, it is a good idea to change the subject line to something specific.
>
> On Fri, Jul 24, 2009 at 3:28 AM, chen Ku<biopython.chen at gmail.com> wrote:
>> Hi
>>? ? ? ? ?I got successed in downloading all the pdb file by biopython module.
>
> Good.
>
>> But now I want to fectch an output file where my ?keyword word is
>>('carbonic andydrade')
>>?second criteria is >=2 chains
>> third criteria is homology =30%
>>
>> Can you please write me few lines of codes to do it as I have some
>> problem in doing this.Please suggest me step by step if possible as I
>> am struggling for few days in this .
>
> If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them?
>
> Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'.
>
> I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way.
>
> Finally, your third criteria is homology =30% - but homology to what?
> And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein?
>
> If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences.
> You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function.
>
> Peter
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From matzke at berkeley.edu  Wed Jul 29 00:38:44 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 28 Jul 2009 21:38:44 -0700
Subject: [Biopython] PDBid to Uniprot ID?
In-Reply-To: <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com>
References: <4A42A2D4.8060400@berkeley.edu>
	<320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com>
Message-ID: <4A6FD254.2070803@berkeley.edu>


Peter wrote:
> On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi all,
>>
>> I have succeeded in using the BioPython PDB parser to download a PDB file,
>> parse the structure, etc.  But I am wondering if there is an easy way to retrieve
>> the UniProt ID that corresponds to the structure?
>>
>> I.e., if the structure is 1QFC...
>> http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC
>>
>> ...the Uniprot ID is (click "Sequence" above): P29288
>> http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC
>>
>> I don't see a way to get this out of the current parser, so I guess I will schlep
>> through the downloaded structure file for "UNP    P29288" unless someone
>> has a better idea.
> 
> Well, I would at least look for a line starting "DBREF" and then search that
> for the reference.
> 
> Right now the PDB header parsing is minimal, and even that was something
> of an after thought - Eric has been looking at this stuff recently, but I image
> he will be busy with his GSoC work at the moment. This could be handled
> as another tiny incremental addition to parse_pdb_header.py - right now I
> don't think it looks at the "DBREF" lines.
> 
> Peter


I forgot to post to the list, I wrote a function for parsing the DBREF 
line a couple of weeks ago, it should be pretty comprehensive as it uses 
the official specifications for DBREF lines.

Here's the code to save other people re-inventing the wheel.  Free to 
use/modify/include in a biopython upgrade whatever...

===================
def parse_DBREF_line(line):
	"""
	Following format here:
	http://www.wwpdb.org/documentation/format23/sect3.html
	
	Record Format
	
	COLUMNS       DATA TYPE          FIELD          DEFINITION
	----------------------------------------------------------------
	 1 - 6        Record name        "DBREF "
	 8 - 11       IDcode             idCode         ID code of this entry.
	13            Character          chainID        Chain identifier.
	15 - 18       Integer            seqBegin       Initial sequence number
													of the PDB sequence segment.
	19            AChar              insertBegin    Initial insertion code
													of the PDB sequence segment.
	21 - 24       Integer            seqEnd         Ending sequence number
													of the PDB sequence segment.
	25            AChar              insertEnd      Ending insertion code
													of the PDB sequence segment.
	27 - 32       LString            database       Sequence database name.
	34 - 41       LString            dbAccession    Sequence database 
accession code.
	43 - 54      LString            dbIdCode        Sequence database
													identification code.
	56 - 60      Integer            dbseqBegin      Initial sequence number 
of the
													database seqment.
	61           AChar              idbnsBeg        Insertion code of 
initial residue
													of the segment, if PDB is the
													reference.
	63 - 67      Integer            dbseqEnd        Ending sequence number 
of the
													database segment.
	68           AChar              dbinsEnd        Insertion code of the 
ending
													residue of the segment, if PDB is
													the reference.

     Database name                         database
                                      (code in columns 27 - 32)
     ----------------------------------------------------------
     GenBank                               GB
     Protein Data Bank                     PDB
     Protein Identification Resource       PIR
     SWISS-PROT                            SWS
     TREMBL                                TREMBL
     UNIPROT                               UNP

	
	Test line:
	line="  1QFC A    1   306  UNP    P29288   PPA5_RAT        22    327 
           "
	"""
	
	data_type_list = ['Record name',
	'IDcode',
	'Character',
	'Integer',
	'AChar',
	'Integer',
	'AChar',
	'LString',
	'LString',
	'LString',
	'Integer',
	'AChar',
	'Integer',
	'AChar']
	
	field_list = ['"DBREF "',
	'idCode',
	'chainID',
	'seqBegin',
	'insertBegin',
	'seqEnd',
	'insertEnd',
	'database',
	'dbAccession',
	'dbIdCode',
	'dbseqBegin',
	'idbnsBeg',
	'dbseqEnd',
	'dbinsEnd']
	
	def_list = ['',
	'ID code of this entry.',
	'Chain identifier.',
	'Initial sequence number of the PDB sequence segment.',
	'Initial insertion code of the PDB sequence segment.',
	'Ending sequence number of the PDB sequence segment.',
	'Ending insertion code of the PDB sequence segment.',
	'Sequence database name.',
	'Sequence database accession code.',
	'Sequence database identification code.',
	'Initial sequence number of the database seqment.',
	'Insertion code of initial residue of the segment, if PDB is the 
reference.',
	'Ending sequence number of the database segment.',
	'Insertion code of the ending residue of the segment, if PDB is the 
reference.']
	
	charpos_list = [(1,6),
	(8,11),
	(13,13),
	(15,18),
	(19,19),
	(21,24),
	(25,25),
	(27,32),
	(34,41),
	(43,54),
	(56,60),
	(61,61),
	(63,67),
	(68,68)]
	
	data_list = ['',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'']
	
	# Make empty dictionary
	dbref_dict = {}
	for index in range(0,len(field_list)):
		dbref_dict[ field_list[index] ] = [ data_type_list[index], 
charpos_list[index], data_list[index], def_list[index] ]
	
	for field in field_list:
		#print field
		#print dbref_dict[field][1]
		startpos = int(dbref_dict[field][1][0])
		endpos = int(dbref_dict[field][1][1])
		
		dbref_dict[field][2] = get_char_range(line, startpos, endpos)
		
	return dbref_dict
===================


> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================

From pzs at dcs.gla.ac.uk  Wed Jul 29 06:56:11 2009
From: pzs at dcs.gla.ac.uk (Peter Saffrey)
Date: Wed, 29 Jul 2009 11:56:11 +0100
Subject: [Biopython] Restriction enzyme digestion gels
Message-ID: <4A702ACB.2080204@dcs.gla.ac.uk>

I want to run an "in-silico" gel, where I take a nucleotide sequence, 
cut it with an enzyme (probably using a tool like restrictionmapper):

http://www.restrictionmapper.org/

and then produce a picture of what the gel should look like, with bands 
where the cuts have been made. I was wondering whether biopython has any 
tools for doing this. Otherwise, I'll hack something up in matplotlib.

Cheers,

Peter

From biopython at maubp.freeserve.co.uk  Wed Jul 29 07:35:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 29 Jul 2009 12:35:27 +0100
Subject: [Biopython] Restriction enzyme digestion gels
In-Reply-To: <4A702ACB.2080204@dcs.gla.ac.uk>
References: <4A702ACB.2080204@dcs.gla.ac.uk>
Message-ID: <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com>

On Wed, Jul 29, 2009 at 11:56 AM, Peter Saffrey<pzs at dcs.gla.ac.uk> wrote:
> I want to run an "in-silico" gel, where I take a nucleotide sequence, cut it
> with an enzyme (probably using a tool like restrictionmapper):
>
> http://www.restrictionmapper.org/
>
> and then produce a picture of what the gel should look like, with bands
> where the cuts have been made. I was wondering whether biopython has any
> tools for doing this. Otherwise, I'll hack something up in matplotlib.

Biopython has a restriction digest module which should be able to take
care of the first step for you at least:
http://biopython.org/DIST/docs/cookbook/Restriction.html

There is nothing built into Biopython's graphics module for generating
fake gel images - so using matplot seems worth trying. However, I
would suggest you talk to Jose Blanca about his work first:
http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005472.html
http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/

Peter

From carlos.borroto at gmail.com  Thu Jul 30 13:18:56 2009
From: carlos.borroto at gmail.com (Carlos Javier Borroto)
Date: Thu, 30 Jul 2009 13:18:56 -0400
Subject: [Biopython] How to efetch Unigene records? Is it possible at all?
Message-ID: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>

Hi, I'm very new to Biopython and to Python in general, has a little
knowledge of Perl and some previous work with Bioperl.

I have the task to from a list of human genes of interest, grab their
protein counter parts in the database to do some additional work. In
the beginning I was thinking that using Bio.Entrez module and
Bio.SeqIO parser I could get the proteins counter parts, but I haven't
found a way to do it, oddly I haven't found a way to get the
crossreference through the parser even when I can see the genebank
files have always one.

Any way because I also have the Unigene ID list, and it seems that the
Unigene parser have a way to get the crossreference, I now want to
download all of the Unigene records and parse from there. But efetch
is not working with unigene, I mean this is not working:

>>> from Bio import Entrez
>>> from Bio import UniGene
>>> Entrez.email = "carlos.borroto at gmail.com"
>>> handle = Entrez.esearch(db="unigene", term="Hs.94542")
>>> record = Entrez.read(handle)
>>> record
{u'Count': '1', u'RetMax': '1', u'IdList': ['141673'],
u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term':
'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet':
[], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'}
>>> handle = Entrez.efetch(db="unigene", id="Hs.94542")
>>> print handle.read()

This print like a webpage, I assume is NCBI server giving an error response.

So there is something I could do to accomplish what I want, either
through parsing the Genebank files or fetching the Unigene and then
parsing its?

Any help or pointing to some helpful documentation will be highly appreciated.
Thanks in advance
-- 
Carlos Javier

From chapmanb at 50mail.com  Thu Jul 30 18:09:02 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 30 Jul 2009 18:09:02 -0400
Subject: [Biopython] How to efetch Unigene records? Is it possible
	at	all?
In-Reply-To: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>
References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>
Message-ID: <20090730220902.GD84345@sobchak.mgh.harvard.edu>

Hi Carlos;

> I have the task to from a list of human genes of interest, grab their
> protein counter parts in the database to do some additional work.
[...]
> >>> from Bio import Entrez
> >>> from Bio import UniGene
> >>> Entrez.email = "carlos.borroto at gmail.com"
> >>> handle = Entrez.esearch(db="unigene", term="Hs.94542")
> >>> record = Entrez.read(handle)
> >>> record
> {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'],
> u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term':
> 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet':
> [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'}
> >>> handle = Entrez.efetch(db="unigene", id="Hs.94542")
> >>> print handle.read()
> 
> This print like a webpage, I assume is NCBI server giving an error response.
> 
> So there is something I could do to accomplish what I want, either
> through parsing the Genebank files or fetching the Unigene and then
> parsing its?

It looks like you are doing things correctly, but I'm not sure if
NCBI supports retrieving UniGene records through the efetch
interface. I tried playing around with it for a bit and got the same
problems as you; the documentation on their site is also not very
clear about if unigene is supported and what return types to get.
Not having a lot of experience with UniGene, my guess is this isn't
the right direction to go.

My suggestion to get your work done is to download the *.data files
from the ftp site:

ftp://ftp.ncbi.nih.gov/repository/UniGene/

and write a script that runs through these and pulls out the protein
identifiers of interest. You should be able to use the UniGene
parser for this and use the protsim attribute of each record. With
these, you can get the GI number (protgi attribute) and use this to
fetch the relevant GenBank records through Entrez.

Hope this helps,
Brad

From carlos.borroto at gmail.com  Thu Jul 30 18:27:24 2009
From: carlos.borroto at gmail.com (Carlos Javier Borroto)
Date: Thu, 30 Jul 2009 18:27:24 -0400
Subject: [Biopython] How to efetch Unigene records? Is it possible at
	all?
In-Reply-To: <20090730220902.GD84345@sobchak.mgh.harvard.edu>
References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> 
	<20090730220902.GD84345@sobchak.mgh.harvard.edu>
Message-ID: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com>

On Thu, Jul 30, 2009 at 6:09 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Carlos;
>
>> I have the task to from a list of human genes of interest, grab their
>> protein counter parts in the database to do some additional work.
>
> It looks like you are doing things correctly, but I'm not sure if
> NCBI supports retrieving UniGene records through the efetch
> interface. I tried playing around with it for a bit and got the same
> problems as you; the documentation on their site is also not very
> clear about if unigene is supported and what return types to get.
> Not having a lot of experience with UniGene, my guess is this isn't
> the right direction to go.
>
> My suggestion to get your work done is to download the *.data files
> from the ftp site:
>
> ftp://ftp.ncbi.nih.gov/repository/UniGene/
>
> and write a script that runs through these and pulls out the protein
> identifiers of interest. You should be able to use the UniGene
> parser for this and use the protsim attribute of each record. With
> these, you can get the GI number (protgi attribute) and use this to
> fetch the relevant GenBank records through Entrez.
>
> Hope this helps,
> Brad
>

Thanks, I was wondering because this is the first time I use Biopython
or NCBI scripting facilities if I was doing something completely
wrong. I'm going to follow your advice.

Thank you for taking the time to review my concern.
regards,
-- 
Carlos Javier

From stran104 at chapman.edu  Thu Jul 30 20:10:11 2009
From: stran104 at chapman.edu (Matthew Strand)
Date: Thu, 30 Jul 2009 17:10:11 -0700
Subject: [Biopython] How to efetch Unigene records? Is it possible at
	all?
In-Reply-To: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com>
References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>
	<20090730220902.GD84345@sobchak.mgh.harvard.edu>
	<65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com>
Message-ID: <2a63cc350907301710w57d4d4b9nb89fea39f9e62b76@mail.gmail.com>

Hi Carlos,
I did something similar to this a while ago and meant to write a cookbook
entry for it but haven't gotten the chance yet. You could also try doing an
efetch on the ID of the record returned by esearch.

I'm not near my workstation so I can't test it but you might try:
handle = Entrez.efetch(db="unigene", id="141673")

If that works then you just need to pull the ID out of the esearch result
and do an efetch on it.

-- 
Matthew Strand
 stran104 at chapman.edu

From lueck at ipk-gatersleben.de  Fri Jul 31 04:27:28 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 10:27:28 +0200
Subject: [Biopython] blastall several alignment viewings options
Message-ID: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>

Hello!

is there a way to set 2 or more alignment viewing options in one blast run? I would like to get the xml and the Query-anchored (and maybe some other) but to run Blast twice would be kind of stupid and slowing down. 

Thanks
Stefanie

From biopython at maubp.freeserve.co.uk  Fri Jul 31 05:18:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 31 Jul 2009 10:18:29 +0100
Subject: [Biopython] blastall several alignment viewings options
In-Reply-To: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>

On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> wrote:
> Hello!
>
> is there a way to set 2 or more alignment viewing options in one blast run?
> I would like to get the xml and the Query-anchored (and maybe some other)
> but to run Blast twice would be kind of stupid and slowing down.

I don't think there is. The XML file should contain enough data to recreate
some of the other views (if I recall correctly Sebastian Bassi has a script to
do that). However, that may not be possible for the Query-anchored output.

Peter


From lueck at ipk-gatersleben.de  Fri Jul 31 05:25:51 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 11:25:51 +0200
Subject: [Biopython] blastall several alignment viewings options
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
Message-ID: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>

Thanks Peter! I expected this, I just wanted to be sure since it's stupid to 
recreate things which are already existing.
Have a nice weekend!
Stefanie


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at biopython.org>
Sent: Friday, July 31, 2009 11:18 AM
Subject: Re: [Biopython] blastall several alignment viewings options


On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> 
wrote:
> Hello!
>
> is there a way to set 2 or more alignment viewing options in one blast 
> run?
> I would like to get the xml and the Query-anchored (and maybe some other)
> but to run Blast twice would be kind of stupid and slowing down.

I don't think there is. The XML file should contain enough data to recreate
some of the other views (if I recall correctly Sebastian Bassi has a script 
to
do that). However, that may not be possible for the Query-anchored output.

Peter


From biopython at maubp.freeserve.co.uk  Fri Jul 31 06:08:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 31 Jul 2009 11:08:42 +0100
Subject: [Biopython] blastall several alignment viewings options
In-Reply-To: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
	<001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com>

On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> wrote:
> Thanks Peter! I expected this, I just wanted to be sure since it's stupid to
> recreate things which are already existing.
> Have a nice weekend!
> Stefanie

I know you are using standalone BLAST (blastall), but if you were doing this
online via the NCBI website, you can reformat the output (without recalculating
it). This *might* be possible via the QBLAST interface too... it would take some
experimentation.

Peter


From lueck at ipk-gatersleben.de  Fri Jul 31 06:28:11 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 12:28:11 +0200
Subject: [Biopython] blastall several alignment viewings options
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
	<001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com>
Message-ID: <002901ca11c9$9a9ed680$1022a8c0@ipkgatersleben.de>

In my new project I'll do both, online and local BLAST. Anyway I'll recreate 
it, it's should be done quickly. In case that someone need it too, I can 
provide it!


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at biopython.org>
Sent: Friday, July 31, 2009 12:08 PM
Subject: Re: [Biopython] blastall several alignment viewings options


On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> 
wrote:
> Thanks Peter! I expected this, I just wanted to be sure since it's stupid 
> to
> recreate things which are already existing.
> Have a nice weekend!
> Stefanie

I know you are using standalone BLAST (blastall), but if you were doing this
online via the NCBI website, you can reformat the output (without 
recalculating
it). This *might* be possible via the QBLAST interface too... it would take 
some
experimentation.

Peter


From lueck at ipk-gatersleben.de  Fri Jul 31 06:37:59 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 12:37:59 +0200
Subject: [Biopython] EuroSciPy2009
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
	<001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com>
Message-ID: <002f01ca11ca$f928d830$1022a8c0@ipkgatersleben.de>

Hello!

I just wanted to say that the EuroSciPy2009 was a great success and I also got a lot of positive feedback for my talk. I would like to thank all Biopython developers for providing a great library!

For anyone who is interested and would like to see for what I use Biopython (and why it's makes my life in the lab easier), here are the links of the abstract and slides:

http://www.euroscipy.org/presentations/abstracts/abstract_lueck.html
http://www.euroscipy.org/presentations/slides/slides_lueck.pdf

Would be nice to see some of you next year!

Kind regards,
Stefanie

From stran104 at chapman.edu  Wed Jul  1 03:01:14 2009
From: stran104 at chapman.edu (Matthew Strand)
Date: Tue, 30 Jun 2009 20:01:14 -0700
Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid
In-Reply-To: <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com>
References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com>
	<320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com>
	<2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com>
	<2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com>
	<320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com>
Message-ID: <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com>

For the benefit of future users who find this thread through a search, I
would like to share how to retreive a sequence from NCBI given a non-NCBI
protein ID (or other ID). This was question 3 in my original message.

Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
want to retrieve the sequence from NCBI.

You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
of NCBI GIs that refrence this identifer. In this case there is only one
(17554770).

Then you can get the sequence using Entrez.efetch(db="protein",
id='17554770', rettype="fasta").

This may be obvious to some, but it was not to me; primarially because I was
unaware of the esearch functionality.

-- 
Matthew Strand


From idoerg at gmail.com  Wed Jul  1 03:53:16 2009
From: idoerg at gmail.com (Iddo Friedberg)
Date: Tue, 30 Jun 2009 20:53:16 -0700
Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid
In-Reply-To: <b5bbbc970906302053n3656c8bdy2a1125275f8f2bac@mail.gmail.com>
References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com>
	<320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com>
	<2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com>
	<2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com>
	<320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com>
	<2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com>
	<b5bbbc970906302050p3dd429b4yb29ba1dac2ffa578@mail.gmail.com>
	<b5bbbc970906302053n3656c8bdy2a1125275f8f2bac@mail.gmail.com>
Message-ID: <b5bbbc970906302053n76ac36c1w630d41fdb96287d8@mail.gmail.com>

Thanks. There is a wiki-based cookbook in the biopython site. Would you like
to put it up there?

Iddo Friedberg
http://iddo-friedberg.net/contact.html

On Jun 30, 2009 8:02 PM, "Matthew Strand" <stran104 at chapman.edu> wrote:

For the benefit of future users who find this thread through a search, I
would like to share how to retreive a sequence from NCBI given a non-NCBI
protein ID (or other ID). This was question 3 in my original message.

Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
want to retrieve the sequence from NCBI.

You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
of NCBI GIs that refrence this identifer. In this case there is only one
(17554770).

Then you can get the sequence using Entrez.efetch(db="protein",
id='17554770', rettype="fasta").

This may be obvious to some, but it was not to me; primarially because I was
unaware of the esearch functionality.

--
Matthew Strand

_______________________________________________ Biopython mailing list -
Biopython at lists.open-bio....


From winda002 at student.otago.ac.nz  Wed Jul  1 06:22:08 2009
From: winda002 at student.otago.ac.nz (David WInter)
Date: Wed, 01 Jul 2009 18:22:08 +1200
Subject: [Biopython] Bio.Sequencing.Ace
In-Reply-To: <761477.83949.qm@web65501.mail.ac4.yahoo.com>
References: <761477.83949.qm@web65501.mail.ac4.yahoo.com>
Message-ID: <4A4B0090.70903@student.otago.ac.nz>

Fungazid wrote:
> David hi,
>
> Many many thanks for the diagram.
> I'm not sure I understand the differences between contig.af[readn].padded_start,  and contig.bs[readn].padded_start, and other unknown parameters. I'll try to compare to the Ace format
>
> Avi
>   
Hi again Avi,

I took me a while to get to grips with the difference, the 'bs' list is 
a mapping of the contig's consensus to the particular read that was used 
to as the 'base segment' in that region. If you have a monospaced font 
in your email client this might help:

             consensus             

|===================================|              
+---read3---x
             +---read5--x
                         +--read1---x


(which would give a contig.bs list with 3 bs instances)

I'm not sure that this is particularly important information for a 454 assembly ;) 

I've updated the examples on the wiki page a little, if you find anything else that you think should 
be there feel free to add to it


Cheers,
David


From p.j.a.cock at googlemail.com  Wed Jul  1 07:44:12 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 1 Jul 2009 08:44:12 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
Message-ID: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>

Hi all (BioPerl and Biopython),

This is a continuation of a long thread on the BioPerl mailing
list, which I have now CC'd to the Biopython mailing list. See:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html

On this thread we have been discussing next gen sequencing
tools and co-coordinating things like consistent file format
naming between Biopython, BioPerl and EMBOSS. I've been
chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
and he will look into setting up a cross project mailing list for
this kind of discussion in future.

In the mean time, my replies to Giles below cover both BioPerl
and Biopython (and EMBOSS). Giles' original email is here:
http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html

Peter

On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>
> I'm developing a transcriptomics database for use with next-gen data, and
> have found processing the raw data to be a big hurdle.
>
> I'm a bit late in responding to this thread, so most issues have already
> been discussed. One thing that hasn't been mentioned is removal of adapters
> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
> developed and documented open source software for removal of adapters
> (and poor quality sequence) from Illumina reads.
>
> My current Illumina sequence processing pipeline is an unholy mix of
> biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting
> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
> pure perl to trim the poor quality sequence from each read, and bioperl
> with emboss to remove the adapter sequence. I'm aware that the pipeline
> contains bugs and would like to simplify it, but at least it does work...
>
> Ideally I'd like to replace as much of the pipeline as possible with
> bioperl/bioperl-run, but this isn't currently possible due to both a lack
> of features and poor performance. I'm sure the features will come with
> time, but the performance is more of a concern to me. ..

I gather you would rather work with (Bio)Perl, but since you are
already using Biopython to do the FASTQ conversion, you could
also use it for more of your pipe line. Our tutorial includes examples
of simple FASTQ quality filtering, and trimming of primer sequences
(something like this might be helpful for removing adaptors). See:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Alternatively, with the new release of EMBOSS this July, you will
also be able to do the Illumina FASTQ to Sanger standard FASTQ
with EMBOSS, and I'm sure BioPerl will offer this soon too.

> Regarding trimming bad quality bases (see comments from
> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
> pure/bioperl solution to be much faster than a primarily bioperl
> based implementation. I found Bio::Seq->subseq(a,b) and
> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
> ~1300 sequences/second, including unzipping the raw data and
> converting it to sanger fastq with biopython. Processing an entire
> sequencing run with the whole pipeline takes in the region of 6-12h.

There are several ways of doing quality trimming, and it would
make an excellent cookbook example (both for BioPerl and
Biopython).

Could you go into a bit more detail about your trimming
algorithm? e.g. Do you just trim any bases on the right below
a certain threshold, perhaps with a minimum length to retain
the trimmed read afterwards?

> Hope this looooong post was of interest to someone!

I was interested at least ;)

Peter


From stran104 at chapman.edu  Wed Jul  1 10:18:42 2009
From: stran104 at chapman.edu (Matthew Strand)
Date: Wed, 1 Jul 2009 03:18:42 -0700
Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid
In-Reply-To: <b5bbbc970906302053n76ac36c1w630d41fdb96287d8@mail.gmail.com>
References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com>
	<320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com>
	<2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com>
	<2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com>
	<320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com>
	<2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com>
	<b5bbbc970906302050p3dd429b4yb29ba1dac2ffa578@mail.gmail.com>
	<b5bbbc970906302053n3656c8bdy2a1125275f8f2bac@mail.gmail.com>
	<b5bbbc970906302053n76ac36c1w630d41fdb96287d8@mail.gmail.com>
Message-ID: <2a63cc350907010318v597f0649u78168decde54d710@mail.gmail.com>

Sure, I can create a page tomorrow when I get into the office. Perhaps
"Retrieving Sequences Based on ID" would be appropriate. Alternative
suggestions are welcome.

On Tue, Jun 30, 2009 at 8:53 PM, Iddo Friedberg <idoerg at gmail.com> wrote:

> Thanks. There is a wiki-based cookbook in the biopython site. Would you
> like to put it up there?
>
> Iddo Friedberg
> http://iddo-friedberg.net/contact.html
>
> On Jun 30, 2009 8:02 PM, "Matthew Strand" <stran104 at chapman.edu> wrote:
>
> For the benefit of future users who find this thread through a search, I
> would like to share how to retreive a sequence from NCBI given a non-NCBI
> protein ID (or other ID). This was question 3 in my original message.
>
> Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you
> want to retrieve the sequence from NCBI.
>
> You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list
> of NCBI GIs that refrence this identifer. In this case there is only one
> (17554770).
>
> Then you can get the sequence using Entrez.efetch(db="protein",
> id='17554770', rettype="fasta").
>
> This may be obvious to some, but it was not to me; primarially because I
> was
> unaware of the esearch functionality.
>
> --
> Matthew Strand
>
> _______________________________________________ Biopython mailing list -
> Biopython at lists.open-bio....
>
>


-- 
Matthew Strand


From cjfields at illinois.edu  Wed Jul  1 12:35:14 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 1 Jul 2009 07:35:14 -0500
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
Message-ID: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>

Peter,

I just committed a fix to FASTQ parsing last night to support read/ 
write for Sanger/Solexa/Illumina following the biopython convention;  
the only thing needed is more extensive testing for the quality  
scores.  There are a few other oddities with it I intend to address  
soon, but it appears to be working.

The Seq instance iterator actually calls a raw data iterator (hash  
refs of named arguments to the class constructor).  That should act as  
a decent filtering step if needed.

We have automated EMBOSS wrapping but I'm not sure how intuitive it  
is; we can probably reconfigure some of that.

chris

On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:

> Hi all (BioPerl and Biopython),
>
> This is a continuation of a long thread on the BioPerl mailing
> list, which I have now CC'd to the Biopython mailing list. See:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>
> On this thread we have been discussing next gen sequencing
> tools and co-coordinating things like consistent file format
> naming between Biopython, BioPerl and EMBOSS. I've been
> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
> and he will look into setting up a cross project mailing list for
> this kind of discussion in future.
>
> In the mean time, my replies to Giles below cover both BioPerl
> and Biopython (and EMBOSS). Giles' original email is here:
> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>
> Peter
>
> On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>>
>> I'm developing a transcriptomics database for use with next-gen  
>> data, and
>> have found processing the raw data to be a big hurdle.
>>
>> I'm a bit late in responding to this thread, so most issues have  
>> already
>> been discussed. One thing that hasn't been mentioned is removal of  
>> adapters
>> from raw Illumina sequence. This is a PITA, and I'm not aware of  
>> any well
>> developed and documented open source software for removal of adapters
>> (and poor quality sequence) from Illumina reads.
>>
>> My current Illumina sequence processing pipeline is an unholy mix of
>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for  
>> converting
>> the Illumina fastq to Sanger fastq, bioperl to read the quality  
>> values,
>> pure perl to trim the poor quality sequence from each read, and  
>> bioperl
>> with emboss to remove the adapter sequence. I'm aware that the  
>> pipeline
>> contains bugs and would like to simplify it, but at least it does  
>> work...
>>
>> Ideally I'd like to replace as much of the pipeline as possible with
>> bioperl/bioperl-run, but this isn't currently possible due to both  
>> a lack
>> of features and poor performance. I'm sure the features will come  
>> with
>> time, but the performance is more of a concern to me. ..
>
> I gather you would rather work with (Bio)Perl, but since you are
> already using Biopython to do the FASTQ conversion, you could
> also use it for more of your pipe line. Our tutorial includes examples
> of simple FASTQ quality filtering, and trimming of primer sequences
> (something like this might be helpful for removing adaptors). See:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html
> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>
> Alternatively, with the new release of EMBOSS this July, you will
> also be able to do the Illumina FASTQ to Sanger standard FASTQ
> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>
>> Regarding trimming bad quality bases (see comments from
>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>> pure/bioperl solution to be much faster than a primarily bioperl
>> based implementation. I found Bio::Seq->subseq(a,b) and
>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>> ~1300 sequences/second, including unzipping the raw data and
>> converting it to sanger fastq with biopython. Processing an entire
>> sequencing run with the whole pipeline takes in the region of 6-12h.
>
> There are several ways of doing quality trimming, and it would
> make an excellent cookbook example (both for BioPerl and
> Biopython).
>
> Could you go into a bit more detail about your trimming
> algorithm? e.g. Do you just trim any bases on the right below
> a certain threshold, perhaps with a minimum length to retain
> the trimmed read afterwards?
>
>> Hope this looooong post was of interest to someone!
>
> I was interested at least ;)
>
> Peter
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


From giles.weaver at googlemail.com  Wed Jul  1 16:27:22 2009
From: giles.weaver at googlemail.com (Giles Weaver)
Date: Wed, 1 Jul 2009 17:27:22 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
Message-ID: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>

Peter, the trimming algorithm I use employs a sliding window, as follows:

   - For each sequence position calculate the mean phred quality score for a
   window around that position.
   - Record whether the mean score is above or below a threshold as an array
   of zeros and ones.
   - Use a regular expression on the joined array to find the start and end
   of the good quality sequence(s).
   - Extract the quality sequence(s) and replace any bases below the quality
   threshold with N.
   - Trim any Ns from the ends.

A refinement would be to weight the scores from positions in the window, but
this could give a performance hit, and the method seems to work well enough
as is.

Chris, thanks for committing the fix, I'll give bioperl illumina fastq
parsing a workout soon. Peter, as much as I'd love to help out with
biopython, I'm under too much time pressure right now!

Jonathan, some of the Illumina sequencing adapters are listed at
http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
http://seqanswers.com/forums/showthread.php?t=198
Adapter sequence typically appears towards the end of the read, though the
latter part of it is often misread as the sequencing quality drops off.
I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
real alignments and trim the sequence. This is not the ideal way of doing
things, but it's fast enough, and does seem to work. The adapter sequence
shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
adapter removal.

I'll happily share some code once I've got it to the stage where I'm not
embarrassed by it!

Giles

2009/7/1 Chris Fields <cjfields at illinois.edu>

> Peter,
>
> I just committed a fix to FASTQ parsing last night to support read/write
> for Sanger/Solexa/Illumina following the biopython convention; the only
> thing needed is more extensive testing for the quality scores.  There are a
> few other oddities with it I intend to address soon, but it appears to be
> working.
>
> The Seq instance iterator actually calls a raw data iterator (hash refs of
> named arguments to the class constructor).  That should act as a decent
> filtering step if needed.
>
> We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we
> can probably reconfigure some of that.
>
> chris
>
>
> On Jul 1, 2009, at 2:44 AM, Peter Cock wrote:
>
>  Hi all (BioPerl and Biopython),
>>
>> This is a continuation of a long thread on the BioPerl mailing
>> list, which I have now CC'd to the Biopython mailing list. See:
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html
>>
>> On this thread we have been discussing next gen sequencing
>> tools and co-coordinating things like consistent file format
>> naming between Biopython, BioPerl and EMBOSS. I've been
>> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009,
>> and he will look into setting up a cross project mailing list for
>> this kind of discussion in future.
>>
>> In the mean time, my replies to Giles below cover both BioPerl
>> and Biopython (and EMBOSS). Giles' original email is here:
>> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html
>>
>> Peter
>>
>> On 6/30/09, Giles Weaver <giles.weaver at googlemail.com> wrote:
>>
>>>
>>> I'm developing a transcriptomics database for use with next-gen data, and
>>> have found processing the raw data to be a big hurdle.
>>>
>>> I'm a bit late in responding to this thread, so most issues have already
>>> been discussed. One thing that hasn't been mentioned is removal of
>>> adapters
>>> from raw Illumina sequence. This is a PITA, and I'm not aware of any well
>>> developed and documented open source software for removal of adapters
>>> (and poor quality sequence) from Illumina reads.
>>>
>>> My current Illumina sequence processing pipeline is an unholy mix of
>>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for
>>> converting
>>> the Illumina fastq to Sanger fastq, bioperl to read the quality values,
>>> pure perl to trim the poor quality sequence from each read, and bioperl
>>> with emboss to remove the adapter sequence. I'm aware that the pipeline
>>> contains bugs and would like to simplify it, but at least it does work...
>>>
>>> Ideally I'd like to replace as much of the pipeline as possible with
>>> bioperl/bioperl-run, but this isn't currently possible due to both a lack
>>> of features and poor performance. I'm sure the features will come with
>>> time, but the performance is more of a concern to me. ..
>>>
>>
>> I gather you would rather work with (Bio)Perl, but since you are
>> already using Biopython to do the FASTQ conversion, you could
>> also use it for more of your pipe line. Our tutorial includes examples
>> of simple FASTQ quality filtering, and trimming of primer sequences
>> (something like this might be helpful for removing adaptors). See:
>> http://biopython.org/DIST/docs/tutorial/Tutorial.html
>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
>>
>> Alternatively, with the new release of EMBOSS this July, you will
>> also be able to do the Illumina FASTQ to Sanger standard FASTQ
>> with EMBOSS, and I'm sure BioPerl will offer this soon too.
>>
>>  Regarding trimming bad quality bases (see comments from
>>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed
>>> pure/bioperl solution to be much faster than a primarily bioperl
>>> based implementation. I found Bio::Seq->subseq(a,b) and
>>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims
>>> ~1300 sequences/second, including unzipping the raw data and
>>> converting it to sanger fastq with biopython. Processing an entire
>>> sequencing run with the whole pipeline takes in the region of 6-12h.
>>>
>>
>> There are several ways of doing quality trimming, and it would
>> make an excellent cookbook example (both for BioPerl and
>> Biopython).
>>
>> Could you go into a bit more detail about your trimming
>> algorithm? e.g. Do you just trim any bases on the right below
>> a certain threshold, perhaps with a minimum length to retain
>> the trimmed read afterwards?
>>
>>  Hope this looooong post was of interest to someone!
>>>
>>
>> I was interested at least ;)
>>
>> Peter
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>


From cjfields at illinois.edu  Wed Jul  1 16:46:49 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 1 Jul 2009 11:46:49 -0500
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
	<1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
Message-ID: <6CAF4023-7D04-4B56-839F-E587A00DEEEA@illinois.edu>

On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote:

...

> Peter, the trimming algorithm I use employs a sliding window, as  
> follows:
>
>   - For each sequence position calculate the mean phred quality  
> score for a
>   window around that position.
>   - Record whether the mean score is above or below a threshold as  
> an array
>   of zeros and ones.
>   - Use a regular expression on the joined array to find the start  
> and end
>   of the good quality sequence(s).
>   - Extract the quality sequence(s) and replace any bases below the  
> quality
>   threshold with N.
>   - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the  
> window, but
> this could give a performance hit, and the method seems to work well  
> enough
> as is.
>
> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Just let me know if the qual values match up with what is expected.   
You can also iterate through the data with hashrefs using next_dataset  
(faster than objects).  This is from the fastq tests in core:

-----------------------------------------
$in_qual  = Bio::SeqIO->new(-file     =>  
test_input_file('fastq','test3_illumina.fastq'),
                             -variant  => 'illumina',
                             -format   => 'fastq');

$qual = $in_qual->next_dataset();

isa_ok($qual, 'HASH');
is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA');
is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ');
is($qual->{-id}, 'FC12044_91407_8_200_406_24');
is($qual->{-desc}, '');
is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24');
is(join(',',@{$qual->{-qual}}[0..10]),  
'19,24,24,20,24,24,24,24,24,24,24');
-----------------------------------------

So one could check those values directly and then filter them through  
as needed directly into Bio::Seq::Quality if necessary (note some of  
the key values are constructor args):

my $qualobj = Bio::Seq::Quality->new(%$qual);

chris


From p.j.a.cock at googlemail.com  Thu Jul  2 07:20:07 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 2 Jul 2009 08:20:07 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
	<1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
Message-ID: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com>

On 7/1/09, Giles Weaver wrote:
> Peter, the trimming algorithm I use employs a sliding window, as follows:
>
>    - For each sequence position calculate the mean phred quality score for a
>    window around that position.
>    - Record whether the mean score is above or below a threshold as an array
>    of zeros and ones.
>    - Use a regular expression on the joined array to find the start and end
>    of the good quality sequence(s).
>    - Extract the quality sequence(s) and replace any bases below the quality
>    threshold with N.
>    - Trim any Ns from the ends.
>
> A refinement would be to weight the scores from positions in the window, but
> this could give a performance hit, and the method seems to work well enough
> as is.

Thanks for the details - that is a bit more complex that what I had been
thinking. Do you have any favoured window size and quality threshold,
or does this really depend on the data itself?

Also, if you find a sequence read that goes "good - poor - good" for
example, do you extract the two good regions as two sub reads
(presumably with a minimum length)? This may be silly for Illumina
where the reads are very short, but might make sense for Roche 454.

> Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> parsing a workout soon. Peter, as much as I'd love to help out with
> biopython, I'm under too much time pressure right now!

Even use cases are useful - so thank you.

> Jonathan, some of the Illumina sequencing adapters are listed at
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> http://seqanswers.com/forums/showthread.php?t=198
> Adapter sequence typically appears towards the end of the read, though the
> latter part of it is often misread as the sequencing quality drops off.
> I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I
> then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> real alignments and trim the sequence. This is not the ideal way of doing
> things, but it's fast enough, and does seem to work. The adapter sequence
> shouldn't be gapped, so I'm sure there is a lot of scope for optimising the
> adapter removal.
>
> I'll happily share some code once I've got it to the stage where I'm not
> embarrassed by it!
>
> Giles

Cheers,

Peter


From vincent.rouilly03 at imperial.ac.uk  Thu Jul  2 13:40:46 2009
From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent)
Date: Thu, 2 Jul 2009 14:40:46 +0100
Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython
Message-ID: <A56CCB96395E684D9D57EAFB4AC41D4D0F61BCC17C@ICEXM4.ic.ac.uk>

Hi,

I have question about Distributed Annotation System (DAS).
What is the current best practice to load a SeqRecord from a DAS description ?

-------
I found that this topic has been discussed in the past here (see below), but I couldn't find the up-to-date method to deal with DAS in BioPython.

[2003] : Draft PyDAS parser from Andrew Dalke:  http://portal.open-bio.org/pipermail/biopython/2003-October/001670.html
Andrew hints at a DAS2 project that might produce a better python tool.

[2006]: Ann Loraine uses a SAX perser to deal with DAS: http://www.bioinformatics.org/pipermail/bbb/2006-December/003694.html

[2007]: PPT Presentation from Sanger Feb 2007: "DAS/2: Next generation Distributed Annotation System". 
Some python code used in the DAS/2 Validation Suite is mentioned.
http://sourceforge.net/projects/dasypus/
Project where Andrew Dalke is involved, but it seems inactive since 2006.
-------

Sorry if I have missed the post where this issue was last discussed,
best wishes,

Vincent.


From giles.weaver at googlemail.com  Fri Jul  3 15:35:00 2009
From: giles.weaver at googlemail.com (Giles Weaver)
Date: Fri, 3 Jul 2009 16:35:00 +0100
Subject: [Biopython] [Bioperl-l] Next-gen modules
In-Reply-To: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com>
References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk>
	<92C15E3391F64BAF801754E924122540@NewLife>
	<200906170927.13273.tristan.lefebure@gmail.com>
	<1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com>
	<320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com>
	<30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu>
	<1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com>
	<320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com>
Message-ID: <1d06cd5d0907030835w14407249l5b47db8893820816@mail.gmail.com>

Regarding the trimming algorithm, I've been using a window size of 5, a
minimum score of 20 and a minimum length of 15 with the Illumina data. In
the past I have used a similar algorithm with a larger window size and much
longer minimum length with sequence from ABI 3XXX machines. I imagine that
the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to
those for Illumina and Sanger sequencing respectively.
Window size doesn't appear to affect performance much, if at all.

For sequences with multiple good regions, I do extract all good regions.
Even with the Illumina data there are sometimes two good regions, but
usually the second is adapter or junk and gets filtered out later. I haven't
seen quality data from a 454 machine recently, and would be interested to
know if multiple good regions are commonplace in 454 data. Can anyone with
access to 454 data comment on this?

Giles

2009/7/2 Peter Cock <p.j.a.cock at googlemail.com>

> On 7/1/09, Giles Weaver wrote:
> > Peter, the trimming algorithm I use employs a sliding window, as follows:
> >
> >    - For each sequence position calculate the mean phred quality score
> for a
> >    window around that position.
> >    - Record whether the mean score is above or below a threshold as an
> array
> >    of zeros and ones.
> >    - Use a regular expression on the joined array to find the start and
> end
> >    of the good quality sequence(s).
> >    - Extract the quality sequence(s) and replace any bases below the
> quality
> >    threshold with N.
> >    - Trim any Ns from the ends.
> >
> > A refinement would be to weight the scores from positions in the window,
> but
> > this could give a performance hit, and the method seems to work well
> enough
> > as is.
>
> Thanks for the details - that is a bit more complex that what I had been
> thinking. Do you have any favoured window size and quality threshold,
> or does this really depend on the data itself?
>
> Also, if you find a sequence read that goes "good - poor - good" for
> example, do you extract the two good regions as two sub reads
> (presumably with a minimum length)? This may be silly for Illumina
> where the reads are very short, but might make sense for Roche 454.
>
> > Chris, thanks for committing the fix, I'll give bioperl illumina fastq
> > parsing a workout soon. Peter, as much as I'd love to help out with
> > biopython, I'm under too much time pressure right now!
>
> Even use cases are useful - so thank you.
>
> > Jonathan, some of the Illumina sequencing adapters are listed at
> >
> http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland
> > http://seqanswers.com/forums/showthread.php?t=198
> > Adapter sequence typically appears towards the end of the read, though
> the
> > latter part of it is often misread as the sequencing quality drops off.
> > I abuse needle (EMBOSS) into aligning the adapter sequence with each
> read. I
> > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify
> > real alignments and trim the sequence. This is not the ideal way of doing
> > things, but it's fast enough, and does seem to work. The adapter sequence
> > shouldn't be gapped, so I'm sure there is a lot of scope for optimising
> the
> > adapter removal.
> >
> > I'll happily share some code once I've got it to the stage where I'm not
> > embarrassed by it!
> >
> > Giles
>
> Cheers,
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Sat Jul  4 13:59:31 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 4 Jul 2009 14:59:31 +0100
Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython
In-Reply-To: <A56CCB96395E684D9D57EAFB4AC41D4D0F61BCC17C@ICEXM4.ic.ac.uk>
References: <A56CCB96395E684D9D57EAFB4AC41D4D0F61BCC17C@ICEXM4.ic.ac.uk>
Message-ID: <320fb6e00907040659ua83a793j94c4920608b0ad28@mail.gmail.com>

On Thu, Jul 2, 2009 at 2:40 PM, Rouilly,
Vincent<vincent.rouilly03 at imperial.ac.uk> wrote:
> Hi,
>
> I have question about Distributed Annotation System (DAS).
> What is the current best practice to load a SeqRecord from
> a DAS description ?

I don't know if anyone has done that. We don't have anything
in Biopython for DAS right now (that I know of). Hopefully
Andrew Dalke (CC'd) can give us a quick report on the status
of his code and the DAS/2 project.

Could you give a specific example of a DAS service you'd like
to use to get a sequence record from?

On the bright side, when chatting to Peter Rice from EMBOSS
at BOSC/ISMB 2009, he said they had been doing a lot of work
with DAS, so it sounds like a lot of the problems Andrew was
talking about (like invalid XML files) about may have been
addressed. I'm not sure if the new version of EMBOSS due
this month will include a DAS client of some kind - that would
be worth checking out.

P.S. Have you signed up to the DAS mailing list?
http://lists.open-bio.org/mailman/listinfo/das

Peter


From fungazid at yahoo.com  Sun Jul  5 22:57:08 2009
From: fungazid at yahoo.com (Fungazid)
Date: Sun, 5 Jul 2009 15:57:08 -0700 (PDT)
Subject: [Biopython] suggestion for a little change in the ACE cookbook
Message-ID: <204841.83488.qm@web65510.mail.ac4.yahoo.com>


Hi,

About the cookbook here
http://biopython.org/wiki/ACE_contig_to_alignment

instead of:

def cut_ends(read, start, end):
  return (start-1) * '-' + read[start-1:end] + (end +1) * '-'

I think it is better to write:

def cut_ends(self,read, start, end):
    return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x'

The 2 changes are:
1) correcting the coordinates of the clipped 5' region
2) adding 'x' instead of '-' to separate the clipped region from the gaps


From biopython.chen at gmail.com  Mon Jul  6 03:27:15 2009
From: biopython.chen at gmail.com (chen Ku)
Date: Sun, 5 Jul 2009 20:27:15 -0700
Subject: [Biopython] how to retrieve pdb id of desired keyword
Message-ID: <4c2163890907052027s3a2843b4w3ebe6ee4ef7a5472@mail.gmail.com>

Dear all,


I seek your help again in using Bio.PDBList. As I understood from
Bio.PDBList we can only download whole PDB by (

*download_entire_pdb(self, listfile=None) *


Actually i want to only fetch the pdb id which are only transcription factor
binding to DNA.
I think to download all PDB file will be time taking so without mising
anydata which is the best way.If you can demonstrate me using PDBList method
for this then I can start with next methods and try by my own.


Any suggestion or one demonstaration using PDBList will be of great help.


Regards

Chen


From oda.gumail at gmail.com  Mon Jul  6 15:19:56 2009
From: oda.gumail at gmail.com (Ogan ABAAN)
Date: Mon, 06 Jul 2009 11:19:56 -0400
Subject: [Biopython] retrieve gene name and exon
Message-ID: <4A52161C.8070909@gmail.com>

Hi all,

I have a number of genomic position from the human genome and I want to 
know which genes these positions belong to. I also would like to know 
which exon (if they are from a gene, or even intron if possible) the 
location is on. For example, I want to put in chr1:10,000,000 and would 
like to see an output as such geneX-exon5 or something like that. I know 
ensemble stores that information but I couldn't find the proper tool in 
Biopython, so I would apritiate if anyone could direct me to one. Thank 
you very much

Ogan


From biopython at maubp.freeserve.co.uk  Mon Jul  6 15:44:28 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 6 Jul 2009 16:44:28 +0100
Subject: [Biopython] retrieve gene name and exon
In-Reply-To: <4A52161C.8070909@gmail.com>
References: <4A52161C.8070909@gmail.com>
Message-ID: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com>

On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN<oda.gumail at gmail.com> wrote:
> Hi all,
>
> I have a number of genomic position from the human genome and I want to know
> which genes these positions belong to. I also would like to know which exon
> (if they are from a gene, or even intron if possible) the location is on.
> For example, I want to put in chr1:10,000,000 and would like to see an
> output as such geneX-exon5 or something like that. I know ensemble stores
> that information but I couldn't find the proper tool in Biopython, so I
> would apritiate if anyone could direct me to one. Thank you very much
>
> Ogan

This thread was on a similar topic:
http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html
Given the GenBank file (or in theory an EMBL file or something else
like a GFF file) for a chromosome, and a position within it, how could
you determine which feature(s) a given position was within.

Note that there are already three different human genomes available
in GenBank, so as mentioned in the earlier thread, you need to know
which human genome your location refers to - and work from the
appropriate GenBank/EMBL/GFF/other data file.

Peter

P.S. How many of these locations do you have?


From oda.gumail at gmail.com  Mon Jul  6 16:58:53 2009
From: oda.gumail at gmail.com (Ogan ABAAN)
Date: Mon, 06 Jul 2009 12:58:53 -0400
Subject: [Biopython] retrieve gene name and exon
In-Reply-To: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com>
References: <4A52161C.8070909@gmail.com>
	<320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com>
Message-ID: <4A522D4D.40602@gmail.com>

Thanks Peter,

Now that you mention it I remember reading that thread. I don't have an 
exact number but for chr1 I have about 350 of these. I parsed them out a 
separate chr files.

Thank you


Peter wrote:
> On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN<oda.gumail at gmail.com> wrote:
>   
>> Hi all,
>>
>> I have a number of genomic position from the human genome and I want to know
>> which genes these positions belong to. I also would like to know which exon
>> (if they are from a gene, or even intron if possible) the location is on.
>> For example, I want to put in chr1:10,000,000 and would like to see an
>> output as such geneX-exon5 or something like that. I know ensemble stores
>> that information but I couldn't find the proper tool in Biopython, so I
>> would apritiate if anyone could direct me to one. Thank you very much
>>
>> Ogan
>>     
>
> This thread was on a similar topic:
> http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html
> Given the GenBank file (or in theory an EMBL file or something else
> like a GFF file) for a chromosome, and a position within it, how could
> you determine which feature(s) a given position was within.
>
> Note that there are already three different human genomes available
> in GenBank, so as mentioned in the earlier thread, you need to know
> which human genome your location refers to - and work from the
> appropriate GenBank/EMBL/GFF/other data file.
>
> Peter
>
> P.S. How many of these locations do you have?
>   


From winda002 at student.otago.ac.nz  Mon Jul  6 23:31:12 2009
From: winda002 at student.otago.ac.nz (David WInter)
Date: Tue, 07 Jul 2009 11:31:12 +1200
Subject: [Biopython] suggestion for a little change in the ACE cookbook
In-Reply-To: <204841.83488.qm@web65510.mail.ac4.yahoo.com>
References: <204841.83488.qm@web65510.mail.ac4.yahoo.com>
Message-ID: <4A528940.6070503@student.otago.ac.nz>

Fungazid wrote:
> Hi,
>
> About the cookbook here
> http://biopython.org/wiki/ACE_contig_to_alignment
>
> instead of:
>
> def cut_ends(read, start, end):
>   return (start-1) * '-' + read[start-1:end] + (end +1) * '-'
>
> I think it is better to write:
>
> def cut_ends(self,read, start, end):
>     return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x'
>   

Yep, well spotted. It seems I'd also put an ugly hack in the 'pad_ends' 
function to deal with the problem (cutting the read to length before 
returning it) so we can get rid to that too ;) I've changed the code on 
the wiki.

As for adding 'x's instead of '-'s - I think this is really going to be 
a case by case thing - the contigs I had to play with had asterisks for 
gaps in the reads so I could tell the difference (and for some strange 
reason I'm squeamish about using letters to represent a gap even if 'x' 
is not an ambiguity code). Do you want to add something to the recipe to 
make it clear that someone could change the 'pad character' to suit the 
assembly you are using?

Cheers,
David


From pzs at dcs.gla.ac.uk  Tue Jul  7 16:41:14 2009
From: pzs at dcs.gla.ac.uk (Peter Saffrey)
Date: Tue, 07 Jul 2009 17:41:14 +0100
Subject: [Biopython] Primer3 for testing primers
Message-ID: <4A537AAA.5040008@dcs.gla.ac.uk>

Has anybody done this through Biopython? I found this posting:

http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html

but it generates a primer3 input file, rather than using the 
set_parameter() method provided by 
Bio.Emboss.Applications.Primer3Commandline.

The problem is that by running primer3 from the command line, I can't 
get it to report problems with (for example) temperature or GC content 
without using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline 
doesn't seem to support that option.

This also makes me wonder whether Biopython's primer3 output parsing 
knows how to read the primer3 "explain" syntax:

PRIMER_LEFT_EXPLAIN=considered 1, ok 1
PRIMER_RIGHT_EXPLAIN=considered 1, ok 1

Does anybody know?

I'm not finding the primer3 documentation all that helpful either :( 
There is no mailing list or contact email address...

Peter


From biopython at maubp.freeserve.co.uk  Tue Jul  7 17:05:55 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 7 Jul 2009 18:05:55 +0100
Subject: [Biopython] Primer3 for testing primers
In-Reply-To: <4A537AAA.5040008@dcs.gla.ac.uk>
References: <4A537AAA.5040008@dcs.gla.ac.uk>
Message-ID: <320fb6e00907071005t24d79108u76d23c006c19f297@mail.gmail.com>

On Tue, Jul 7, 2009 at 5:41 PM, Peter Saffrey<pzs at dcs.gla.ac.uk> wrote:
> Has anybody done this through Biopython? I found this posting:
>
> http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html
>
> but it generates a primer3 input file, rather than using the set_parameter()
> method provided by Bio.Emboss.Applications.Primer3Commandline.
>
> The problem is that by running primer3 from the command line, I can't get it
> to report problems with (for example) temperature or GC content without
> using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline
> doesn't seem to support that option.
>
> This also makes me wonder whether Biopython's primer3 output parsing knows
> how to read the primer3 "explain" syntax:
>
> PRIMER_LEFT_EXPLAIN=considered 1, ok 1
> PRIMER_RIGHT_EXPLAIN=considered 1, ok 1
>
> Does anybody know?
>
> I'm not finding the primer3 documentation all that helpful either :( There
> is no mailing list or contact email address...

Are you sure you are using the EMBOSS version of primer3? i.e.
the command line tool called eprimer3 (with an "e" at the start).

EMBOSS mailing list:
http://emboss.sourceforge.net/support/#usermail
http://emboss.open-bio.org/mailman/listinfo/emboss

EMBOSS docs:
http://emboss.sourceforge.net/apps/cvs/emboss/apps/eprimer3.html

This does specifically list the "-explainflag" argument, which should be
set to a boolean value. This is supported in the Primer3Commandline
wrapper in Biopython. I'm not sure about the parser off hand.

Peter


From fungazid at yahoo.com  Tue Jul  7 19:19:33 2009
From: fungazid at yahoo.com (Fungazid)
Date: Tue, 7 Jul 2009 12:19:33 -0700 (PDT)
Subject: [Biopython] suggestion for a little change in the ACE cookbook
Message-ID: <927677.46270.qm@web65502.mail.ac4.yahoo.com>


Hi David,

I am working with a version of this cookbook that suits my needs. Right now I do not have extremely existing things to add to the cookbook, but I am working with this code and maybe I can track something important (hopefully not bugs ;) ).

Thanks,
Avi

--- On Tue, 7/7/09, David WInter <winda002 at student.otago.ac.nz> wrote:

> From: David WInter <winda002 at student.otago.ac.nz>
> Subject: Re: [Biopython] suggestion for a little change in the ACE cookbook
> To: "Fungazid" <fungazid at yahoo.com>
> Cc: biopython at lists.open-bio.org
> Date: Tuesday, July 7, 2009, 2:31 AM
> Fungazid wrote:
> > Hi,
> > 
> > About the cookbook here
> > http://biopython.org/wiki/ACE_contig_to_alignment
> > 
> > instead of:
> > 
> > def cut_ends(read, start, end):
> >???return (start-1) * '-' +
> read[start-1:end] + (end +1) * '-'
> > 
> > I think it is better to write:
> > 
> > def cut_ends(self,read, start, end):
> >? ???return (start-1) * 'x' +
> read[start-1:end-1] + (len(read)-end) * 'x'
> >???
> 
> Yep, well spotted. It seems I'd also put an ugly hack in
> the 'pad_ends' function to deal with the problem (cutting
> the read to length before returning it) so we can get rid to
> that too ;) I've changed the code on the wiki.
> 
> As for adding 'x's instead of '-'s - I think this is really
> going to be a case by case thing - the contigs I had to play
> with had asterisks for gaps in the reads so I could tell the
> difference (and for some strange reason I'm squeamish about
> using letters to represent a gap even if 'x' is not an
> ambiguity code). Do you want to add something to the recipe
> to make it clear that someone could change the 'pad
> character' to suit the assembly you are using?
> 
> Cheers,
> David
> 
> 
> 
> 
> 
> 
> 


From lueck at ipk-gatersleben.de  Wed Jul  8 10:08:56 2009
From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de)
Date: Wed,  8 Jul 2009 12:08:56 +0200
Subject: [Biopython] blastall - strange results
Message-ID: <20090708120856.c902mgb7eed4w8c8@webmail.ipk-gatersleben.de>

Hi!

Sorry for the late replay but here is an update:

I tried megablast but it doesn't help...But what I found out and is 
acceptable for the moment:

If the query sequence is >235 bp
   >>> use wordsize 21

If the query sequence is <235 bp
   >>> use wordsize 11

I don't know the reason for that but at least I can work with it. 
However now and than BLAST don't find all sequences (rarely) and soon 
or later I'll switch to a short read aligner or global alignment.
Kind regards
Stefanie

>>>
On Thu, May 28, 2009 at 1:02 PM, Brad Chapman <[EMAIL PROTECTED]> wrote:
> Hi Stefanie;
>
>> I get strange results with blast.
>> My aim is to blast a query sequence, spitted to 21-mers, against a database.
> [...]
>> Is this normal? I would expect to find all 21-mers. Why only some?

I would check the filtering option is off (by default BLAST will mask low
complexity regions).

> BLAST isn't the best tool for this sort of problem. For exhaustively
> aligning short sequences to a database of target sequences, you
> should think about using a short read aligner. This is a nice
> summary of available aligners:
>
> http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml
>
> Personally, I have had good experiences using Mosaik and Bowtie.
>
> Hope this helps,
> Brad

Brad is probably right about normal BLAST not being the best tool.

However, if you haven't done so already you might want to try
megablast instead of blastn, as this is designed for very similar
matches. This should be a very small change to your existing Biopython
script, so it should be easy to try out.

Peter
_______________________________________________
Biopython mailing list  -  [EMAIL PROTECTED]
http://lists.open-bio.org/mailman/listinfo/biopython


From bartomas at gmail.com  Tue Jul 14 11:03:08 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 12:03:08 +0100
Subject: [Biopython] Record count in pcassay database
Message-ID: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>

Hi,
I'm using Biopython to access Entrez databases.
I've retrieved information of the pcassay database with the following code:


handle=Entrez.einfo(db=*"pcassay"*)

record=Entrez.read(handle)

print record[*'DbInfo'*][*'Count'*]


Printing the record count of pcassay gives :

*1659*

Such a limited number of records seems impossible.

Am I using Biopython incorrectly ?


Thanks very much


From dejmail at gmail.com  Tue Jul 14 11:09:49 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Tue, 14 Jul 2009 13:09:49 +0200
Subject: [Biopython] cleaning sequences
Message-ID: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>

Hi everyone

I was wondering if there was a built in method for determining whether a
sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
reason I ask is I am trying to subtype a couple hundred viral DNA sequences,
and due to bad sequencing, the sequences often have ambiguous characters in
them, which the algorithm used to subtype doesn't like. I realise I can
compare each letter of each genome in a loop with GATC to determine
ambiguity, but it might be easier if there was a built in function.

Thanks
Liam


-- 
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
Faculty of Health Sciences, Room 7Q07
7 York Road, Parktown
2193

Tel: 2711 717 2465/7
Fax: 2711 717 2395
Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com


From chapmanb at 50mail.com  Tue Jul 14 11:30:09 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 14 Jul 2009 07:30:09 -0400
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
Message-ID: <20090714113009.GP17086@sobchak.mgh.harvard.edu>

Hello;

> I'm using Biopython to access Entrez databases.
> I've retrieved information of the pcassay database with the following code:
> 
> 
> handle=Entrez.einfo(db=*"pcassay"*)
> record=Entrez.read(handle)
> print record[*'DbInfo'*][*'Count'*]
> 
> Printing the record count of pcassay gives :
> *1659*
> Such a limited number of records seems impossible.
> Am I using Biopython incorrectly ?

That count looks right to me if I manually browse the PubChem
BioAssay database:

http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]

It looks like you are retrieving the top level assay records. The
counts for total compounds assayed will be much higher but you would
need to examine individual records of interest to determine those.

Hope this helps,
Brad


From bartomas at gmail.com  Tue Jul 14 11:48:51 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 12:48:51 +0100
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <20090714113009.GP17086@sobchak.mgh.harvard.edu>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
	<20090714113009.GP17086@sobchak.mgh.harvard.edu>
Message-ID: <fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>

Thanks very much for your reply.
By the way in your http query you specify *term=all[filt]*
I've just tried the same with BioPython and it does retireve all records:


handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*)
Is 'filt' the standard wildcard for Entrez queries ?

Thanks.

On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hello;
>
> > I'm using Biopython to access Entrez databases.
> > I've retrieved information of the pcassay database with the following
> code:
> >
> >
> > handle=Entrez.einfo(db=*"pcassay"*)
> > record=Entrez.read(handle)
> > print record[*'DbInfo'*][*'Count'*]
> >
> > Printing the record count of pcassay gives :
> > *1659*
> > Such a limited number of records seems impossible.
> > Am I using Biopython incorrectly ?
>
> That count looks right to me if I manually browse the PubChem
> BioAssay database:
>
> http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]
>
> It looks like you are retrieving the top level assay records. The
> counts for total compounds assayed will be much higher but you would
> need to examine individual records of interest to determine those.
>
> Hope this helps,
> Brad
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From chapmanb at 50mail.com  Tue Jul 14 12:50:12 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 14 Jul 2009 08:50:12 -0400
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
	<20090714113009.GP17086@sobchak.mgh.harvard.edu>
	<fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>
Message-ID: <20090714125012.GS17086@sobchak.mgh.harvard.edu>

Hello;

> Thanks very much for your reply.
> By the way in your http query you specify *term=all[filt]*
> I've just tried the same with BioPython and it does retireve all records:

It looked like you were getting all the records with your previous
query as well.

> handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*)
> Is 'filt' the standard wildcard for Entrez queries ?

I don't know too much about PubChem queries but had just clicked on the
"All BioAssays" link from the main page:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay

The documentation linked to from there:

http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index

can probably provide additional direction. Thanks,
Brad

> 
> Thanks.
> 
> On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > Hello;
> >
> > > I'm using Biopython to access Entrez databases.
> > > I've retrieved information of the pcassay database with the following
> > code:
> > >
> > >
> > > handle=Entrez.einfo(db=*"pcassay"*)
> > > record=Entrez.read(handle)
> > > print record[*'DbInfo'*][*'Count'*]
> > >
> > > Printing the record count of pcassay gives :
> > > *1659*
> > > Such a limited number of records seems impossible.
> > > Am I using Biopython incorrectly ?
> >
> > That count looks right to me if I manually browse the PubChem
> > BioAssay database:
> >
> > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]
> >
> > It looks like you are retrieving the top level assay records. The
> > counts for total compounds assayed will be much higher but you would
> > need to examine individual records of interest to determine those.
> >
> > Hope this helps,
> > Brad
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >


From chapmanb at 50mail.com  Tue Jul 14 12:45:21 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 14 Jul 2009 08:45:21 -0400
Subject: [Biopython] cleaning sequences
In-Reply-To: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
Message-ID: <20090714124521.GR17086@sobchak.mgh.harvard.edu>

Hi Liam;
I don't believe there is built in functionality for doing this. The
problem itself is hard because it is a bit underspecified: what
should be done when encountering ambiguous characters? Depending on
your situation this can be a couple of different things:

- Trim the sequence to remove the bases. This might be a
  post-sequencing step, and there was some discussion between Peter
  and Giles about the parameters of doing this earlier this month:

  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html

- Replace the bases with an accepted ambiguity character (say, N or
  x)

So it's a bit hard to generalize. Saying that, we'd be happy for
thoughts on an implementation that would tackle these sorts of
issues.

Brad

> I was wondering if there was a built in method for determining whether a
> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> reason I ask is I am trying to subtype a couple hundred viral DNA sequences,
> and due to bad sequencing, the sequences often have ambiguous characters in
> them, which the algorithm used to subtype doesn't like. I realise I can
> compare each letter of each genome in a loop with GATC to determine
> ambiguity, but it might be easier if there was a built in function.
> 
> Thanks
> Liam
> 
> 
> 
> -- 
> -----------------------------------------------------------
> Antiviral Gene Therapy Research Unit
> University of the Witwatersrand
> Faculty of Health Sciences, Room 7Q07
> 7 York Road, Parktown
> 2193
> 
> Tel: 2711 717 2465/7
> Fax: 2711 717 2395
> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From bartomas at gmail.com  Tue Jul 14 13:22:28 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 14:22:28 +0100
Subject: [Biopython] Record count in pcassay database
In-Reply-To: <20090714125012.GS17086@sobchak.mgh.harvard.edu>
References: <fdcd75820907140403o7236fe53ib25a82837418c953@mail.gmail.com>
	<20090714113009.GP17086@sobchak.mgh.harvard.edu>
	<fdcd75820907140448i7e38124as6c4d605b93cf22ab@mail.gmail.com>
	<20090714125012.GS17086@sobchak.mgh.harvard.edu>
Message-ID: <fdcd75820907140622i4798d911y3c748700df19ea8d@mail.gmail.com>

Thanks a lot!

On Tue, Jul 14, 2009 at 1:50 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hello;
>
> > Thanks very much for your reply.
> > By the way in your http query you specify *term=all[filt]*
> > I've just tried the same with BioPython and it does retireve all records:
>
> It looked like you were getting all the records with your previous
> query as well.
>
> > handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*)
> > Is 'filt' the standard wildcard for Entrez queries ?
>
> I don't know too much about PubChem queries but had just clicked on the
> "All BioAssays" link from the main page:
>
> http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay
>
> The documentation linked to from there:
>
> http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index
>
> can probably provide additional direction. Thanks,
> Brad
>
> >
> > Thanks.
> >
> > On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman <chapmanb at 50mail.com>
> wrote:
> >
> > > Hello;
> > >
> > > > I'm using Biopython to access Entrez databases.
> > > > I've retrieved information of the pcassay database with the following
> > > code:
> > > >
> > > >
> > > > handle=Entrez.einfo(db=*"pcassay"*)
> > > > record=Entrez.read(handle)
> > > > print record[*'DbInfo'*][*'Count'*]
> > > >
> > > > Printing the record count of pcassay gives :
> > > > *1659*
> > > > Such a limited number of records seems impossible.
> > > > Am I using Biopython incorrectly ?
> > >
> > > That count looks right to me if I manually browse the PubChem
> > > BioAssay database:
> > >
> > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt]
> > >
> > > It looks like you are retrieving the top level assay records. The
> > > counts for total compounds assayed will be much higher but you would
> > > need to examine individual records of interest to determine those.
> > >
> > > Hope this helps,
> > > Brad
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biopython
> > >
>


From cjfields at illinois.edu  Tue Jul 14 14:48:04 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 14 Jul 2009 09:48:04 -0500
Subject: [Biopython] cleaning sequences
In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
Message-ID: <16F8D67C-EC52-4C11-8889-B07CAE9D7E1B@illinois.edu>

If you do come up with something, let us Bioperl guys know.  We have a  
preliminary trimming/cleaning version that we're thinking of adding,  
but it would be nice to coalesce around a similar implementation.

chris

On Jul 14, 2009, at 7:45 AM, Brad Chapman wrote:

> Hi Liam;
> I don't believe there is built in functionality for doing this. The
> problem itself is hard because it is a bit underspecified: what
> should be done when encountering ambiguous characters? Depending on
> your situation this can be a couple of different things:
>
> - Trim the sequence to remove the bases. This might be a
>  post-sequencing step, and there was some discussion between Peter
>  and Giles about the parameters of doing this earlier this month:
>
>  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
>
> - Replace the bases with an accepted ambiguity character (say, N or
>  x)
>
> So it's a bit hard to generalize. Saying that, we'd be happy for
> thoughts on an implementation that would tackle these sorts of
> issues.
>
> Brad
>
>> I was wondering if there was a built in method for determining  
>> whether a
>> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous  
>> sequence. The
>> reason I ask is I am trying to subtype a couple hundred viral DNA  
>> sequences,
>> and due to bad sequencing, the sequences often have ambiguous  
>> characters in
>> them, which the algorithm used to subtype doesn't like. I realise I  
>> can
>> compare each letter of each genome in a loop with GATC to determine
>> ambiguity, but it might be easier if there was a built in function.
>>
>> Thanks
>> Liam
>>
>>
>>
>> -- 
>> -----------------------------------------------------------
>> Antiviral Gene Therapy Research Unit
>> University of the Witwatersrand
>> Faculty of Health Sciences, Room 7Q07
>> 7 York Road, Parktown
>> 2193
>>
>> Tel: 2711 717 2465/7
>> Fax: 2711 717 2395
>> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From bartomas at gmail.com  Tue Jul 14 15:39:08 2009
From: bartomas at gmail.com (bar tomas)
Date: Tue, 14 Jul 2009 16:39:08 +0100
Subject: [Biopython] Problem using efetch
Message-ID: <fdcd75820907140839s4616c3b4o4ef4b0dd33e11020@mail.gmail.com>

Hi,

I?m using BioPython to access Entrez databases.  I?m following the BioPython
tutorial.
I?ve tried retrieving all record ids from pcassay database with esearch and
then retrieving the first full record on the list with efetch:

handle = Entrez.esearch(db="pcassay", term="ALL[filt]")

print record["IdList"]


# This prints the following list of ids:

# ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc.


But when I then try to retrieve the first record:

handle2 = Entrez.efetch(db="pcassay", id="1866")

I get the following error :


<html>

<body>

<br/><h2>Error occurred: Report 'ASN1' not found in 'pcassay'
presentation</h2><br/><ul title="some params from request:">

<li>db=pcassay</li>

<li>query_key=</li>

<li>report=</li>

<li>dispstart=</li>

<li>dispmax=</li>

<li>mode=html</li>

<li>WebEnv=</li>

</ul>

<br/><b>pmfetch need params:</b><br/><br/>

<li>(id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where NNN - number in the
history, 0 - clipboard content for current database)</li>

<li>db=db_name (mandatory)</li>

<li>report=[docsum, brief, abstract, citation, medline, asn.1, mlasn1,
uilist, sgml, gen] (Optional; default is asn.1)</li>

<li>mode=[html, file, text, asn.1, xml] (Optional; default is html)</li>

<li>dispstart - first element to display, from 0 to count - 1, (Optional;
default is 0)</li>

<li>dispmax - number of items to display (Optional; default is all elements,
from dispstart)</li>

<br/>See <a href="
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/efetch_help.html
">help</a>.</body>

</html>


Do you have an idea of what I?m doing wrong?

Thanks very much


From dejmail at gmail.com  Tue Jul 14 18:21:29 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Tue, 14 Jul 2009 20:21:29 +0200
Subject: [Biopython] cleaning sequences
In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
Message-ID: <f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>

Hi Brad

Yes, I remember the posts rereading them now. I think my problem is a little
less complicated than sequence data, seeing as my sequences are genbank
entries, so they just need to be read, even if they're bad quality. I
suppose changing the letter would be a better option for me, especially as
the reading frame is important for aligning based on peptide sequence.

As for implementation, I am a complete greenhorn at python nevermind
programming, so I wouldn't even know where to start suggestions, sorry about
that.

Regards
Liam


On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hi Liam;
> I don't believe there is built in functionality for doing this. The
> problem itself is hard because it is a bit underspecified: what
> should be done when encountering ambiguous characters? Depending on
> your situation this can be a couple of different things:
>
> - Trim the sequence to remove the bases. This might be a
>  post-sequencing step, and there was some discussion between Peter
>  and Giles about the parameters of doing this earlier this month:
>
>  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
>
> - Replace the bases with an accepted ambiguity character (say, N or
>  x)
>
> So it's a bit hard to generalize. Saying that, we'd be happy for
> thoughts on an implementation that would tackle these sorts of
> issues.
>
> Brad
>
> > I was wondering if there was a built in method for determining whether a
> > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> > reason I ask is I am trying to subtype a couple hundred viral DNA
> sequences,
> > and due to bad sequencing, the sequences often have ambiguous characters
> in
> > them, which the algorithm used to subtype doesn't like. I realise I can
> > compare each letter of each genome in a loop with GATC to determine
> > ambiguity, but it might be easier if there was a built in function.
> >
> > Thanks
> > Liam
> >
> >
> >
> > --
> > -----------------------------------------------------------
> > Antiviral Gene Therapy Research Unit
> > University of the Witwatersrand
> > Faculty of Health Sciences, Room 7Q07
> > 7 York Road, Parktown
> > 2193
> >
> > Tel: 2711 717 2465/7
> > Fax: 2711 717 2395
> > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
Faculty of Health Sciences, Room 7Q07
7 York Road, Parktown
2193

Tel: 2711 717 2465/7
Fax: 2711 717 2395
Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com


From biopython at maubp.freeserve.co.uk  Tue Jul 14 22:08:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 14 Jul 2009 23:08:50 +0100
Subject: [Biopython] Problem using efetch
In-Reply-To: <fdcd75820907140839s4616c3b4o4ef4b0dd33e11020@mail.gmail.com>
References: <fdcd75820907140839s4616c3b4o4ef4b0dd33e11020@mail.gmail.com>
Message-ID: <320fb6e00907141508l13ed0d2i9ddd466538af8816@mail.gmail.com>

On Tue, Jul 14, 2009 at 4:39 PM, bar tomas<bartomas at gmail.com> wrote:
> Hi,
>
> I?m using BioPython to access Entrez databases. ?I?m following
> the BioPython tutorial. I?ve tried retrieving all record ids from
> pcassay database with esearch and then retrieving the first full
> record on the list with efetch:
>
> handle = Entrez.esearch(db="pcassay", term="ALL[filt]")
>
> print record["IdList"]
>
> # This prints the following list of ids:
>
> # ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc.
>
>
> But when I then try to retrieve the first record:
>
> handle2 = Entrez.efetch(db="pcassay", id="1866")
>
> I get the following error :
>
> <html>
> <body>
> <br/><h2>Error occurred: Report 'ASN1' not found in 'pcassay'
> presentation</h2><br/><ul title="some params from request:">
> <li>db=pcassay</li>
> ...
>
> Do you have an idea of what I?m doing wrong?

This isn't anything wrong with Biopython - this is the sort of
slightly cryptic error the NCBI gives when the return type
and/or return mode isn't supported. Apparently the default
(ASN1) isn't supported for this database. The NCBI efetch
documentation is a little vague or simply missing for the
less main-stream databases. You can make some
guesses from playing with the Entrez website, e.g.

>>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist").read()
<html><head><title>PmFetch response</title></head><body>
<pre>
1866
</pre></body></html>
>>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist", retmode="text").read()
1866
>>> print Entrez.efetch(db="pcassay", id="1866", rettype="abstract", retmode="text").read()

1: AID: 1866
Name:  Epi-absorbance-based counterscreen assay for selective VIM-2 inhibitors:
biochemical high throughput screening assay to identify inhibitors of TEM-1
serine-beta-lactamase.
Source:  The Scripps Research Institute Molecular Screening Center
Description:   Source (MLPCN Center Name): The Scripps Research Institute
...

You could also try emailing the NCBI for advice.

Peter


From chapmanb at 50mail.com  Wed Jul 15 12:35:40 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 15 Jul 2009 08:35:40 -0400
Subject: [Biopython] cleaning sequences
In-Reply-To: <f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
	<f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
Message-ID: <20090715123540.GF17086@sobchak.mgh.harvard.edu>

Hi Liam;
That makes sense. It's a good suggestion and I added it to the
Project Ideas area of the wiki so hopefully it'll get picked up on
in the future:

http://biopython.org/wiki/Active_projects#Project_ideas

For your specific problem, you should be able to do something along
the lines of:

def convert_ambiguous(orig_seq):
    new_bases = []
    for base in str(orig_seq).upper():
        if base in ["G", "A", "T", "C"]:
            new_bases.append(base)
        else:
            new_bases.append("N")
    return Seq("".join(new_bases), orig_seq.alphabet)

which would switch all non GATCs to the N ambiguity character,
assuming your downstream program accepts that.

Hope this helps,
Brad

> 
> Yes, I remember the posts rereading them now. I think my problem is a little
> less complicated than sequence data, seeing as my sequences are genbank
> entries, so they just need to be read, even if they're bad quality. I
> suppose changing the letter would be a better option for me, especially as
> the reading frame is important for aligning based on peptide sequence.
> 
> As for implementation, I am a complete greenhorn at python nevermind
> programming, so I wouldn't even know where to start suggestions, sorry about
> that.
> 
> Regards
> Liam
> 
> 
> 
> 
> On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> 
> > Hi Liam;
> > I don't believe there is built in functionality for doing this. The
> > problem itself is hard because it is a bit underspecified: what
> > should be done when encountering ambiguous characters? Depending on
> > your situation this can be a couple of different things:
> >
> > - Trim the sequence to remove the bases. This might be a
> >  post-sequencing step, and there was some discussion between Peter
> >  and Giles about the parameters of doing this earlier this month:
> >
> >  http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html
> >
> > - Replace the bases with an accepted ambiguity character (say, N or
> >  x)
> >
> > So it's a bit hard to generalize. Saying that, we'd be happy for
> > thoughts on an implementation that would tackle these sorts of
> > issues.
> >
> > Brad
> >
> > > I was wondering if there was a built in method for determining whether a
> > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The
> > > reason I ask is I am trying to subtype a couple hundred viral DNA
> > sequences,
> > > and due to bad sequencing, the sequences often have ambiguous characters
> > in
> > > them, which the algorithm used to subtype doesn't like. I realise I can
> > > compare each letter of each genome in a loop with GATC to determine
> > > ambiguity, but it might be easier if there was a built in function.
> > >
> > > Thanks
> > > Liam
> > >
> > >
> > >
> > > --
> > > -----------------------------------------------------------
> > > Antiviral Gene Therapy Research Unit
> > > University of the Witwatersrand
> > > Faculty of Health Sciences, Room 7Q07
> > > 7 York Road, Parktown
> > > 2193
> > >
> > > Tel: 2711 717 2465/7
> > > Fax: 2711 717 2395
> > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
> > > _______________________________________________
> > > Biopython mailing list  -  Biopython at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> 
> 
> 
> -- 
> -----------------------------------------------------------
> Antiviral Gene Therapy Research Unit
> University of the Witwatersrand
> Faculty of Health Sciences, Room 7Q07
> 7 York Road, Parktown
> 2193
> 
> Tel: 2711 717 2465/7
> Fax: 2711 717 2395
> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com


From bartomas at gmail.com  Wed Jul 15 13:12:10 2009
From: bartomas at gmail.com (bar tomas)
Date: Wed, 15 Jul 2009 14:12:10 +0100
Subject: [Biopython] How to run esearch in BioPython without specifying any
	filtering terms
Message-ID: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>

Hi,

The BioPython tutorial (p.86) shows how once the available fields of an
Entrez database have been found with Einfo ,  queries can be run that use
those fields in the term argument of Esearch (for instance Jones[AUTH]).

However, I?d like to retrieve all IDs from a database without specifying any
filtering term.

If I leave the term argument out in the Entrez.efetch method, BioPython
returns an error.

It tried the following, that came up in a previous email on this mailing
list regarding pcassay database:


handle = Entrez.esearch(db='pcsubstance', term="ALL[filt]")


But this returns a list of 20 ids that obviously cannot comprise the whole
pcsubstance database

How can you run esearch in BioPython with no filtering terms?

Thanks very much.


From chapmanb at 50mail.com  Wed Jul 15 20:16:55 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 15 Jul 2009 16:16:55 -0400
Subject: [Biopython] How to run esearch in BioPython without
	specifying	any filtering terms
In-Reply-To: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>
References: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>
Message-ID: <20090715201655.GH39098@sobchak.mgh.harvard.edu>

Hello;

> The BioPython tutorial (p.86) shows how once the available fields of an
> Entrez database have been found with Einfo ,  queries can be run that use
> those fields in the term argument of Esearch (for instance Jones[AUTH]).
> 
> However, I?d like to retrieve all IDs from a database without specifying any
> filtering term.
> 
> If I leave the term argument out in the Entrez.efetch method, BioPython
> returns an error.
[..]
> How can you run esearch in BioPython with no filtering terms?

Retrieving all IDs isn't practical for most of the databases due to
large numbers of entries. That's why a term is required in Biopython,
and why most NCBI databases likely won't have an option to return
everything. For example, 'pcsubstance' looks to contain 81 million
records from the available downloads:

ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/

To realistically loop over a query, you'll need to limit your search
via some subset of things you are interested in to make the numbers
more manageable.

Hope this helps,
Brad


From dejmail at gmail.com  Wed Jul 15 20:39:38 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Wed, 15 Jul 2009 22:39:38 +0200
Subject: [Biopython] cleaning sequences
In-Reply-To: <20090715123540.GF17086@sobchak.mgh.harvard.edu>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
	<f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
	<20090715123540.GF17086@sobchak.mgh.harvard.edu>
Message-ID: <f2647b310907151339k291254fdh442baf830c42ba11@mail.gmail.com>

Hi Brad

Thanks, it does work really well, and I was quite close, I just need to work
on my loop conditions.

I would suggest for development a way of interacting with the Unafold
software. I know this was talked about a few weeks back, I think someone
(Chris ?) wanted to write a wrapper, and it would be really nice if this
could be added on.

Regards
Liam


From chapmanb at 50mail.com  Thu Jul 16 12:15:07 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 16 Jul 2009 08:15:07 -0400
Subject: [Biopython] cleaning sequences
In-Reply-To: <f2647b310907151339k291254fdh442baf830c42ba11@mail.gmail.com>
References: <f2647b310907140409u4113d4d1xae5799c95a279a0a@mail.gmail.com>
	<20090714124521.GR17086@sobchak.mgh.harvard.edu>
	<f2647b310907141121y78fb458cq4ff158c57aa7a870@mail.gmail.com>
	<20090715123540.GF17086@sobchak.mgh.harvard.edu>
	<f2647b310907151339k291254fdh442baf830c42ba11@mail.gmail.com>
Message-ID: <20090716121507.GD44295@sobchak.mgh.harvard.edu>

Hi Liam;

> Thanks, it does work really well, and I was quite close, I just need to work
> on my loop conditions.

Great to hear -- glad you got it all figured out.

> I would suggest for development a way of interacting with the Unafold
> software. I know this was talked about a few weeks back, I think someone
> (Chris ?) wanted to write a wrapper, and it would be really nice if this
> could be added on.

Sounds good. I'd encourage you to register on the wiki and add these
type of ideas to the project ideas section, ideally with links to the
relevant discussion lists:

http://biopython.org/wiki/Active_projects#Project_ideas

This is informal but helps do two things: it keeps the idea from
getting lost on the mailing list, and provides a place for people to
look if they are interested in contributing but don't know where to
start.

Brad


From mmokrejs at ribosome.natur.cuni.cz  Fri Jul 17 09:58:13 2009
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Fri, 17 Jul 2009 11:58:13 +0200
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
Message-ID: <4A604B35.5010708@ribosome.natur.cuni.cz>

Hi Peter and others,
  finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something
wrong with my installation biopython-1.49:

$ python
Python 2.6.2 (r262:71600, Jun 10 2009, 00:54:18) 
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez, Medline, GenBank
>>> Entrez.email = "mmokrejs at iresite.org"
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read
    record = handler.run(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run
    self.parser.ParseFile(handle)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read
    record = handler.run(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run
    self.parser.ParseFile(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 283, in external_entity_ref_handler
    parser.ParseFile(handle)
  File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 280, in external_entity_ref_handler
    handle = urllib.urlopen(systemId)
  File "/usr/lib/python2.6/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/usr/lib/python2.6/urllib.py", line 203, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.6/urllib.py", line 465, in open_file
    return self.open_local_file(url)
  File "/usr/lib/python2.6/urllib.py", line 479, in open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd'
>>> 


When I upgrade to 1.51b I get slightly better results:

$ python
Python 2.5.4 (r254:67916, Jul 15 2009, 19:40:01) 
[GCC 4.2.2 (Gentoo 4.2.2 p1.0)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez, Medline, GenBank
>>> Entrez.email = "mmokrejs at iresite.org"
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read
    record = handler.run(handle)
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run
    self.parser.ParseFile(handle)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>> _records = Entrez.read(_handle)
>>> _records
[{u'MedlineCitation': {u'DateCompleted': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'OtherID': [], u'DateRevised': {u'Month': '11', u'Day': '14', u'Year': '2007'}, u'MeshHeadingList': [{u'QualifierName': [], u'DescriptorName': '3T3 Cells'}, {u'QualifierName': ['chemistry', 'physiology'], u'DescriptorName': "5' Untranslated Regions"}, {u'QualifierName': [], u'DescriptorName': 'Animals'}, {u'QualifierName': [], u'DescriptorName': 'Base Sequence'}, {u'QualifierName': [], u'DescriptorName': 'Chick Embryo'}, {u'QualifierName': [], u'DescriptorName': 'Mice'}, {u'QualifierName': [], u'DescriptorName': 'Molecular Sequence Data'}, {u'QualifierName': [], u'DescriptorName': 'Protein Biosynthesis'}, {u'QualifierName': ['genetics'], u'DescriptorName': 'Proto-Oncogene Proteins c-jun'}, {u'QualifierName': ['chemistry'], u'DescriptorName': 'RNA, Messenger'}, {u'QualifierName': [], u'DescriptorName': 'Rabbits'}], u'OtherAbstract': [], u'CitationSubset': ['IM'], u'ChemicalList': [{u'Nam
eOfSubstance': "5' Untranslated Regions", u'RegistryNumber': '0'}, {u'NameOfSubstance': 'Proto-Oncogene Proteins c-jun', u'RegistryNumber': '0'}, {u'NameOfSubstance': 'RNA, Messenger', u'RegistryNumber': '0'}], u'KeywordList': [], u'DateCreated': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'SpaceFlightMission': [], u'GeneralNote': [], u'Article': {u'ArticleDate': [], u'Pagination': {u'MedlinePgn': '2836-45'}, u'AuthorList': [{u'LastName': 'Sehgal', u'Initials': 'A', u'ForeName': 'A'}, {u'LastName': 'Briggs', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Rinehart-Kim', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Basso', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Bos', u'Initials': 'TJ', u'ForeName': 'T J'}], u'Language': ['eng'], u'PublicationTypeList': ['Journal Article', "Research Support, Non-U.S. Gov't", "Research Support, U.S. Gov't, P.H.S."], u'Journal': {u'ISSN': '0950-9232', u'ISOAbbreviation': 'Oncogene', u'JournalIssue': {u'Volume': '19',
 u'Issue': '24', u'PubDate': {u'Month': 'Jun', u'Day': '1', u'Year': '2000'}}, u'Title': 'Oncogene'}, u'Affiliation': 'Department of Microbiology and Molecular Cell Biology, Eastern Virginia Medical School, PO Box 1980, Norfolk, Virginia, VA 23501, USA.', u'ArticleTitle': "The chicken c-Jun 5' untranslated region directs translation by internal initiation.", u'ELocationID': [], u'Abstract': {u'AbstractText': "The 5' untranslated region (UTR) of the chicken c-jun message is exceptionally GC rich and has the potential to form a complex and extremely stable secondary structure. Because stable RNA secondary structures can serve as obstacles to scanning ribosomes, their presence suggests inefficient translation or initiation through alternate mechanisms. We have examined the role of the c-jun 5' UTR with respect to its ability to influence translation both in vitro and in vivo. We find, using rabbit reticulocyte lysates, that the presence of the c-jun 5' UTR severely inhibits tran
slation of both homologous and heterologous genes in vitro. Furthermore, translational inhibition correlates with the degree of secondary structure exhibited by the 5' UTR. Thus, in the rabbit reticulocyte lysate system, the c-jun 5' UTR likely impedes ribosome scanning resulting in inefficient translation. In contrast to our results in vitro, the c-jun 5' UTR does not inhibit translation in a variety of different cell lines suggesting that it may direct an alternate mechanism of translational initiation in vivo. To distinguish among the alternate mechanisms, we generated a series of bicistronic expression plasmids. Our results demonstrate that the downstream cistron, in the bicistronic gene, is expressed to a much higher level when directly preceded by the c-jun 5' UTR. In addition, inhibition of ribosome scanning on the bicistronic message, through insertion of a synthetic stable hairpin, inhibits translation of the first cistron but does not inhibit translation of the cist
ron downstream of the c-jun 5' UTR. These results are consistent with a model by which the c-jun message is translated through cap independent internal initiation. Oncogene (2000) 19, 2836 - 2845"}, u'GrantList': [{u'Acronym': 'CA', u'Country': 'United States', u'Agency': 'NCI NIH HHS', u'GrantID': 'R01 CA51982'}]}, u'PMID': '10851087', u'MedlineJournalInfo': {u'MedlineTA': 'Oncogene', u'Country': 'ENGLAND', u'NlmUniqueID': '8711562'}}, u'PubmedData': {u'ArticleIdList': ['10851087', '10.1038/sj.onc.1203601'], u'PublicationStatus': 'ppublish', u'History': [[{u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}, {u'Minute': '0', u'Month': '7', u'Day': '6', u'Hour': '11', u'Year': '2000'}, {u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}]]}}]
>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>> _records = Entrez.read(_handle)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read
    record = handler.run(handle)
  File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run
    self.parser.ParseFile(handle)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
>>> 


  Any clues what does that mean? TIA,
martin


From bartomas at gmail.com  Fri Jul 17 11:23:28 2009
From: bartomas at gmail.com (bar tomas)
Date: Fri, 17 Jul 2009 12:23:28 +0100
Subject: [Biopython] How to run esearch in BioPython without specifying
	any filtering terms
In-Reply-To: <20090715201655.GH39098@sobchak.mgh.harvard.edu>
References: <fdcd75820907150612g73bd6d1er323d630e68451783@mail.gmail.com>
	<20090715201655.GH39098@sobchak.mgh.harvard.edu>
Message-ID: <fdcd75820907170423r68175d3ej5ca98718eba8a345@mail.gmail.com>

Thanks a lot. I understand now.

On Wed, Jul 15, 2009 at 9:16 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hello;
>
> > The BioPython tutorial (p.86) shows how once the available fields of an
> > Entrez database have been found with Einfo ,  queries can be run that use
> > those fields in the term argument of Esearch (for instance Jones[AUTH]).
> >
> > However, I?d like to retrieve all IDs from a database without specifying
> any
> > filtering term.
> >
> > If I leave the term argument out in the Entrez.efetch method, BioPython
> > returns an error.
> [..]
> > How can you run esearch in BioPython with no filtering terms?
>
> Retrieving all IDs isn't practical for most of the databases due to
> large numbers of entries. That's why a term is required in Biopython,
> and why most NCBI databases likely won't have an option to return
> everything. For example, 'pcsubstance' looks to contain 81 million
> records from the available downloads:
>
> ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/
>
> To realistically loop over a query, you'll need to limit your search
> via some subset of things you are interested in to make the numbers
> more manageable.
>
> Hope this helps,
> Brad
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From chapmanb at 50mail.com  Fri Jul 17 12:01:29 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 17 Jul 2009 08:01:29 -0400
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
Message-ID: <20090717120129.GE46309@sobchak.mgh.harvard.edu>

Hi Martin;
Thanks for the e-mail. Let's tackle your up to date 1.51beta work.

> When I upgrade to 1.51b I get slightly better results:
> 
> >>> from Bio import Entrez, Medline, GenBank
> >>> Entrez.email = "mmokrejs at iresite.org"
> >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
> >>> _records = Entrez.read(_handle)
[ error ]

> >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
> >>> _records = Entrez.read(_handle)
> >>> _records
[ worked ]

>   Any clues what does that mean? TIA,

In the first (and also third) example, you are retrieving the text
based result. The Entrez parser handles XML output, so it is
complaining because it's getting the raw text record instead of XML. 

Your second example is correct and worked; you specified the correct
XML retmode. You should be able to go with this.

More generally, since Entrez returns many different file types, you
want to be sure and match up what you are getting with the parser
you are using.

Hope this helps,
Brad


From mmokrejs at ribosome.natur.cuni.cz  Fri Jul 17 13:29:31 2009
From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=)
Date: Fri, 17 Jul 2009 15:29:31 +0200
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <20090717120129.GE46309@sobchak.mgh.harvard.edu>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
	<20090717120129.GE46309@sobchak.mgh.harvard.edu>
Message-ID: <4A607CBB.106@ribosome.natur.cuni.cz>

Hi Brad,
  thanks for clarification. I somewhat overlooked in the tutorial that
Entrez.read() requires me to ask for XML rettype and that it parses the XML
result by itself into the dictionary structure. Still I think it should
check what values I have passed down to Entrez.efetch() function. I know
it might be quite some work to keep it in sync with NCBI website but
let's see what others say. Either way, my code works now with Bio.Entrez
instead of the deprecated Bio.PubMed. I just had to quickly reinvent all
the exceptions because some PubMed entries lack authors, abbreviated
journal name, lack year, etc. ;-)
Best regards,
Martin

Brad Chapman wrote:
> Hi Martin;
> Thanks for the e-mail. Let's tackle your up to date 1.51beta work.
> 
>> When I upgrade to 1.51b I get slightly better results:
>>
>>>>> from Bio import Entrez, Medline, GenBank
>>>>> Entrez.email = "mmokrejs at iresite.org"
>>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text")
>>>>> _records = Entrez.read(_handle)
> [ error ]
> 
>>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>>>> _records = Entrez.read(_handle)
>>>>> _records
> [ worked ]
> 
>>   Any clues what does that mean? TIA,
> 
> In the first (and also third) example, you are retrieving the text
> based result. The Entrez parser handles XML output, so it is
> complaining because it's getting the raw text record instead of XML. 
> 
> Your second example is correct and worked; you specified the correct
> XML retmode. You should be able to go with this.
> 
> More generally, since Entrez returns many different file types, you
> want to be sure and match up what you are getting with the parser
> you are using.


From biopython at maubp.freeserve.co.uk  Sat Jul 18 11:40:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 18 Jul 2009 12:40:36 +0100
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00907180440i7a98bef9v8282bb1e2b6b8961@mail.gmail.com>

On Fri, Jul 17, 2009 at 10:58 AM, Martin
MOKREJ?<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi Peter and others,
>  finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something
> wrong with my installation biopython-1.49:
>
> ...
>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML")
>>>> _records = Entrez.read(_handle)
> ...
> IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd'

The NCBI added some new DTD files in Jan 2009, there are not included
with Biopython 1.49, but are in 1.51b which is why this error went away
when you upgraded.

Peter


From p.j.a.cock at googlemail.com  Sat Jul 18 11:48:30 2009
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sat, 18 Jul 2009 12:48:30 +0100
Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez
In-Reply-To: <4A607CBB.106@ribosome.natur.cuni.cz>
References: <4A604B35.5010708@ribosome.natur.cuni.cz>
	<20090717120129.GE46309@sobchak.mgh.harvard.edu>
	<4A607CBB.106@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00907180448j4f733b02xac6949048f310103@mail.gmail.com>

On Fri, Jul 17, 2009 at 2:29 PM, Martin
MOKREJ?<mmokrejs at ribosome.natur.cuni.cz> wrote:
> Hi Brad,
>  thanks for clarification. I somewhat overlooked in the tutorial that
> Entrez.read() requires me to ask for XML rettype and that it parses
> the XML result by itself into the dictionary structure. Still I think it should
> check what values I have passed down to Entrez.efetch() function.

This isn't going to be possible given that Entrez.read() just takes a
file handle. This separation between getting the data and parsing
it is deliberate. The handle you give to Entrez.read() might be to a
file on disk (saved from a previous search) instead of an Internet
handle to a live NCBI Entrez connection.

> Either way, my code works now with Bio.Entrez instead of the
> deprecated Bio.PubMed.

Good.

Note you didn't have to switch to using the XML from Entrez (e.g.
with the Bio.Entrez.read() funciton). It sounds like you were using
Bio.PubMed to access the data (in Medline format), and internally
this used Bio.Medline to parse it. Therefore, it would have been
less upheaval to use Bio.Entrez to fetch the data (as Medline files),
and continue to use Bio.Medline to parse this. See the section
"Parsing Medline records" in the Entrez chapter of the tutorial.

Peter


From lthiberiol at gmail.com  Mon Jul 20 14:22:38 2009
From: lthiberiol at gmail.com (Luiz Thiberio Rangel)
Date: Mon, 20 Jul 2009 11:22:38 -0300
Subject: [Biopython] BLAST footer
Message-ID: <f00cc0d10907200722h33713f8aubb51982bd3b8e52f@mail.gmail.com>

-- 
Luiz Thib?rio Rangel


From lthiberiol at gmail.com  Mon Jul 20 14:29:34 2009
From: lthiberiol at gmail.com (Luiz Thiberio Rangel)
Date: Mon, 20 Jul 2009 11:29:34 -0300
Subject: [Biopython] BLAST footer
Message-ID: <f00cc0d10907200729h31f056f4ye5b874e8dc5ac103@mail.gmail.com>

Hi folks,

Is there any way to get a complete BLAST footer using NCBIXML.parse?
The xml BLAST output generated by blastall doesn't have the complete footer
information, but the txt output has.

I'm running the BLAST using the xml output because this is the format
compatible do BioPython's parser, but I need some information that it
doesn't contains.  If somebody know how I can calculate the footer
information by the xml content would be useful too.

thanks...

-- 
Luiz Thib?rio Rangel


From biopython at maubp.freeserve.co.uk  Mon Jul 20 14:51:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Jul 2009 15:51:51 +0100
Subject: [Biopython] BLAST footer
In-Reply-To: <f00cc0d10907200729h31f056f4ye5b874e8dc5ac103@mail.gmail.com>
References: <f00cc0d10907200729h31f056f4ye5b874e8dc5ac103@mail.gmail.com>
Message-ID: <320fb6e00907200751s42f1387n64d95061a56a382b@mail.gmail.com>

On Mon, Jul 20, 2009 at 3:29 PM, Luiz Thiberio
Rangel<lthiberiol at gmail.com> wrote:
> Hi folks,
>
> Is there any way to get a complete BLAST footer using NCBIXML.parse?
> The xml BLAST output generated by blastall doesn't have the complete
> footer information, but the txt output has.

If the information isn't in the XML file, then the BLAST XML parser can't
tell you it ;)

> I'm running the BLAST using the xml output because this is the format
> compatible do BioPython's parser, but I need some information that it
> doesn't contains. ?If somebody know how I can calculate the footer
> information by the xml content would be useful too.

What information in particular do you need? Have you read the BLAST
book (Ian Korf, Mark Yandell and Joseph Bedell)? They may explain
where some of these numbers come from.

Peter


From iitlife2008 at gmail.com  Mon Jul 20 21:08:21 2009
From: iitlife2008 at gmail.com (life happy)
Date: Mon, 20 Jul 2009 14:08:21 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
Message-ID: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>

Hi there,

I am new to Biopython and have been working for a couple of weeks on Bio.PDB
module.I would appreciate any clue or help in the following matter.

I have some short ,closely related peptide sequences.I want to align these
short peptides and send the aligned structures into a new PDB file.I used
set_atoms class in Superimposer module to align the short peptides. I tried
using PDBIO module, and send the aligned structures into a new PDB file. But
when I see the output PDB file, I get the whole proteins not the short
peptides. I like to have output PDB file with all the short peptides aligned
to any particular short peptide.


#This is the part of my code. B is list of atoms of peptides. C is a list
with PDB ids of each peptide.

from Bio.PDB.Superimposer import Superimposer
fixed = B[0:1*(stop-start+1)]
sup = Superimposer()
for i in range(1,5) :
        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
        sup.set_atoms(fixed, moving)
        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
(C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
sup.rms)
        print "Saving %s aligned structure as PDB file %s" %
(C[0][2].split("'")[1], pdb_out_filename)
        io=Bio.PDB.PDBIO()
        io.set_structure(structure)
        io.save(pdb_out_filename)

thanks in advance!!

cheers,
Kumar.


From biopython at maubp.freeserve.co.uk  Mon Jul 20 21:14:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Jul 2009 22:14:50 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
Message-ID: <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>

On Mon, Jul 20, 2009 at 10:08 PM, life happy<iitlife2008 at gmail.com> wrote:
> Hi there,
>
> I am new to Biopython and have been working for a couple of weeks on Bio.PDB
> module.I would appreciate any clue or help in the following matter.
>
> I have some short ,closely related peptide sequences.I want to align these
> short peptides and send the aligned structures into a new PDB file.I used
> set_atoms class in Superimposer module to align the short peptides. I tried
> using PDBIO module, and send the aligned structures into a new PDB file. But
> when I see the output PDB file, I get the whole proteins not the short
> peptides. I like to have output PDB file with all the short peptides aligned
> to any particular short peptide.
>
>
> #This is the part of my code. B is list of atoms of peptides. C is a list
> with PDB ids of each peptide.
>
> from Bio.PDB.Superimposer import Superimposer
> fixed = B[0:1*(stop-start+1)]
> sup = Superimposer()
> for i in range(1,5) :
>        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
>        sup.set_atoms(fixed, moving)
>        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
> (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
> sup.rms)
>        print "Saving %s aligned structure as PDB file %s" %
> (C[0][2].split("'")[1], pdb_out_filename)
>        io=Bio.PDB.PDBIO()
>        io.set_structure(structure)
>        io.save(pdb_out_filename)
>
> thanks in advance!!

Your example never defines the "structure" variable. I guess it should
be pointing at something in the "C" data structure...

Peter


From biopython at maubp.freeserve.co.uk  Mon Jul 20 22:15:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 20 Jul 2009 23:15:54 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
Message-ID: <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>

On Mon, Jul 20, 2009 at 10:36 PM, life happy<iitlife2008 at gmail.com> wrote:
> No..this is only a piece of code. The structure object 'structure' was
> already created.

You example never seems to appy the transformation. Have you read this?
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

It is a worked example using Bio.PDB's Superimposer, and it saves the output.

Peter


From biopython at maubp.freeserve.co.uk  Tue Jul 21 09:13:13 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Jul 2009 10:13:13 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
Message-ID: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>

Please keep the mailing list CC'd.

On Mon, Jul 20, 2009 at 11:59 PM, life happy<iitlife2008 at gmail.com> wrote:
> Yes! I have read this.

I'm glad you found that page (something I'd like to integrate into the
main Biopython Tutorial at some point):
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

> Which step applies the transformation?Isn't that
> set_atoms function? I am able to print RMS value. I did not follow the
> superimpose.apply(alt_model.get_atoms()) .

As the name should suggest, superimpose.apply(...) actually applies the
transformation. This is what you are missing. The set_atoms(...) just tells
the code which atoms are going to be superimposed.

> According to description in BioPDB faq pdf and
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html
> set_atom does the transformation, right? If I am wrong, please correct me!

That docstring is rather confusing, we should fix that.

> Also,In which step are we sending the transformed co-ordinates into
> the PDB file?

These lines write out the PDB file for the whole structure:

io=Bio.PDB.PDBIO()
io.set_structure(structure)
io.save(pdb_out_filename)

> Also, the output PDB file has whole protein, I only want the short peptides
> aligned(only the atom lists that I gave as input must be aligned, not the
> whole protein of peptides).

If you only want some of the protein written, then you should only give
some of the structure to the PDB output code.

Peter


From iitlife2008 at gmail.com  Tue Jul 21 20:35:58 2009
From: iitlife2008 at gmail.com (life happy)
Date: Tue, 21 Jul 2009 13:35:58 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
Message-ID: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>

I have tried using   io.save("pdb_out_filename", se.accept_model(alt_model))

       I get error as , 'int' object has no attribute 'accept_model'

If I use  io.save("pdb_out_filename", se = accept_model(alt_model))

      I get Error: name 'accept_model' is not defined

In both the cases I created 'se' an object of Bio.PDB.Select()
Do you have an example for printing out some part of PDB?


On Tue, Jul 21, 2009 at 2:13 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Please keep the mailing list CC'd.
>
> On Mon, Jul 20, 2009 at 11:59 PM, life happy<iitlife2008 at gmail.com> wrote:
> > Yes! I have read this.
>
> I'm glad you found that page (something I'd like to integrate into the
> main Biopython Tutorial at some point):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> > Which step applies the transformation?Isn't that
> > set_atoms function? I am able to print RMS value. I did not follow the
> > superimpose.apply(alt_model.get_atoms()) .
>
> As the name should suggest, superimpose.apply(...) actually applies the
> transformation. This is what you are missing. The set_atoms(...) just tells
> the code which atoms are going to be superimposed.
>
> > According to description in BioPDB faq pdf and
> >
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html
> > set_atom does the transformation, right? If I am wrong, please correct
> me!
>
> That docstring is rather confusing, we should fix that.
>
> > Also,In which step are we sending the transformed co-ordinates into
> > the PDB file?
>
> These lines write out the PDB file for the whole structure:
>
> io=Bio.PDB.PDBIO()
> io.set_structure(structure)
> io.save(pdb_out_filename)
>
> > Also, the output PDB file has whole protein, I only want the short
> peptides
> > aligned(only the atom lists that I gave as input must be aligned, not the
> > whole protein of peptides).
>
> If you only want some of the protein written, then you should only give
> some of the structure to the PDB output code.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Tue Jul 21 20:48:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 21 Jul 2009 21:48:12 +0100
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
	<46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
Message-ID: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>

On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com> wrote:
> I have tried using?? io.save("pdb_out_filename", se.accept_model(alt_model))
>
> ?????? I get error as , 'int' object has no attribute 'accept_model'

If "se" really is an integer, that isn't surprising!

> If I use? io.save("pdb_out_filename", se = accept_model(alt_model))
>
> ????? I get Error: name 'accept_model' is not defined
>
> In both the cases I created 'se' an object of Bio.PDB.Select()
> Do you have an example for printing out some part of PDB?

The examples here may help:
http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
http://biopython.org/wiki/Remove_PDB_disordered_atoms
http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html

See also pages 5 and 6 of the Bio.PDB documentation, the bit
on the Select class:
http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf

Peter


From biopython at maubp.freeserve.co.uk  Thu Jul 23 10:20:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Jul 2009 11:20:11 +0100
Subject: [Biopython] Storing SeqRecord objects with annotation
Message-ID: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>

Hi Andrea (and everyone else),

This is a continuation of a discussion started on Bug 2883. Andrea had
a problem with unpickling SeqRecord objects which were pickled using
an older version of Biopython. She was using pickle to store complicated
annotated SeqRecord objects on disk.

See http://bugzilla.open-bio.org/show_bug.cgi?id=2883 for details.

http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c6
On Bug 2883 comment 6, Peter wrote:
>>
>> If your SeqRecord objects are all simply loaded from sequence files in
>> the first place (and not modified), I would just keep the original file and
>> re-parse it.
>>
>> If you have generated your own SeqRecords (or modified those from
>> reading a file), then it makes sense to save them somehow. The choice
>> of file format depends on the nature of annotation. The latest Biopython
>> will now record the features in a GenBank file, making that a reasonable
>> choice - but this does not cover per-letter-annotations. BioSQL has the
>> same limitation.

http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c7
On Bug 2883 comment 7, Andrea wrote:
>
> yes, i'm testing some predictors. I do prediction and i compare the
> "newly predicted seqrecords" with the "previously correct predicted
> pickled seqrecords".

Sorry - when you said "test code" on the Bug discussion, I though you
meant you were testing the code - not that this was real work doing
biological tests.

> I've them (the correct ones) only in pickled seqrecord format. The
> correctly predicted seqrecord, before prediction were in fasta format,
> but after i parsed them (into seqrecord), i did prediction, and then
> i pickled them (during prediction i add to seqrecord features and
> annotations).

If you have SeqFeatures and SeqRecords with simple string based
annotation, then BioSQL should be fine.

If you have SeqFeatures, then using GenBank output might be enough.
There are no general fields in the GenBank format for arbitary
annotation though.

> Actually i don't use per-letter-annotation despite the fact it seems
> interesting. But i didn't find any example in documentation (that
> show how the dictionary is populated...) so i really don't know
> how to use it.... even if i've, during prediction, a "per position
> annotation".

You are right that the SeqRecord chapter in the Tutorial doesn't
explicitly cover populating the per-letter-annotation. I can fix that...

However, the built in documentation covers this (e.g. the section
on slicing a SeqRecord to get a sub-record):

>>> from Bio.SeqRecord import SeqRecord
>>> help(SeqRecord)
...

You can read this online:
http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html

> Also if the "per letter annotation" is not managed in the GenBank
> format or in the BioSQL format (that i use a lot) i've to wait!!

Currently the BioSQL schema doesn't have any explicit support
for "per letter annotation", but we could encode it as a string
(e.g. using XML or JSON) perhaps. This will require coordination
with BioSQL, BioPerl etc - and thus far no one has expressed a
strong need for this.

The GenBank file format simply doesn't have an concept of "per
letter annotation". The PFAM/Stockholm alignment format does
(for the special case of a single character per letter of the
sequence), and in sequencing the base quality is also held in
some file formats.

> I was thinking also to store the pssm information somewhere in the
> seqrecord.... but this would be a very big change... (and also
> manage to store it in BioSQL.... )... but it's better to stop
> the discussion here or to move it... :-)

You can record any object in the SeqRecord's annotation dictionary.
However, saving the result to a file will be tricky - and it wouldn't
work in BioSQL either.

Peter


From andrea at biodec.com  Thu Jul 23 12:23:19 2009
From: andrea at biodec.com (Andrea)
Date: Thu, 23 Jul 2009 14:23:19 +0200
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
Message-ID: <4A685637.30806@biodec.com>

An HTML attachment was scrubbed...
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20090723/f3458454/attachment-0002.html>

From biopython at maubp.freeserve.co.uk  Thu Jul 23 12:54:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Jul 2009 13:54:47 +0100
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <4A685637.30806@biodec.com>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
	<4A685637.30806@biodec.com>
Message-ID: <320fb6e00907230554o1665af8cpbc44328df49c70bf@mail.gmail.com>

On Thu, Jul 23, 2009 at 1:23 PM, Andrea<andrea at biodec.com> wrote:
>
> To be precise i'm really testing code, my code. My predictors are
> implemented in python and to be shure that during time, bug fixes,
> modifications.. i won't alter the prediction results, i build some
> unittest to compare the results of the modified code with the results
> of the old code.
>
>Peter wrote:
>> If you have SeqFeatures and SeqRecords with simple string based
>> annotation, then BioSQL should be fine.
>
> According to me, for unittesting purposes, using Biosql for storing data
> is quite expensive? in term of code (or it seems so...), despite the fact,
> actually, BioSQL is for sure fine for storing? my annotations and
> features.
>
>> If you have SeqFeatures, then using GenBank output might be
>> enough. There are no general fields in the GenBank format for
>> arbitrary annotation though.
>
> Yes, i think that GenBank wont store my "peronal annotations"
> (or i've to check it).
>
>>> Actually i don't use per-letter-annotation despite the fact it seems
>>> interesting. But i didn't find any example in documentation (that
>>> show how the dictionary is populated...) so i really don't know
>>> how to use it.... even if i've, during prediction, a "per position
>>> annotation".
>>
>> You are right that the SeqRecord chapter in the Tutorial doesn't
>> explicitly cover populating the per-letter-annotation. I can fix that...

The next version of the Tutorial will include a short example of this.

>> However, the built in documentation covers this (e.g. the section
>> on slicing a SeqRecord to get a sub-record):
>>
>> from Bio.SeqRecord import SeqRecord
>> help(SeqRecord)
>> ...
>>
>> You can read this online:
>> http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html
>
> Very interesting and easy to use. I can either use it for:
> ? - storing per position string representing the "per position label"
>     of the prediction
> ? - storing list of per position reliabilities (raliability of prediction)
> ? - storing sequence variant
> ? - storing possible aligned sequence
> But it's a pity that this is not yet managed in BioSQL ....

Some of those might be possible using SeqFeature objects,
but I agree, the  "per letter annotation" seems more suitable.

> Also if the "per letter annotation" is not managed in the GenBank
> format or in the BioSQL format (that i use a lot) i've to wait!!

Some special cases of "per letter annotation" are supported for
file output (PFAM/Stockholm alignments, FASTQ, and QUAL),
but that's it. The idea of the SeqRecord "per letter annotation"
was to be sufficiently general to cover these and other future
uses.

>> Currently the BioSQL schema doesn't have any explicit support
>> for "per letter annotation", but we could encode it as a string
>> (e.g. using XML or JSON) perhaps. This will require coordination
>> with BioSQL, BioPerl etc - and thus far no one has expressed a
>> strong need for this.
>>
>> ...
>>
>> You can record any object in the SeqRecord's annotation
>> dictionary. However, saving the result to a file will be tricky -
>> and it wouldn't work in BioSQL either.
>
> I could say that i will use it, if it will work in biosql... but until
> there won't be the? possibility to store this information (BioSQL,
> GenBank...) i think the "per letter annotation" will lose part of its
> "charme"....

Currently BioSQL just stores strings for general annotation.
I think extending BioSQL to store simple per-letter-annotation
would be possible - for example strings, integers, and floating
point numbers. However, storing objects like a PSSM might
not be possible as we would want this to be compatible
between the other Bio* bindings.

Peter


From hlapp at gmx.net  Thu Jul 23 13:01:29 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 23 Jul 2009 09:01:29 -0400
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
Message-ID: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net>


On Jul 23, 2009, at 6:20 AM, Peter wrote:

> Currently the BioSQL schema doesn't have any explicit support
> for "per letter annotation"

I haven't been following the thread closely and so may be missing what  
is really meant by this. If, however, you mean associating annotation  
to a specific letter (position) in the sequence, BioSQL does support  
this - you'd create a seqfeature with appropriate location, and attach  
the annotation to the seqfeature.

Bioentry annotations are location-less, by comparison.

>
> The GenBank file format simply doesn't have an concept of "per
> letter annotation"

Since it does for in the above sense, I'm inclined to assume that you  
really do mean something different than the above?

> [...]
> You can record any object in the SeqRecord's annotation dictionary.
> However, saving the result to a file will be tricky - and it wouldn't
> work in BioSQL either.


Note that that's not entirely true. If you have a textual  
serialization (such as XML) of your object, you *can* store it in  
bioentry_qualifier_value. This is what we do in BioPerl with a TagTree  
annotation object that supports a nested hierarchical annotation  
structure needed for lossless representation of some UniProt lines.

Obviously, that won't allow you to query very well by individual  
elements of your custom annotation object. But you can build a custom  
index (e.g., using Lucene) that does that.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu Jul 23 13:32:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 23 Jul 2009 14:32:39 +0100
Subject: [Biopython] Storing SeqRecord objects with annotation
In-Reply-To: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net>
References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com>
	<8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net>
Message-ID: <320fb6e00907230632q730aa496g4a07c50d5860bd54@mail.gmail.com>

Hi Hilmar!

I've CC'd this to the BioSQL list. The start of the thread was here:
http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html

On Thu, Jul 23, 2009 at 2:01 PM, Hilmar Lapp<hlapp at gmx.net> wrote:
>
> On Jul 23, 2009, at 6:20 AM, Peter wrote:
>
>> Currently the BioSQL schema doesn't have any explicit support
>> for "per letter annotation"
>
> I haven't been following the thread closely and so may be missing what is
> really meant by this. If, however, you mean associating annotation to a
> specific letter (position) in the sequence, BioSQL does support this - you'd
> create a seqfeature with appropriate location, and attach the annotation to
> the seqfeature.
>
> Bioentry annotations are location-less, by comparison.

By "per letter annotation" we mean essentially a list of annotation
data, with one entry for each letter in the sequence. For example,
a sequencing quality score (from a FASTQ file) where this is one
integer per letter (i.e. per base pair). Or, a secondary structure
prediction, encoded as one character per letter (which could
apply to proteins and nucleotides).

This sort of thing could be done by using on feature per letter,
but it would be dreadfully inefficient for storing in the database.

>> [...]
>> You can record any object in the SeqRecord's annotation dictionary.
>> However, saving the result to a file will be tricky - and it wouldn't
>> work in BioSQL either.
>
> Note that that's not entirely true. If you have a textual serialization
> (such as XML) of your object, you *can* store it in
> bioentry_qualifier_value. This is what we do in BioPerl with a TagTree
> annotation object that supports a nested hierarchical annotation
> structure needed for lossless representation of some UniProt lines.

This was what I mentioned earlier in the thread - using XML or
JSON to turn the object into a long string. However, we really need
the Bio* projects to agree on some standards here, rather than
each project adding its own additions ad hoc (which will make
interoperation much trickier). For example, I was unaware you
(BioPerl) had already pressed ahead with this for the UniProt
data - which rather proves my point.

> Obviously, that won't allow you to query very well by individual
> elements of your custom annotation object. But you can build a
> custom index (e.g., using Lucene) that does that.

Yes, doing searches on an XML/JSON encoded string is an issue.
But right now we are probably more interested in just solving the
persistence of more complex objects.

Peter


From iitlife2008 at gmail.com  Thu Jul 23 17:45:46 2009
From: iitlife2008 at gmail.com (life happy)
Date: Thu, 23 Jul 2009 10:45:46 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
	<46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
	<320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>
Message-ID: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com>

Hi Peter ,

Thanks, the links were helpful. But I am facing this problem.

from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb')
structure = parser.get_structure( "3DH4", filehandle)
filehandle.close()
Select = Bio.PDB.Select()
class GlySelect(Select):
    def accept_residue(self, residue):
        if residue.get_name()=='GLY':
            return 1
        else:
            return 0
io=PDBIO()
io.set_structure(structure)
io.save('gly_only.pdb', GlySelect())

I use this code but I am getting the following error!

File "aligned_matches_written_to_new_pdb_file.py", line 34, in <module>
    class GlySelect(Select):
TypeError: Error when calling the metaclass bases
    this constructor takes no arguments

I have also tried the example in
http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error
message. What  does this mean? Any remedy?

Secondly, I didn't understand your answer to my question.."In which step are
we sending the transformed co-ordinates into the PDB file? " The
Superimposer is a black box for me. I give it atom lists, it gives me RMSD.
But I want the aligned co-ordinates of the given atom lists, so that I can
see the alignment in PyMol.I don't know how to extract aligned atom
co-ordinates!

Your example :-

http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F

does this job perfectly.It aptly prints out aligned models into a new PDB
file.But I am working on two atom lists from two different proteins, unlike
two models of same structure.Can you give me little push on how to deal
superimposing two different structures?

sincerely,
Kumar.


On Tue, Jul 21, 2009 at 1:48 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com> wrote:
> > I have tried using   io.save("pdb_out_filename",
> se.accept_model(alt_model))
> >
> >        I get error as , 'int' object has no attribute 'accept_model'
>
> If "se" really is an integer, that isn't surprising!
>
> > If I use  io.save("pdb_out_filename", se = accept_model(alt_model))
> >
> >       I get Error: name 'accept_model' is not defined
> >
> > In both the cases I created 'se' an object of Bio.PDB.Select()
> > Do you have an example for printing out some part of PDB?
>
> The examples here may help:
> http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
> http://biopython.org/wiki/Remove_PDB_disordered_atoms
> http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html
>
> See also pages 5 and 6 of the Bio.PDB documentation, the bit
> on the Select class:
> http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
>
> Peter
>


From idoerg at gmail.com  Thu Jul 23 18:09:03 2009
From: idoerg at gmail.com (Iddo Friedberg)
Date: Thu, 23 Jul 2009 11:09:03 -0700
Subject: [Biopython] Writing into a PDB file using PDBIO module
In-Reply-To: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com>
References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com>
	<320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com>
	<46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com>
	<320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com>
	<46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com>
	<320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com>
	<46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com>
	<320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com>
	<46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com>
Message-ID: <b5bbbc970907231109q445c665qe35e330894bbfc7b@mail.gmail.com>

Kumar:

The following works. The main error you had was that you instantiated Select
upon definition like so:
Select = Bio.PDB.Select()

Instead of:

Select = Bio.PDB.Select

Also, you used residue.get_name() instead of residue.get_resname() (there is
no get_name() method).

#!/usr/bin/python
import Bio
import os
from Bio import PDB
from Bio.PDB import PDBIO
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()
mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb"
filehandle = open(os.path.join(mypdb), 'rb')
structure = parser.get_structure( "1ZUG", filehandle)
filehandle.close()
Select = Bio.PDB.Select
class GlySelect(Select):
   def accept_residue(self, residue):
#       print dir(residue)
       if residue.get_resname()=='GLY':
           return 1
       else:
           return 0
if __name__ == '__main__':
    io=PDBIO()
    io.set_structure(structure)
    io.save('gly_only.pdb', GlySelect())


On Thu, Jul 23, 2009 at 10:45 AM, life happy <iitlife2008 at gmail.com> wrote:

> Hi Peter ,
>
> Thanks, the links were helpful. But I am facing this problem.
>
> from Bio.PDB.PDBParser import PDBParser
> parser = PDBParser()
> filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb')
> structure = parser.get_structure( "3DH4", filehandle)
> filehandle.close()
> Select = Bio.PDB.Select()
> class GlySelect(Select):
>    def accept_residue(self, residue):
>        if residue.get_name()=='GLY':
>            return 1
>        else:
>            return 0
> io=PDBIO()
> io.set_structure(structure)
> io.save('gly_only.pdb', GlySelect())
>
> I use this code but I am getting the following error!
>
> File "aligned_matches_written_to_new_pdb_file.py", line 34, in <module>
>    class GlySelect(Select):
> TypeError: Error when calling the metaclass bases
>    this constructor takes no arguments
>
> I have also tried the example in
> http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error
> message. What  does this mean? Any remedy?
>
> Secondly, I didn't understand your answer to my question.."In which step
> are
> we sending the transformed co-ordinates into the PDB file? " The
> Superimposer is a black box for me. I give it atom lists, it gives me RMSD.
> But I want the aligned co-ordinates of the given atom lists, so that I can
> see the alignment in PyMol.I don't know how to extract aligned atom
> co-ordinates!
>
> Your example :-
>
>
> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F
>
> does this job perfectly.It aptly prints out aligned models into a new PDB
> file.But I am working on two atom lists from two different proteins, unlike
> two models of same structure.Can you give me little push on how to deal
> superimposing two different structures?
>
> sincerely,
> Kumar.
>
>
> On Tue, Jul 21, 2009 at 1:48 PM, Peter <biopython at maubp.freeserve.co.uk
> >wrote:
>
> > On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com>
> wrote:
> > > I have tried using   io.save("pdb_out_filename",
> > se.accept_model(alt_model))
> > >
> > >        I get error as , 'int' object has no attribute 'accept_model'
> >
> > If "se" really is an integer, that isn't surprising!
> >
> > > If I use  io.save("pdb_out_filename", se = accept_model(alt_model))
> > >
> > >       I get Error: name 'accept_model' is not defined
> > >
> > > In both the cases I created 'se' an object of Bio.PDB.Select()
> > > Do you have an example for printing out some part of PDB?
> >
> > The examples here may help:
> > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
> > http://biopython.org/wiki/Remove_PDB_disordered_atoms
> > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html
> >
> > See also pages 5 and 6 of the Bio.PDB documentation, the bit
> > on the Select class:
> > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
> >
> > Peter
> >
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Iddo Friedberg, Ph.D.
Atkinson Hall, mail code 0446
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0446, USA
T: +1 (858) 534-0570
http://iddo-friedberg.org


From iitlife2008 at gmail.com  Thu Jul 23 20:57:17 2009
From: iitlife2008 at gmail.com (life happy)
Date: Thu, 23 Jul 2009 13:57:17 -0700
Subject: [Biopython] Creating and adding new models to a structure
Message-ID: <46a813870907231357u47501af9jc96369f9f54faa37@mail.gmail.com>

Hi Iddo Friedberg,

Thanks for correcting me. Its working!!

I have a new question. I like to store an atom list as a model in a
structure.How can I do this?

Kumar.

On Thu, Jul 23, 2009 at 11:09 AM, Iddo Friedberg <idoerg at gmail.com> wrote:

> Kumar:
>
> The following works. The main error you had was that you instantiated
> Select upon definition like so:
> Select = Bio.PDB.Select()
>
> Instead of:
>
> Select = Bio.PDB.Select
>
> Also, you used residue.get_name() instead of residue.get_resname() (there
> is no get_name() method).
>
> #!/usr/bin/python
> import Bio
> import os
> from Bio import PDB
> from Bio.PDB import PDBIO
> from Bio.PDB.PDBParser import PDBParser
> parser = PDBParser()
> mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb"
> filehandle = open(os.path.join(mypdb), 'rb')
> structure = parser.get_structure( "1ZUG", filehandle)
> filehandle.close()
> Select = Bio.PDB.Select
> class GlySelect(Select):
>    def accept_residue(self, residue):
> #       print dir(residue)
>        if residue.get_resname()=='GLY':
>            return 1
>        else:
>            return 0
> if __name__ == '__main__':
>     io=PDBIO()
>     io.set_structure(structure)
>     io.save('gly_only.pdb', GlySelect())
>
>
>
> On Thu, Jul 23, 2009 at 10:45 AM, life happy <iitlife2008 at gmail.com>wrote:
>
>> Hi Peter ,
>>
>> Thanks, the links were helpful. But I am facing this problem.
>>
>> from Bio.PDB.PDBParser import PDBParser
>> parser = PDBParser()
>> filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb')
>> structure = parser.get_structure( "3DH4", filehandle)
>> filehandle.close()
>> Select = Bio.PDB.Select()
>> class GlySelect(Select):
>>    def accept_residue(self, residue):
>>        if residue.get_name()=='GLY':
>>            return 1
>>        else:
>>            return 0
>> io=PDBIO()
>> io.set_structure(structure)
>> io.save('gly_only.pdb', GlySelect())
>>
>> I use this code but I am getting the following error!
>>
>> File "aligned_matches_written_to_new_pdb_file.py", line 34, in <module>
>>    class GlySelect(Select):
>> TypeError: Error when calling the metaclass bases
>>    this constructor takes no arguments
>>
>> I have also tried the example in
>> http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same
>> error
>> message. What  does this mean? Any remedy?
>>
>> Secondly, I didn't understand your answer to my question.."In which step
>> are
>> we sending the transformed co-ordinates into the PDB file? " The
>> Superimposer is a black box for me. I give it atom lists, it gives me
>> RMSD.
>> But I want the aligned co-ordinates of the given atom lists, so that I can
>> see the alignment in PyMol.I don't know how to extract aligned atom
>> co-ordinates!
>>
>> Your example :-
>>
>>
>> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F
>>
>> does this job perfectly.It aptly prints out aligned models into a new PDB
>> file.But I am working on two atom lists from two different proteins,
>> unlike
>> two models of same structure.Can you give me little push on how to deal
>> superimposing two different structures?
>>
>> sincerely,
>> Kumar.
>>
>>
>> On Tue, Jul 21, 2009 at 1:48 PM, Peter <biopython at maubp.freeserve.co.uk
>> >wrote:
>>
>> > On Tue, Jul 21, 2009 at 9:35 PM, life happy<iitlife2008 at gmail.com>
>> wrote:
>> > > I have tried using   io.save("pdb_out_filename",
>> > se.accept_model(alt_model))
>> > >
>> > >        I get error as , 'int' object has no attribute 'accept_model'
>> >
>> > If "se" really is an integer, that isn't surprising!
>> >
>> > > If I use  io.save("pdb_out_filename", se = accept_model(alt_model))
>> > >
>> > >       I get Error: name 'accept_model' is not defined
>> > >
>> > > In both the cases I created 'se' an object of Bio.PDB.Select()
>> > > Do you have an example for printing out some part of PDB?
>> >
>> > The examples here may help:
>> > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html
>> > http://biopython.org/wiki/Remove_PDB_disordered_atoms
>> > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html
>> >
>> > See also pages 5 and 6 of the Bio.PDB documentation, the bit
>> > on the Select class:
>> > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf
>> >
>> > Peter
>> >
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
> --
> Iddo Friedberg, Ph.D.
> Atkinson Hall, mail code 0446
> University of California, San Diego
> 9500 Gilman Drive
> La Jolla, CA 92093-0446, USA
> T: +1 (858) 534-0570
> http://iddo-friedberg.org
>
>


From biopython.chen at gmail.com  Fri Jul 24 02:28:21 2009
From: biopython.chen at gmail.com (chen Ku)
Date: Thu, 23 Jul 2009 19:28:21 -0700
Subject: [Biopython] Biopython Digest, Vol 79, Issue 15
In-Reply-To: <mailman.5.1248192001.11069.biopython@lists.open-bio.org>
References: <mailman.5.1248192001.11069.biopython@lists.open-bio.org>
Message-ID: <4c2163890907231928x5429929sd82bddcecdd7a26c@mail.gmail.com>

Hi
              I got successed in downloading all the pdb file
> by biopython module. But now I want to fectch an output file where my
> keyword word is ('carbonic andydrade')
>  second criteria is >=2 chains
> third criteria is homology =30%
>
> Can you please write me few lines of codes to do it as I have some problem
> in doing this.Please suggest me step by step if possible as I am
struggling
> for few days in this .
>
> I will be waiting for your kind help.
>regards
chen

On Tue, Jul 21, 2009 at 9:00 AM, <biopython-request at lists.open-bio.org>wrote:

> Send Biopython mailing list submissions to
>        biopython at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/biopython
> or, via email, send a message with subject or body 'help' to
>        biopython-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        biopython-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython digest..."
>
>
> Today's Topics:
>
>   1. Writing into a PDB file using PDBIO module (life happy)
>   2. Re: Writing into a PDB file using PDBIO module (Peter)
>   3. Re: Writing into a PDB file using PDBIO module (Peter)
>   4. Re: Writing into a PDB file using PDBIO module (Peter)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 20 Jul 2009 14:08:21 -0700
> From: life happy <iitlife2008 at gmail.com>
> Subject: [Biopython] Writing into a PDB file using PDBIO module
> To: biopython at lists.open-bio.org
> Message-ID:
>        <46a813870907201408j5d72e25eg9fffcf61331e4aaa at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi there,
>
> I am new to Biopython and have been working for a couple of weeks on
> Bio.PDB
> module.I would appreciate any clue or help in the following matter.
>
> I have some short ,closely related peptide sequences.I want to align these
> short peptides and send the aligned structures into a new PDB file.I used
> set_atoms class in Superimposer module to align the short peptides. I tried
> using PDBIO module, and send the aligned structures into a new PDB file.
> But
> when I see the output PDB file, I get the whole proteins not the short
> peptides. I like to have output PDB file with all the short peptides
> aligned
> to any particular short peptide.
>
>
> #This is the part of my code. B is list of atoms of peptides. C is a list
> with PDB ids of each peptide.
>
> from Bio.PDB.Superimposer import Superimposer
> fixed = B[0:1*(stop-start+1)]
> sup = Superimposer()
> for i in range(1,5) :
>        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
>        sup.set_atoms(fixed, moving)
>        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
>
> (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
> sup.rms)
>        print "Saving %s aligned structure as PDB file %s" %
> (C[0][2].split("'")[1], pdb_out_filename)
>        io=Bio.PDB.PDBIO()
>        io.set_structure(structure)
>        io.save(pdb_out_filename)
>
> thanks in advance!!
>
> cheers,
> Kumar.
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 20 Jul 2009 22:14:50 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Writing into a PDB file using PDBIO module
> To: life happy <iitlife2008 at gmail.com>
> Cc: biopython at lists.open-bio.org
> Message-ID:
>        <320fb6e00907201414j549e0eefyc556157cf432b327 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Jul 20, 2009 at 10:08 PM, life happy<iitlife2008 at gmail.com> wrote:
> > Hi there,
> >
> > I am new to Biopython and have been working for a couple of weeks on
> Bio.PDB
> > module.I would appreciate any clue or help in the following matter.
> >
> > I have some short ,closely related peptide sequences.I want to align
> these
> > short peptides and send the aligned structures into a new PDB file.I used
> > set_atoms class in Superimposer module to align the short peptides. I
> tried
> > using PDBIO module, and send the aligned structures into a new PDB file.
> But
> > when I see the output PDB file, I get the whole proteins not the short
> > peptides. I like to have output PDB file with all the short peptides
> aligned
> > to any particular short peptide.
> >
> >
> > #This is the part of my code. B is list of atoms of peptides. C is a list
> > with PDB ids of each peptide.
> >
> > from Bio.PDB.Superimposer import Superimposer
> > fixed = B[0:1*(stop-start+1)]
> > sup = Superimposer()
> > for i in range(1,5) :
> >        moving = B[i*(stop-start+1):(i+1)*(stop-start+1)]
> >        sup.set_atoms(fixed, moving)
> >        print "RMS(%s file %s chain, %s file %s model) = %0.2f" %
> >
> (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1],
> > sup.rms)
> >        print "Saving %s aligned structure as PDB file %s" %
> > (C[0][2].split("'")[1], pdb_out_filename)
> >        io=Bio.PDB.PDBIO()
> >        io.set_structure(structure)
> >        io.save(pdb_out_filename)
> >
> > thanks in advance!!
>
> Your example never defines the "structure" variable. I guess it should
> be pointing at something in the "C" data structure...
>
> Peter
>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 20 Jul 2009 23:15:54 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Writing into a PDB file using PDBIO module
> To: life happy <iitlife2008 at gmail.com>
> Cc: biopython at biopython.org
> Message-ID:
>        <320fb6e00907201515o517c885ahb2c396efc4281f73 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Jul 20, 2009 at 10:36 PM, life happy<iitlife2008 at gmail.com> wrote:
> > No..this is only a piece of code. The structure object 'structure' was
> > already created.
>
> You example never seems to appy the transformation. Have you read this?
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> It is a worked example using Bio.PDB's Superimposer, and it saves the
> output.
>
> Peter
>
>
> ------------------------------
>
> Message: 4
> Date: Tue, 21 Jul 2009 10:13:13 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Writing into a PDB file using PDBIO module
> To: life happy <iitlife2008 at gmail.com>
> Cc: Biopython Mailing List <biopython at lists.open-bio.org>
> Message-ID:
>        <320fb6e00907210213p5df40d5dl583a962069ed1867 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Please keep the mailing list CC'd.
>
> On Mon, Jul 20, 2009 at 11:59 PM, life happy<iitlife2008 at gmail.com> wrote:
> > Yes! I have read this.
>
> I'm glad you found that page (something I'd like to integrate into the
> main Biopython Tutorial at some point):
> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/
>
> > Which step applies the transformation?Isn't that
> > set_atoms function? I am able to print RMS value. I did not follow the
> > superimpose.apply(alt_model.get_atoms()) .
>
> As the name should suggest, superimpose.apply(...) actually applies the
> transformation. This is what you are missing. The set_atoms(...) just tells
> the code which atoms are going to be superimposed.
>
> > According to description in BioPDB faq pdf and
> >
> http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html
> > set_atom does the transformation, right? If I am wrong, please correct
> me!
>
> That docstring is rather confusing, we should fix that.
>
> > Also,In which step are we sending the transformed co-ordinates into
> > the PDB file?
>
> These lines write out the PDB file for the whole structure:
>
> io=Bio.PDB.PDBIO()
> io.set_structure(structure)
> io.save(pdb_out_filename)
>
> > Also, the output PDB file has whole protein, I only want the short
> peptides
> > aligned(only the atom lists that I gave as input must be aligned, not the
> > whole protein of peptides).
>
> If you only want some of the protein written, then you should only give
> some of the structure to the PDB output code.
>
> Peter
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 79, Issue 15
> *****************************************
>


From jblanca at btc.upv.es  Fri Jul 24 08:53:15 2009
From: jblanca at btc.upv.es (Jose Blanca)
Date: Fri, 24 Jul 2009 10:53:15 +0200
Subject: [Biopython] next-gen sequencing software
Message-ID: <200907241053.15954.jblanca@btc.upv.es>

Hi:

We have been writting some code that we think that could be interesting to the 
Biopython community. Right now we're mainly interested in the new sequencing 
technologies, specially in:
	- cleaning of the raw reads provided by the sequencers.
	- parsing of the assembler results (ace, caf and bowtie map files)
	- SNP detecion and mining.
	- sequence annotation.
We're writing some software to deal with that problems. Currently the software 
is not finished but it starts to be useful. Everything is written in python. 
We have used Biopython for some things, but for some others we have used a 
slighty different approach. If the Biopython developers think that some of 
our ideas could be of any use we would be willing to incorporate it into 
Biopython.
If you want to take a look just go to:
http://bioinf.comav.upv.es/svn/biolib/biolib/src/

Recently we have finished the cleaning infrastructure. We haven't yet 
pipelines defined for all the new sequencing technologies but we have created 
a pipeline system very easy to modify. With just a dozen of lines of code a 
new pipeline suited to a new sequencing technology can be created. There's 
also an script that runs those pipelines (run_cleannig_pipeline.py).
We have also created a set of scripts that create statistics that ease the 
quality evaluation of the cleaning process.

Regarding the SNPs we can get them using ace and caf files and we're finishing 
the parsing of the bowtie map files. All these files are transformed into an 
iterator of contig objects. There is also funcionallity to get SNPs and 
statistics from these contig objects.

We're willing to get comments, suggestions, criticisms.
Best regards,

-- 
Jose M. Blanca
Instituto Universitario de Conservacion y
Mejora de la Agrodiversidad Valenciana (COMAV)
Universidad Politecnica de Valencia (UPV)
Edificio CPI (Ciudad Politecnica de la Innovacion), 8E
46022 Valencia (SPAIN)
Tlf.:+34-96-3877000 (ext 88473)

P.D. We're using this functionallity in a computer cluster, so everything is 
parallelized.


From biopython at maubp.freeserve.co.uk  Fri Jul 24 09:38:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Jul 2009 10:38:43 +0100
Subject: [Biopython]  Searching a local copy of the PDB
Message-ID: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>

Hi Chen,

When replying to a digest email, it is a good idea to change the subject
line to something specific.

On Fri, Jul 24, 2009 at 3:28 AM, chen Ku<biopython.chen at gmail.com> wrote:
> Hi
>? ? ? ? ?I got successed in downloading all the pdb file by biopython module.

Good.

> But now I want to fectch an output file where my
> keyword word is ('carbonic andydrade')
>?second criteria is >=2 chains
> third criteria is homology =30%
>
> Can you please write me few lines of codes to do it as I have some problem
> in doing this.Please suggest me step by step if possible as I am struggling
> for few days in this .

If I understand you correctly, you have download all the PDB files to your
computer (as plain text PDB format data). And now you want to search them?

Are you using Unix or Windows? There are several Unix command line
tools like grep, which are very good at searching plain text files. That
might be a good way to look for PDB files containing the words 'carbonic
andydrade'.

I'm not sure what the fastest way to count the chains in a PDB file would
be. If you only find a few hundred PDB files with 'carbonic andydrade',
it might be OK just to parse them with Bio.PDB and count the chains
that way.

Finally, your third criteria is homology =30% - but homology to what?
And how are you measuring homology? I guess you mean 30%
sequence identity to a reference carbonic andydrade protein?

If what you want to do is take a known carbonic andydrade protein,
and search the PDB for similar sequences then there are better
ways to do this. I would run BLASTP against the PDB sequences.
You can do this at the NCBI via their webpages, or from within
Biopython using the Bio.Blast.NCBIWWW.qblast function.

Peter


From biopython at maubp.freeserve.co.uk  Fri Jul 24 09:50:08 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 24 Jul 2009 10:50:08 +0100
Subject: [Biopython] next-gen sequencing software
In-Reply-To: <200907241053.15954.jblanca@btc.upv.es>
References: <200907241053.15954.jblanca@btc.upv.es>
Message-ID: <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com>

On Fri, Jul 24, 2009 at 9:53 AM, Jose Blanca<jblanca at btc.upv.es> wrote:
> Hi:
>
> We have been writting some code that we think that could be interesting to the
> Biopython community. ... Currently the software is not finished but it starts to
> be useful. Everything is written in python. We have used Biopython for some
> things, but for some others we have used a slighty different approach. If the
> Biopython developers think that some of our ideas could be of any use we
> would be willing to incorporate it into Biopython.
> If you want to take a look just go to:
> http://bioinf.comav.upv.es/svn/biolib/biolib/src/

Cool. I already knew you had some interested ideas for contig classes.
I see you also have a parser for EMBOSS water output - where you
actually collect some useful information from the header, which the
Biopython parser ignores. This was a simplification because the current
Biopython alignment object doesn't have a proper annotation system.

Work on improving the Biopython alignment object and introducing a
contig object is something I would like to see for the next release (once
Biopython 1.51 is out).

I'm sure there is other stuff in your code that would also be very useful.

If you want to contribute code to Biopython is will have to be under our
MIT style license, but in the meantime maybe you should stick an
an explicit license on your code?

Peter


From darnells at dnastar.com  Fri Jul 24 14:15:09 2009
From: darnells at dnastar.com (Steve Darnell)
Date: Fri, 24 Jul 2009 09:15:09 -0500
Subject: [Biopython] Searching a local copy of the PDB
In-Reply-To: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>
References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>
Message-ID: <A4009967D1886D4286A9B7931FD5861001A3D875@FS1.dnastar.com>

Greetings,

You could also do this using the PDB Advanced Search option.  Although not a scriptable solution, it's perfect for a few manual queries.  Here are my suggested parameters:

Match **all** of the following conditions

Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text**
Subquery 2: Sequence Features: Number of Chains, Between: **2** and **<blank>**

**<checkbox>** Remove Similar Sequences at **30%** Identity

Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase."  No results for "andydrade."

Regards,
Steve Darnell


-----Original Message-----
From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
Sent: Friday, July 24, 2009 4:39 AM
To: chen Ku
Cc: biopython at lists.open-bio.org
Subject: [Biopython] Searching a local copy of the PDB

Hi Chen,

When replying to a digest email, it is a good idea to change the subject line to something specific.

On Fri, Jul 24, 2009 at 3:28 AM, chen Ku<biopython.chen at gmail.com> wrote:
> Hi
>? ? ? ? ?I got successed in downloading all the pdb file by biopython module.

Good.

> But now I want to fectch an output file where my  keyword word is 
>('carbonic andydrade')
>?second criteria is >=2 chains
> third criteria is homology =30%
>
> Can you please write me few lines of codes to do it as I have some 
> problem in doing this.Please suggest me step by step if possible as I 
> am struggling for few days in this .

If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them?

Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'.

I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way.

Finally, your third criteria is homology =30% - but homology to what?
And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein?

If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences.
You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function.

Peter

_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython


From jkhilmer at gmail.com  Fri Jul 24 15:19:27 2009
From: jkhilmer at gmail.com (Jonathan Hilmer)
Date: Fri, 24 Jul 2009 09:19:27 -0600
Subject: [Biopython] Searching a local copy of the PDB
In-Reply-To: <A4009967D1886D4286A9B7931FD5861001A3D875@FS1.dnastar.com>
References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com>
	<A4009967D1886D4286A9B7931FD5861001A3D875@FS1.dnastar.com>
Message-ID: <81277ce10907240819j3710c35j2d336209ba474451@mail.gmail.com>

Just for the record, a few years back I ran some Biopython-based code
to check structural statistics of a local copy of the entire PDB.  I
was parsing to the level of each alpha-carbon, but it was still fast
enough to be a very viable way to run the calculations.  Clearly in
this case it's not the best solution to use Bio.PDB, but if you have a
local mirror then there's no reason you couldn't do it via
structure-parsing.

Also, the PDB Advanced search should be scriptable, just not in a
convenient way.  The Python module ClientForm should handle it.

Jonathan


On Fri, Jul 24, 2009 at 8:15 AM, Steve Darnell<darnells at dnastar.com> wrote:
> Greetings,
>
> You could also do this using the PDB Advanced Search option. ?Although not a scriptable solution, it's perfect for a few manual queries. ?Here are my suggested parameters:
>
> Match **all** of the following conditions
>
> Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text**
> Subquery 2: Sequence Features: Number of Chains, Between: **2** and **<blank>**
>
> **<checkbox>** Remove Similar Sequences at **30%** Identity
>
> Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase." ?No results for "andydrade."
>
> Regards,
> Steve Darnell
>
>
> -----Original Message-----
> From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter
> Sent: Friday, July 24, 2009 4:39 AM
> To: chen Ku
> Cc: biopython at lists.open-bio.org
> Subject: [Biopython] Searching a local copy of the PDB
>
> Hi Chen,
>
> When replying to a digest email, it is a good idea to change the subject line to something specific.
>
> On Fri, Jul 24, 2009 at 3:28 AM, chen Ku<biopython.chen at gmail.com> wrote:
>> Hi
>>? ? ? ? ?I got successed in downloading all the pdb file by biopython module.
>
> Good.
>
>> But now I want to fectch an output file where my ?keyword word is
>>('carbonic andydrade')
>>?second criteria is >=2 chains
>> third criteria is homology =30%
>>
>> Can you please write me few lines of codes to do it as I have some
>> problem in doing this.Please suggest me step by step if possible as I
>> am struggling for few days in this .
>
> If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them?
>
> Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'.
>
> I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way.
>
> Finally, your third criteria is homology =30% - but homology to what?
> And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein?
>
> If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences.
> You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function.
>
> Peter
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From matzke at berkeley.edu  Wed Jul 29 04:38:44 2009
From: matzke at berkeley.edu (Nick Matzke)
Date: Tue, 28 Jul 2009 21:38:44 -0700
Subject: [Biopython] PDBid to Uniprot ID?
In-Reply-To: <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com>
References: <4A42A2D4.8060400@berkeley.edu>
	<320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com>
Message-ID: <4A6FD254.2070803@berkeley.edu>


Peter wrote:
> On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi all,
>>
>> I have succeeded in using the BioPython PDB parser to download a PDB file,
>> parse the structure, etc.  But I am wondering if there is an easy way to retrieve
>> the UniProt ID that corresponds to the structure?
>>
>> I.e., if the structure is 1QFC...
>> http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC
>>
>> ...the Uniprot ID is (click "Sequence" above): P29288
>> http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC
>>
>> I don't see a way to get this out of the current parser, so I guess I will schlep
>> through the downloaded structure file for "UNP    P29288" unless someone
>> has a better idea.
> 
> Well, I would at least look for a line starting "DBREF" and then search that
> for the reference.
> 
> Right now the PDB header parsing is minimal, and even that was something
> of an after thought - Eric has been looking at this stuff recently, but I image
> he will be busy with his GSoC work at the moment. This could be handled
> as another tiny incremental addition to parse_pdb_header.py - right now I
> don't think it looks at the "DBREF" lines.
> 
> Peter


I forgot to post to the list, I wrote a function for parsing the DBREF 
line a couple of weeks ago, it should be pretty comprehensive as it uses 
the official specifications for DBREF lines.

Here's the code to save other people re-inventing the wheel.  Free to 
use/modify/include in a biopython upgrade whatever...

===================
def parse_DBREF_line(line):
	"""
	Following format here:
	http://www.wwpdb.org/documentation/format23/sect3.html
	
	Record Format
	
	COLUMNS       DATA TYPE          FIELD          DEFINITION
	----------------------------------------------------------------
	 1 - 6        Record name        "DBREF "
	 8 - 11       IDcode             idCode         ID code of this entry.
	13            Character          chainID        Chain identifier.
	15 - 18       Integer            seqBegin       Initial sequence number
													of the PDB sequence segment.
	19            AChar              insertBegin    Initial insertion code
													of the PDB sequence segment.
	21 - 24       Integer            seqEnd         Ending sequence number
													of the PDB sequence segment.
	25            AChar              insertEnd      Ending insertion code
													of the PDB sequence segment.
	27 - 32       LString            database       Sequence database name.
	34 - 41       LString            dbAccession    Sequence database 
accession code.
	43 - 54      LString            dbIdCode        Sequence database
													identification code.
	56 - 60      Integer            dbseqBegin      Initial sequence number 
of the
													database seqment.
	61           AChar              idbnsBeg        Insertion code of 
initial residue
													of the segment, if PDB is the
													reference.
	63 - 67      Integer            dbseqEnd        Ending sequence number 
of the
													database segment.
	68           AChar              dbinsEnd        Insertion code of the 
ending
													residue of the segment, if PDB is
													the reference.

     Database name                         database
                                      (code in columns 27 - 32)
     ----------------------------------------------------------
     GenBank                               GB
     Protein Data Bank                     PDB
     Protein Identification Resource       PIR
     SWISS-PROT                            SWS
     TREMBL                                TREMBL
     UNIPROT                               UNP

	
	Test line:
	line="  1QFC A    1   306  UNP    P29288   PPA5_RAT        22    327 
           "
	"""
	
	data_type_list = ['Record name',
	'IDcode',
	'Character',
	'Integer',
	'AChar',
	'Integer',
	'AChar',
	'LString',
	'LString',
	'LString',
	'Integer',
	'AChar',
	'Integer',
	'AChar']
	
	field_list = ['"DBREF "',
	'idCode',
	'chainID',
	'seqBegin',
	'insertBegin',
	'seqEnd',
	'insertEnd',
	'database',
	'dbAccession',
	'dbIdCode',
	'dbseqBegin',
	'idbnsBeg',
	'dbseqEnd',
	'dbinsEnd']
	
	def_list = ['',
	'ID code of this entry.',
	'Chain identifier.',
	'Initial sequence number of the PDB sequence segment.',
	'Initial insertion code of the PDB sequence segment.',
	'Ending sequence number of the PDB sequence segment.',
	'Ending insertion code of the PDB sequence segment.',
	'Sequence database name.',
	'Sequence database accession code.',
	'Sequence database identification code.',
	'Initial sequence number of the database seqment.',
	'Insertion code of initial residue of the segment, if PDB is the 
reference.',
	'Ending sequence number of the database segment.',
	'Insertion code of the ending residue of the segment, if PDB is the 
reference.']
	
	charpos_list = [(1,6),
	(8,11),
	(13,13),
	(15,18),
	(19,19),
	(21,24),
	(25,25),
	(27,32),
	(34,41),
	(43,54),
	(56,60),
	(61,61),
	(63,67),
	(68,68)]
	
	data_list = ['',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'',
	'']
	
	# Make empty dictionary
	dbref_dict = {}
	for index in range(0,len(field_list)):
		dbref_dict[ field_list[index] ] = [ data_type_list[index], 
charpos_list[index], data_list[index], def_list[index] ]
	
	for field in field_list:
		#print field
		#print dbref_dict[field][1]
		startpos = int(dbref_dict[field][1][0])
		endpos = int(dbref_dict[field][1][1])
		
		dbref_dict[field][2] = get_char_range(line, startpos, endpos)
		
	return dbref_dict
===================


> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================


From pzs at dcs.gla.ac.uk  Wed Jul 29 10:56:11 2009
From: pzs at dcs.gla.ac.uk (Peter Saffrey)
Date: Wed, 29 Jul 2009 11:56:11 +0100
Subject: [Biopython] Restriction enzyme digestion gels
Message-ID: <4A702ACB.2080204@dcs.gla.ac.uk>

I want to run an "in-silico" gel, where I take a nucleotide sequence, 
cut it with an enzyme (probably using a tool like restrictionmapper):

http://www.restrictionmapper.org/

and then produce a picture of what the gel should look like, with bands 
where the cuts have been made. I was wondering whether biopython has any 
tools for doing this. Otherwise, I'll hack something up in matplotlib.

Cheers,

Peter


From biopython at maubp.freeserve.co.uk  Wed Jul 29 11:35:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 29 Jul 2009 12:35:27 +0100
Subject: [Biopython] Restriction enzyme digestion gels
In-Reply-To: <4A702ACB.2080204@dcs.gla.ac.uk>
References: <4A702ACB.2080204@dcs.gla.ac.uk>
Message-ID: <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com>

On Wed, Jul 29, 2009 at 11:56 AM, Peter Saffrey<pzs at dcs.gla.ac.uk> wrote:
> I want to run an "in-silico" gel, where I take a nucleotide sequence, cut it
> with an enzyme (probably using a tool like restrictionmapper):
>
> http://www.restrictionmapper.org/
>
> and then produce a picture of what the gel should look like, with bands
> where the cuts have been made. I was wondering whether biopython has any
> tools for doing this. Otherwise, I'll hack something up in matplotlib.

Biopython has a restriction digest module which should be able to take
care of the first step for you at least:
http://biopython.org/DIST/docs/cookbook/Restriction.html

There is nothing built into Biopython's graphics module for generating
fake gel images - so using matplot seems worth trying. However, I
would suggest you talk to Jose Blanca about his work first:
http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005472.html
http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/

Peter


From carlos.borroto at gmail.com  Thu Jul 30 17:18:56 2009
From: carlos.borroto at gmail.com (Carlos Javier Borroto)
Date: Thu, 30 Jul 2009 13:18:56 -0400
Subject: [Biopython] How to efetch Unigene records? Is it possible at all?
Message-ID: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>

Hi, I'm very new to Biopython and to Python in general, has a little
knowledge of Perl and some previous work with Bioperl.

I have the task to from a list of human genes of interest, grab their
protein counter parts in the database to do some additional work. In
the beginning I was thinking that using Bio.Entrez module and
Bio.SeqIO parser I could get the proteins counter parts, but I haven't
found a way to do it, oddly I haven't found a way to get the
crossreference through the parser even when I can see the genebank
files have always one.

Any way because I also have the Unigene ID list, and it seems that the
Unigene parser have a way to get the crossreference, I now want to
download all of the Unigene records and parse from there. But efetch
is not working with unigene, I mean this is not working:

>>> from Bio import Entrez
>>> from Bio import UniGene
>>> Entrez.email = "carlos.borroto at gmail.com"
>>> handle = Entrez.esearch(db="unigene", term="Hs.94542")
>>> record = Entrez.read(handle)
>>> record
{u'Count': '1', u'RetMax': '1', u'IdList': ['141673'],
u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term':
'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet':
[], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'}
>>> handle = Entrez.efetch(db="unigene", id="Hs.94542")
>>> print handle.read()

This print like a webpage, I assume is NCBI server giving an error response.

So there is something I could do to accomplish what I want, either
through parsing the Genebank files or fetching the Unigene and then
parsing its?

Any help or pointing to some helpful documentation will be highly appreciated.
Thanks in advance
-- 
Carlos Javier


From chapmanb at 50mail.com  Thu Jul 30 22:09:02 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Thu, 30 Jul 2009 18:09:02 -0400
Subject: [Biopython] How to efetch Unigene records? Is it possible
	at	all?
In-Reply-To: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>
References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>
Message-ID: <20090730220902.GD84345@sobchak.mgh.harvard.edu>

Hi Carlos;

> I have the task to from a list of human genes of interest, grab their
> protein counter parts in the database to do some additional work.
[...]
> >>> from Bio import Entrez
> >>> from Bio import UniGene
> >>> Entrez.email = "carlos.borroto at gmail.com"
> >>> handle = Entrez.esearch(db="unigene", term="Hs.94542")
> >>> record = Entrez.read(handle)
> >>> record
> {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'],
> u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term':
> 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet':
> [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'}
> >>> handle = Entrez.efetch(db="unigene", id="Hs.94542")
> >>> print handle.read()
> 
> This print like a webpage, I assume is NCBI server giving an error response.
> 
> So there is something I could do to accomplish what I want, either
> through parsing the Genebank files or fetching the Unigene and then
> parsing its?

It looks like you are doing things correctly, but I'm not sure if
NCBI supports retrieving UniGene records through the efetch
interface. I tried playing around with it for a bit and got the same
problems as you; the documentation on their site is also not very
clear about if unigene is supported and what return types to get.
Not having a lot of experience with UniGene, my guess is this isn't
the right direction to go.

My suggestion to get your work done is to download the *.data files
from the ftp site:

ftp://ftp.ncbi.nih.gov/repository/UniGene/

and write a script that runs through these and pulls out the protein
identifiers of interest. You should be able to use the UniGene
parser for this and use the protsim attribute of each record. With
these, you can get the GI number (protgi attribute) and use this to
fetch the relevant GenBank records through Entrez.

Hope this helps,
Brad


From carlos.borroto at gmail.com  Thu Jul 30 22:27:24 2009
From: carlos.borroto at gmail.com (Carlos Javier Borroto)
Date: Thu, 30 Jul 2009 18:27:24 -0400
Subject: [Biopython] How to efetch Unigene records? Is it possible at
	all?
In-Reply-To: <20090730220902.GD84345@sobchak.mgh.harvard.edu>
References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> 
	<20090730220902.GD84345@sobchak.mgh.harvard.edu>
Message-ID: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com>

On Thu, Jul 30, 2009 at 6:09 PM, Brad Chapman<chapmanb at 50mail.com> wrote:
> Hi Carlos;
>
>> I have the task to from a list of human genes of interest, grab their
>> protein counter parts in the database to do some additional work.
>
> It looks like you are doing things correctly, but I'm not sure if
> NCBI supports retrieving UniGene records through the efetch
> interface. I tried playing around with it for a bit and got the same
> problems as you; the documentation on their site is also not very
> clear about if unigene is supported and what return types to get.
> Not having a lot of experience with UniGene, my guess is this isn't
> the right direction to go.
>
> My suggestion to get your work done is to download the *.data files
> from the ftp site:
>
> ftp://ftp.ncbi.nih.gov/repository/UniGene/
>
> and write a script that runs through these and pulls out the protein
> identifiers of interest. You should be able to use the UniGene
> parser for this and use the protsim attribute of each record. With
> these, you can get the GI number (protgi attribute) and use this to
> fetch the relevant GenBank records through Entrez.
>
> Hope this helps,
> Brad
>

Thanks, I was wondering because this is the first time I use Biopython
or NCBI scripting facilities if I was doing something completely
wrong. I'm going to follow your advice.

Thank you for taking the time to review my concern.
regards,
-- 
Carlos Javier


From stran104 at chapman.edu  Fri Jul 31 00:10:11 2009
From: stran104 at chapman.edu (Matthew Strand)
Date: Thu, 30 Jul 2009 17:10:11 -0700
Subject: [Biopython] How to efetch Unigene records? Is it possible at
	all?
In-Reply-To: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com>
References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com>
	<20090730220902.GD84345@sobchak.mgh.harvard.edu>
	<65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com>
Message-ID: <2a63cc350907301710w57d4d4b9nb89fea39f9e62b76@mail.gmail.com>

Hi Carlos,
I did something similar to this a while ago and meant to write a cookbook
entry for it but haven't gotten the chance yet. You could also try doing an
efetch on the ID of the record returned by esearch.

I'm not near my workstation so I can't test it but you might try:
handle = Entrez.efetch(db="unigene", id="141673")

If that works then you just need to pull the ID out of the esearch result
and do an efetch on it.

-- 
Matthew Strand
 stran104 at chapman.edu


From lueck at ipk-gatersleben.de  Fri Jul 31 08:27:28 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 10:27:28 +0200
Subject: [Biopython] blastall several alignment viewings options
Message-ID: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>

Hello!

is there a way to set 2 or more alignment viewing options in one blast run? I would like to get the xml and the Query-anchored (and maybe some other) but to run Blast twice would be kind of stupid and slowing down. 

Thanks
Stefanie


From biopython at maubp.freeserve.co.uk  Fri Jul 31 09:18:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 31 Jul 2009 10:18:29 +0100
Subject: [Biopython] blastall several alignment viewings options
In-Reply-To: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>

On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> wrote:
> Hello!
>
> is there a way to set 2 or more alignment viewing options in one blast run?
> I would like to get the xml and the Query-anchored (and maybe some other)
> but to run Blast twice would be kind of stupid and slowing down.

I don't think there is. The XML file should contain enough data to recreate
some of the other views (if I recall correctly Sebastian Bassi has a script to
do that). However, that may not be possible for the Query-anchored output.

Peter


From lueck at ipk-gatersleben.de  Fri Jul 31 09:25:51 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 11:25:51 +0200
Subject: [Biopython] blastall several alignment viewings options
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
Message-ID: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>

Thanks Peter! I expected this, I just wanted to be sure since it's stupid to 
recreate things which are already existing.
Have a nice weekend!
Stefanie


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at biopython.org>
Sent: Friday, July 31, 2009 11:18 AM
Subject: Re: [Biopython] blastall several alignment viewings options


On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> 
wrote:
> Hello!
>
> is there a way to set 2 or more alignment viewing options in one blast 
> run?
> I would like to get the xml and the Query-anchored (and maybe some other)
> but to run Blast twice would be kind of stupid and slowing down.

I don't think there is. The XML file should contain enough data to recreate
some of the other views (if I recall correctly Sebastian Bassi has a script 
to
do that). However, that may not be possible for the Query-anchored output.

Peter


From biopython at maubp.freeserve.co.uk  Fri Jul 31 10:08:42 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 31 Jul 2009 11:08:42 +0100
Subject: [Biopython] blastall several alignment viewings options
In-Reply-To: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
	<001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
Message-ID: <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com>

On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> wrote:
> Thanks Peter! I expected this, I just wanted to be sure since it's stupid to
> recreate things which are already existing.
> Have a nice weekend!
> Stefanie

I know you are using standalone BLAST (blastall), but if you were doing this
online via the NCBI website, you can reformat the output (without recalculating
it). This *might* be possible via the QBLAST interface too... it would take some
experimentation.

Peter


From lueck at ipk-gatersleben.de  Fri Jul 31 10:28:11 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 12:28:11 +0200
Subject: [Biopython] blastall several alignment viewings options
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
	<001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com>
Message-ID: <002901ca11c9$9a9ed680$1022a8c0@ipkgatersleben.de>

In my new project I'll do both, online and local BLAST. Anyway I'll recreate 
it, it's should be done quickly. In case that someone need it too, I can 
provide it!


----- Original Message ----- 
From: "Peter" <biopython at maubp.freeserve.co.uk>
To: "Stefanie L?ck" <lueck at ipk-gatersleben.de>
Cc: <biopython at biopython.org>
Sent: Friday, July 31, 2009 12:08 PM
Subject: Re: [Biopython] blastall several alignment viewings options


On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck<lueck at ipk-gatersleben.de> 
wrote:
> Thanks Peter! I expected this, I just wanted to be sure since it's stupid 
> to
> recreate things which are already existing.
> Have a nice weekend!
> Stefanie

I know you are using standalone BLAST (blastall), but if you were doing this
online via the NCBI website, you can reformat the output (without 
recalculating
it). This *might* be possible via the QBLAST interface too... it would take 
some
experimentation.

Peter


From lueck at ipk-gatersleben.de  Fri Jul 31 10:37:59 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Fri, 31 Jul 2009 12:37:59 +0200
Subject: [Biopython] EuroSciPy2009
References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com>
	<001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de>
	<320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com>
Message-ID: <002f01ca11ca$f928d830$1022a8c0@ipkgatersleben.de>

Hello!

I just wanted to say that the EuroSciPy2009 was a great success and I also got a lot of positive feedback for my talk. I would like to thank all Biopython developers for providing a great library!

For anyone who is interested and would like to see for what I use Biopython (and why it's makes my life in the lab easier), here are the links of the abstract and slides:

http://www.euroscipy.org/presentations/abstracts/abstract_lueck.html
http://www.euroscipy.org/presentations/slides/slides_lueck.pdf

Would be nice to see some of you next year!

Kind regards,
Stefanie