[BioPython] blast parsing errors

Mon Mar 5 16:24:58 UTC 2007

Thanks guys,

You are right - I am using BioPython 1.42, and python2.5 installed  
via fink on Mac OS X.

I meant to use an amino acid sequence for the seq variable, and I  
have included the revised code snippet which uses the protein  
sequence that gave me trouble in the first place.  However, there is  
no problem when using the current CVS code.  Thanks for all of your  
help!

I have 3 questions:

1.) Is the documentation for the new NCBIXML and NBCIWWW up to date?
2.) Why is NCBIXML.parse returning an iterator in this case since  
there is only one result?  Or in other words, what are the use cases  
where an iterator is necessary?
3.) How are the fink packages of Biopython maintained?  I am using  
the fink unstable tree, which means that I am getting the most  
current version that fink has.  If Biopython 1.44 is substantially  
different from 1.42 (current fink), can we update the fink version  
faster than we currently are?

Cheers,

Julius

---------- code that works in Biopython 1.44 --------

from Bio import Fasta
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import StringIO
import re

#BLAST cutoff
cutoff = 1e-4

#Create a fasta record: title and seq are given
title = 'test'
seq = '\
MESFSVQAYLKATDNNFVSTFKDAAKQVQNF\
EKNTNSTMSTVGKVATSTGKTLTKAVTVPII\
GIGVAAAKIGGDFESQMSRVKAISGATGSSF\
EELRQQAIDLGAKTAFSAKESASGMENLASA\
GFNAKEIMEAMPGLLDLAAVSGGDVALASEN\
AATALRGFNLDASQSGHVANVFAKAAADTNA\
EVGDMGEAMKYIAPVANSMGLSIEEVSAAIG\
IMSDAGIKGSQAGTSLRGALSRLADPTDAMQ\
AKMDELGLSFYDSEGKMKPLKDQIGMLKDAF\
KGLTPEQQQNALVTLYGQESLSGMMALIDKG\
PDKLGKLTESLKNSDGAADKMAKTMQDNMNS\
SLEQMMGAFESAAIVVQKILAPAVRKVADSI\
SGLVDKFVSAPEPVQKMIVTIGLIVAAIGPL\
LVIFGQAVVTLQRVKVGFLALRSGLALIGGS\
FTAISLPVLGIIAAIAAVIAIGILVYKNWDK\
ISKFGKEVWANVKKFASDAAEVIKEKWGDIT\
QWFSDTWNNIKNGAKGLWDGTVQGAKNAVDS\
VKNAWNGIKEWFTNLWKGTTSGLSSAWDSVT\
TTLAPFVETIKTIFQPILDFFSGLWGQVQTI\
FGSAWEIIKTVVMGPVLLLIDLITGDFNQFK\
KDFAMLWQTLFTNIQTLVTTYVQIVVGFFTA\
WGQTVSNIWTTVVNTIQSLWGAFTTWVINMA\
KSIVDGIVNGWNSFKQGTVDLWNATVQWVKD\
TWASFKQWVVDSANAIVNGVKQGWENLKQGT\
IDLWNGMINGLKGIWDGLKQSVRNLIDNVKT\
TFNNLKNINLLDIGKAIIDGLVKGLKKKWED\
GMKFISGIGDWIRKHKGPIRKDRKLLIPAGK\
AIMTGLNSGLTGGFRNVQSNVSGMGDMIANA\
INSDYSVDIGANVAAANRSISSQVSHDVNLN\
QGKQPASFTVKLGNQIFKAFVDDISNAQGQA\
INLNMGF*'

fasta_rec = Fasta.Record()

#Sanitize title - blast does not like single quotes or \n in titles
title = re.sub("'","prime",title)
title = re.sub("\n","",title)
fasta_rec.title = title
fasta_rec.sequence = seq

result_handle = NCBIWWW.qblast  
('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre 
z_query="Viruses [ORGN]")
b_records = NCBIXML.parse(result_handle)

for b_record in b_records:
     print "%s found %i results" % (b_record.query, len 
(b_record.alignments))
     for alignment in b_record.alignments:
          titles = alignment.title.split('>')
          print titles

----------

-----------------------------------------------------
http://openwetware.org/wiki/User:Lucks
-----------------------------------------------------

On Mar 5, 2007, at 9:55 AM, Peter wrote:

> Julius Lucks wrote:
>> Hi all,
>> I am trying to parse a bunch of blast results that I gather via   
>> NCBIWWW.qblast().  I have the following code snipit:
>
> You didn't say which version of BioPython you are using, I would  
> guess 1.42 - there have been some Bio.Blast changes since than.
>
> Your example sequence was "ATCG", but you ran a "blastp" search.   
> Did you really mean the peptide Ala-Thr-Cys-Gly here?
>
> If you meant to do a nucleotide search, try using "blastn" and "nr"  
> instead.  That should work better.
>
> However, there is still something funny going on.  I tried your  
> example as is using the CVS code, and it fails before it even gets  
> the blast results back...
>
> Could you save the XML output to a file and email it to me; or even  
> better file a bug an attach the XML file to the bug.
>
> Thanks
>
> Peter