[BioPython] NCBIXML Blast parser Error for TBLASTN

Alaguraj Veluchamy alaguraj.v at gmail.com
Sat Apr 12 21:23:26 UTC 2008


Dear people,
I thought this will give a solution to the problem, i faced.
If you do "formatdb" of blast with ".gbk" files of NCBI database, then
the sequence co-ordinates are processed and gives numbers which doesnt
match with the original sequence.
Record.Py in the lib module doesnt say much about this.
The solution is- do format the blast database with ".fna".
This gives out a result with correct co-ordinates.
This co-ordinate can be used to retrieve the sequence or can be used
to parse the ".gbk" files.
To make it precise, Format with "fna" files and parse with "gbk" files.
This is true for all versions of BLAST.

As two members have asked for the code.. here it is:

##########################
import os
import sys
import dircache
import re
from Bio.Blast import NCBIStandalone
from Bio.Blast import NCBIXML
path="/Users/alaguraj/soca/genomes/Bacteria"
lstDb=os.listdir(path)
blastdbpath="/Users/alaguraj/blastSmVsall/blastDB"
myArray=[]

for db in lstDb:
        eachdirpath=path+"/"+db
        listfiles=dircache.listdir(eachdirpath)
        print listfiles
        for listfile in listfiles:

command='/Users/alaguraj/progs/blast-2.2.17/bin/blastall -i
'+inputfile+' -p tblastn -e 0.001 -d
'+blastdbpath+'/'+db+'/'+listfile+' -m 7 \n'
                w, read, e = os.popen3(command)
                b_parser = NCBIXML.BlastParser()
                blast_records = b_parser.parse(read)
                for b_record in blast_records:
                        myArray.append(b_record)
########I am appending here because i need the data to further process
for my application#########

for eachmyArray in myArray:
    for alignment in eachmyArray.alignments:
        for hsp in alignment.hsps:
            print hsp.sbjct_start
            print hsp.sbjct_end

############################################################

The next code is For parsing the genbank file

parser=GenBank.RecordParser()
	record = parser.parse(open(genomefile))
	for feature in record.features:
		if feature.key=="CDS":
                   for qualifier in feature.qualifiers:
					if qualifier.key=="/protein_id=":
						protid=qualifier.value[1:-1]
                                                print protid
					if qualifier.key=="/product=":
						func=qualifier.value[1:-1]
                                                print protid
#############################################################

Sorry for the long mail.
Hope it helps.

Regards,
Alaguraj.V


Dear people,I am stuck in parsing Blast XML output.
I am trying to extract the co-ordinate and hence the sequence of the hits
for my query in standalone blast.
XML output is coming for query proteins against a genbank genome file.
(using TBLASTN)
The problem for me is- I am unable to get the co-ordinate(hit-from -->
hit-to).
I used hsp.sbjct_start and hsp.sbct_end, but it gives two numbers which is
different from the original.
when i tried extracting those regions from the genbank file and do an
alignment, it nowhere matches.
so what is this hsp.sbjct_start - integers?? how to get the hit region (or
alignment) cordinates?
I read record.py--the library file, but they given it is 3x times larger
(but still it is incorrect)
Regards
-Alaguraj.V



More information about the Biopython mailing list