[BioPython] Is query_length really the length of query?

Peter biopython at maubp.freeserve.co.uk
Tue Apr 14 14:23:03 UTC 2009


On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
> Hello,
>
> I tried to check the length before sending it to blast.
> My problem is that all the query sequences are in a file so I used SeqIO to
> read/parse them
>
> for record in SeqIO.parse(fh, "fasta"):
>        l_query = len(record.seq)
>        result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe,
> "blastn",
>                                                          my_blast_db,
> record.seq)
>
> doesn't work as NCBIStandalone.blastall takes a file as infile.
>
> Should I write a temporary file with the record.id and record.seq and pass
> it to NCBIStandalone.blastall ?
>
> or is there an easier way?

It sounds like you already have a FASTA file containing the query
sequences, so just use that as the input to standalone BLAST.

i.e. I would do something like this to double check the reported query
length matches up with the actual query length:

from Bio import SeqIO
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIStandalone
query_filename = "example.fasta"
#Load all the queries into memory as a dictionary of SeqRecord objects
query_handle = open("example.fasta")
query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta"))
query_handle.close()
#Run BLAST and loop over the XML blast results one by one (memory efficient),
result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \
                                               "blastn", my_blast_db,
query_filename)
for blast_record in NCBIXML.parse(result_handle) :
   query_record = query_dict[blast_record.query_id] #check this
   assert len(query_record) == blast_record.query_letters
   assert len(query_record) == blast_record.query_length #Biopython
1.50b or later

Note I haven't actually tested this example, but I think the idea is clear.

This approach gives you easy access to the full query sequence, and
its full description.  If all you care about is the length, then
rather than storing a dictionary of the queries as SeqRecords, just
use a dictionary of their lengths as integers.

Peter




More information about the Biopython mailing list