[BioPython] Is query_length really the length of query?

Wed Apr 15 08:32:49 UTC 2009

Peter wrote:
> On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
>> Hello,
>>
>> I tried to check the length before sending it to blast.
>> My problem is that all the query sequences are in a file so I used SeqIO to
>> read/parse them
>>
>> for record in SeqIO.parse(fh, "fasta"):
>>        l_query = len(record.seq)
>>        result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe,
>> "blastn",
>>                                                          my_blast_db,
>> record.seq)
>>
>> doesn't work as NCBIStandalone.blastall takes a file as infile.
>>
>> Should I write a temporary file with the record.id and record.seq and pass
>> it to NCBIStandalone.blastall ?
>>
>> or is there an easier way?
> 
> It sounds like you already have a FASTA file containing the query
> sequences, so just use that as the input to standalone BLAST.
> 
> i.e. I would do something like this to double check the reported query
> length matches up with the actual query length:
> 
> from Bio import SeqIO
> from Bio.Blast import NCBIXML
> from Bio.Blast import NCBIStandalone
> query_filename = "example.fasta"
> #Load all the queries into memory as a dictionary of SeqRecord objects
> query_handle = open("example.fasta")
> query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta"))
> query_handle.close()
> #Run BLAST and loop over the XML blast results one by one (memory efficient),
> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \
>                                                "blastn", my_blast_db,
> query_filename)
> for blast_record in NCBIXML.parse(result_handle) :
>    query_record = query_dict[blast_record.query_id] #check this
>    assert len(query_record) == blast_record.query_letters
>    assert len(query_record) == blast_record.query_length #Biopython
> 1.50b or later
> 
> Note I haven't actually tested this example, but I think the idea is clear.
> 
> This approach gives you easy access to the full query sequence, and
> its full description.  If all you care about is the length, then
> rather than storing a dictionary of the queries as SeqRecords, just
> use a dictionary of their lengths as integers.
> 
> Peter

Thanks a lot!
Just have to change the query_record = query_dict[blast_record.query_id]
to
query_record = query_dict[blast_record.query]

because query_id return something like lcl|XXX and not the actual fasta header.
and yes I am interested in the whole SeqRecords ;-).
Does the query_dict size is limited by the memory of the machine ?

yvan