[BioPython] Is query_length really the length of query?

Peter biopython at maubp.freeserve.co.uk
Wed Apr 15 05:16:37 EDT 2009


On Wed, Apr 15, 2009 at 9:32 AM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
>
> Peter wrote:
>> It sounds like you already have a FASTA file containing the query
>> sequences, so just use that as the input to standalone BLAST.
>>
>> i.e. I would do something like this to double check the reported query
>> length matches up with the actual query length:
>>
>> from Bio import SeqIO
>> from Bio.Blast import NCBIXML
>> from Bio.Blast import NCBIStandalone
>> query_filename = "example.fasta"
>> #Load all the queries into memory as a dictionary of SeqRecord objects
>> query_handle = open("example.fasta")
>> query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta"))
>> query_handle.close()
>> #Run BLAST and loop over the XML blast results one by one (memory
>> efficient),
>> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \
>>                                               "blastn", my_blast_db,
>> query_filename)
>> for blast_record in NCBIXML.parse(result_handle) :
>>   query_record = query_dict[blast_record.query_id] #check this
>>   assert len(query_record) == blast_record.query_letters
>>   assert len(query_record) == blast_record.query_length #Biopython
>> 1.50b or later
>>
>> Note I haven't actually tested this example, but I think the idea is
>> clear.
>>
>> This approach gives you easy access to the full query sequence, and
>> its full description.  If all you care about is the length, then
>> rather than storing a dictionary of the queries as SeqRecords, just
>> use a dictionary of their lengths as integers.
>>
>> Peter
>
> Thanks a lot!
> Just have to change the query_record = query_dict[blast_record.query_id]
> to query_record = query_dict[blast_record.query]
> because query_id return something like lcl|XXX and not the actual fasta
> header.

There may be a blastall command line argument to alter this
behaviour, but if that works, great :)

> and yes I am interested in the whole SeqRecords ;-).

OK, good.  That example should be a good starting point then.

> Does the query_dict size is limited by the memory of the machine ?

In the example I gave, yes.  This is using a standard python
dictionary, the keys are the record identifiers (strings) and the
values are SeqRecord objects.  A larger query file means more
SeqRecord in memory in this dictionary.  Unless you are using
thousands of query sequences (in which case your BLAST
search will be slow), I don't expect this to be a problem.

However each SeqRecord in this example will be wasting some
memory e.g. an empty list of features, an empty list of database
cross references, and an empty dictionary of annotations.  If you
find you are running out of memory, then perhaps use a dictionary
where the keys are the record identifiers (strings) and the values
are just the sequence (as strings).

If after that you are still running out of memory, you could index the
FASTA file somehow, or use a full database (e.g. BioSQL).

Peter



More information about the Biopython mailing list