[BioPython] Is query_length really the length of query?
Yvan Strahm
yvan.strahm at bccs.uib.no
Wed Apr 15 08:32:49 UTC 2009
Peter wrote:
> On Tue, Apr 14, 2009 at 3:00 PM, Yvan Strahm <yvan.strahm at bccs.uib.no> wrote:
>> Hello,
>>
>> I tried to check the length before sending it to blast.
>> My problem is that all the query sequences are in a file so I used SeqIO to
>> read/parse them
>>
>> for record in SeqIO.parse(fh, "fasta"):
>> l_query = len(record.seq)
>> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe,
>> "blastn",
>> my_blast_db,
>> record.seq)
>>
>> doesn't work as NCBIStandalone.blastall takes a file as infile.
>>
>> Should I write a temporary file with the record.id and record.seq and pass
>> it to NCBIStandalone.blastall ?
>>
>> or is there an easier way?
>
> It sounds like you already have a FASTA file containing the query
> sequences, so just use that as the input to standalone BLAST.
>
> i.e. I would do something like this to double check the reported query
> length matches up with the actual query length:
>
> from Bio import SeqIO
> from Bio.Blast import NCBIXML
> from Bio.Blast import NCBIStandalone
> query_filename = "example.fasta"
> #Load all the queries into memory as a dictionary of SeqRecord objects
> query_handle = open("example.fasta")
> query_dict = SeqIO.to_dict(SeqIO.parse(query_handle,"fasta"))
> query_handle.close()
> #Run BLAST and loop over the XML blast results one by one (memory efficient),
> result_handle, error_handle = NCBIStandalone.blastall(my_blast_exe, \
> "blastn", my_blast_db,
> query_filename)
> for blast_record in NCBIXML.parse(result_handle) :
> query_record = query_dict[blast_record.query_id] #check this
> assert len(query_record) == blast_record.query_letters
> assert len(query_record) == blast_record.query_length #Biopython
> 1.50b or later
>
> Note I haven't actually tested this example, but I think the idea is clear.
>
> This approach gives you easy access to the full query sequence, and
> its full description. If all you care about is the length, then
> rather than storing a dictionary of the queries as SeqRecords, just
> use a dictionary of their lengths as integers.
>
> Peter
Thanks a lot!
Just have to change the query_record = query_dict[blast_record.query_id]
to
query_record = query_dict[blast_record.query]
because query_id return something like lcl|XXX and not the actual fasta header.
and yes I am interested in the whole SeqRecords ;-).
Does the query_dict size is limited by the memory of the machine ?
yvan
More information about the Biopython
mailing list