[Biopython] BLAST problems

Fri Apr 15 09:15:40 EDT 2011

On Fri, Apr 15, 2011 at 1:31 PM, Kathe Munk <Kathe.Munk at agrsci.dk> wrote:
> Dear all
>
> I am wondering if any of you know what is going on here. I have made the following script:
>
> #!/usr/bin/python
> from Bio.Blast import NCBIWWW
> from Bio.Blast import NCBIXML
>
> result_handle = NCBIWWW.qblast("blastn", "nr", 58585087, hitlist_size=2)
> blast_record = NCBIXML.read(result_handle)
> print(blast_record.alignments[0].title)
>
>
> The print statement prints the following:
> gi|58585087|ref|NM_001011569.1| Apis mellifera complementary
> sex determiner (Csd), mRNA >gi|46276949|gb|AY569721.1| Apis
> mellifera complementary sex determiner (csd) mRNA, csd-S7-16
> allele, complete cds
>
>
> As you can see I use blastn to retrieve sequences similar to my query.
> The problem is that when I substract the title I get the names of two
> sequences. I would only expect one. Furthermore I have tested the
> script with other query sequences where only a single sequence
> appears in the title. Do I have an error in my script?

No, it is part of how the NCBI present the NR database - identical
redundant sequences get collapsed onto a single entry, with their
descriptions combined as shown (with a control+A character as
a separator from memory).

If you are using the BLAST+ tabular output, the optional extra
sallseqid column gives the subject (match) IDs semi-column
separated.

Peter