[Biopython] Tutorial Question 7.4 alignment.title

Fri Oct 8 15:45:31 UTC 2010

Peter,

Thanks for your reply. I started to fiddle around with parsing the  
string last night but haven't made much progress.

At the moment the output looks like this:

****Alignment****
sequence: gi|302529614|ref|ZP_07281956.1| predicted protein  
[Streptomyces sp. AA4] >gi|302438509|gb|EFL10325.1| predicted protein  
[Streptomyces sp. AA4]
e value: 1.89229e-46
length: 1109
start: 7
end: 414

So what I want from the sequence string is the following:
[Streptomyces sp. AA4]
ZP_07281956.1

printed out as separated lines like the rest of the output.

After that is figured out I want to put all the information in columns  
so it can be read into a spreadsheet in OO so that it looks like this:
Name	Locus #	E_value	Length	Start	End

Regards,
Ara

On Oct 8, 2010, at 3:30 AM, Peter wrote:

> On Fri, Oct 8, 2010 at 4:06 AM, Ara Kooser <akooser at unm.edu> wrote:
>> Hello all,
>>
>> I am a new user to Biopython. I've been working my way through the
>> tutorial. I have a question about how the alignment.title works in  
>> the
>> example given in section 7.4 of the tutorial. I wrote the following  
>> code:
>>
>> from Bio.Blast import NCBIXML
>>
>> E_VALUE_THRESH = 1e-30
>>
>> result_handle = open("test.xml")
>> blast_records = NCBIXML.parse(result_handle)
>> blast_record = blast_records.next()
>>
>> for alignment in blast_record.alignments:
>>    for hsp in alignment.hsps:
>>        if hsp.expect < E_VALUE_THRESH:
>>            print '****Alignment****'
>>            print 'sequence:', alignment.title
>>            print 'e value:', hsp.expect
>>            print 'length:', alignment.length
>>            print 'start:', hsp.query_start
>>            print 'end:',hsp.query_end
>>
>> To look at a .xml file that was produced by BLAST. I was wondering  
>> if there
>> was a way to break up the string for information produced by the:
>>
>>            print 'sequence:', alignment.title
>>
>> Basically I would like the organisms name first, followed by the  
>> locus
>> number. I wasn't sure how to split up the print command.
>>
>> I looked at the docs over at http://biopython.org/DIST/docs/api/ to  
>> see if
>> there was a tag specifically for the locus number and organism name.
>>
>> Thank you for your time and help.
>>
>> Regards,
>> Ara
>
> Hi Ara,
>
> An example of the output you are getting and what you want
> would  help, but I think this isn't possible in general.
>
> As I recall, the locus number and organism name information is
> just part of the original identifier and/or description in the FASTA
> file used to build the BLAST database. The NCBI tend to include
> the species in the description within square brackets - but this is
> just their convention, it is not a nicely tagged part of the BLAST
> output which the parser could spot.
>
> Basically I think you will have to parse the string yourself.
>
> Peter
>
> P.S. Alternatively if you want the organism name and have the
> GI number (or similar) this can be mapped to the organism via
> the NCBI taxonomy database (either online via Entrez or
> by parsing a downloaded copy of the mapping).