[Bioperl-l] searchio tags
Liam Elbourne
lelbourn at cbms.mq.edu.au
Wed Nov 5 09:04:38 UTC 2008
Thanks Jason and Chris,
I ended up iterating through the features as I needed the locus_tag
anyway - bit slow but gets there in the end...
I've ccd this to the bioperl list as suggested, and will subscribe soon.
Regards,
Liam.
On 05/11/2008, at 6:00 AM, Chris Fields wrote:
> I've already looked into that using TBLASTN output from the web
> interface, but I found no relevant lines in HSP sections in XML
> formatted reports (see below). It is possible the URLAPI interface
> is returning different XML data (I've seen stranger things from
> NCBI). I'll check into that when I can. BTW, the default read
> format for RemoteBlast is still text IIRC (we never changed it to
> XML), so is it possible Liam is referring to the default (text)
> output, and not XML?
>
> Also, I have managed to get this working in GenericHSP, will
> probably commit momentarily with added tests. Liam, if you have XML
> output that contains the relevant data could you attach it to an
> email so I can try parsing it out?
>
> Liam, it might be best to ask these questions on the mail list in
> case Jason or I can't get to them immediately. For SeqIO features,
> in order to retrieve a particular feature region you would need to
> either iterate through or grep the relevant features and retrieve
> the sequence using $feature->seq. If I want surrounding sequence I
> grab start/end from the feature, then +/- the extra bp and use $seq-
> >subseq.
>
> -c
>
> Example Text Hit/HSP block:
>
> >emb|AM408590.1| Mycobacterium bovis BCG Pasteur 1173P2, complete
> genome
> Length=4374522
>
> Features in this part of subject sequence:
> Conserved hypothetical protein
> Probable pyrimidine operon regulatory protein pyrR
>
> Score = 372 bits (955), Expect = 3e-101, Method: Compositional
> matrix adjust.
> Identities = 193/193 (100%), Positives = 193/193 (100%), Gaps =
> 0/193 (0%)
> Frame = +3
>
> Query 1
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGV 60
>
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGV
> Sbjct 1577814
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGV 1577993
>
> Query 61
> TLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDD 120
>
> TLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDD
> Sbjct 1577994
> TLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDD 1578173
>
> Query 121
> VLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRL 180
>
> VLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRL
> Sbjct 1578174
> VLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRL 1578353
>
> Query 181 REHDGRDGVVISR 193
> REHDGRDGVVISR
> Sbjct 1578354 REHDGRDGVVISR 1578392
>
> Matching XML Hit/HSP block:
>
> <Hit>
> <Hit_num>2</Hit_num>
> <Hit_id>gi|121491530|emb|AM408590.1|</Hit_id>
> <Hit_def>Mycobacterium bovis BCG Pasteur 1173P2, complete
> genome</Hit_def>
> <Hit_accession>AM408590</Hit_accession>
> <Hit_len>4374522</Hit_len>
> <Hit_hsps>
> <Hsp>
> <Hsp_num>1</Hsp_num>
> <Hsp_bit-score>372.474</Hsp_bit-score>
> <Hsp_score>955</Hsp_score>
> <Hsp_evalue>3.27024e-101</Hsp_evalue>
> <Hsp_query-from>1</Hsp_query-from>
> <Hsp_query-to>193</Hsp_query-to>
> <Hsp_hit-from>1577814</Hsp_hit-from>
> <Hsp_hit-to>1578392</Hsp_hit-to>
> <Hsp_query-frame>0</Hsp_query-frame>
> <Hsp_hit-frame>3</Hsp_hit-frame>
> <Hsp_identity>193</Hsp_identity>
> <Hsp_positive>193</Hsp_positive>
> <Hsp_gaps>0</Hsp_gaps>
> <Hsp_align-len>193</Hsp_align-len>
>
> <
> Hsp_qseq
> >
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGVTLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDDVLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRLREHDGRDGVVISR
> </Hsp_qseq>
>
> <
> Hsp_hseq
> >
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGVTLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDDVLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRLREHDGRDGVVISR
> </Hsp_hseq>
>
> <
> Hsp_midline
> >
> MGAAGDAAIGRESRELMSAADVGRTISRIAHQIIEKTALDDPVGPDAPRVVLLGIPTRGVTLANRLAGNITEYSGIHVGHGALDITLYRDDLMIKPPRPLASTSIPAGGIDDALVILVDDVLYSGRSVRSALDALRDVGRPRAVQLAVLVDRGHRELPLRADYVGKNVPTSRSESVHVRLREHDGRDGVVISR
> </Hsp_midline>
> </Hsp>
> </Hit_hsps>
> </Hit>
>
> On Nov 4, 2008, at 11:55 AM, Jason Stajich wrote:
>
>> FYI
>>
>> -
>> Jason Stajich
>> Sent from my iPod
>>
>> Begin forwarded message:
>>
>>> From: Liam Elbourne <lelbourn at cbms.mq.edu.au>
>>> Date: November 3, 2008 10:50:35 PM PST
>>> To: Jason Stajich <jason at bioperl.org>
>>> Subject: Re: searchio tags
>>>
>>> Jason,
>>>
>>> Thanks for the very rapid response! Excuse the circularity, but
>>> NCBI must now be including this info in the XML format because
>>> RemoteBlast had it in my blast results, or is there a fatal
>>> antipodean past pub time flaw in my logic?
>>>
>>> In any event I'll adopt your suggestion to check back against
>>> SeqIO (and check out DB Cache, that's a new one for me...).
>>>
>>> Is there a method in SeqIO that allows one to go straight to a
>>> particular region of the sequence (ie something like "$seq-
>>> >features_after($curr_hsp->start('hit')") or do I have to iterate
>>> through all the features until $seq->current-feature->start >=
>>> $curr_hsp->start('hit') if you'll excuse (and understand) the
>>> pseudo-bioperl code?
>>>
>>> Regards,
>>> Liam.
>>>
>>> On 04/11/2008, at 5:07 PM, Jason Stajich wrote:
>>>
>>>> Liam -
>>>>
>>>> Sorry - we don't parse this out of the report -- that information
>>>> is only something that is included in the online-only BLAST
>>>> report and is not a standard part of the format. The RemoteBlast
>>>> will typically get the XML-only version of the report which to my
>>>> knowledge does not contain this information. If it is now being
>>>> included the parser could be updated to also parse this out, but
>>>> I am not sure that it is.
>>>> Sorry - this is just a limit of how the data has been
>>>> traditionally available and NCBI makes this kind of thing
>>>> available only in an HTML form that is not-standardized.
>>>>
>>>> Your best bet is to get the sequence accession for your hit, then
>>>> obtain the sequence as a genbank record from remote genbank (and/
>>>> or also cache this sequence in your local DB, there is a DB Cache
>>>> in BioPerl for just this sort of thing) or a local genbank - then
>>>> check and see if your HIT location is close to any of the
>>>> features you are interested in from the annotation of the record.
>>>>
>>>> Hope that gets you pointed in the right direction, may be worth
>>>> sending a note to the bioperl list as well to see if other people
>>>> have similar needs and maybe a better custom solution can be
>>>> cooked up.
>>>>
>>>> Best wishes,
>>>> -jason
>>>> On Nov 3, 2008, at 9:00 PM, Liam Elbourne wrote:
>>>>
>>>>> Hi Jason,
>>>>>
>>>>> I'm trying to parse a bunch of blast reports produced by
>>>>> Bio::Tools::Run::RemoteBlast using SearchIO. Everything is fine
>>>>> except that the:
>>>>>
>>>>> "Features in this part of subject sequence:"
>>>>>
>>>>> information doesn't seem to included in any of the objects
>>>>> produced by SearchIO - the only method I could find that looked
>>>>> vaguely promising was the locus() method for
>>>>> Bio::Search::Hit::HitI, but this isn't initialised.
>>>>>
>>>>> I'm trying to match a set of primers against the genome they
>>>>> were designed to target to check that they are associated with
>>>>> the loci they were intended to targeted (as it appears in
>>>>> retrospect that they have been 'mislabeled'), and if not, which
>>>>> loci they are near or in. I've included an example blast report
>>>>> at the end of the email, where the information I'm still trying
>>>>> to extract (everything else being extractable) is "hypothetical
>>>>> protein".
>>>>>
>>>>> Sorry to trouble you with such a trivial issue, if you are not
>>>>> the right person to contact (and I appreciate that this is not a
>>>>> bug as such, except so far as it is a 'bug' in my knowledge of
>>>>> bioperl!) please tell me who would be. I've attached the code
>>>>> I'm using too for what that is worth.
>>>>>
>>>>>
>>>>> <blast_parse.pl>
>>>>>
>>>>> Regards,
>>>>> Liam Elbourne.
>>>>> Research Fellow
>>>>> Paulsen Lab
>>>>> Macquarie University
>>>>> Sydney Australia.
>>>>>
>>>>> blast report below
>>>>> ****************************
>>>>> BLASTN 2.2.18+
>>>>> Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro
>>>>> A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and
>>>>> David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new
>>>>> generation of protein database search programs", Nucleic
>>>>> Acids Res. 25:3389-3402.
>>>>>
>>>>>
>>>>> RID: GZ1WTPWW016
>>>>>
>>>>>
>>>>> Database: NCBI Genomic Reference Sequences
>>>>> 11,340 sequences; 45,795,413,781 total letters
>>>>>
>>>>> Query= BR0042_r
>>>>> Length=25
>>>>>
>>>>>
>>>>>
>>>>> Score E
>>>>> Sequences producing significant
>>>>> alignments: (Bits) Value
>>>>>
>>>>> ref|NC_004310.3| Brucella suis 1330 chromosome I, complete
>>>>> se... 46.4 3e-07
>>>>>
>>>>> ALIGNMENTS
>>>>> >ref|NC_004310.3| Brucella suis 1330 chromosome I, complete
>>>>> sequence
>>>>> gb|AE014291.4| Brucella suis 1330 chromosome I, complete sequence
>>>>> Length=2107794
>>>>>
>>>>> Features in this part of subject sequence:
>>>>> hypothetical protein
>>>>>
>>>>> Score = 46.4 bits (50), Expect = 3e-07
>>>>> Identities = 25/25 (100%), Gaps = 0/25 (0%)
>>>>> Strand=Plus/Minus
>>>>>
>>>>> Query 1 GTTTTTCGAGCCGCCCTTTTTGCCC 25
>>>>> |||||||||||||||||||||||||
>>>>> Sbjct 494130 GTTTTTCGAGCCGCCCTTTTTGCCC 494106
>>>>>
>>>>>
>>>>> Database: NCBI Genomic Reference Sequences
>>>>> Posted date: Nov 2, 2008 5:47 PM
>>>>> Number of letters in database: 3,315,177
>>>>> Number of sequences in database: 2
>>>>>
>>>>> Lambda K H
>>>>> 0.634 0.408 0.912
>>>>> Gapped
>>>>> Lambda K H
>>>>> 0.634 0.408 0.912
>>>>> Matrix: blastn matrix:2 -3
>>>>> Gap Penalties: Existence: 5, Extension: 2
>>>>> Number of Sequences: 2
>>>>> Number of Hits to DB: 0
>>>>> Number of extensions: 0
>>>>> Number of successful extensions: 0
>>>>> Number of sequences better than 1e-05: 0
>>>>> Number of HSP's better than 1e-05 without gapping: 0
>>>>> Number of HSP's gapped: 0
>>>>> Number of HSP's successfully gapped: 0
>>>>> Length of query: 25
>>>>> Length of database: 3315177
>>>>> Length adjustment: 18
>>>>> Effective length of query: 7
>>>>> Effective length of database: 3315141
>>>>> Effective search space: 23205987
>>>>> Effective search space used: 23205987
>>>>> A: 0
>>>>> X1: 22 (20.1 bits)
>>>>> X2: 33 (29.8 bits)
>>>>> X3: 110 (99.2 bits)
>>>>> S1: 33 (31.0 bits)
>>>>> S2: 45 (41.9 bits)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Jason Stajich
>>>> jason at bioperl.org
>>>>
>>>>
>>>>
>>>
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Marie-Claude Hofmann
> College of Veterinary Medicine
> University of Illinois Urbana-Champaign
>
>
>
>
More information about the Bioperl-l
mailing list