[Bioperl-l] matching miRNAs to one or a lot of mRNAs

Ian Korf ik1 at sanger.ac.uk
Sun Sep 28 02:28:39 EDT 2003


Some more reasons to use WU-BLAST:

(1) You can use a nucleotide scoring matrix rather than simple 
match-mismatch values. The main reason for doing this is to make GC and 
GU both positive scoring. You might also make the various match and 
mismatch values a little different based on observed properties of 
matches and mismatches in complementary RNAs.

(2) You get more control over gap costs. In NCBI-BLAST the gap 
initiation cost must always be greater than 0. So you can't do a 
uniform -1 or -2 for gaps, it must always be -2 for the first and -1 
for the rest (or some such variant). You'd have to study RNA alignments 
a bit to determine if this is really an advantage or not.

(3) At shorter word lengths, WU-BLAST switches its extension rules and 
no longer uses random values for ambiguities. Probably not a big deal 
unless you're mining ESTs or you have known RNA modifications like 
inosine. You would, of course, want to modify the scoring matrix if you 
had inosines in your sequence.

(4) As you pointed out, unlike FASTA or SW, BLAST finds multiple 
high-scoring pairs rather than a single maximum scoring pair.

Some things to watch out for:

(a) The default setting for E2 may be too low and must be raised (e.g. 
100). You might also want to raise E if your database is large. Make 
sure the statistics aren't getting in the way of finding your true 
positives.

(b) It might take a long time to do the search. I'd recommend using 
hitdist=20 and W=4. W=5 seems a bit of a stretch to me, but W=4 on its 
own might be too sensitive. Requiring the second hit is a good idea, 
and lots of RNA structures have short symmetric bubbles, so this seems 
like it would work.

(c) You don't want Sum statistics because you're really looking for 
complete matches. Use the -kap option to turn off combined stats. This 
will save some compute time as well.

(d) Don't forget to pick up a copy of the O'Reilly BLAST book which has 
quite a bit of info on BLAST in general (sorry for the shameless plug).

-Ian

On Saturday, September 27, 2003, at 10:17 PM, Peter Stoilov wrote:

> Hi,
>
> they indeed mixed up the transcripts, but to me it looks like a honest
> mistake. There seems to be another unrelated transcript with the same 
> name
> (HES1). The accession for this transcript is NM_004649.  All of the
> experiments in the paper with the exception of the ELISA are done on 
> the
> wrong gene (ugly). The funny stuff doesn't end here. Using FASTA 
> search I was
> able to find 1 mi-RNA  (hsa-miR-221) that matches the transcript 
> (NM_004649)
> at exactly the same spot much better than hsa-miR-23(b).
>
> hsa-miR-221 vs   Homo sapiens chromosome 21 open reading frame 33 
> (C21orf33),
> mRNA
> Matches 19 of 23
> 23      CUUUGGGUCGUCUGUUACAU-CG   2
>         :.::...: ..:: :.:: : :.
> 873     GGAACUCACUGGAAAGUG-ACGC   894
>
>
>
> As for the real HES1 hsa-miR-205 and hsa-miR-221 are much better 
> compared to
> hsa-miR-23b.
>
> hsa-miR-205 vs   Homo sapiens hairy and enhancer of split 1, 
> (Drosophila)
> (HES1), mRNA
> Matches 18 of 22
> 21      UCUGAGGCCACCUU-ACUUCCU   1
>         ::..  .:: :::: :::.::.
> 1062    AGGCCGUGGCGGAACUGAGGGG   1083
>
> hsa-miR-221 vs   Homo sapiens hairy and enhancer of split 1, 
> (Drosophila)
> (HES1), mRNA
> Matches 21 of 23
> 23      CUUUGG-GUCGUC-UG-UUACAUCGA   1
>         ::.... ..:..: :. .: : .:.:
> 1061    GAGGCCGUGGCGGAACUGAGGGGGCU   1086
>
>
>  I I'll write to the autors to see what they think about this.
>
>
> Now about searching for mi-RNA targets. The smolest word size that I 
> can use
> in BLAST is 7 for nucleic acid (thanks for the WU-BLAST idea!). So I 
> had to
> go with FASTA. Now FASTA reports only one hit (should I say HSP?) per
> sequence. The way I go arround this is to generate multiple sequences 
> for
> each transcript in wich the transcript except for 35 nt is masked with 
> Ns.
> The unmasked regions are tiled with 5nt step (30nt overlap).  The 
> problem
> with this is that the database size gets completely out of hand and 
> will not
> fit my hard drive;). Searching the database takes forever. But when I 
> do it
> for individual transcripts it works pretty well.
>
> Peter
>
> On Saturday 27 September 2003 03:45, Ian Korf wrote:
>> The human HES1 8400709 is not the sequence from the paper I don't
>> think. If you align the sequence in figure 1a against 8400709, you'll
>> find they don't match. There are other HES1 sequences in GenBank
>> though, for example, 1655593, that contain the sequence in the figure.
>> But if you try aligning the miRNA to 1655593 with NCBI-BLAST, you 
>> won't
>> find anything.
>>
>> If you do a S-W alignment (match +1, mismatch -1, gap -2) of the miRNA
>> complement against 1655593 you get the following, which is the same
>> alignment reported in the paper.
>>
>> Stats: score=12
>> Alignment: Q:855..874 S:1..21 17/3 1,0
>> Q: TGGAACTCACTGG-AAAGTGA
>>
>> S: TGGAAATCCCTGGAAATGTGA
>>
>> You'll note that the largest ungapped alilgnment is 5nt. The authors
>> did not say they used BLAST, only that they searched GenBank. 5nt is
>> too short for NCBI-BLAST, which has a minimum word size of 7. WU-BLAST
>> has no limit of word size, and you can find the alignment with
>> WU-BLAST. Same scoring system as above used here but note that E2 had
>> to be raised to at least 11 or the alignment would get pruned before
>> subjected to gapped statistics. Here it is:
>>
>>   Score = 12 (17.3 bits), Expect = 0.037, P = 0.037
>>   Identities = 17/21 (80%), Positives = 17/21 (80%), Strand = Plus / 
>> Plus
>>
>> Query:   855 TGGAACTCACTGGAAA-GTGA 874
>>
>> Sbjct:     1 TGGAAATCCCTGGCAATGTGA 21
>>
>> If you make a habit of such searches, don't be surprised if you run in
>> to a lot of false-positives. I think you might want to use additional
>> criteria such as overlapping the stop or located in the 3'UTR. I'm not
>> aware of any software specifically designed for such searches, but
>> perhaps the authors of the paper have one. The paper was very brief 
>> and
>> had no description of the bioinformatics in the methods section (if I
>> was one of the referees, I would have found this unacceptable). I
>> suggest you contact the authors and find out specifically what they 
>> did.
>>
>> -Ian
>>
>> On Friday, September 26, 2003, at 07:29 PM, Starr Hazard wrote:
>>> Folks,
>>>
>>> In a recent paper, Kawasacki et al(pubmed 12808467) report on the
>>> interaction between a specific miRNA (human miRNA23 g.i. 17646028) 
>>> and
>>> a specific mRNA (human HES1 g.i. 8400709). They suggest they did a
>>> BLAST search and ultimately located the interaction. I cannot
>>> duplicate their data mining and cannot find the association they
>>> describe.
>>>
>>> In general, is there a way to take a library of miRNAs and evaluate
>>> their potential interaction with a particular mRNA? Or is there a 
>>> data
>>> mining tool that could screen a large pool of mRNAs for
>>> significant interactions with a pool miRNAs?
>>>
>>> I cannot at present see any BioPerl tools that address this issue
>>> (right now that means I scanned the FAQ for the string RNA and
>>> searched the BioPerl site for RNA but found only some traffic about
>>> Seq.pm).The people I have asked seem divided about whether this is
>>> text matching issue or more of a hybridization issue involving
>>> an energy of interaction evaluation.
>>>
>>> Anybody got any pointers to offer?
>>>
>>> Starr
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>



More information about the Bioperl-l mailing list