[Bioperl-l] [Gmod-gbrowse] scores in Bio::DB::BigBed

Daniel Lang Daniel.Lang at biologie.uni-freiburg.de
Wed Jul 6 07:54:57 UTC 2011


Hi all,

thanks a lot for your input on this!

I want to explore the repeat structure of our model genome derived by
lastz self-alignments (using %id as score).
Since this is a HUGE file and I initially wanted to have the ability to
access the information for individual repeat regions also in gbrowse, I
wanted to use BigBed. Having the data in hand, it seems not to be such a
good idea anyway since the resulting repeat graph is much more complex
that I expected. So summarizing using the score and/or coverage will do
just fine;-)

But as they are repeats they're overlapping. So if I see it correctly
BigWig/BedGraph aren't an option. Due to the size limitations, I have
not stored individual CIGAR strings that I could use to generate full-
blown SAM files. Or can I use BAM without sequence/qual data?

Or is there an existing tool that would allow me to collapse overlapping
ranges with average scores for use in BigWig?

Otherwise, I'll have to live with the coverage graphs for visualization
in gbrowse and use Bio::DB::BigBed::features to look at conservation
score at individual loci.

Chris, the proposed BP page would be extremely helpful :-D

Again, thanks a lot!

Best,
Daniel

Am 04.07.2011 18:10, schrieb Chris Fields:
> I generally follow these rules where I want a common set of possibly volatile features (e.g. specific transcriptome analysis) separate from my main 'stable' feature database (e.g. gene models):
> 
> 1) BigBed - lightweight bundle of simple features where the ranges may overlap, but I'm not concerned about score.  I have found BED/BigBed scores of limited use in most cases to me unless I scale the data (since they must be 0-1000 integer values).  Document it very well if you do any scaling! YMMV
> 
> 2) SAM/BAM - bundle of (possibly overlapping) features where summary stats are needed.  I've seen these used for BLAST/BLAT runs, etc.
> 
> 3) BigWig - quantitative data of fixed or varying ranges covering entire genome, ranges can't overlap
> 
> 4) BedGraph - quantitative sparse data, ranges can't overlap (these are converted over to BigWig for GBrowse, though)
> 
> 5) Of course, one can also set up separate DB::SF::Store databases as well depending on your needs (I have used both the SQLite and MySQL adaptors for this).
> 
> I think this is almost begging for a 'best practices' chart/table somewhere, maybe a GBrowse 'cookbook' of common data representation cases.
> 
> chris
> 
> On Jul 4, 2011, at 8:22 AM, Lincoln Stein wrote:
> 
>> I had a look at the output of bigBedSummary, which is from Jim Kent's source
>> tree (no Perl involved), and it appears that the statistics it provides are
>> limited to coverage; so I don't think you can do anything with the scores if
>> you're using BigBed indexing. Have a look at BedGraph=>BigWig and see if it
>> meets your needs.
>>
>> Lincoln
>>
>> On Mon, Jul 4, 2011 at 9:04 AM, Lincoln Stein <lincoln.stein at gmail.com>wrote:
>>
>>> Hi Dan,
>>>
>>> The documentation for BigBed is scanty; all I know about it is what is
>>> provided by the bigbed library is in Jim Kent's bigbed.h include file. I had
>>> thought that the scores in BED files would come through into the summary
>>> statistics like those in BigWig, but now I'm looking at the example data
>>> provided in Jim's source code, and see that the BigBed example source file
>>> has scores of "0".
>>>
>>> I'll investigate whether there is an issue in the Perl layer, but it could
>>> easily be a limitation in the library itself. Have you considered using a
>>> BedGraph file and indexing it with bedGraphToBigWig? I know that the
>>> Bio::DB::BigWig interface works perfectly to retrieve and summarize the
>>> scores.
>>>
>>> Lincoln
>>>
>>>
>>> On Sun, Jul 3, 2011 at 5:48 AM, Daniel Lang <
>>> Daniel.Lang at biologie.uni-freiburg.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> quick question about the BigBed adaptor: Is it correct that the bin and
>>>> summary functions only return statistics about the number of features in
>>>> the defined intervals?
>>>> I was expecting them to deliver statistics about the score if the
>>>> respective bb file has a defined score field.
>>>> If this is true, does this also mean that I cannot plot the distribution
>>>> of scores in BigBed files in gbrowse?
>>>>
>>>> This is the first time I'm using BigBed, maybe I'm doing something
>>>> wrong...
>>>>
>>>> I had some trouble formatting the bed files correctly in order to see
>>>> the score in the features returned by the Bio::DB::BigBed::features()
>>>> routine. It seems the bigbed entries will only have a correctly assigned
>>>> score field if you also provide a non-empty name field. Initially I
>>>> thought that the order of columns is irrelevant if you use an .as file
>>>> in the bedToBigBed call, but that doesn't seem to be the case.
>>>>
>>>> Best,
>>>> Daniel
>>>> --
>>>>
>>>> Dr. Daniel Lang
>>>> University of Freiburg, Plant Biotechnology
>>>> Schaenzlestr. 1, D-79104 Freiburg
>>>> fax:        +49 761 203 6945
>>>> phone:      +49 761 203 6989
>>>> homepage:   http://www.plant-biotech.net/
>>>>           http://www.cosmoss.org/
>>>> e-mail <http://www.cosmoss.org/e-mail>:
>>>> daniel.lang at biologie.uni-freiburg.de
>>>>
>>>> #################################################
>>>> My software never has bugs.
>>>> It just develops random features.
>>>> #################################################
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> All of the data generated in your IT infrastructure is seriously valuable.
>>>> Why? It contains a definitive record of application performance, security
>>>> threats, fraudulent activity, and more. Splunk takes this data and makes
>>>> sense of it. IT sense. And common sense.
>>>> http://p.sf.net/sfu/splunk-d2d-c2
>>>> _______________________________________________
>>>> Gmod-gbrowse mailing list
>>>> Gmod-gbrowse at lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>>>>
>>>
>>>
>>>
>>> --
>>> Lincoln D. Stein
>>> Director, Informatics and Biocomputing Platform
>>> Ontario Institute for Cancer Research
>>> 101 College St., Suite 800
>>> Toronto, ON, Canada M5G0A3
>>> 416 673-8514
>>> Assistant: Renata Musa <Renata.Musa at oicr.on.ca>
>>>
>>
>>
>>
>> -- 
>> Lincoln D. Stein
>> Director, Informatics and Biocomputing Platform
>> Ontario Institute for Cancer Research
>> 101 College St., Suite 800
>> Toronto, ON, Canada M5G0A3
>> 416 673-8514
>> Assistant: Renata Musa <Renata.Musa at oicr.on.ca>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 

-- 

Dr. Daniel Lang
University of Freiburg, Plant Biotechnology
Schaenzlestr. 1, D-79104 Freiburg
fax:        +49 761 203 6945
phone:      +49 761 203 6989
homepage:   http://www.plant-biotech.net/
            http://www.cosmoss.org/
e-mail:     daniel.lang at biologie.uni-freiburg.de

#################################################
My software never has bugs.
It just develops random features.
#################################################






More information about the Bioperl-l mailing list