[Bioperl-l] whole genome annotation

Fri Jul 28 13:28:02 UTC 2006

Richard Birnie wrote:
> 
> -----Original Message-----
> From: Sean Davis [mailto:sdavis2 at mail.nih.gov]
> Sent: Fri 7/28/2006 12:59
> To: Richard Birnie
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] whole genome annotation
>  
> Richard Birnie wrote:
> 
>>Hello all,
>>
>>I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go.  
>>
>>Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is  simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies.
>>
>>What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways.
>>
>>I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. 
>>
>>What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. 
>>
>>If example code for what I'm trying to describe is included somewhere, great could someone point to where.
> 
> 
> Hi, Richard.
> 
> Bioperl is good for many things, but for simply grabbing all the 
> locations of human genes in the genome and chromosome band locations, I 
> wouldn't use bioperl.  It sounds to me like you are interested in 
> getting the genes associated with each chromosomal band?  If so, just 
> download the cytoband.txt and refFlat.txt files from the UCSC genome 
> browser site.  cytoband.txt contains the base pair locations for each of 
> the cytobands.  refFlat.txt contains the base pair locations of "refseq" 
> genes.  It is then simply a matter of finding overlapping regions (genes 
> with cytobands) to determine which genes are in which cytobands.  Since 
> the files are tab-delimited text, they are very easy to work with (in 
> perl, excel, python, ...).  Don't get me wrong--I really appreciate the 
> power of bioperl, but in this case, your task lends itself to a simpler 
> (and MUCH) faster approach.
> 
> Sean
> 
> Thanks for the response Sean,
> 
> getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. 
> 
> However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose.
> 
> The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now.  So I'll go back to wrestling with that.

Ahh.  I see.  Metashark actually searches the remaining sequence in the 
human genome?  If that is the case, then you need the start and end 
positions of the chromosomal bands, which you can download from the ucsc 
genome browser.  Follow the links to download and then to the genome of 
your choice and finally get the chromband.txt file.  The other piece of 
the puzzle is the bio::DB::Fasta module.  It allows extremely fast 
access to a set of fasta files, which it first indexes.  Here is the 
documentation for it:

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/Fasta.html

You could imagine making a hash indexed by chromosome band of a hash of 
starts and ends for each band.  For each CGH experiment, find those 
regions that are deleted.  Exclude those when looping through all the 
chromosome bands, pulling the sequence using Bio::DB::Fasta for each 
band and writing that to a file for metashark.

Sean