[Bioperl-l] whole genome annotation

Fri Jul 28 12:21:46 UTC 2006

-----Original Message-----
From: Sean Davis [mailto:sdavis2 at mail.nih.gov]
Sent: Fri 7/28/2006 12:59
To: Richard Birnie
Cc: bioperl-l at lists.open-bio.org
Subject: Re: [Bioperl-l] whole genome annotation

Richard Birnie wrote:
> Hello all,
> 
> I'm just trying to familiarise myself with BioPerl and I'm a little overwhelmed by the sheer volume of information available on the wiki. I'm hoping someone can point in the right direction through the labyrinth. This may become a little longwinded but I'll try and get all the annoying newbie questions out of the way in one go.  
> 
> Let me try and explain what I'm aiming for. I have some CGH data downloaded from the Progenetix database (http://www.progenetix.de/~pgscripts/progenetix/Aboutprogenetix.html), this data is  simplified to record simply gain/loss/amplification of whole chromosome bands at 862 band resolution to facilitate the combination of data from multiple different studies.
> 
> What I'd like to be able to do is download a copy of the human genome sequence with annotation describing the locations of chromosome bands and preferably of known genes. I then want to be able to manipulate the genome data based on the CGH data to mimic deletions. The ultimate goal of this is to be able to feed the manipulated genome data into a program (metashark) that predicts the structure of metabolic networks based on genome annotation compared to a reference genome, in this case a complete 'normal' human genome and see what effect that has on the metabolic pathways.
> 
> I appreciate that is a bit vague but thats sort of my problem, I'm not a bioinformatician really so I'm not sue of the details of what I want. I just happen to have an question to answer and bioperl seems the way to go (for this project and more generally). I've started looking at the HOWTOs and read the main bioperl tutorial. I also looked at the CGL comparative genomics library but I haven't penetrated far into that yet. I'm ok with basic perl although not much object oriented stuff. I don't really have much experience with handling sequence data on a whole genome scale either, a few genbank files for my favourite genes is fine but I need some guidance to work on this scale. 
> 
> What I'm looking for is someone to give me a start. I'd greatly appreciate it if someone could spell out the general steps for downloading a complete copy of the human genome and its annotations (if this is even a feasible approach) and how to put it all together. Not actual code just the general concept for each step and which tools from the bioperl set would be most appropriate for each step so that I can focus what I need to read about, even a little pseudo-code if I'm lucky. If I can get the genome data downloaded and setup properly I'll work out how to apply the CGH data to it myself. 
> 
> If example code for what I'm trying to describe is included somewhere, great could someone point to where.

Hi, Richard.

Bioperl is good for many things, but for simply grabbing all the 
locations of human genes in the genome and chromosome band locations, I 
wouldn't use bioperl.  It sounds to me like you are interested in 
getting the genes associated with each chromosomal band?  If so, just 
download the cytoband.txt and refFlat.txt files from the UCSC genome 
browser site.  cytoband.txt contains the base pair locations for each of 
the cytobands.  refFlat.txt contains the base pair locations of "refseq" 
genes.  It is then simply a matter of finding overlapping regions (genes 
with cytobands) to determine which genes are in which cytobands.  Since 
the files are tab-delimited text, they are very easy to work with (in 
perl, excel, python, ...).  Don't get me wrong--I really appreciate the 
power of bioperl, but in this case, your task lends itself to a simpler 
(and MUCH) faster approach.

Sean

Thanks for the response Sean,

getting the genes associated with each band is certainly part of what I need and your suggestion will help with that. I did look at the UCSC site but as you know there is such a volume of info on there I didn't really know which files I needed. 

However my main goal requires slightly more. What I want to be able to do is take the chromosomal band annotation info and compare that against the CGH data I have. From this I'd like to then be able say "OK band 8q13.1 (or whatever) is deleted, so make a copy of the genome with the actual sequence associated with that band removed." Then I could feed both sequences into metashark which predicts the structure of metabolic pathways based on genome annotation and see what effect deleting that region of DNA has on the structure of the metabolic network. Knowing which genes are involved is necessary for identifying what are the important components within the region. Are there tools in Bioperl for making this comparison? It can probably be reduced to a straight comparison of data structures so I may just use regular perl for this part unless there is anything designed for purpose.

The thing I was struggling with was how to store and manipulate genomic sequence data in such quantities. Since this morning I've had a better look at the CGL library and associated datastore module, I think I can do it using these tools but I'm having a few dependency issues getting it installed right now.  So I'll go back to wrestling with that.

regards,
Richard