[BioRuby] GFF3

Wed Aug 18 01:09:06 UTC 2010

Hi,

> Thanks for the nice example. It shows how you can filter GFF without
> storing everything in memory. Naturally that does not work for
> extracting all transcripts as GFF does not guarantee ordered data.

I think the code is not dependent on the order of the GFF file.
All the exon is stored in an array holding the exons that belong to  
the mRNA.
The output order of the exons is dependent on the input GFF, but in  
this case the
output order is not required to be specified.
I could insert exonary.sort{ some ordering rule }
before exonary.each{} if the output order matters.
(Since this program was not to persist a long time and there was
sufficient memory, I didn't care anything to keep the memory usage low).

> I am thinking we can assume that related data comes with each other.

The nature of gene/genome is not so simple.
You can read on trans-splicing.  So, unlinked
parts of the genome can form a mature mRNA and protein thereof.
If these parts are collected close in GFF file, then positional order  
is not
preserved.  If the GFF is sorted by the position, the parts are in
distant position.

> share parts between genes,

For, shared parts between genes, it is frequent that micro RNA genes
are on introns or exons of other genes.  Also, for compact genomes,
there is quite a number of genes having overlapping UTRs.
On chloroplast genomes, even overlapped CDS are known.

> Question, have we ever seen GFF files that are not ordered?

I've never seen an unordered GFF file, but there could be different
orders.
1. The lines are just sorted according to the location.

2. genes are ordered and the parts of the gene comes together.
For example the arabidopsis GFF file looks like this and you can see  
that the
feature itself is not ordered that protein 3760 comes earlier than  
exon 3631.

Chr1    TAIR9   gene    3631    5899    .       +       .        
ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR9   mRNA    3631    5899    .       +       .        
ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR9   protein 3760    5630    .       +       .        
ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR9   exon    3631    3913    .       +       .        
Parent=AT1G01010.1
Chr1    TAIR9   five_prime_UTR  3631    3759    .        
+       .       Parent=AT1G01010.1
Chr1    TAIR9   CDS     3760    3913    .       +       0        
Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1    TAIR9   exon    3996    4276    .       +       .        
Parent=AT1G01010.1
Chr1    TAIR9   CDS     3996    4276    .       +       2        
Parent=AT1G01010.1,AT1G01010.1-Protein;

> It makes so much sense to keep genes and their components together.

I think GFF is an exchange format rather than to work directly with  
part of it.
The data can be relatively easily stored into a RDB and extracted  
from it.
Index on RDB will allow a fast identification of all feature in a  
specified
region or a gene. That subset is good to work with.
-- 
Tomoaki NISHIYAMA

Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan

On 2010/08/18, at 1:38, Pjotr Prins wrote:

> On Mon, Aug 16, 2010 at 10:52:02PM +0900, Tomoaki NISHIYAMA wrote:
>> Hi,
>>
>>> A simple implementation would be to store all relations into a
>>> graph (or graphs) and then extracting information.
>>
>> I recently wrote a program to extract all the mRNAs, but up to the
>> addresses
>> and not to the sequences.
>>
>> http://github.com/tomoakin/Bioruby-use/blob/master/src/ 
>> gff2easytrack.rb
>>
>> This is not designed to be very general, but might be useful as a
>> starting point.
>
> Thanks for the nice example. It shows how you can filter GFF without
> storing everything in memory. Naturally that does not work for
> extracting all transcripts as GFF does not guarantee ordered data.
>
> Still, a good example. What I also like is that there is almost no
> coupling with other BioRuby modules (other than embedded Fasta). We
> should keep it that way.
>
> Question, have we ever seen GFF files that are not ordered? It makes
> so much sense to keep genes and their components together. I think it
> is somewhere argued that you can share parts between genes, but how
> often does that happen - and would they be far apart in the file?
> Even Lincoln states that you can split GFF files. That would not work
> if data is not together.
>
> I am thinking we can assume that related data comes with each other.
> This means we only have to cache a limited number records in memory
> to resolve dependencies.
>
> I'll probably write something in the coming week, as I need it. I'll
> design it to be a BioRuby plugin. For the time being.
>
> Pj.
>