[BioRuby] GFF3
Tomoaki NISHIYAMA
tomoakin at kenroku.kanazawa-u.ac.jp
Wed Aug 18 01:09:06 UTC 2010
Hi,
> Thanks for the nice example. It shows how you can filter GFF without
> storing everything in memory. Naturally that does not work for
> extracting all transcripts as GFF does not guarantee ordered data.
I think the code is not dependent on the order of the GFF file.
All the exon is stored in an array holding the exons that belong to
the mRNA.
The output order of the exons is dependent on the input GFF, but in
this case the
output order is not required to be specified.
I could insert exonary.sort{ some ordering rule }
before exonary.each{} if the output order matters.
(Since this program was not to persist a long time and there was
sufficient memory, I didn't care anything to keep the memory usage low).
> I am thinking we can assume that related data comes with each other.
The nature of gene/genome is not so simple.
You can read on trans-splicing. So, unlinked
parts of the genome can form a mature mRNA and protein thereof.
If these parts are collected close in GFF file, then positional order
is not
preserved. If the GFF is sorted by the position, the parts are in
distant position.
> share parts between genes,
For, shared parts between genes, it is frequent that micro RNA genes
are on introns or exons of other genes. Also, for compact genomes,
there is quite a number of genes having overlapping UTRs.
On chloroplast genomes, even overlapped CDS are known.
> Question, have we ever seen GFF files that are not ordered?
I've never seen an unordered GFF file, but there could be different
orders.
1. The lines are just sorted according to the location.
2. genes are ordered and the parts of the gene comes together.
For example the arabidopsis GFF file looks like this and you can see
that the
feature itself is not ordered that protein 3760 comes earlier than
exon 3631.
Chr1 TAIR9 gene 3631 5899 . + .
ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR9 mRNA 3631 5899 . + .
ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1 TAIR9 protein 3760 5630 . + .
ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1 TAIR9 exon 3631 3913 . + .
Parent=AT1G01010.1
Chr1 TAIR9 five_prime_UTR 3631 3759 .
+ . Parent=AT1G01010.1
Chr1 TAIR9 CDS 3760 3913 . + 0
Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR9 exon 3996 4276 . + .
Parent=AT1G01010.1
Chr1 TAIR9 CDS 3996 4276 . + 2
Parent=AT1G01010.1,AT1G01010.1-Protein;
> It makes so much sense to keep genes and their components together.
I think GFF is an exchange format rather than to work directly with
part of it.
The data can be relatively easily stored into a RDB and extracted
from it.
Index on RDB will allow a fast identification of all feature in a
specified
region or a gene. That subset is good to work with.
--
Tomoaki NISHIYAMA
Advanced Science Research Center,
Kanazawa University,
13-1 Takara-machi,
Kanazawa, 920-0934, Japan
On 2010/08/18, at 1:38, Pjotr Prins wrote:
> On Mon, Aug 16, 2010 at 10:52:02PM +0900, Tomoaki NISHIYAMA wrote:
>> Hi,
>>
>>> A simple implementation would be to store all relations into a
>>> graph (or graphs) and then extracting information.
>>
>> I recently wrote a program to extract all the mRNAs, but up to the
>> addresses
>> and not to the sequences.
>>
>> http://github.com/tomoakin/Bioruby-use/blob/master/src/
>> gff2easytrack.rb
>>
>> This is not designed to be very general, but might be useful as a
>> starting point.
>
> Thanks for the nice example. It shows how you can filter GFF without
> storing everything in memory. Naturally that does not work for
> extracting all transcripts as GFF does not guarantee ordered data.
>
> Still, a good example. What I also like is that there is almost no
> coupling with other BioRuby modules (other than embedded Fasta). We
> should keep it that way.
>
> Question, have we ever seen GFF files that are not ordered? It makes
> so much sense to keep genes and their components together. I think it
> is somewhere argued that you can share parts between genes, but how
> often does that happen - and would they be far apart in the file?
> Even Lincoln states that you can split GFF files. That would not work
> if data is not together.
>
> I am thinking we can assume that related data comes with each other.
> This means we only have to cache a limited number records in memory
> to resolve dependencies.
>
> I'll probably write something in the coming week, as I need it. I'll
> design it to be a BioRuby plugin. For the time being.
>
> Pj.
>
More information about the BioRuby
mailing list