[BioSQL-l] getting exon information from genbank files

Hilmar Lapp hlapp at gnf.org
Mon Apr 11 14:55:09 EDT 2005


Ankit, the values you're showing in your sample record, did you make  
them up entirely or is this an actual query result?

Note that all columns in the location table are numeric, so it only  
creates confusion if you choose letters as characters to mask the real  
values. If they are the real values that you must have changed the  
schema and not used load_seqdatabase.pl to load records.

Note also that generally what's in biosql will closely resemble the  
object model that was built by the SeqIO bioperl parser run on your  
input record(s) - provided you used load_seqdatabase.pl to load the  
record(s). So, what ends up in biosql as the result of loading a  
genbank record greatly depends on the genbank record itself. As a rule,  
what the genbank record had in its feature table you'll also find in  
biosql as a seqfeature record, and what wasn't in the feature table you  
also won't find in biosql. Introns are usually not annotated in Genbank  
explicitly, they are only implicit as the region between exons, so  
unless the genbank record you loaded were exceptions you . How to find  
exons again depends on the feature table of the original records: some  
have a single cDNA feature with a composite ('split') location, which  
will end up in biosql as one seqfeature that has many locations  
attached. Genomic contigs sometimes have the exons annotated as  
individual features, and then this is what you'll find in biosql too:  
one seqfeature per exon, each with a single location.

The bottom line is, if you load through load_seqdatabase.pl the content  
in biosql will closely match the object tree in bioperl - which often  
times will be close to the data structure of the original input record.  
Features that weren't there to begin with you won't find magically  
added.

So, to come back to your question, there is no good answer because it  
greatly depends  on what your input was. Most likely though you'll have  
to impute introns by fetching the locations of the cDNA (or mRNA)  
feature or the locations of the exon features, order them properly, and  
then infer introns between consecutive exons.

If this is what you need to do all the time I'd write a script that  
does this in an automated fashion against all newly loaded records and  
inserts the introns as features back into the database.

	-hilmar

On Sunday, April 10, 2005, at 11:04  AM, ankit soni wrote:

> Hi all,
> I have just started using BioSQL for one of my projects and I have
> loaded few genbank files in the MySQL database using BioPerl and the
> standard schema.
> I wanted to ask how can I get the information about the exons, introns
> from the database.
> If I use the following querry I get the start and end position but I
> am not able to find out what limits(start_pos and end-pos) stand for
> i.e. gene or exon or intron.
> mysql> select * from location where seqfeature_id='XXXX';
> +-------------+---------------+-----------+---------+----------- 
> +---------+--------+------+
> | location_id | seqfeature_id | dbxref_id | term_id | start_pos |
> end_pos | strand | rank |
> +-------------+---------------+-----------+---------+----------- 
> +---------+--------+------+
> |       YYYY |         XXXX  |      NULL  |    NULL |      ABC  |
> EFG |      1    |    1     |
> +-------------+---------------+-----------+---------+----------- 
> +---------+--------+------+
>
> It would be very helpful if somebody can guide me.
> I am sorry if I am unable to use the correct biological terms as I
> know very little of biology.
>
> Ankit Soni
> Junior Undergraduate
> Dept. of Computer Science
> IIT kanpur
> India
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the BioSQL-l mailing list