[Bioperl-l] [BioSQL-l] How is is_circular recorded in BioSQL (by BioPerl)?

Mon Jul 25 12:31:41 EDT 2011

On Jul 25, 2011, at 10:39 AM, Peter Cock wrote:

> On Mon, Jul 25, 2011 at 3:12 PM, Roy Chaudhuri <roy.chaudhuri at gmail.com> wrote:
>>>> I don't think there's any specific handling, but (in GenBank files
>>>> at least) mol_type is recorded as a tag in the source feature, so
>>>> it will be stored in BioSQL like any other feature tag (in
>>>> seqfeature_qualifier_value).
>>> 
>>> I'd forgotten in my question this potential slight redundancy in the
>>>  GenBank format!
>> 
>> No problem, I forgot in my answer that for some obscure reason people
>> may be interested in looking at GenBank files that aren't bacterial genome
>> sequences.
> 
> Sampling bias ;)
> 
>>> Let me clarify that I'm interested in if and where BioPerl stores
>>> the molecule type from the GenBank LOCUS line in BioSQL (and I'm
>>> expecting this to go in bioentry_qualifier_value table under some tag
>>> name).
>> 
>> As far as I can tell, the only fields stored by default in
>> bioentry_qualifier_value are keyword, date_changed and secondary_accession
>> (although my database only contains GenBank bacterial genomes). As with the
>> is_circular hack, you could store the molecule type by adding it as an
>> annotation in the SequenceProcessor (it's stored as $seq->molecule by
>> BioPerl).
> 
> OK, that makes sense.
> 
>> Actually, when round-tripping a GenBank file through BioSQL, the LOCUS line
>> molecule type ends up in lower case, which makes me wonder if it is coming
>> from alphabet in the biosequence table.
> 
> If so, that may break for viral GenBank files where the LOCUS line may say
> RNA, but the sequence is given using acgt (i.e. the DNA alphabet).

Not sure, but that's worth checking on.  Truthfully, our interest has typically been in favor more towards parsing data into the proper classes for downstream analysis than round-tripping sequence formats.  Not that the latter isn't important, but that there is frankly more interest in doing something more than rote sequence format conversion.

>>> P.S.
>>> 
>>> As as been discussed before, the BioSQL documentation would benefit
>>> from at least one worked example of a (small) GenBank file showing
>>> where each field ends up in the database. It would be a reasonable
>>> amount of work though - but could then be used for a basic compliance
>>> unit test by all the Bio* interfaces to BioSQL.
>> 
>> I agree that this would be very useful - the SearchIO HOWTO has a similar
>> treatment of a BLAST report that I often refer to.
> 
> If only we could clone/fork bioinformaticians ;)
> 
> Peter

:)

chris