[Bioperl-l] genbank/embl format ebnf or other formal description

Hilmar Lapp hlapp at drycafe.net
Tue Sep 11 22:02:46 UTC 2012


One of the problems in Perl with using a language-neutral definition of the format as a context-free grammar has been that RecDescent was just way too slow for this.

One of the Google Summer of Code students working on fast parsers (for SAM/BAM I think) used Ragel (http://www.complang.org/ragel/), which looks quite cool, but unfortunately doesn't support Perl (nor Go :-)

	-hilmar

On Sep 11, 2012, at 5:39 PM, Dan Kortschak wrote:

> Thanks Chris. It is related to both really, and more.
> 
> Second first, I continue to be amazed at the lack of specification or testing in a significant portion of software in the bioinformatics realm (bioperl is a nice counter example and one that I am grateful for having had as a training ground - and the work that has obviously gone into working through parsing and formatting un- or under-specified formats by the core and other developers is phenomenal).
> 
> But to the first point, I am unable to use bioperl to parse/format these formats for my project as it is a new project, not written in Perl - apologies for abusing the list - but rather in Go. I could go through the Perl to reimplement based on that, but I was hoping to use a parser generator from a spec, so that I can guarantee the parser/formatter is correct formally.
> 
> I asked here because I believe the developers of bioperl are some of the foremost experts in parsing the collection of "weakly defined, internally redundant, ambiguous, bulky fruit salad[s] of ... data format[s]" [1] that constute the majority of the file formats out there (this is not a pejorative against the bioperl devs, but rather a testament to their fortitude and strength - I have only implemented the bare minimum of formats in my library so far).
> 
> thanks
> Dan
> 
> 
> [1]http://www.biostars.org/post/show/7126/what-are-the-most-common-stupid-mistakes-in-bioinformatics/#7136
> 
> On 12/09/2012, at 12:09 AM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
> 
>> Christopher,
>> 
>> I think Dan's question is orthogonal to actually parsing a file; it relates more to proper formatting for a particular format based on a specification as well as potential downstream validation.  Bio::SeqIO::genbank is geared for flexibility and can handle a lot of mis-formatted data, it can massage some data into the proper format if needed.  One must recognize the primary driver for the parsers is to get data into objects, not as a format converter (that just happens to be a nice useful side effect).
>> 
>> The problem is, like many formats, a formal specification for Genbank format doesn't exist outside of the NCBI example file (old and incomplete) and the FT definition as far as I know, so calling something 'official' Genbank format isn't possible outside of NCBI.
>> 
>> chris (f)
>> 
>> On Sep 11, 2012, at 9:10 AM, Christopher Bottoms <molecules at cpan.org> wrote:
>> 
>>> Dan,
>>> 
>>> Why not use BioPerl's Bio::SeqIO, which can parse GenBank files?
>>> 
>>> --Christopher Bottoms
>>> 
>>> On Fri, Sep 7, 2012 at 10:43 PM, Dan Kortschak
>>> <dan.kortschak at adelaide.edu.au> wrote:
>>>> Thanks Chris. That's remarkable, so many words and not an actual formal
>>>> specification. I guess I have some work ahead of me. I found the
>>>> example, but examples rarely contain all edges and corners.
>>>> 
>>>> Dan
>>>> 
>>>> On Sat, 2012-09-08 at 03:39 +0000, Fields, Christopher J wrote:
>>>>> Re: Genbank, the only know specification I know of is for the feature
>>>>> table portion of the format as you have below.  They do have a
>>>>> (possibly out of date) example file, note it isn't easily found unless
>>>>> you search for it:
>>>>> 
>>>>> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord
>>>>> 
>>>>> EMBL is better in this regard:
>>>>> 
>>>>> http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html
>>>>> 
>>>>> Note that UniProt Knowledgebase also has a user manual outlining the
>>>>> similarities and differences with EMBL:
>>>>> 
>>>>> http://web.expasy.org/docs/userman.html
>>>>> 
>>>>> chris
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
===========================================================




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20120911/7b584d20/attachment.sig>


More information about the Bioperl-l mailing list