[GSoC] GSoC python variant update

Lenna Peterson arklenna at gmail.com
Sun May 6 17:26:30 EDT 2012


Hi all,

I've written a few new posts on my blog; here's the latest:

http://arklenna.tumblr.com/post/22542372076/spot-isa-dog

I will attach a UML diagram and include the part of the post
addressing the diagram. Click through to the full post for a bonus
Einstein quote!

-------

My main goals are not limited to:

 * Make the structure parser and file-format agnostic: an abstracted
OO design should allow anything to be slotted in (for example,
Marjan's C GFF parser?)
 * Maintain encapsulation: limit how much each object can see of
objects above and below it
 * Allow extension at multiple levels: some existing parsers may
process data in different ways; this structure should allow handling
both raw data and data in various formats.

The `Variant` object's constructor allows an end user to change the
default parsers. Practical implementation details of `parse()` and
`write()` will need to be finessed - for example, ways to help the
user sift through immense quantities of data. I'm still in the process
of comparing the data contained in VCF/GVF files as well as the APIs
of PyVCF and BCBio.GFF.

`Parser` and `Writer` are both abstract classes that will define all
methods found in known parsers/writers with `NotImplementedError`s.
I'm speculating on whether a Variant-specific exception would be
useful, but a custom message should suffice.

Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper`
would each inherit from both `Parser` and `Writer`. As the name
implies, they would serve as the adapter between the generic `Variant`
and the specific parser.

I anticipate that this structure could easily be extended to allow
intermediate storage in DBs as well as innumerable
sorting/comparing/filtering methods inside `Variant`.

-------

I would appreciate any and all feedback about the overall structure.
Namespace is definitely flexible. I'd also appreciate any specific
genomic variant workflows, and if somebody can point me to smallish
sample files of the same data in both VCF and GVF, I'd be eternally
grateful.

Regards,

Lenna
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Variant_UML.png
Type: image/png
Size: 23313 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/gsoc/attachments/20120506/38c3290c/attachment-0001.png>


More information about the GSoC mailing list