[GSoC] GSoC python variant update

Brad Chapman chapmanb at 50mail.com
Mon May 7 20:24:39 EDT 2012


Lenna;
This all looks great for a top level overview of the classes. This
should give you sufficient flexibility to work on the different file
types. Another approach is to avoid some of the inheritence and have
parse/write dispatch to VCF or GFF specific classes based on the
filetype:

if filetype == "vcf":
    variant_handler = PyVCFVariants()
elif filetype == "gvf":
    variant_handler = GVFVariants()
variant_handler.parse(*args)

Avoiding layers can be nice to simplify the architecture, as long as it
gives you the flexibility you need.

My suggestion for digging more in the API design would be to start
playing with some VCF files and getting comfortable with the data they
have and where it would go in Biopython objects. VCF is much more widely
used than GVF so it's a good practical place to start.

Thanks for all this work and best of luck on finals,
Brad

> Hi all,
> 
> I've written a few new posts on my blog; here's the latest:
> 
> http://arklenna.tumblr.com/post/22542372076/spot-isa-dog
> 
> I will attach a UML diagram and include the part of the post
> addressing the diagram. Click through to the full post for a bonus
> Einstein quote!
> 
> -------
> 
> My main goals are not limited to:
> 
>  * Make the structure parser and file-format agnostic: an abstracted
> OO design should allow anything to be slotted in (for example,
> Marjan's C GFF parser?)
>  * Maintain encapsulation: limit how much each object can see of
> objects above and below it
>  * Allow extension at multiple levels: some existing parsers may
> process data in different ways; this structure should allow handling
> both raw data and data in various formats.
> 
> The `Variant` object's constructor allows an end user to change the
> default parsers. Practical implementation details of `parse()` and
> `write()` will need to be finessed - for example, ways to help the
> user sift through immense quantities of data. I'm still in the process
> of comparing the data contained in VCF/GVF files as well as the APIs
> of PyVCF and BCBio.GFF.
> 
> `Parser` and `Writer` are both abstract classes that will define all
> methods found in known parsers/writers with `NotImplementedError`s.
> I'm speculating on whether a Variant-specific exception would be
> useful, but a custom message should suffice.
> 
> Continuing down the diagram, `PyVCFWrapper` and `BCBioGFFWrapper`
> would each inherit from both `Parser` and `Writer`. As the name
> implies, they would serve as the adapter between the generic `Variant`
> and the specific parser.
> 
> I anticipate that this structure could easily be extended to allow
> intermediate storage in DBs as well as innumerable
> sorting/comparing/filtering methods inside `Variant`.
> 
> -------
> 
> I would appreciate any and all feedback about the overall structure.
> Namespace is definitely flexible. I'd also appreciate any specific
> genomic variant workflows, and if somebody can point me to smallish
> sample files of the same data in both VCF and GVF, I'd be eternally
> grateful.
> 
> Regards,
> 
> Lenna
Attachment: Variant_UML.png (image/png)
> _______________________________________________
> GSoC mailing list
> GSoC at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/gsoc


More information about the GSoC mailing list