[Biopython-dev] Fw: Re: Parsing TRANSFAC matrices with Bio.Motif
Michiel de Hoon
mjldehoon at yahoo.com
Tue Aug 7 14:47:00 UTC 2012
Forwarding Bartek's email to the list ..
I am pretty much OK with his suggestions, but feel free to comment or suggest other solutions before we start implementing this.
Best,
-Michiel.
--- On Tue, 8/7/12, Bartek Wilczynski <bartek at rezolwenta.eu.org> wrote:
> From: Bartek Wilczynski <bartek at rezolwenta.eu.org>
> Subject: Re: [Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif
> To: "Michiel de Hoon" <mjldehoon at yahoo.com>
> Date: Tuesday, August 7, 2012, 5:16 AM
> On Tue, Aug 7, 2012 at 10:39 AM,
> Michiel de Hoon <mjldehoon at yahoo.com>
> wrote:
> > Hi Bartek,
> >
> > Thanks for your reply.
> >
> > --- On Tue, 8/7/12, Bartek Wilczynski <bartek at rezolwenta.eu.org>
> wrote:
> >> If you do, then you get access to a number of
> interconnected
> >> datasets, including information about what they
> call "matrices",
> >> "sites" and "transcription factors" and "classes".
> I think that if
> >> we want to support their filetypes, we probably
> should think whether
> >> we should support the matrix file only or maybe the
> other ones asa
> >> well.
> >
> > I would suggest to just support the matrices for now.
> >
> I'm fine with that. Some links between the files might be
> less
> usefule, but that might be added later.
>
> >> The confusing part is that many programs use
> "transfac-like"
> >> formats, i.e. files very similar to the part in the
> "matrix"
> >> file that corresponds to the PWM itself. (For
> example see
> >> http://www.benoslab.pitt.edu/stamp/help.html).
> >
> > This also means that if Bio.Motif can parse TRANSFAC
> files, then it
> > can parse the transfac-like formats, at least to some
> degree. Personally I am actually more interested in the
> SwissRegulon database, which uses a transfac-like format
> >
>
> In principle yes, but there are slight variants making
> things "almost
> working". That's the main reason I didn't put the code I was
> using
> myself into biopython repository, as it might cause some
> weird
> breakages. For examples, some formats drop the P0 column
> (the
> "transfac-like" in STAMP, for one) which makes it impossible
> to figure
> out whether you are interpreting the numbers right unless
> you agree on
> some ordering of nucleotides. I would suggest that we should
> support
> databases named directly and, maybe, think about generic
> methods for
> "raw PSSM" files, that would require the user to give the
> nucleotide
> order...
>
> >> Then comes the thing with annotations. I would
> rather
> >> vote for something more similar to SeqRecord and
> Seq,
> >> where a new class (MotifRecord?) would hold all
> the
> >> annotation data from TRANSFAC or somesuch DB, and
> the
> >> Motif would remain more sequence-like.
> >
> > Are you suggesting that MotifRecord subclasses
> Bio.Motif._Motif.Motif?
> > For example we could have a
> Bio.Motif.Parsers.TRANSFAC.Motif class that subclasses
> Bio.Motif._Motif.Motif. Then Bio.Motif._Motif.Motif
> remains sequence-like, and Bio.Motif.Parsers.TRANSFAC.Motif
> takes care of the annotations.
> >
> > Alternatively we could say that
> Bio.Motif.Parsers.TRANSFAC.read returns a
> Bio.Motif.Parsers.TRANSFAC.Record object that contains the
> motif information as an attribute (so record.motif would be
> an instance of Bio.Motif._Motif.Motif).
> >
>
> For me, personally, the version where transfac motif is a
> subclass of
> Motif is a more useful one. It is simpler, and it adds
> annotations as
> attributes of a motif. However, if we decided that we want
> the whole
> TRANSFAC db with all it's annotations, the more natural way
> would be
> to have separate classes for instances and motifs and maybe
> even
> separate record classes representing a database record
> (there might be
> more transfac records referencing the same matrix). I don't
> think that
> there is so much need for supporting all the stuff from
> TRANSFAC (I
> don't know anybody who would be using all their annotations,
> people
> seem to care only about matrices anyway) so I'd vote for the
> simpler
> way of subclassing Motif.
>
> best
> Bartek
> --
> Bartek Wilczynski
>
More information about the Biopython-dev
mailing list