[Biopython] affy CEL and CDF reader

Thu Apr 8 19:40:01 UTC 2010

On Thu, Apr 8, 2010 at 3:03 PM, Vincent Davis <vincent at vincentdavis.net> wrote:
> Parsing it myself, But based directly an the affy documentation found here.
> http://www.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/

So, are you covering both binary and text formats for .CEL files?  I
think that modern .CEL files (those produced by GCOS) are binary and
represent the majority of .CEL files produced today.  Some of the I/O
issues that you discuss are almost definitely dealt with by using the
binary .CEL files.

I'm certainly not an expert on Affy, so take all these
questions/comments with a grain of salt.

Sean

> On Thu, Apr 8, 2010 at 12:56 PM, Sean Davis <sdavis2 at mail.nih.gov> wrote:
>
>> On Thu, Apr 8, 2010 at 2:33 PM, Vincent Davis <vincent at vincentdavis.net>
>> wrote:
>> > I ended up writing my own modules for reading both affy Cel and CDF
>> files.
>> > Long story as to why I did not just use what was available in biopython.
>> > I plan on making what I have done available to the biopython and will
>> upload
>> > it as a fork. I will outline what ways what I have is different below.
>> > My question is: Are there any improvements(features) others would like to
>> > see beyond what is avalible in the current CelFile.py?
>> > I saw some posts a month or so ago about checking for consistency in cell
>> > file, I think it was something about making sure the stated number of
>> probes
>> > was consistent with the intensity measurements.
>> >
>> > What is different,
>> > when an file is read Affycel.read('file') many atributes are set. for
>> > example
>> > a = affcel()
>> > a.read('testfile')
>> > a.filename,
>> > a.version,
>> > a.header.items()  # a dictionary of all header items
>> > a.num_intensity
>> > a.intensity
>> > a.num_masks
>> > a.masks
>> > a.num_outliers
>> > a.outliers
>> > a.numb_modified
>> > a.modified
>> >
>> > I plan to add the ability return/call intensity values with our with
>> > outliers or mask values.
>> > All data is currently store in numpy structured arrays,
>> > currently a.intensity returns the structured array, but I plan on making
>> it
>> > an option to easily choose how this is returned.
>> > also what to make an optional normalized intensity array so that if the
>> data
>> > is normalized it can be stored with the affycel instance. My use case was
>> > that I was opening about 80 cel files and reading them in was slow. this
>> > allowed me to read each file as an instance of affycel stored in a list
>> that
>> > I then pickled. It was then much faster to open them.
>> >
>> > Are improvements to the CelFile.py are of value to biopython?
>> >
>> > I hope to have the code pushed up to my fork on github late tonight. Just
>> > thought I would ask if there was any suggestion before I did.
>> >
>> > Also have an CDF file reader, but only have done some basic testing. I
>> don't
>> > have a lot of use for this, do other biopython users?
>> >
>> > I am kinda working in a vacuum and am trying to get more involved in
>> > projects to improve my skills and knowledge. Any suggestions would be
>> > appreciated.
>>
>> Just out of curiosity, is your work based on the affy sdk, or are you
>> parsing stuff yourself?
>>
>> Sean
>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>