[Bioperl-l] HG-U133a annotation csv (HG-U133A_annot.csv)

Sun Dec 12 02:11:12 EST 2004

Another good route is the standard module Text::ParseWords, as described in
the Perl Cookbook. Here's the basic approach:

use Text::ParseWords;

while (<>) {
    chomp;
    my @data = parse_line(',', 0, $_);
    do_something(@data);
}

Works like a dream.

BTW, Excel should handle these files no problem. What version/platform of
Excel are you using?

Some suggestions: Make sure the file has a .txt extension. Opening it from
within Excel should then activate the Text Import Wizard. Once there, be
sure that the data type is "Delimited", the delimiter is "Comma", the text
qualifier is ", and the column data format is "Text" (general should work
too, since all fields are enclosed in double quotes and should therefore be
interpreted as text).

If you still have problems, let me know.

Steve

> From: Jason Stajich <jason.stajich at duke.edu>
> Date: Fri, 10 Dec 2004 16:49:31 -0500
> To: Peter Robinson <Peter.Robinson at t-online.de>
> Cc: Bioperl List <bioperl-l at bioperl.org>
> Subject: Re: [Bioperl-l] HG-U133a annotation csv (HG-U133A_annot.csv)
> 
> The module Text::CSV_XS could be used as well - it does a pretty good
> job with mixed quoted and non-quoted fields.
> 
> -jason
> 
> On Dec 10, 2004, at 4:34 PM, Peter Robinson wrote:
> 
>> while ($line =~ m/"(.*?)"/g) {
>> print $1;
>> }
>> The "?" keeps * from being greedy, so we match only what is in between
>> each of the quotes. This regex just basically ignores the commas in
>> between the entries.
>> 
>> HTH
>> 
>> Peter
>> 
>> 
>> On Fri, 2004-12-10 at 21:59, D.Enrique ESCOBAR ESPINOZA wrote:
>>> I m have a hell of time trying to parse the annotation file with a
>>> regular expression.
>>> The problem is that the file contains fileds separated by a coma,
>>> each field starts with a double quote and it ends in a double quote,
>>> and also it contains in each field some ';' and ','.
>>> an exemple of that file is at the end of this mail,
>>> can someone help and give me a trick for parsing the lines of this
>>> file?
>>> It has 38 fields, and excel is not even opening it correctly,
>>> and if i try to save it back to a csv file,
>>> it does a complet mess.
>>> Thanks in advance.
>>> "Probe Set ID","GeneChip Array","Species Scientific Name","Annotation
>>> Date","Sequence Type","Sequence Source","Transcript ID","Target
>>> Description","Representative Public ID","UniGene ID","Genome
>>> Version","Alignments","Gene Title","Gene Symbol","Chromosomal
>>> Location","Unigene Cluster
>>> Type","Ensembl","LocusLink","SwissProt","EC","OMIM","RefSeq Protein
>>> ID","RefSeq Transcript ID","FlyBase","AGI","WormBase","MGI Name","RGD
>>> Name","SGD accession number","Gene Ontology Biological Process","Gene
>>> Ontology Cellular Component","Gene Ontology Molecular
>>> Function","Pathway","Protein Families","Protein
>>> Domains","InterPro","Trans Membrane","QTL","Annotation
>>> Description","Annotation Transcript Cluster","Transcript
>>> Assignments","Annotation Notes"
>>> "1007_s_at","Human Genome U133A Array","Homo sapiens","Oct 11,
>>> 2004","Exemplar sequence","Affymetrix Proprietary
>>> Database","U48705mRNA"," U48705 /FEATURE=mRNA /DEFINITION=HSU48705
>>> Human receptor tyrosine kinase DDR gene, complete cds
>>> ","U48705","Hs.423573","May 2004 (NCBI 35)","chr6:30964144-30975910
>>> (+) // 95.63 // p21.33","discoidin domain receptor family, member
>>> 1","DDR1","chr6p21.3","full length","ENSG00000137332","780","BAC85426
>>> /// Q08345 /// Q96T61 /// Q96T62","EC:2.7.1.112","600408","NP_001945
>>> /// NP_054699 /// NP_054700","NM_001954 /// NM_013993 ///
>>> NM_013994","---","---","---","---","---","---","6468 // protein amino
>>> acid phosphorylation // inferred from electronic annotation /// 7155
>>> // cell adhesion // traceable author statement /// 7169 //
>>> transmembrane receptor protein tyrosine kinase signaling pathway //
>>> inferred from electronic annotation","5887 // integral to plasma
>>> membrane // traceable author statement /// 16020 // membrane //
>>> inferred from electronic annotation","4674 // protein
>>> serine/threonine kinase activity // inferred from electronic
>>> annotation /// 4714 // transmembrane receptor protein tyrosine kinase
>>> activity // traceable author statement /// 4872 // receptor activity
>>> // inferred from electronic annotation /// 5524 // ATP binding //
>>> inferred from electronic annotation /// 16740 // transferase activity
>>> // inferred from electronic annotation","---","ec // ZA70_HUMAN //
>>> ZA70_HUMAN EC:2.7.1.112:TYROSINE-PROTEIN KINASE ZAP-70 (EC 2.7.1.112)
>>> (70 KDA ZETA-ASSOCIATED PROTEIN) (SYK-RELATED TYROSINE KINASE). //
>>> 2.0E-65 /// Hanks // DDR // HUMRTK_1 (DDR) KINASES:5.11.1 | PTK Group
>>> B membrane spanning protein tyrosine kinases.PTK XX DDR/TKT family
>>> .DDR // 1.0E-156","scop // d1kexa_ // d1kexa_ SCOP:b.18.1.2:| B1
>>> domain of neuropilin-1 // 5.0E-42","IPR000421 // Coagulation factor
>>> 5/8 type C domain (FA58C) /// IPR000719 // Protein
>>> kinase","NP_054700.1 // span:417-439 // numtm:1","---","This probe
>>> set was annotated using the Matching Probes based pipeline to a Locus
>>> Link identifier using 1 transcripts. // false // Matching Probes //
>>> A","NM_013994(16)","ENST00000259875 // cdna:known
>>> chromosome:NCBI34:6:30958112:30974184:1 // ensembl // 16 // --- ///
>>> NM_013994 // Homo sapiens discoidin domain receptor family, member 1
>>> (DDR1), transcript variant 3, mRNA. // refseq // 16 //
>>> ---","ENST00000325423 // ensembl // 1 // Negative Strand Matching
>>> Probes /// ENST00000340208 // ensembl // 1 // Negative Strand
>>> Matching Probes /// GENSCAN00000025013 // ensembl // 1 // Negative
>>> Strand Matching Probes /// BC026341 // gb // 1 // Negative Strand
>>> Matching Probes /// S57212 // gb // 1 // Negative Strand Matching
>>> Probes"
>>> "1053_at","Human Genome U133A Array","Homo sapiens","Oct 11,
>>> 2004","Exemplar sequence","GenBank","M87338"," M87338 /FEATURE=
>>> /DEFINITION=HUMA1SBU Human replication factor C, 40-kDa subunit (A1)
>>> mRNA, complete cds ","M87338","Hs.139226","May 2004 (NCBI
>>> 35)","chr7:73090653-73113383 (-) // 70.86 // q11.23","replication
>>> factor C (activator 1) 2, 40kDa","RFC2","chr7q11.23","full
>>> length","ENSG00000049541","5982","AAP35707 ///
>>> P35250","---","600404","NP_002905 /// NP_852136","NM_002914 ///
>>> NM_181471","---","---","---","---","---","---","6260 // DNA
>>> replication // inferred from electronic annotation","5634 // nucleus
>>> // inferred from electronic annotation /// 5663 // DNA replication
>>> factor C complex // traceable author statement","166 // nucleotide
>>> binding // inferred from electronic annotation /// 3677 // DNA
>>> binding // inferred from electronic annotation /// 5524 // ATP
>>> binding // traceable author statement","DNA_replication //
>>> GenMAPP","ec // KAD2_HUMAN // KAD2_HUMAN EC:2.7.4.3:ADENYLATE KINASE
>>> ISOENZYME 2, MITOCHONDRIAL (EC 2.7.4.3) (ATP-AMP TRANSPHOSPHORYLASE).
>>> // 8.2","scop // d1nrjb_ // d1nrjb_ SCOP:c.37.1.8:| Signal
>>> recognition particle receptor beta-subunit //
>>> 0.024","---","---","---","This probe set was annotated using the
>>> Matching Probes based pipeline to a Locus Link identifier using 2
>>> transcripts. // false // Matching Probes //
>>> A","M87338(15),NM_181471(12)","ENST00000055077 // cdna:known
>>> chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
>>> ENST00000275627 // cdna:known
>>> chromosome:NCBI34:7:73057931:73080835:-1 // ensembl // 12 // --- ///
>>> M87338 // Human replication factor C, 40-kDa subunit (A1) mRNA,
>>> complete cds. // gb // 15 // --- /// NM_181471 // Homo sapiens
>>> replication factor C (activator 1) 2, 40kDa (RFC2), transcript
>>> variant 1, mRNA. // refseq // 12 // ---","GENSCAN00000014431 //
>>> ensembl // 8 // Cross Hyb Matching Probes"
>>> 
>>> 
>>> =====
>>> --------------------------------------------------
>>> D.Enrique ESCOBAR ESPINOZA (B.Sc.)
>>> http://www.iro.umontreal.ca/~escobard/
>>> http://adn.bioinfo.uqam.ca/~escd07097301/
>>> ICQ#: 201778618
>>> -------------------------------------------------
>>> 1487, Boul. St-Joseph Est Apt4
>>> Tel:  (514) 523-8398
>>> Montreal QC Canada
>>> H2J 1M6
>>> 
>>> 
>>> 
>>> __________________________________
>>> Do you Yahoo!?
>>> Yahoo! Mail - Easier than ever with enhanced search. Learn more.
>>> http://info.mail.yahoo.com/mail_250
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> -- 
>> Peter N. Robinson
>> peter.robinson at t-online.de
>> peter.robinson at charite.de
>> http://www.charite.de/ch/medgen/robinson/
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>> 
>> 
> --
> Jason Stajich
> jason.stajich at duke.edu
> http://www.duke.edu/~jes12/
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l