[Bioperl-l] SeqIO-based parser for Vector NTI sequence files
Scott Markel
SMarkel at accelrys.com
Mon Feb 9 14:46:12 EST 2009
Malcolm,
Thank you for your follow-up email.
> What are you really after...?
Our Pipeline Pilot-based Sequence Analysis Collection can read a
variety of sequence file formats, largely thanks to the BioPerl
parsers. One of my customers would like us to also be able to
read Vector NTI files. They annotate sequences in Vector NTI and
want to use these sequences, with features, in our product the
same way they can work with other sequences. Since the Vector NTI
file format is nominally GenBank format, I can "read" the file,
but I miss the annotations that the customer added. Hence my
interest in parsing these additional lines.
> One thing you should know (if you don't) that might bear on your
> underlying problem (whatever it may be....). Vector NTI can open GFF
> files and display them as an extenral analysis. The element of the
> analysis can then be promoted interactively by the user to be features
> (genbank style).
This is good to know. Maybe the solution to my problem is as simple
as having the user appropriately promote the features they create.
I haven't used Vector NTI in ages so I'm not familiar with any of
the save/export options.
Scott
> -----Original Message-----
> From: Cook, Malcolm [mailto:MEC at stowers.org]
> Sent: Monday, 09 February 2009 11:32 AM
> To: Scott Markel; 'bioperl-ml'
> Subject: RE: [Bioperl-l] SeqIO-based parser for Vector NTI sequence files
>
> Hi Scott,
>
> It is my understanding that Informax developer used the COMMENT line to
> encode molecular level attribute-value pairs.
>
> i.e. your VNTDATE|<integers here>|
>
> Some dialect of LISP had a role in the products back-days. The attribute
> 'Vector_NTI_Display_Data_(Do_Not_Edit!)' is a LISP S-expression whose CAR
> is 'SXF' (ever program LISP?).
>
> It should be easily 'parseable' by any LISP interpreter or anything that
> can balance parens.
>
> The only 'rub' is the #NNN= tokens which appear to be some form of forward
> reference used by the lisp serializer to allow for internal references.
>
> I would be very surprised if you could learn from ABI (was Invitrogen, was
> Informax) anything about the internal structure of this. I once asked
> about 5 years ago. I am/was a supported user. It was arcane historical
> knowledge even then.
>
> I would also be very surprised if there was anything in it that was
> meaningful, unless the creator of the molecule in VNTI was using some very
> odd convention.
>
> I used to know which dialect of LISP was being used by them and might
> track it down if it were important to you....
>
> If you are after the oligo descriptive text in the COMMENT, it is likely
> that the oligos ALSO have a genbank feature associated with them. But you
> probably already looked....
>
> What are you really after...?
>
> One thing you should know (if you don't) that might bear on your
> underlying problem (whatever it may be....). Vector NTI can open GFF
> files and display them as an extenral analysis. The element of the
> analysis can then be promoted interactively by the user to be features
> (genbank style).
>
> Malcolm
>
>
> > -----Original Message-----
> > From: Scott Markel [mailto:SMarkel at accelrys.com]
> > Sent: Monday, February 09, 2009 12:35 PM
> > To: Cook, Malcolm; 'bioperl-ml'
> > Subject: RE: [Bioperl-l] SeqIO-based parser for Vector NTI
> > sequence files
> >
> > Malcolm,
> >
> > It looks like Vector NTI puts features into COMMENT lines
> > rather than leveraging the DDBJ/EMBL/GenBank Feature table
> > syntax. I'd like to treat these features the same way I
> > treat other features, hence my interest in parsing them.
> >
> > My only example file is from a customer so the following
> > snippets have been tweaked a bit. My replacements are in
> > angle brackets: <...>.
> >
> > COMMENT <date here> <user name here> wrote:
> > <user comment here>
> >
> > .COMMENT This file is created by Vector NTI
> > http://www.informaxinc.com/
> > COMMENT ORIGDB|GenBank
> > COMMENT VNTDATE|<integers here>|
> > COMMENT VNTDBDATE|<integers here>|
> > COMMENT LSOWNER|
> > COMMENT VNTNAME|<string here>|
> > COMMENT VNTAUTHORNAME|<user name here>|
> > COMMENT VNTREPLTYPE|<string here>
> > COMMENT VNTEXTCHREPL|Animal/Other Eukaryotic
> > COMMENT Vector_NTI_Display_Data_(Do_Not_Edit!)
> > COMMENT (SXF
> > COMMENT (CGexDoc "<string here>" 0 7616
> > COMMENT (CDBMol 0 0 1 1 1 0 0 0 0 "" "" 0 0 0 0
> > (CObList) (CObList) (CObList)
> > COMMENT (CObList) -1 "")
> > COMMENT (CDocSetData 1 1 0 1 0 1 "MAIN" 1 1 1 1 1 0 1 1
> > 1 0 10 10 4294967295 50 0
> > COMMENT 1 0 (CHomObj 1 0 0 3 100) (CWordArray 23) (CWordArray)
> > COMMENT (CStringList <multiple quoted strings here>)
> > COMMENT (CStringList <multiple quoted strings here>)
> > (CStringList <multiple quoted strings here>)
> > COMMENT (CObList
> > COMMENT #0=(COligo <quoted string here> <quoted string here>
> > COMMENT "Tm: 52.1C Length: 16mer GC: 56.3%" 0
> > (CStringList) 0)
> > COMMENT #1=(COligo <quoted string here> <quoted string here>
> > COMMENT "Tm: 56.8C Length: 18mer GC: 61.1%" 0
> > (CStringList) 0)
> >
> > There are also some hierarchical sections.
> >
> > COMMENT (CObList) (CObList) (CObList)
> > COMMENT (CTextView 0
> > COMMENT #120=(CGroupPar (CParagraph 0 (0 0) 1 2 0 0 180)
> > COMMENT (CObjectList
> > COMMENT #121=(CRefLinePar
> > COMMENT (CLinePar (CParagraph 0 (0 0) 0 2
> > 0 1 233) <quoted string here> 2) 5
> > COMMENT "" 0 4)
> > COMMENT #122=(CFolderPar
> > COMMENT (CGroupPar (CParagraph 1 (0 0) 1
> > 1 0 0 178)
> > COMMENT (CObjectList
> > COMMENT #123=(CLinePar (CParagraph 0 (0
> > 0) 1 2 1 0 180)
> > COMMENT <quoted string here> 1)
> > COMMENT #124=(CLinePar (CParagraph 0 (0
> > 0) 1 2 1 0 180)
> > COMMENT <quoted string here> 1)
> >
> > Scott
> >
> > > -----Original Message-----
> > > From: Cook, Malcolm [mailto:MEC at stowers.org]
> > > Sent: Monday, 09 February 2009 6:51 AM
> > > To: Scott Markel; 'bioperl-ml'
> > > Subject: RE: [Bioperl-l] SeqIO-based parser for Vector NTI sequence
> > > files
> > >
> > > Scott,
> > >
> > > What do you expect to extract from the COMMENT lines?
> > >
> > >
> > > Malcolm Cook
> > > Database Applications Manager - Bioinformatics Stowers
> > Institute for
> > > Medical Research - Kansas City, Missouri
> > >
> > > -----Original Message-----
> > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > > bounces at lists.open-bio.org] On Behalf Of Scott Markel
> > > Sent: Tuesday, October 21, 2008 3:49 PM
> > > To: bioperl-ml
> > > Cc: smarkel at accelrys.com
> > > Subject: [Bioperl-l] SeqIO-based parser for Vector NTI
> > sequence files
> > >
> > > I'm looking for a BioPerl-related solution to parsing Vector NTI
> > > sequence files. The genbank.pm parser will work, but it
> > doesn't parse
> > > the COMMENT lines beyond grabbing the simple string value, so it
> > > misses all of the added information in those lines.
> > >
> > > If you know of any existing code, I'd be interesting in
> > hearing about it.
> > > I checked BioPerl, BioJava, and EMBOSS documentation.
> > > I also checked the Invitrogen web site.
> > >
> > > Scott
> > >
> > > --
> > > Scott Markel, Ph.D.
> > > Principal Bioinformatics Architect email: smarkel at accelrys.com
> > > Accelrys (SciTegic R&D) mobile: +1 858 205 3653
> > > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603
> > > San Diego, CA 92121 fax: +1 858 799 5222
> > > USA web: http://www.accelrys.com
> > >
> > > http://www.linkedin.com/in/smarkel
> > > Board of Directors: International Society for Computational Biology
> > > Co-chair: ISCB Publications Committee
> > > Associate Editor: PLoS Computational Biology Editorial Board:
> > > Briefings in Bioinformatics
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
More information about the Bioperl-l
mailing list