[Biopython-dev] GenBank parser -- first go
Edwin Steele
edwin.steele at eBioinformatics.com
Mon Dec 11 01:09:09 EST 2000
Brad,
> Here's a justification for this. It's already common practice with
> GenBank files to have subitems indented under the major item. For
> example,
>
> SOURCE thale cress.
> ORGANISM Arabidopsis thaliana
There are a few caveats that come up with indenting that I've come
across. Save the feature table, there used to be only one level of
subitem. The new PUBMED tag breaks this paradigm:
REFERENCE 1 (bases 1 to 675)
AUTHORS Sant,V.J., Sainani,M.N., Sami-Subbu,R., Ranjekar,P.K. and
Gupta,V.S.
TITLE Ty1-copia retrotransposon-like elements in chickpea genome: their
identification, distribution and use for diversity analysis
JOURNAL Gene 257 (1), 157-166 (2000)
PUBMED 11054578
It's indented three spaces instead of two...
Brad, this will mean your indent_space definition will break (or pick
up unnecessary stuff).
Also, it's not fair to assume that the initial indenting is two spaces.
In some of the larger entries like LMFLCHR12 that is about 2000000 bp
long, the seven figures in the origin section causes there to be a one
character indent instead of the normal two character minimum.
ORIGIN
1 TCAGTTTGTG CGGGGTGTGC ATATGCATGT GCATGCATAC ATGCACATAC ACATATATAC
...
2287441 GCGTCACGTG GCGACGTCGA GGCCCGCAGC TTCTATTTTT TTT
//
However, I don't think this will break anything in the parser, but is
something to be remembered if you become more strict...
Cheers,
Edwin.
-------------------------------------------------------------------------------
Edwin Steele
QA Manager, eBioinformatics. http://www.ebioinformatics.com
email: edwin.steele at eBioinformatics.com Bay 16/104, Australian Technology Park
ph: +61 (2) 9209-4765 Eveleigh 1430, NSW, Australia.
More information about the Biopython-dev
mailing list