[Biopython-dev] Creating a NCBIFastaIterator
Peter Cock
p.j.a.cock at googlemail.com
Fri Oct 7 12:49:30 UTC 2011
On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt <keith.hughitt at gmail.com> wrote:
> Okay, I took at stab at it. The code is in the master branch of my
> fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73
You are only handling gi|<gi_num>|ref|<accession>|<description>
whereas the NCBI have a *lot* of other variations to consider:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html
This is quite an open ended bit of work...
> I wasn't sure what the best choices are for id/name so for now I stored the
> gid in id (and also in the annotations), and the accession as name. Any
> suggestions?
I suggest collecting a selection of matched NCBI FASTA and
GenBank/GenPept files, and how Biopython handles the
GenBank/GenPept version (format name "genbank" alias "gb"
in Bio.SeqIO) and try to make handling the FASTA version as
"fasta-ncbi" do the same.
e.g. From our unit tests (from the NCBI FTP site), these are
a pair:
Tests/GenBank/NC_005816.gb
Tests/GenBank/NC_005816.fna
> I also haven't written any test code yet. Should I parameterize
> TitleFunctions.simple_check and multi_check, or is there
> another approach you would advise?
> Keith
Probably write some completely new tests. e.g. Use the
existing test files mentioned above, and verify that both
the "genbank" and the "fasta-ncbi" parser give the same
results (ignoring things not in the FASTA file of course).
Peter
More information about the Biopython-dev
mailing list