[Biopython-dev] Creating a NCBIFastaIterator

Tue Oct 4 11:46:19 UTC 2011

On Tue, Oct 4, 2011 at 12:31 PM, Keith Hughitt <keith.hughitt at gmail.com> wrote:
> Hi all,
>
> I was thinking recently that it would be nice if the FASTA file reader were
> able to check for known formats (e.g. NCBI) and then use that information to
> choose better values for name, id, etc.
>
> After some discussion with Peter Cock on GitHub, however, he convinced me
> that this would be problematic in terms of backwards compatibility, and that
> instead a better approach might be to add a new sub-format ("fasta-ncbi") to
> the list of supported format readers.
>
> This could go something like:
>
> 1. Create a new function in SeqIO.FastaIO for parsing NCBI-formatted FASTA
> files. Add it the the mapping of iterators.

Yes.

> 2. FastaIO.NCBIFasterIterator will simply call FASTAIterator and then modify
> the result by assigning a new id, name, etc (other suggestions?)

Store the GI number in the SeqRecord's annotation under key "gi"
to match the GenBank parser. There may be other things like this.

If the FASTA header does not match the NCBI style, that should
probably trigger an exception.

> 3. FastaIO.NCBIFastaWriter (modify and subclass FastaIO.FastaWriter?)

This will be harder, but yes in principle.

> 4. Modify code that interacts with NCBI services which return FASTA files
> and have it return a NCBIFasterIterator (First use a deprecation/warning to
> let users know of the pending change?)

No need. I'm pretty sure all the NCBI code (like Bio.Entrez) returns
handles so it is up to the end user to decide what to do with the
data, e.g. parse it with the current SeqIO "fasta" format, or save it
straight to disk.

> Does this sound like it would be a useful feature? What about the basic
> approach outlined above? Any suggestions?
>
> Keith

Yes, it sounds useful. I'm not sure where the most current NCBI
documentation is, but this is a good start:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html

Peter