[Biopython-dev] Creating a NCBIFastaIterator

Fri Oct 7 15:38:04 UTC 2011

Adding my unsolicited opinion here, what do y'all think of this 
NCBIFasta parser being a more general "callback" parser, where a 
function passed to read() or write() translates some arbitrary 
delimited-text into an (id, name, description) tuple, as in:

def x(seqrec):
     # gi|<gi_num>|ref|<accession>|<description>
     y = seqrec.description.strip().split("|")

     #       gi     acc  desc
     return (y[1], y[3]. y[4])

# calls x on every record in the FASTA
for seqrec in SeqIO.parse(fp, "fasta", x):
     print seqrec

This would be similar to key_function in SeqIO.to_dict() and would shift 
the responsibility of handling variation in formats to the user. 
Alternatively, a few functions to parse different styles of description 
lines could be included in the module.

Andrew

On 10/07/2011 08:49 AM, Peter Cock wrote:
> On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt<keith.hughitt at gmail.com>  wrote:
>> Okay, I took at stab at it. The code is in the master branch of my
>> fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73
>
> You are only handling gi|<gi_num>|ref|<accession>|<description>
> whereas the NCBI have a *lot* of other variations to consider:
>
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html
>
> This is quite an open ended bit of work...
>
>> I wasn't sure what the best choices are for id/name so for now I stored the
>> gid in id (and also in the annotations), and the accession as name. Any
>> suggestions?
>
> I suggest collecting a selection of matched NCBI FASTA and
> GenBank/GenPept files, and how Biopython handles the
> GenBank/GenPept version (format name "genbank" alias "gb"
> in Bio.SeqIO) and try to make handling the FASTA version as
> "fasta-ncbi" do the same.
>
> e.g. From our unit tests (from the NCBI FTP site), these are
> a pair:
>
> Tests/GenBank/NC_005816.gb
> Tests/GenBank/NC_005816.fna
>
>> I also haven't written any test code yet. Should I parameterize
>> TitleFunctions.simple_check and multi_check, or is there
>> another approach you would advise?
>> Keith
>
> Probably write some completely new tests. e.g. Use the
> existing test files mentioned above, and verify that both
> the "genbank" and the "fasta-ncbi" parser give the same
> results (ignoring things not in the FASTA file of course).
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev