[Biopython-dev] Creating a NCBIFastaIterator
Andrew Sczesnak
andrew.sczesnak at med.nyu.edu
Fri Oct 7 15:38:04 UTC 2011
Adding my unsolicited opinion here, what do y'all think of this
NCBIFasta parser being a more general "callback" parser, where a
function passed to read() or write() translates some arbitrary
delimited-text into an (id, name, description) tuple, as in:
def x(seqrec):
# gi|<gi_num>|ref|<accession>|<description>
y = seqrec.description.strip().split("|")
# gi acc desc
return (y[1], y[3]. y[4])
# calls x on every record in the FASTA
for seqrec in SeqIO.parse(fp, "fasta", x):
print seqrec
This would be similar to key_function in SeqIO.to_dict() and would shift
the responsibility of handling variation in formats to the user.
Alternatively, a few functions to parse different styles of description
lines could be included in the module.
Andrew
On 10/07/2011 08:49 AM, Peter Cock wrote:
> On Fri, Oct 7, 2011 at 12:18 PM, Keith Hughitt<keith.hughitt at gmail.com> wrote:
>> Okay, I took at stab at it. The code is in the master branch of my
>> fork: https://github.com/khughitt/biopython/blob/75be77cf28d376329577adf5ec41a8880b7faf5c/Bio/SeqIO/FastaIO.py#L73
>
> You are only handling gi|<gi_num>|ref|<accession>|<description>
> whereas the NCBI have a *lot* of other variations to consider:
>
> http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html
>
> This is quite an open ended bit of work...
>
>> I wasn't sure what the best choices are for id/name so for now I stored the
>> gid in id (and also in the annotations), and the accession as name. Any
>> suggestions?
>
> I suggest collecting a selection of matched NCBI FASTA and
> GenBank/GenPept files, and how Biopython handles the
> GenBank/GenPept version (format name "genbank" alias "gb"
> in Bio.SeqIO) and try to make handling the FASTA version as
> "fasta-ncbi" do the same.
>
> e.g. From our unit tests (from the NCBI FTP site), these are
> a pair:
>
> Tests/GenBank/NC_005816.gb
> Tests/GenBank/NC_005816.fna
>
>> I also haven't written any test code yet. Should I parameterize
>> TitleFunctions.simple_check and multi_check, or is there
>> another approach you would advise?
>> Keith
>
> Probably write some completely new tests. e.g. Use the
> existing test files mentioned above, and verify that both
> the "genbank" and the "fasta-ncbi" parser give the same
> results (ignoring things not in the FASTA file of course).
>
> Peter
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
More information about the Biopython-dev
mailing list