[Biopython-dev] [Bug 2531] Nexus and fasta parsers have a problem with identical taxa names

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Mon Jun 30 16:12:29 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2531





------- Comment #7 from biopython-bugzilla at maubp.freeserve.co.uk  2008-06-30 12:12 EST -------
It looks like I didn't have the latest version of Bio.Nexus on this machine
which may have added to the confusion.  I've just updated to CVS (i.e. almost
exactly Biopython 1.46).  My issue with the matrix being None has gone away. 
Opps.

>>> from Bio.Nexus import Nexus
>>> n = Nexus.Nexus(open('eg.nex'))
>>> n.matrix.keys()
['HI99.Line5.copy', 'am', 'HI99.Line1.copy', 'ezo', 'HI99.Line0.copy',
'DI05.Line5.copy', 'DI05.Line0.copy', 'DI05.Line8.copy1', 'DI05.Line1.copy1',
'HI99.Line3.copy', 'HI99.Line1.copy1', 'DI05.Line1.copy', 'DI05.Line9.copy',
'DI05.Line8.copy', 'HI99.Line4.copy', 'vir', 'DI05.Line8', 'DI05.Line9',
'HI99.Line2.copy', 'DI05.Line2', 'DI05.Line3', 'DI05.Line0', 'DI05.Line1',
'DI05.Line6', 'DI05.Line7', 'DI05.Line4', 'DI05.Line5', 'HI99.Line1',
'HI99.Line0', 'HI99.Line3', 'HI99.Line2', 'HI99.Line5', 'HI99.Line4']
>>> assert [id for id in n.matrix] == n.matrix.keys()
>>> n.matrix['HI99.Line5']
Seq('ATCGATAGCATTGCGG-GGACGACGATGGACATTTGGAAAACGAATATGAAAAT...GAG',
IUPACAmbiguousDNA())
>>> n.matrix['HI99.Line5'][249-1]
'T'
>>> n.matrix['HI99.Line5'][417-1]
'T'
>>> n.matrix['HI99.Line5'][452-1]
'A'
>>> n.matrix['HI99.Line5.copy']
Seq('ATCGATAGCATTGCGGCGGACGACGATGGACATTTGGAAAACGAATATGAAAAT...GAG',
IUPACAmbiguousDNA())
>>> n.matrix['HI99.Line5.copy'][249-1]
'C'
>>> n.matrix['HI99.Line5.copy'][417-1]
'C'
>>> n.matrix['HI99.Line5.copy'][452-1]
'G'

So far this looks good.  However:

>>> n.original_taxon_order
['vir', 'am', 'ezo', 'DI05.Line5', 'DI05.Line1', 'DI05.Line9', 'DI05.Line2',
'DI05.Line3', 'HI99.Line2', 'HI99.Line1', 'HI99.Line5', 'DI05.Line4',
'DI05.Line1', 'DI05.Line7', 'HI99.Line3', 'DI05.Line6', 'DI05.Line8',
'HI99.Line4', 'DI05.Line1', 'HI99.Line1', 'DI05.Line8', 'DI05.Line5',
'HI99.Line2', 'HI99.Line0', 'HI99.Line0', 'HI99.Line5', 'DI05.Line9',
'HI99.Line3', 'DI05.Line0', 'DI05.Line0', 'HI99.Line4', 'HI99.Line1',
'DI05.Line8']

In the Bio.SeqIO code that calls Bio.Nexus, I hadn't realized that Bio.Nexus
kept the un-edited taxon names around.  It is this list of the non-unique
original identifiers that Bio.SeqIO was using, which explains why you end up
with two copies of HI99.Line5.

Sorry Frank - I was pointing fingers when it was my own bug after all!


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list