[Biopython] matching sequences from fasta files

Chris Fields cjfields at illinois.edu
Wed Mar 10 14:31:39 UTC 2010


On Mar 10, 2010, at 5:15 AM, Ivan Rossi wrote:

> On Wed, 10 Mar 2010, Peter wrote:
> 
>> For the special case of looking for perfect matches, you would be fine
>> with just Python - depending on your data files, you may be able to
>> match on the record identifiers
> 
> Don't trust that. We have seen many many times the sequence change over time (in different releases of the databases) while keeping the same id.

If the database has a proper versioning scheme or date information this should be detectable, otherwise I agree.

> it is much more robust to compare SHA1 (or MD5) hashes of the sequence, or do string comparisons.

Agreed there; it's probably the only full-proof way.

>> or simply do string comparisons of the sequences.
> 
> This is OK.
> 
> --
> Ivan Rossi, PhD - ivan AT biodec dot com OR ivan dot rossi3 AT unibo dot it
> BioDec Srl, Via Calzavecchio 20/2, I-40033 Casalecchio di Reno (BO), Italy
> Phone: (+39)-051-0548263 - Fax: (+39)-051-7459582 - http://www.biodec.com

chris (peeking in from bioperl ;)



More information about the Biopython mailing list