[Biopython] SeqIO.index()

Peter p.j.a.cock at googlemail.com
Sat Jan 30 14:08:57 UTC 2010


Hi

Your request makes perfect sense for FASTA files, but does not  
generalise to all the other supported file formats - hence the  
relatively limited callback support available in Bio.SeqIO.index.

I would suggest you could subclass the FASTA indexer to do what you  
want. Or, for smaller files use Bio.SeqIO.to_dict instead.

Regards

Peter

On 30 Jan 2010, at 08:46, Sebastian Schmeier <s.schmeier at gmail.com>  
wrote:

> Dear community,
>
> I am new to the mailing list and have a problem/question regarding the
> SeqIO.index() method/module. Up to now, I usually used an home-brewed
> fasta-file parser. This time though I had a look at the SeqIO
> interface. I am especially interested in the index() method.
>
> The fasta-file I use have non-standardized (if this is even possible)
> headers. I found that the index method uses the first string after the
> marker up to a space as the identifier for the dictionary (I will call
> this ID in the text below). It is however a great idea to have a
> function argument "key_function" that allows for adjust the key values
> via a self implemented callback function. This is essential in my case
> because ID in my fasta-file are not unique per entry.
>
> I had a look at the source code of SeqIO/_index.py and I found that
> unfortunately in the current implementation the "key_function" only
> acts on ID. I think it would make more sense to allow to extract a key
> from the complete header. Is this somehow possible with the current
> implementation?
>
> I refer here to the code in SeqIO/_index.py:
>
>
> 188 class _SequentialSeqFileDict(_IndexedSeqFileDict) :
> .
> .
> .
> 200             if marker_re.match(line) :
> 201                 #Here we can assume the record.id is the first
> word after the
> 202                 #marker. This is generally fine... but not for
> GenBank, EMBL, Swiss
> 203
> self._record_key(line[marker_offset:].strip().split(None,1)[0],
> offset)         ##### here you define that the key_function only acts
> on the first split
>
>
>
> Thanks,
> Seb
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



More information about the Biopython mailing list