[Biopython-dev] Arlequin sequence files in Bio.Popgen

Fri Jul 10 22:52:41 UTC 2009

Hi David,

> Gee, I hope I haven't raised your hopes beyond my ability to deliver (both
> in terms of time and skills). I've uploaded my Arlequin classes and
> functions to a branch on github so you can see them (/Bio/PopGen/Arlequin/
> on http://github.com/dwinter/biopython/tree/arleq-branch)

This is great, I took your code and created a new version (nothing
more than also an initial sketch - Feel free to disagree/propose
changes), you can find it here:
http://github.com/tiagoantao/biopython/tree/arlequin

Here are a few comments:
1. I've put indentation at 4 spaces, which I think is the biopython standard
2. I've split the code in Record (__init__.py) and your Seq code (on Utils.py)
3. Just one note, samples and haplotype tables, might not be lists,
but iterators. The problem is with very large files (like thousands of
sequences) which do not fit in memory. While the current
implementation is fine, the expectation is that what is there is just
an iterator, not specifically a  (in memory) list. I think a list
should be ok for arlequin genetic structures which I hope are always
small...
4. I've put a copyright message with your name in both files ;)
5. I HAVE NOT TESTED THE CODE CHANGES. Just as a proposed startup draft concept

OK, somebody has to do a parser to actually read the files in ;) .
Which is the biggest piece of work to be done. I don't mind doing it
(like in the next month or so - I have some free time now), but you
can do it if you want. In case you decide to do it, I have just one
major point to note: making a parser that is able to read big files
(i.e., some files cannot be parsed into memory in one go). I made this
mistake with the genepop parser and some people do complain about it.
Somethings cannot be read as lists to memory but have to be read as
iterators (issue 3 above).
I think a parser that is able to handle lots of files is also good to
help in building a sound model to represent an arlequin record.

As usual we will need test code and documentation for all this ;)

> By the way, is there a plan to have generic representations of populations,
> alleles etc in PopGen? It would make a parser for Arlequin files a much more
> useful tool. I found a few threads about it on the mailing lists around the
> birth of the module but not since.

I am actually afraid of a single generic representation. My main issue
with this is that I don't believe that it is possible to get it right.
Many kinds of markers, type of data (frequency, gametic-phase,
non-phased), population info (e.g. georeferencing).

But after we get the genepop code and an arlequin parser fully working
I don't mind revisiting this. But I would like to delay this
discussion after the genepop code and (if we get it done) the arlequin
code in the production version.

Any comments would be most welcome,
Tiago