From jbdundas at gmail.com Thu Jul 2 04:20:09 2009 From: jbdundas at gmail.com (jitesh dundas) Date: Thu, 2 Jul 2009 13:50:09 +0530 Subject: [emboss-dev] Task Update Message-ID: <326ea8620907020120w4d286761m48a99996f11f1022@mail.gmail.com> Dear Sir, I have created the database design for the task of running the tasks of Emboss in parallel. As this is to be monitored from any place, I thought of making it web-based. Java/Servets/JSP/Beans is my choice as I am confortablw with these. Now, I will need an IP address of another machine , located in a distant place , having Emboss running on it. The first interface would be the master interface, tracking all the activities, This should be easy, provided I have all the IP addresses and program details. Next, this is what will happen. I have Jemboss installed on my PC ( with internet ) . Next I get the first task, say try creating a sequence or any simple input function. This interface will need some input from (another Jemboss interface ) at machine in another location. The details are sent to this machine via internet as this is the widest network available. I am using Servlets / JSP with MySQL support. Note:- In above scenario, to send an input from one interface, I will need a button or a menu item in the main interface of Jemboss, this will send the details on the click event. For receving part, when the user receives via email, a Jemboss receiving interface will listen for such mails, get the details from reading this email (All this is done in the background) and thus is the details are sent to this interface. This is a little difficult but worth implementing. Once done, this would mean that a person can execute an interface of Jemboss in India while he sends the result to a Jemboss interface in UK , which in turn processes details to some other place. All this will be controlled and decided by the Monitor interface. This can be controlled by the user, but a degree of automation in scheduling is provided. Any feedback is most welcome. I have started writing. Any experts in Java RMI and related areas, please help. I request your reply. Regards, Jitesh Thanks & Regards, Jitesh Dundas Phone:- +91-9860925706 http://jiteshbdundas.blogspot.com ---------- Forwarded message ---------- From: jitesh dundas Date: Wed, Jun 24, 2009 at 8:31 PM Subject: Fwd: [emboss-dev] (no subject) To: Peter Rice Cc: emboss-dev at lists.open-bio.org Dear Sir, This is the logic that I intend to implement:- 1) I have emboss on my laptop(India) installed. I will need another machine(say in UK) with EMBOSS installed in it. When I run one interface on my machine, the output will be sent to the machine to the UK machine. The UK machine will have another interface thread waiting for this input(from India). Please note that this information will be sent via internet/intranet (in encrypted form). Thus the execution will continue on UK machine. IN the same way, the execution for India machine will continue till it needs some input from UK machine. 2) The decision to allot the tasks/input and output will be done by an independent monitoring master interface. This will track continuously the progress of both the machines for each inteface. 3) The information about each interface will have to be sent to a database for storage. For e.g.) a table with the fields interface no, actuvity start time, duration, timestamp,allowed time to execution, input needed, output to be sent, etc. This will be continuously used by the monitor interface. I wanted to implement one of my methods for managing these projects in parallel processing. Based on the results obtained , a paper with the results can be published. Sir, this is my basic idea, which I intend to build on. I will need another machine that I can use for executing this idea. However, it will be needed after 1-2 weeks by which I intend to finish the prior needed parts. TECHINICAL POINTS:- 1) RMI (Remote Method Invocation) will be needed. 2) Internet access / network connection will be needed. 3) MySQL Db. I request your feedback. Thanks & Regards, Jitesh Dundas ---------- Forwarded message ---------- From: jitesh dundas Date: Tue, Jun 16, 2009 at 1:42 PM Subject: Re: [emboss-dev] (no subject) To: Peter Rice Dear Sir I have installed BOINC on my PC and currently I am studying the code and it could take me some time in getting my grip on it. I assure you though that I will get the task done soon. I will update you about progress every 2 days. Regards, Jitesh ---------- Forwarded message ---------- From: emboss-dev-request at lists.open-bio.org < emboss-dev-request at lists.open-bio.org> Date: Jun 15, 2009 10:30 PM Subject: emboss-dev Digest, Vol 10, Issue 3 To: emboss-dev at lists.open-bio.org Send emboss-dev mailing list submissions to emboss-dev at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.open-bio.org/mailman/listinfo/emboss-dev or, via email, send a message with subject or body 'help' to emboss-dev-request at lists.open-bio.org You can reach the person managing the list at emboss-dev-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of emboss-dev digest..." Today's Topics: 1. Re: (no subject) (jitesh dundas) ---------------------------------------------------------------------- Message: 1 Date: Thu, 11 Jun 2009 20:59:45 +0530 From: jitesh dundas Subject: Re: [emboss-dev] (no subject) To: Peter Rice Cc: emboss-dev at lists.open-bio.org Message-ID: <326ea8620906110829p472a1b06x1e1f38a277c57959 at mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Dear Sir, I hope my previous email gave you a clear idea of my plan on this task. I will begin working on the code now. I will keep you posted every 2 days with my progress. Please let me know if you need anthing else from my side. Regards, Jitesh Dundas On 6/10/09, jitesh dundas wrote: > Dear Sir, > > Thank you for your reply. PLEASE FIND MY COMMENTS IN BLOCK LETTERS BELOW. > > On 6/9/09, Peter Rice wrote: >> Dear Jitesh, >> >>> I need to know the priority on which any script/ application in EMBOSS >>> is executed. >> >> Currently, all EMBOSS applications simply execute. >> >> When they terminate, if EMBOSS_LOGFILE is defined they can write a >> single record the the logfile. >> >> However, we can extend this is that is what you are suggesting. >> >> All EMBOSS (and EMBASSY) applications start with a call to ajAcdInit >> (often via embInitP or ajGraphInit) >> >> All EMBOSS applications end with a call to ajExit on success. Failed >> applications should call ajExitBad or ajExitAbort ... unless they crash >> with a segmentation fault or are otherwise terminated. >> >> So we have places to put in additional monitoring code. > > I NEED THE CENTRAL LOCATION FROM WHERE THIS MONITORING SCRIPT CAN BE > ACCESSED. AN INTERFACE THAT WILL BE AT THE HEART OF EMBOSS. THIS WILL > BE A COMMON SCRIPT AND THUS, IT MUST HAVE ACCESS TO ALL SCRIPTS. > >>> If applications in Emboss are to be executed, they need to be assigned >>> a priority or an impact , besides the following:- >>> >>> 1) A master database or a table that stores list of applications >>> running. These will be updated by a scheduled script running >>> continuously in the background. >> >> This script, could, for example, check the list of known running >> applications and remove any that appear to have crashed. >> >>> 2) the front-end GUI needs to showing a chart of applications running >>> and parameters like progress, time consumed etc. >>> >>> Measuring progress needs some breakpoints. Their status will be >>> pending,WIP or completed. >> >> We have no breakpoints in EMBOSS at present. Can you give examples of >> what you have in mind? > > FOR E.G.) there are 5 stages/applications running in parallel. EACH > STAGE WILL BE DIVIDED INTO PARTS,WHERE EACH PART'S END-POINT ACTING > AS A COMPLETION SUB_TARGET. > > THE ENTRY IN DATABASE TABLE WILL HAVE A PROCESS,SUB-PROCESS,STATUS > FIELDS. DETAILS OF EACH FIELD ARE ENTERED HERE. > tHE SCRIPT OR THE INTERFACE WILL RUN AND MONITOR THE EXECUTION > PROGRESS OF EACH STAGE. REGULARLY, IT WILL UPDATE THE DATABASE TABLE. > >>> I will send further details in 1-2 days. Meanwhile, i request your >>> feedback. >> >> Hope this helps. >> >> Peter Rice >> > > PLEASE LET ME KNOW IF YOU NEED ANYTHING ELSE FROM MY SIDE. > > -- > Thanks & Regards, > Jitesh Dundas > > Scientist, Edencore Technologies(www.edencore.net) > Web Developer, JR Technologies, India > > Phone:- +91-9860925706 > > http://jiteshbdundas.blogspot.com > > "NO IDEA IS STUPID,EITHER IT IS TOO GOOD TO BE TRUE OR IT IS WAY AHEAD > OF ITS FUTURE "- GEORGE BERNARD SHAW. > From ajb at ebi.ac.uk Tue Jul 7 08:26:58 2009 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 7 Jul 2009 13:26:58 +0100 (BST) Subject: [emboss-dev] Move to libtool 2.2.6a Message-ID: <58544.88.96.156.129.1246969618.squirrel@webmail.ebi.ac.uk> Dear developers, We would like to update the EMBOSS CVS source code from using libtool 1.5.x to libtool 2.2.6a. Libtool 2, in various versions, has been out now for well over a year. Many current distributions, after having spent some time putting it through its paces, have now adopted it e.g. Fedora, OpenSuSE, Mandriva, cygwin, MacOSX Snow Leopard etc. This puts us in the unenviable position in that whatever we do will probably irritate some people. We can't be in the position, though, where we have to eventually advise people to downgrade their software. 1) If we stay with 1.5.x for now then more and more developers using libtool 2.2.x are going to have to type (e.g.) autoreconf -fi prior to the 'aclocal -I m4' stages. 2) If we move to 2.2.6 then developers on machines using 1.5.x will need to install libtool 2.2.6 somewhere (and usually install fresh versions of autoconf [2.63] and automake [1.11] to the same directory tree to avoid currently installed versions referencing the older libtool). People on this list are developers and, by definition, obviously more than capable of downloading 3 files from ftp.gnu.org and installing them. It takes about 10 minutes. MacOSX is a bit different but new versions are available from MacPorts. So, the question is really whether anyone has any strong views for or against a move to libtool 2.2.6 now, given that we will need to get there in the near future? Alan From ajb at ebi.ac.uk Wed Jul 15 07:18:37 2009 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Wed, 15 Jul 2009 12:18:37 +0100 (BST) Subject: [emboss-dev] EMBOSS 6.1.0 release now available Message-ID: <36222.86.26.12.63.1247656717.squirrel@webmail.ebi.ac.uk> Dear EMBOSS users and developers, A new version of EMBOSS (6.1.0) is now available for download from our ftp server: ftp://emboss.open-bio.org/pub/EMBOSS/ If you use any of the EMBASSY packages (e.g. PHYLIP, VIENNA etc) then, as usual, remember to re-download and compile those too. A new version of the mEMBOSS, the Windows port, is also available from: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.1.0-setup.exe Many new capabilities have been added and bugs fixed throughout. Release highlights for EMBOSS include: * Full support for the new SwissProt format. In most cases the entry can be read and written exactly * Full support for EMBL and GenBank entries. In most cases the entry can be read and written exactly * Support for FASTQ short read formats for sequence and quality data * Full support for protein and nucleotide sequence parsing from PDB entries * Full support for GFF3 feature format as the new default feature output * Improved summary information at the end of report output * Alignment output using multiple sequence formats * Extended support for distance matrix file formats * Improved support for regular expression and pattern searching * Improved support for large sequence alignments * Support for remote locations in feature table processing, for example retrieval in coderet. * Output directory support extended to allow directories to be created * Normalisation option for hydrophobicity plots (pepwindow and pepwindowall) * Processing of methylation sites in restriction mapping * Embossdata reports results alphabetically sorted * Command line qualifiers should be unique after 5 characters to allow safe abbreviation * Improved configuration procedures for X11 support * Support for dasgff report format, making it possible to write EMBOSS-based DAS annotation servers Release highlights for EMBASSY include: * Support for MEME 4.0 * Phylipnew updated to Phylip 3.68 * Support for the HMMERDB environment variable in Hmmernew. * Bug fixes for the MSE multiple sequence editor Release highlights for Jemboss include: * Refactoring of the source code * Location of the 'Execution mode' menu moved near to the 'Go' button in the application forms. When a user runs a job for the first time in 'batch' mode an information message is displayed * Automatic configuration of the standalone Jemboss GUI on UNIX systems after typing "make install" for EMBOSS. This standalone GUI can be run using the runJemboss.csh script in the EMBOSS 'bin' directory. This assumes that you have a reasonably up-to-date version of Java installed (1.6 preferred) For future extensions, we have added: * Parsing of cross-reference information from SwissProt and EMBL/GenBank formats * Code to delete and update database indexes New EMBOSS wiki EMBOSS now has a Wiki at http://emboss.open-bio.org/wiki where we will maintain the master copies of documentation for the applications and libraries, and where we have sections for planning new features and applications for the next 3 years of funding. Please contribute any corrections to the documentation and add new ideas to the "Planning" section. We will, of course, be making the wiki prettier as it matures. Important note for Developers New distributions of operating systems have started to use the series 2 version of libtool. We therefore now use this in our CVS repository. The latest stable version of libtool is 2.2.6a (reported by libtool itself as 2.2.6). Developers using systems with older (1.5.x) libtool versions will have to install a local copy of libtool. This would typically be done by downloading the source code from the GNU site: ftp://ftp.gnu.org/ After installing libtool it will usually be necessary to then re-install autoconf (2.63) and automake (1.11) to the same directory root (they are often tied to the version of libtool they were provided with). They too are available from the GNU ftp server. Make sure that your PATH is refreshed between doing the installations of the GNU tools in order that the previous versions aren't referenced. We note that one system (cygwin) currently provides an experimental version of libtool (2.2.7). Developers on these systems (and, in general, on any system with a higher version of libtool than in our CVS repository) should type: autoreconf -fi before attempting compilation. We will usually keep up-to-date with libtool stable releases within a libtool series. New BBSRC funding and future work As previously announced, we have recently been refunded by the BBSRC. What we said in that announcement bears repeating here. The core aims of the funding proposal were to continue support, maintenance and development of EMBOSS, and to provide extensive online training materials for users, developers and system administrators using text from a series of books to be published by Cambridge University Press. We are also explicitly targeting areas where we see EMBOSS can be expanded: * Richer data content in EMBOSS outputs leading to major improvements in the integration and visualisation of results in browsers. * Processing many more data fields in EMBOSS inputs (taxonomy, genes, GO terms, cross-references, keywords. * Extending and improving database access: better indexing, query language support and combining searches across multiple databases, support for non-sequence data resources and new data access methods * Scaling up the libraries and adding new applications to support the data volumes generated by next-generation sequencing runs. We anticipate many more users will be working with short read data mapped to reference sequences over the next few years. * We aim to add at least 100 new applications in these 3 years. Suggestions for new applications are very welcome. * Major work on new developments and new library code will start from August. Alan From javierluiso at gmail.com Fri Jul 17 18:36:36 2009 From: javierluiso at gmail.com (Javier Luiso) Date: Fri, 17 Jul 2009 19:36:36 -0300 Subject: [emboss-dev] EMBOSS and CUDA Message-ID: <61d930160907171536t50f25d43g1ae67b524dadc90e@mail.gmail.com> HI, I'm Javier and I work as a software developer in computer graphcis area and visualization too. I've got experience coding for GPU's and I'm quite interesting in working in order to get the power of using HPC as CUDA to whatever EMBOSS programm that was able to do it. Anything dealing with matrices would be a good first start, I'm open to any suggestion if there's someone working in the same direction. Javier Luiso From biopython at maubp.freeserve.co.uk Mon Jul 20 12:56:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 17:56:45 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? Message-ID: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> Hi all, One of Biopython's unit tests uses the EMBOSS tools. This is for several tasks, including checking we agree for basic sequence translations using different tables, as well as making sure Biopython can parse the alignments output by needle and water. Another area is cross checking we can read each other's sequence output files. I've been going over the Biopython unit tests with EMBOSS 6.1.0, and have found a regression compared to EMBOSS 6.0.1. This is to do with how EMBOSS parses a minimal GenBank file written with Biopython. The file in question is a 10kb GenBank (well, a GenPept file as it holds protein sequences) converted from an Inteligentics file. I can email this on request. The file contains 16 records: $ grep "^LOCUS" VIF_mase-pro.gb | wc -l 16 Using EMBOSS 6.0.1, there are warning messages about the LOCUS line, but all 16 records do get converted into FASTA format fine. I'm not sure why it is complaining, and would be grateful for feedback: $ embossversion Writes the current EMBOSS version number to a file 6.0.1 $ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta -auto -filter | grep ">" | wc -l Warning: bad Genbank LOCUS line 'LOCUS most-likely 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS U455 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS HXB2R 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS ELI 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS MVP5180 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS AD_MAL 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS CPZGAB 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS CPZANT 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS ROD 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS EHOA 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS MM251 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS STM 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS AGM3 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS AGM677 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS SAB1C 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS SYK 298 aa UNK 01-JAN-1980 ' 16 In any case, seqret 6.0.1 was able to convert this to a FASTA file of 16 records. However, seqret 6.1.0 fails - only the first record is extracted: $ embossversion Reports the current EMBOSS version number 6.1.0 $ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta -auto -filter | grep ">" | wc -l 1 If there is something wrong with my LOCUS lines, I can fix them. Any thoughts? The LOCUS lines are reproduced above in the EMBOSS 6.0.1 warning messages. One possible issue is the inclusion of an arbitary date (01-JAN-1980, a common default which shouldn't get confused with a real date), over something equally arbitrary (like the date of the conversion), or simply omitting the date (which may be invalid). Thanks, Peter C. (@Biopython) From biopython at maubp.freeserve.co.uk Mon Jul 20 13:57:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 18:57:59 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? Message-ID: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Hi all at Biopython (and EMBOSS-dev CC'd), Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. As I mentioned on the Biopython mailing list a week ago, in particular I'd like to make sure we agree on the various FASTQ variants. I'm waiting for EMBOSS to update the documentation on their website, but as I recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test this afternoon, they are using: fastq - FASTQ where the qualities are ignored (useful for input?) fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 I was expecting "fastq" to be an EMBOSS input only format given how I had understood this to be interpreted (ignore the qualities). This makes sense for tasks like FASTQ to FASTQ where the qualities can be ignored. I was however surprised that using "fastq" as an output format in EMBOSS seqret gives quality strings of double quote characters. This ASCII character (34) is outside the range used in the Solexa and Illumina 1.3+ FASTQ variants. If interpreted as a Sanger style FASTQ file this means a PHRED quality of one (meaning about random, a sensible default). Enough background. The reason for this email was that (subject to confirmation), Biopython's "fastq" matches EMBOSS's "fastq-sanger", so I'd like to consider adding this as an alias in Bio.SeqIO. I resisted adding aliases initially, but we now have "gb" for "genbank" to make working with Entrez a little easier, so there is a precedent. In this case, it will make some of the test_Emboss.py code cleaner if I can just use "fastq-sanger" everywhere and have both Biopython and EMBOSS understand this. Peter From biopython at maubp.freeserve.co.uk Mon Jul 20 17:46:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 22:46:38 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support Message-ID: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> Hi all, I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0 This first example is included in Biopython's unit tests, and can be downloaded here: http://biopython.org/SRC/biopython/Tests/Quality/solexa_example.fastq This was taken from http://maq.sourceforge.net/fq_all2std.pl where it is given as as an example of a Solexa (or early Illumina) format FASTQ file encoding Solexa scores with an ASCII offset of 64, and can be seen by doing: $ perl fq_all2std.pl example ... @SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA +SLXA-B3_649_FC8437_R1_1_1_610_79 YYYYYYYYYYYYYYYYYYWYWYYSU @SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA +SLXA-B3_649_FC8437_R1_1_1_397_389 YYYYYYYYYWYYYYWWYYYWYWYWW @SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG +SLXA-B3_649_FC8437_R1_1_1_850_123 YYYYYYYYYYYYYWYYWYYSYYYSY @SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG +SLXA-B3_649_FC8437_R1_1_1_362_549 YYYYYYYYYYYYYYYYYYWWWWYWY @SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA +SLXA-B3_649_FC8437_R1_1_1_183_714 YYYYYYYYYYWYYYYWYWWUWWWQQ I am pleased to say EMBOSS 6.1.0 will read this and convert it into a standard FASTA file: $ seqret -sequence solexa_example.fastq -sformat fastq -osformat fasta -filter >SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA >SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA >SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG >SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG >SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA Or, output as a Sanger style FASTQ file (using PHRED qualities with an ASCII offset of 33): $ seqret -sequence solexa_example.fastq -sformat fastq-solexa -osformat fastq-sanger -filter @SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA +SLXA-B3_649_FC8437_R1_1_1_610_79 ::::::::::::::::::8:8::46 @SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA +SLXA-B3_649_FC8437_R1_1_1_397_389 :::::::::8::::88:::8:8:88 @SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG +SLXA-B3_649_FC8437_R1_1_1_850_123 :::::::::::::8::8::4:::4: @SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG +SLXA-B3_649_FC8437_R1_1_1_362_549 ::::::::::::::::::8888:8: @SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA +SLXA-B3_649_FC8437_R1_1_1_183_714 ::::::::::8::::8:88688822 Using Biopython, for example as shown on the following cookbook page, agrees perfectly (except that Biopython omits the optional repeated title on the plus lines): http://www.biopython.org/wiki/Reading_from_unix_pipes This also agrees with the MAQ script - if you ignore its strange bug where it adds a "!" to the end of each quality string, see: http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help So far so good :) Was there any particular reason why EMBOSS includes the redundant second title on the plus lines? I can see that doing this makes the FASTQ files perhaps slightly more likely to work with other parsers, but imposes quite a size penalty. Peter C. From biopython at maubp.freeserve.co.uk Mon Jul 20 18:12:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 23:12:29 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> Message-ID: <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> Earlier I wrote: > Hi all, > > I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0 > ... > So far so good :) Could anyone spot a "but" coming up? Well, here we are - consider the following single Sanger format FASTQ record (originally from the NCBI SRA, I think SRA000271, but I would have to double check that). @071113_EAS56_0053:1:1:182:712 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG +071113_EAS56_0053:1:1:182:712 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ I would guess the problem is that quality line starts with a @, meaning care is needed. Likewise of course, quality lines can start with a + character too (although in my quick testing EMBOSS seems happy with these). The ASCII code for @ is 64, meaning for a Sanger style file this is a PHRED quality of 64-33 = 31. Here is what Biopython gives for the FASTA conversion: >071113_EAS56_0053:1:1:182:712 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG And this is what Biopython gives for the QUAL conversion, showing the PHRED scores as integers: >071113_EAS56_0053:1:1:182:712 31 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 34 35 40 40 40 40 40 27 4 27 21 5 12 9 8 13 7 9 4 10 Anyway, EMBOSS doesn't seem to like this example FASTQ record: $ seqret -sequence tricky_one.fastq -sformat fastq -osformat fasta -filter Error: Unable to read sequence 'tricky_one.fastq' Died: seqret terminated: Bad value for '-sequence' with -auto defined This read is actually one of four records in the following Biopython test file, in which EMBOSS only seems to find the first record: http://biopython.org/SRC/biopython/Tests/Quality/tricky.fastq As described here, this is a hand modified version of a real NCBI FASTQ file to show case several potential gotchas in parsing FASTQ (including some unlikely to occur in real life - unless someone were to concatenate FASTQ files from separate sources or something): http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html#FastqGeneralIterator In fact, looking at that again now, maybe I should include another record where the sequence line starts with a "+" as well... maybe even a record with the quality split over multiple lines some starting with @ and some with +. That would be an even better evil test ;) Regards, Peter C. From pmr at ebi.ac.uk Tue Jul 21 03:43:40 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 08:43:40 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> Message-ID: <4A6571AC.5090801@ebi.ac.uk> Peter C. wrote: > Could anyone spot a "but" coming up? > > Well, here we are - consider the following single Sanger format > FASTQ record (originally from the NCBI SRA, I think SRA000271, > but I would have to double check that). > > @071113_EAS56_0053:1:1:182:712 > ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG > +071113_EAS56_0053:1:1:182:712 > @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ > > I would guess the problem is that quality line starts with a @, Urghh ... I left an extra '@' test in even though I meant to take it out before the release. I will make a patch for this ... have to look into a couple of your other queries at the same time as they are in the same source file. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Jul 21 05:44:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 10:44:57 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> Message-ID: <320fb6e00907210244g48da17a2nbf7309eae0bd1356@mail.gmail.com> I wrote: > ... > I've been going over the Biopython unit tests with EMBOSS 6.1.0, > and have found a regression compared to EMBOSS 6.0.1. This is > to do with how EMBOSS parses a minimal GenBank file written > with Biopython. > > The file in question is a 10kb GenBank (well, a GenPept file as > it holds protein sequences) ... As requested (off list), I have sent the GenBank file to Peter Rice to look at. Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 06:21:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 11:21:34 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <4A6591F6.20107@ebi.ac.uk> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> Message-ID: <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> On Tue, Jul 21, 2009 at 11:01 AM, Peter Rice wrote: > > Peter C. wrote: >> I guess "refseqp" means refseq protein? Another name for GenPept? > > Not quite ... because genpept has yet another variation of GenBank format. > > refseqp is the protein part of refseq. > >> Is "refseqp" a public EMBOSS format name, or something internal? I've >> never noticed it in the documentation, e.g. >> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#in > > We're in the process of updating that. Somewhere in among writing the > books and creating the wiki the old website got left behind. > > My next task (once I've made sure your bugs are fixed) is to regenerate > all the tables of formats. Great. This may save you having to answer my next question, which was could you expand on what EMBOSS considers to be the differences between "genbank", "genpept" and "refseqp" as file formats? Of course, I may come up with further questions ;) >> Biopython treats "genbank" format as meaning either a GenBank file >> (with nucleotides) or a GenPept file (with amino acids). We detect this >> based on the LOCUS line containing "bp" or "aa". > > So do we ... but we need two versions of the 'aa' LOCUS lines. We try to > pick up the rest of the details for reuse in output. Why do you need two versions of the 'aa' LOCUS line? Is this the "genpept" format versus "refseqp" issue alluded to earlier? >> [Do you want to forward this back to the mailing list?] > > Will do. > > Peter I've CC'd this reply to the list. Peter From pmr at ebi.ac.uk Tue Jul 21 06:40:39 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 11:40:39 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> Message-ID: <4A659B27.4010902@ebi.ac.uk> Peter C. wrote: >> My next task (once I've made sure your bugs are fixed) is to regenerate >> all the tables of formats. > > Great. This may save you having to answer my next question, > which was could you expand on what EMBOSS considers to be > the differences between "genbank", "genpept" and "refseqp" as > file formats? Of course, I may come up with further questions ;) Oh, further questions please! We love answering them. GenPept format expects to find 9 fields on the LOCUS line. RefseqP format expects only 8. The difference is GenPept format including the original GenPept locus name. We may try to merge them one day. If we do, we would keep the format names but use one parser. Your Genpept (refseqp) format problem will be fixed in a patch. It was fine for one sequence but needed to rebuffer the input file to work with multiple input sequences. Meanwhile, could you tar up the biopython test data and scripts http://biopython.open-bio.org/SRC/biopython/Tests/ and I will try running the same data through EMBOSS to see what issues we can find. regards, Peter From biopython at maubp.freeserve.co.uk Tue Jul 21 06:52:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 11:52:19 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output Message-ID: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> Hi, One of the many things I talked to Peter Rice about in Sweden was the Pearson FASTA like output from needle and water (e.g. what EMBOSS calls the markx10 output format), and why it includes the EMBOSS header and footer lines (which start with a # character), which are not present in real FASTA output. Biopython can parse the pairwise -m 10 output from Bill Pearson's FASTA tools, so in theory we (Biopython) should be able to parse the markx10 output from EMBOSS needle and water. We could probably cope with the extra header and footer, but I think it would be best if EMBOSS could produce something more closely matching the real FASTA output. Unfortunately, it appears to be more than just the headers which upset our parser - even ignoring them, EMBOSS markx10 output still looks rather different to (current) FASTA -m 10 output. Was the markx10 output mimicking a particular (old) version of the FASTA tools? ------------------------------------------------------------------ Peter R. did say it would be simple to turn off this header and footer output, so I thought I would try this myself. It looks like this is handled in file ajax/ajalign.c by function alignWriteMark, but I don't see a switch to disable the headers and footers. >From looking at other writers, to disable the header, I think I just need to replace this line in alignWriteMark: alignWriteHeaderNum(thys,iali); with: /* turn off printing of the header, keep the calculation */ thys->File = NULL; alignWriteHeaderNum(thys,iali); thys->File = outf; I have worked out the footer gets printed by ajAlignWriteTail, but am unclear on where this is called by alignWriteMark. The only place that seems to call it is ajAlignClose, and this calls ajAlignWriteTail unconditionally. Regards, Peter C. (@Biopython) From biopython at maubp.freeserve.co.uk Tue Jul 21 07:32:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 12:32:59 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907210432h26da39b2ka24ceb1194a1be1a@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). This > makes sense for tasks like FASTQ to FASTQ where the qualities can > be ignored. I meant of course, for FASTQ to FASTA conversion the qualities (and how they are encoded, Sanger versus Solexa versus Illumina 1.3+) can be ignored. Peter From pmr at ebi.ac.uk Tue Jul 21 08:06:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 13:06:43 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> Message-ID: <4A65AF53.5090105@ebi.ac.uk> Peter wrote: > Hi, > > One of the many things I talked to Peter Rice about in Sweden > was the Pearson FASTA like output from needle and water (e.g. > what EMBOSS calls the markx10 output format), and why it > includes the EMBOSS header and footer lines (which start with > a # character), which are not present in real FASTA output. > > Biopython can parse the pairwise -m 10 output from Bill > Pearson's FASTA tools, so in theory we (Biopython) should > be able to parse the markx10 output from EMBOSS needle > and water. We could probably cope with the extra header > and footer, but I think it would be best if EMBOSS could > produce something more closely matching the real FASTA > output. Unfortunately, it appears to be more than just the > headers which upset our parser - even ignoring them, > EMBOSS markx10 output still looks rather different to > (current) FASTA -m 10 output. Was the markx10 output > mimicking a particular (old) version of the FASTA tools? The source code documentation refers to FASTA 3.4 which may be the last time I took a detailed look at the FASTA alignment outputs. Can you send us some example files so we can check for the significant differences? We plan to install all the bio* projects so it would be helpful to have a set of biopython parser scripts we can use to test locally. We can add them to our routine QA tests and flag up changes as soon as they appear. > Peter R. did say it would be simple to turn off this header and > footer output, so I thought I would try this myself. It looks like > this is handled in file ajax/ajalign.c by function alignWriteMark, > but I don't see a switch to disable the headers and footers. You correctly found how to turn off the header. The footer is reported for anything except pure sequence output. For the next release I will add attributes to the list of alignment formats to say whether the header and footer are needed. That will allow us better control and reporting. Meanwhile, we are very happy to standardise the markx* outputs to make them easier to parse. Biopython is the first project to report problems with this. There are alternatives - specifying -aformat and using some other alignment format for all applications - but we like to conform and will do our best to fir what parsers expect. Also, of course, once we know we are being parsed we will do our best not to let the output change. regards, Peter Rice From biopython at maubp.freeserve.co.uk Tue Jul 21 09:05:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:05:35 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <4A65AF53.5090105@ebi.ac.uk> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> <4A65AF53.5090105@ebi.ac.uk> Message-ID: <320fb6e00907210605v7415b1b6id043af520c1bb8de@mail.gmail.com> Hi all, I've CC'd the Biopython-dev mailing list as this EMBOSS thread is becoming cross project. On Tue, Jul 21, 2009 at 1:06 PM, Peter Rice wrote: > > Peter wrote: >> Hi, >> >> One of the many things I talked to Peter Rice about in Sweden >> was the Pearson FASTA like output from needle and water (e.g. >> what EMBOSS calls the markx10 output format), and why it >> includes the EMBOSS header and footer lines (which start with >> a # character), which are not present in real FASTA output. >> >> Biopython can parse the pairwise -m 10 output from Bill >> Pearson's FASTA tools, so in theory we (Biopython) should >> be able to parse the markx10 output from EMBOSS needle >> and water. We could probably cope with the extra header >> and footer, but I think it would be best if EMBOSS could >> produce something more closely matching the real FASTA >> output. Unfortunately, it appears to be more than just the >> headers which upset our parser - even ignoring them, >> EMBOSS markx10 output still looks rather different to >> (current) FASTA -m 10 output. Was the markx10 output >> mimicking a particular (old) version of the FASTA tools? > > The source code documentation refers to FASTA 3.4 which > may be the last time I took a detailed look at the FASTA > alignment outputs. That might explain it - I've been using FASTA 3.5. > Can you send us some example files so we can check for > the significant differences? Sure. There are half a dozen FASTA -m 10 output files here: http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/ > We plan to install all the bio* projects so it would be helpful > to have a set of biopython parser scripts we can use to test > locally. We can add them to our routine QA tests and flag up > changes as soon as they appear. If you have (the latest) Biopython installed, and periodically run the unit tests (in particular, test_Emboss.py), that would be a good start. Right now I know that unit test works with EMBOSS 4.0.0 and 6.0.1 (which happens to be on two of the machines I use for testing), and mostly works with EMBOSS 6.1.0 (everything except the GenBank regression you were just looking into today). I'm considering extending test_Emboss.py in the future to take advantage of the new features in EMBOSS 6.1.0 onwards such as GFF and FASTQ support, or perhaps having a second test script (which will be conditional on the version of EMBOSS installed). >> Peter R. did say it would be simple to turn off this header and >> footer output, so I thought I would try this myself. It looks like >> this is handled in file ajax/ajalign.c by function alignWriteMark, >> but I don't see a switch to disable the headers and footers. > > You correctly found how to turn off the header. The footer is > reported for anything except pure sequence output. > > For the next release I will add attributes to the list of alignment > formats to say whether the header and footer are needed. That > will allow us better control and reporting. > > Meanwhile, we are very happy to standardise the markx* outputs > to make them easier to parse. Biopython is the first project to > report problems with this. There are alternatives - specifying > -aformat and using some other alignment format for all > applications - but we like to conform and will do our best to fir > what parsers expect. > > Also, of course, once we know we are being parsed we will do > our best not to let the output change. This isn't really a problem. Biopython can read EMBOSS's own alignment formats (pairs and simple), so there is little need for us to be able to parse EMBOSS's version of the FASTA output. [Although at the moment we ignore all the header information, if that formatting will be consistent, we could parse it too.] However, at least one person wanted to parse EMBOSS markx10 output strongly enough that he wrote a modified version of our FASTA -m 10 parser. I would rather however have EMBOSS revise its output to better match FASTA. See http://bugzilla.open-bio.org/show_bug.cgi?id=2704 Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 09:18:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:18:01 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <4A659B27.4010902@ebi.ac.uk> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> Message-ID: <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> On Tue, Jul 21, 2009 at 11:40 AM, Peter Rice wrote: > > Peter C. wrote: >>> My next task (once I've made sure your bugs are fixed) is to >>> regenerate all the tables of formats. >> >> Great. This may save you having to answer my next question, >> which was could you expand on what EMBOSS considers to be >> the differences between "genbank", "genpept" and "refseqp" as >> file formats? Of course, I may come up with further questions ;) > > Oh, further questions please! We love answering them. > > GenPept format expects to find 9 fields on the LOCUS line. > RefseqP format expects only 8. > > The difference is GenPept format including the original GenPept locus name. Which 8 or 9 fields? > We may try to merge them one day. If we do, we would keep the format > names but use one parser. That makes sense. > Your Genpept (refseqp) format problem will be fixed in a patch. It was > fine for one sequence but needed to rebuffer the input file to work with > multiple input sequences. Grand. Will there be an EMBOSS 6.1.1 in a week or so then (addressing this, the FASTQ @ problem, and any other minor issues)? > Meanwhile, could you tar up the biopython test data and scripts > http://biopython.open-bio.org/SRC/biopython/Tests/ and I will try > running the same data through EMBOSS to see what issues we > can find. http://biopython.open-bio.org/SRC/biopython/ is just a dump from our repository (hourly or something). If you just download the latest Biopython source code, this will have all the unit test files etc: http://biopython.org/DIST/biopython-1.51b.tar.gz You could also grab the latest code from CVS or github - further details on request. Ask if you need clarification on what any of the test data files are for. In some cases searching the Tests/test_*.py files may have informative comments. Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 09:21:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:21:46 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> Message-ID: <320fb6e00907210621s1952ad5fm3f62549f7376d292@mail.gmail.com> Peter C. wrote: > Peter Rice wrote: >> >> Peter C. wrote: >>> >>> Great. This may save you having to answer my next question, >>> which was could you expand on what EMBOSS considers to be >>> the differences between "genbank", "genpept" and "refseqp" as >>> file formats? Of course, I may come up with further questions ;) >> >> Oh, further questions please! We love answering them. >> >> GenPept format expects to find 9 fields on the LOCUS line. >> RefseqP format expects only 8. >> >> The difference is GenPept format including the original GenPept locus name. > > Which 8 or 9 fields? Oh, and a related question: Can I adjust the GenPept file in question (emailed to Peter Rice off list) to get rid of the warning from EMBOSS 6.0.1 about the bad LOCUS line? If there is something wrong with the GenBank/GenPept LOCUS lines Biopython writes, I'd like to fix it before our next release. Peter C. From pmr at ebi.ac.uk Tue Jul 21 09:30:17 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 14:30:17 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> Message-ID: <4A65C2E9.1000203@ebi.ac.uk> Peter wrote: > On Tue, Jul 21, 2009 at 11:40 AM, Peter Rice wrote: >> GenPept format expects to find 9 fields on the LOCUS line. >> RefseqP format expects only 8. >> >> The difference is GenPept format including the original GenPept locus name. > > Which 8 or 9 fields? 'LOCUS' identifier Genbank-locus-name (GenPept format only) seqlen (numeric) 'aa' molecule-type (controlled vocabulary - we ignore the protein ones for now) 'circular' or 'linear' division (expecting 'UNC' for unclassified) date (last modified date) > Grand. Will there be an EMBOSS 6.1.1 in a week or so then (addressing > this, the FASTQ @ problem, and any other minor issues)? There will be a patch file in the ftp://emboss.open-bio.org/pub/EMBOSS/patches/ directory For those (like me) who prefer to manually update there will also be replacement file(s) in the fixes directory. > http://biopython.open-bio.org/SRC/biopython/ is just a dump from > our repository (hourly or something). If you just download the latest > Biopython source code, this will have all the unit test files etc: > http://biopython.org/DIST/biopython-1.51b.tar.gz Super, thanks. > Ask if you need clarification on what any of the test data files are > for. In some cases searching the Tests/test_*.py files may have > informative comments. Thanks. The plan is to include them in the EMBOSS QA tests so I will take a look at the inputs and what you check for in the outputs. At first glance it looks straightforward. regards, Peter From pmr at ebi.ac.uk Tue Jul 21 09:35:26 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 14:35:26 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210621s1952ad5fm3f62549f7376d292@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> <320fb6e00907210621s1952ad5fm3f62549f7376d292@mail.gmail.com> Message-ID: <4A65C41E.20505@ebi.ac.uk> Peter C. wrote: > Oh, and a related question: Can I adjust the GenPept file in question > (emailed to Peter Rice off list) to get rid of the warning from > EMBOSS 6.0.1 about the bad LOCUS line? If there is something > wrong with the GenBank/GenPept LOCUS lines Biopython writes, > I'd like to fix it before our next release. For EMBOSS 6.1.0 it should use -sformat refseqp (but will run without warning after the patch). For 6.0.1 all you can do is lie and change aa to bp. We added the protein formats refseqp and genpept in release 6.1.0. Previous releases warn about the 'aa' tag and continue. You could run with -nowarning on the command line but we don't recommend it :-) regards, Peter Rice From biopython at maubp.freeserve.co.uk Tue Jul 21 13:10:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 18:10:17 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A6571AC.5090801@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> Message-ID: <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> On Tue, Jul 21, 2009 at 8:43 AM, Peter Rice wrote: > > Peter C. wrote: > >> Could anyone spot a "but" coming up? >> ... >> I would guess the problem is that quality line starts with a @, > > Urghh ... I left an extra '@' test in even though I meant to take it out > before the release. > > I will make a patch for this ... have to look into a couple of your other > queries at the same time as they are in the same source file. > > Thanks I've got another issue for you, which I think is an rounding problem converting negative Solexa scores into ASCII (which sounds a bit strange), or assuming you store everything as PHRED scores in memory, this could be in how you round negative Solexa scores on conversion back to ASCII. This can be neatly demonstrated with the following artificial FASTQ file which uses the Solexa encoding covering scores 40 to -5 inclusive (which I understand to be the typical range likely to come off an actual Solexa/Illumina machine): $ more solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; $ seqret -sequence solexa_faked.fastq -sformat fastq-solexa -osformat fastq-solexa -stdout -auto @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@@?>=< $ embossversion Reports the current EMBOSS version number 6.1.0 As I hope is clear, EMBOSS seqret has inflated the last five scores by one. The original Solexa scores were: 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5 After putting this file through seqret, they become: 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, -1, -2, -3, -4 Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 13:19:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 18:19:58 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <4A65C2E9.1000203@ebi.ac.uk> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> <4A65C2E9.1000203@ebi.ac.uk> Message-ID: <320fb6e00907211019r447fca87i87c9143223c6cf8e@mail.gmail.com> On Tue, Jul 21, 2009 at 2:30 PM, Peter Rice wrote: > > Peter wrote: >> On Tue, Jul 21, 2009 at 11:40 AM, Peter Rice wrote: >>> GenPept format expects to find 9 fields on the LOCUS line. >>> RefseqP format expects only 8. >>> >>> The difference is GenPept format including the original GenPept locus name. >> >> Which 8 or 9 fields? > > 'LOCUS' > identifier > Genbank-locus-name (GenPept format only) > seqlen ? ? ? ? ? ? (numeric) > 'aa' > molecule-type ? ? ?(controlled vocabulary - we ignore the protein ones > for now) > 'circular' or 'linear' > division ? ? ? ? ? (expecting 'UNC' for unclassified) > date ? ? ? ? ? ? ? (last modified date) Do you have some publicly available examples of these? And if so, are you happy for them to be included within Biopython for unit tests? >> Grand. Will there be an EMBOSS 6.1.1 in a week or so then (addressing >> this, the FASTQ @ problem, and any other minor issues)? > > There will be a patch file in the > ftp://emboss.open-bio.org/pub/EMBOSS/patches/ directory > > For those (like me) who prefer to manually update there will also be > replacement file(s) in the fixes directory. Would there eventually be an EMBOSS 6.1.1 release for the less technical users who won't want to mess about with patches or replacing single files? I hope we don't have to wait 40 days! ;) [This is a joke referencing St Swithin's day and associated legends] Peter C. From biopython at maubp.freeserve.co.uk Wed Jul 22 07:56:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Jul 2009 12:56:23 +0100 Subject: [emboss-dev] Line wrapping in FASTQ output Message-ID: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Hi Peter R. et al, Up until now I had mostly been trying EMBOSS 6.1.0 with short read data. I've just noticed for longer reads EMBOSS wraps the sequences and qualities lines in FASTQ output (at 60 characters). There is an example of this at the end of the email. My understanding is that while line breaks are allowed in the sequences and qualities lines of a FASTQ file, they are discouraged as it can break simple minded parsers. Unfortunately right now I can't find any references/websites to back up this assertion (other than things I wrote myself since), but I was sure I read this on the MAQ site somewhere. Several sites do simply talk about "the" sequence line and "the" quality line (indeed the early drafts of the wikipedia page had this assumption, which I fixed). This is natural if all you have ever worked with is short read data. Of course, 454 reads are hundreds of bases long, and even the latest Illumina reads now are in the range 70 to 100 bp (or so I hear), so this issue will become more common - so any existing parsers that can't cope with line breaks will soon get broken, and hopefully fixed. For Biopython we should be able cope with any strange line breaks in the sequences and qualities lines on input, but for output don't do any line wrapping. I felt this would result in more widely parseable output. I wondered what your thought process was, and if you think it is worth removing the line wrapping on EMBOSS's FASTQ output (or indeed, if you have a good argument to convince me to make Biopython output FASTQ with line wrapping by default). [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as ideal for an OBF cross project mailing list, something we talked about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going to look into this?] Regards, Peter C. (at Biopython) e.g. $ embossversion Reports the current EMBOSS version number 6.1.0 $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! It is likely that email software will mangle the line breaks, but in my example file sanger_93.fastq the sequence and the quality are single line strings (of length 94). Now let's let EMBOSS seqret read this in and write it out again: $ seqret -filter -seq sanger_93.fastq -sformat fastq-sanger -osformat fastq-sanger @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG ACTGACTGACTGACTGACTGACTGACTGACTGAN +Test ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDC BA@?>=<;:9876543210/.-,+*)('&%$#"! The new lines are real and not just from the email formatting - you can check this by piping the output though hexdump. It appears EMBOSS is using 60 character line wrapping. Peter C. From pmr at ebi.ac.uk Thu Jul 23 04:08:51 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 23 Jul 2009 09:08:51 +0100 Subject: [emboss-dev] Line wrapping in FASTQ output In-Reply-To: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Message-ID: <4A681A93.9030303@ebi.ac.uk> Peter C. wrote: > Hi Peter R. et al, > > For Biopython we should be able cope with any strange line breaks in > the sequences and qualities lines on input, but for output don't do > any line wrapping. I felt this would result in more widely parseable > output. I wondered what your thought process was, and if you think it > is worth removing the line wrapping on EMBOSS's FASTQ output (or > indeed, if you have a good argument to convince me to make Biopython > output FASTQ with line wrapping by default). There is also an issue with making the ines so long that brain-damaged parsers (those that read a line in C and fail to check it was a complete line) will fail. Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see whether any parsers would object. The obvious compromise is to increase the default line length in EMBOSS to say 500 so that anyone reading up to 512 characters will still be safe. Unfortunately some flk will then assume there will never be a line break. Alternatively, we could truly make everything fit on one line. Or we could double up the fastq outputs with and without line breaks (horrible problems with naming the ouptut formats) I suspect this one-line thing is a simple attempt to avoid the "quality line starting with '@' or '+'" issue. > [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as > ideal for an OBF cross project mailing list, something we talked about > at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going > to look into this?] Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release but I will get back on to it. regards, Peter From biopython at maubp.freeserve.co.uk Thu Jul 23 05:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 10:14:52 +0100 Subject: [emboss-dev] [Biopython-dev] Line wrapping in FASTQ output In-Reply-To: <4A681A93.9030303@ebi.ac.uk> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> <4A681A93.9030303@ebi.ac.uk> Message-ID: <320fb6e00907230214l6df7ff76j643e8ddc1f600054@mail.gmail.com> On Thu, Jul 23, 2009 at 9:08 AM, Peter Rice wrote: > Peter C. wrote: >> >> Hi Peter R. et al, >> >> For Biopython we should be able cope with any strange line breaks >> in the sequences and qualities lines on input, but for output don't do >> any line wrapping. I felt this would result in more widely parseable >> output. I wondered what your thought process was, and if you think >> it is worth removing the line wrapping on EMBOSS's FASTQ output >> (or indeed, if you have a good argument to convince me to make >> Biopython output FASTQ with line wrapping by default). > > There is also an issue with making the ines so long that brain-damaged > parsers (those that read a line in C and fail to check it was a complete > line) will fail. You mean a C parser with a finite string buffer (say 100 characters) which reads things line by line. Yes, that would be a bit brain dead too. I guess either way could break some parsers out there ;) > Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see > whether any parsers would object. I see - well I'm not objecting, and neither is the Biopython parser. > The obvious compromise is to increase the default line length in > EMBOSS to say 500 so that anyone reading up to 512 characters > will still be safe. Unfortunately some flk will then assume there will > never be a line break. That seems like a bad idea - especially as Roche 454 reads are in the region of 500+ bp, meaning some would wrap and some wouldn't. Even using a longer wrap like 1000 would probably just postpone the issue. If you are going to wrap, something short like 60 seems more sensible (often used in FASTA files too) given the historical 80 character width of a terminal window. People using early Solexa/Illumina machines will only see a single line, but as their read lengths are already in the range 70 to 100bp, I wonder what the latest Illumina pipelines output (wrt wrapping)? > Alternatively, we could truly make everything fit on one line. That's what Biopython currently does. But you are right - I hadn't considered brain dead parsers using fixed buffers. > Or we could double up the fastq outputs with and without line breaks > (horrible problems with naming the ouptut formats) I don't like that plan. For Biopython we could have a wrapping setting available for people who really need to specify this (as we do for FASTA already), with a sensible default value. > I suspect this one-line thing is a simple attempt to avoid the "quality line > starting with '@' or '+'" issue. Could be. I think the fact that @ and + are valid entries in the quality string is the second most annoying thing about the FASTQ format (after the lack of a clear format definition from Sanger, and the resulting variants from Solexa/Illumina etc). >> [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as >> ideal for an OBF cross project mailing list, something we talked >> about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) >> were going to look into this?] > > Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release > but I will get back on to it. Thanks! > regards, > > Peter Cheers, Peter C. From pmr at ebi.ac.uk Thu Jul 23 12:24:01 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 23 Jul 2009 17:24:01 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> Message-ID: <4A688EA1.9060005@ebi.ac.uk> Peter wrote: > I've got another issue for you, which I think is an rounding problem > converting negative Solexa scores into ASCII (which sounds a bit > strange), or assuming you store everything as PHRED scores in > memory, this could be in how you round negative Solexa scores > on conversion back to ASCII. Yup, it's the rounding on output. It was adding 0.5 and going to the nearest integer. For negative values of course it has to subtract 0.5 to get the correct rounding. regards, Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 05:59:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 10:59:52 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A688EA1.9060005@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> Message-ID: <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> On Thu, Jul 23, 2009 at 5:24 PM, Peter Rice wrote: > > Peter C. wrote: >> >> I've got another issue for you, which I think is an rounding problem >> converting negative Solexa scores into ASCII (which sounds a bit >> strange), or assuming you store everything as PHRED scores in >> memory, this could be in how you round negative Solexa scores >> on conversion back to ASCII. > > Yup, it's the rounding on output. It was adding 0.5 and going to the > nearest integer. > > For negative values of course it has to subtract 0.5 to get the correct > rounding. C can be fun like that - nearest integer verses truncation to lowest integer. I'd like to re-test with your fixes. I presume these things are being fixed in the public CVS repository, so I could try building EMBOSS from there. Is there a particular branch? Or are you planning an EMBOSS 6.1.1 release shortly? Thanks, Peter From pmr at ebi.ac.uk Fri Jul 24 06:14:23 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 24 Jul 2009 11:14:23 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> Message-ID: <4A69897F.9010301@ebi.ac.uk> Peter C. wrote: > I'd like to re-test with your fixes. I presume these things are being fixed > in the public CVS repository, so I could try building EMBOSS from there. > Is there a particular branch? Or are you planning an EMBOSS 6.1.1 > release shortly? You found various things in sequence formats, but all are resolved by changes to ajseqread.c and ajseqwrite.c Assuming I am happy with the test I plan to make a patch which will update those files. The CVS code would have new things for the next release. For now, if you are using the 6.1.0 release, patching is the way to go. Fixes so far: FASTQ format changes: * sequence and quality scores on one line * quality ID line shortened to '+' * Solexa negative quality score output corrected * Phred quality score rounding error fixed * Corrected reading of quality lines starting with '@' GenBank format changes: * protein (genpept and refseqp) formats auto-detect fix for multiple input sequences Intelligenetics format: * Sequence ID corrected for DOS format input file Did I miss anything? regards, Peter Rice From biopython at maubp.freeserve.co.uk Fri Jul 24 06:32:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:32:50 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> Message-ID: <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> Hi again Peter, I have another query regarding how EMBOSS treats "fastq" as a format name. >From our earlier discussions I was expecting "fastq" to be an EMBOSS input *only* format where you would ignore the qualities. This would allow tasks like FASTQ to FASTA without having to worry if the scores where encoded following the Sanger standard, the original Solexa scheme, or the Illumina 1.3+ encoding. When I found EMBOSS offered "fastq" as an output format, I initially thought it might produce files with dummy quality values (even if the input file had qualities). This puzzled me, as I couldn't see a use for this, but in fact this isn't the case. Instead, "fastq" as an output format seems to act like the "fastq-sanger" format. I notice you use dummy values for the quality if there are unknown, specifically a PHRED quality of one (meaning about random, a sensible default in some cases). e.g. $ more example.fasta >EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC >EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA >EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG Converting "fasta" (with no qualities) to "fastq-sanger", seqret assigns a PHRED quality of 1 (the double quote, ASCII 34): $ seqret -sequence example.fasta -sformat fasta -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 """"""""""""""""""""""""" @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 """"""""""""""""""""""""" @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 """"""""""""""""""""""""" Converting "fasta" (with no qualities) to "fastq" seems to act just like conversion to "fastq-sanger": $ seqret -sequence example.fasta -sformat fasta -osformat fastq -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 """"""""""""""""""""""""" @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 """"""""""""""""""""""""" @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 """"""""""""""""""""""""" As an aside, FASTA to Illumina FASTQ also uses PHRED quality one (ASCII 64+1 = 65 is the letter A): $ seqret -sequence example.fasta -sformat fasta -osformat fastq-illumina -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 AAAAAAAAAAAAAAAAAAAAAAAAA @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 AAAAAAAAAAAAAAAAAAAAAAAAA @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 AAAAAAAAAAAAAAAAAAAAAAAAA (Due to the rounding issue I have not included a FASTA to Solexa FASTQ example) Have I understood correctly? i.e. in EMBOSS 6.1.0: "fastq" on input - ignores quality strings "fastq" on output - acts like "fastq-sanger" "fastq-sanger" - PHRED scores offset 31 "fastq-solexa" - Solexa scores offset 64 "fastq-illumina" - PHRED scores offset 64 If this is correct, the "fastq" format behaviour strikes me as very odd. I would have either made "fastq" and "fastq-solexa" the same, or made "fastq" an input only format. Consider this very surprising behaviour that this results in... $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 You might want to use seqret to "clean up" a FASTQ file, for example to standardize the line wrapping and the captions. As this example is a Sanger style FASTQ file, this works: seqret -sequence example.fastq -sformat fastq-sanger -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ;;;;;;;;;;;9;7;;.7;393333 Notice EMBOSS has filled in the (optional) repeated caption on the plus lines (and would have wrapped long reads). However, consider the more natural thing to type: $ seqret -sequence example.fastq -sformat fastq -osformat fastq -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 """"""""""""""""""""""""" @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 """"""""""""""""""""""""" @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 """"""""""""""""""""""""" I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like this threw away the quality scores - and I'm sure other people will also make this mistake. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 24 06:45:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:45:07 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> Message-ID: <320fb6e00907240345u158715cbg33c8b71741d7588b@mail.gmail.com> On Fri, Jul 24, 2009 at 11:32 AM, Peter wrote: > > Have I understood correctly? i.e. in EMBOSS 6.1.0: > > "fastq" on input - ignores quality strings > "fastq" on output - acts like "fastq-sanger" > "fastq-sanger" - PHRED scores offset 31 [* TYPO - should be offset 33 *] > "fastq-solexa" - Solexa scores offset 64 > "fastq-illumina" - PHRED scores offset 64 > Correction (just for the record) - the Sanger FASTQ files use an offset of 33 (ASCII for "!"). The number 31 is important as the difference between this and the Illumina ASCII offset. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 24 06:48:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:48:04 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > Hi all at Biopython (and EMBOSS-dev CC'd), > > Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. > As I mentioned on the Biopython mailing list a week ago, in particular I'd > like to make sure we agree on the various FASTQ variants. I'm waiting > for EMBOSS to update the documentation on their website, but as I > recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test > this afternoon, they are using: > > fastq - FASTQ where the qualities are ignored (useful for input?) > fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 > fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 > fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 > > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). > ... I was however surprised that using "fastq" as an output format > in EMBOSS seqret gives quality strings of double quote characters. To be more precise, it looks like "fastq" as an output format in EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html In any case, it would still make sense to include "fastq-sanger" as an alias for the Sanger standard FASTQ files in Biopython's SeqIO, especially if BioPerl is also going to use that name (to be confirmed): http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Peter From pmr at ebi.ac.uk Fri Jul 24 07:20:13 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 24 Jul 2009 12:20:13 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> Message-ID: <4A6998ED.9020607@ebi.ac.uk> Peter C. wrote: > I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like > this threw away the quality scores - and I'm sure other people will also > make this mistake. Hmmm ... good point, but hard to avoid unless we simply delete "fastq" from the list of output formats. On balance, I prefer to keep it available. On input there is simply no way to guarantee reading quality scores without being told which type they are. On output it is reasonable to default to fastq-sanger ... otherwise what else could "fastq" output format write? We can consider, as I say, dropping the fastq output format name from a future release. Let's see how users get on with it first. regards, Peter Rice From biopython at maubp.freeserve.co.uk Fri Jul 24 07:33:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 12:33:33 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A6998ED.9020607@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> <4A6998ED.9020607@ebi.ac.uk> Message-ID: <320fb6e00907240433k2ca27ea4y977063ecb863ebaa@mail.gmail.com> On Fri, Jul 24, 2009 at 12:20 PM, Peter Rice wrote: > > Peter C. wrote: >> I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like >> this threw away the quality scores - and I'm sure other people will also >> make this mistake. > > Hmmm ... good point, but hard to avoid unless we simply delete "fastq" > from the list of output formats. > > On balance, I prefer to keep it available. > > On input there is simply no way to guarantee reading quality scores > without being told which type they are. > > On output it is reasonable to default to fastq-sanger ... otherwise what > else could "fastq" output format write? Well quite. In our chat in Sweden, I never expected you to offer "fastq" as an output format in the first place, so didn't raise the issue. > We can consider, as I say, dropping the fastq output format name from a > future release. Let's see how users get on with it first. As you suggest, let's see if anyone else is concerned about this. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 24 08:40:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 13:40:55 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> Message-ID: <320fb6e00907240540i17f7f3f0kdf144c79ccbfdae@mail.gmail.com> On Fri, Jul 24, 2009 at 11:48 AM, Peter wrote: > > To be more precise, it looks like "fastq" as an output format in > EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: > http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html Confirmed, http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000602.html > In any case, it would still make sense to include "fastq-sanger" as > an alias for the Sanger standard FASTQ files in Biopython's SeqIO, > especially if BioPerl is also going to use that name (to be confirmed): > http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Confirmed, BioPerl will support "fastq" or "fastq-sanger" to mean the Sanger standard FASTQ files: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030691.html I've updated Biopython's SeqIO in CVS to support "fastq-sanger" as an alias for "fastq". Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 09:32:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:32:49 +0100 Subject: [emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS Message-ID: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Hi all, Peter Rice kindly said he will look into an OBF cross project mailing list, but in the meantime this has been cross posted to the Biopython, BioPerl, and EMBOSS development lists. On Thu, Jul 23, 2009 at 11:58 PM, Chris Fields wrote: >> I'd like to get comparisons against BioPerl's new FASTQ support >> going too. To do this I'd need to know which (branch?) of BioPerl I >> should install, and I'd also like a trivial sample BioPerl script to do >> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say >> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning >> the Sanger Standard FASTQ). > > You would have to install svn (bioperl-live) if you want the refactored > fastq. ?That commit was within the last month. I've got SVN bioperl-live installed and apparently working :) >> i.e. Something like this four line Biopython script would be perfect: >> http://biopython.org/wiki/Reading_from_unix_pipes > > We use named parameters so it's a little more verbose. > > use Bio::SeqIO; > my $in ?= Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger'); > my $out = Bio::SeqIO->new(-format => 'fastq-solexa'); > while (my $seq = $in->next_seq) { $out->write_seq($seq) } > > Don't be surprised if there are still bugs lurking about, just let me know > and I'll fix 'em. I've got a bug report coming up in a second email, but the basics work :) e.g. Using this Sanger style FASTQ file, and converting it to Solexa style http://biopython.org/SRC/biopython/Tests/Quality/example.fastq $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 This is simple three record FASTQ file (in the Sanger format). Using EMBOSS 6.1.0: $ seqret -filter -sformat fastq-sanger -osformat fastq-solexa < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using BioPerl: $ perl bioperl_sanger2solexa.pl < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using Biopython: $ python biopython_sanger2solexa.py < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ZZZZZZZZZZZXZVZZMVZRXRRRR They all agree, except that Biopython has followed the MAQ convention of omitting the (optional) repeat of the captions on the plus lines. This is something I'd already asked Peter Rice about for EMBOSS (but I think we got sidetracked): http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 09:53:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:53:40 +0100 Subject: [emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >> >> Don't be surprised if there are still bugs lurking about, just let me >> know and I'll fix 'em. > > I've got a bug report coming up in a second email, but the basics work :) I think I have found a bug in BioPerl's conversion from fastq-solexa to fastq-sanger concerning lower quality scores. Here is an artificial Solexa file using the Solexa scores from 40 down to -5 (which I believe to be the full range expected from an instrument). $ more solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; A Solexa quality of 40 maps to ASCII 40+64 = 104, "h" A Solexa quality of -5 maps to ASCII -5+64 = 59, ";" You should find this example has Solexa scores 40, 39, .., -4, -5. This file is in the Biopython repository under biopython/Tests/Quality Here is the conversion using MAQ (with the chomp fix from Tim Yu to remove an extra "!" character, see the maq-help mailing list for 10 July 2009): http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help $ perl fq_all2std.pl sol2std < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" Here is the Biopython conversion, which is identical: $ python biopython_solexa2sanger.py < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" EMBOSS 6.1.0 has a rounding issue with negative Solexa scores, and the last six qualities are up by one - Peter Rice is aware of this, and has a fix: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000596.html $ seqret -filter -sformat fastq-solexa -osformat fastq-sanger < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+*)(''&%%$$##""" Now we come to BioPerl, $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+++*)(''&&&&%%%% You look fine for the higher qualities, but there is something really wrong for the lower scores (not just the negative ones). I'll leave you to double check the details, but here are the Sanger PHRED qualities decoded into integers (using Biopython to convert from "fastq-sanger" to "qual" output): $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4 $ perl fq_all2std.pl sol2std < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1 1 Peter C. P.S. This is the BioPerl script I am using here: $ more bioperl_solexa2sanger.pl use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-solexa'); my $out = Bio::SeqIO->new(-format => 'fastq-sanger'); while (my $seq = $in->next_seq) { $out->write_seq($seq) }; From biopython at maubp.freeserve.co.uk Fri Jul 24 10:01:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 15:01:11 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A69897F.9010301@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <4A69897F.9010301@ebi.ac.uk> Message-ID: <320fb6e00907240701i1656fe1bh821e491cdc1958ff@mail.gmail.com> On Fri, Jul 24, 2009 at 11:14 AM, Peter Rice wrote: > > Peter C. wrote: >> I'd like to re-test with your fixes. I presume these things are being fixed >> in the public CVS repository, so I could try building EMBOSS from there. >> Is there a particular branch? Or are you planning an EMBOSS 6.1.1 >> release shortly? > > You found various things in sequence formats, but all are resolved by > changes to ajseqread.c and ajseqwrite.c > > Assuming I am happy with the test I plan to make a patch which will > update those files. If issuing patches is how you prefer to handle this, that's fine with me. Will you do updates to the binaries for Windows users etc? > The CVS code would have new things for the next release. For now, > if you are using the 6.1.0 release, patching is the way to go. So if I want to retest with your fixes, I can either use CVS or wait for the patches? > Fixes so far: > > FASTQ format changes: > > * sequence and quality scores on one line That does seem to be preferred in general. > * quality ID line shortened to '+' This is certainly the way MAQ does it, and as a Sanger based tool that gives this some status - in addition to the file size benefit ;) > * Solexa negative quality score output corrected > * Phred quality score rounding error fixed Were the above two the same issue? > * Corrected reading of quality lines starting with '@' Great. Can you read this file fine now? http://biopython.org/SRC/biopython/Tests/Quality/tricky.fastq > GenBank format changes: > > * protein (genpept and refseqp) formats auto-detect fix for multiple > input sequences > > Intelligenetics format: > > * Sequence ID corrected for DOS format input file > > Did I miss anything? I think that's everything. Thank you! Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 11:12:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 16:12:57 +0100 Subject: [emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> Message-ID: <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> On Fri, Jul 24, 2009 at 2:53 PM, Peter wrote: > On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >>> >>> Don't be surprised if there are still bugs lurking about, just let me >>> know and I'll fix 'em. >> >> I've got a bug report coming up in a second email, but the basics work :) > > I think I have found a bug in BioPerl's conversion from fastq-solexa > to fastq-sanger concerning lower quality scores. Next up is an issue with BioPerl converting from Sanger to Illumina. In principle this is simple - the quality strings both use PHRED scores just with different offsets. With lower PHRED scores, everything is fine: $ more sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this is an example constructed by hand to cover a broad range of valid scores, and can be found in the Biopython repository under biopython/Tests/Quality $ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN +Test PHRED qualities from 40 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ python biopython_sanger2illumina.py < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ So, BioPerl and Biopython (and EMBOSS) agree - apart from the repeating second title on the plus line. I understand that EMBOSS will in future omit the repeated title on the plus line: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html Now, here comes the problem. I believe FASTQ files directly from an Illumina 1.3+ pipeline will have PHRED scores in the range 0 to 40 (as in this example). However, much higher PHRED scores are possible during assembly / contig'ing and read mapping. For example, the tool MAQ will output Sanger style FASTQ files with PHRED scores in the range 0 to 93 inclusive. Now, in the Sanger FASTQ format, PHRED scores of 0 to 93 map onto ASCII values of 33 to 126 (! to ~). There is a reason for stopping at 126, since ASCII 127 is "delete". However, in the Illumina 1.3+ FASTQ format, PHRED scores of 0 to 93 would map to ASCII values of 64 to 157, which includes a lot of non printing characters. Working with such files at the command line or in an editor is a big problem. Clearly, Illumina never intended to include such high scores in their FASTQ files! Nevertheless, it is possible to write a FASTQ format following the Illumina 1.3+ encoding with these values. Biopython and EMBOSS attempt to do this - although I would regard throwing an error as equally acceptable. So, here is another hand constructed example of a Sanger style FASTQ file using the full quality range: $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this example is in the Biopython repository under biopython/Tests/Quality Just to check: $ python biopython_sanger2qual.py < sanger_93.fastq >Test PHRED qualities from 93 to 0 inclusive 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 So, here we go - apologies for the expected line mangling: $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina < sanger_93.fastq | hexdump -C -v 00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 0a 41 43 54 47 41 43 |GACTGACTG.ACTGAC| 00000070 54 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 |TGACTGACTGACTGAC| 00000080 54 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 54 65 |TGACTGACTGAN.+Te| 00000090 73 74 0a 9d 9c 9b 9a 99 98 97 96 95 94 93 92 91 |st..............| 000000a0 90 8f 8e 8d 8c 8b 8a 89 88 87 86 85 84 83 82 81 |................| 000000b0 80 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 |..~}|{zyxwvutsrq| 000000c0 70 6f 6e 6d 6c 6b 6a 69 68 67 66 65 64 63 62 0a |ponmlkjihgfedcb.| 000000d0 61 60 5f 5e 5d 5c 5b 5a 59 58 57 56 55 54 53 52 |a`_^]\[ZYXWVUTSR| 000000e0 51 50 4f 4e 4d 4c 4b 4a 49 48 47 46 45 44 43 42 |QPONMLKJIHGFEDCB| 000000f0 41 40 0a |A at .| 000000f3 $ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C -v00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000070 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000080 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 0a 9d 9c |GACTGACTGAN.+...| 00000090 9b 9a 99 98 97 96 95 94 93 92 91 90 8f 8e 8d 8c |................| 000000a0 8b 8a 89 88 87 86 85 84 83 82 81 80 7f 7e 7d 7c |.............~}|| 000000b0 7b 7a 79 78 77 76 75 74 73 72 71 70 6f 6e 6d 6c |{zyxwvutsrqponml| 000000c0 6b 6a 69 68 67 66 65 64 63 62 61 60 5f 5e 5d 5c |kjihgfedcba`_^]\| 000000d0 5b 5a 59 58 57 56 55 54 53 52 51 50 4f 4e 4d 4c |[ZYXWVUTSRQPONML| 000000e0 4b 4a 49 48 47 46 45 44 43 42 41 40 0a |KJIHGFEDCBA at .| 000000ed Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to 64 in decimal, which after subtracting the Illumina offset of 64, gives PHRED scores of 93 to 0 as desired. Now to BioPerl, $ perl bioperl_sanger2illumina.pl < sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN +Test PHRED qualities from 93 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v ... BioPerl has output an invalid FASTQ file - it seems to omit the quality scores for the top scoring nucleotides at the start. The BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 (in hex), giving 104 to 64 in decimal, giving PHRED values of 40 to 0. I think BioPerl should either throw an error, or output the non printing characters as done by Biopython and EMBOSS. Regards, Peter C. (@Biopython) From biopython at maubp.freeserve.co.uk Sat Jul 25 17:12:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:12:26 +0100 Subject: [emboss-dev] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> Message-ID: <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: > >> Now, here comes the problem. I believe FASTQ files directly >> from an Illumina 1.3+ pipeline will have PHRED scores in the >> range 0 to 40 (as in this example). However, much higher >> PHRED scores are possible during assembly / contig'ing >> and read mapping. For example, the tool MAQ will output >> Sanger style FASTQ files with PHRED scores in the range >> 0 to 93 inclusive. > > Is this behavior documented anywhere, specifically by Illumina (that values > can exceed 40)? If Illumina 1.3 is specified as being PHRED 0-40, and > another (non-Illumina) software package pushes that limit above the > specified range of Illumina values, I would consider that unfortunately yet > another variant. > > We can support it as Illumina 1.3, but my point is this may getting into a > grey area and may be something that Illumina doesn't/wouldn't support. > Reminds me a little of the multiple GFF2 variations (one of the main > reasons for a GFF3). I agree this is an grey area (high scores in Solexa/Illumina FASTQ files). >> Now, in the Sanger FASTQ format, PHRED scores of 0 to >> 93 map onto ASCII values of 33 to 126 (! to ~). There is a >> reason for stopping at 126, since ASCII 127 is "delete". >> >> However, in the Illumina 1.3+ FASTQ format, PHRED >> scores of 0 to 93 would map to ASCII values of 64 to >> 157, which includes a lot of non printing characters. >> Working with such files at the command line or in an >> editor is a big problem. Clearly, Illumina never intended >> to include such high scores in their FASTQ files! > > Exactly. > >> Nevertheless, it is possible to write a FASTQ format >> following the Illumina 1.3+ encoding with these values. >> Biopython and EMBOSS attempt to do this - although I >> would regard throwing an error as equally acceptable. >> >> So, here is another hand constructed example of a >> Sanger style FASTQ file using the full quality range: >> >> ... >> >> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree >> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to >> 64 in decimal, which after subtracting the Illumina offset of 64, gives >> PHRED scores of 93 to 0 as desired. >> >> Now to BioPerl, >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq >> ... >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v >> ... >> >> BioPerl has output an invalid FASTQ file - it seems to omit the >> quality scores for the top scoring nucleotides at the start. The >> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 >> (in hex), giving 104 to 64 in decimal, giving PHRED values of >> 40 to 0. I think BioPerl should either throw an error, or output >> the non printing characters as done by Biopython and EMBOSS. > > If this is accepted as common practice between BioPython and EMBOSS > we will follow similarly. I do think it's worth at least a warning for the > reasons outlined above (e.g. it likely isn't Illumina's intent to support qual > values outside the specified range). Might be worth checking into. True. I think what EMBOSS and Biopython are doing is reasonable (although a warning in this situation makes sense). Equally, an error is a valid option. However, one question is when would you issue the warning/error? For a PHRED score above 40? (Assuming we have a definative reference for Illumina using just 0 to 40). How about if a problem character would result? Since ASCII 64+63=127, the first problem character would be for PHRED score 63. i.e. An Illumina FASTQ format file can hold PHRED scores in the range 0 to 62 without using problem characters. And likewise for a Solexa FASTQ file (Solexa scores up to 62). > From this it could be summarized that converting to sanger format is least > problematic, as possible issues may be encountered when converting to the > other variants. Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93 while using nice ASCII characters - this means it is suitable for both raw reads and processed data from assemblies or read mappings. In my personal experience, Solexa/Illumina FASTQ files tend to get converted into the Sanger FASTQ format for downstream analysis (e.g. the MAQ tool, or the NCBI short read archive). i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or Illumina FASTQ files is unlikely. > We'll need to fix the solexa quality calculations in the BioPerl > parser as noted in your previous post; I'll work on that. Great. Peter From pmr at ebi.ac.uk Mon Jul 27 04:55:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 09:55:43 +0100 Subject: [emboss-dev] Open-bio cross-project issues In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <4A6D6B8F.9060108@ebi.ac.uk> Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): > Hi all, > > Peter Rice kindly said he will look into an OBF cross project mailing > list, but in the meantime this has been cross posted to the Biopython, > BioPerl, and EMBOSS development lists. There is a list already for this purpose - open-bio-l I think we will also need a cross-project wiki space on the OBF site. Is there something already used by other projects or should we set something up? I am cross-posting this to other OBF project lists to encourage developers interested in combining efforts to address common problems. This started with FASTQ short read formats, and open-bio-l (a low volume list) has also seen discussion of common test data sets. Please sign up to open-bio-l (if you are not there already) and post suggestions for cross-project issues there. The list subscription page is: http://lists.open-bio.org/mailman/listinfo/open-bio-l Please feel free to forward this to any other projects I may have missed (I picked the obvious addresses from the list.open-bio-org server) regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Jul 27 13:39:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 18:39:49 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS Message-ID: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> Hi all, I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for some of the FASTQ issues I've raised, and I decided to do a few simple benchmarks. For this example, I have used a 1.3 GB standard Sanger FASTQ file from the NCBI short read archive which contains just over seven million short reads of length 36 bp, which I believe were originally from a Solexa/Illumina machine. This is actually one of a pair of FASTQ files as this was a paired end run. The file is here (compressed): ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/SRR001666_1.fastq.gz Note that some of the quality lines start with "@", so you can't use grep for "^@" to count the records. However, all the reads have an identifier starting SRR so you can do this: $ time grep "^@SRR" SRR001666_1.fastq | wc -l 7047668 real 0m15.886s user 0m18.357s sys 0m1.268s For this example, I want to convert the FASTQ file to FASTA (i.e. ignore and throw away the quality scores). This is a fairly common task, as most all assemblers will take FASTA files, even if they don't understand FASTQ. As I didn't want to waste disk space and I wanted a basic check on the output, I have simply piped the output via grep and wc to count the FASTA records: $ time seqret -filter -sformat fastq-sanger -osformat fasta < SRR001666_1.fastq | grep "^>" | wc -l 7047668 real 2m48.288s user 3m3.994s sys 0m3.525s I've run this several times, and this result is typical. So, using the "fastq-sanger" format this takes about 2m48s. There is a slight speed up using "fastq" as the EMBOSS input format name, as this never has to convert the quality strings into PHRED values: $ time seqret -filter -sformat fastq -osformat fasta < SRR001666_1.fastq | grep "^>" | wc -l 7047668 real 2m43.566s user 2m59.077s sys 0m3.540s i.e. About 2m44, saving about 4s. Just for the record, actually doing the FASTQ to FASTA conversion to a file (without grep and wc) takes about 2m52s: $ time seqret -filter -sformat fastq -osformat fasta -sequence SRR001666_1.fastq -outseq SRR001666_1.fasta real 2m51.791s user 2m40.545s sys 0m4.848s This is over 40 thousand reads per second, but I was still a little disappointed in the run time. Improvements in the FASTQ parsing/writing speed would help get EMBOSS used in sequencing centre pipelines. Once we have the EMBOSS FASTQ input/output working as intended, does trying to speed it up further seem worthwhile? One specific suggestions is for the "fastq" parser (function seqReadFastq) which doesn't do anything with the quality strings. Other than for a debug statement, there is no need to calculate these lines: minqual = ajStrGetAsciiLow(qualstr); maxqual = ajStrGetAsciiHigh(qualstr); comqual = ajStrGetAsciiCommon(qualstr); In fact, you don't really need to record qualstr at all. Could you just verify the total length of the quality string, without actually recording it in a buffer? Another suggestion (although not demonstrated in the above benchmark) is for the Solexa FASTQ parsing (and output). >From looking at the code, you map the ASCII to a PHRED score for each letter of every read. This is a relatively expensive operation using powers and logs. I would try using a precomputed look up table (something I have just been working on for Biopython - this made a very big difference, especially when converting to/from Solexa scores to PHRED scores). Peter C. From pmr at ebi.ac.uk Tue Jul 28 04:05:47 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 28 Jul 2009 09:05:47 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS In-Reply-To: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> References: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> Message-ID: <4A6EB15B.20903@ebi.ac.uk> Peter wrote: > Hi all, > > I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for > some of the FASTQ issues I've raised, and I decided to do a few > simple benchmarks. > > This is over 40 thousand reads per second, but I was still a > little disappointed in the run time. Improvements in the FASTQ > parsing/writing speed would help get EMBOSS used in > sequencing centre pipelines. Once we have the EMBOSS > FASTQ input/output working as intended, does trying to > speed it up further seem worthwhile? Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing the output takes about as long as reading the input. There may be ways to speed that up (output requires making an output sequence object which takes half the output time). Building EMBOSS with --with-gccprofile and compiling with gcc creates a gprof profile. Very useful for catching bottlenecks. Up to the advent of NGS data, large input/output runs have been limited to converting EMBL/GenBank into Fasta as a one-off every few months so looking into the efficiency of sequence reading/writing has been a low priority. Now it does assume much more importance. > Another suggestion (although not demonstrated in the above > benchmark) is for the Solexa FASTQ parsing (and output). >>From looking at the code, you map the ASCII to a PHRED > score for each letter of every read. This is a relatively > expensive operation using powers and logs. I would try > using a precomputed look up table (something I have just > been working on for Biopython - this made a very big > difference, especially when converting to/from Solexa > scores to PHRED scores). Yes, that was on my list of future changes. There wasn't time to fully implement and test before the release freeze. regards, Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 05:21:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 10:21:33 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS In-Reply-To: <4A6EB15B.20903@ebi.ac.uk> References: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> <4A6EB15B.20903@ebi.ac.uk> Message-ID: <320fb6e00907280221y141797fcw81faeefd22429fb1@mail.gmail.com> On Tue, Jul 28, 2009 at 9:05 AM, Peter Rice wrote: > > Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing the > output takes about as long as reading the input. There may be ways to speed > that up (output requires making an output sequence object which takes half > the output time). > > Building EMBOSS with --with-gccprofile and compiling with gcc creates a > gprof profile. Very useful for catching bottlenecks. Nice tip. > Up to the advent of NGS data, large input/output runs have been limited to > converting EMBL/GenBank into Fasta as a one-off every few months so looking > into the efficiency of sequence reading/writing has been a low priority. Now > it does assume much more importance. Exactly :) >> Another suggestion (although not demonstrated in the above >> benchmark) is for the Solexa FASTQ parsing (and output). >> From looking at the code, you map the ASCII to a PHRED >> score for each letter of every read. This is a relatively >> expensive operation using powers and logs. I would try >> using a precomputed look up table (something I have just >> been working on for Biopython - this made a very big >> difference, especially when converting to/from Solexa >> scores to PHRED scores). > > Yes, that was on my list of future changes. There wasn't time to fully > implement and test before the release freeze. That makes sense - and it is a pretty obvious thing to try, so I would have been surprised if you hadn't come up with the same idea. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 08:51:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 13:51:08 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS Message-ID: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> I've retitled this and CC'ed it to the EMBOSS dev list - which is probably a better place for this now! On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice wrote: > Peter wrote: >> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > >>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>>> Biopython's FASTQ parsing stacks up in terms of run time? >>> >>> We better be the fastest. Everyone knows that C code is bloated >>> and slow. >> >> I pretty sure that was tongue in check, but if you were being mean >> you probably could describe some of the EMBOSS infrastructure >> as bloat. In any case, I'm sure that EMBOSS can be made faster >> now that speed matters here with next generation sequencing, see: >> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html > > EMBOSS code is indeed bloated and slow in some places - for example on > output it constructs a sequence output object from the input sequence. > However, it's C ... if we know what we're doing we can tell the machine > to go faster. Unless the compiler decides it can optimise us away... > > Certainly this is a place where using reference-counted strings shows > gains. We tend to avoid them in EMBOSS because early experience in > optimising had them being deleted at the 'wrong' times and leaving us > with no significant improvement in performance. Sequence output looks > like a good place for them. > > We can also simplify the sequence output objects to avoid some of the > reset operations when reusing the objects. > >> And I've got bad news for you then - currently EMBOSS seqret >> is about twice as fast as CVS Biopython SeqIO (measuring parsing >> versus writing is a bit tricky). However, I have a cunning plan: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html > > Worse news, I can find some speedups in EMBOSS ... though > the split is about 40% in output and 60% in input CPU time. Well, it is only bad news from the point of view of Biopython bragging rights ;) And with those speed ups, I guess my fast lower level Biopython FASTQ to FASTA script will now be about the same speed as seqret! See: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Nice work! > I/O time is another issue where we could play with blocked > reads ... though when I tried that some time ago it seemed > the operating systems and file systems were doing a grand > job and it was hard to get a consistent speed gain even for > one specific system. Maybe best avoided, given EMBOSS is truly cross platform. Peter C. From jbdundas at gmail.com Tue Jul 28 21:06:43 2009 From: jbdundas at gmail.com (jitesh dundas) Date: Wed, 29 Jul 2009 06:36:43 +0530 Subject: [emboss-dev] emboss-dev Digest, Vol 11, Issue 14 In-Reply-To: References: Message-ID: <326ea8620907281806x2ffa42sf345cc9a0986aec3@mail.gmail.com> Dear Sir, I am going to begin writing code for mak9ng parallel program execution in Emboss. I need someone to answer my doubts about Emboss as I am learning. On 7/28/09, emboss-dev-request at lists.open-bio.org wrote: > Send emboss-dev mailing list submissions to > emboss-dev at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/emboss-dev > or, via email, send a message with subject or body 'help' to > emboss-dev-request at lists.open-bio.org > > You can reach the person managing the list at > emboss-dev-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of emboss-dev digest..." > > > Today's Topics: > > 1. FASTQ parsing speed in EMBOSS (Peter) > 2. Re: FASTQ parsing speed in EMBOSS (Peter Rice) > 3. Re: FASTQ parsing speed in EMBOSS (Peter) > 4. FASTQ parsing speed in EMBOSS (Peter) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 27 Jul 2009 18:39:49 +0100 > From: Peter > Subject: [emboss-dev] FASTQ parsing speed in EMBOSS > To: emboss-dev at lists.open-bio.org > Message-ID: > <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi all, > > I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for > some of the FASTQ issues I've raised, and I decided to do a few > simple benchmarks. > > For this example, I have used a 1.3 GB standard Sanger FASTQ > file from the NCBI short read archive which contains just over > seven million short reads of length 36 bp, which I believe were > originally from a Solexa/Illumina machine. This is actually one > of a pair of FASTQ files as this was a paired end run. The file is > here (compressed): > > ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/SRR001666_1.fastq.gz > > Note that some of the quality lines start with "@", so you can't > use grep for "^@" to count the records. However, all the reads > have an identifier starting SRR so you can do this: > > $ time grep "^@SRR" SRR001666_1.fastq | wc -l > 7047668 > > real 0m15.886s > user 0m18.357s > sys 0m1.268s > > For this example, I want to convert the FASTQ file to FASTA > (i.e. ignore and throw away the quality scores). This is a fairly > common task, as most all assemblers will take FASTA files, > even if they don't understand FASTQ. > > As I didn't want to waste disk space and I wanted a basic > check on the output, I have simply piped the output via > grep and wc to count the FASTA records: > > $ time seqret -filter -sformat fastq-sanger -osformat fasta < > SRR001666_1.fastq | grep "^>" | wc -l > 7047668 > > real 2m48.288s > user 3m3.994s > sys 0m3.525s > > I've run this several times, and this result is typical. So, using > the "fastq-sanger" format this takes about 2m48s. There is a > slight speed up using "fastq" as the EMBOSS input format > name, as this never has to convert the quality strings into > PHRED values: > > $ time seqret -filter -sformat fastq -osformat fasta < > SRR001666_1.fastq | grep "^>" | wc -l > 7047668 > > real 2m43.566s > user 2m59.077s > sys 0m3.540s > > i.e. About 2m44, saving about 4s. > > Just for the record, actually doing the FASTQ to FASTA conversion > to a file (without grep and wc) takes about 2m52s: > > $ time seqret -filter -sformat fastq -osformat fasta -sequence > SRR001666_1.fastq -outseq SRR001666_1.fasta > > real 2m51.791s > user 2m40.545s > sys 0m4.848s > > This is over 40 thousand reads per second, but I was still a > little disappointed in the run time. Improvements in the FASTQ > parsing/writing speed would help get EMBOSS used in > sequencing centre pipelines. Once we have the EMBOSS > FASTQ input/output working as intended, does trying to > speed it up further seem worthwhile? > > One specific suggestions is for the "fastq" parser (function > seqReadFastq) which doesn't do anything with the quality > strings. Other than for a debug statement, there is no need > to calculate these lines: > > minqual = ajStrGetAsciiLow(qualstr); > maxqual = ajStrGetAsciiHigh(qualstr); > comqual = ajStrGetAsciiCommon(qualstr); > > In fact, you don't really need to record qualstr at all. Could > you just verify the total length of the quality string, without > actually recording it in a buffer? > > Another suggestion (although not demonstrated in the above > benchmark) is for the Solexa FASTQ parsing (and output). > >From looking at the code, you map the ASCII to a PHRED > score for each letter of every read. This is a relatively > expensive operation using powers and logs. I would try > using a precomputed look up table (something I have just > been working on for Biopython - this made a very big > difference, especially when converting to/from Solexa > scores to PHRED scores). > > Peter C. > > > ------------------------------ > > Message: 2 > Date: Tue, 28 Jul 2009 09:05:47 +0100 > From: Peter Rice > Subject: Re: [emboss-dev] FASTQ parsing speed in EMBOSS > To: Peter > Cc: emboss-dev at lists.open-bio.org > Message-ID: <4A6EB15B.20903 at ebi.ac.uk> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Peter wrote: >> Hi all, >> >> I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for >> some of the FASTQ issues I've raised, and I decided to do a few >> simple benchmarks. >> >> This is over 40 thousand reads per second, but I was still a >> little disappointed in the run time. Improvements in the FASTQ >> parsing/writing speed would help get EMBOSS used in >> sequencing centre pipelines. Once we have the EMBOSS >> FASTQ input/output working as intended, does trying to >> speed it up further seem worthwhile? > > Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing > the output takes about as long as reading the input. There may be ways to > speed that up (output requires making an output sequence object which takes > half the output time). > > Building EMBOSS with --with-gccprofile and compiling with gcc creates a > gprof profile. Very useful for catching bottlenecks. > > Up to the advent of NGS data, large input/output runs have been limited to > converting EMBL/GenBank into Fasta as a one-off every few months so looking > into the efficiency of sequence reading/writing has been a low priority. > Now it does assume much more importance. > >> Another suggestion (although not demonstrated in the above >> benchmark) is for the Solexa FASTQ parsing (and output). >>>From looking at the code, you map the ASCII to a PHRED >> score for each letter of every read. This is a relatively >> expensive operation using powers and logs. I would try >> using a precomputed look up table (something I have just >> been working on for Biopython - this made a very big >> difference, especially when converting to/from Solexa >> scores to PHRED scores). > > Yes, that was on my list of future changes. There wasn't time to fully > implement and test before the release freeze. > > regards, > > Peter > > > ------------------------------ > > Message: 3 > Date: Tue, 28 Jul 2009 10:21:33 +0100 > From: Peter > Subject: Re: [emboss-dev] FASTQ parsing speed in EMBOSS > To: Peter Rice > Cc: emboss-dev at lists.open-bio.org > Message-ID: > <320fb6e00907280221y141797fcw81faeefd22429fb1 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Tue, Jul 28, 2009 at 9:05 AM, Peter Rice wrote: >> >> Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing >> the >> output takes about as long as reading the input. There may be ways to >> speed >> that up (output requires making an output sequence object which takes half >> the output time). >> >> Building EMBOSS with --with-gccprofile and compiling with gcc creates a >> gprof profile. Very useful for catching bottlenecks. > > Nice tip. > >> Up to the advent of NGS data, large input/output runs have been limited to >> converting EMBL/GenBank into Fasta as a one-off every few months so >> looking >> into the efficiency of sequence reading/writing has been a low priority. >> Now >> it does assume much more importance. > > Exactly :) > >>> Another suggestion (although not demonstrated in the above >>> benchmark) is for the Solexa FASTQ parsing (and output). >>> From looking at the code, you map the ASCII to a PHRED >>> score for each letter of every read. This is a relatively >>> expensive operation using powers and logs. I would try >>> using a precomputed look up table (something I have just >>> been working on for Biopython - this made a very big >>> difference, especially when converting to/from Solexa >>> scores to PHRED scores). >> >> Yes, that was on my list of future changes. There wasn't time to fully >> implement and test before the release freeze. > > That makes sense - and it is a pretty obvious thing to try, so > I would have been surprised if you hadn't come up with the > same idea. > > Peter > > > ------------------------------ > > Message: 4 > Date: Tue, 28 Jul 2009 13:51:08 +0100 > From: Peter > Subject: [emboss-dev] FASTQ parsing speed in EMBOSS > To: Peter Rice , emboss-dev at lists.open-bio.org > Cc: biopython-dev at lists.open-bio.org > Message-ID: > <320fb6e00907280551n7a42563byb802016b2342de06 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > I've retitled this and CC'ed it to the EMBOSS dev list - which is > probably a better place for this now! > > On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice wrote: >> Peter wrote: >>> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: >> >>>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>>>> Biopython's FASTQ parsing stacks up in terms of run time? >>>> >>>> We better be the fastest. Everyone knows that C code is bloated >>>> and slow. >>> >>> I pretty sure that was tongue in check, but if you were being mean >>> you probably could describe some of the EMBOSS infrastructure >>> as bloat. In any case, I'm sure that EMBOSS can be made faster >>> now that speed matters here with next generation sequencing, see: >>> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html >> >> EMBOSS code is indeed bloated and slow in some places - for example on >> output it constructs a sequence output object from the input sequence. >> However, it's C ... if we know what we're doing we can tell the machine >> to go faster. Unless the compiler decides it can optimise us away... >> >> Certainly this is a place where using reference-counted strings shows >> gains. We tend to avoid them in EMBOSS because early experience in >> optimising had them being deleted at the 'wrong' times and leaving us >> with no significant improvement in performance. Sequence output looks >> like a good place for them. >> >> We can also simplify the sequence output objects to avoid some of the >> reset operations when reusing the objects. >> >>> And I've got bad news for you then - currently EMBOSS seqret >>> is about twice as fast as CVS Biopython SeqIO (measuring parsing >>> versus writing is a bit tricky). However, I have a cunning plan: >>> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html >> >> Worse news, I can find some speedups in EMBOSS ... though >> the split is about 40% in output and 60% in input CPU time. > > Well, it is only bad news from the point of view of Biopython > bragging rights ;) > > And with those speed ups, I guess my fast lower level Biopython > FASTQ to FASTA script will now be about the same speed as > seqret! See: > http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html > > Nice work! > >> I/O time is another issue where we could play with blocked >> reads ... though when I tried that some time ago it seemed >> the operating systems and file systems were doing a grand >> job and it was hard to get a consistent speed gain even for >> one specific system. > > Maybe best avoided, given EMBOSS is truly cross platform. > > Peter C. > > > ------------------------------ > > _______________________________________________ > emboss-dev mailing list > emboss-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss-dev > > > End of emboss-dev Digest, Vol 11, Issue 14 > ****************************************** > -- Thanks & Regards, Jitesh Dundas Research Associate, DIL Lab, IIT-Bombay(www.dil.iitb.ac.in), Scientist, Edencore Technologies(www.edencore.net) Phone:- +91-9860925706 http://jiteshbdundas.blogspot.com "No idea is stupid,either its too good to be true, or its way ahead of its future"- GEORGE BERNARD SHAW. From biopython at maubp.freeserve.co.uk Fri Jul 31 08:01:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 13:01:27 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS In-Reply-To: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> References: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> Message-ID: <320fb6e00907310501t56a136d0yde9882cb3e96c4a2@mail.gmail.com> On Tue, Jul 28, 2009 at 1:51 PM, Peter wrote: > I've retitled this and CC'ed it to the EMBOSS dev list - which is > probably a better place for this now! Another random thought for speeding up parsing/writing the Solexa/Illumina FASTQ formats: At some point you need to convert from an integer score to an ASCII character using an offset of 64. Would clearing/setting the bit be faster than using integer subtraction/addition? Sadly this trick won't work for the Sanger FASTQ format as the offset is 33, not 32. Peter C. Credit where due: This idea was based on a discussion with Leighton Pritchard, where he suggested this could be why Solexa opted for a 64 bit offset in particular. From jbdundas at gmail.com Thu Jul 2 08:20:09 2009 From: jbdundas at gmail.com (jitesh dundas) Date: Thu, 2 Jul 2009 13:50:09 +0530 Subject: [emboss-dev] Task Update Message-ID: <326ea8620907020120w4d286761m48a99996f11f1022@mail.gmail.com> Dear Sir, I have created the database design for the task of running the tasks of Emboss in parallel. As this is to be monitored from any place, I thought of making it web-based. Java/Servets/JSP/Beans is my choice as I am confortablw with these. Now, I will need an IP address of another machine , located in a distant place , having Emboss running on it. The first interface would be the master interface, tracking all the activities, This should be easy, provided I have all the IP addresses and program details. Next, this is what will happen. I have Jemboss installed on my PC ( with internet ) . Next I get the first task, say try creating a sequence or any simple input function. This interface will need some input from (another Jemboss interface ) at machine in another location. The details are sent to this machine via internet as this is the widest network available. I am using Servlets / JSP with MySQL support. Note:- In above scenario, to send an input from one interface, I will need a button or a menu item in the main interface of Jemboss, this will send the details on the click event. For receving part, when the user receives via email, a Jemboss receiving interface will listen for such mails, get the details from reading this email (All this is done in the background) and thus is the details are sent to this interface. This is a little difficult but worth implementing. Once done, this would mean that a person can execute an interface of Jemboss in India while he sends the result to a Jemboss interface in UK , which in turn processes details to some other place. All this will be controlled and decided by the Monitor interface. This can be controlled by the user, but a degree of automation in scheduling is provided. Any feedback is most welcome. I have started writing. Any experts in Java RMI and related areas, please help. I request your reply. Regards, Jitesh Thanks & Regards, Jitesh Dundas Phone:- +91-9860925706 http://jiteshbdundas.blogspot.com ---------- Forwarded message ---------- From: jitesh dundas Date: Wed, Jun 24, 2009 at 8:31 PM Subject: Fwd: [emboss-dev] (no subject) To: Peter Rice Cc: emboss-dev at lists.open-bio.org Dear Sir, This is the logic that I intend to implement:- 1) I have emboss on my laptop(India) installed. I will need another machine(say in UK) with EMBOSS installed in it. When I run one interface on my machine, the output will be sent to the machine to the UK machine. The UK machine will have another interface thread waiting for this input(from India). Please note that this information will be sent via internet/intranet (in encrypted form). Thus the execution will continue on UK machine. IN the same way, the execution for India machine will continue till it needs some input from UK machine. 2) The decision to allot the tasks/input and output will be done by an independent monitoring master interface. This will track continuously the progress of both the machines for each inteface. 3) The information about each interface will have to be sent to a database for storage. For e.g.) a table with the fields interface no, actuvity start time, duration, timestamp,allowed time to execution, input needed, output to be sent, etc. This will be continuously used by the monitor interface. I wanted to implement one of my methods for managing these projects in parallel processing. Based on the results obtained , a paper with the results can be published. Sir, this is my basic idea, which I intend to build on. I will need another machine that I can use for executing this idea. However, it will be needed after 1-2 weeks by which I intend to finish the prior needed parts. TECHINICAL POINTS:- 1) RMI (Remote Method Invocation) will be needed. 2) Internet access / network connection will be needed. 3) MySQL Db. I request your feedback. Thanks & Regards, Jitesh Dundas ---------- Forwarded message ---------- From: jitesh dundas Date: Tue, Jun 16, 2009 at 1:42 PM Subject: Re: [emboss-dev] (no subject) To: Peter Rice Dear Sir I have installed BOINC on my PC and currently I am studying the code and it could take me some time in getting my grip on it. I assure you though that I will get the task done soon. I will update you about progress every 2 days. Regards, Jitesh ---------- Forwarded message ---------- From: emboss-dev-request at lists.open-bio.org < emboss-dev-request at lists.open-bio.org> Date: Jun 15, 2009 10:30 PM Subject: emboss-dev Digest, Vol 10, Issue 3 To: emboss-dev at lists.open-bio.org Send emboss-dev mailing list submissions to emboss-dev at lists.open-bio.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.open-bio.org/mailman/listinfo/emboss-dev or, via email, send a message with subject or body 'help' to emboss-dev-request at lists.open-bio.org You can reach the person managing the list at emboss-dev-owner at lists.open-bio.org When replying, please edit your Subject line so it is more specific than "Re: Contents of emboss-dev digest..." Today's Topics: 1. Re: (no subject) (jitesh dundas) ---------------------------------------------------------------------- Message: 1 Date: Thu, 11 Jun 2009 20:59:45 +0530 From: jitesh dundas Subject: Re: [emboss-dev] (no subject) To: Peter Rice Cc: emboss-dev at lists.open-bio.org Message-ID: <326ea8620906110829p472a1b06x1e1f38a277c57959 at mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 Dear Sir, I hope my previous email gave you a clear idea of my plan on this task. I will begin working on the code now. I will keep you posted every 2 days with my progress. Please let me know if you need anthing else from my side. Regards, Jitesh Dundas On 6/10/09, jitesh dundas wrote: > Dear Sir, > > Thank you for your reply. PLEASE FIND MY COMMENTS IN BLOCK LETTERS BELOW. > > On 6/9/09, Peter Rice wrote: >> Dear Jitesh, >> >>> I need to know the priority on which any script/ application in EMBOSS >>> is executed. >> >> Currently, all EMBOSS applications simply execute. >> >> When they terminate, if EMBOSS_LOGFILE is defined they can write a >> single record the the logfile. >> >> However, we can extend this is that is what you are suggesting. >> >> All EMBOSS (and EMBASSY) applications start with a call to ajAcdInit >> (often via embInitP or ajGraphInit) >> >> All EMBOSS applications end with a call to ajExit on success. Failed >> applications should call ajExitBad or ajExitAbort ... unless they crash >> with a segmentation fault or are otherwise terminated. >> >> So we have places to put in additional monitoring code. > > I NEED THE CENTRAL LOCATION FROM WHERE THIS MONITORING SCRIPT CAN BE > ACCESSED. AN INTERFACE THAT WILL BE AT THE HEART OF EMBOSS. THIS WILL > BE A COMMON SCRIPT AND THUS, IT MUST HAVE ACCESS TO ALL SCRIPTS. > >>> If applications in Emboss are to be executed, they need to be assigned >>> a priority or an impact , besides the following:- >>> >>> 1) A master database or a table that stores list of applications >>> running. These will be updated by a scheduled script running >>> continuously in the background. >> >> This script, could, for example, check the list of known running >> applications and remove any that appear to have crashed. >> >>> 2) the front-end GUI needs to showing a chart of applications running >>> and parameters like progress, time consumed etc. >>> >>> Measuring progress needs some breakpoints. Their status will be >>> pending,WIP or completed. >> >> We have no breakpoints in EMBOSS at present. Can you give examples of >> what you have in mind? > > FOR E.G.) there are 5 stages/applications running in parallel. EACH > STAGE WILL BE DIVIDED INTO PARTS,WHERE EACH PART'S END-POINT ACTING > AS A COMPLETION SUB_TARGET. > > THE ENTRY IN DATABASE TABLE WILL HAVE A PROCESS,SUB-PROCESS,STATUS > FIELDS. DETAILS OF EACH FIELD ARE ENTERED HERE. > tHE SCRIPT OR THE INTERFACE WILL RUN AND MONITOR THE EXECUTION > PROGRESS OF EACH STAGE. REGULARLY, IT WILL UPDATE THE DATABASE TABLE. > >>> I will send further details in 1-2 days. Meanwhile, i request your >>> feedback. >> >> Hope this helps. >> >> Peter Rice >> > > PLEASE LET ME KNOW IF YOU NEED ANYTHING ELSE FROM MY SIDE. > > -- > Thanks & Regards, > Jitesh Dundas > > Scientist, Edencore Technologies(www.edencore.net) > Web Developer, JR Technologies, India > > Phone:- +91-9860925706 > > http://jiteshbdundas.blogspot.com > > "NO IDEA IS STUPID,EITHER IT IS TOO GOOD TO BE TRUE OR IT IS WAY AHEAD > OF ITS FUTURE "- GEORGE BERNARD SHAW. > From ajb at ebi.ac.uk Tue Jul 7 12:26:58 2009 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Tue, 7 Jul 2009 13:26:58 +0100 (BST) Subject: [emboss-dev] Move to libtool 2.2.6a Message-ID: <58544.88.96.156.129.1246969618.squirrel@webmail.ebi.ac.uk> Dear developers, We would like to update the EMBOSS CVS source code from using libtool 1.5.x to libtool 2.2.6a. Libtool 2, in various versions, has been out now for well over a year. Many current distributions, after having spent some time putting it through its paces, have now adopted it e.g. Fedora, OpenSuSE, Mandriva, cygwin, MacOSX Snow Leopard etc. This puts us in the unenviable position in that whatever we do will probably irritate some people. We can't be in the position, though, where we have to eventually advise people to downgrade their software. 1) If we stay with 1.5.x for now then more and more developers using libtool 2.2.x are going to have to type (e.g.) autoreconf -fi prior to the 'aclocal -I m4' stages. 2) If we move to 2.2.6 then developers on machines using 1.5.x will need to install libtool 2.2.6 somewhere (and usually install fresh versions of autoconf [2.63] and automake [1.11] to the same directory tree to avoid currently installed versions referencing the older libtool). People on this list are developers and, by definition, obviously more than capable of downloading 3 files from ftp.gnu.org and installing them. It takes about 10 minutes. MacOSX is a bit different but new versions are available from MacPorts. So, the question is really whether anyone has any strong views for or against a move to libtool 2.2.6 now, given that we will need to get there in the near future? Alan From ajb at ebi.ac.uk Wed Jul 15 11:18:37 2009 From: ajb at ebi.ac.uk (ajb at ebi.ac.uk) Date: Wed, 15 Jul 2009 12:18:37 +0100 (BST) Subject: [emboss-dev] EMBOSS 6.1.0 release now available Message-ID: <36222.86.26.12.63.1247656717.squirrel@webmail.ebi.ac.uk> Dear EMBOSS users and developers, A new version of EMBOSS (6.1.0) is now available for download from our ftp server: ftp://emboss.open-bio.org/pub/EMBOSS/ If you use any of the EMBASSY packages (e.g. PHYLIP, VIENNA etc) then, as usual, remember to re-download and compile those too. A new version of the mEMBOSS, the Windows port, is also available from: ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.1.0-setup.exe Many new capabilities have been added and bugs fixed throughout. Release highlights for EMBOSS include: * Full support for the new SwissProt format. In most cases the entry can be read and written exactly * Full support for EMBL and GenBank entries. In most cases the entry can be read and written exactly * Support for FASTQ short read formats for sequence and quality data * Full support for protein and nucleotide sequence parsing from PDB entries * Full support for GFF3 feature format as the new default feature output * Improved summary information at the end of report output * Alignment output using multiple sequence formats * Extended support for distance matrix file formats * Improved support for regular expression and pattern searching * Improved support for large sequence alignments * Support for remote locations in feature table processing, for example retrieval in coderet. * Output directory support extended to allow directories to be created * Normalisation option for hydrophobicity plots (pepwindow and pepwindowall) * Processing of methylation sites in restriction mapping * Embossdata reports results alphabetically sorted * Command line qualifiers should be unique after 5 characters to allow safe abbreviation * Improved configuration procedures for X11 support * Support for dasgff report format, making it possible to write EMBOSS-based DAS annotation servers Release highlights for EMBASSY include: * Support for MEME 4.0 * Phylipnew updated to Phylip 3.68 * Support for the HMMERDB environment variable in Hmmernew. * Bug fixes for the MSE multiple sequence editor Release highlights for Jemboss include: * Refactoring of the source code * Location of the 'Execution mode' menu moved near to the 'Go' button in the application forms. When a user runs a job for the first time in 'batch' mode an information message is displayed * Automatic configuration of the standalone Jemboss GUI on UNIX systems after typing "make install" for EMBOSS. This standalone GUI can be run using the runJemboss.csh script in the EMBOSS 'bin' directory. This assumes that you have a reasonably up-to-date version of Java installed (1.6 preferred) For future extensions, we have added: * Parsing of cross-reference information from SwissProt and EMBL/GenBank formats * Code to delete and update database indexes New EMBOSS wiki EMBOSS now has a Wiki at http://emboss.open-bio.org/wiki where we will maintain the master copies of documentation for the applications and libraries, and where we have sections for planning new features and applications for the next 3 years of funding. Please contribute any corrections to the documentation and add new ideas to the "Planning" section. We will, of course, be making the wiki prettier as it matures. Important note for Developers New distributions of operating systems have started to use the series 2 version of libtool. We therefore now use this in our CVS repository. The latest stable version of libtool is 2.2.6a (reported by libtool itself as 2.2.6). Developers using systems with older (1.5.x) libtool versions will have to install a local copy of libtool. This would typically be done by downloading the source code from the GNU site: ftp://ftp.gnu.org/ After installing libtool it will usually be necessary to then re-install autoconf (2.63) and automake (1.11) to the same directory root (they are often tied to the version of libtool they were provided with). They too are available from the GNU ftp server. Make sure that your PATH is refreshed between doing the installations of the GNU tools in order that the previous versions aren't referenced. We note that one system (cygwin) currently provides an experimental version of libtool (2.2.7). Developers on these systems (and, in general, on any system with a higher version of libtool than in our CVS repository) should type: autoreconf -fi before attempting compilation. We will usually keep up-to-date with libtool stable releases within a libtool series. New BBSRC funding and future work As previously announced, we have recently been refunded by the BBSRC. What we said in that announcement bears repeating here. The core aims of the funding proposal were to continue support, maintenance and development of EMBOSS, and to provide extensive online training materials for users, developers and system administrators using text from a series of books to be published by Cambridge University Press. We are also explicitly targeting areas where we see EMBOSS can be expanded: * Richer data content in EMBOSS outputs leading to major improvements in the integration and visualisation of results in browsers. * Processing many more data fields in EMBOSS inputs (taxonomy, genes, GO terms, cross-references, keywords. * Extending and improving database access: better indexing, query language support and combining searches across multiple databases, support for non-sequence data resources and new data access methods * Scaling up the libraries and adding new applications to support the data volumes generated by next-generation sequencing runs. We anticipate many more users will be working with short read data mapped to reference sequences over the next few years. * We aim to add at least 100 new applications in these 3 years. Suggestions for new applications are very welcome. * Major work on new developments and new library code will start from August. Alan From javierluiso at gmail.com Fri Jul 17 22:36:36 2009 From: javierluiso at gmail.com (Javier Luiso) Date: Fri, 17 Jul 2009 19:36:36 -0300 Subject: [emboss-dev] EMBOSS and CUDA Message-ID: <61d930160907171536t50f25d43g1ae67b524dadc90e@mail.gmail.com> HI, I'm Javier and I work as a software developer in computer graphcis area and visualization too. I've got experience coding for GPU's and I'm quite interesting in working in order to get the power of using HPC as CUDA to whatever EMBOSS programm that was able to do it. Anything dealing with matrices would be a good first start, I'm open to any suggestion if there's someone working in the same direction. Javier Luiso From biopython at maubp.freeserve.co.uk Mon Jul 20 16:56:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 17:56:45 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? Message-ID: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> Hi all, One of Biopython's unit tests uses the EMBOSS tools. This is for several tasks, including checking we agree for basic sequence translations using different tables, as well as making sure Biopython can parse the alignments output by needle and water. Another area is cross checking we can read each other's sequence output files. I've been going over the Biopython unit tests with EMBOSS 6.1.0, and have found a regression compared to EMBOSS 6.0.1. This is to do with how EMBOSS parses a minimal GenBank file written with Biopython. The file in question is a 10kb GenBank (well, a GenPept file as it holds protein sequences) converted from an Inteligentics file. I can email this on request. The file contains 16 records: $ grep "^LOCUS" VIF_mase-pro.gb | wc -l 16 Using EMBOSS 6.0.1, there are warning messages about the LOCUS line, but all 16 records do get converted into FASTA format fine. I'm not sure why it is complaining, and would be grateful for feedback: $ embossversion Writes the current EMBOSS version number to a file 6.0.1 $ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta -auto -filter | grep ">" | wc -l Warning: bad Genbank LOCUS line 'LOCUS most-likely 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS U455 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS HXB2R 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS ELI 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS MVP5180 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS AD_MAL 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS CPZGAB 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS CPZANT 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS ROD 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS EHOA 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS MM251 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS STM 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS AGM3 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS AGM677 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS SAB1C 298 aa UNK 01-JAN-1980 ' Warning: bad Genbank LOCUS line 'LOCUS SYK 298 aa UNK 01-JAN-1980 ' 16 In any case, seqret 6.0.1 was able to convert this to a FASTA file of 16 records. However, seqret 6.1.0 fails - only the first record is extracted: $ embossversion Reports the current EMBOSS version number 6.1.0 $ seqret -sequence VIF_mase-pro.gb -sformat genbank -osformat fasta -auto -filter | grep ">" | wc -l 1 If there is something wrong with my LOCUS lines, I can fix them. Any thoughts? The LOCUS lines are reproduced above in the EMBOSS 6.0.1 warning messages. One possible issue is the inclusion of an arbitary date (01-JAN-1980, a common default which shouldn't get confused with a real date), over something equally arbitrary (like the date of the conversion), or simply omitting the date (which may be invalid). Thanks, Peter C. (@Biopython) From biopython at maubp.freeserve.co.uk Mon Jul 20 17:57:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 18:57:59 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? Message-ID: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Hi all at Biopython (and EMBOSS-dev CC'd), Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. As I mentioned on the Biopython mailing list a week ago, in particular I'd like to make sure we agree on the various FASTQ variants. I'm waiting for EMBOSS to update the documentation on their website, but as I recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test this afternoon, they are using: fastq - FASTQ where the qualities are ignored (useful for input?) fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 I was expecting "fastq" to be an EMBOSS input only format given how I had understood this to be interpreted (ignore the qualities). This makes sense for tasks like FASTQ to FASTQ where the qualities can be ignored. I was however surprised that using "fastq" as an output format in EMBOSS seqret gives quality strings of double quote characters. This ASCII character (34) is outside the range used in the Solexa and Illumina 1.3+ FASTQ variants. If interpreted as a Sanger style FASTQ file this means a PHRED quality of one (meaning about random, a sensible default). Enough background. The reason for this email was that (subject to confirmation), Biopython's "fastq" matches EMBOSS's "fastq-sanger", so I'd like to consider adding this as an alias in Bio.SeqIO. I resisted adding aliases initially, but we now have "gb" for "genbank" to make working with Entrez a little easier, so there is a precedent. In this case, it will make some of the test_Emboss.py code cleaner if I can just use "fastq-sanger" everywhere and have both Biopython and EMBOSS understand this. Peter From biopython at maubp.freeserve.co.uk Mon Jul 20 21:46:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 22:46:38 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support Message-ID: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> Hi all, I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0 This first example is included in Biopython's unit tests, and can be downloaded here: http://biopython.org/SRC/biopython/Tests/Quality/solexa_example.fastq This was taken from http://maq.sourceforge.net/fq_all2std.pl where it is given as as an example of a Solexa (or early Illumina) format FASTQ file encoding Solexa scores with an ASCII offset of 64, and can be seen by doing: $ perl fq_all2std.pl example ... @SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA +SLXA-B3_649_FC8437_R1_1_1_610_79 YYYYYYYYYYYYYYYYYYWYWYYSU @SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA +SLXA-B3_649_FC8437_R1_1_1_397_389 YYYYYYYYYWYYYYWWYYYWYWYWW @SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG +SLXA-B3_649_FC8437_R1_1_1_850_123 YYYYYYYYYYYYYWYYWYYSYYYSY @SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG +SLXA-B3_649_FC8437_R1_1_1_362_549 YYYYYYYYYYYYYYYYYYWWWWYWY @SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA +SLXA-B3_649_FC8437_R1_1_1_183_714 YYYYYYYYYYWYYYYWYWWUWWWQQ I am pleased to say EMBOSS 6.1.0 will read this and convert it into a standard FASTA file: $ seqret -sequence solexa_example.fastq -sformat fastq -osformat fasta -filter >SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA >SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA >SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG >SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG >SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA Or, output as a Sanger style FASTQ file (using PHRED qualities with an ASCII offset of 33): $ seqret -sequence solexa_example.fastq -sformat fastq-solexa -osformat fastq-sanger -filter @SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA +SLXA-B3_649_FC8437_R1_1_1_610_79 ::::::::::::::::::8:8::46 @SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA +SLXA-B3_649_FC8437_R1_1_1_397_389 :::::::::8::::88:::8:8:88 @SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG +SLXA-B3_649_FC8437_R1_1_1_850_123 :::::::::::::8::8::4:::4: @SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG +SLXA-B3_649_FC8437_R1_1_1_362_549 ::::::::::::::::::8888:8: @SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA +SLXA-B3_649_FC8437_R1_1_1_183_714 ::::::::::8::::8:88688822 Using Biopython, for example as shown on the following cookbook page, agrees perfectly (except that Biopython omits the optional repeated title on the plus lines): http://www.biopython.org/wiki/Reading_from_unix_pipes This also agrees with the MAQ script - if you ignore its strange bug where it adds a "!" to the end of each quality string, see: http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help So far so good :) Was there any particular reason why EMBOSS includes the redundant second title on the plus lines? I can see that doing this makes the FASTQ files perhaps slightly more likely to work with other parsers, but imposes quite a size penalty. Peter C. From biopython at maubp.freeserve.co.uk Mon Jul 20 22:12:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 23:12:29 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> Message-ID: <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> Earlier I wrote: > Hi all, > > I've just been having a play with the FASTQ support in seqret from EMBOSS 6.1.0 > ... > So far so good :) Could anyone spot a "but" coming up? Well, here we are - consider the following single Sanger format FASTQ record (originally from the NCBI SRA, I think SRA000271, but I would have to double check that). @071113_EAS56_0053:1:1:182:712 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG +071113_EAS56_0053:1:1:182:712 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ I would guess the problem is that quality line starts with a @, meaning care is needed. Likewise of course, quality lines can start with a + character too (although in my quick testing EMBOSS seems happy with these). The ASCII code for @ is 64, meaning for a Sanger style file this is a PHRED quality of 64-33 = 31. Here is what Biopython gives for the FASTA conversion: >071113_EAS56_0053:1:1:182:712 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG And this is what Biopython gives for the QUAL conversion, showing the PHRED scores as integers: >071113_EAS56_0053:1:1:182:712 31 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 34 35 40 40 40 40 40 27 4 27 21 5 12 9 8 13 7 9 4 10 Anyway, EMBOSS doesn't seem to like this example FASTQ record: $ seqret -sequence tricky_one.fastq -sformat fastq -osformat fasta -filter Error: Unable to read sequence 'tricky_one.fastq' Died: seqret terminated: Bad value for '-sequence' with -auto defined This read is actually one of four records in the following Biopython test file, in which EMBOSS only seems to find the first record: http://biopython.org/SRC/biopython/Tests/Quality/tricky.fastq As described here, this is a hand modified version of a real NCBI FASTQ file to show case several potential gotchas in parsing FASTQ (including some unlikely to occur in real life - unless someone were to concatenate FASTQ files from separate sources or something): http://www.biopython.org/DIST/docs/api/Bio.SeqIO.QualityIO-module.html#FastqGeneralIterator In fact, looking at that again now, maybe I should include another record where the sequence line starts with a "+" as well... maybe even a record with the quality split over multiple lines some starting with @ and some with +. That would be an even better evil test ;) Regards, Peter C. From pmr at ebi.ac.uk Tue Jul 21 07:43:40 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 08:43:40 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> Message-ID: <4A6571AC.5090801@ebi.ac.uk> Peter C. wrote: > Could anyone spot a "but" coming up? > > Well, here we are - consider the following single Sanger format > FASTQ record (originally from the NCBI SRA, I think SRA000271, > but I would have to double check that). > > @071113_EAS56_0053:1:1:182:712 > ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG > +071113_EAS56_0053:1:1:182:712 > @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ > > I would guess the problem is that quality line starts with a @, Urghh ... I left an extra '@' test in even though I meant to take it out before the release. I will make a patch for this ... have to look into a couple of your other queries at the same time as they are in the same source file. Thanks Peter From biopython at maubp.freeserve.co.uk Tue Jul 21 09:44:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 10:44:57 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> Message-ID: <320fb6e00907210244g48da17a2nbf7309eae0bd1356@mail.gmail.com> I wrote: > ... > I've been going over the Biopython unit tests with EMBOSS 6.1.0, > and have found a regression compared to EMBOSS 6.0.1. This is > to do with how EMBOSS parses a minimal GenBank file written > with Biopython. > > The file in question is a 10kb GenBank (well, a GenPept file as > it holds protein sequences) ... As requested (off list), I have sent the GenBank file to Peter Rice to look at. Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 10:21:34 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 11:21:34 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <4A6591F6.20107@ebi.ac.uk> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> Message-ID: <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> On Tue, Jul 21, 2009 at 11:01 AM, Peter Rice wrote: > > Peter C. wrote: >> I guess "refseqp" means refseq protein? Another name for GenPept? > > Not quite ... because genpept has yet another variation of GenBank format. > > refseqp is the protein part of refseq. > >> Is "refseqp" a public EMBOSS format name, or something internal? I've >> never noticed it in the documentation, e.g. >> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html#in > > We're in the process of updating that. Somewhere in among writing the > books and creating the wiki the old website got left behind. > > My next task (once I've made sure your bugs are fixed) is to regenerate > all the tables of formats. Great. This may save you having to answer my next question, which was could you expand on what EMBOSS considers to be the differences between "genbank", "genpept" and "refseqp" as file formats? Of course, I may come up with further questions ;) >> Biopython treats "genbank" format as meaning either a GenBank file >> (with nucleotides) or a GenPept file (with amino acids). We detect this >> based on the LOCUS line containing "bp" or "aa". > > So do we ... but we need two versions of the 'aa' LOCUS lines. We try to > pick up the rest of the details for reuse in output. Why do you need two versions of the 'aa' LOCUS line? Is this the "genpept" format versus "refseqp" issue alluded to earlier? >> [Do you want to forward this back to the mailing list?] > > Will do. > > Peter I've CC'd this reply to the list. Peter From pmr at ebi.ac.uk Tue Jul 21 10:40:39 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 11:40:39 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> Message-ID: <4A659B27.4010902@ebi.ac.uk> Peter C. wrote: >> My next task (once I've made sure your bugs are fixed) is to regenerate >> all the tables of formats. > > Great. This may save you having to answer my next question, > which was could you expand on what EMBOSS considers to be > the differences between "genbank", "genpept" and "refseqp" as > file formats? Of course, I may come up with further questions ;) Oh, further questions please! We love answering them. GenPept format expects to find 9 fields on the LOCUS line. RefseqP format expects only 8. The difference is GenPept format including the original GenPept locus name. We may try to merge them one day. If we do, we would keep the format names but use one parser. Your Genpept (refseqp) format problem will be fixed in a patch. It was fine for one sequence but needed to rebuffer the input file to work with multiple input sequences. Meanwhile, could you tar up the biopython test data and scripts http://biopython.open-bio.org/SRC/biopython/Tests/ and I will try running the same data through EMBOSS to see what issues we can find. regards, Peter From biopython at maubp.freeserve.co.uk Tue Jul 21 10:52:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 11:52:19 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output Message-ID: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> Hi, One of the many things I talked to Peter Rice about in Sweden was the Pearson FASTA like output from needle and water (e.g. what EMBOSS calls the markx10 output format), and why it includes the EMBOSS header and footer lines (which start with a # character), which are not present in real FASTA output. Biopython can parse the pairwise -m 10 output from Bill Pearson's FASTA tools, so in theory we (Biopython) should be able to parse the markx10 output from EMBOSS needle and water. We could probably cope with the extra header and footer, but I think it would be best if EMBOSS could produce something more closely matching the real FASTA output. Unfortunately, it appears to be more than just the headers which upset our parser - even ignoring them, EMBOSS markx10 output still looks rather different to (current) FASTA -m 10 output. Was the markx10 output mimicking a particular (old) version of the FASTA tools? ------------------------------------------------------------------ Peter R. did say it would be simple to turn off this header and footer output, so I thought I would try this myself. It looks like this is handled in file ajax/ajalign.c by function alignWriteMark, but I don't see a switch to disable the headers and footers. >From looking at other writers, to disable the header, I think I just need to replace this line in alignWriteMark: alignWriteHeaderNum(thys,iali); with: /* turn off printing of the header, keep the calculation */ thys->File = NULL; alignWriteHeaderNum(thys,iali); thys->File = outf; I have worked out the footer gets printed by ajAlignWriteTail, but am unclear on where this is called by alignWriteMark. The only place that seems to call it is ajAlignClose, and this calls ajAlignWriteTail unconditionally. Regards, Peter C. (@Biopython) From biopython at maubp.freeserve.co.uk Tue Jul 21 11:32:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 12:32:59 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907210432h26da39b2ka24ceb1194a1be1a@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). This > makes sense for tasks like FASTQ to FASTQ where the qualities can > be ignored. I meant of course, for FASTQ to FASTA conversion the qualities (and how they are encoded, Sanger versus Solexa versus Illumina 1.3+) can be ignored. Peter From pmr at ebi.ac.uk Tue Jul 21 12:06:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 13:06:43 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> Message-ID: <4A65AF53.5090105@ebi.ac.uk> Peter wrote: > Hi, > > One of the many things I talked to Peter Rice about in Sweden > was the Pearson FASTA like output from needle and water (e.g. > what EMBOSS calls the markx10 output format), and why it > includes the EMBOSS header and footer lines (which start with > a # character), which are not present in real FASTA output. > > Biopython can parse the pairwise -m 10 output from Bill > Pearson's FASTA tools, so in theory we (Biopython) should > be able to parse the markx10 output from EMBOSS needle > and water. We could probably cope with the extra header > and footer, but I think it would be best if EMBOSS could > produce something more closely matching the real FASTA > output. Unfortunately, it appears to be more than just the > headers which upset our parser - even ignoring them, > EMBOSS markx10 output still looks rather different to > (current) FASTA -m 10 output. Was the markx10 output > mimicking a particular (old) version of the FASTA tools? The source code documentation refers to FASTA 3.4 which may be the last time I took a detailed look at the FASTA alignment outputs. Can you send us some example files so we can check for the significant differences? We plan to install all the bio* projects so it would be helpful to have a set of biopython parser scripts we can use to test locally. We can add them to our routine QA tests and flag up changes as soon as they appear. > Peter R. did say it would be simple to turn off this header and > footer output, so I thought I would try this myself. It looks like > this is handled in file ajax/ajalign.c by function alignWriteMark, > but I don't see a switch to disable the headers and footers. You correctly found how to turn off the header. The footer is reported for anything except pure sequence output. For the next release I will add attributes to the list of alignment formats to say whether the header and footer are needed. That will allow us better control and reporting. Meanwhile, we are very happy to standardise the markx* outputs to make them easier to parse. Biopython is the first project to report problems with this. There are alternatives - specifying -aformat and using some other alignment format for all applications - but we like to conform and will do our best to fir what parsers expect. Also, of course, once we know we are being parsed we will do our best not to let the output change. regards, Peter Rice From biopython at maubp.freeserve.co.uk Tue Jul 21 13:05:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:05:35 +0100 Subject: [emboss-dev] EMBOSS and its FASTA like alignment output In-Reply-To: <4A65AF53.5090105@ebi.ac.uk> References: <320fb6e00907210352l76503d38n37e4dc4fc0f4cc33@mail.gmail.com> <4A65AF53.5090105@ebi.ac.uk> Message-ID: <320fb6e00907210605v7415b1b6id043af520c1bb8de@mail.gmail.com> Hi all, I've CC'd the Biopython-dev mailing list as this EMBOSS thread is becoming cross project. On Tue, Jul 21, 2009 at 1:06 PM, Peter Rice wrote: > > Peter wrote: >> Hi, >> >> One of the many things I talked to Peter Rice about in Sweden >> was the Pearson FASTA like output from needle and water (e.g. >> what EMBOSS calls the markx10 output format), and why it >> includes the EMBOSS header and footer lines (which start with >> a # character), which are not present in real FASTA output. >> >> Biopython can parse the pairwise -m 10 output from Bill >> Pearson's FASTA tools, so in theory we (Biopython) should >> be able to parse the markx10 output from EMBOSS needle >> and water. We could probably cope with the extra header >> and footer, but I think it would be best if EMBOSS could >> produce something more closely matching the real FASTA >> output. Unfortunately, it appears to be more than just the >> headers which upset our parser - even ignoring them, >> EMBOSS markx10 output still looks rather different to >> (current) FASTA -m 10 output. Was the markx10 output >> mimicking a particular (old) version of the FASTA tools? > > The source code documentation refers to FASTA 3.4 which > may be the last time I took a detailed look at the FASTA > alignment outputs. That might explain it - I've been using FASTA 3.5. > Can you send us some example files so we can check for > the significant differences? Sure. There are half a dozen FASTA -m 10 output files here: http://biopython.open-bio.org/SRC/biopython/Tests/Fasta/ > We plan to install all the bio* projects so it would be helpful > to have a set of biopython parser scripts we can use to test > locally. We can add them to our routine QA tests and flag up > changes as soon as they appear. If you have (the latest) Biopython installed, and periodically run the unit tests (in particular, test_Emboss.py), that would be a good start. Right now I know that unit test works with EMBOSS 4.0.0 and 6.0.1 (which happens to be on two of the machines I use for testing), and mostly works with EMBOSS 6.1.0 (everything except the GenBank regression you were just looking into today). I'm considering extending test_Emboss.py in the future to take advantage of the new features in EMBOSS 6.1.0 onwards such as GFF and FASTQ support, or perhaps having a second test script (which will be conditional on the version of EMBOSS installed). >> Peter R. did say it would be simple to turn off this header and >> footer output, so I thought I would try this myself. It looks like >> this is handled in file ajax/ajalign.c by function alignWriteMark, >> but I don't see a switch to disable the headers and footers. > > You correctly found how to turn off the header. The footer is > reported for anything except pure sequence output. > > For the next release I will add attributes to the list of alignment > formats to say whether the header and footer are needed. That > will allow us better control and reporting. > > Meanwhile, we are very happy to standardise the markx* outputs > to make them easier to parse. Biopython is the first project to > report problems with this. There are alternatives - specifying > -aformat and using some other alignment format for all > applications - but we like to conform and will do our best to fir > what parsers expect. > > Also, of course, once we know we are being parsed we will do > our best not to let the output change. This isn't really a problem. Biopython can read EMBOSS's own alignment formats (pairs and simple), so there is little need for us to be able to parse EMBOSS's version of the FASTA output. [Although at the moment we ignore all the header information, if that formatting will be consistent, we could parse it too.] However, at least one person wanted to parse EMBOSS markx10 output strongly enough that he wrote a modified version of our FASTA -m 10 parser. I would rather however have EMBOSS revise its output to better match FASTA. See http://bugzilla.open-bio.org/show_bug.cgi?id=2704 Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 13:18:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:18:01 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <4A659B27.4010902@ebi.ac.uk> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> Message-ID: <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> On Tue, Jul 21, 2009 at 11:40 AM, Peter Rice wrote: > > Peter C. wrote: >>> My next task (once I've made sure your bugs are fixed) is to >>> regenerate all the tables of formats. >> >> Great. This may save you having to answer my next question, >> which was could you expand on what EMBOSS considers to be >> the differences between "genbank", "genpept" and "refseqp" as >> file formats? Of course, I may come up with further questions ;) > > Oh, further questions please! We love answering them. > > GenPept format expects to find 9 fields on the LOCUS line. > RefseqP format expects only 8. > > The difference is GenPept format including the original GenPept locus name. Which 8 or 9 fields? > We may try to merge them one day. If we do, we would keep the format > names but use one parser. That makes sense. > Your Genpept (refseqp) format problem will be fixed in a patch. It was > fine for one sequence but needed to rebuffer the input file to work with > multiple input sequences. Grand. Will there be an EMBOSS 6.1.1 in a week or so then (addressing this, the FASTQ @ problem, and any other minor issues)? > Meanwhile, could you tar up the biopython test data and scripts > http://biopython.open-bio.org/SRC/biopython/Tests/ and I will try > running the same data through EMBOSS to see what issues we > can find. http://biopython.open-bio.org/SRC/biopython/ is just a dump from our repository (hourly or something). If you just download the latest Biopython source code, this will have all the unit test files etc: http://biopython.org/DIST/biopython-1.51b.tar.gz You could also grab the latest code from CVS or github - further details on request. Ask if you need clarification on what any of the test data files are for. In some cases searching the Tests/test_*.py files may have informative comments. Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 13:21:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 14:21:46 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> Message-ID: <320fb6e00907210621s1952ad5fm3f62549f7376d292@mail.gmail.com> Peter C. wrote: > Peter Rice wrote: >> >> Peter C. wrote: >>> >>> Great. This may save you having to answer my next question, >>> which was could you expand on what EMBOSS considers to be >>> the differences between "genbank", "genpept" and "refseqp" as >>> file formats? Of course, I may come up with further questions ;) >> >> Oh, further questions please! We love answering them. >> >> GenPept format expects to find 9 fields on the LOCUS line. >> RefseqP format expects only 8. >> >> The difference is GenPept format including the original GenPept locus name. > > Which 8 or 9 fields? Oh, and a related question: Can I adjust the GenPept file in question (emailed to Peter Rice off list) to get rid of the warning from EMBOSS 6.0.1 about the bad LOCUS line? If there is something wrong with the GenBank/GenPept LOCUS lines Biopython writes, I'd like to fix it before our next release. Peter C. From pmr at ebi.ac.uk Tue Jul 21 13:30:17 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 14:30:17 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> Message-ID: <4A65C2E9.1000203@ebi.ac.uk> Peter wrote: > On Tue, Jul 21, 2009 at 11:40 AM, Peter Rice wrote: >> GenPept format expects to find 9 fields on the LOCUS line. >> RefseqP format expects only 8. >> >> The difference is GenPept format including the original GenPept locus name. > > Which 8 or 9 fields? 'LOCUS' identifier Genbank-locus-name (GenPept format only) seqlen (numeric) 'aa' molecule-type (controlled vocabulary - we ignore the protein ones for now) 'circular' or 'linear' division (expecting 'UNC' for unclassified) date (last modified date) > Grand. Will there be an EMBOSS 6.1.1 in a week or so then (addressing > this, the FASTQ @ problem, and any other minor issues)? There will be a patch file in the ftp://emboss.open-bio.org/pub/EMBOSS/patches/ directory For those (like me) who prefer to manually update there will also be replacement file(s) in the fixes directory. > http://biopython.open-bio.org/SRC/biopython/ is just a dump from > our repository (hourly or something). If you just download the latest > Biopython source code, this will have all the unit test files etc: > http://biopython.org/DIST/biopython-1.51b.tar.gz Super, thanks. > Ask if you need clarification on what any of the test data files are > for. In some cases searching the Tests/test_*.py files may have > informative comments. Thanks. The plan is to include them in the EMBOSS QA tests so I will take a look at the inputs and what you check for in the outputs. At first glance it looks straightforward. regards, Peter From pmr at ebi.ac.uk Tue Jul 21 13:35:26 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 21 Jul 2009 14:35:26 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <320fb6e00907210621s1952ad5fm3f62549f7376d292@mail.gmail.com> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> <320fb6e00907210621s1952ad5fm3f62549f7376d292@mail.gmail.com> Message-ID: <4A65C41E.20505@ebi.ac.uk> Peter C. wrote: > Oh, and a related question: Can I adjust the GenPept file in question > (emailed to Peter Rice off list) to get rid of the warning from > EMBOSS 6.0.1 about the bad LOCUS line? If there is something > wrong with the GenBank/GenPept LOCUS lines Biopython writes, > I'd like to fix it before our next release. For EMBOSS 6.1.0 it should use -sformat refseqp (but will run without warning after the patch). For 6.0.1 all you can do is lie and change aa to bp. We added the protein formats refseqp and genpept in release 6.1.0. Previous releases warn about the 'aa' tag and continue. You could run with -nowarning on the command line but we don't recommend it :-) regards, Peter Rice From biopython at maubp.freeserve.co.uk Tue Jul 21 17:10:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 18:10:17 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A6571AC.5090801@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> Message-ID: <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> On Tue, Jul 21, 2009 at 8:43 AM, Peter Rice wrote: > > Peter C. wrote: > >> Could anyone spot a "but" coming up? >> ... >> I would guess the problem is that quality line starts with a @, > > Urghh ... I left an extra '@' test in even though I meant to take it out > before the release. > > I will make a patch for this ... have to look into a couple of your other > queries at the same time as they are in the same source file. > > Thanks I've got another issue for you, which I think is an rounding problem converting negative Solexa scores into ASCII (which sounds a bit strange), or assuming you store everything as PHRED scores in memory, this could be in how you round negative Solexa scores on conversion back to ASCII. This can be neatly demonstrated with the following artificial FASTQ file which uses the Solexa encoding covering scores 40 to -5 inclusive (which I understand to be the typical range likely to come off an actual Solexa/Illumina machine): $ more solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; $ seqret -sequence solexa_faked.fastq -sformat fastq-solexa -osformat fastq-solexa -stdout -auto @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@@?>=< $ embossversion Reports the current EMBOSS version number 6.1.0 As I hope is clear, EMBOSS seqret has inflated the last five scores by one. The original Solexa scores were: 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5 After putting this file through seqret, they become: 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 0, -1, -2, -3, -4 Peter C. From biopython at maubp.freeserve.co.uk Tue Jul 21 17:19:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 18:19:58 +0100 Subject: [emboss-dev] Regression in GenBank/GenPept parsing? In-Reply-To: <4A65C2E9.1000203@ebi.ac.uk> References: <320fb6e00907200956n7af9912aie833e974d4fd52a5@mail.gmail.com> <4A6579BC.5080008@ebi.ac.uk> <320fb6e00907210207u2a2f03b5x6a69a8bcbb93bf14@mail.gmail.com> <4A658E4E.4090008@ebi.ac.uk> <320fb6e00907210255u59518deau1487ab564ab014d@mail.gmail.com> <4A6591F6.20107@ebi.ac.uk> <320fb6e00907210321g46655a1h6f797009f9334cf@mail.gmail.com> <4A659B27.4010902@ebi.ac.uk> <320fb6e00907210618x7e258c3ekd9da44f5e82cea42@mail.gmail.com> <4A65C2E9.1000203@ebi.ac.uk> Message-ID: <320fb6e00907211019r447fca87i87c9143223c6cf8e@mail.gmail.com> On Tue, Jul 21, 2009 at 2:30 PM, Peter Rice wrote: > > Peter wrote: >> On Tue, Jul 21, 2009 at 11:40 AM, Peter Rice wrote: >>> GenPept format expects to find 9 fields on the LOCUS line. >>> RefseqP format expects only 8. >>> >>> The difference is GenPept format including the original GenPept locus name. >> >> Which 8 or 9 fields? > > 'LOCUS' > identifier > Genbank-locus-name (GenPept format only) > seqlen ? ? ? ? ? ? (numeric) > 'aa' > molecule-type ? ? ?(controlled vocabulary - we ignore the protein ones > for now) > 'circular' or 'linear' > division ? ? ? ? ? (expecting 'UNC' for unclassified) > date ? ? ? ? ? ? ? (last modified date) Do you have some publicly available examples of these? And if so, are you happy for them to be included within Biopython for unit tests? >> Grand. Will there be an EMBOSS 6.1.1 in a week or so then (addressing >> this, the FASTQ @ problem, and any other minor issues)? > > There will be a patch file in the > ftp://emboss.open-bio.org/pub/EMBOSS/patches/ directory > > For those (like me) who prefer to manually update there will also be > replacement file(s) in the fixes directory. Would there eventually be an EMBOSS 6.1.1 release for the less technical users who won't want to mess about with patches or replacing single files? I hope we don't have to wait 40 days! ;) [This is a joke referencing St Swithin's day and associated legends] Peter C. From biopython at maubp.freeserve.co.uk Wed Jul 22 11:56:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 22 Jul 2009 12:56:23 +0100 Subject: [emboss-dev] Line wrapping in FASTQ output Message-ID: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Hi Peter R. et al, Up until now I had mostly been trying EMBOSS 6.1.0 with short read data. I've just noticed for longer reads EMBOSS wraps the sequences and qualities lines in FASTQ output (at 60 characters). There is an example of this at the end of the email. My understanding is that while line breaks are allowed in the sequences and qualities lines of a FASTQ file, they are discouraged as it can break simple minded parsers. Unfortunately right now I can't find any references/websites to back up this assertion (other than things I wrote myself since), but I was sure I read this on the MAQ site somewhere. Several sites do simply talk about "the" sequence line and "the" quality line (indeed the early drafts of the wikipedia page had this assumption, which I fixed). This is natural if all you have ever worked with is short read data. Of course, 454 reads are hundreds of bases long, and even the latest Illumina reads now are in the range 70 to 100 bp (or so I hear), so this issue will become more common - so any existing parsers that can't cope with line breaks will soon get broken, and hopefully fixed. For Biopython we should be able cope with any strange line breaks in the sequences and qualities lines on input, but for output don't do any line wrapping. I felt this would result in more widely parseable output. I wondered what your thought process was, and if you think it is worth removing the line wrapping on EMBOSS's FASTQ output (or indeed, if you have a good argument to convince me to make Biopython output FASTQ with line wrapping by default). [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as ideal for an OBF cross project mailing list, something we talked about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going to look into this?] Regards, Peter C. (at Biopython) e.g. $ embossversion Reports the current EMBOSS version number 6.1.0 $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! It is likely that email software will mangle the line breaks, but in my example file sanger_93.fastq the sequence and the quality are single line strings (of length 94). Now let's let EMBOSS seqret read this in and write it out again: $ seqret -filter -seq sanger_93.fastq -sformat fastq-sanger -osformat fastq-sanger @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTG ACTGACTGACTGACTGACTGACTGACTGACTGAN +Test ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDC BA@?>=<;:9876543210/.-,+*)('&%$#"! The new lines are real and not just from the email formatting - you can check this by piping the output though hexdump. It appears EMBOSS is using 60 character line wrapping. Peter C. From pmr at ebi.ac.uk Thu Jul 23 08:08:51 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 23 Jul 2009 09:08:51 +0100 Subject: [emboss-dev] Line wrapping in FASTQ output In-Reply-To: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> Message-ID: <4A681A93.9030303@ebi.ac.uk> Peter C. wrote: > Hi Peter R. et al, > > For Biopython we should be able cope with any strange line breaks in > the sequences and qualities lines on input, but for output don't do > any line wrapping. I felt this would result in more widely parseable > output. I wondered what your thought process was, and if you think it > is worth removing the line wrapping on EMBOSS's FASTQ output (or > indeed, if you have a good argument to convince me to make Biopython > output FASTQ with line wrapping by default). There is also an issue with making the ines so long that brain-damaged parsers (those that read a line in C and fail to check it was a complete line) will fail. Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see whether any parsers would object. The obvious compromise is to increase the default line length in EMBOSS to say 500 so that anyone reading up to 512 characters will still be safe. Unfortunately some flk will then assume there will never be a line break. Alternatively, we could truly make everything fit on one line. Or we could double up the fastq outputs with and without line breaks (horrible problems with naming the ouptut formats) I suspect this one-line thing is a simple attempt to avoid the "quality line starting with '@' or '+'" issue. > [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as > ideal for an OBF cross project mailing list, something we talked about > at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) were going > to look into this?] Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release but I will get back on to it. regards, Peter From biopython at maubp.freeserve.co.uk Thu Jul 23 09:14:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 10:14:52 +0100 Subject: [emboss-dev] [Biopython-dev] Line wrapping in FASTQ output In-Reply-To: <4A681A93.9030303@ebi.ac.uk> References: <320fb6e00907220456s7aadff64v9c65a49dc0bc5ea4@mail.gmail.com> <4A681A93.9030303@ebi.ac.uk> Message-ID: <320fb6e00907230214l6df7ff76j643e8ddc1f600054@mail.gmail.com> On Thu, Jul 23, 2009 at 9:08 AM, Peter Rice wrote: > Peter C. wrote: >> >> Hi Peter R. et al, >> >> For Biopython we should be able cope with any strange line breaks >> in the sequences and qualities lines on input, but for output don't do >> any line wrapping. I felt this would result in more widely parseable >> output. I wondered what your thought process was, and if you think >> it is worth removing the line wrapping on EMBOSS's FASTQ output >> (or indeed, if you have a good argument to convince me to make >> Biopython output FASTQ with line wrapping by default). > > There is also an issue with making the ines so long that brain-damaged > parsers (those that read a line in C and fail to check it was a complete > line) will fail. You mean a C parser with a finite string buffer (say 100 characters) which reads things line by line. Yes, that would be a bit brain dead too. I guess either way could break some parsers out there ;) > Leaving the line breaks in was deliberate in EMBOSS 6.1.0 to see > whether any parsers would object. I see - well I'm not objecting, and neither is the Biopython parser. > The obvious compromise is to increase the default line length in > EMBOSS to say 500 so that anyone reading up to 512 characters > will still be safe. Unfortunately some flk will then assume there will > never be a line break. That seems like a bad idea - especially as Roche 454 reads are in the region of 500+ bp, meaning some would wrap and some wouldn't. Even using a longer wrap like 1000 would probably just postpone the issue. If you are going to wrap, something short like 60 seems more sensible (often used in FASTA files too) given the historical 80 character width of a terminal window. People using early Solexa/Illumina machines will only see a single line, but as their read lengths are already in the range 70 to 100bp, I wonder what the latest Illumina pipelines output (wrt wrapping)? > Alternatively, we could truly make everything fit on one line. That's what Biopython currently does. But you are right - I hadn't considered brain dead parsers using fixed buffers. > Or we could double up the fastq outputs with and without line breaks > (horrible problems with naming the ouptut formats) I don't like that plan. For Biopython we could have a wrapping setting available for people who really need to specify this (as we do for FASTA already), with a sensible default value. > I suspect this one-line thing is a simple attempt to avoid the "quality line > starting with '@' or '+'" issue. Could be. I think the fact that @ and + are valid entries in the quality string is the second most annoying thing about the FASTQ format (after the lack of a clear format definition from Sanger, and the resulting variants from Solexa/Illumina etc). >> [I nearly CC'd BioPerl-l with this. In fact, this topic strikes me as >> ideal for an OBF cross project mailing list, something we talked >> about at BOSC/ISMB 2009. Am I right in thinking you (Peter Rice) >> were going to look into this?] > > Yes indeed I was. Waylaid by the demands of the 6.1.0 EMOSS release > but I will get back on to it. Thanks! > regards, > > Peter Cheers, Peter C. From pmr at ebi.ac.uk Thu Jul 23 16:24:01 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Thu, 23 Jul 2009 17:24:01 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> Message-ID: <4A688EA1.9060005@ebi.ac.uk> Peter wrote: > I've got another issue for you, which I think is an rounding problem > converting negative Solexa scores into ASCII (which sounds a bit > strange), or assuming you store everything as PHRED scores in > memory, this could be in how you round negative Solexa scores > on conversion back to ASCII. Yup, it's the rounding on output. It was adding 0.5 and going to the nearest integer. For negative values of course it has to subtract 0.5 to get the correct rounding. regards, Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 09:59:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 10:59:52 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A688EA1.9060005@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> Message-ID: <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> On Thu, Jul 23, 2009 at 5:24 PM, Peter Rice wrote: > > Peter C. wrote: >> >> I've got another issue for you, which I think is an rounding problem >> converting negative Solexa scores into ASCII (which sounds a bit >> strange), or assuming you store everything as PHRED scores in >> memory, this could be in how you round negative Solexa scores >> on conversion back to ASCII. > > Yup, it's the rounding on output. It was adding 0.5 and going to the > nearest integer. > > For negative values of course it has to subtract 0.5 to get the correct > rounding. C can be fun like that - nearest integer verses truncation to lowest integer. I'd like to re-test with your fixes. I presume these things are being fixed in the public CVS repository, so I could try building EMBOSS from there. Is there a particular branch? Or are you planning an EMBOSS 6.1.1 release shortly? Thanks, Peter From pmr at ebi.ac.uk Fri Jul 24 10:14:23 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 24 Jul 2009 11:14:23 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> Message-ID: <4A69897F.9010301@ebi.ac.uk> Peter C. wrote: > I'd like to re-test with your fixes. I presume these things are being fixed > in the public CVS repository, so I could try building EMBOSS from there. > Is there a particular branch? Or are you planning an EMBOSS 6.1.1 > release shortly? You found various things in sequence formats, but all are resolved by changes to ajseqread.c and ajseqwrite.c Assuming I am happy with the test I plan to make a patch which will update those files. The CVS code would have new things for the next release. For now, if you are using the 6.1.0 release, patching is the way to go. Fixes so far: FASTQ format changes: * sequence and quality scores on one line * quality ID line shortened to '+' * Solexa negative quality score output corrected * Phred quality score rounding error fixed * Corrected reading of quality lines starting with '@' GenBank format changes: * protein (genpept and refseqp) formats auto-detect fix for multiple input sequences Intelligenetics format: * Sequence ID corrected for DOS format input file Did I miss anything? regards, Peter Rice From biopython at maubp.freeserve.co.uk Fri Jul 24 10:32:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:32:50 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> Message-ID: <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> Hi again Peter, I have another query regarding how EMBOSS treats "fastq" as a format name. >From our earlier discussions I was expecting "fastq" to be an EMBOSS input *only* format where you would ignore the qualities. This would allow tasks like FASTQ to FASTA without having to worry if the scores where encoded following the Sanger standard, the original Solexa scheme, or the Illumina 1.3+ encoding. When I found EMBOSS offered "fastq" as an output format, I initially thought it might produce files with dummy quality values (even if the input file had qualities). This puzzled me, as I couldn't see a use for this, but in fact this isn't the case. Instead, "fastq" as an output format seems to act like the "fastq-sanger" format. I notice you use dummy values for the quality if there are unknown, specifically a PHRED quality of one (meaning about random, a sensible default in some cases). e.g. $ more example.fasta >EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC >EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA >EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG Converting "fasta" (with no qualities) to "fastq-sanger", seqret assigns a PHRED quality of 1 (the double quote, ASCII 34): $ seqret -sequence example.fasta -sformat fasta -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 """"""""""""""""""""""""" @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 """"""""""""""""""""""""" @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 """"""""""""""""""""""""" Converting "fasta" (with no qualities) to "fastq" seems to act just like conversion to "fastq-sanger": $ seqret -sequence example.fasta -sformat fasta -osformat fastq -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 """"""""""""""""""""""""" @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 """"""""""""""""""""""""" @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 """"""""""""""""""""""""" As an aside, FASTA to Illumina FASTQ also uses PHRED quality one (ASCII 64+1 = 65 is the letter A): $ seqret -sequence example.fasta -sformat fasta -osformat fastq-illumina -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 AAAAAAAAAAAAAAAAAAAAAAAAA @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 AAAAAAAAAAAAAAAAAAAAAAAAA @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 AAAAAAAAAAAAAAAAAAAAAAAAA (Due to the rounding issue I have not included a FASTA to Solexa FASTQ example) Have I understood correctly? i.e. in EMBOSS 6.1.0: "fastq" on input - ignores quality strings "fastq" on output - acts like "fastq-sanger" "fastq-sanger" - PHRED scores offset 31 "fastq-solexa" - Solexa scores offset 64 "fastq-illumina" - PHRED scores offset 64 If this is correct, the "fastq" format behaviour strikes me as very odd. I would have either made "fastq" and "fastq-solexa" the same, or made "fastq" an input only format. Consider this very surprising behaviour that this results in... $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 You might want to use seqret to "clean up" a FASTQ file, for example to standardize the line wrapping and the captions. As this example is a Sanger style FASTQ file, this works: seqret -sequence example.fastq -sformat fastq-sanger -osformat fastq-sanger -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ;;;;;;;;;;;9;7;;.7;393333 Notice EMBOSS has filled in the (optional) repeated caption on the plus lines (and would have wrapped long reads). However, consider the more natural thing to type: $ seqret -sequence example.fastq -sformat fastq -osformat fastq -stdout -auto @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 """"""""""""""""""""""""" @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 """"""""""""""""""""""""" @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 """"""""""""""""""""""""" I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like this threw away the quality scores - and I'm sure other people will also make this mistake. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 24 10:45:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:45:07 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> Message-ID: <320fb6e00907240345u158715cbg33c8b71741d7588b@mail.gmail.com> On Fri, Jul 24, 2009 at 11:32 AM, Peter wrote: > > Have I understood correctly? i.e. in EMBOSS 6.1.0: > > "fastq" on input - ignores quality strings > "fastq" on output - acts like "fastq-sanger" > "fastq-sanger" - PHRED scores offset 31 [* TYPO - should be offset 33 *] > "fastq-solexa" - Solexa scores offset 64 > "fastq-illumina" - PHRED scores offset 64 > Correction (just for the record) - the Sanger FASTQ files use an offset of 33 (ASCII for "!"). The number 31 is important as the difference between this and the Illumina ASCII offset. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 24 10:48:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 11:48:04 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> Message-ID: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> On Mon, Jul 20, 2009 at 6:57 PM, Peter wrote: > Hi all at Biopython (and EMBOSS-dev CC'd), > > Now that EMBOSS 6.1.0 is out I've started checking it against Biopython. > As I mentioned on the Biopython mailing list a week ago, in particular I'd > like to make sure we agree on the various FASTQ variants. I'm waiting > for EMBOSS to update the documentation on their website, but as I > recall from talking to Peter Rice at BOSC/ISMB 2009 and a quick test > this afternoon, they are using: > > fastq - FASTQ where the qualities are ignored (useful for input?) > fastq-sanger - Standard Sanger style FASTQ using PHRED offset 33 > fastq-solexa - Early Solexa/Illumina FASTQ, Solexa scores offset 64 > fastq-illumina - Illumina 1.3+ FASTQ using PHRED offset 64 > > I was expecting "fastq" to be an EMBOSS input only format given > how I had understood this to be interpreted (ignore the qualities). > ... I was however surprised that using "fastq" as an output format > in EMBOSS seqret gives quality strings of double quote characters. To be more precise, it looks like "fastq" as an output format in EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html In any case, it would still make sense to include "fastq-sanger" as an alias for the Sanger standard FASTQ files in Biopython's SeqIO, especially if BioPerl is also going to use that name (to be confirmed): http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Peter From pmr at ebi.ac.uk Fri Jul 24 11:20:13 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Fri, 24 Jul 2009 12:20:13 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> Message-ID: <4A6998ED.9020607@ebi.ac.uk> Peter C. wrote: > I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like > this threw away the quality scores - and I'm sure other people will also > make this mistake. Hmmm ... good point, but hard to avoid unless we simply delete "fastq" from the list of output formats. On balance, I prefer to keep it available. On input there is simply no way to guarantee reading quality scores without being told which type they are. On output it is reasonable to default to fastq-sanger ... otherwise what else could "fastq" output format write? We can consider, as I say, dropping the fastq output format name from a future release. Let's see how users get on with it first. regards, Peter Rice From biopython at maubp.freeserve.co.uk Fri Jul 24 11:33:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 12:33:33 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A6998ED.9020607@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <320fb6e00907240332g21a01511h1a7ddaf677cad962@mail.gmail.com> <4A6998ED.9020607@ebi.ac.uk> Message-ID: <320fb6e00907240433k2ca27ea4y977063ecb863ebaa@mail.gmail.com> On Fri, Jul 24, 2009 at 12:20 PM, Peter Rice wrote: > > Peter C. wrote: >> I was shocked to find using EMBOSS to convert "FASTQ to FASTQ" like >> this threw away the quality scores - and I'm sure other people will also >> make this mistake. > > Hmmm ... good point, but hard to avoid unless we simply delete "fastq" > from the list of output formats. > > On balance, I prefer to keep it available. > > On input there is simply no way to guarantee reading quality scores > without being told which type they are. > > On output it is reasonable to default to fastq-sanger ... otherwise what > else could "fastq" output format write? Well quite. In our chat in Sweden, I never expected you to offer "fastq" as an output format in the first place, so didn't raise the issue. > We can consider, as I say, dropping the fastq output format name from a > future release. Let's see how users get on with it first. As you suggest, let's see if anyone else is concerned about this. Peter C. From biopython at maubp.freeserve.co.uk Fri Jul 24 12:40:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 13:40:55 +0100 Subject: [emboss-dev] EMBOSS format name "fastq-sanger" in Biopython? In-Reply-To: <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> References: <320fb6e00907201057p50d2c4dar9ecc0a2c8a7cf9a5@mail.gmail.com> <320fb6e00907240348h6a3bc043kc17fbbd62daf8a06@mail.gmail.com> Message-ID: <320fb6e00907240540i17f7f3f0kdf144c79ccbfdae@mail.gmail.com> On Fri, Jul 24, 2009 at 11:48 AM, Peter wrote: > > To be more precise, it looks like "fastq" as an output format in > EMBOSS is an alias for "fastq-sanger" (to be confirmed), see: > http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000599.html Confirmed, http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000602.html > In any case, it would still make sense to include "fastq-sanger" as > an alias for the Sanger standard FASTQ files in Biopython's SeqIO, > especially if BioPerl is also going to use that name (to be confirmed): > http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030688.html Confirmed, BioPerl will support "fastq" or "fastq-sanger" to mean the Sanger standard FASTQ files: http://lists.open-bio.org/pipermail/bioperl-l/2009-July/030691.html I've updated Biopython's SeqIO in CVS to support "fastq-sanger" as an alias for "fastq". Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 13:32:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:32:49 +0100 Subject: [emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS Message-ID: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Hi all, Peter Rice kindly said he will look into an OBF cross project mailing list, but in the meantime this has been cross posted to the Biopython, BioPerl, and EMBOSS development lists. On Thu, Jul 23, 2009 at 11:58 PM, Chris Fields wrote: >> I'd like to get comparisons against BioPerl's new FASTQ support >> going too. To do this I'd need to know which (branch?) of BioPerl I >> should install, and I'd also like a trivial sample BioPerl script to do >> piped FASTQ conversion. i.e. read a FASTQ file from stdin (say >> as "fastq-solexa"), and output it to stdout (say as "fastq" meaning >> the Sanger Standard FASTQ). > > You would have to install svn (bioperl-live) if you want the refactored > fastq. ?That commit was within the last month. I've got SVN bioperl-live installed and apparently working :) >> i.e. Something like this four line Biopython script would be perfect: >> http://biopython.org/wiki/Reading_from_unix_pipes > > We use named parameters so it's a little more verbose. > > use Bio::SeqIO; > my $in ?= Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-sanger'); > my $out = Bio::SeqIO->new(-format => 'fastq-solexa'); > while (my $seq = $in->next_seq) { $out->write_seq($seq) } > > Don't be surprised if there are still bugs lurking about, just let me know > and I'll fix 'em. I've got a bug report coming up in a second email, but the basics work :) e.g. Using this Sanger style FASTQ file, and converting it to Solexa style http://biopython.org/SRC/biopython/Tests/Quality/example.fastq $ more example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ;;3;;;;;;;;;;;;7;;;;;;;88 @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ;;;;;;;;;;;7;;;;;-;;;3;83 @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ;;;;;;;;;;;9;7;;.7;393333 This is simple three record FASTQ file (in the Sanger format). Using EMBOSS 6.1.0: $ seqret -filter -sformat fastq-sanger -osformat fastq-solexa < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using BioPerl: $ perl bioperl_sanger2solexa.pl < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC +EAS54_6_R1_2_1_413_324 ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA +EAS54_6_R1_2_1_540_792 ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 ZZZZZZZZZZZXZVZZMVZRXRRRR Using Biopython: $ python biopython_sanger2solexa.py < example.fastq @EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC + ZZRZZZZZZZZZZZZVZZZZZZZWW @EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA + ZZZZZZZZZZZVZZZZZLZZZRZWR @EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG + ZZZZZZZZZZZXZVZZMVZRXRRRR They all agree, except that Biopython has followed the MAQ convention of omitting the (optional) repeat of the captions on the plus lines. This is something I'd already asked Peter Rice about for EMBOSS (but I think we got sidetracked): http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000577.html Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 13:53:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 14:53:40 +0100 Subject: [emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >> >> Don't be surprised if there are still bugs lurking about, just let me >> know and I'll fix 'em. > > I've got a bug report coming up in a second email, but the basics work :) I think I have found a bug in BioPerl's conversion from fastq-solexa to fastq-sanger concerning lower quality scores. Here is an artificial Solexa file using the Solexa scores from 40 down to -5 (which I believe to be the full range expected from an instrument). $ more solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; A Solexa quality of 40 maps to ASCII 40+64 = 104, "h" A Solexa quality of -5 maps to ASCII -5+64 = 59, ";" You should find this example has Solexa scores 40, 39, .., -4, -5. This file is in the Biopython repository under biopython/Tests/Quality Here is the conversion using MAQ (with the chomp fix from Tim Yu to remove an extra "!" character, see the maq-help mailing list for 10 July 2009): http://sourceforge.net/mailarchive/forum.php?thread_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com&forum_name=maq-help $ perl fq_all2std.pl sol2std < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" Here is the Biopython conversion, which is identical: $ python biopython_solexa2sanger.py < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN + IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" EMBOSS 6.1.0 has a rounding issue with negative Solexa scores, and the last six qualities are up by one - Peter Rice is aware of this, and has a fix: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000596.html $ seqret -filter -sformat fastq-solexa -osformat fastq-sanger < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+*)(''&%%$$##""" Now we come to BioPerl, $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq @slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN +slxa_0001_1_0001_01 IHGFEDCBA@?>=<;:9876543210/.-,+++*)(''&&&&%%%% You look fine for the higher qualities, but there is something really wrong for the lower scores (not just the negative ones). I'll leave you to double check the details, but here are the Sanger PHRED qualities decoded into integers (using Biopython to convert from "fastq-sanger" to "qual" output): $ perl bioperl_solexa2sanger.pl < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 10 9 8 7 6 6 5 5 5 5 4 4 4 4 $ perl fq_all2std.pl sol2std < solexa_faked.fastq | python biopython_sanger2qual.py >slxa_0001_1_0001_01 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1 1 Peter C. P.S. This is the BioPerl script I am using here: $ more bioperl_solexa2sanger.pl use Bio::SeqIO; my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => 'fastq-solexa'); my $out = Bio::SeqIO->new(-format => 'fastq-sanger'); while (my $seq = $in->next_seq) { $out->write_seq($seq) }; From biopython at maubp.freeserve.co.uk Fri Jul 24 14:01:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 15:01:11 +0100 Subject: [emboss-dev] EMBOSS seqret FASTQ support In-Reply-To: <4A69897F.9010301@ebi.ac.uk> References: <320fb6e00907201446x215fa100i3fc09dadb110216b@mail.gmail.com> <320fb6e00907201512r47d26224p64d42dc5ebefb11e@mail.gmail.com> <4A6571AC.5090801@ebi.ac.uk> <320fb6e00907211010v43e0a579je8e6d43112c6ba20@mail.gmail.com> <4A688EA1.9060005@ebi.ac.uk> <320fb6e00907240259v6f57d858nc640afa0b18f2727@mail.gmail.com> <4A69897F.9010301@ebi.ac.uk> Message-ID: <320fb6e00907240701i1656fe1bh821e491cdc1958ff@mail.gmail.com> On Fri, Jul 24, 2009 at 11:14 AM, Peter Rice wrote: > > Peter C. wrote: >> I'd like to re-test with your fixes. I presume these things are being fixed >> in the public CVS repository, so I could try building EMBOSS from there. >> Is there a particular branch? Or are you planning an EMBOSS 6.1.1 >> release shortly? > > You found various things in sequence formats, but all are resolved by > changes to ajseqread.c and ajseqwrite.c > > Assuming I am happy with the test I plan to make a patch which will > update those files. If issuing patches is how you prefer to handle this, that's fine with me. Will you do updates to the binaries for Windows users etc? > The CVS code would have new things for the next release. For now, > if you are using the 6.1.0 release, patching is the way to go. So if I want to retest with your fixes, I can either use CVS or wait for the patches? > Fixes so far: > > FASTQ format changes: > > * sequence and quality scores on one line That does seem to be preferred in general. > * quality ID line shortened to '+' This is certainly the way MAQ does it, and as a Sanger based tool that gives this some status - in addition to the file size benefit ;) > * Solexa negative quality score output corrected > * Phred quality score rounding error fixed Were the above two the same issue? > * Corrected reading of quality lines starting with '@' Great. Can you read this file fine now? http://biopython.org/SRC/biopython/Tests/Quality/tricky.fastq > GenBank format changes: > > * protein (genpept and refseqp) formats auto-detect fix for multiple > input sequences > > Intelligenetics format: > > * Sequence ID corrected for DOS format input file > > Did I miss anything? I think that's everything. Thank you! Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 15:12:57 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 16:12:57 +0100 Subject: [emboss-dev] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> Message-ID: <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> On Fri, Jul 24, 2009 at 2:53 PM, Peter wrote: > On Fri, Jul 24, 2009 at 2:32 PM, Peter wrote: >>> >>> Don't be surprised if there are still bugs lurking about, just let me >>> know and I'll fix 'em. >> >> I've got a bug report coming up in a second email, but the basics work :) > > I think I have found a bug in BioPerl's conversion from fastq-solexa > to fastq-sanger concerning lower quality scores. Next up is an issue with BioPerl converting from Sanger to Illumina. In principle this is simple - the quality strings both use PHRED scores just with different offsets. With lower PHRED scores, everything is fine: $ more sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + IHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this is an example constructed by hand to cover a broad range of valid scores, and can be found in the Biopython repository under biopython/Tests/Quality $ perl bioperl_sanger2illumina.pl < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN +Test PHRED qualities from 40 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ python biopython_sanger2illumina.py < sanger_faked.fastq @Test PHRED qualities from 40 to 0 inclusive ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN + hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ So, BioPerl and Biopython (and EMBOSS) agree - apart from the repeating second title on the plus line. I understand that EMBOSS will in future omit the repeated title on the plus line: http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000598.html Now, here comes the problem. I believe FASTQ files directly from an Illumina 1.3+ pipeline will have PHRED scores in the range 0 to 40 (as in this example). However, much higher PHRED scores are possible during assembly / contig'ing and read mapping. For example, the tool MAQ will output Sanger style FASTQ files with PHRED scores in the range 0 to 93 inclusive. Now, in the Sanger FASTQ format, PHRED scores of 0 to 93 map onto ASCII values of 33 to 126 (! to ~). There is a reason for stopping at 126, since ASCII 127 is "delete". However, in the Illumina 1.3+ FASTQ format, PHRED scores of 0 to 93 would map to ASCII values of 64 to 157, which includes a lot of non printing characters. Working with such files at the command line or in an editor is a big problem. Clearly, Illumina never intended to include such high scores in their FASTQ files! Nevertheless, it is possible to write a FASTQ format following the Illumina 1.3+ encoding with these values. Biopython and EMBOSS attempt to do this - although I would regard throwing an error as equally acceptable. So, here is another hand constructed example of a Sanger style FASTQ file using the full quality range: $ more sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN + ~}|{zyxwvutsrqponmlkjihgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;:9876543210/.-,+*)('&%$#"! Again, this example is in the Biopython repository under biopython/Tests/Quality Just to check: $ python biopython_sanger2qual.py < sanger_93.fastq >Test PHRED qualities from 93 to 0 inclusive 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 So, here we go - apologies for the expected line mangling: $ seqret -filter -sformat fastq-sanger -osformat fastq-illumina < sanger_93.fastq | hexdump -C -v 00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 0a 41 43 54 47 41 43 |GACTGACTG.ACTGAC| 00000070 54 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 |TGACTGACTGACTGAC| 00000080 54 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 54 65 |TGACTGACTGAN.+Te| 00000090 73 74 0a 9d 9c 9b 9a 99 98 97 96 95 94 93 92 91 |st..............| 000000a0 90 8f 8e 8d 8c 8b 8a 89 88 87 86 85 84 83 82 81 |................| 000000b0 80 7f 7e 7d 7c 7b 7a 79 78 77 76 75 74 73 72 71 |..~}|{zyxwvutsrq| 000000c0 70 6f 6e 6d 6c 6b 6a 69 68 67 66 65 64 63 62 0a |ponmlkjihgfedcb.| 000000d0 61 60 5f 5e 5d 5c 5b 5a 59 58 57 56 55 54 53 52 |a`_^]\[ZYXWVUTSR| 000000e0 51 50 4f 4e 4d 4c 4b 4a 49 48 47 46 45 44 43 42 |QPONMLKJIHGFEDCB| 000000f0 41 40 0a |A at .| 000000f3 $ python biopython_sanger2illumina.py < sanger_93.fastq | hexdump -C -v00000000 40 54 65 73 74 20 50 48 52 45 44 20 71 75 61 6c |@Test PHRED qual| 00000010 69 74 69 65 73 20 66 72 6f 6d 20 39 33 20 74 6f |ities from 93 to| 00000020 20 30 20 69 6e 63 6c 75 73 69 76 65 0a 41 43 54 | 0 inclusive.ACT| 00000030 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000040 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000050 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000060 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000070 47 41 43 54 47 41 43 54 47 41 43 54 47 41 43 54 |GACTGACTGACTGACT| 00000080 47 41 43 54 47 41 43 54 47 41 4e 0a 2b 0a 9d 9c |GACTGACTGAN.+...| 00000090 9b 9a 99 98 97 96 95 94 93 92 91 90 8f 8e 8d 8c |................| 000000a0 8b 8a 89 88 87 86 85 84 83 82 81 80 7f 7e 7d 7c |.............~}|| 000000b0 7b 7a 79 78 77 76 75 74 73 72 71 70 6f 6e 6d 6c |{zyxwvutsrqponml| 000000c0 6b 6a 69 68 67 66 65 64 63 62 61 60 5f 5e 5d 5c |kjihgfedcba`_^]\| 000000d0 5b 5a 59 58 57 56 55 54 53 52 51 50 4f 4e 4d 4c |[ZYXWVUTSRQPONML| 000000e0 4b 4a 49 48 47 46 45 44 43 42 41 40 0a |KJIHGFEDCBA at .| 000000ed Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to 64 in decimal, which after subtracting the Illumina offset of 64, gives PHRED scores of 93 to 0 as desired. Now to BioPerl, $ perl bioperl_sanger2illumina.pl < sanger_93.fastq @Test PHRED qualities from 93 to 0 inclusive ACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGAN +Test PHRED qualities from 93 to 0 inclusive hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v ... BioPerl has output an invalid FASTQ file - it seems to omit the quality scores for the top scoring nucleotides at the start. The BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 (in hex), giving 104 to 64 in decimal, giving PHRED values of 40 to 0. I think BioPerl should either throw an error, or output the non printing characters as done by Biopython and EMBOSS. Regards, Peter C. (@Biopython) From biopython at maubp.freeserve.co.uk Sat Jul 25 21:12:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 25 Jul 2009 22:12:26 +0100 Subject: [emboss-dev] [Bioperl-l] FASTQ support in Biopython, BioPerl, and EMBOSS In-Reply-To: <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> <320fb6e00907240653y1d7e7861j98ce45a12f02d9df@mail.gmail.com> <320fb6e00907240812l25cd222dxf72fee0e3093f7b3@mail.gmail.com> <32BA007E-949A-4BF2-9F73-8FE0F98807CC@illinois.edu> Message-ID: <320fb6e00907251412u5f53b24eiea618906e607a0e1@mail.gmail.com> On Sat, Jul 25, 2009 at 8:50 PM, Chris Fields wrote: > >> Now, here comes the problem. I believe FASTQ files directly >> from an Illumina 1.3+ pipeline will have PHRED scores in the >> range 0 to 40 (as in this example). However, much higher >> PHRED scores are possible during assembly / contig'ing >> and read mapping. For example, the tool MAQ will output >> Sanger style FASTQ files with PHRED scores in the range >> 0 to 93 inclusive. > > Is this behavior documented anywhere, specifically by Illumina (that values > can exceed 40)? If Illumina 1.3 is specified as being PHRED 0-40, and > another (non-Illumina) software package pushes that limit above the > specified range of Illumina values, I would consider that unfortunately yet > another variant. > > We can support it as Illumina 1.3, but my point is this may getting into a > grey area and may be something that Illumina doesn't/wouldn't support. > Reminds me a little of the multiple GFF2 variations (one of the main > reasons for a GFF3). I agree this is an grey area (high scores in Solexa/Illumina FASTQ files). >> Now, in the Sanger FASTQ format, PHRED scores of 0 to >> 93 map onto ASCII values of 33 to 126 (! to ~). There is a >> reason for stopping at 126, since ASCII 127 is "delete". >> >> However, in the Illumina 1.3+ FASTQ format, PHRED >> scores of 0 to 93 would map to ASCII values of 64 to >> 157, which includes a lot of non printing characters. >> Working with such files at the command line or in an >> editor is a big problem. Clearly, Illumina never intended >> to include such high scores in their FASTQ files! > > Exactly. > >> Nevertheless, it is possible to write a FASTQ format >> following the Illumina 1.3+ encoding with these values. >> Biopython and EMBOSS attempt to do this - although I >> would regard throwing an error as equally acceptable. >> >> So, here is another hand constructed example of a >> Sanger style FASTQ file using the full quality range: >> >> ... >> >> Biopython and EMBOSS 6.1.0 differ regarding the plus line, but agree >> on the quality string which runs from 0x9d to 0x40 (in hex), or 157 to >> 64 in decimal, which after subtracting the Illumina offset of 64, gives >> PHRED scores of 93 to 0 as desired. >> >> Now to BioPerl, >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq >> ... >> >> $ perl bioperl_sanger2illumina.pl < sanger_93.fastq | hexdump -C -v >> ... >> >> BioPerl has output an invalid FASTQ file - it seems to omit the >> quality scores for the top scoring nucleotides at the start. The >> BioPerl quality string runs from just "h" to "@", or 0x68 to 0x40 >> (in hex), giving 104 to 64 in decimal, giving PHRED values of >> 40 to 0. I think BioPerl should either throw an error, or output >> the non printing characters as done by Biopython and EMBOSS. > > If this is accepted as common practice between BioPython and EMBOSS > we will follow similarly. I do think it's worth at least a warning for the > reasons outlined above (e.g. it likely isn't Illumina's intent to support qual > values outside the specified range). Might be worth checking into. True. I think what EMBOSS and Biopython are doing is reasonable (although a warning in this situation makes sense). Equally, an error is a valid option. However, one question is when would you issue the warning/error? For a PHRED score above 40? (Assuming we have a definative reference for Illumina using just 0 to 40). How about if a problem character would result? Since ASCII 64+63=127, the first problem character would be for PHRED score 63. i.e. An Illumina FASTQ format file can hold PHRED scores in the range 0 to 62 without using problem characters. And likewise for a Solexa FASTQ file (Solexa scores up to 62). > From this it could be summarized that converting to sanger format is least > problematic, as possible issues may be encountered when converting to the > other variants. Yes. The Sanger FASTQ format will hold PHRED scores from 0 to 93 while using nice ASCII characters - this means it is suitable for both raw reads and processed data from assemblies or read mappings. In my personal experience, Solexa/Illumina FASTQ files tend to get converted into the Sanger FASTQ format for downstream analysis (e.g. the MAQ tool, or the NCBI short read archive). i.e. Writing high quality reads (i.e. above PHRED 40) to Solexa or Illumina FASTQ files is unlikely. > We'll need to fix the solexa quality calculations in the BioPerl > parser as noted in your previous post; I'll work on that. Great. Peter From pmr at ebi.ac.uk Mon Jul 27 08:55:43 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Mon, 27 Jul 2009 09:55:43 +0100 Subject: [emboss-dev] Open-bio cross-project issues In-Reply-To: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> References: <320fb6e00907240632h53600e73s63590a8deb4e8ffe@mail.gmail.com> Message-ID: <4A6D6B8F.9060108@ebi.ac.uk> Peter C. wrote (to bioperl-l, biopython-l, emboss-dev): > Hi all, > > Peter Rice kindly said he will look into an OBF cross project mailing > list, but in the meantime this has been cross posted to the Biopython, > BioPerl, and EMBOSS development lists. There is a list already for this purpose - open-bio-l I think we will also need a cross-project wiki space on the OBF site. Is there something already used by other projects or should we set something up? I am cross-posting this to other OBF project lists to encourage developers interested in combining efforts to address common problems. This started with FASTQ short read formats, and open-bio-l (a low volume list) has also seen discussion of common test data sets. Please sign up to open-bio-l (if you are not there already) and post suggestions for cross-project issues there. The list subscription page is: http://lists.open-bio.org/mailman/listinfo/open-bio-l Please feel free to forward this to any other projects I may have missed (I picked the obvious addresses from the list.open-bio-org server) regards, Peter Rice From biopython at maubp.freeserve.co.uk Mon Jul 27 17:39:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 27 Jul 2009 18:39:49 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS Message-ID: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> Hi all, I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for some of the FASTQ issues I've raised, and I decided to do a few simple benchmarks. For this example, I have used a 1.3 GB standard Sanger FASTQ file from the NCBI short read archive which contains just over seven million short reads of length 36 bp, which I believe were originally from a Solexa/Illumina machine. This is actually one of a pair of FASTQ files as this was a paired end run. The file is here (compressed): ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/SRR001666_1.fastq.gz Note that some of the quality lines start with "@", so you can't use grep for "^@" to count the records. However, all the reads have an identifier starting SRR so you can do this: $ time grep "^@SRR" SRR001666_1.fastq | wc -l 7047668 real 0m15.886s user 0m18.357s sys 0m1.268s For this example, I want to convert the FASTQ file to FASTA (i.e. ignore and throw away the quality scores). This is a fairly common task, as most all assemblers will take FASTA files, even if they don't understand FASTQ. As I didn't want to waste disk space and I wanted a basic check on the output, I have simply piped the output via grep and wc to count the FASTA records: $ time seqret -filter -sformat fastq-sanger -osformat fasta < SRR001666_1.fastq | grep "^>" | wc -l 7047668 real 2m48.288s user 3m3.994s sys 0m3.525s I've run this several times, and this result is typical. So, using the "fastq-sanger" format this takes about 2m48s. There is a slight speed up using "fastq" as the EMBOSS input format name, as this never has to convert the quality strings into PHRED values: $ time seqret -filter -sformat fastq -osformat fasta < SRR001666_1.fastq | grep "^>" | wc -l 7047668 real 2m43.566s user 2m59.077s sys 0m3.540s i.e. About 2m44, saving about 4s. Just for the record, actually doing the FASTQ to FASTA conversion to a file (without grep and wc) takes about 2m52s: $ time seqret -filter -sformat fastq -osformat fasta -sequence SRR001666_1.fastq -outseq SRR001666_1.fasta real 2m51.791s user 2m40.545s sys 0m4.848s This is over 40 thousand reads per second, but I was still a little disappointed in the run time. Improvements in the FASTQ parsing/writing speed would help get EMBOSS used in sequencing centre pipelines. Once we have the EMBOSS FASTQ input/output working as intended, does trying to speed it up further seem worthwhile? One specific suggestions is for the "fastq" parser (function seqReadFastq) which doesn't do anything with the quality strings. Other than for a debug statement, there is no need to calculate these lines: minqual = ajStrGetAsciiLow(qualstr); maxqual = ajStrGetAsciiHigh(qualstr); comqual = ajStrGetAsciiCommon(qualstr); In fact, you don't really need to record qualstr at all. Could you just verify the total length of the quality string, without actually recording it in a buffer? Another suggestion (although not demonstrated in the above benchmark) is for the Solexa FASTQ parsing (and output). >From looking at the code, you map the ASCII to a PHRED score for each letter of every read. This is a relatively expensive operation using powers and logs. I would try using a precomputed look up table (something I have just been working on for Biopython - this made a very big difference, especially when converting to/from Solexa scores to PHRED scores). Peter C. From pmr at ebi.ac.uk Tue Jul 28 08:05:47 2009 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 28 Jul 2009 09:05:47 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS In-Reply-To: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> References: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> Message-ID: <4A6EB15B.20903@ebi.ac.uk> Peter wrote: > Hi all, > > I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for > some of the FASTQ issues I've raised, and I decided to do a few > simple benchmarks. > > This is over 40 thousand reads per second, but I was still a > little disappointed in the run time. Improvements in the FASTQ > parsing/writing speed would help get EMBOSS used in > sequencing centre pipelines. Once we have the EMBOSS > FASTQ input/output working as intended, does trying to > speed it up further seem worthwhile? Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing the output takes about as long as reading the input. There may be ways to speed that up (output requires making an output sequence object which takes half the output time). Building EMBOSS with --with-gccprofile and compiling with gcc creates a gprof profile. Very useful for catching bottlenecks. Up to the advent of NGS data, large input/output runs have been limited to converting EMBL/GenBank into Fasta as a one-off every few months so looking into the efficiency of sequence reading/writing has been a low priority. Now it does assume much more importance. > Another suggestion (although not demonstrated in the above > benchmark) is for the Solexa FASTQ parsing (and output). >>From looking at the code, you map the ASCII to a PHRED > score for each letter of every read. This is a relatively > expensive operation using powers and logs. I would try > using a precomputed look up table (something I have just > been working on for Biopython - this made a very big > difference, especially when converting to/from Solexa > scores to PHRED scores). Yes, that was on my list of future changes. There wasn't time to fully implement and test before the release freeze. regards, Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 09:21:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 10:21:33 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS In-Reply-To: <4A6EB15B.20903@ebi.ac.uk> References: <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8@mail.gmail.com> <4A6EB15B.20903@ebi.ac.uk> Message-ID: <320fb6e00907280221y141797fcw81faeefd22429fb1@mail.gmail.com> On Tue, Jul 28, 2009 at 9:05 AM, Peter Rice wrote: > > Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing the > output takes about as long as reading the input. There may be ways to speed > that up (output requires making an output sequence object which takes half > the output time). > > Building EMBOSS with --with-gccprofile and compiling with gcc creates a > gprof profile. Very useful for catching bottlenecks. Nice tip. > Up to the advent of NGS data, large input/output runs have been limited to > converting EMBL/GenBank into Fasta as a one-off every few months so looking > into the efficiency of sequence reading/writing has been a low priority. Now > it does assume much more importance. Exactly :) >> Another suggestion (although not demonstrated in the above >> benchmark) is for the Solexa FASTQ parsing (and output). >> From looking at the code, you map the ASCII to a PHRED >> score for each letter of every read. This is a relatively >> expensive operation using powers and logs. I would try >> using a precomputed look up table (something I have just >> been working on for Biopython - this made a very big >> difference, especially when converting to/from Solexa >> scores to PHRED scores). > > Yes, that was on my list of future changes. There wasn't time to fully > implement and test before the release freeze. That makes sense - and it is a pretty obvious thing to try, so I would have been surprised if you hadn't come up with the same idea. Peter From biopython at maubp.freeserve.co.uk Tue Jul 28 12:51:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 28 Jul 2009 13:51:08 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS Message-ID: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> I've retitled this and CC'ed it to the EMBOSS dev list - which is probably a better place for this now! On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice wrote: > Peter wrote: >> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: > >>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>>> Biopython's FASTQ parsing stacks up in terms of run time? >>> >>> We better be the fastest. Everyone knows that C code is bloated >>> and slow. >> >> I pretty sure that was tongue in check, but if you were being mean >> you probably could describe some of the EMBOSS infrastructure >> as bloat. In any case, I'm sure that EMBOSS can be made faster >> now that speed matters here with next generation sequencing, see: >> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html > > EMBOSS code is indeed bloated and slow in some places - for example on > output it constructs a sequence output object from the input sequence. > However, it's C ... if we know what we're doing we can tell the machine > to go faster. Unless the compiler decides it can optimise us away... > > Certainly this is a place where using reference-counted strings shows > gains. We tend to avoid them in EMBOSS because early experience in > optimising had them being deleted at the 'wrong' times and leaving us > with no significant improvement in performance. Sequence output looks > like a good place for them. > > We can also simplify the sequence output objects to avoid some of the > reset operations when reusing the objects. > >> And I've got bad news for you then - currently EMBOSS seqret >> is about twice as fast as CVS Biopython SeqIO (measuring parsing >> versus writing is a bit tricky). However, I have a cunning plan: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html > > Worse news, I can find some speedups in EMBOSS ... though > the split is about 40% in output and 60% in input CPU time. Well, it is only bad news from the point of view of Biopython bragging rights ;) And with those speed ups, I guess my fast lower level Biopython FASTQ to FASTA script will now be about the same speed as seqret! See: http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html Nice work! > I/O time is another issue where we could play with blocked > reads ... though when I tried that some time ago it seemed > the operating systems and file systems were doing a grand > job and it was hard to get a consistent speed gain even for > one specific system. Maybe best avoided, given EMBOSS is truly cross platform. Peter C. From jbdundas at gmail.com Wed Jul 29 01:06:43 2009 From: jbdundas at gmail.com (jitesh dundas) Date: Wed, 29 Jul 2009 06:36:43 +0530 Subject: [emboss-dev] emboss-dev Digest, Vol 11, Issue 14 In-Reply-To: References: Message-ID: <326ea8620907281806x2ffa42sf345cc9a0986aec3@mail.gmail.com> Dear Sir, I am going to begin writing code for mak9ng parallel program execution in Emboss. I need someone to answer my doubts about Emboss as I am learning. On 7/28/09, emboss-dev-request at lists.open-bio.org wrote: > Send emboss-dev mailing list submissions to > emboss-dev at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/emboss-dev > or, via email, send a message with subject or body 'help' to > emboss-dev-request at lists.open-bio.org > > You can reach the person managing the list at > emboss-dev-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of emboss-dev digest..." > > > Today's Topics: > > 1. FASTQ parsing speed in EMBOSS (Peter) > 2. Re: FASTQ parsing speed in EMBOSS (Peter Rice) > 3. Re: FASTQ parsing speed in EMBOSS (Peter) > 4. FASTQ parsing speed in EMBOSS (Peter) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 27 Jul 2009 18:39:49 +0100 > From: Peter > Subject: [emboss-dev] FASTQ parsing speed in EMBOSS > To: emboss-dev at lists.open-bio.org > Message-ID: > <320fb6e00907271039w15ef3afcsd4a36e3ddbf001e8 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi all, > > I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for > some of the FASTQ issues I've raised, and I decided to do a few > simple benchmarks. > > For this example, I have used a 1.3 GB standard Sanger FASTQ > file from the NCBI short read archive which contains just over > seven million short reads of length 36 bp, which I believe were > originally from a Solexa/Illumina machine. This is actually one > of a pair of FASTQ files as this was a paired end run. The file is > here (compressed): > > ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/SRR001666_1.fastq.gz > > Note that some of the quality lines start with "@", so you can't > use grep for "^@" to count the records. However, all the reads > have an identifier starting SRR so you can do this: > > $ time grep "^@SRR" SRR001666_1.fastq | wc -l > 7047668 > > real 0m15.886s > user 0m18.357s > sys 0m1.268s > > For this example, I want to convert the FASTQ file to FASTA > (i.e. ignore and throw away the quality scores). This is a fairly > common task, as most all assemblers will take FASTA files, > even if they don't understand FASTQ. > > As I didn't want to waste disk space and I wanted a basic > check on the output, I have simply piped the output via > grep and wc to count the FASTA records: > > $ time seqret -filter -sformat fastq-sanger -osformat fasta < > SRR001666_1.fastq | grep "^>" | wc -l > 7047668 > > real 2m48.288s > user 3m3.994s > sys 0m3.525s > > I've run this several times, and this result is typical. So, using > the "fastq-sanger" format this takes about 2m48s. There is a > slight speed up using "fastq" as the EMBOSS input format > name, as this never has to convert the quality strings into > PHRED values: > > $ time seqret -filter -sformat fastq -osformat fasta < > SRR001666_1.fastq | grep "^>" | wc -l > 7047668 > > real 2m43.566s > user 2m59.077s > sys 0m3.540s > > i.e. About 2m44, saving about 4s. > > Just for the record, actually doing the FASTQ to FASTA conversion > to a file (without grep and wc) takes about 2m52s: > > $ time seqret -filter -sformat fastq -osformat fasta -sequence > SRR001666_1.fastq -outseq SRR001666_1.fasta > > real 2m51.791s > user 2m40.545s > sys 0m4.848s > > This is over 40 thousand reads per second, but I was still a > little disappointed in the run time. Improvements in the FASTQ > parsing/writing speed would help get EMBOSS used in > sequencing centre pipelines. Once we have the EMBOSS > FASTQ input/output working as intended, does trying to > speed it up further seem worthwhile? > > One specific suggestions is for the "fastq" parser (function > seqReadFastq) which doesn't do anything with the quality > strings. Other than for a debug statement, there is no need > to calculate these lines: > > minqual = ajStrGetAsciiLow(qualstr); > maxqual = ajStrGetAsciiHigh(qualstr); > comqual = ajStrGetAsciiCommon(qualstr); > > In fact, you don't really need to record qualstr at all. Could > you just verify the total length of the quality string, without > actually recording it in a buffer? > > Another suggestion (although not demonstrated in the above > benchmark) is for the Solexa FASTQ parsing (and output). > >From looking at the code, you map the ASCII to a PHRED > score for each letter of every read. This is a relatively > expensive operation using powers and logs. I would try > using a precomputed look up table (something I have just > been working on for Biopython - this made a very big > difference, especially when converting to/from Solexa > scores to PHRED scores). > > Peter C. > > > ------------------------------ > > Message: 2 > Date: Tue, 28 Jul 2009 09:05:47 +0100 > From: Peter Rice > Subject: Re: [emboss-dev] FASTQ parsing speed in EMBOSS > To: Peter > Cc: emboss-dev at lists.open-bio.org > Message-ID: <4A6EB15B.20903 at ebi.ac.uk> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Peter wrote: >> Hi all, >> >> I've been testing EMBOSS 6.1.0 with a patch from Peter Rice for >> some of the FASTQ issues I've raised, and I decided to do a few >> simple benchmarks. >> >> This is over 40 thousand reads per second, but I was still a >> little disappointed in the run time. Improvements in the FASTQ >> parsing/writing speed would help get EMBOSS used in >> sequencing centre pipelines. Once we have the EMBOSS >> FASTQ input/output working as intended, does trying to >> speed it up further seem worthwhile? > > Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing > the output takes about as long as reading the input. There may be ways to > speed that up (output requires making an output sequence object which takes > half the output time). > > Building EMBOSS with --with-gccprofile and compiling with gcc creates a > gprof profile. Very useful for catching bottlenecks. > > Up to the advent of NGS data, large input/output runs have been limited to > converting EMBL/GenBank into Fasta as a one-off every few months so looking > into the efficiency of sequence reading/writing has been a low priority. > Now it does assume much more importance. > >> Another suggestion (although not demonstrated in the above >> benchmark) is for the Solexa FASTQ parsing (and output). >>>From looking at the code, you map the ASCII to a PHRED >> score for each letter of every read. This is a relatively >> expensive operation using powers and logs. I would try >> using a precomputed look up table (something I have just >> been working on for Biopython - this made a very big >> difference, especially when converting to/from Solexa >> scores to PHRED scores). > > Yes, that was on my list of future changes. There wasn't time to fully > implement and test before the release freeze. > > regards, > > Peter > > > ------------------------------ > > Message: 3 > Date: Tue, 28 Jul 2009 10:21:33 +0100 > From: Peter > Subject: Re: [emboss-dev] FASTQ parsing speed in EMBOSS > To: Peter Rice > Cc: emboss-dev at lists.open-bio.org > Message-ID: > <320fb6e00907280221y141797fcw81faeefd22429fb1 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Tue, Jul 28, 2009 at 9:05 AM, Peter Rice wrote: >> >> Thanks. I'll take a look. FASTQ parsing is pretty fast - in that writing >> the >> output takes about as long as reading the input. There may be ways to >> speed >> that up (output requires making an output sequence object which takes half >> the output time). >> >> Building EMBOSS with --with-gccprofile and compiling with gcc creates a >> gprof profile. Very useful for catching bottlenecks. > > Nice tip. > >> Up to the advent of NGS data, large input/output runs have been limited to >> converting EMBL/GenBank into Fasta as a one-off every few months so >> looking >> into the efficiency of sequence reading/writing has been a low priority. >> Now >> it does assume much more importance. > > Exactly :) > >>> Another suggestion (although not demonstrated in the above >>> benchmark) is for the Solexa FASTQ parsing (and output). >>> From looking at the code, you map the ASCII to a PHRED >>> score for each letter of every read. This is a relatively >>> expensive operation using powers and logs. I would try >>> using a precomputed look up table (something I have just >>> been working on for Biopython - this made a very big >>> difference, especially when converting to/from Solexa >>> scores to PHRED scores). >> >> Yes, that was on my list of future changes. There wasn't time to fully >> implement and test before the release freeze. > > That makes sense - and it is a pretty obvious thing to try, so > I would have been surprised if you hadn't come up with the > same idea. > > Peter > > > ------------------------------ > > Message: 4 > Date: Tue, 28 Jul 2009 13:51:08 +0100 > From: Peter > Subject: [emboss-dev] FASTQ parsing speed in EMBOSS > To: Peter Rice , emboss-dev at lists.open-bio.org > Cc: biopython-dev at lists.open-bio.org > Message-ID: > <320fb6e00907280551n7a42563byb802016b2342de06 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > I've retitled this and CC'ed it to the EMBOSS dev list - which is > probably a better place for this now! > > On Tue, Jul 28, 2009 at 1:40 PM, Peter Rice wrote: >> Peter wrote: >>> On Mon, Jul 27, 2009 at 11:44 PM, Brad Chapman wrote: >> >>>>> P.S. Anyone care to guess on how EMBOSS, BioPerl, and >>>>> Biopython's FASTQ parsing stacks up in terms of run time? >>>> >>>> We better be the fastest. Everyone knows that C code is bloated >>>> and slow. >>> >>> I pretty sure that was tongue in check, but if you were being mean >>> you probably could describe some of the EMBOSS infrastructure >>> as bloat. In any case, I'm sure that EMBOSS can be made faster >>> now that speed matters here with next generation sequencing, see: >>> http://lists.open-bio.org/pipermail/emboss-dev/2009-July/000611.html >> >> EMBOSS code is indeed bloated and slow in some places - for example on >> output it constructs a sequence output object from the input sequence. >> However, it's C ... if we know what we're doing we can tell the machine >> to go faster. Unless the compiler decides it can optimise us away... >> >> Certainly this is a place where using reference-counted strings shows >> gains. We tend to avoid them in EMBOSS because early experience in >> optimising had them being deleted at the 'wrong' times and leaving us >> with no significant improvement in performance. Sequence output looks >> like a good place for them. >> >> We can also simplify the sequence output objects to avoid some of the >> reset operations when reusing the objects. >> >>> And I've got bad news for you then - currently EMBOSS seqret >>> is about twice as fast as CVS Biopython SeqIO (measuring parsing >>> versus writing is a bit tricky). However, I have a cunning plan: >>> http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html >> >> Worse news, I can find some speedups in EMBOSS ... though >> the split is about 40% in output and 60% in input CPU time. > > Well, it is only bad news from the point of view of Biopython > bragging rights ;) > > And with those speed ups, I guess my fast lower level Biopython > FASTQ to FASTA script will now be about the same speed as > seqret! See: > http://lists.open-bio.org/pipermail/biopython-dev/2009-July/006493.html > > Nice work! > >> I/O time is another issue where we could play with blocked >> reads ... though when I tried that some time ago it seemed >> the operating systems and file systems were doing a grand >> job and it was hard to get a consistent speed gain even for >> one specific system. > > Maybe best avoided, given EMBOSS is truly cross platform. > > Peter C. > > > ------------------------------ > > _______________________________________________ > emboss-dev mailing list > emboss-dev at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss-dev > > > End of emboss-dev Digest, Vol 11, Issue 14 > ****************************************** > -- Thanks & Regards, Jitesh Dundas Research Associate, DIL Lab, IIT-Bombay(www.dil.iitb.ac.in), Scientist, Edencore Technologies(www.edencore.net) Phone:- +91-9860925706 http://jiteshbdundas.blogspot.com "No idea is stupid,either its too good to be true, or its way ahead of its future"- GEORGE BERNARD SHAW. From biopython at maubp.freeserve.co.uk Fri Jul 31 12:01:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 13:01:27 +0100 Subject: [emboss-dev] FASTQ parsing speed in EMBOSS In-Reply-To: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> References: <320fb6e00907280551n7a42563byb802016b2342de06@mail.gmail.com> Message-ID: <320fb6e00907310501t56a136d0yde9882cb3e96c4a2@mail.gmail.com> On Tue, Jul 28, 2009 at 1:51 PM, Peter wrote: > I've retitled this and CC'ed it to the EMBOSS dev list - which is > probably a better place for this now! Another random thought for speeding up parsing/writing the Solexa/Illumina FASTQ formats: At some point you need to convert from an integer score to an ASCII character using an offset of 64. Would clearing/setting the bit be faster than using integer subtraction/addition? Sadly this trick won't work for the Sanger FASTQ format as the offset is 33, not 32. Peter C. Credit where due: This idea was based on a discussion with Leighton Pritchard, where he suggested this could be why Solexa opted for a 64 bit offset in particular.