From mauricio at open-bio.org Fri Feb 5 10:48:30 2010 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Fri, 05 Feb 2010 09:48:30 -0600 Subject: [Open-bio-l] Fwd: Changes to NCBI BLAST and E-utilities. Message-ID: <4B6C3DCE.2070808@open-bio.org> Forwarding to the proper lists... -------- Original Message -------- Subject: [O|B|F Helpdesk #889] Changes to NCBI BLAST and E-utilities. Date: Fri, 5 Feb 2010 10:08:51 -0500 From: mcginnis via RT Reply-To: support at helpdesk.open-bio.org To: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org, mauricio at open-bio.org Fri Feb 05 10:08:51 2010: Request 889 was acted upon. Transaction: Ticket created by mcginnis at ncbi.nlm.nih.gov Queue: support at open-bio.org Subject: Changes to NCBI BLAST and E-utilities. Owner: Nobody Requestors: mcginnis at ncbi.nlm.nih.gov Status: new Ticket Dear Colleague: There are two changes I'd like to make you aware of. As you may or may not have noticed, we have been working on a new C++ version of the BLAST binaries. In the coming months we will be moving the C++ binaries into prominence and (slowly) phasing out the C toolkit binaries. There are many changes not least of which is a move to individual binaries for each program (blastn, blastp, etc). We are not sure how many of your users use BioPerl with the BLAST binaries, my understanding is that many use BioPerl to to remote BLAST. However, there isa change to the BLAST results in Text and presumably HTML. This could have an effect on any parsers which scrape these formats and do not use XML. For obvious reason, we want to support only the XML format for parsing, but we thought we should give you heads up on this. blast 2.2.22 Query: 3307 ------------------------------------------------------------ 3307 Sbjct: 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 blast 2.2.22+ Query ------------------------------------------------------------ Sbjct 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 A single line of gaps lacks the Query numbering in the blast+ output. The C version of blast has numbering in this case. Sample alignment shown below. According to users the blast+ output without the numbering breaks bioperl parsers. Wehave heard forma few but I think they may be older parsers? The second issue is a policy concerning E-utilities. This was announced on the utilities-announce at ncbi.nlm.nih.gov mail-list but you may not have seen it. As part of an ongoing effort to ensure efficient access to the Entrez Utilities (E-utilities) by all users, NCBI has decided to change the usage policy for the E-utilities effective June 1, 2010. Effective on June 1, 2010, all E-utility requests, either using standard URLs or SOAP, must contain non-null values for both the &tool and &email parameters. Any E-utility request made after June 1, 2010 that does not contain values for both parameters will return an error explaining that these parameters must be included in E-utility requests. The value of the &tool parameter should be a URI-safe string that is the name of the software package, script or web page producing the E-utility request. The value of the &email parameter should be a valid e-mail address for the appropriate contact person or group responsible for maintaining the tool producing the E-utility request. NCBI uses these parameters to contact users whose use of the E-utilities violates the standard usage policies described athttp://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements. These usage policies are designed to prevent excessive requests from a small group of users from reducing or eliminating the wider community's access to the E-utilities. NCBI will attempt to contact a user at the e-mail address provided in the &email parameter prior to blocking access to the E-utilities. NCBI realizes that this policy change will require many of our users to change their code. Based on past experience, we anticipate that most of our users should be able to make the necessary changes before the June 1, 2010 deadline. If you have any concerns about making these changes by that date, or if you have any questions about these policies, please contact eutilities at ncbi.nlm.nih.gov. Thank you for your understanding and cooperation in helping us continue to deliver a reliable and efficient web service. I think you already adhere to this policy but should a user's script not meet these requirements, than the script will fail and requests will be turned away with an error message. Scott D. McGinnis M.A. NCBI/NLM/NIH 45 Center Drive, MSC 6511 Bldg 45, Room 4AN.44C Bethesda, MD 20892 mcginnis at ncbi.nlm.nih.gov From biopython at maubp.freeserve.co.uk Mon Feb 8 19:59:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Feb 2010 00:59:37 +0000 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> Message-ID: <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> > 2010/2/8 Charles Imbusch >> Hello, >> >> I have been wondering whether Biojava is able to >> handle sff files coming from 454 sequencing runs. >> >> I found something here: >> http://lists.open-bio.org/pipermail/biojava-dev/2009-July/003907.html >> >> Does somebody know about the current status on Biojava and sff files? >> >> >> Thanks in advance, >> Charles On Mon, Feb 8, 2010 at 9:24 PM, Paolo Pavan wrote: > > Unfortunately, after spending some time on it, I didn't anything, sorry. > There is just a post more I sent to Andreas Prlic without enclose the list > by mistake, in which I report a few info more, coming from my reading on > BioPerl's way to manage contigs and assembly informations. > Nothing more. > > Paolo Hi, I've CC'd the common OpenBio mailing list as this is probably of interest beyond just BioJava. Based on code from Jose Blanca (author of sff_extract), I implemented support for the SFF (Roche 454) sequencing reads for Biopython last year on a branch that I hope to merge into our next release, currently here: http://github.com/peterjc/biopython/tree/sff-seqio In addition to the Roche Manuals (which may not be that easy to get a copy of), the SFF format is described on this NCBI webpage: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff I'm happy to answer questions on how the file format works (including the undocumented index block which I had to reverse engineer). Peter P.S. Just to clarify (from the old BioJava thread), the SFF file just holds the raw reads - it is an input file for doing an assembly or mapping. From hlapp at drycafe.net Sat Feb 13 18:02:32 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 13 Feb 2010 18:02:32 -0500 Subject: [Open-bio-l] Registration Open for Conference on Informatics for Phylogenetics, Evolution, and Biodiversity (iEvoBio) Message-ID: <41DDC850-053B-451E-9C5C-8F4ED6539996@drycafe.net> Registration is now open for the inaugural conference on Informatics for Phylogenetics, Evolution, and Biodiversity (iEvoBio), at http://www.evolutionsociety.org/SSE2010/Register.html . iEvoBio aims to bring together biologists working in evolution, systematics, and biodiversity, with software developers, and mathematicians, both to develop new tools, and to increase awareness of existing technologies (ranging from standards and reusable toolkits to mega-scale data analysis to rich visualization). The 2-day meeting will take place June 29-30, 2010, in Portland, OR, jointly with the Evolution Meetings as a satellite conference. The event will feature traditional elements, including a keynote presentation at the beginning of each day and contributed talks, as well as more dynamic and interactive elements, including a challenge, lightning talk-style sessions, a software bazaar, and Birds-of-a- Feather gatherings. Attendees can register jointly for Evolution and iEvoBio at a discount, or only for the iEvoBio conference. For more information about registration, venue, travel, or accommodations visit the Evolution 2010 website at http://www.evolutionsociety.org/SSE2010/. More details about the program and guidelines for contributing content are available at http://ievobio.org. You can also find continuous updates on the conference's Twitter feed at http://twitter.com/iEvoBio. iEvoBio is sponsored by the US National Evolutionary Synthesis Center (NESCent) in partnership with the Society of Systematic Biologists (SSB). Additional support has been provided by the Encyclopedia of Life (EOL). The iEvoBio 2010 Organizing Committee: Rod Page (University of Glasgow) Cecile Ane (University of Wisconsin at Madison) Rob Guralnick (University of Colorado at Boulder) Hilmar Lapp (NESCent) Cynthia Parr (Encyclopedia of Life) Michael Sanderson (University of Arizona) From biopython at maubp.freeserve.co.uk Mon Feb 22 06:35:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 11:35:11 +0000 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <4B79CB7C.3040008@imbusch.net> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> <4B79CB7C.3040008@imbusch.net> Message-ID: <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> On Mon, Feb 15, 2010 at 10:32 PM, Charles Imbusch wrote: > > Hi all, > > I've been playing around with the sff file based on the file > format definition at NCBI. > I uploaded the output which includes the common header, > the read header and read data section for the first read > of that file. > > http://home.arcor.de/cimbusch/output.txt Looks like you've been making excellent progress :) Sorry for the delay in my reply, I was on leave last week (and without internet access for most of it). >> I'm happy to answer questions on how the file format works >> (including the undocumented index block which I had to reverse >> engineer). >> > > Yes, I would like to know how that works. > index_magic_number:778921588 .mft > version:1.00 > Couldn't find anything about ".mft" version 1. I believe ".mft" stands for "Manifest format", and Roche 454 use this block to hold both a read index and an XML string (the manifest). Immediately after the ".mft1.00" string are two longs which give the lengths of the XML string and the actual index data. Then comes the XML manifest string, followed by the actual index data (same format as Roche's older ".srt" index only block, uses base 256). Note the Biopython SFF code has now been merged into our trunk: http://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > At the moment I have two classes: sffParser and sffFile > My idea was that sffParser can hold one or multiple sff files. > Each instance of sffFile has a hashtable with the identifiers > as keys and the filepointers are stored as the values. Not all SFF files will have an index, but the Roche .srt and .mft index blocks will let you map from the ID to the offset. I take advantage of this in Biopython for our Bio.SeqIO.index(...) functionality with a slower fall back on scanning the file to build the index if the index information is missing (or in an unsupported format). The Biopython index code then uses a Python dictionary (hash) to hold the mapping from read name to file offset. See also: http://github.com/biopython/biopython/blob/master/Bio/SeqIO/_index.py > Now I would like to find a good representation of one single "read" object, > which shall be accessible with an identifier like EV5RTWS02JXUUH I think this is a Java question, so not my area of expertise. Peter From biopython at maubp.freeserve.co.uk Fri Feb 26 08:33:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 13:33:19 +0000 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <4B864C26.3050709@imbusch.net> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> <4B79CB7C.3040008@imbusch.net> <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> <4B864C26.3050709@imbusch.net> Message-ID: <320fb6e01002260533y148936fdg36a5c8c814deb141@mail.gmail.com> On Thu, Feb 25, 2010 at 10:08 AM, Charles Imbusch wrote: > > Dear Peter, > > thanks for your mail. I will try to make use of that index > to speed things up when I have time available. > > Cheers, > ?Charles Hi Charles, If found when you want random access to the reads, loading the provided .mft or .srt index is MUCH faster than scanning the whole file to build the index manually. So this really is worth the effort. I hope the comments in my code are reasonably clear, but to recap the key idea of the index block is you get chunks of data of varying length (although typically all the same length since by default all the Roche reads have the same read length) like this name, null char, four character offset, terminator char of 0xFF You divide the index block into entries for each read by finding the 0xFF terminators. Because 0xFF (decimal 255) is used in this way, it cannot be used to encode the offsets which must only use 0x00 to 0xFE (decimal 0 to 254). The offset therefore uses base 255 instead of base 256. Note that this means that the largest offset the current Roche index blocks can hold is 255^4, or a little under 4GB. If you use the Roche tools to try and merge SFF files to make an example SFF file over 4GB you get a warning that there will be no index (and no manifest). The index holds the reads sorted alphabetically by name. We don't take advantage of this in Biopython since I use a Python dictionary (like a Perl hash) to store the offsets. In case you missed them, I'd like to draw your attention to the SFF files I am using in the Biopython unit tests: http://github.com/biopython/biopython/tree/master/Tests/Roche/ Regards, Peter From charles at imbusch.net Thu Feb 25 05:08:38 2010 From: charles at imbusch.net (Charles Imbusch) Date: Thu, 25 Feb 2010 11:08:38 +0100 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> <4B79CB7C.3040008@imbusch.net> <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> Message-ID: <4B864C26.3050709@imbusch.net> Dear Peter, thanks for your mail. I will try to make use of that index to speed things up when I have time available. Cheers, Charles Peter schrieb: > I believe ".mft" stands for "Manifest format", and Roche 454 use this > block to hold both a read index and an XML string (the manifest). > Immediately after the ".mft1.00" string are two longs which give the > lengths of the XML string and the actual index data. Then comes > the XML manifest string, followed by the actual index data (same > format as Roche's older ".srt" index only block, uses base 256). > > Note the Biopython SFF code has now been merged into our trunk: > http://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > From mauricio at open-bio.org Fri Feb 5 15:48:30 2010 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Fri, 05 Feb 2010 09:48:30 -0600 Subject: [Open-bio-l] Fwd: Changes to NCBI BLAST and E-utilities. Message-ID: <4B6C3DCE.2070808@open-bio.org> Forwarding to the proper lists... -------- Original Message -------- Subject: [O|B|F Helpdesk #889] Changes to NCBI BLAST and E-utilities. Date: Fri, 5 Feb 2010 10:08:51 -0500 From: mcginnis via RT Reply-To: support at helpdesk.open-bio.org To: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org, mauricio at open-bio.org Fri Feb 05 10:08:51 2010: Request 889 was acted upon. Transaction: Ticket created by mcginnis at ncbi.nlm.nih.gov Queue: support at open-bio.org Subject: Changes to NCBI BLAST and E-utilities. Owner: Nobody Requestors: mcginnis at ncbi.nlm.nih.gov Status: new Ticket Dear Colleague: There are two changes I'd like to make you aware of. As you may or may not have noticed, we have been working on a new C++ version of the BLAST binaries. In the coming months we will be moving the C++ binaries into prominence and (slowly) phasing out the C toolkit binaries. There are many changes not least of which is a move to individual binaries for each program (blastn, blastp, etc). We are not sure how many of your users use BioPerl with the BLAST binaries, my understanding is that many use BioPerl to to remote BLAST. However, there isa change to the BLAST results in Text and presumably HTML. This could have an effect on any parsers which scrape these formats and do not use XML. For obvious reason, we want to support only the XML format for parsing, but we thought we should give you heads up on this. blast 2.2.22 Query: 3307 ------------------------------------------------------------ 3307 Sbjct: 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 blast 2.2.22+ Query ------------------------------------------------------------ Sbjct 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 A single line of gaps lacks the Query numbering in the blast+ output. The C version of blast has numbering in this case. Sample alignment shown below. According to users the blast+ output without the numbering breaks bioperl parsers. Wehave heard forma few but I think they may be older parsers? The second issue is a policy concerning E-utilities. This was announced on the utilities-announce at ncbi.nlm.nih.gov mail-list but you may not have seen it. As part of an ongoing effort to ensure efficient access to the Entrez Utilities (E-utilities) by all users, NCBI has decided to change the usage policy for the E-utilities effective June 1, 2010. Effective on June 1, 2010, all E-utility requests, either using standard URLs or SOAP, must contain non-null values for both the &tool and &email parameters. Any E-utility request made after June 1, 2010 that does not contain values for both parameters will return an error explaining that these parameters must be included in E-utility requests. The value of the &tool parameter should be a URI-safe string that is the name of the software package, script or web page producing the E-utility request. The value of the &email parameter should be a valid e-mail address for the appropriate contact person or group responsible for maintaining the tool producing the E-utility request. NCBI uses these parameters to contact users whose use of the E-utilities violates the standard usage policies described athttp://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements. These usage policies are designed to prevent excessive requests from a small group of users from reducing or eliminating the wider community's access to the E-utilities. NCBI will attempt to contact a user at the e-mail address provided in the &email parameter prior to blocking access to the E-utilities. NCBI realizes that this policy change will require many of our users to change their code. Based on past experience, we anticipate that most of our users should be able to make the necessary changes before the June 1, 2010 deadline. If you have any concerns about making these changes by that date, or if you have any questions about these policies, please contact eutilities at ncbi.nlm.nih.gov. Thank you for your understanding and cooperation in helping us continue to deliver a reliable and efficient web service. I think you already adhere to this policy but should a user's script not meet these requirements, than the script will fail and requests will be turned away with an error message. Scott D. McGinnis M.A. NCBI/NLM/NIH 45 Center Drive, MSC 6511 Bldg 45, Room 4AN.44C Bethesda, MD 20892 mcginnis at ncbi.nlm.nih.gov From biopython at maubp.freeserve.co.uk Tue Feb 9 00:59:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Feb 2010 00:59:37 +0000 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> Message-ID: <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> > 2010/2/8 Charles Imbusch >> Hello, >> >> I have been wondering whether Biojava is able to >> handle sff files coming from 454 sequencing runs. >> >> I found something here: >> http://lists.open-bio.org/pipermail/biojava-dev/2009-July/003907.html >> >> Does somebody know about the current status on Biojava and sff files? >> >> >> Thanks in advance, >> Charles On Mon, Feb 8, 2010 at 9:24 PM, Paolo Pavan wrote: > > Unfortunately, after spending some time on it, I didn't anything, sorry. > There is just a post more I sent to Andreas Prlic without enclose the list > by mistake, in which I report a few info more, coming from my reading on > BioPerl's way to manage contigs and assembly informations. > Nothing more. > > Paolo Hi, I've CC'd the common OpenBio mailing list as this is probably of interest beyond just BioJava. Based on code from Jose Blanca (author of sff_extract), I implemented support for the SFF (Roche 454) sequencing reads for Biopython last year on a branch that I hope to merge into our next release, currently here: http://github.com/peterjc/biopython/tree/sff-seqio In addition to the Roche Manuals (which may not be that easy to get a copy of), the SFF format is described on this NCBI webpage: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats#sff I'm happy to answer questions on how the file format works (including the undocumented index block which I had to reverse engineer). Peter P.S. Just to clarify (from the old BioJava thread), the SFF file just holds the raw reads - it is an input file for doing an assembly or mapping. From hlapp at drycafe.net Sat Feb 13 23:02:32 2010 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 13 Feb 2010 18:02:32 -0500 Subject: [Open-bio-l] Registration Open for Conference on Informatics for Phylogenetics, Evolution, and Biodiversity (iEvoBio) Message-ID: <41DDC850-053B-451E-9C5C-8F4ED6539996@drycafe.net> Registration is now open for the inaugural conference on Informatics for Phylogenetics, Evolution, and Biodiversity (iEvoBio), at http://www.evolutionsociety.org/SSE2010/Register.html . iEvoBio aims to bring together biologists working in evolution, systematics, and biodiversity, with software developers, and mathematicians, both to develop new tools, and to increase awareness of existing technologies (ranging from standards and reusable toolkits to mega-scale data analysis to rich visualization). The 2-day meeting will take place June 29-30, 2010, in Portland, OR, jointly with the Evolution Meetings as a satellite conference. The event will feature traditional elements, including a keynote presentation at the beginning of each day and contributed talks, as well as more dynamic and interactive elements, including a challenge, lightning talk-style sessions, a software bazaar, and Birds-of-a- Feather gatherings. Attendees can register jointly for Evolution and iEvoBio at a discount, or only for the iEvoBio conference. For more information about registration, venue, travel, or accommodations visit the Evolution 2010 website at http://www.evolutionsociety.org/SSE2010/. More details about the program and guidelines for contributing content are available at http://ievobio.org. You can also find continuous updates on the conference's Twitter feed at http://twitter.com/iEvoBio. iEvoBio is sponsored by the US National Evolutionary Synthesis Center (NESCent) in partnership with the Society of Systematic Biologists (SSB). Additional support has been provided by the Encyclopedia of Life (EOL). The iEvoBio 2010 Organizing Committee: Rod Page (University of Glasgow) Cecile Ane (University of Wisconsin at Madison) Rob Guralnick (University of Colorado at Boulder) Hilmar Lapp (NESCent) Cynthia Parr (Encyclopedia of Life) Michael Sanderson (University of Arizona) From biopython at maubp.freeserve.co.uk Mon Feb 22 11:35:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 11:35:11 +0000 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <4B79CB7C.3040008@imbusch.net> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> <4B79CB7C.3040008@imbusch.net> Message-ID: <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> On Mon, Feb 15, 2010 at 10:32 PM, Charles Imbusch wrote: > > Hi all, > > I've been playing around with the sff file based on the file > format definition at NCBI. > I uploaded the output which includes the common header, > the read header and read data section for the first read > of that file. > > http://home.arcor.de/cimbusch/output.txt Looks like you've been making excellent progress :) Sorry for the delay in my reply, I was on leave last week (and without internet access for most of it). >> I'm happy to answer questions on how the file format works >> (including the undocumented index block which I had to reverse >> engineer). >> > > Yes, I would like to know how that works. > index_magic_number:778921588 .mft > version:1.00 > Couldn't find anything about ".mft" version 1. I believe ".mft" stands for "Manifest format", and Roche 454 use this block to hold both a read index and an XML string (the manifest). Immediately after the ".mft1.00" string are two longs which give the lengths of the XML string and the actual index data. Then comes the XML manifest string, followed by the actual index data (same format as Roche's older ".srt" index only block, uses base 256). Note the Biopython SFF code has now been merged into our trunk: http://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py > At the moment I have two classes: sffParser and sffFile > My idea was that sffParser can hold one or multiple sff files. > Each instance of sffFile has a hashtable with the identifiers > as keys and the filepointers are stored as the values. Not all SFF files will have an index, but the Roche .srt and .mft index blocks will let you map from the ID to the offset. I take advantage of this in Biopython for our Bio.SeqIO.index(...) functionality with a slower fall back on scanning the file to build the index if the index information is missing (or in an unsupported format). The Biopython index code then uses a Python dictionary (hash) to hold the mapping from read name to file offset. See also: http://github.com/biopython/biopython/blob/master/Bio/SeqIO/_index.py > Now I would like to find a good representation of one single "read" object, > which shall be accessible with an identifier like EV5RTWS02JXUUH I think this is a Java question, so not my area of expertise. Peter From biopython at maubp.freeserve.co.uk Fri Feb 26 13:33:19 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 13:33:19 +0000 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <4B864C26.3050709@imbusch.net> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> <4B79CB7C.3040008@imbusch.net> <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> <4B864C26.3050709@imbusch.net> Message-ID: <320fb6e01002260533y148936fdg36a5c8c814deb141@mail.gmail.com> On Thu, Feb 25, 2010 at 10:08 AM, Charles Imbusch wrote: > > Dear Peter, > > thanks for your mail. I will try to make use of that index > to speed things up when I have time available. > > Cheers, > ?Charles Hi Charles, If found when you want random access to the reads, loading the provided .mft or .srt index is MUCH faster than scanning the whole file to build the index manually. So this really is worth the effort. I hope the comments in my code are reasonably clear, but to recap the key idea of the index block is you get chunks of data of varying length (although typically all the same length since by default all the Roche reads have the same read length) like this name, null char, four character offset, terminator char of 0xFF You divide the index block into entries for each read by finding the 0xFF terminators. Because 0xFF (decimal 255) is used in this way, it cannot be used to encode the offsets which must only use 0x00 to 0xFE (decimal 0 to 254). The offset therefore uses base 255 instead of base 256. Note that this means that the largest offset the current Roche index blocks can hold is 255^4, or a little under 4GB. If you use the Roche tools to try and merge SFF files to make an example SFF file over 4GB you get a warning that there will be no index (and no manifest). The index holds the reads sorted alphabetically by name. We don't take advantage of this in Biopython since I use a Python dictionary (like a Perl hash) to store the offsets. In case you missed them, I'd like to draw your attention to the SFF files I am using in the Biopython unit tests: http://github.com/biopython/biopython/tree/master/Tests/Roche/ Regards, Peter From charles at imbusch.net Thu Feb 25 10:08:38 2010 From: charles at imbusch.net (Charles Imbusch) Date: Thu, 25 Feb 2010 11:08:38 +0100 Subject: [Open-bio-l] [Biojava-l] .sff support In-Reply-To: <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> References: <4B703AF3.4000300@imbusch.net> <56be91b61002081324t3423359dm917b283c6a1f2474@mail.gmail.com> <320fb6e01002081659u793228d1g17abb4f8e0100837@mail.gmail.com> <4B79CB7C.3040008@imbusch.net> <320fb6e01002220335x6899a44bl68789cd4d7d772e3@mail.gmail.com> Message-ID: <4B864C26.3050709@imbusch.net> Dear Peter, thanks for your mail. I will try to make use of that index to speed things up when I have time available. Cheers, Charles Peter schrieb: > I believe ".mft" stands for "Manifest format", and Roche 454 use this > block to hold both a read index and an XML string (the manifest). > Immediately after the ".mft1.00" string are two longs which give the > lengths of the XML string and the actual index data. Then comes > the XML manifest string, followed by the actual index data (same > format as Roche's older ".srt" index only block, uses base 256). > > Note the Biopython SFF code has now been merged into our trunk: > http://github.com/biopython/biopython/blob/master/Bio/SeqIO/SffIO.py >