From biojava at hannes.oib.com Thu Jan 5 01:32:11 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 5 Jan 2012 07:32:11 +0100 Subject: [Biojava-l] BioJava in education In-Reply-To: References:

Message-ID: I am considering creating a course/presentation series for our bioinformatics program at the Upper Austria University of Applied Sciences - presenting various bioinformatics libraries (bioperl, biopython, biojava...) so any material would be welcome! Hannes On Mon, Nov 7, 2011 at 15:48, Amr AL-Hossary wrote: > Sorry for the late reply. > Although I come into it on my own, I had the interest to voluntarily teach > it to the Bioinforamtics group of ITi (a famous IT Institute herein Egypt). > I focused on BioJava 1.7 and I May send you the presentation if you like. > Unfortunately, this department was closed after 2 successive intakes, so I > didn't update it. > > > Amr > > -------------------------------------------------- > From: "Andreas Prlic" > Sent: Friday, November 04, 2011 8:21 AM > To: "Biojava" > Subject: [Biojava-l] BioJava in education > >> Hi, >> >> Is anybody using BioJava in teaching, or has been introduced to >> BioJava as part of a course? >> >> Andreas >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From jw12 at sanger.ac.uk Thu Jan 5 08:44:26 2012 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Thu, 5 Jan 2012 13:44:26 +0000 Subject: [Biojava-l] Registrations for DAS Workshop 2012 Message-ID: <3637D8C0-AF24-4D42-90E8-85346C5F706D@sanger.ac.uk> DAS is currently being used to share annotations on genomes, protein alignments, structural and interaction information. If you are interested in sharing biological information the DAS workshop below may be of interest to you. Learn of and contribute to current developments in DAS such as: DAS in the cloud, DAS for Genotype Data, DAS searching, DAS for collaborative annotation projects, DAS alternative formats. Registration is open for the 2012 DAS workshop (27-29 February) at the Genome Campus, Hinxton UK. If you are interested in attending, please find out more by going to http://www.ebi.ac.uk/training/onsite/120227_DAS.html and register via the web link at the bottom of the page. This workshop will cater for novice to expert DAS users as each day is optional. Please register early as places will be limited. Registration closes 10 February 2012 - 12:00. If you are interested in giving a 15 minute talk on the second day please email Jonathan Warren using jonathan.warren at sanger.ac.uk Many thanks The Sanger/EBI DAS team. Jonathan Warren Senior Developer and DAS coordinator blog: http://biodasman.wordpress.com/ jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jdr0887 at renci.org Fri Jan 6 14:11:13 2012 From: jdr0887 at renci.org (Jason) Date: Fri, 6 Jan 2012 14:11:13 -0500 Subject: [Biojava-l] trim adapter segment Message-ID: <4F074751.5050803@renci.org> Hi all, I am new to BioJava...how would I go about trimming adapter segments? Thanks, Jason From biojava at hannes.oib.com Mon Jan 9 02:54:49 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Mon, 9 Jan 2012 08:54:49 +0100 Subject: [Biojava-l] trim adapter segment In-Reply-To: <4F074751.5050803@renci.org> References: <4F074751.5050803@renci.org> Message-ID: Hi! On Fri, Jan 6, 2012 at 20:11, Jason wrote: > I am new to BioJava...how would I go about trimming adapter segments? Could you give a more verbose example? I do not know what you are talking about. Hannes From biojava at hannes.oib.com Wed Jan 11 09:28:55 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Wed, 11 Jan 2012 15:28:55 +0100 Subject: [Biojava-l] FASTA Header Parser Message-ID: Hi there - I just came across a puzzling "feature" of the GenericFastaHeaderParser. It seems to throw away everything in the header after (and including) "length=" (see GenericFastaHeaderParser.java lines 71-76) ... Why? Also, is there a Fasta Header Parser I can use that does not mess about with the header? I really would like to have that as key (still working on my FASTA/QUAL parsing) and not having that (only in the originalHeader, not in the Hashmap key) really breaks stuff. Hannes From biojava at hannes.oib.com Wed Jan 11 09:45:54 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Wed, 11 Jan 2012 15:45:54 +0100 Subject: [Biojava-l] FASTA Header Parser In-Reply-To: References: Message-ID: nope, the header is in the hashmap in total, except for everything after length= -- there are whitespaces before that and these are still left in the header that is used as key. either make it work like you say or even better, leave the header as-is. I need to quickly find the sequence, I don't want to iterate over all my 35k sequences and look up the original headers. Hannes On Wed, Jan 11, 2012 at 15:38, Scooter Willis wrote: > It should parse until the first space as the unique id. Lots of extra info > gets added in to the header. You should find a getOriginalHeader method that > will preserve to contents of the header. I use this when writing the > sequences back to disk to restore the original header. > > You can also do your own custom header parser which we use to support the > known different fasta headers. If you have extra information in the header > you can formally associate that with the sequence at the time of the parse. > We can also add support for your header if it is standard ouput from a > device. > > Thanks > > Scooter > > > ----- Reply message ----- > From: "Hannes Brandst?tter-M?ller" > To: "biojava-l" > Subject: [Biojava-l] FASTA Header Parser > Date: Wed, Jan 11, 2012 9:30 am > > > > Hi there - > > I just came across a puzzling "feature" of the GenericFastaHeaderParser. > It seems to throw away everything in the header after (and including) > "length=" > (see GenericFastaHeaderParser.java lines 71-76) > > ... Why? > > Also, is there a Fasta Header Parser I can use that does not mess > about with the header? > > I really would like to have that as key (still working on my > FASTA/QUAL parsing) and not having that (only in the originalHeader, > not in the Hashmap key) really breaks stuff. > > Hannes > _______________________________________________ > Biojava-l mailing list? -? Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From HWillis at scripps.edu Wed Jan 11 09:38:21 2012 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 11 Jan 2012 09:38:21 -0500 Subject: [Biojava-l] FASTA Header Parser Message-ID: It should parse until the first space as the unique id. Lots of extra info gets added in to the header. You should find a getOriginalHeader method that will preserve to contents of the header. I use this when writing the sequences back to disk to restore the original header. You can also do your own custom header parser which we use to support the known different fasta headers. If you have extra information in the header you can formally associate that with the sequence at the time of the parse. We can also add support for your header if it is standard ouput from a device. Thanks Scooter ----- Reply message ----- From: "Hannes Brandst?tter-M?ller" To: "biojava-l" Subject: [Biojava-l] FASTA Header Parser Date: Wed, Jan 11, 2012 9:30 am Hi there - I just came across a puzzling "feature" of the GenericFastaHeaderParser. It seems to throw away everything in the header after (and including) "length=" (see GenericFastaHeaderParser.java lines 71-76) ... Why? Also, is there a Fasta Header Parser I can use that does not mess about with the header? I really would like to have that as key (still working on my FASTA/QUAL parsing) and not having that (only in the originalHeader, not in the Hashmap key) really breaks stuff. Hannes _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From mictadlo at gmail.com Mon Jan 16 01:24:28 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 16 Jan 2012 16:24:28 +1000 Subject: [Biojava-l] Unique reads Message-ID: Hello, I read in many papers that they made unique reads before the reads were align and later on the SNPs were called. However, I could not find out how they do it. Which tool can be used to do it? Thank you in advance. From mictadlo at gmail.com Mon Jan 16 08:01:56 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 16 Jan 2012 23:01:56 +1000 Subject: [Biojava-l] compare sequences Message-ID: Hello, Is there anyway a memory efficient way to compare sequences like from NGS? Thank you in advance. From p.j.a.cock at googlemail.com Mon Jan 16 08:30:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Jan 2012 13:30:01 +0000 Subject: [Biojava-l] compare sequences In-Reply-To: References: Message-ID: On Mon, Jan 16, 2012 at 1:01 PM, Mic wrote: > Hello, > Is there anyway a?memory?efficient?way to ?compare ?sequences like from NGS? > > Thank you in advance. Hi Mic, Could you stop posting such broad questions to multiple mailing lists simultaneously please? Perhaps you would find Biostars Q&A more useful? http://biostar.stackexchange.com/ See also: http://dx.doi.org/10.1371/journal.pcbi.1002202 Peter From mictadlo at gmail.com Tue Jan 24 03:37:02 2012 From: mictadlo at gmail.com (Mic) Date: Tue, 24 Jan 2012 18:37:02 +1000 Subject: [Biojava-l] Fastq benchmark Message-ID: Hello, I have found the following benchmark ( http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-a-huge-fastq-file/11279#11279 ) and I just wonder whether it is possible to make Java example even faster? Thank you in advance. From HWillis at scripps.edu Tue Jan 24 07:08:22 2012 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 24 Jan 2012 07:08:22 -0500 Subject: [Biojava-l] Fastq benchmark In-Reply-To: Message-ID: You can try a FASTA version of the file to measure performance gain. File file = new File("filename"); Boolean lazySequenceLoad = true; LinkedHashMap sequences = FastaReaderHelper.readFastaDNASequence(file,lazySequenceLoad); This will go through and index the accession id and not load any sequence data which means no memory allocation and speed. You can then reference the DNASequence by name and when you need the sequence data it will use the file index to load the sequence data from the file for that specific sequence. The same approach can be applied to FASTQ files. Scooter On 1/24/12 3:37 AM, "Mic" wrote: >Hello, >I have found the following benchmark ( >http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse- >a-huge-fastq-file/11279#11279 >) >and I just wonder whether it is possible to make Java example even faster? > >Thank you in advance. >_______________________________________________ >Biojava-l mailing list - Biojava-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biojava-l From heuermh at gmail.com Tue Jan 24 13:00:58 2012 From: heuermh at gmail.com (Michael Heuer) Date: Tue, 24 Jan 2012 12:00:58 -0600 Subject: [Biojava-l] Fastq benchmark In-Reply-To: References: Message-ID: Hello Mic, That is an interesting benchmark, and you could probably squeeze a bit more performance out of fqextract.java by tweaking the data structures (e.g. provide expected size to the HashMap constructor, use ImmutableMap from Guava, etc.). Using bioperl, biopython, bioruby, or biojava for this task will be much slower than just spitting out lines from a file since they are all validating the FASTQ format against the specification. michael On Tue, Jan 24, 2012 at 6:08 AM, Scooter Willis wrote: > You can try a FASTA version of the file to measure performance gain. > > File file = new File("filename"); > Boolean ?lazySequenceLoad = true; > > LinkedHashMap sequences = > FastaReaderHelper.readFastaDNASequence(file,lazySequenceLoad); > > This will go through and index the accession id and not load any sequence > data which means no memory allocation and speed. You can then reference > the DNASequence by name and when you need the sequence data it will use > the file index to load the sequence data from the file for that specific > sequence. The same approach can be applied to FASTQ files. > > Scooter > > On 1/24/12 3:37 AM, "Mic" wrote: > >>Hello, >>I have found the following benchmark ( >>http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse- >>a-huge-fastq-file/11279#11279 >>) >>and I just wonder whether it is possible to make Java example even faster? >> >>Thank you in advance. From biojava at hannes.oib.com Thu Jan 5 06:32:11 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Thu, 5 Jan 2012 07:32:11 +0100 Subject: [Biojava-l] BioJava in education In-Reply-To: References:

Message-ID: I am considering creating a course/presentation series for our bioinformatics program at the Upper Austria University of Applied Sciences - presenting various bioinformatics libraries (bioperl, biopython, biojava...) so any material would be welcome! Hannes On Mon, Nov 7, 2011 at 15:48, Amr AL-Hossary wrote: > Sorry for the late reply. > Although I come into it on my own, I had the interest to voluntarily teach > it to the Bioinforamtics group of ITi (a famous IT Institute herein Egypt). > I focused on BioJava 1.7 and I May send you the presentation if you like. > Unfortunately, this department was closed after 2 successive intakes, so I > didn't update it. > > > Amr > > -------------------------------------------------- > From: "Andreas Prlic" > Sent: Friday, November 04, 2011 8:21 AM > To: "Biojava" > Subject: [Biojava-l] BioJava in education > >> Hi, >> >> Is anybody using BioJava in teaching, or has been introduced to >> BioJava as part of a course? >> >> Andreas >> _______________________________________________ >> Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biojava-l >> > _______________________________________________ > Biojava-l mailing list ?- ?Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From jw12 at sanger.ac.uk Thu Jan 5 13:44:26 2012 From: jw12 at sanger.ac.uk (Jonathan Warren) Date: Thu, 5 Jan 2012 13:44:26 +0000 Subject: [Biojava-l] Registrations for DAS Workshop 2012 Message-ID: <3637D8C0-AF24-4D42-90E8-85346C5F706D@sanger.ac.uk> DAS is currently being used to share annotations on genomes, protein alignments, structural and interaction information. If you are interested in sharing biological information the DAS workshop below may be of interest to you. Learn of and contribute to current developments in DAS such as: DAS in the cloud, DAS for Genotype Data, DAS searching, DAS for collaborative annotation projects, DAS alternative formats. Registration is open for the 2012 DAS workshop (27-29 February) at the Genome Campus, Hinxton UK. If you are interested in attending, please find out more by going to http://www.ebi.ac.uk/training/onsite/120227_DAS.html and register via the web link at the bottom of the page. This workshop will cater for novice to expert DAS users as each day is optional. Please register early as places will be limited. Registration closes 10 February 2012 - 12:00. If you are interested in giving a 15 minute talk on the second day please email Jonathan Warren using jonathan.warren at sanger.ac.uk Many thanks The Sanger/EBI DAS team. Jonathan Warren Senior Developer and DAS coordinator blog: http://biodasman.wordpress.com/ jw12 at sanger.ac.uk Ext: 2314 Telephone: 01223 492314 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From jdr0887 at renci.org Fri Jan 6 19:11:13 2012 From: jdr0887 at renci.org (Jason) Date: Fri, 6 Jan 2012 14:11:13 -0500 Subject: [Biojava-l] trim adapter segment Message-ID: <4F074751.5050803@renci.org> Hi all, I am new to BioJava...how would I go about trimming adapter segments? Thanks, Jason From biojava at hannes.oib.com Mon Jan 9 07:54:49 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Mon, 9 Jan 2012 08:54:49 +0100 Subject: [Biojava-l] trim adapter segment In-Reply-To: <4F074751.5050803@renci.org> References: <4F074751.5050803@renci.org> Message-ID: Hi! On Fri, Jan 6, 2012 at 20:11, Jason wrote: > I am new to BioJava...how would I go about trimming adapter segments? Could you give a more verbose example? I do not know what you are talking about. Hannes From biojava at hannes.oib.com Wed Jan 11 14:28:55 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Wed, 11 Jan 2012 15:28:55 +0100 Subject: [Biojava-l] FASTA Header Parser Message-ID: Hi there - I just came across a puzzling "feature" of the GenericFastaHeaderParser. It seems to throw away everything in the header after (and including) "length=" (see GenericFastaHeaderParser.java lines 71-76) ... Why? Also, is there a Fasta Header Parser I can use that does not mess about with the header? I really would like to have that as key (still working on my FASTA/QUAL parsing) and not having that (only in the originalHeader, not in the Hashmap key) really breaks stuff. Hannes From biojava at hannes.oib.com Wed Jan 11 14:45:54 2012 From: biojava at hannes.oib.com (=?ISO-8859-1?Q?Hannes_Brandst=E4tter=2DM=FCller?=) Date: Wed, 11 Jan 2012 15:45:54 +0100 Subject: [Biojava-l] FASTA Header Parser In-Reply-To: References: Message-ID: nope, the header is in the hashmap in total, except for everything after length= -- there are whitespaces before that and these are still left in the header that is used as key. either make it work like you say or even better, leave the header as-is. I need to quickly find the sequence, I don't want to iterate over all my 35k sequences and look up the original headers. Hannes On Wed, Jan 11, 2012 at 15:38, Scooter Willis wrote: > It should parse until the first space as the unique id. Lots of extra info > gets added in to the header. You should find a getOriginalHeader method that > will preserve to contents of the header. I use this when writing the > sequences back to disk to restore the original header. > > You can also do your own custom header parser which we use to support the > known different fasta headers. If you have extra information in the header > you can formally associate that with the sequence at the time of the parse. > We can also add support for your header if it is standard ouput from a > device. > > Thanks > > Scooter > > > ----- Reply message ----- > From: "Hannes Brandst?tter-M?ller" > To: "biojava-l" > Subject: [Biojava-l] FASTA Header Parser > Date: Wed, Jan 11, 2012 9:30 am > > > > Hi there - > > I just came across a puzzling "feature" of the GenericFastaHeaderParser. > It seems to throw away everything in the header after (and including) > "length=" > (see GenericFastaHeaderParser.java lines 71-76) > > ... Why? > > Also, is there a Fasta Header Parser I can use that does not mess > about with the header? > > I really would like to have that as key (still working on my > FASTA/QUAL parsing) and not having that (only in the originalHeader, > not in the Hashmap key) really breaks stuff. > > Hannes > _______________________________________________ > Biojava-l mailing list? -? Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l From HWillis at scripps.edu Wed Jan 11 14:38:21 2012 From: HWillis at scripps.edu (Scooter Willis) Date: Wed, 11 Jan 2012 09:38:21 -0500 Subject: [Biojava-l] FASTA Header Parser Message-ID: It should parse until the first space as the unique id. Lots of extra info gets added in to the header. You should find a getOriginalHeader method that will preserve to contents of the header. I use this when writing the sequences back to disk to restore the original header. You can also do your own custom header parser which we use to support the known different fasta headers. If you have extra information in the header you can formally associate that with the sequence at the time of the parse. We can also add support for your header if it is standard ouput from a device. Thanks Scooter ----- Reply message ----- From: "Hannes Brandst?tter-M?ller" To: "biojava-l" Subject: [Biojava-l] FASTA Header Parser Date: Wed, Jan 11, 2012 9:30 am Hi there - I just came across a puzzling "feature" of the GenericFastaHeaderParser. It seems to throw away everything in the header after (and including) "length=" (see GenericFastaHeaderParser.java lines 71-76) ... Why? Also, is there a Fasta Header Parser I can use that does not mess about with the header? I really would like to have that as key (still working on my FASTA/QUAL parsing) and not having that (only in the originalHeader, not in the Hashmap key) really breaks stuff. Hannes _______________________________________________ Biojava-l mailing list - Biojava-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biojava-l From mictadlo at gmail.com Mon Jan 16 06:24:28 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 16 Jan 2012 16:24:28 +1000 Subject: [Biojava-l] Unique reads Message-ID: Hello, I read in many papers that they made unique reads before the reads were align and later on the SNPs were called. However, I could not find out how they do it. Which tool can be used to do it? Thank you in advance. From mictadlo at gmail.com Mon Jan 16 13:01:56 2012 From: mictadlo at gmail.com (Mic) Date: Mon, 16 Jan 2012 23:01:56 +1000 Subject: [Biojava-l] compare sequences Message-ID: Hello, Is there anyway a memory efficient way to compare sequences like from NGS? Thank you in advance. From p.j.a.cock at googlemail.com Mon Jan 16 13:30:01 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 16 Jan 2012 13:30:01 +0000 Subject: [Biojava-l] compare sequences In-Reply-To: References: Message-ID: On Mon, Jan 16, 2012 at 1:01 PM, Mic wrote: > Hello, > Is there anyway a?memory?efficient?way to ?compare ?sequences like from NGS? > > Thank you in advance. Hi Mic, Could you stop posting such broad questions to multiple mailing lists simultaneously please? Perhaps you would find Biostars Q&A more useful? http://biostar.stackexchange.com/ See also: http://dx.doi.org/10.1371/journal.pcbi.1002202 Peter From mictadlo at gmail.com Tue Jan 24 08:37:02 2012 From: mictadlo at gmail.com (Mic) Date: Tue, 24 Jan 2012 18:37:02 +1000 Subject: [Biojava-l] Fastq benchmark Message-ID: Hello, I have found the following benchmark ( http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse-a-huge-fastq-file/11279#11279 ) and I just wonder whether it is possible to make Java example even faster? Thank you in advance. From HWillis at scripps.edu Tue Jan 24 12:08:22 2012 From: HWillis at scripps.edu (Scooter Willis) Date: Tue, 24 Jan 2012 07:08:22 -0500 Subject: [Biojava-l] Fastq benchmark In-Reply-To: Message-ID: You can try a FASTA version of the file to measure performance gain. File file = new File("filename"); Boolean lazySequenceLoad = true; LinkedHashMap sequences = FastaReaderHelper.readFastaDNASequence(file,lazySequenceLoad); This will go through and index the accession id and not load any sequence data which means no memory allocation and speed. You can then reference the DNASequence by name and when you need the sequence data it will use the file index to load the sequence data from the file for that specific sequence. The same approach can be applied to FASTQ files. Scooter On 1/24/12 3:37 AM, "Mic" wrote: >Hello, >I have found the following benchmark ( >http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse- >a-huge-fastq-file/11279#11279 >) >and I just wonder whether it is possible to make Java example even faster? > >Thank you in advance. >_______________________________________________ >Biojava-l mailing list - Biojava-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biojava-l From heuermh at gmail.com Tue Jan 24 18:00:58 2012 From: heuermh at gmail.com (Michael Heuer) Date: Tue, 24 Jan 2012 12:00:58 -0600 Subject: [Biojava-l] Fastq benchmark In-Reply-To: References: Message-ID: Hello Mic, That is an interesting benchmark, and you could probably squeeze a bit more performance out of fqextract.java by tweaking the data structures (e.g. provide expected size to the HashMap constructor, use ImmutableMap from Guava, etc.). Using bioperl, biopython, bioruby, or biojava for this task will be much slower than just spitting out lines from a file since they are all validating the FASTQ format against the specification. michael On Tue, Jan 24, 2012 at 6:08 AM, Scooter Willis wrote: > You can try a FASTA version of the file to measure performance gain. > > File file = new File("filename"); > Boolean ?lazySequenceLoad = true; > > LinkedHashMap sequences = > FastaReaderHelper.readFastaDNASequence(file,lazySequenceLoad); > > This will go through and index the accession id and not load any sequence > data which means no memory allocation and speed. You can then reference > the DNASequence by name and when you need the sequence data it will use > the file index to load the sequence data from the file for that specific > sequence. The same approach can be applied to FASTQ files. > > Scooter > > On 1/24/12 3:37 AM, "Mic" wrote: > >>Hello, >>I have found the following benchmark ( >>http://biostar.stackexchange.com/questions/10376/how-to-efficiently-parse- >>a-huge-fastq-file/11279#11279 >>) >>and I just wonder whether it is possible to make Java example even faster? >> >>Thank you in advance.