[Biojava-l] Genbank feature parsing performance
Khalil El Mazouari
khalil.elmazouari at gmail.com
Fri Jun 17 16:21:43 UTC 2011
Hi,
exec time for parsing Genbank, EMBL and EMBL-XML is ± the same.
However, writing sequence in EMBL format was 87% slower vs Genbank format.
Regards,
khalil
On 17 Jun 2011, at 12:36, Martin Jones wrote:
> Yes, this approach won't be much use if you are interested in the
> contents of every genbank record.
>
> Have you thought about parsing the gb files in parallel? In my
> experience, parsing genbank files scales quite nicely when done in
> multiple threads. I have used the GPars library for this type of job
> and it is very nice to use:
>
> http://gpars.codehaus.org/Parallelizer
>
>
> M
>
>
>
> On 17 June 2011 11:33, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
>> Thanks Martin,
>>
>> I already tried the regex. The performance increase was < 10%.
>>
>> My situation is different in 2 points:
>> 1. info to extract from genbank file is always present.
>> 2. there is multiple feature to extract from each record.
>>
>> I agree with you. Extracting a single field from a genbank file, is done munch faster with simple regex than with FeatureFilter.
>>
>> Regards,
>>
>> khalil
>>
>> On 17 Jun 2011, at 12:12, Martin Jones wrote:
>>
>>> Hi,
>>>
>>> I have had the same issue when parsing large sets of genbank files. In
>>> my case, the workaround was to first treat the whole genbank record as
>>> a string, and do a quick regex match to check if it contained
>>> something of interest (in my case I was searching for specific
>>> taxids):
>>>
>>> // first do a quick pattern-match to extract the taxid so we can
>>> exit early without the overhead of parsing the whole file
>>> private final Pattern taxidPattern =
>>> Pattern.compile("db_xref=\\\"taxon:(\\d+)");
>>> Matcher taxidMatcher = taxidPattern.matcher(currentRecord);
>>> if (taxidMatcher.find()) {
>>> def taxid = taxidMatcher[0][1].toInteger()
>>> if (!taxidList.contains(taxid)) {
>>> return
>>> }
>>> // here do the slow part of actually parsing all the features
>>>
>>>
>>> This is in Groovy so there are a few syntactical differences. If you
>>> are only interested in a subset of the GenBank records, then this
>>> approach might be of use.
>>>
>>> M
>>>
>>>
>>>
>>>
>>> On 17 June 2011 10:16, Khalil El Mazouari <khalil.elmazouari at gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I am developing an app where features are extracted from a large genbank file, and processed: multiple alignment, annotation....
>>>>
>>>> The feature extraction is a real bottleneck in my app. It consumes 87% of total execution time.
>>>>
>>>> Feature extraction is done via:
>>>>
>>>> FeatureFilter ff = new FeatureFilter.ByAnnotation(key, value);
>>>> FeatureHolder fh = richSequence.filter(ff);
>>>> Feature feat = fh.features().next();
>>>> ...
>>>>
>>>> Any suggestion on how to improve the performance of features extraction is welcome.
>>>>
>>>> Thanks,
>>>>
>>>> khalil
>>>> _______________________________________________
>>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>
>>>>
>>
>>
>>
More information about the Biojava-l
mailing list