[Biopython] About BLAST parser

Thu Oct 22 07:10:11 EDT 2009

Thanks very much for your help and suggestions! I think I'll manage  
from here on!
Manu

On Oct 22, 2009, at 1:51 PM, Peter wrote:

> On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen <mavata at gmail.com>  
> wrote:
>>
>> With all blast hits included, the output file is around 1 gigabyte.
>> Therefore just opening and searching for the broken parts is  
>> challenging
>> with regular text editors. Furthermore, I'm not very familiar with  
>> XML
>> syntax and therefore would probably not recognize the broken parts.
>
> There is probably a neat way to extract a chunk using Unix command
> line tools. Or just try something like this in Python:
>
> error_line = 82921
> input_handle = open("really_big.xml")
> output_handle = open("fragment.txt", "w")
> for line_number, line in enumerate(input_handle) :
>    if error_line - 1000 < error_line and error_line < error_line +  
> 1000 :
>        output_handle.write(line)
> input_handle.close()
> output_handle.close()
>
> I would still suggest you re-try copying it from the cluster to your
> machine, in case it was just a network error corrupting the machine.
>
>> Breaking down the search into smaller parts sounds like a good idea.
>> However, I'm also considering writing a more robust script. Would  
>> it be
>> possible to make the script ignore the broken entries in the XML  
>> file and
>> skip into next correct one?
>
> I think that will be tricky. Part of idea about XML is it is a  
> strictly defined
> file format where there are standards about how to interpret and abort
> with bad data. Tolerant XML parsers are considered to be a bad thing.
>
> What should be possible is a simple script that removes the broken
> section of the file, giving a (partial) but valid XML file covering  
> most
> of the sequences. It might be more effort than just re-doing the  
> search
> (in parts this time).
>
> Peter