[Bioperl-l] Processing large fasta sequences throught SeqIO

Jason Stajich jason@chg.mc.duke.edu
Sat, 1 Sep 2001 16:41:09 -0400 (EDT)


  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.
  Send mail to mime@docserver.cac.washington.edu for more info.

---559023410-851401618-999376869=:7819
Content-Type: TEXT/PLAIN; charset=US-ASCII

Josep - 
Tracked down the bug - it is in Bio::SeqIO::largefasta.pm

I wrote the following test script to diagnose the problem as it caused a
lovely infinite loop.  It appears this loop is what is filling up your
/tmp directory and hence the 'too many links' error.

You can do the following to fix your code w/o upgrading your bioperl
code locally ( since it is only checked in to the bioperl CVS repository).

where you have a loop getting all the sequences from the seqio stream -
> while ( $seq = $seqio->next_seq ) 
change it to
> while ( $seq = $seqio->next_seq && $seq->length() > 0 )

This is of course a workaround, but should take care of things.  

Please let us know if the suggestion helps.

I have propigated this fix to branch-07 and main trunk.  Thanks for you
patience and I hope this helps you accomplish your task.

Attached is the test script for those interested in playing around with
this more.

-jason

--------------------------------------------------------------
On Fri, 31 Aug 2001, Josep Francesc Abril Ferrando wrote:

> Hi Jason,
> 
> > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory
> > > /tmp/Z0gD8R0rlB: Too many links at
> > > /usr/lib/perl5/site_perl/5.005//Bio/Root/IO.pm line 457
> >
> > Is your tmp dir really full of files/directories or have not enough space
> > for the collection of all the sequence data?  This seems like a system
> > problem.
> 
> Currently, "/tmp" is only ~150Mb and I have more than 1Gb of free hard disk space (on a PC box with
> 386Mb of RAM, Red Hat 6.2 with kernel version 2.2.14, and perl 5.6.1). Maybe it could be a
> permissions issue.
> 
> > Do you have File::Temp installed?  There is a known bug in 0.7 release
> > that if you do not have File::Temp installed the application will not
> > cleanup its tempdirs/tempfiles cleanly.  Installing File::Temp will take
> > care of that.
> 
> It is installed and it is version 0.12. Do I have to include the corresponding "use File::Temp;" in
> the script ?
> Maybe I have to tell our sysadmin to update both, File::Temp and BioPerl.
> 
> > > If I look at the saved file, the sequence is OK (do not have more or
> > > less nucleotides than expected and they are in the correct ordering)
> > > but the file contains a lot of empty lines (or just having '>') after
> > > the finished sequence. Any idea of what should be wrong in the
> > > following script:
> >
> > Nothing obvious is jumping out right now by looking at your code -
> > How large are your files?
> 
> At this moment I am working around 50Mbp length sequences, but I would like being able to scale up
> to 250Mbp.
> 
> > > Is that the right way to use "Bio::SeqIO" for processing large fasta
> > > files. Do I have to include "Bio::Seq::LargeSeq" and, if yes, how can
> > > I do that ?
> >
> > you could add the line
> > use Bio::Seq::LargeSeq;
> > just below --> use Bio::SeqIO <--
> > if you wanted, but it is included by the largefasta modules so it is
> > optional.
> 
> Well, I've made some test, including "use Bio::Seq::LargeSeq" first and then also with "use
> File::Temp", and I've got the same results (the same error/warning -only changing the temporary
> directory name that cannot be created- and the same trailing extra lines).
> 
> Thanks again... Josep F.
> 
> ________________________________________
> 
>     Josep Francesc ABRIL FERRANDO
> 
> RESEARCH GROUP on BIOMEDICAL INFORMATICS
>         GENOME INFORMATICS LAB
>               IMIM - UPF
>           C/ Dr. Aiguader 80
>        08003 - Barcelona  (SPAIN)
> 
>     Ph:  +34 93 2211009 ext 2016
>     Fax: +34 93 2213237
> 
>     http://www1.imim.es/~jabril/
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 


---559023410-851401618-999376869=:7819
Content-Type: APPLICATION/x-perl; name="josep_test.pl"
Content-Transfer-Encoding: BASE64
Content-ID: <Pine.GSO.4.05.10109011641090.7819@peptide.mc.duke.edu>
Content-Description: 
Content-Disposition: attachment; filename="josep_test.pl"

IyEvdXNyL2Jpbi9wZXJsIC13CnVzZSBzdHJpY3Q7CnVzZSBGaWxlOjpUZW1w
IHF3KHRlbXBmaWxlIHRlbXBkaXIpOwp1c2UgRmlsZTo6UGF0aDsKCnVzZSBC
aW86OlJvb3Q6OklPOwp1c2UgQmlvOjpTZXE6OkxhcmdlUHJpbWFyeVNlcTsK
dXNlIEJpbzo6U2VxSU87Cm15ICRERUJVRyA9IDE7CgpteSAoJGRpciwkZmgs
JGZpbGVuYW1lLCRpbywgJHNlcWlvLCAkc2VxKTsKCgojIHRlc3QgRmlsZTo6
VGVtcAooICRkaXIpICA9IHRlbXBkaXIoQ0xFQU5VUCA9PiAxKTsKaWYoICEg
JGRpciApIHsgZGllICJlcnJvciBnZXR0aW5nIHRlbXBkaXJcbiI7IH0KKCAk
ZmgsICRmaWxlbmFtZSkgPSB0ZW1wZmlsZShESVIgPT4gJGRpcik7CmlmKCAh
ICRmaCB8fCAhICRmaWxlbmFtZSApIHsgZGllICJlcnJvciBnZXR0aW5nIHRl
bXBmaWxlXG4iOyB9CgpwcmludCAkZmggInRlc3Rpbmcgb3V0cHV0XG4iOwpp
ZiggJERFQlVHICkgeyAKICAgIHByaW50ICJmaWxlbmFtZSBpcyAkZmlsZW5h
bWUsIGRpciBpcyAkZGlyXG4iOwp9CgpGaWxlOjpQYXRoOjpybXRyZWUoJGRp
cik7CgokZmggPSB1bmRlZjsKJGZpbGVuYW1lID0gdW5kZWY7CiRkaXIgPSB1
bmRlZjsKCiMgdGVzdCBCaW86OlJvb3Q6OklPCiRpbyA9IG5ldyBCaW86OlJv
b3Q6OklPKC12ZXJib3NlID0+ICRERUJVRyApOwoKKCRkaXIpID0gJGlvLT50
ZW1wZGlyKENMRUFOVVAgPT4gMSk7CmlmKCAhICRkaXIgKSB7IGRpZSAiZXJy
b3IgZ2V0dGluZyBSb290OjpJTyB0ZW1wZGlyXG4iOyB9CgooJGZoLCAkZmls
ZW5hbWUpID0gJGlvLT50ZW1wZmlsZShESVIgPT4gJGRpcik7CmlmKCAhICRm
aCB8fCAhICRmaWxlbmFtZSApIHsgZGllICJlcnJvciBnZXR0aW5nIFJvb3Q6
OklPIHRlbXBmaWxlXG4iOyB9CgokaW8tPl9pb19jbGVhbnVwKCk7CnVuZGVm
ICRpbzsKCmlmKCAtZSAkZmlsZW5hbWUgKSB7ICAgCiAgICBwcmludCBTVERF
UlIgImNsZWFudXAgYnkgUm9vdDo6SU8gZGlkIG5vdCB3b3JrXG4iOwp9CkZp
bGU6OlBhdGg6OnJtdHJlZSgkZGlyKTsKCmlmKCAtZSAkZGlyICApIHsgICAK
ICAgIHByaW50IFNUREVSUiAiY2xlYW51cCBieSBybXRyZWUgZGlkIG5vdCB3
b3JrXG4iOwp9CgojIHRlc3QgQmlvOjpTZXE6OkxhcmdlUHJpbWFyeVNlcQoK
JHNlcSA9IG5ldyBCaW86OlNlcTo6TGFyZ2VQcmltYXJ5U2VxKC1pZCA9PiAn
dGVzdDEnLCAKCQkJCSAgICAgLXNlcSA9PiAnY2FndCcpOwokc2VxLT5hZGRf
c2VxdWVuY2VfYXNfc3RyaW5nKCdHQVRBR1RHQVRBR1QnKTsKCmlmKCBsYyAk
c2VxLT5zdWJzZXEoMSwgMTApIG5lICdjYWd0Z2F0YWd0JykgewogICAgZGll
KCJlcnJvciB3aXRoIEJpbzo6U2VxOjpMYXJnZVByaW1hcnlTZXEgaW1wbGVt
ZW50YXRpb24iKTsKfQoKJHNlcSA9IHVuZGVmOwoKIyB0ZXN0IEJpbzo6U2Vx
SU86OmxhcmdlZmFzdGEgaW4gbWFubmVyIHRoYXQgSm9zZXAgaXMgdXNpbmcg
aXQKCm15IEBiYXNlcyA9IHF3KEMgQSBHIFQpOwooJGRpcikgPSB0ZW1wZGly
KENMRUFOVVAgPT4gMSk7Cm15IEBmaWxlczsKZm9yZWFjaCAoIDEuLjEwICkg
ewogICAgbXkgJHNlcXVlbmNlID0gJyc7ICAgIAogICAgZm9yZWFjaCAoIDEu
LjMwMDAgKSB7ICRzZXF1ZW5jZSAuPSAkYmFzZXNbIGludCByYW5kKDQpXTsg
ICB9CiAgICAKICAgICggJGZoLCAkZmlsZW5hbWUpID0gdGVtcGZpbGUoRElS
ID0+ICRkaXIpOwogICAgcHJpbnQgIm5ldyB0bXBmaWxlIGlzICRmaWxlbmFt
ZVxuIjsKICAgIHB1c2ggQGZpbGVzLCAkZmlsZW5hbWU7CiAgICAkc2VxaW8g
PSBuZXcgQmlvOjpTZXFJTygtZmggPT4gJGZoLCAtZm9ybWF0ID0+ICdmYXN0
YScpOwogICAgCiAgICAkc2VxID0gbmV3IEJpbzo6U2VxOjpMYXJnZVByaW1h
cnlTZXEoLWlkID0+ICJ0ZXN0XyRfIiwgCgkJCQkJIC1zZXEgPT4gJHNlcXVl
bmNlKTsKICAgICRzZXFpby0+d3JpdGVfc2VxKCRzZXEpOwogICAgJHNlcWlv
ID0gdW5kZWY7CiAgICBjbG9zZSgkZmgpOwp9CgpwcmludCAiYWJvdXQgdG8g
cHJvY2VzcyBhZ2dyZWdhdGUgZmlsZXNcbiI7CgooICRmaCwgJGZpbGVuYW1l
KSA9IHRlbXBmaWxlKCBESVIgPT4gJGRpcik7Cm15ICRzZXFvdXQgPSBuZXcg
QmlvOjpTZXFJTygtZmggPT4gJGZoLCAtZm9ybWF0ID0+ICdsYXJnZWZhc3Rh
Jyk7Cm15ICRiaWdzZXEgPSBuZXcgQmlvOjpTZXE6OkxhcmdlUHJpbWFyeVNl
cSgtaWQgPT4gJ2JpZ3NlcScpOwoKZm9yZWFjaCBteSAkZmlsZSAoIEBmaWxl
cyApIHsKICAgIHByaW50ICJwcm9jZXNzaW5nIGZpbGU6ICRmaWxlXG4iOwoK
ICAgICRzZXFpbyA9IG5ldyBCaW86OlNlcUlPKC1maWxlID0+ICRmaWxlLCAt
Zm9ybWF0ID0+ICdsYXJnZWZhc3RhJyk7CiAgICB3aGlsZSggZGVmaW5lZCAo
ICRzZXEgPSAkc2VxaW8tPm5leHRfc2VxKSApIHsKCSRzZXFvdXQtPndyaXRl
X3NlcSgkc2VxKTsKCQoJIyB0aGlzIGlzIHRvIGJ1aWxkIGEgZ2lhbnQgYWdn
cmVnYXRlIHNlcXVlbmNlIAoJIyBub3Qgc3VyZSBpZiBpdCBpcyB3aGF0IEpv
c2VwIGlzIHJlYWxseSBkb2luZwoJIyBoYXZlIHRvIHBsYXkgdGhlc2UgZ2Ft
ZXMgYmVjYXVzZSBjYW5ub3QgY2FsbAoJIyBzZXEtPnNlcSgpIGlmIHNlcSBp
cyBzdWZmaWNlbnRseSBsYXJnZQoJIyBiZWNhdXNlIGVudGlyZSBzZXEgbWF5
IG5vdCBmaXQgaW50byBtZW1vcnkKCQoJbXkgJHN0YXJ0ID0gMTsKCW15ICRs
ZW5ndGggPSAkc2VxLT5sZW5ndGgoKTsKCXdoaWxlKCAkc3RhcnQgPCAkbGVu
Z3RoICkgewoJICAgICRiaWdzZXEtPmFkZF9zZXF1ZW5jZV9hc19zdHJpbmco
JHNlcS0+c3Vic2VxKCRzdGFydCwkc3RhcnQrOTk5KSk7CgkgICAgJHN0YXJ0
ICs9IDEwMDA7Cgl9CiAgICB9Cn0K
---559023410-851401618-999376869=:7819--