#316 closed defect (fixed)
spiffy hangs sometimes
Reported by: | felix winkelmann | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | extensions | Version: | 4.5.x |
Keywords: | spiffy call-cc.org madness | Cc: | sjamaan, Mario Domenech Goulart |
Estimated difficulty: |
Description
There seem to be scheduler problems with Spiffy on call-cc.org, as it has hung up several times, now.
Suggested plan:
- try to reproduce it by hammering the server locally with ab or Siege.
- checkout the scheduler patches submitted recently by Jörg Wittenberger (spiffy on call-cc.org runs on 4.5.0, the diff is attached).
- possibly try Joerg's "high-load-scheduler" (needs to be updated to the newest version, first, since JW sent me some code recently that introduced some changes that haven't been integrated).
- Stay calm.
Attachments (2)
Change History (16)
Changed 14 years ago by
Attachment: | scheduler.diff added |
---|
comment:1 follow-up: 4 Changed 14 years ago by
What's the purpose of trying the high-load scheduler? call-cc is sustaining two hits per second max, usually much lower.
What I would be curious to see is the number of threads that were running at the time. If there were 1024 stuck threads, spiffy will refuse to spawn any more (it will block via mutex until a thread finishes).
Another less likely possibility is that it ran out of file descriptors and/or ports are not being closed properly on exception.
If you want to cut down on traffic, it is probably a good idea to forbid the robots from crawling the edit pages, the history pages, the git objects in chicken-core, chicken.wiki.br, and the https:// schema. That's a lot of duplicate or irrelevant bandwidth.
comment:2 Changed 14 years ago by
Pointed out the fd issue because there are a bunch of (file-exists?) errors in the logs, which could cause an FD leak (I have experienced hangs from this before). On the other hand there are only 418 total errors so there would have to be multiple leaks per error.
Maybe someone could run lsof on spiffy right now (I don't have root) and check for open FDs. Errors are still happening at least once a minute.
comment:3 follow-up: 6 Changed 14 years ago by
Keywords: | madness added |
---|
sjamaan confirmed there were 600+ files open due to a bug in svn-client, and fixed this issue.
Unfortunately we experienced another hang on :80 and later on :443, and we haven't gotten good debugging information on it yet.
I would recommend another lsof (to confirm FD leak issue is fixed) and strace to see if any system calls are being made or are in progress. Then maybe get a map of code pages and see if you can narrow down which library it is hanging in (maybe via core dump). Then I guess we try felix's idea of debugging libraries, because I'm running out of ideas.
Chickadee has not experienced this problem. Even though it's behind a caching proxy, which IIRC accesses spiffy over HTTP/1.0 (no persistent connections), it's still handled about 7000 hits in the last 6 days. It is running on Chicken 4.5.5 right now but I also ran it on 4.4.x and probably 4.3.x earlier.
comment:4 Changed 14 years ago by
Replying to zbigniew:
What's the purpose of trying the high-load scheduler? call-cc is sustaining two hits per second max, usually much lower.
It contains some potential bugfixes.
What I would be curious to see is the number of threads that were running at the time. If there were 1024 stuck threads, spiffy will refuse to spawn any more (it will block via mutex until a thread finishes).
In chicken 4.5.7 there is an internal procedure `(##sys#dump-heap-state)' which prints statistics to stderr about all objects in the heap.
Another less likely possibility is that it ran out of file descriptors and/or ports are not being closed properly on exception.
It wouldn't hang in such a case, it would generate a "too many open files" error.
comment:5 Changed 14 years ago by
The FD issue wasn't fixed. I have made a new commit that hopefully fixes it for good. I've updated and restarted spiffy.
comment:6 Changed 14 years ago by
Replying to zbigniew:
Unfortunately we experienced another hang on :80 and later on :443, and we haven't gotten good debugging information on it yet.
I would recommend another lsof (to confirm FD leak issue is fixed) and strace to see if any system calls are being made or are in progress. Then maybe get a map of code pages and see if you can narrow down which library it is hanging in (maybe via core dump). Then I guess we try felix's idea of debugging libraries, because I'm running out of ideas.
I think running spiffy with full debugging info is the best option. That way we can attach to it when it hangs and have full source-level debugging in gdb. It's just a matter of running spiffy via /home/felix/chicken-core-spiffy-debug/bin/csi
and keeping it running when it hangs.
comment:7 Changed 14 years ago by
Look what I found in the access log!
https://wiki.call-cc.org/eggref/3/web-scheme?action=show&rev=5821
For some reason this causes an error inside libsvn or the svn-client egg which causes it to hold open a reference to the tempfile. However, for other files the history functionality seems to work fine.
This revision is listed in the history page for eggref/3/web-scheme, so I'm not sure what's going on here. Something's fishy!
comment:8 follow-up: 9 Changed 14 years ago by
I could set up a qwiki instance for debugging on call-cc.org (Felix: I didn't use your home not to mess too much with it).
It's on /home/mario/wiki-debug
. Chicken is on /home/mario/wiki-debug/chicken
and it was compiled with DEBUGBUILD=1
. I've installed the latest trunk code for svn-client, spiffy and qwiki.
/home/mario/wiki-debug
contains a copy of the svn repo and the same directory structure used by spiffy (e.g., /home/mario/wiki-debug/etc
for /etc
, /home/mario/wiki-debug/var
for /var
and so on).
Peter and me run a debugging session today and found out that one of the problems is still fd leak, caused by svn-client (or the underlying libsvn-client).
We've run the server and stressed it with wget (wget -r -l5
). Using lsof, we observed the number of lines like
csi 7634 mario 961u REG 8,6 736 1020 /tmp/svnwiki-cat-QNZIrJ (deleted) csi 7634 mario 962u REG 8,6 701 1021 /tmp/svnwiki-cat-YQhjmY (deleted) csi 7634 mario 963u REG 8,6 11834 1022 /tmp/svnwiki-cat-9hY3od (deleted)
increased during the test. According to ulimit, the max number of open files is 1024, and spiffy died when the number of svnwiki-cat
files from lsof got close to 1024.
The latest messages before segmentation fault were:
svn: '/home/mario/wiki-debug/tmp/qwiki/hive' is not under version control svn: Can't open file '/home/mario/wiki-debug/usr/local/repos/chicken-eggs/db/revs/4718': Too many open files svn: Can't open file '/home/mario/wiki-debug/usr/local/repos/chicken-eggs/db/revs/4718': Too many open files svn: Can't open file '/home/mario/wiki-debug/tmp/qwiki/.svn/entries': Too many open files
If you want to make some more tests, feel free to use my home dir. Be sure to set the PATH
to get things from the debug dir:
PATH=/home/mario/wiki-debug/chicken/bin/:/home/mario/wiki-debug/bin:$PATH csi -s bin/spiffy
If you get permissions problems, feel free to change them.
comment:9 follow-up: 10 Changed 14 years ago by
Replying to mario:
I could set up a qwiki instance for debugging on call-cc.org (Felix: I didn't use your home not to mess too much with it).
It's on
/home/mario/wiki-debug
. Chicken is on/home/mario/wiki-debug/chicken
and it was compiled withDEBUGBUILD=1
. I've installed the latest trunk code for svn-client, spiffy and qwiki.
Have you installed all the extension from a local checkout, with the "-k" option? That is required to have full source-level debugging of all extensions.
Thanks for the setup. I will try it out. I'll need some time to figure out where all the files are. On what port is the spiffy instance listening?
comment:10 Changed 14 years ago by
Replying to felix:
Have you installed all the extension from a local checkout, with the "-k" option? That is required to have full source-level debugging of all extensions.
Unfortunately not. I installed them the usual way (chicken-install eggs
), except for spiffy, svn-client and qwiki, that I changed to the svn dir for each egg and installed them providing the .setup
file for chicken-install.
Thanks for the setup. I will try it out. I'll need some time to figure out where all the files are. On what port is the spiffy instance listening?
Let me try to summarize some things (consider prefix = /home/mario/wiki-debug
)
- the spiffy configuration is
prefix/etc/spiffy/spiffy.conf
- the qwiki configuration is
prefix/etc/spiffy/qwiki.conf
- logs go to
prefix/var/log/spiffy
We've tried setting spiffy's max-connections
parameter to 4, it doesn't change anything, it seems. It's still configured that way.
The server listens on port 8080.
You can start it by running
cd /home/mario/wiki-debug PATH=/home/mario/wiki-debug/chicken/bin/:/home/mario/wiki-debug/bin:$PATH csi -s bin/spiffy
It will load the spiffy and qwiki configuration from prefix/etc/spiffy/spiffy.conf
and
prefix/etc/spiffy/qwiki.conf
You can use
wget -r -l5 http://www.call-cc.org:8080
to stress the server.
I've made a copy of wiki-debug (to wiki-debug.bkp). Feel free to do whatever you want with wiki-debug
. I think you can even copy the whole dir to your home if you want (I _think_ you just need to change wiki-debug/chroot.scm
to point to the new dir -- haven't tested it).
I've set the permissions in a way that users from the chicken group can write to wiki-debug
.
comment:11 Changed 14 years ago by
I think the fix Peter applied to svn-client solved the crash problem.
I've installed the latest svn-client version on the wiki-debug environment and run the following script against spiffy:
wget -r -L -l5 http://wiki.call-cc.org:8080 & wget -r -L -l5 http://wiki.call-cc.org:8080 & wget -r -L -l5 http://wiki.call-cc.org:8080 & wget -r -L -l5 http://wiki.call-cc.org:8080 & wget -r -L -l5 http://wiki.call-cc.org:8080
(Nevermind the java-style for repeating code)
The server didn't crash after serving the total of 8400 requests. lsof didn't report leaking svnwiki-cats.
comment:12 Changed 14 years ago by
May this be really fixed? Wouldn't that be nice...
Note that there definitely is a scheduler bug (a short while ago reported by N. Pelletier on chicken-users), where thread-timeouts overflow after about 13 days. Since this bug exists for a long time and hasn't (with the one exception) ever be reported,
it may be a non-issue, but this is nevertheless a reminder. That particular problem is fixed in "experimental", but I recommend not to try it out light-heartedly, as it didn't get enough testing yet.
comment:13 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Spiffy has been running for three days now without interruption. It has never been able to run that long while the svn-client bug was in there, so it's highly likely that was the cause of this particular hang. I think we can declare this ticket closed. If the scheduler bug causes a hang, we should start a new ticket as the problem would be caused a different bug even though the symptoms would (probably?) be the same.
git diff master..experimental scheduler.scm