Opened 14 years ago
Closed 14 years ago
#340 closed defect (fixed)
Spiffy hangs occasionally (again!)
Reported by: | sjamaan | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | wiki | Version: | 4.5.x |
Keywords: | spiffy, wiki, hang, call-cc.org, infrastructure | Cc: | Mario Domenech Goulart, felix winkelmann, sjamaan |
Estimated difficulty: |
Description
There's a new problem that's cropped up. One of the Spiffies on call-cc.org was not accepting connections anymore (the one listening on HTTP). I checked with lsof, but it had open a normal number of file handles. The http socket was in state LISTEN.
I had to restart it, with no clue of what is going on. It had been running since 12 august, so 7 days.
Change History (19)
comment:1 Changed 14 years ago by
Milestone: | 4.6.0 |
---|
comment:2 follow-up: 3 Changed 14 years ago by
I'll try to run something like recursive wgets (maybe parallel) in a loop until spiffy hangs.
Any suggestion for the debug environment besides using DEBUGBUILD=1
for chicken?
comment:3 follow-up: 4 Changed 14 years ago by
Replying to mario:
I'll try to run something like recursive wgets (maybe parallel) in a loop until spiffy hangs.
Any suggestion for the debug environment besides using
DEBUGBUILD=1
for chicken?
Right, DEBUGBUILD=1 and the eggs should be installed from a local repository checkout, with
chicken-install -k -l <localdir> -t local
This will keep the compiled .c files on the local filesystem and we will be able to debug all code with gdb.
It's a bit brute force, but gives us maximum introspection. If the server hangs, we can attach via gdb. IIRC, the hanging spiffy does not bog down the CPU, correct? Then you can just keep it running and I attach when I find the time. Or if the behaviour + testing is reproducable, I can run it myself.
comment:4 Changed 14 years ago by
Replying to felix:
IIRC, the hanging spiffy does not bog down the CPU, correct?
Indeed. It was not consuming any CPU before I restarted it.
comment:5 Changed 14 years ago by
As a side note, regarding to the 13 days scheduler issue, spiffy has been running on parenteses.org for more than 13 days now:
$ date Mon Aug 23 09:03:47 EDT 2010 $ ps -ef | grep [a]wful spiffy 31816 1 0 Aug09 ? 00:04:57 awful /etc/awful/awful.conf
I has served ~5000 requests within that period.
comment:6 Changed 14 years ago by
We just had another hang. strace(1) showed us this output:
select(1024, [], [], NULL, NULL
I don't know why the select call isn't finished in strace. Buffering? Anyway, it looks like it's doing a select() call on an empty set of file descriptors.
comment:7 Changed 14 years ago by
I just tried this code on the call-cc.org box and it hung indeed:
#include <sys/select.h> #include <stdio.h> int main(void) { fd_set in, out; FD_ZERO(&in); FD_ZERO(&out); select(1024, &in, &out, NULL, NULL); return 0; }
It also hangs on NetBSD, so I guess Chicken would need a test to see if the sets are both empty before actually calling into select().
comment:8 Changed 14 years ago by
Owner: | set to felix winkelmann |
---|---|
Status: | new → assigned |
Likely to be a scheduler bug. See ##sys#fdset-select-timeout
and thereabouts. I will try to add code to catch this situation. Is there a way to cound selectors in a FDSET?
Can't someone just reimplement the damn thing?
comment:9 Changed 14 years ago by
There doesn't seem to be a way to count them except loop from zero to FD_SETSIZE and doing an FD_ISSET check.
You can't even compare with FD_ZERO because the implementation is (allowed to be) an array of elements, which means you'd just be comparing pointers.
We could keep a shadow copy around in Scheme as a list and use that to count the set size... Checking for the empty list is simple enough. Keeping it synced shouldn't be hard, but of course it's a little annoying.
comment:10 Changed 14 years ago by
I've run a simplified stress test here. Just spiffy serving some static files (the Emacs manual).
I couldn't make spiffy hang after more than 225 thousand requests (~1.5h of stress -- ten recursive wgets running in parallel).
I used chicken 4.5.0 build with DEBUGBUILD=1
and spiffy 4.8.
comment:11 Changed 14 years ago by
Tested again using (force-implementation 'read-write)
for sendfile. No hang after more than 150K requests.
comment:12 Changed 14 years ago by
Tested again using (tcp-buffer-size 8192)
for tcp. No hang after more than 150K requests.
comment:13 Changed 14 years ago by
Tested again using (tcp-buffer-size 512) for tcp. No hang after more than 50K requests.
comment:14 Changed 14 years ago by
I have finally started to take a look at the scheduler. I haven't a clue why this situation crops up. Apparently there are entries in the "fd-list" (the list that holds FDs and the threads waiting for them) for which no bits where set in the fdsets - these are set just once, when the threads is blocked for I/O, and then retained. The operations to keep an fdset and later restore it may be broken (I just use assignment, but this may be invalid for certain implementations of fdsets), or the fdset and the fdlist are getting out of sync by some other reason.
I'm trying to provide a patch that simplifies the fdset handling. The fdset is here always initialized from scratch from the fdlist before doing the select(2)
call. Isn't working yet.
comment:15 Changed 14 years ago by
Owner: | felix winkelmann deleted |
---|---|
Priority: | major → not urgent at all |
I have done some work on the scheduler and simplified the fdset handling, so that the situation that select(2)
waits for empty fdsets should not normally occur. Also, some sanity checks have been added. It's in the "safer-scheduler" branch, taken from the current "master" (4.6.0rc1).
The right steps now would be to build chicken with this branch, which (considering that abnf is still broken, and the problems with sendfile) may be a bit of a challenge.
Also, by changed the commented out debugging code at the start of scheduler.scm, it is possible to get quite verbose debugging output about scheduling. It may be worthwhile to enable this.
comment:16 Changed 14 years ago by
Priority: | not urgent at all → critical |
---|
comment:17 Changed 14 years ago by
Using the head of the safe-scheduler branch: no hang after 2 million requests served to 16 parallel wgets in a 3.5h period.
comment:18 follow-up: 19 Changed 14 years ago by
I think the safe-scheduler is a lot safer now :) - and so is openssl :P
Maybe this ticket can finally be closed?
comment:19 Changed 14 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Replying to sjamaan:
I think the safe-scheduler is a lot safer now :) - and so is openssl :P
Maybe this ticket can finally be closed?
If you think so, ok.
BTW, you are l33t.
Two suggestions: