Opened 14 years ago

Closed 13 years ago

#340 closed defect (fixed)

Spiffy hangs occasionally (again!)

Reported by: sjamaan Owned by:
Priority: critical Milestone:
Component: wiki Version: 4.5.x
Keywords: spiffy, wiki, hang, call-cc.org, infrastructure Cc: Mario Domenech Goulart, felix winkelmann, sjamaan
Estimated difficulty:

Description

There's a new problem that's cropped up. One of the Spiffies on call-cc.org was not accepting connections anymore (the one listening on HTTP). I checked with lsof, but it had open a normal number of file handles. The http socket was in state LISTEN.

I had to restart it, with no clue of what is going on. It had been running since 12 august, so 7 days.

Change History (19)

comment:1 Changed 14 years ago by felix winkelmann

Milestone: 4.6.0

Two suggestions:

  1. as an experiment, try using a newer version (say, 4.5.7 which has the timeout scheduler fixes, or even the "experimental" HEAD)
  1. let's set up a version for debugging (as described recently) and run that until it hangs, killing it in a way suitable to generate a corefile, which I would be happy to analyse

comment:2 Changed 14 years ago by Mario Domenech Goulart

I'll try to run something like recursive wgets (maybe parallel) in a loop until spiffy hangs.

Any suggestion for the debug environment besides using DEBUGBUILD=1 for chicken?

comment:3 in reply to:  2 ; Changed 14 years ago by felix winkelmann

Replying to mario:

I'll try to run something like recursive wgets (maybe parallel) in a loop until spiffy hangs.

Any suggestion for the debug environment besides using DEBUGBUILD=1 for chicken?

Right, DEBUGBUILD=1 and the eggs should be installed from a local repository checkout, with

chicken-install -k -l <localdir> -t local

This will keep the compiled .c files on the local filesystem and we will be able to debug all code with gdb.

It's a bit brute force, but gives us maximum introspection. If the server hangs, we can attach via gdb. IIRC, the hanging spiffy does not bog down the CPU, correct? Then you can just keep it running and I attach when I find the time. Or if the behaviour + testing is reproducable, I can run it myself.

comment:4 in reply to:  3 Changed 14 years ago by sjamaan

Replying to felix:

IIRC, the hanging spiffy does not bog down the CPU, correct?

Indeed. It was not consuming any CPU before I restarted it.

comment:5 Changed 14 years ago by Mario Domenech Goulart

As a side note, regarding to the 13 days scheduler issue, spiffy has been running on parenteses.org for more than 13 days now:

$ date
Mon Aug 23 09:03:47 EDT 2010

$ ps -ef | grep [a]wful
spiffy   31816     1  0 Aug09 ?        00:04:57 awful /etc/awful/awful.conf

I has served ~5000 requests within that period.

comment:6 Changed 14 years ago by sjamaan

We just had another hang. strace(1) showed us this output:

select(1024, [], [], NULL, NULL

I don't know why the select call isn't finished in strace. Buffering? Anyway, it looks like it's doing a select() call on an empty set of file descriptors.

comment:7 Changed 14 years ago by sjamaan

I just tried this code on the call-cc.org box and it hung indeed:

#include <sys/select.h>
#include <stdio.h>

int main(void) {
  fd_set in, out;
  FD_ZERO(&in);
  FD_ZERO(&out);
  select(1024, &in, &out, NULL, NULL);
  return 0;
}

It also hangs on NetBSD, so I guess Chicken would need a test to see if the sets are both empty before actually calling into select().

comment:8 Changed 14 years ago by felix winkelmann

Owner: set to felix winkelmann
Status: newassigned

Likely to be a scheduler bug. See ##sys#fdset-select-timeout and thereabouts. I will try to add code to catch this situation. Is there a way to cound selectors in a FDSET?

Can't someone just reimplement the damn thing?

comment:9 Changed 14 years ago by sjamaan

There doesn't seem to be a way to count them except loop from zero to FD_SETSIZE and doing an FD_ISSET check.

You can't even compare with FD_ZERO because the implementation is (allowed to be) an array of elements, which means you'd just be comparing pointers.

We could keep a shadow copy around in Scheme as a list and use that to count the set size... Checking for the empty list is simple enough. Keeping it synced shouldn't be hard, but of course it's a little annoying.

comment:10 Changed 14 years ago by Mario Domenech Goulart

I've run a simplified stress test here. Just spiffy serving some static files (the Emacs manual).

I couldn't make spiffy hang after more than 225 thousand requests (~1.5h of stress -- ten recursive wgets running in parallel).

I used chicken 4.5.0 build with DEBUGBUILD=1 and spiffy 4.8.

comment:11 Changed 14 years ago by Mario Domenech Goulart

Tested again using (force-implementation 'read-write) for sendfile. No hang after more than 150K requests.

comment:12 Changed 14 years ago by Mario Domenech Goulart

Tested again using (tcp-buffer-size 8192) for tcp. No hang after more than 150K requests.

comment:13 Changed 14 years ago by Mario Domenech Goulart

Tested again using (tcp-buffer-size 512) for tcp. No hang after more than 50K requests.

comment:14 Changed 14 years ago by felix winkelmann

I have finally started to take a look at the scheduler. I haven't a clue why this situation crops up. Apparently there are entries in the "fd-list" (the list that holds FDs and the threads waiting for them) for which no bits where set in the fdsets - these are set just once, when the threads is blocked for I/O, and then retained. The operations to keep an fdset and later restore it may be broken (I just use assignment, but this may be invalid for certain implementations of fdsets), or the fdset and the fdlist are getting out of sync by some other reason.

I'm trying to provide a patch that simplifies the fdset handling. The fdset is here always initialized from scratch from the fdlist before doing the select(2) call. Isn't working yet.

comment:15 Changed 14 years ago by felix winkelmann

Owner: felix winkelmann deleted
Priority: majornot urgent at all

I have done some work on the scheduler and simplified the fdset handling, so that the situation that select(2) waits for empty fdsets should not normally occur. Also, some sanity checks have been added. It's in the "safer-scheduler" branch, taken from the current "master" (4.6.0rc1).

The right steps now would be to build chicken with this branch, which (considering that abnf is still broken, and the problems with sendfile) may be a bit of a challenge.

Also, by changed the commented out debugging code at the start of scheduler.scm, it is possible to get quite verbose debugging output about scheduling. It may be worthwhile to enable this.

comment:16 Changed 14 years ago by felix winkelmann

Priority: not urgent at allcritical

comment:17 Changed 14 years ago by Mario Domenech Goulart

Using the head of the safe-scheduler branch: no hang after 2 million requests served to 16 parallel wgets in a 3.5h period.

comment:18 Changed 13 years ago by sjamaan

I think the safe-scheduler is a lot safer now :) - and so is openssl :P

Maybe this ticket can finally be closed?

comment:19 in reply to:  18 Changed 13 years ago by felix winkelmann

Resolution: fixed
Status: assignedclosed

Replying to sjamaan:

I think the safe-scheduler is a lot safer now :) - and so is openssl :P

Maybe this ticket can finally be closed?

If you think so, ok.

BTW, you are l33t.

Note: See TracTickets for help on using tickets.