Opened 12 years ago

Closed 10 years ago

#877 closed defect (fixed)

tests/signal-tests.scm hangs sometimes

Reported by: Mario Domenech Goulart Owned by:
Priority: major Milestone: 4.9.0
Component: build system Version: 4.8.x
Keywords: signal-tests.scm Cc:
Estimated difficulty:


Sometimes, when the daily salmonella tests at runs, we get an a.out process that hangs eating CPU:

root     14821  0.0  0.0   2496   992 ?        S    00:30   0:00  \_ /USR/SBIN/CRON
root     14822  0.0  0.0   1756   472 ?        Ss   00:30   0:00      \_ /bin/sh -c /root/
root     14823  0.0  0.0   1756   480 ?        S    00:30   0:00          \_ /bin/sh /root/
root     14850  0.0  0.3  15300  6984 ?        S    00:30   0:00              \_ csi -s src/salmonella-run-publish/salmonella-run-publish.scm src/chicken-infrastructure/salmonella/salmonella-master.conf
root     18441  0.0  0.0      0     0 ?        Z    00:47   0:00                  \_ [sh] <defunct>
root     19322 98.0  0.0   6308  1712 ?        R    00:49 549:13 ./a.out

It seems that the a.out file is generated during some test run by make check. We've got to this conclusion by analyzing /proc/<pid of a.out>. Based on that information, we found out that tests/signal-tests.scm may be the culprit.

Attached is the output of strings /proc/<pid of a.out>/exe, which contains some hints about the file that originated a.out.

Once we kill a.out, salmonella's execution continues.

The behavior described here doesn't show up everyday. It doesn't look easy to reproduce.

Attachments (3)

a.out-exe-strings.txt (5.2 KB) - added by Mario Domenech Goulart 12 years ago.
strings of /proc/<pid of a.out>/exe
a.out-exe-strings-reopen.txt (3.9 KB) - added by Mario Domenech Goulart 12 years ago.
out.diff (846 bytes) - added by felix winkelmann 12 years ago.

Download all attachments as: .zip

Change History (17)

Changed 12 years ago by Mario Domenech Goulart

Attachment: a.out-exe-strings.txt added

strings of /proc/<pid of a.out>/exe

comment:1 Changed 12 years ago by felix winkelmann

Can you modify the build to log the output to a file? signal-tests.scm produces output that may indicate the operation that causes the hang.

comment:2 Changed 12 years ago by Mario Domenech Goulart

Unfortunately the script which builds chicken and runs salmonella only logs output when the exit status != 0, so there's nothing logged. I've modified it to always log the output.

comment:3 Changed 12 years ago by Mario Domenech Goulart

I've modified salmonella-run-publish (he script which builds chicken and runs salmonella) to make it log the output of all commands. So, now if "make check" fails or hangs, we can see what it prints in run-salmonella.log (example).

comment:4 Changed 12 years ago by felix winkelmann

BTW, I think this is more a problem with the test than with the compiler or runtime system. This seems to be a timing issue.

comment:5 Changed 12 years ago by felix winkelmann

Milestone: 4.8.0

comment:6 Changed 12 years ago by Mario Domenech Goulart

Resolution: fixed
Status: newclosed

The cause of this issue seems to be the same as #907's. I'm closing it, since d856a53aa194922b73f480f0e943c38268ef93f1 fixes #907.

comment:7 Changed 12 years ago by Mario Domenech Goulart

Component: unknownbuild system
Resolution: fixed
Status: closedreopened

Today it hanged again, so I'm reopening this ticket. I'm attaching the output of strings /proc/<pid of a.out>/exe

The output of after a.out was killed is

======================================== signal tests ...
../chicken signal-tests.scm -output-file a.c -verbose -include-path ..
gcc a.c -o a.o -c  -fno-strict-aliasing -fwrapv -DHAVE_CHICKEN_CONFIG_H -DC_ENABLE_PTABLES -Os -fomit-frame-pointer -I.. -I"/root/salmonella-run-publish/chicken/include/chicken"
rm a.c
gcc a.o -o a.out -L.. -L"/root/salmonella-run-publish/chicken/lib"  -Wl,-R"/root/salmonella-run-publish/chicken/lib" -lchicken -lm -ldl
rm a.o
child started
_._sending signals to 31829
signals sent: 726 USR1, 729 USR2
terminating child process ...

Changed 12 years ago by Mario Domenech Goulart

comment:8 Changed 12 years ago by felix winkelmann

Thanks for the information. I have no idea why this hangs. If the process is still running, we could send a SIGTERM, which should trigger the "fini" function in the child process.

It *might* be possible that the child process gets delayed somehow - the signal handlers are established after setting up of the thread and *perhaps* the parent process sends the termination signal before the handlers are installed (that would be very curious indeed, but perhaps... on a heavily loaded machine... I don't know).

I'll run a slightly modified test (diff attached) repeatedly, now.

Changed 12 years ago by felix winkelmann

Attachment: out.diff added

comment:9 Changed 11 years ago by sjamaan

I believe this is related to #989

comment:10 Changed 11 years ago by sjamaan

Milestone: 4.9.0
Version: 4.7.x4.8.x

Hopefully this is fixed now. I'm putting it on 4.9.0 so it stays on our radar for a while (note to self: put it on 4.10.0 as soon as 4.9.0 is released)

comment:11 Changed 11 years ago by sjamaan

The problem is still around. It looks like it is easy to reproduce on slow machines (like, because I can pretty consistently reproduce it in a VM with Haiku.

comment:12 Changed 10 years ago by sjamaan

OK, findings so far: It seems to only hang on multicore/multi-CPU machines.

When it hangs it *looks* like the child process' secondary thread is no longer "ticking": there are no dots in the last stretch of the output any more.

comment:13 Changed 10 years ago by sjamaan

I think I know why this is so hard to reproduce: the parent waits for a random number of milliseconds (which is probably desirable, otherwise it might consistently NOT have triggered this bug). Together with the unpredictability of CPU load and OS process scheduling indeterminism it's hard to predict if it will go wrong.

See chicken-hackers for a possible patch which hopefully fixes this _and_ detects the error.

comment:14 Changed 10 years ago by sjamaan

Resolution: fixed
Status: reopenedclosed

I think we can safely assume it's fixed by 6cf5b12b275455ff9c9adfd9e7d6e0ffeed98478

Note: See TracTickets for help on using tickets.