Opened 13 years ago
Closed 12 years ago
#877 closed defect (fixed)
tests/signal-tests.scm hangs sometimes
| Reported by: | Mario Domenech Goulart | Owned by: | |
|---|---|---|---|
| Priority: | major | Milestone: | 4.9.0 |
| Component: | build system | Version: | 4.8.x |
| Keywords: | signal-tests.scm | Cc: | |
| Estimated difficulty: |
Description
Sometimes, when the daily salmonella tests at call-cc.org runs, we get an a.out process that hangs eating CPU:
root 14821 0.0 0.0 2496 992 ? S 00:30 0:00 \_ /USR/SBIN/CRON root 14822 0.0 0.0 1756 472 ? Ss 00:30 0:00 \_ /bin/sh -c /root/run-salmonella.sh root 14823 0.0 0.0 1756 480 ? S 00:30 0:00 \_ /bin/sh /root/run-salmonella.sh root 14850 0.0 0.3 15300 6984 ? S 00:30 0:00 \_ csi -s src/salmonella-run-publish/salmonella-run-publish.scm src/chicken-infrastructure/salmonella/salmonella-master.conf root 18441 0.0 0.0 0 0 ? Z 00:47 0:00 \_ [sh] <defunct> ... root 19322 98.0 0.0 6308 1712 ? R 00:49 549:13 ./a.out
It seems that the a.out file is generated during some test run by make check. We've got to this conclusion by analyzing /proc/<pid of a.out>. Based on that information, we found out that tests/signal-tests.scm may be the culprit.
Attached is the output of strings /proc/<pid of a.out>/exe, which contains some hints about the file that originated a.out.
Once we kill a.out, salmonella's execution continues.
The behavior described here doesn't show up everyday. It doesn't look easy to reproduce.
Attachments (3)
Change History (17)
by , 13 years ago
| Attachment: | a.out-exe-strings.txt added |
|---|
comment:1 by , 13 years ago
Can you modify the build to log the output to a file? signal-tests.scm produces output that may indicate the operation that causes the hang.
comment:2 by , 13 years ago
Unfortunately the script which builds chicken and runs salmonella only logs output when the exit status != 0, so there's nothing logged. I've modified it to always log the output.
comment:3 by , 13 years ago
I've modified salmonella-run-publish (he script which builds chicken and runs salmonella) to make it log the output of all commands. So, now if "make check" fails or hangs, we can see what it prints in run-salmonella.log (example).
comment:4 by , 13 years ago
BTW, I think this is more a problem with the test than with the compiler or runtime system. This seems to be a timing issue.
comment:5 by , 13 years ago
| Milestone: | 4.8.0 |
|---|
comment:6 by , 13 years ago
| Resolution: | → fixed |
|---|---|
| Status: | new → closed |
comment:7 by , 13 years ago
| Component: | unknown → build system |
|---|---|
| Resolution: | fixed |
| Status: | closed → reopened |
Today it hanged again, so I'm reopening this ticket. I'm attaching the output of strings /proc/<pid of a.out>/exe
The output of runtests.sh after a.out was killed is
======================================== signal tests ... ../chicken signal-tests.scm -output-file a.c -verbose -include-path .. gcc a.c -o a.o -c -fno-strict-aliasing -fwrapv -DHAVE_CHICKEN_CONFIG_H -DC_ENABLE_PTABLES -Os -fomit-frame-pointer -I.. -I"/root/salmonella-run-publish/chicken/include/chicken" rm a.c gcc a.o -o a.out -L.. -L"/root/salmonella-run-publish/chicken/lib" -Wl,-R"/root/salmonella-run-publish/chicken/lib" -lchicken -lm -ldl rm a.o child started _._sending signals to 31829 BA21.AAAA111.A1.BAB21.BBB22.A1.B2.BBB22.ABAB2121.BAB212.ABA21.A1.A1.AB21.B2.AB21.BAB21.AAB21.AAA11.B2.AA1.B2.AAAA1BB221.AAB21.B2.AB21.ABB21.AA1.BB2.AAA1.AB21.BA21.AB21.AAA1.AAA1.A1.A1.BB2.BAA211.A1.BAAAA11112.A1.B2.BBB2.BB2.BAA211.AABB221.AA1.BBB22.AA1.ABB212.AA1.B2.A1.AB21.BAB21.BB2.BBA2B12.AA1.BB2.AA1.AAB21.BB2.AA1.BA21.BB2.BAB212.BBA2A11.ABB221.AB21.ABB21.A1.B2.B2.ABB212.BB2.BB2.B2.AA1._B2.BAA211.A1.B2.B2.BABB2212.BBA2B122.BA21.AAA1.AB21.A1.AA1.B2.A1.B2.A1.A1A.1BA21.B2.BB2.B2.BB2.ABA211.AB21.A1.BAB21.AB12.B2.BA21.ABB21.A1.BA21.B2.AB21.B2.BAA21.BAB21.A1.AAB21.BAB21.AAAB2111.A1.BB2.BB2.AA1.ABAAA21111.B2.AAB21.BBBA212.AA1.BB2.BBB22.A1.ABB122.B2.BB2.AAB21.BA1A2A11.AA1AA11.BB2.B2.ABA211.AA1.B2.BA21.AAB21.B2.BB2.BA21.AA1.A1.AB21.A1.A1.BAB21.B2.A1._AAAB121.BA21.BAB21.AAA11.BAA211.B2.AAA1.AA1.BB2.ABA211.ABA211.B2.BA21.AAA1.BA21.A1.B2.B2.B2.BBB2.B2.BA21.ABA211.BBA21.BAA211.BB2.B2.BBB2.BAB221.B2.AB21.B2.BAA21.ABA211.A1.AB21.BBA21.B2.A1.BBB2.BB2.A1.A1.AAB21.ABA211.BA21.B2.B2.BBA21.B2.BBA21.AA1.A1.BBB22.BB2.A1.BAA211.ABB212.BBB22.B2.BA1B22BBA.122BB22.BA21.B2.BA21.AAB12.B2.BAA1B2AA2111.AA11.BA12.BB2.BA21.AAB112.A1.AA1B21.ABB2AA2111.BAB122._A1.AA11.A1.ABB122.B2.B2.BAA211.A1.AB21.AAA11.BBAA112.BBB22.BA21A.1A1.BA2A11.BA21.BA2A11.BBBA2212.BBA221.A1.A1.AB1A12.A1.B2.B2.AA11.AB12.A1.B2.BA12.BBA221.B2.BB2.BBB22.A1.A1.BA21.BB22.B2.BA21.BB22.AB12.BA21.BA21.AB12.AB12A.1A1.AA1.BAA121.AA11.BBA21.BA12.B2.AB21.ABA211.BBB222.BBB22.AB12.ABB2BA2121.BAB21.AB21.AAA111.BAA121.BB2.AB21.A1.AAA11.ABA112.AAA111.BB22.B2.A1.A1.A1.A1.AAA111.B2.B2.AAB112.AA1.B2.BB22.AA11._BAA211.BB22.A1.A1.AA11.A1.A1.BBA221.BA21.AA11.B2.AA11.AAA11.B2.AAA111.ABA211.BAA121.BBAA2112.B2.BBA221.BB22.A1.A1.BAB221.BAB221.A1.AAA111.B2.BB22.BBB222.BAA121.AB12.BA21.ABA112.A1.BAB221.BAA211.A1.AB21.B2.AA11.B2.ABB212.ABB221.B2.B2.AAA11.BAA211.AB21.BBB22.AA11.BA12.A1.BA21.BA12.B2.AA11.BAB221.A1.BB22.BBBAB1222.A1.AB12.BA21.A1.B2.BB22.B2._A1.BAB221.AA1B21.AB12.B2.B2.ABB212.A1.BBA212.BBA21.B2.A1.BAB221.B2.BB22.B2.B2.A1.B2.B2.ABAB1ABBABAAAABBAAABBBBAAAABBABABABAAABBAAAABAAAAABBBBAABAAAAABBABBAABBBAAAAAAAABBBAAAAABBABABABBBAABABABABBABABBBABBABABBAABBBABBBAABABAAAAABBBBBAABBAABBBBBBAABBABAABBAABBAAABBBAABBBAABBBBAABBBBBABBBABAABABAABBABABAABABABBBAAAAABBBBABBABBBABAABBBAAAAABABAAAAABAABBABAAABABABAAAABBBBBBABABBBABBBBBBBBBABBAABBAABAAAABBBBAABBBAABABBAABABAABAABBBBABAAAAABBABBBABABABBBBABBABAABBABBBABABBAAABBBBAAAAABAABBBABAAAAAAABAAAAABBBABAAABABABAABABBBAABABAABBBABBAABBABAAABBBABAAABABBABAABBBAAABBAAABBAAAAABABAABAABAAAAAABAAAAABABABBABABABBBBAAAABBBBAAABBAABABBABBABBBBBABAAAABBABBBBABBBAABBBABABBBAAAAABBBBABBBBAABBAAAA signals sent: 726 USR1, 729 USR2 terminating child process ...
by , 13 years ago
| Attachment: | a.out-exe-strings-reopen.txt added |
|---|
comment:8 by , 13 years ago
Thanks for the information. I have no idea why this hangs. If the process is still running, we could send a SIGTERM, which should trigger the "fini" function in the child process.
It *might* be possible that the child process gets delayed somehow - the signal handlers are established after setting up of the thread and *perhaps* the parent process sends the termination signal before the handlers are installed (that would be very curious indeed, but perhaps... on a heavily loaded machine... I don't know).
I'll run a slightly modified test (diff attached) repeatedly, now.
by , 13 years ago
comment:10 by , 12 years ago
| Milestone: | → 4.9.0 |
|---|---|
| Version: | 4.7.x → 4.8.x |
Hopefully this is fixed now. I'm putting it on 4.9.0 so it stays on our radar for a while (note to self: put it on 4.10.0 as soon as 4.9.0 is released)
comment:11 by , 12 years ago
The problem is still around. It looks like it is easy to reproduce on slow machines (like call-cc.org), because I can pretty consistently reproduce it in a VM with Haiku.
comment:12 by , 12 years ago
OK, findings so far: It seems to only hang on multicore/multi-CPU machines.
When it hangs it *looks* like the child process' secondary thread is no longer "ticking": there are no dots in the last stretch of the output any more.
comment:13 by , 12 years ago
I think I know why this is so hard to reproduce: the parent waits for a random number of milliseconds (which is probably desirable, otherwise it might consistently NOT have triggered this bug). Together with the unpredictability of CPU load and OS process scheduling indeterminism it's hard to predict if it will go wrong.
See chicken-hackers for a possible patch which hopefully fixes this _and_ detects the error.
comment:14 by , 12 years ago
| Resolution: | → fixed |
|---|---|
| Status: | reopened → closed |
I think we can safely assume it's fixed by 6cf5b12b275455ff9c9adfd9e7d6e0ffeed98478

strings of /proc/<pid of a.out>/exe