Opened 12 years ago
Closed 11 years ago
#877 closed defect (fixed)
tests/signal-tests.scm hangs sometimes
Reported by: | Mario Domenech Goulart | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 4.9.0 |
Component: | build system | Version: | 4.8.x |
Keywords: | signal-tests.scm | Cc: | |
Estimated difficulty: |
Description
Sometimes, when the daily salmonella tests at call-cc.org runs, we get an a.out process that hangs eating CPU:
root 14821 0.0 0.0 2496 992 ? S 00:30 0:00 \_ /USR/SBIN/CRON root 14822 0.0 0.0 1756 472 ? Ss 00:30 0:00 \_ /bin/sh -c /root/run-salmonella.sh root 14823 0.0 0.0 1756 480 ? S 00:30 0:00 \_ /bin/sh /root/run-salmonella.sh root 14850 0.0 0.3 15300 6984 ? S 00:30 0:00 \_ csi -s src/salmonella-run-publish/salmonella-run-publish.scm src/chicken-infrastructure/salmonella/salmonella-master.conf root 18441 0.0 0.0 0 0 ? Z 00:47 0:00 \_ [sh] <defunct> ... root 19322 98.0 0.0 6308 1712 ? R 00:49 549:13 ./a.out
It seems that the a.out file is generated during some test run by make check
. We've got to this conclusion by analyzing /proc/<pid of a.out>
. Based on that information, we found out that tests/signal-tests.scm may be the culprit.
Attached is the output of strings /proc/<pid of a.out>/exe
, which contains some hints about the file that originated a.out.
Once we kill a.out, salmonella's execution continues.
The behavior described here doesn't show up everyday. It doesn't look easy to reproduce.
Attachments (3)
Change History (17)
Changed 12 years ago by
Attachment: | a.out-exe-strings.txt added |
---|
comment:1 Changed 12 years ago by
Can you modify the build to log the output to a file? signal-tests.scm produces output that may indicate the operation that causes the hang.
comment:2 Changed 12 years ago by
Unfortunately the script which builds chicken and runs salmonella only logs output when the exit status != 0, so there's nothing logged. I've modified it to always log the output.
comment:3 Changed 12 years ago by
I've modified salmonella-run-publish (he script which builds chicken and runs salmonella) to make it log the output of all commands. So, now if "make check" fails or hangs, we can see what it prints in run-salmonella.log (example).
comment:4 Changed 12 years ago by
BTW, I think this is more a problem with the test than with the compiler or runtime system. This seems to be a timing issue.
comment:5 Changed 12 years ago by
Milestone: | 4.8.0 |
---|
comment:6 Changed 12 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:7 Changed 12 years ago by
Component: | unknown → build system |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
Today it hanged again, so I'm reopening this ticket. I'm attaching the output of strings /proc/<pid of a.out>/exe
The output of runtests.sh after a.out was killed is
======================================== signal tests ... ../chicken signal-tests.scm -output-file a.c -verbose -include-path .. gcc a.c -o a.o -c -fno-strict-aliasing -fwrapv -DHAVE_CHICKEN_CONFIG_H -DC_ENABLE_PTABLES -Os -fomit-frame-pointer -I.. -I"/root/salmonella-run-publish/chicken/include/chicken" rm a.c gcc a.o -o a.out -L.. -L"/root/salmonella-run-publish/chicken/lib" -Wl,-R"/root/salmonella-run-publish/chicken/lib" -lchicken -lm -ldl rm a.o child started _._sending signals to 31829 BA21.AAAA111.A1.BAB21.BBB22.A1.B2.BBB22.ABAB2121.BAB212.ABA21.A1.A1.AB21.B2.AB21.BAB21.AAB21.AAA11.B2.AA1.B2.AAAA1BB221.AAB21.B2.AB21.ABB21.AA1.BB2.AAA1.AB21.BA21.AB21.AAA1.AAA1.A1.A1.BB2.BAA211.A1.BAAAA11112.A1.B2.BBB2.BB2.BAA211.AABB221.AA1.BBB22.AA1.ABB212.AA1.B2.A1.AB21.BAB21.BB2.BBA2B12.AA1.BB2.AA1.AAB21.BB2.AA1.BA21.BB2.BAB212.BBA2A11.ABB221.AB21.ABB21.A1.B2.B2.ABB212.BB2.BB2.B2.AA1._B2.BAA211.A1.B2.B2.BABB2212.BBA2B122.BA21.AAA1.AB21.A1.AA1.B2.A1.B2.A1.A1A.1BA21.B2.BB2.B2.BB2.ABA211.AB21.A1.BAB21.AB12.B2.BA21.ABB21.A1.BA21.B2.AB21.B2.BAA21.BAB21.A1.AAB21.BAB21.AAAB2111.A1.BB2.BB2.AA1.ABAAA21111.B2.AAB21.BBBA212.AA1.BB2.BBB22.A1.ABB122.B2.BB2.AAB21.BA1A2A11.AA1AA11.BB2.B2.ABA211.AA1.B2.BA21.AAB21.B2.BB2.BA21.AA1.A1.AB21.A1.A1.BAB21.B2.A1._AAAB121.BA21.BAB21.AAA11.BAA211.B2.AAA1.AA1.BB2.ABA211.ABA211.B2.BA21.AAA1.BA21.A1.B2.B2.B2.BBB2.B2.BA21.ABA211.BBA21.BAA211.BB2.B2.BBB2.BAB221.B2.AB21.B2.BAA21.ABA211.A1.AB21.BBA21.B2.A1.BBB2.BB2.A1.A1.AAB21.ABA211.BA21.B2.B2.BBA21.B2.BBA21.AA1.A1.BBB22.BB2.A1.BAA211.ABB212.BBB22.B2.BA1B22BBA.122BB22.BA21.B2.BA21.AAB12.B2.BAA1B2AA2111.AA11.BA12.BB2.BA21.AAB112.A1.AA1B21.ABB2AA2111.BAB122._A1.AA11.A1.ABB122.B2.B2.BAA211.A1.AB21.AAA11.BBAA112.BBB22.BA21A.1A1.BA2A11.BA21.BA2A11.BBBA2212.BBA221.A1.A1.AB1A12.A1.B2.B2.AA11.AB12.A1.B2.BA12.BBA221.B2.BB2.BBB22.A1.A1.BA21.BB22.B2.BA21.BB22.AB12.BA21.BA21.AB12.AB12A.1A1.AA1.BAA121.AA11.BBA21.BA12.B2.AB21.ABA211.BBB222.BBB22.AB12.ABB2BA2121.BAB21.AB21.AAA111.BAA121.BB2.AB21.A1.AAA11.ABA112.AAA111.BB22.B2.A1.A1.A1.A1.AAA111.B2.B2.AAB112.AA1.B2.BB22.AA11._BAA211.BB22.A1.A1.AA11.A1.A1.BBA221.BA21.AA11.B2.AA11.AAA11.B2.AAA111.ABA211.BAA121.BBAA2112.B2.BBA221.BB22.A1.A1.BAB221.BAB221.A1.AAA111.B2.BB22.BBB222.BAA121.AB12.BA21.ABA112.A1.BAB221.BAA211.A1.AB21.B2.AA11.B2.ABB212.ABB221.B2.B2.AAA11.BAA211.AB21.BBB22.AA11.BA12.A1.BA21.BA12.B2.AA11.BAB221.A1.BB22.BBBAB1222.A1.AB12.BA21.A1.B2.BB22.B2._A1.BAB221.AA1B21.AB12.B2.B2.ABB212.A1.BBA212.BBA21.B2.A1.BAB221.B2.BB22.B2.B2.A1.B2.B2.ABAB1ABBABAAAABBAAABBBBAAAABBABABABAAABBAAAABAAAAABBBBAABAAAAABBABBAABBBAAAAAAAABBBAAAAABBABABABBBAABABABABBABABBBABBABABBAABBBABBBAABABAAAAABBBBBAABBAABBBBBBAABBABAABBAABBAAABBBAABBBAABBBBAABBBBBABBBABAABABAABBABABAABABABBBAAAAABBBBABBABBBABAABBBAAAAABABAAAAABAABBABAAABABABAAAABBBBBBABABBBABBBBBBBBBABBAABBAABAAAABBBBAABBBAABABBAABABAABAABBBBABAAAAABBABBBABABABBBBABBABAABBABBBABABBAAABBBBAAAAABAABBBABAAAAAAABAAAAABBBABAAABABABAABABBBAABABAABBBABBAABBABAAABBBABAAABABBABAABBBAAABBAAABBAAAAABABAABAABAAAAAABAAAAABABABBABABABBBBAAAABBBBAAABBAABABBABBABBBBBABAAAABBABBBBABBBAABBBABABBBAAAAABBBBABBBBAABBAAAA signals sent: 726 USR1, 729 USR2 terminating child process ...
Changed 12 years ago by
Attachment: | a.out-exe-strings-reopen.txt added |
---|
comment:8 Changed 12 years ago by
Thanks for the information. I have no idea why this hangs. If the process is still running, we could send a SIGTERM, which should trigger the "fini" function in the child process.
It *might* be possible that the child process gets delayed somehow - the signal handlers are established after setting up of the thread and *perhaps* the parent process sends the termination signal before the handlers are installed (that would be very curious indeed, but perhaps... on a heavily loaded machine... I don't know).
I'll run a slightly modified test (diff attached) repeatedly, now.
Changed 12 years ago by
comment:10 Changed 11 years ago by
Milestone: | → 4.9.0 |
---|---|
Version: | 4.7.x → 4.8.x |
Hopefully this is fixed now. I'm putting it on 4.9.0 so it stays on our radar for a while (note to self: put it on 4.10.0 as soon as 4.9.0 is released)
comment:11 Changed 11 years ago by
The problem is still around. It looks like it is easy to reproduce on slow machines (like call-cc.org), because I can pretty consistently reproduce it in a VM with Haiku.
comment:12 Changed 11 years ago by
OK, findings so far: It seems to only hang on multicore/multi-CPU machines.
When it hangs it *looks* like the child process' secondary thread is no longer "ticking": there are no dots in the last stretch of the output any more.
comment:13 Changed 11 years ago by
I think I know why this is so hard to reproduce: the parent waits for a random number of milliseconds (which is probably desirable, otherwise it might consistently NOT have triggered this bug). Together with the unpredictability of CPU load and OS process scheduling indeterminism it's hard to predict if it will go wrong.
See chicken-hackers for a possible patch which hopefully fixes this _and_ detects the error.
comment:14 Changed 11 years ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
I think we can safely assume it's fixed by 6cf5b12b275455ff9c9adfd9e7d6e0ffeed98478
strings of /proc/<pid of a.out>/exe