Opened 9 years ago

Closed 7 years ago

#877 closed defect (fixed)

tests/signal-tests.scm hangs sometimes

Reported by: Mario Domenech Goulart Owned by:
Priority: major Milestone: 4.9.0
Component: build system Version: 4.8.x
Keywords: signal-tests.scm Cc:
Estimated difficulty:

Description

Sometimes, when the daily salmonella tests at call-cc.org runs, we get an a.out process that hangs eating CPU:

root     14821  0.0  0.0   2496   992 ?        S    00:30   0:00  \_ /USR/SBIN/CRON
root     14822  0.0  0.0   1756   472 ?        Ss   00:30   0:00      \_ /bin/sh -c /root/run-salmonella.sh
root     14823  0.0  0.0   1756   480 ?        S    00:30   0:00          \_ /bin/sh /root/run-salmonella.sh
root     14850  0.0  0.3  15300  6984 ?        S    00:30   0:00              \_ csi -s src/salmonella-run-publish/salmonella-run-publish.scm src/chicken-infrastructure/salmonella/salmonella-master.conf
root     18441  0.0  0.0      0     0 ?        Z    00:47   0:00                  \_ [sh] <defunct>
...
root     19322 98.0  0.0   6308  1712 ?        R    00:49 549:13 ./a.out

It seems that the a.out file is generated during some test run by make check. We've got to this conclusion by analyzing /proc/<pid of a.out>. Based on that information, we found out that tests/signal-tests.scm may be the culprit.

Attached is the output of strings /proc/<pid of a.out>/exe, which contains some hints about the file that originated a.out.

Once we kill a.out, salmonella's execution continues.

The behavior described here doesn't show up everyday. It doesn't look easy to reproduce.

Attachments (3)

a.out-exe-strings.txt (5.2 KB) - added by Mario Domenech Goulart 9 years ago.
strings of /proc/<pid of a.out>/exe
a.out-exe-strings-reopen.txt (3.9 KB) - added by Mario Domenech Goulart 9 years ago.
out.diff (846 bytes) - added by felix winkelmann 8 years ago.

Download all attachments as: .zip

Change History (17)

Changed 9 years ago by Mario Domenech Goulart

Attachment: a.out-exe-strings.txt added

strings of /proc/<pid of a.out>/exe

comment:1 Changed 9 years ago by felix winkelmann

Can you modify the build to log the output to a file? signal-tests.scm produces output that may indicate the operation that causes the hang.

comment:2 Changed 9 years ago by Mario Domenech Goulart

Unfortunately the script which builds chicken and runs salmonella only logs output when the exit status != 0, so there's nothing logged. I've modified it to always log the output.

comment:3 Changed 9 years ago by Mario Domenech Goulart

I've modified salmonella-run-publish (he script which builds chicken and runs salmonella) to make it log the output of all commands. So, now if "make check" fails or hangs, we can see what it prints in run-salmonella.log (example).

comment:4 Changed 9 years ago by felix winkelmann

BTW, I think this is more a problem with the test than with the compiler or runtime system. This seems to be a timing issue.

comment:5 Changed 9 years ago by felix winkelmann

Milestone: 4.8.0

comment:6 Changed 9 years ago by Mario Domenech Goulart

Resolution: fixed
Status: newclosed

The cause of this issue seems to be the same as #907's. I'm closing it, since d856a53aa194922b73f480f0e943c38268ef93f1 fixes #907.

comment:7 Changed 9 years ago by Mario Domenech Goulart

Component: unknownbuild system
Resolution: fixed
Status: closedreopened

Today it hanged again, so I'm reopening this ticket. I'm attaching the output of strings /proc/<pid of a.out>/exe

The output of runtests.sh after a.out was killed is

======================================== signal tests ...
../chicken signal-tests.scm -output-file a.c -verbose -include-path ..
gcc a.c -o a.o -c  -fno-strict-aliasing -fwrapv -DHAVE_CHICKEN_CONFIG_H -DC_ENABLE_PTABLES -Os -fomit-frame-pointer -I.. -I"/root/salmonella-run-publish/chicken/include/chicken"
rm a.c
gcc a.o -o a.out -L.. -L"/root/salmonella-run-publish/chicken/lib"  -Wl,-R"/root/salmonella-run-publish/chicken/lib" -lchicken -lm -ldl
rm a.o
child started
_._sending signals to 31829
BA21.AAAA111.A1.BAB21.BBB22.A1.B2.BBB22.ABAB2121.BAB212.ABA21.A1.A1.AB21.B2.AB21.BAB21.AAB21.AAA11.B2.AA1.B2.AAAA1BB221.AAB21.B2.AB21.ABB21.AA1.BB2.AAA1.AB21.BA21.AB21.AAA1.AAA1.A1.A1.BB2.BAA211.A1.BAAAA11112.A1.B2.BBB2.BB2.BAA211.AABB221.AA1.BBB22.AA1.ABB212.AA1.B2.A1.AB21.BAB21.BB2.BBA2B12.AA1.BB2.AA1.AAB21.BB2.AA1.BA21.BB2.BAB212.BBA2A11.ABB221.AB21.ABB21.A1.B2.B2.ABB212.BB2.BB2.B2.AA1._B2.BAA211.A1.B2.B2.BABB2212.BBA2B122.BA21.AAA1.AB21.A1.AA1.B2.A1.B2.A1.A1A.1BA21.B2.BB2.B2.BB2.ABA211.AB21.A1.BAB21.AB12.B2.BA21.ABB21.A1.BA21.B2.AB21.B2.BAA21.BAB21.A1.AAB21.BAB21.AAAB2111.A1.BB2.BB2.AA1.ABAAA21111.B2.AAB21.BBBA212.AA1.BB2.BBB22.A1.ABB122.B2.BB2.AAB21.BA1A2A11.AA1AA11.BB2.B2.ABA211.AA1.B2.BA21.AAB21.B2.BB2.BA21.AA1.A1.AB21.A1.A1.BAB21.B2.A1._AAAB121.BA21.BAB21.AAA11.BAA211.B2.AAA1.AA1.BB2.ABA211.ABA211.B2.BA21.AAA1.BA21.A1.B2.B2.B2.BBB2.B2.BA21.ABA211.BBA21.BAA211.BB2.B2.BBB2.BAB221.B2.AB21.B2.BAA21.ABA211.A1.AB21.BBA21.B2.A1.BBB2.BB2.A1.A1.AAB21.ABA211.BA21.B2.B2.BBA21.B2.BBA21.AA1.A1.BBB22.BB2.A1.BAA211.ABB212.BBB22.B2.BA1B22BBA.122BB22.BA21.B2.BA21.AAB12.B2.BAA1B2AA2111.AA11.BA12.BB2.BA21.AAB112.A1.AA1B21.ABB2AA2111.BAB122._A1.AA11.A1.ABB122.B2.B2.BAA211.A1.AB21.AAA11.BBAA112.BBB22.BA21A.1A1.BA2A11.BA21.BA2A11.BBBA2212.BBA221.A1.A1.AB1A12.A1.B2.B2.AA11.AB12.A1.B2.BA12.BBA221.B2.BB2.BBB22.A1.A1.BA21.BB22.B2.BA21.BB22.AB12.BA21.BA21.AB12.AB12A.1A1.AA1.BAA121.AA11.BBA21.BA12.B2.AB21.ABA211.BBB222.BBB22.AB12.ABB2BA2121.BAB21.AB21.AAA111.BAA121.BB2.AB21.A1.AAA11.ABA112.AAA111.BB22.B2.A1.A1.A1.A1.AAA111.B2.B2.AAB112.AA1.B2.BB22.AA11._BAA211.BB22.A1.A1.AA11.A1.A1.BBA221.BA21.AA11.B2.AA11.AAA11.B2.AAA111.ABA211.BAA121.BBAA2112.B2.BBA221.BB22.A1.A1.BAB221.BAB221.A1.AAA111.B2.BB22.BBB222.BAA121.AB12.BA21.ABA112.A1.BAB221.BAA211.A1.AB21.B2.AA11.B2.ABB212.ABB221.B2.B2.AAA11.BAA211.AB21.BBB22.AA11.BA12.A1.BA21.BA12.B2.AA11.BAB221.A1.BB22.BBBAB1222.A1.AB12.BA21.A1.B2.BB22.B2._A1.BAB221.AA1B21.AB12.B2.B2.ABB212.A1.BBA212.BBA21.B2.A1.BAB221.B2.BB22.B2.B2.A1.B2.B2.ABAB1ABBABAAAABBAAABBBBAAAABBABABABAAABBAAAABAAAAABBBBAABAAAAABBABBAABBBAAAAAAAABBBAAAAABBABABABBBAABABABABBABABBBABBABABBAABBBABBBAABABAAAAABBBBBAABBAABBBBBBAABBABAABBAABBAAABBBAABBBAABBBBAABBBBBABBBABAABABAABBABABAABABABBBAAAAABBBBABBABBBABAABBBAAAAABABAAAAABAABBABAAABABABAAAABBBBBBABABBBABBBBBBBBBABBAABBAABAAAABBBBAABBBAABABBAABABAABAABBBBABAAAAABBABBBABABABBBBABBABAABBABBBABABBAAABBBBAAAAABAABBBABAAAAAAABAAAAABBBABAAABABABAABABBBAABABAABBBABBAABBABAAABBBABAAABABBABAABBBAAABBAAABBAAAAABABAABAABAAAAAABAAAAABABABBABABABBBBAAAABBBBAAABBAABABBABBABBBBBABAAAABBABBBBABBBAABBBABABBBAAAAABBBBABBBBAABBAAAA
signals sent: 726 USR1, 729 USR2
terminating child process ...

Changed 9 years ago by Mario Domenech Goulart

comment:8 Changed 8 years ago by felix winkelmann

Thanks for the information. I have no idea why this hangs. If the process is still running, we could send a SIGTERM, which should trigger the "fini" function in the child process.

It *might* be possible that the child process gets delayed somehow - the signal handlers are established after setting up of the thread and *perhaps* the parent process sends the termination signal before the handlers are installed (that would be very curious indeed, but perhaps... on a heavily loaded machine... I don't know).

I'll run a slightly modified test (diff attached) repeatedly, now.

Changed 8 years ago by felix winkelmann

Attachment: out.diff added

comment:9 Changed 7 years ago by sjamaan

I believe this is related to #989

comment:10 Changed 7 years ago by sjamaan

Milestone: 4.9.0
Version: 4.7.x4.8.x

Hopefully this is fixed now. I'm putting it on 4.9.0 so it stays on our radar for a while (note to self: put it on 4.10.0 as soon as 4.9.0 is released)

comment:11 Changed 7 years ago by sjamaan

The problem is still around. It looks like it is easy to reproduce on slow machines (like call-cc.org), because I can pretty consistently reproduce it in a VM with Haiku.

comment:12 Changed 7 years ago by sjamaan

OK, findings so far: It seems to only hang on multicore/multi-CPU machines.

When it hangs it *looks* like the child process' secondary thread is no longer "ticking": there are no dots in the last stretch of the output any more.

comment:13 Changed 7 years ago by sjamaan

I think I know why this is so hard to reproduce: the parent waits for a random number of milliseconds (which is probably desirable, otherwise it might consistently NOT have triggered this bug). Together with the unpredictability of CPU load and OS process scheduling indeterminism it's hard to predict if it will go wrong.

See chicken-hackers for a possible patch which hopefully fixes this _and_ detects the error.

comment:14 Changed 7 years ago by sjamaan

Resolution: fixed
Status: reopenedclosed

I think we can safely assume it's fixed by 6cf5b12b275455ff9c9adfd9e7d6e0ffeed98478

Note: See TracTickets for help on using tickets.