Opened 11 years ago

Closed 10 years ago

#989 closed defect (fixed)

High CPU usage when calling signal handler multiple times

Reported by: Mario Domenech Goulart Owned by:
Priority: critical Milestone: 4.9.0
Component: core libraries Version: 4.8.x
Keywords: signal handling Cc:
Estimated difficulty:

Description

Here's an example to demonstrate the aforementioned behavior:

;; Press C-c multiple times in an interval < 20s

(use posix)

(set-signal-handler!
 signal/int
 (lambda (signal)
   (print "caught signal" signal)
   (sleep 20)))


(let loop ()
  (sleep 1)
  (loop))

I cannot reproduce that behavior on 4.7.0. Peter mentioned on IRC that b7995839c0b481280bdeda117eb68bc0e78a40bf triggered it.

Attachments (1)

signal-handler-hack.patch (1.4 KB) - added by megane 11 years ago.
a hack that makes the symptoms go away

Download all attachments as: .zip

Change History (12)

comment:1 Changed 11 years ago by sjamaan

Oddly enough, if I throw out the HAVE_SIGACTION from Makefile.bsd, make spotless and rebuild, it still fails the same way (on master). It seems to be related to the reworking of the signal handling itself.

gdb seems to point to an infinite loop in the GC, but that's rather difficult to understand given the setjmp/longjmp stuff. ktrace shows that it keeps setting and resetting a signal mask.

comment:2 Changed 11 years ago by sjamaan

Ripping out the queueing up of pending events (so that interrupt_reason is overwritten like it used to be) doesn't help either.

comment:3 Changed 11 years ago by sjamaan

C_stack_probe(&a) keeps returning false in f_19521 (the procedure marked with /* loop in interrupt-hook in k19511 in k19465 in k17778 in k17771 in k17767 in k17763 in k17759 in k21506 in k21500 in k17281 in k17278 in k17276 in k17274 in k17272 in k17270 in k17267 in k17265 in k17263 in k17261 in k17259 in ... */), even after the C_reclaim has "returned" and the procedure is re-invoked via its trampoline.

Changed 11 years ago by megane

Attachment: signal-handler-hack.patch added

a hack that makes the symptoms go away

comment:4 Changed 11 years ago by megane

Now that I try to think why that patch works, I cannot come up with an explanation.

But something seems to go wrong when multiple signal-handlers are running at the same time.

comment:5 Changed 11 years ago by sjamaan

Extremely interesting: I cannot reproduce the problem when I add a single print statement after the (sleep 20) in the signal handler:

;; Press C-c multiple times in an interval < 20s

(use posix)

(set-signal-handler!
 signal/int
 (lambda (signal)
   (print "caught signal" signal)
   (sleep 20)
   (print "done")))

(let loop ()
  (sleep 1)
  (loop))

So far, I can't explain this yet.

comment:6 Changed 11 years ago by sjamaan

Still unclear, but the continuation of sleep seems to be doing something very strange. If we explicitly capture the continuation (which should be identical to the implicit continuation) and pass it to the procedure when we invoke it, it works. If we use the continuation supplied by the foreign-lambda wrapper, it breaks.

However, if sleep(x) is replaced by a constant integer, it works with either continuation.

(use posix)

;; BROKEN:
(define do-sleep
  (foreign-primitive ((scheme-object k) (int x))
                     "printf(\"%d\\n\", sleep(x)); C_values(2, C_SCHEME_UNDEFINED, C_k);"))

;; OKAY:
(define do-sleep
  (foreign-primitive ((scheme-object k) (int x))
                     "printf(\"%d\\n\", sleep(x)); C_values(2, C_SCHEME_UNDEFINED, k);"))


(set-signal-handler!
 signal/int
 (lambda (signal)
   (print "caught signal" signal)
   (call/cc (lambda (k) (do-sleep k 10)))))


(let loop ()
  (sleep 1)
  (loop))

It gets weirder and weirder!

comment:7 Changed 11 years ago by sjamaan

This could be related to #1058. If either is fixed, the other should be re-tested.

comment:8 Changed 10 years ago by sjamaan

I think #877 is due to the same underlying problem. There seems to be something strange going on when handling interrupts. I think the reason the bug goes away when messing with the continuation is because we're now allowing interrupts to get handled via the "regular" way, through the GC rather than explicitly by C_pending_interrupt.

I still not quite grok it, but this seems to be the case. A fix is by not calling C_pending_interrupt, but this would kill the ability to save a backlog of interrupts (effectively undoing b7995839c0b481280bdeda117eb68bc0e78a40bf).

comment:9 Changed 10 years ago by sjamaan

A better solution is in the works

comment:10 Changed 10 years ago by Mario Domenech Goulart

Fixed by 3f43b7c808269ff9eba0702800018ed327d7758e

comment:11 Changed 10 years ago by sjamaan

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.