#614 closed defect (fixed)
TCP sockets get closed prematurely
Reported by: | Moritz Heidkamp | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | 4.9.0 |
Component: | core libraries | Version: | 4.7.x |
Keywords: | Cc: | sjamaan | |
Estimated difficulty: |
Description
Running the attached programs leads to "Connection reset by peer" errors.
Attachments (2)
Change History (14)
Changed 13 years ago by
Attachment: | listen.scm added |
---|
Changed 13 years ago by
Attachment: | client.scm added |
---|
comment:1 Changed 13 years ago by
Cc: | sjamaan added |
---|---|
Component: | unknown → core libraries |
comment:2 Changed 13 years ago by
Sorry for replying late, this somehow slipped my attention.
I have only tested it on my machine through the loopback interface so far. Maybe a tcpdump or something like this would reveal if that's what's happening. I'm a bit clueless about all that TCP stuff but will try to figure it out!
comment:3 follow-up: 9 Changed 13 years ago by
I tried to run the test code again, but keep getting segfaults in the client with HEAD chicken :(
It crashes pretty consistently in f_943 in scheduler.c (which is ##sys#interrupt-hook), even if the server testing code is not running.
Enabling different debug options causes it to fail in slightly different places, but it's just before it calls ##sys#schedule
It looks like this was caused by changeset 7cd875bb13f2efd2af83a9f8d82217101e4e85a3 because reverting it causes the code to work again.
comment:5 Changed 13 years ago by
After reverting the patch I get several of these in the client:
Warning (#<thread: thread843>): (tcp-connect) cannot connect to socket - Bad file descriptor "localhost" 8080
and then finally the familiar
Warning (#<thread: thread271>): (tcp-connect) cannot create socket - Connection reset by peer
comment:6 Changed 13 years ago by
The connect error happens because of the backlog being 4 by default. If I change listen.scm to have a larger backlog by doing (tcp-listen 8080 100)
the client continues doing its work and stops normally after a very short time.
comment:7 Changed 13 years ago by
Probably the default should be raised as this gives networked Chicken apps a bad name; if it's raised the client exits almost immediately instead of hanging for a while. Now Spiffy's performance is closer to node.js
comment:9 Changed 13 years ago by
Replying to sjamaan:
It looks like this was caused by changeset 7cd875bb13f2efd2af83a9f8d82217101e4e85a3 because reverting it causes the code to work again.
Very strange. I have removed all uses of the error fdset from the scheduler (it wouldn't have been useful anyway). Why this crashes I can't explain.
comment:10 Changed 13 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
I raised (and documented) the default backlog to 100.
comment:11 Changed 13 years ago by
What's the expected behavior of this test? I get
Warning (#<thread: thread871>): read operation timed out 60000 178 Error: uncaught exception: #<condition: (exn i/o net timeout)>
-like errors when I run client.scm against listen.scm.
While digging through my copy of Stevens "UNIX Network Programming", I came upon a section that deals with an example TCP client that, waiting for input from a server, gets into a state where it assumes the connection is reset. Not exactly the same, but this passage looked interesting:
"... That is why the client's call to
readline
returns 0 because the FIN that was received earlier is ready to be read. But if we run the client and server on the same host, or if we were to put a slight pause in the client before its call toreadline
, then the received RST takes precedence over the FIN that was received earlier. This would causereadline
to return an error anderrno
would containECONNRESET
."I tried to find more information in the RFC, but that's much too dense for me. The gist of this is: is it possible that there isn't really an error situation, but an RST segment is sent by the server (out of whatever reason - perhaps just because the connection is dropped after closing, or so) and overtakes the FIN which would indicate a closing connection to the client? Perhaps there is a time-window between the call to
select
and reading from the socket that is too long?Has this test been run on two different machines in the same LAN? (I just tested it on a single host (4.6.7, Linux) and everything runs fine).