Opened 10 years ago

Closed 10 years ago

Last modified 9 years ago

#614 closed defect (fixed)

TCP sockets get closed prematurely

Reported by: Moritz Heidkamp Owned by:
Priority: major Milestone: 4.9.0
Component: core libraries Version: 4.7.x
Keywords: Cc: sjamaan
Estimated difficulty:

Description

Running the attached programs leads to "Connection reset by peer" errors.

Attachments (2)

listen.scm (575 bytes) - added by Moritz Heidkamp 10 years ago.
client.scm (564 bytes) - added by Moritz Heidkamp 10 years ago.

Download all attachments as: .zip

Change History (14)

Changed 10 years ago by Moritz Heidkamp

Attachment: listen.scm added

Changed 10 years ago by Moritz Heidkamp

Attachment: client.scm added

comment:1 Changed 10 years ago by felix winkelmann

Cc: sjamaan added
Component: unknowncore libraries

While digging through my copy of Stevens "UNIX Network Programming", I came upon a section that deals with an example TCP client that, waiting for input from a server, gets into a state where it assumes the connection is reset. Not exactly the same, but this passage looked interesting:

"... That is why the client's call to readline returns 0 because the FIN that was received earlier is ready to be read. But if we run the client and server on the same host, or if we were to put a slight pause in the client before its call to readline , then the received RST takes precedence over the FIN that was received earlier. This would cause readline to return an error and errno would contain ECONNRESET."

I tried to find more information in the RFC, but that's much too dense for me. The gist of this is: is it possible that there isn't really an error situation, but an RST segment is sent by the server (out of whatever reason - perhaps just because the connection is dropped after closing, or so) and overtakes the FIN which would indicate a closing connection to the client? Perhaps there is a time-window between the call to select and reading from the socket that is too long?

Has this test been run on two different machines in the same LAN? (I just tested it on a single host (4.6.7, Linux) and everything runs fine).

comment:2 Changed 10 years ago by Moritz Heidkamp

Sorry for replying late, this somehow slipped my attention.

I have only tested it on my machine through the loopback interface so far. Maybe a tcpdump or something like this would reveal if that's what's happening. I'm a bit clueless about all that TCP stuff but will try to figure it out!

comment:3 Changed 10 years ago by sjamaan

I tried to run the test code again, but keep getting segfaults in the client with HEAD chicken :(

It crashes pretty consistently in f_943 in scheduler.c (which is ##sys#interrupt-hook), even if the server testing code is not running.

Enabling different debug options causes it to fail in slightly different places, but it's just before it calls ##sys#schedule

It looks like this was caused by changeset 7cd875bb13f2efd2af83a9f8d82217101e4e85a3 because reverting it causes the code to work again.

comment:4 Changed 10 years ago by sjamaan

(this is on netbsd/amd64)

comment:5 Changed 10 years ago by sjamaan

After reverting the patch I get several of these in the client:

Warning (#<thread: thread843>): (tcp-connect) cannot connect to socket - Bad file descriptor
"localhost"
8080

and then finally the familiar

Warning (#<thread: thread271>): (tcp-connect) cannot create socket - Connection reset by peer

comment:6 Changed 10 years ago by sjamaan

The connect error happens because of the backlog being 4 by default. If I change listen.scm to have a larger backlog by doing (tcp-listen 8080 100) the client continues doing its work and stops normally after a very short time.

comment:7 Changed 10 years ago by sjamaan

Probably the default should be raised as this gives networked Chicken apps a bad name; if it's raised the client exits almost immediately instead of hanging for a while. Now Spiffy's performance is closer to node.js

comment:8 Changed 10 years ago by sjamaan

(actually, the backlog is 10, not 4; the docs are wrong)

comment:9 in reply to:  3 Changed 10 years ago by felix winkelmann

Replying to sjamaan:

It looks like this was caused by changeset 7cd875bb13f2efd2af83a9f8d82217101e4e85a3 because reverting it causes the code to work again.

Very strange. I have removed all uses of the error fdset from the scheduler (it wouldn't have been useful anyway). Why this crashes I can't explain.

comment:10 Changed 10 years ago by felix winkelmann

Resolution: fixed
Status: newclosed

I raised (and documented) the default backlog to 100.

comment:11 Changed 10 years ago by Mario Domenech Goulart

What's the expected behavior of this test? I get

Warning (#<thread: thread871>): read operation timed out
60000
178

Error: uncaught exception: #<condition: (exn i/o net timeout)>

-like errors when I run client.scm against listen.scm.

comment:12 Changed 9 years ago by felix winkelmann

Milestone: 4.8.04.9.0

Milestone 4.8.0 deleted

Note: See TracTickets for help on using tickets.