Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#316 closed defect (fixed)

spiffy hangs sometimes

Reported by: felix winkelmann Owned by:
Priority: critical Milestone:
Component: extensions Version: 4.5.x
Keywords: spiffy call-cc.org madness Cc: sjamaan, Mario Domenech Goulart
Estimated difficulty:

Description

There seem to be scheduler problems with Spiffy on call-cc.org, as it has hung up several times, now.

Suggested plan:

  • try to reproduce it by hammering the server locally with ab or Siege.
  • checkout the scheduler patches submitted recently by Jörg Wittenberger (spiffy on call-cc.org runs on 4.5.0, the diff is attached).
  • possibly try Joerg's "high-load-scheduler" (needs to be updated to the newest version, first, since JW sent me some code recently that introduced some changes that haven't been integrated).
  • Stay calm.

Attachments (2)

scheduler.diff (3.0 KB) - added by felix winkelmann 11 years ago.
git diff master..experimental scheduler.scm
library.diff (951 bytes) - added by felix winkelmann 11 years ago.
also needed

Download all attachments as: .zip

Change History (16)

Changed 11 years ago by felix winkelmann

Attachment: scheduler.diff added

git diff master..experimental scheduler.scm

Changed 11 years ago by felix winkelmann

Attachment: library.diff added

also needed

comment:1 Changed 11 years ago by Jim Ursetto

What's the purpose of trying the high-load scheduler? call-cc is sustaining two hits per second max, usually much lower.

What I would be curious to see is the number of threads that were running at the time. If there were 1024 stuck threads, spiffy will refuse to spawn any more (it will block via mutex until a thread finishes).

Another less likely possibility is that it ran out of file descriptors and/or ports are not being closed properly on exception.

If you want to cut down on traffic, it is probably a good idea to forbid the robots from crawling the edit pages, the history pages, the git objects in chicken-core, chicken.wiki.br, and the https:// schema. That's a lot of duplicate or irrelevant bandwidth.

comment:2 Changed 11 years ago by Jim Ursetto

Pointed out the fd issue because there are a bunch of (file-exists?) errors in the logs, which could cause an FD leak (I have experienced hangs from this before). On the other hand there are only 418 total errors so there would have to be multiple leaks per error.

Maybe someone could run lsof on spiffy right now (I don't have root) and check for open FDs. Errors are still happening at least once a minute.

comment:3 Changed 11 years ago by Jim Ursetto

Keywords: madness added

sjamaan confirmed there were 600+ files open due to a bug in svn-client, and fixed this issue.

Unfortunately we experienced another hang on :80 and later on :443, and we haven't gotten good debugging information on it yet.

I would recommend another lsof (to confirm FD leak issue is fixed) and strace to see if any system calls are being made or are in progress. Then maybe get a map of code pages and see if you can narrow down which library it is hanging in (maybe via core dump). Then I guess we try felix's idea of debugging libraries, because I'm running out of ideas.

Chickadee has not experienced this problem. Even though it's behind a caching proxy, which IIRC accesses spiffy over HTTP/1.0 (no persistent connections), it's still handled about 7000 hits in the last 6 days. It is running on Chicken 4.5.5 right now but I also ran it on 4.4.x and probably 4.3.x earlier.

comment:4 in reply to:  1 Changed 11 years ago by felix winkelmann

Replying to zbigniew:

What's the purpose of trying the high-load scheduler? call-cc is sustaining two hits per second max, usually much lower.

It contains some potential bugfixes.

What I would be curious to see is the number of threads that were running at the time. If there were 1024 stuck threads, spiffy will refuse to spawn any more (it will block via mutex until a thread finishes).

In chicken 4.5.7 there is an internal procedure `(##sys#dump-heap-state)' which prints statistics to stderr about all objects in the heap.

Another less likely possibility is that it ran out of file descriptors and/or ports are not being closed properly on exception.

It wouldn't hang in such a case, it would generate a "too many open files" error.

comment:5 Changed 11 years ago by sjamaan

The FD issue wasn't fixed. I have made a new commit that hopefully fixes it for good. I've updated and restarted spiffy.

comment:6 in reply to:  3 Changed 11 years ago by felix winkelmann

Replying to zbigniew:

Unfortunately we experienced another hang on :80 and later on :443, and we haven't gotten good debugging information on it yet.

I would recommend another lsof (to confirm FD leak issue is fixed) and strace to see if any system calls are being made or are in progress. Then maybe get a map of code pages and see if you can narrow down which library it is hanging in (maybe via core dump). Then I guess we try felix's idea of debugging libraries, because I'm running out of ideas.

I think running spiffy with full debugging info is the best option. That way we can attach to it when it hangs and have full source-level debugging in gdb. It's just a matter of running spiffy via /home/felix/chicken-core-spiffy-debug/bin/csi and keeping it running when it hangs.

comment:7 Changed 11 years ago by sjamaan

Look what I found in the access log!

https://wiki.call-cc.org/eggref/3/web-scheme?action=show&rev=5821

For some reason this causes an error inside libsvn or the svn-client egg which causes it to hold open a reference to the tempfile. However, for other files the history functionality seems to work fine.

This revision is listed in the history page for eggref/3/web-scheme, so I'm not sure what's going on here. Something's fishy!

comment:8 Changed 11 years ago by Mario Domenech Goulart

I could set up a qwiki instance for debugging on call-cc.org (Felix: I didn't use your home not to mess too much with it).

It's on /home/mario/wiki-debug. Chicken is on /home/mario/wiki-debug/chicken and it was compiled with DEBUGBUILD=1. I've installed the latest trunk code for svn-client, spiffy and qwiki.

/home/mario/wiki-debug contains a copy of the svn repo and the same directory structure used by spiffy (e.g., /home/mario/wiki-debug/etc for /etc, /home/mario/wiki-debug/var for /var and so on).

Peter and me run a debugging session today and found out that one of the problems is still fd leak, caused by svn-client (or the underlying libsvn-client).

We've run the server and stressed it with wget (wget -r -l5). Using lsof, we observed the number of lines like

csi     7634 mario  961u   REG       8,6     736    1020 /tmp/svnwiki-cat-QNZIrJ (deleted)
csi     7634 mario  962u   REG       8,6     701    1021 /tmp/svnwiki-cat-YQhjmY (deleted)
csi     7634 mario  963u   REG       8,6   11834    1022 /tmp/svnwiki-cat-9hY3od (deleted)

increased during the test. According to ulimit, the max number of open files is 1024, and spiffy died when the number of svnwiki-cat files from lsof got close to 1024.

The latest messages before segmentation fault were:

svn: '/home/mario/wiki-debug/tmp/qwiki/hive' is not under version control
svn: Can't open file '/home/mario/wiki-debug/usr/local/repos/chicken-eggs/db/revs/4718': Too many open files
svn: Can't open file '/home/mario/wiki-debug/usr/local/repos/chicken-eggs/db/revs/4718': Too many open files
svn: Can't open file '/home/mario/wiki-debug/tmp/qwiki/.svn/entries': Too many open files

If you want to make some more tests, feel free to use my home dir. Be sure to set the PATH to get things from the debug dir:

PATH=/home/mario/wiki-debug/chicken/bin/:/home/mario/wiki-debug/bin:$PATH  csi -s bin/spiffy

If you get permissions problems, feel free to change them.

comment:9 in reply to:  8 ; Changed 11 years ago by felix winkelmann

Replying to mario:

I could set up a qwiki instance for debugging on call-cc.org (Felix: I didn't use your home not to mess too much with it).

It's on /home/mario/wiki-debug. Chicken is on /home/mario/wiki-debug/chicken and it was compiled with DEBUGBUILD=1. I've installed the latest trunk code for svn-client, spiffy and qwiki.

Have you installed all the extension from a local checkout, with the "-k" option? That is required to have full source-level debugging of all extensions.

Thanks for the setup. I will try it out. I'll need some time to figure out where all the files are. On what port is the spiffy instance listening?

comment:10 in reply to:  9 Changed 11 years ago by Mario Domenech Goulart

Replying to felix:

Have you installed all the extension from a local checkout, with the "-k" option? That is required to have full source-level debugging of all extensions.

Unfortunately not. I installed them the usual way (chicken-install eggs), except for spiffy, svn-client and qwiki, that I changed to the svn dir for each egg and installed them providing the .setup file for chicken-install.

Thanks for the setup. I will try it out. I'll need some time to figure out where all the files are. On what port is the spiffy instance listening?

Let me try to summarize some things (consider prefix = /home/mario/wiki-debug)

  • the spiffy configuration is prefix/etc/spiffy/spiffy.conf
  • the qwiki configuration is prefix/etc/spiffy/qwiki.conf
  • logs go to prefix/var/log/spiffy

We've tried setting spiffy's max-connections parameter to 4, it doesn't change anything, it seems. It's still configured that way.

The server listens on port 8080.

You can start it by running

cd /home/mario/wiki-debug
PATH=/home/mario/wiki-debug/chicken/bin/:/home/mario/wiki-debug/bin:$PATH  csi -s bin/spiffy

It will load the spiffy and qwiki configuration from prefix/etc/spiffy/spiffy.conf

and prefix/etc/spiffy/qwiki.conf

You can use

wget -r -l5 http://www.call-cc.org:8080

to stress the server.

I've made a copy of wiki-debug (to wiki-debug.bkp). Feel free to do whatever you want with wiki-debug. I think you can even copy the whole dir to your home if you want (I _think_ you just need to change wiki-debug/chroot.scm to point to the new dir -- haven't tested it).

I've set the permissions in a way that users from the chicken group can write to wiki-debug.

comment:11 Changed 11 years ago by Mario Domenech Goulart

I think the fix Peter applied to svn-client solved the crash problem.

I've installed the latest svn-client version on the wiki-debug environment and run the following script against spiffy:

wget -r -L -l5 http://wiki.call-cc.org:8080 &
wget -r -L -l5 http://wiki.call-cc.org:8080 &
wget -r -L -l5 http://wiki.call-cc.org:8080 &
wget -r -L -l5 http://wiki.call-cc.org:8080 &
wget -r -L -l5 http://wiki.call-cc.org:8080

(Nevermind the java-style for repeating code)

The server didn't crash after serving the total of 8400 requests. lsof didn't report leaking svnwiki-cats.

comment:12 Changed 11 years ago by felix winkelmann

May this be really fixed? Wouldn't that be nice...

Note that there definitely is a scheduler bug (a short while ago reported by N. Pelletier on chicken-users), where thread-timeouts overflow after about 13 days. Since this bug exists for a long time and hasn't (with the one exception) ever be reported,
it may be a non-issue, but this is nevertheless a reminder. That particular problem is fixed in "experimental", but I recommend not to try it out light-heartedly, as it didn't get enough testing yet.

comment:13 Changed 11 years ago by sjamaan

Resolution: fixed
Status: newclosed

Spiffy has been running for three days now without interruption. It has never been able to run that long while the svn-client bug was in there, so it's highly likely that was the cause of this particular hang. I think we can declare this ticket closed. If the scheduler bug causes a hang, we should start a new ticket as the problem would be caused a different bug even though the symptoms would (probably?) be the same.

comment:14 Changed 11 years ago by felix winkelmann

Milestone: 4.6.0

Milestone 4.6.0 deleted

Note: See TracTickets for help on using tickets.