From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id A054F1FF72; Tue, 24 Oct 2017 21:02:50 +0000 (UTC) Date: Tue, 24 Oct 2017 21:02:50 +0000 From: Eric Wong To: Marek Majkowski Cc: Jason Baron , linux-kernel@vger.kernel.org, cmogstored-public@bogomips.org Subject: Re: blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ Message-ID: <20171024210250.GA14673@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline List-Id: Hi Marek, I'm replying to http://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ via email so Jason and linux-kernel see it. I also don't believe in using centralized, proprietary messaging like Disqus for discussing Open Source, nor do I deal with JavaScript. I still believe the best way to balance connections across multiple processes in any server with persistent socket connections is to create a dedicated thread performing nothing but blocking accept4() + EPOLL_CTL_ADD calls in each worker process, acting as the queue producer with EPOLLONESHOT. (the "queue" in this case is the epoll or kqueue file description) The usual worker thread(s) act as the queue consumer, calling epoll_wait as usual with minimal modification. These usual worker threads only change the epoll watch set with EPOLL_CTL_MOD (and maybe _DEL); but not _ADD; with EPOLLONESHOT. Perhaps some pseudo code can describe this better: thread_acceptor: /* this thread never does epoll_wait */ while (running) { /* * blocking accept, but create non-blocking client socket * lfd may be shared across any number of processes */ int cfd = accept4(lfd, ..., SOCK_NONBLOCK); if (cfd >= 0) { struct epoll_event event; event.events = EPOLLONESHOT|EPOLLIN; event.data.ptr = client_new(cfd); /* epfd is per-process */ epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, &event); } } thread_worker: /* this never does EPOLL_CTL_ADD */ /* * If there's multiple worker threads; maxevents can be 1 * for optimal fairness (at the expense of throughput) */ while (running) { int i; int n = epoll_wait(epfd, events, maxevents, timeout); for (i = 0; i < n; i++) { struct client *client = events[i].data.ptr; /* * The usual non-blocking server processing, * any socket read/writes are non-blocking here: */ enum next_action next = client_read_write(client); int want = 0; switch (next) { case NEXT_RDONLY: want = EPOLLIN; break; case NEXT_WRONLY: want = EPOLLOUT; break; case NEXT_RDWR: want = EPOLLOUT|EPOLLIN; break; case NEXT_CLOSE: close(client->fd); } if (want) { events[i].events = want | EPOLLONESHOT; epoll_ctl(epfd, EPOLL_CTL_MOD, client->fd, &events[i]); } } } I came up with this design back around 2011 before EPOLLEXCLUSIVE and SO_REUSEPORT came about. I instead based my design on the ancient-but-still-accurate document around blocking accept() and exclusive wakeups: http://www.citi.umich.edu/projects/linux-scalability/reports/accept.html All this is applied in cmogstored which has had the same basic design since 2012. Since then, it has evenly distributed persistent connections across multiple processes. cmogstored supports hundreds of rotational disks in a JBOD configuration while maintaining thousands of persistent connections indefinitely for both fast and slow clients over HTTP and a MogileFS-specific sidechannel protocol (TCP). It applies the Waitrose "combined queue model" in every aspect to avoid worst case latency. For cmogstored, any new latency from lock contention (ep->mtx and ep->lock) is inconsequential compared to the huge latencies of storage devices. Some documentation around the design is here: https://bogomips.org/cmogstored/queues.txt Multiple processes isn't documented as it wasn't in the original Perl mogstored, but it's there since since I figured somebody might run into contention with FD allocation and it provides some safety in case of segfaults[1]. All the code (GPL-3.0+) is available at: git clone git://bogomips.org/cmogstored/ It also works with kqueue and is in the FreeBSD ports collection. [1] Ironically, the only segfault I've encountered in cmogstored is because I accidentally shared a DIR * (from opendir) across processes :x And, really, cmogstored probably doesn't benefit rom multiple processes like nginx does as cmogstored was always designed to be MT. Anyways, I believe nginx can apply this design around dedicated blocking acceptor threads for each worker process to its existing model to improve client balancing across worker processes. However, this change breaks the nginx SIGUSR2 upgrade/backout because old workers depend on O_NONBLOCK while the new ones do not want it on the listener. I half-heartedly proposed SOCK_DONTWAIT (and maybe SOCK_MUSTWAIT) for accept4 to get around this, but never cared enough to push for it: <20150513023712.GA4206@dcvr.yhbt.net> Fwiw, in old-fashioned servers without multiplexing epoll/kqueue (such as Apache with mpm_prefork), I prefer the inbalance with non-blocking accept is beneficial to keep active workers hot. I'd never expose one of those servers to the Internet without something like nginx protecting it from Slowloris, though. For the same reason, I prefer the LIFO behavior of multiple epoll_wait callers on the same epfd with cmogstored. The actual order of events is FIFO, of course. > 1. Of course comparing blocking accept() with a full featured > epoll() event loop is not fair. Epoll is more powerful and > allows us to create rich event driven programs. Using > blocking accept is rather cumbersome or just not useful at > all. To make any sense, blocking accept programs would > require careful multi-threading programming, with a > dedicated thread per request. Of course I disagree :) > 2. Another surprise lurking in the corner - using blocking > accept() on Linux is technically incorrect! Alan Burlison > pointed out that calling close() on listen socket that > has blocking accepts() will not interrupt them. This can > result in a buggy behavior - you may get a successful > accept() on a listen socket that no longer exists. When in > doubt - avoid using blocking accept() in multithreaded > programs. The workaround is to call shutdown() first, but > this is not POSIX compliant. It's a mess. Right, I use pthread_kill to trigger EINTR in accept4 and check a `running' flag in the loop as above. This won't cause other processes to lose connections. pthread cancellation will cause lost connections with accept4 (and lost events in epoll_wait, as confirmed by glibc folks in libc-help (*)); so pthread_kill seems the best option if available. (*) For higher-level languages and some VMs (I think it was Rubinius) without pthread_kill; I create fake clients to connect to the listener and eventually kick it out of blocking accept4. For epoll_wait: I create a pipe/eventfd object; EPOLL_CTL_ADD with EPOLLOUT (no EPOLLONESHOT) and make the client_read_write function exit the current thread when it sees the pipe/eventfd. That will just bounce around all the threads without EPOLLONESHOT until all the epoll_wait-ing threads exit. I still find it fun to imagine this object bouncing around threads to stop them :) Anyways, thanks for bringing this up.