From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <e@80x24.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN:  
X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
	by dcvr.yhbt.net (Postfix) with ESMTP id A054F1FF72;
	Tue, 24 Oct 2017 21:02:50 +0000 (UTC)
Date: Tue, 24 Oct 2017 21:02:50 +0000
From: Eric Wong <e@80x24.org>
To: Marek Majkowski <marek@cloudflare.com>
Cc: Jason Baron <jbaron@akamai.com>, linux-kernel@vger.kernel.org,
	cmogstored-public@bogomips.org
Subject: Re: blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
Message-ID: <20171024210250.GA14673@dcvr>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
List-Id: <cmogstored-public.bogomips.org>

Hi Marek, I'm replying to
http://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/
via email so Jason and linux-kernel see it.  I also don't believe in
using centralized, proprietary messaging like Disqus for discussing
Open Source, nor do I deal with JavaScript.

I still believe the best way to balance connections across
multiple processes in any server with persistent socket
connections is to create a dedicated thread performing nothing
but blocking accept4() + EPOLL_CTL_ADD calls in each worker
process, acting as the queue producer with EPOLLONESHOT.

(the "queue" in this case is the epoll or kqueue file description)

The usual worker thread(s) act as the queue consumer, calling
epoll_wait as usual with minimal modification.  These usual
worker threads only change the epoll watch set with
EPOLL_CTL_MOD (and maybe _DEL); but not _ADD; with EPOLLONESHOT.

Perhaps some pseudo code can describe this better:

thread_acceptor: /* this thread never does epoll_wait */

	while (running) {
		/*
		 * blocking accept, but create non-blocking client socket
		 * lfd may be shared across any number of processes
		 */
		int cfd = accept4(lfd, ..., SOCK_NONBLOCK);

		if (cfd >= 0) {
			struct epoll_event event;

			event.events = EPOLLONESHOT|EPOLLIN;
			event.data.ptr = client_new(cfd);

			/* epfd is per-process */
			epoll_ctl(epfd, EPOLL_CTL_ADD, cfd, &event);
		}
	}

thread_worker: /* this never does EPOLL_CTL_ADD */

	/*
	 * If there's multiple worker threads; maxevents can be 1
	 * for optimal fairness (at the expense of throughput)
	 */
	while (running) {
		int i;
		int n = epoll_wait(epfd, events, maxevents, timeout);

		for (i = 0; i < n; i++) {
			struct client *client = events[i].data.ptr;

			/*
			 * The usual non-blocking server processing,
			 * any socket read/writes are non-blocking here:
			 */
			enum next_action next = client_read_write(client);

			int want = 0;

			switch (next) {
			case NEXT_RDONLY: want = EPOLLIN; break;
			case NEXT_WRONLY: want = EPOLLOUT; break;
			case NEXT_RDWR: want = EPOLLOUT|EPOLLIN; break;
			case NEXT_CLOSE:
				close(client->fd);
			}
			if (want) {
				events[i].events = want | EPOLLONESHOT;
				epoll_ctl(epfd, EPOLL_CTL_MOD,
					  client->fd, &events[i]);
			}
		}
	}


I came up with this design back around 2011 before EPOLLEXCLUSIVE
and SO_REUSEPORT came about.  I instead based my design on the
ancient-but-still-accurate document around blocking accept() and
exclusive wakeups:

  http://www.citi.umich.edu/projects/linux-scalability/reports/accept.html

All this is applied in cmogstored which has had the same basic
design since 2012.  Since then, it has evenly distributed
persistent connections across multiple processes.

cmogstored supports hundreds of rotational disks in a JBOD
configuration while maintaining thousands of persistent
connections indefinitely for both fast and slow clients over
HTTP and a MogileFS-specific sidechannel protocol (TCP).  It
applies the Waitrose "combined queue model" in every aspect to
avoid worst case latency.  For cmogstored, any new latency from
lock contention (ep->mtx and ep->lock) is inconsequential
compared to the huge latencies of storage devices.

Some documentation around the design is here:

	https://bogomips.org/cmogstored/queues.txt

Multiple processes isn't documented as it wasn't in the original
Perl mogstored, but it's there since since I figured somebody
might run into contention with FD allocation and it provides
some safety in case of segfaults[1].

All the code (GPL-3.0+) is available at:

	git clone git://bogomips.org/cmogstored/

It also works with kqueue and is in the FreeBSD ports collection.


[1] Ironically, the only segfault I've encountered in cmogstored
    is because I accidentally shared a DIR * (from opendir) across
    processes :x   And, really, cmogstored probably doesn't
    benefit rom multiple processes like nginx does as
    cmogstored was always designed to be MT.


Anyways, I believe nginx can apply this design around dedicated
blocking acceptor threads for each worker process to its
existing model to improve client balancing across worker processes.

However, this change breaks the nginx SIGUSR2 upgrade/backout
because old workers depend on O_NONBLOCK while the new ones do
not want it on the listener.  I half-heartedly proposed
SOCK_DONTWAIT (and maybe SOCK_MUSTWAIT) for accept4 to get
around this, but never cared enough to push for it:
<20150513023712.GA4206@dcvr.yhbt.net>


Fwiw, in old-fashioned servers without multiplexing
epoll/kqueue (such as Apache with mpm_prefork), I prefer the
inbalance with non-blocking accept is beneficial to keep active
workers hot.  I'd never expose one of those servers to the
Internet without something like nginx protecting it from
Slowloris, though.

For the same reason, I prefer the LIFO behavior of multiple
epoll_wait callers on the same epfd with cmogstored.  The actual
order of events is FIFO, of course.

>  1. Of course comparing blocking accept() with a full featured
>     epoll() event loop is not fair. Epoll is more powerful and
>     allows us to create rich event driven programs. Using
>     blocking accept is rather cumbersome or just not useful at
>     all. To make any sense, blocking accept programs would
>     require careful multi-threading programming, with a
>     dedicated thread per request.

Of course I disagree :)

>  2. Another surprise lurking in the corner - using blocking
>     accept() on Linux is technically incorrect! Alan Burlison
>     pointed out that calling close() on listen socket that
>     has blocking accepts() will not interrupt them. This can
>     result in a buggy behavior - you may get a successful
>     accept() on a listen socket that no longer exists. When in
>     doubt - avoid using blocking accept() in multithreaded
>     programs. The workaround is to call shutdown() first, but
>     this is not POSIX compliant. It's a mess.

Right, I use pthread_kill to trigger EINTR in accept4 and
check a `running' flag in the loop as above.  This won't cause
other processes to lose connections.

pthread cancellation will cause lost connections with accept4
(and lost events in epoll_wait, as confirmed by glibc folks in
libc-help (*)); so pthread_kill seems the best option if available.

(*) <CAE2sS1gxQkqmcywQ07pmgNHM+CyqzMkuASVjmWDL+hgaTMURWQ@mail.gmail.com>


For higher-level languages and some VMs (I think it was
Rubinius) without pthread_kill; I create fake clients to connect
to the listener and eventually kick it out of blocking accept4.

For epoll_wait: I create a pipe/eventfd object; EPOLL_CTL_ADD
with EPOLLOUT (no EPOLLONESHOT) and make the client_read_write
function exit the current thread when it sees the pipe/eventfd.

That will just bounce around all the threads without
EPOLLONESHOT until all the epoll_wait-ing threads exit.  I still
find it fun to imagine this object bouncing around threads to
stop them :)

Anyways, thanks for bringing this up.