From a4126a4bef3708c6f3b63f8a8877a3ce2213470b Mon Sep 17 00:00:00 2001 From: Eric Wong Date: Mon, 30 Sep 2013 08:44:15 +0000 Subject: cmogstored 1.3.0 - many improvements There are no changes from 1.3.0rc2. For the most part, cmogstored 1.2.2 works well, but 1.3 contains some fairly major changes and improvements. cmogstored CPU usage may be higher than other servers because it's designed to use whatever resources it has at its disposal to distribute load to different storage devices. cmogstored 1.3 continues this, but it should be safer to lower thread counts without hurting performance too much for non-dedicated servers. cmogstored 1.3 contains improvements for storage hosts at the extremes ends of the performance scale. For large machines with many cores, memory/thread usage is reduced because we had too many acceptor threads. There are more improvements for smaller machines, especially those with slow/imbalanced drive speeds and few CPUs. Some of the improvements came from my testing with ancient single-core machines, others came from testing on 24-core machines :) Major features in 1.3: ioq - a I/O queues for all MogileFS requests -------------------------------------------- The new I/O queue (ioq) implements the equivalent of AIO channels functionality from Perlbal/mogstored. This feature prevents a failing/overloaded disk from monopolizing all the threads in the system. Since cmogstored uses threads directly (and not AIO), the common (uncontended case) behaves like a successful sem_wait with POSIX semaphores. Queueing+rescheduling only occurs in the contended case (unlike with AIO-style APIs, where request are always queued). I experimented with, but did not use POSIX semaphores as contention would still starve the thread pool. Unlike the old fsck_queue, ioq is based on the MogileFS devid in the URL and not the st_dev ID of the actual underlying file. This is less correct from a systems perspective, but should make no difference for normal production deployments (which are expected to use one MogileFS devid for each st_dev ID) and has several advantages: 1) testing/mock deploys of this feature with mock deploys is easier 2) we do not require any additional filesystem syscall (open/*stat) to look up the ioq based on st_dev, so we can use ioq to avoid stalls from slow open/openat/stat/fstatat/unlink/unlinkat syscalls. Otherwise, the implementation of this very closely resembles the old fsck queue implementation, but is generic across HTTP and sidechannel clients. The existing fsck queue functionality is now implemented using ioq. Thus, fsck queue functionality is mapped by the MogileFS devid and not the system st_dev ID as a result of this change. One benefit of this feature is the ability to run fewer aio_threads safely without worrying about cross-device contention on machines with limited resources or few disks (or not solely dedicated to MogileFS storage). The capacity of these I/O queues is automatically scaled to the number of available aio_threads, so they can change dynamically while your admin is tuning "SERVER aio_threads = XX" However, on a dedicated storage node, running many aio_threads (as is the default) should still be beneficial. Having more threads can keep the internal I/O queues of the kernel and storage hardware more populated and can improve throughput. thread shutdown fixes (epoll) ----------------------------- Our previous reliance on pthreads cancellation primitives left us open to a small race condition where I/O events (from epoll) could be lost during graceful shutdown or thread reduction via "SERVER aio_threads = XX". We no longer rely on pthreads cancellation for stopping threads and instead implement explicit check points for epoll. This did not affect kqueue users, but the code is simpler and more consistent across epoll/kqueue implementations. Graceful shutdown improvements ------------------------------ The addition of our I/O queueing and use of our custom thread shutdown API also allowed us to improve the responsiveness and fairness when the process enters graceful shutdown mode. This improves fairness and avoids client-side timeouts when large PUT requests are being issued over a fast network to slow disks during graceful shutdown. Currently, graceful shutdown remains single-threaded, but we will likely become multi-threaded in the future (like normal runtime). Miscellaneous fixes and improvements ------------------------------------ Further improved matching for (Linux) device-mapper setups where the same device (not symlinks) appears multiple times in /dev aio_threads count is automatically updated when new devices are added/removed. This is currently synced to MOG_DISK_USAGE_INTERVAL, but will use inotify (or the kqueue equivalent) in the future. HTTP read buffers grow monotonically (up to 64K) and always use aligned memory. This allows deployments which pass large HTTP headers do not trigger unnecessary reallocations. Deployments which use small HTTP headers should notice no memory increase. Acceptor threads are now limited to two per process instead of being scaled to CPU count. This avoids excessive threads/memory usage and contention of kernel-level mutexes for large multi-core machines. The gnulib version used for building the tarball is now included in the tarball for ease-of-reproducibility. Additional tests for uncommon error conditions using the fault-injection capabilities of GNU ld. The "shutdown" command over the sidechannel is more responsive for epoll users. Improved reporting of failed requests during PUT requests. Again, I run MogileFS instances on some of the most horrible networks on the planet[2] fix LIB_CLOCK_GETTIME linkage on some toolchains. "SERVER mogstored.persist_client = (0|1)" over the sidechannel is supported for compatibility with Perlbal/mogstored The Status: header is no longer returned on HTTP responses. All known MogileFS clients parse the HTTP status response correctly without the need for the Status: header. Neither Perlbal nor nginx set the Status: header on responses, so this is unlikely to introduce incompatibilities. The Status: header was originally inherited from HTTP servers which had to deal with a much larger range of (non-compliant) clients. --- README | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/README b/README index e6f79d2..546199e 100644 --- a/README +++ b/README @@ -56,10 +56,7 @@ Source tarballs suitable for distribution are housed here: * http://bogomips.org/cmogstored/files/ The latest stable release is: - http://bogomips.org/cmogstored/files/cmogstored-1.2.2.tar.gz - -The latest release candidate is: - http://bogomips.org/cmogstored/files/pre/cmogstored-1.3.0rc2.tar.gz + http://bogomips.org/cmogstored/files/cmogstored-1.3.0.tar.gz See http://bogomips.org/cmogstored/NEWS for release notes -- cgit v1.2.3-24-ge0c7