cmogstored.git - alternative mogstored implementation for MogileFS

Date	Commit message (Collapse)
2013-07-17	tapset/http_request: log listen address and PID of connection
	It is helpful to know the address of the listener on the server which accepted the client socket. Additionally, the PID,FD combination should be be safely unique for any point in time.
2013-07-17	document ioq and mog_{mgmt,http}_drop interaction safety
	I needed to spend time to convince myself this was safe, so leave a note to others (and future self) in case there is cause for concern. Basically, this is highly dependent on our overall one-shot-based concurrency model and safe as long as basic rules are followed.
2013-07-16	queue_epoll: EPOLL_CTL_MOD should be safe on 2.6.32.61+
	Willy Tarreau cherry-picked the relevant fix into 2.6.32 longterm stable tree ref: commit 1c137a47bbdd6e86298627e04f547afd7f35d523 git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
2013-07-14	alloc: remove mog_rbuf_free_and_null
	This function is no longer used as we now attempt to reattach rbufs to the TLS space of each thread.
2013-07-14	downgrade thread/device-count fields to unsigned int
	It's unlikely we'll even come close to see 2-4 billion devices in a MogileFS instance for a while. Meanwhile, it's also unlikely the kernel will ever run that many threads, either. So make it easier to pack and shrink data structures to save a few bytes and perhaps get better memory alignement. For reference, the POSIX semaphore API specifies initial values with unsigned (int) values, too. This leads to a minor size reduction (and we're not even packing): $ ~/linux/scripts/bloat-o-meter cmogstored.before cmogstored add/remove: 0/0 grow/shrink: 0/13 up/down: 0/-86 (-86) function old new delta mog_svc_dev_quit_prepare 13 12 -1 mog_mgmt_fn_aio_threads 147 146 -1 mog_dev_user_rescale_i 27 26 -1 mog_ioq_requeue_prepare 52 50 -2 mog_ioq_init 80 78 -2 mog_thrpool_start 101 96 -5 mog_svc_dev_user_rescale 143 137 -6 mog_svc_start_each 264 256 -8 mog_svc_aio_threads_handler 257 249 -8 mog_ioq_ready 263 255 -8 mog_ioq_next 303 295 -8 mog_svc_thrpool_rescale 206 197 -9 mog_thrpool_set_size 1028 1001 -27
2013-07-14	cmogstored 1.3.0rc1 v1.3.0rc1
	For the most part, cmogstored 1.2.2 works well, but 1.3 contains some fairly major changes and improvements. cmogstored CPU usage may be higher than other servers because it's designed to use whatever resources it has at its disposal to distribute load to different storage devices. cmogstored 1.3 will continue this, but it should be safer to lower thread counts without hurting performance too much for non-dedicated servers. Unfortunately, the minor, Linux-only bug affecting 1.2.2 for (uncommon) thread shutdowns required some fairly intrusive changes to fix, so I'm not sure if releasing a 1.2.3 is worth it. If you're happy with 1.2.x, I recommend marking the host down via mogadm before lowering "SERVER aio_threads = XX" or sending SIGQUIT to cmogstored. But I think thread shutdown is uncommon enough to not affect normal deployments. cmogstored 1.3 will contain improvements for storage hosts at the extremes ends of the performance scale. For large machines with many cores, memory/thread usage is reduced because we had too many acceptor threads. There are more improvements for smaller machines, especially those with slow/imbalanced drive speeds and few CPUs. Some of the improvements came from my testing with ancient single-core machines, others came from testing on 24-core machines :) The SystemTap tracing work is still in-progress (although the 1.3 cycle was originally intended to focus on this :x). I expect the remaining changes to be non-intrusive and will work on them through the RC cycle. Major features in 1.3: ioq - a I/O queues for all MogileFS requests -------------------------------------------- The new I/O queue (ioq) implements the equivalent of AIO channels functionality from Perlbal/mogstored. This feature prevents a failing/overloaded disk from monopolizing all the threads in the system. Since cmogstored uses threads directly (and not AIO), the common (uncontended case) behaves like a successful sem_wait with POSIX semaphores. Queueing+rescheduling only occurs in the contended case (unlike with AIO-style APIs, where request are always queued). I experimented with, but did not use POSIX semaphores as contention would still starve the thread pool. Unlike the old fsck_queue, ioq is based on the MogileFS devid in the URL and not the st_dev ID of the actual underlying file. This is less correct from a systems perspective, but should make no difference for normal production deployments (which are expected to use one MogileFS devid for each st_dev ID) and has several advantages: 1) testing/mock deploys of this feature with mock deploys is easier 2) we do not require any additional filesystem syscall (open/*stat) to look up the ioq based on st_dev, so we can use ioq to avoid stalls from slow open/openat/stat/fstatat/unlink/unlinkat syscalls. Otherwise, the implementation of this very closely resembles the old fsck queue implementation, but is generic across HTTP and sidechannel clients. The existing fsck queue functionality is now implemented using ioq. Thus, fsck queue functionality is mapped by the MogileFS devid and not the system st_dev ID as a result of this change. One benefit of this feature is the ability to run fewer aio_threads safely without worrying about cross-device contention on machines with limited resources or few disks (or not solely dedicated to MogileFS storage). The capacity of these I/O queues is automatically scaled to the number of available aio_threads, so they can change dynamically while your admin is tuning "SERVER aio_threads = XX" However, on a dedicated storage node, running many aio_threads (as is the default) should still be beneficial. Having more threads can keep the internal I/O queues of the kernel and storage hardware more populated and can improve throughput. thread shutdown fixes (epoll) ----------------------------- Our previous reliance on pthreads cancellation primitives left us open to a small race condition where I/O events (from epoll) could be lost during graceful shutdown or thread reduction via "SERVER aio_threads = XX". We no longer rely on pthreads cancellation for stopping threads and instead implement explicit check points for epoll. This did not affect kqueue users, but the code is simpler and more consistent across epoll/kqueue implementations. Graceful shutdown improvements ------------------------------ The addition of our I/O queueing and use of our custom thread shutdown API also allowed us to improve the responsiveness and fairness when the process enters graceful shutdown mode. This improves fairness and avoids client-side timeouts when large PUT requests are being issued over a fast network to slow disks during graceful shutdown. Currently, graceful shutdown remains single-threaded, but we will likely become multi-threaded in the future (like normal runtime). Miscellaneous fixes and improvements ------------------------------------ Further improved matching for (Linux) device-mapper setups where the same device (not symlinks) appears multiple times in /dev aio_threads count is automatically updated when new devices are added/removed. This is currently synced to MOG_DISK_USAGE_INTERVAL, but will use inotify (or the kqueue equivalent) in the future. HTTP read buffers grow monotonically (up to 64K) and always use aligned memory. This allows deployments which pass large HTTP headers do not trigger unnecessary reallocations. Deployments which use small HTTP headers should notice no memory increase. Acceptor threads are now limited to two per process instead of being scaled to CPU count. This avoids excessive threads/memory usage and contention of kernel-level mutexes for large multi-core machines. The gnulib version used for building the tarball is now included in the tarball for ease-of-reproducibility. Additional tests for uncommon error conditions using the fault-injection capabilities of GNU ld. The "shutdown" command over the sidechannel is more responsive for epoll users. Improved reporting of failed requests during PUT requests. Again, I run MogileFS instances on some of the most horrible networks on the planet[2] fix LIB_CLOCK_GETTIME linkage on some toolchains. "SERVER mogstored.persist_client = (0\|1)" over the sidechannel is supported for compatibility with Perlbal/mogstored
2013-07-14	m4/systemtap: require stap for enabling systemtap build
	Only relying on dtrace leads to build problems on FreeBSD which I haven't had a chance to fix.
2013-07-14	ioq: reset internal queues during requeue/shutdown
	This should avoid concurrency bugs where client may run in multiple threads if we switch to multi-threaded graceful shutdown.
2013-07-13	test/pwrite_wrap: disable test under valgrind for now
	This test is too slow and timing-sensitive under valgrind, so disable it for now until we have a better solution.
2013-07-13	ioq: set contended flag if we are the last one acquiring the lock
	We could be completely out of threads upon acquiring an ioq, so the last thread to acquire a lock slot must trigger a yield soon to avoid starvation and fairness issues. Otherwise, all threads for a given device could remained pinned indefinitely.
2013-07-13	test/mgmt_persist_client: teardown running processes
	Tests need to cleanup by stopping running processes.
2013-07-13	pass mog_accept instead of mog_svc to post-accept callbacks
	This allows us to capture/trace the listen address which accepted the request without consuming additional stack space.
2013-07-13	set addrinfo field for "struct mog_accept"
	This will allow us to properly report the listen address the client connected to.
2013-07-13	http: pass "struct mog_fd *" more consistently in API
	This makes it easier to write tapsets which key objects by: PID,FD for uniqueness. This also avoids some mog_fd_of() calls.
2013-07-13	m4/ld_wrap: avoid compiler warning for missing declaration
	This avoids noise in config.log
2013-07-13	iostat: keep update prefix on stack instead of heap
	The update prefix is bounded in size, so this will save us NR_DEVICES malloc/free pairs each second from typical iostat output.
2013-07-12	mgmt_fn: minor cleanup for emitting blank response
	No need to recreate mog_mgmt_fn_blank for sending blank responses.
2013-07-12	test/http: disable time-dependent test under valgrind
	test_head_response_time does not test anything which would not be otherwise tested by other tests under valgrind. This test is only needed for occasional validation of fuckups regarding TCP_NOPUSH on FreeBSD, and not necessary for general use.
2013-07-12	http: check persist_client state when parsing starts
	We don't want drop in-flight pipelined requests when disabling persistent connections. Disabling persistent connections will always be potentially racy, but hopefully this makes the race small enough that lower-level latencies are the only thing which affect that.
2013-07-12	http: signal connection close during shutdown
	While we always properly disconnected clients during shutdown, we explicitly set "Connection: close" now to inform clients of our pending shutdown. This avoids potentially confusing clients when we disconnect them as there may still be a race condition where we shut down a client while their request packets are in-flight.
2013-07-12	mgmt: support "SET mogstored.persist_client = $BOOL"
	This is Perlbal functionality which works in Perl mogstored, so we will also support it here, as it makes upgrading to new versions easier.
2013-07-12	svc: increase responsiveness of graceful shutdown
	By reducing the capacity of each ioq, we force each running worker thread to yield the current client and hit an exit point (epoll_wait/kqueue) sooner.
2013-07-12	test/mgmt: increase reliability on overloaded systems
	Without this, test_iostat_watch fails sometimes under valgrind.
2013-07-12	tests: introduce pwrite-wrap test for slow I/O
	pwrite can be a slow, blocking function on an overloaded system, but a slow pwrite requires a wrapper to simulate. This allows us to have coverage of the: if (mog_ioq_contended()) return MOG_NEXT_WAIT_RD; cases in http_put.c
2013-07-12	ioq: rescale to match user-set aio_threads values
	Users reducing or increasing thread counts should increase ioq capacity, otherwise there's no point in having more or less threads if they are synched to the ioq capacity.
2013-07-11	mgmt: checksumming is interruptible during thread shutdown
	We want to yield dying threads as soon as possible during thread shutdown, so we check the quit flag and yield the running thread to trigger a MOG_NEXT_ACTIVE.
2013-07-11	ioq: introduce mog_ioq_contended hint
	This will allow us to detect I/O contention on our queue and yield the current thread to other clients for fairness. This can prevent a client from hogging the thread in situations where the network is much faster than the filesystem/disk.
2013-07-10	struct mog_ni: document reasoning for the ':' in ni_serv
	This is somewhat strange, but makes the code base slightly easier to reuse for non-HTTP purposes.
2013-07-10	http: include IP:PORT in "client died" message
	This should hopefully make failures easier to track down.
2013-07-10	remove assertion for handling iostat death
	This only triggered if the (undocumented) --worker-processes option is used. This assertion is no longer valid as of commit d5a52618ca1f9b5d7f6998716fbfe7714f927112 (refactor handling of "server aio_threads = " command)
2013-07-10	file: embed ioq in the opened mog_file object
	This allows us to avoid a redundant hash lookup every time we "activate" an open file for reading or writing.
2013-07-10	ioq: implement and enable generic I/O queues
	This will allow us to limit concurrency on a per-device basis with limited impact on HTTP header reading/parsing. This prevents pathological slowness on a single device from bringing down an entire host. This also allows users to more safely run with fewer aio_threads (e.g. 1:1 thread:device mapping) on fast devices with smaller low-level (kernel/hardware) I/O queues.
2013-07-10	packaddr: simplify mog_sockaddr definition
	"struct sockaddr" turns out to be smaller than "struct sockaddr_in6", so we can avoid complicated casting and just add that to the union. We continue avoiding "struct sockaddr_storage", however, as it is unnecessarily large for our needs.
2013-07-10	test/mgmt: remove unused variable
	This was triggering warnings with Ruby 2.0.0-p195
2013-07-10	rbuf: reattach/reuse read buffers when possible
	Reattaching/reusing read buffers allows us to avoid repeated reallocation/growth/free when clients repeatedly send us large headers. This may also increase cache-hits by favoring recently-used buffers as long as fragmentation is kept in check. The fragmentation should be no worse that is currently, due to the existing detach nature of rbufs
2013-07-10	mgmt: remove restriction on large rbuf sizes
	We'll be allowing the migration of buffers between threads and from waiting clients back to thread-local storage.
2013-07-10	alloc: cache-align all rbuf memory allocations
	Some setups use clients which pass large headers (User-Agent, or even cookies(!)) to cmogstored, so large rbufs may be used often and repeatedly in those cases. We limit rbuf sizes to 64K anyways, so keeping "larger" buffers around should not be much of an issue for modern systems. This prepares us for reusing/recycling large rbufs as TLS buffers.
2013-07-10	mgmt: handle disk-using requests outside of the parser
	This will allow us to use control flow similar to the http client handling code when we queue clients based on I/O channel.
2013-07-10	introduce generic I/O queue functionality
	This replaces the fsck_queue internals with a generic ioq implementation which is based on the MogileFS devid, and not the operating system devid.
2013-07-10	http: add assertion for unused wbuf
	We need to ensure we do not introduce code to launch http_process_client while we have buffered data (or socket write errors).
2013-07-10	dev: shrink and cache-align struct mog_dev
	We will have structures inside the dev struct accessed by multiple threads frequently, so keep it cache-aligned. To reduce memory usage for large-numbered devices, avoid storing the prefix on output and instead just rely on the printf-family of routines to generate stringified output in uncommon code paths.
2013-07-10	mgmt: fix case where rbuf->rsize may be uninitialized
	Detachers MUST set rsize properly. This API is unfortunately fragile and will eventually be fixed to be more difficult to misuse.
2013-07-04	build: fix LIB_CLOCK_GETTIME linkage on some toolchains
	According to the m4/clock_gettime.m4 documentation (from gnulib), the LIB_CLOCK_GETTIME variable should be added to a *LDADD variable and not AM_LDFLAGS. This is also consistent with GNU automake documentation. Thanks to Cody Pisto for reporting this problem under Ubuntu 12.04 ref: http://www.gnu.org/software/automake/manual/html_node/Linking.html
2013-06-25	Merge branch '1.2-stable'
	* 1.2-stable: cmogstored 1.2.2 - minor maintenance release INSTALL: update versions and URLs INSTALL: clarify between starting from tarball vs git test/cmogstored-cfg: ensure TMPDIR is absolute for valgrind iostat_parser: allow '-' for device names alloc: posix_memalign does not set errno
2013-06-25	tests: fault-injection test for ENOSPC on epoll_ctl
	For difficult-to-trigger errors, fault injection is necessary for testing our error handling. I have confirmed this test fails with "avoid leaks on epoll/kqueue resources exhaustion" reverted.
2013-06-25	avoid leaks on epoll/kqueue resources exhaustion
	Simply releasing the descriptor triggering ENOSPC/ENOMEM errors from epoll_ctl and kevent is not good enough, as those descriptors may have other descriptors (e.g. files to be served) hanging off of them.
2013-06-25	introduce mog_yield wrapper around sched_yield/pthread_yield
	While pthread_yield is non-standard, it is relatively common and preferable for systems where pthreads are _not_ 1:1 mapped to kernel threads. This also provides a stronger yield to weaken the priority of the calling thread wherever we previously used sched_yield.
2013-06-25	call sched_yield repeatedly when terminating threads
	This should allow the threads we're terminating to more quickly enter a safe state where they're allowed to exit. On SMP systems, we need to yield the signalling thread more times to increase the probability the interrupted thread can run (and exit).
2013-06-25	Makefile.am: fix systemtap probes.h distribution
	Our tests over-link (to save developer time :P), so we must link in probes with our tests. Also, we must keep probes.h around for distclean (but not maintainerclean)
2013-06-25	shrink mog_packaddr and improve portability
	We cannot assume sa_family_t is the first element of "struct sockaddr_in" or "struct sockaddr_in6". FreeBSD has a "sa_len" member as the first element while Linux does not. So only keep the parts of the "struct sockaddr*" we need and use inet_ntop instead of getnameinfo. This also gives us a little more space to add additional fields to "struct mog_http" in the future without increasing memory (or CPU cache) use.