Date | Commit message (Collapse) |
|
glibc malloc creates arenas aggressively to avoid malloc contention.
This is good for CPU-bound multithreaded programs which are
malloc-dependent. However cmogstored uses multiple threads for
concurrent disk/FS activity and avoids malloc in hot/common paths.
Thus malloc should _never_ be a bottleneck for cmogstored. Although
physical memory allocation is lazy on Linux kernels, the metadata
overhead of the virtually allocated pages can still add up on a
system with many disks/devices.
I've observed 6-7G VmSize on cmogstored processes with only ~5M VmRSS
on machines with many cores/devices and a few hundred clients.
|
|
This release fixes a bug which only affects users of the
undocumented multi-process configuration feature
(which is also multi-threaded).
* avoid use-after-free with multi-process setups
readdir on the same DIR pointer is undefined if DIR was inherited by
multiple children. Using the reentrant readdir_r would not have
helped, since the underlying file descriptor and kernel file handle
were still shared (and we need rewinddir, too).
This readdir usage bug existed in cmogstored since the earliest
releases, but was harmless until the cmogstored 1.3 series.
This misuse of readdir lead to hitting a leftover call to free().
So this bug only manifested since
commit 1fab1e7a7f03f3bc0abb1b5181117f2d4605ce3b
(svc: implement top-level by_mog_devid hash)
Fortunately, these bugs only affect users of the undocumented
multi-process feature (not just multi-threaded).
|
|
readdir on the same DIR pointer is undefined if DIR was inherited by
multiple children. Using the reentrant readdir_r would not have
helped, since the underlying file descriptor and kernel file handle
were still shared (and we need rewinddir, too).
This readdir usage bug existed in cmogstored since the earliest
releases, but was harmless until the cmogstored 1.3 series.
This misuse of readdir lead to hitting a leftover call to free().
So this bug only manifested since
commit 1fab1e7a7f03f3bc0abb1b5181117f2d4605ce3b
(svc: implement top-level by_mog_devid hash)
Fortunately, these bugs only affect users of the undocumented
multi-process feature (not just multi-threaded).
|
|
There are no changes from 1.3.0rc2.
For the most part, cmogstored 1.2.2 works well, but 1.3 contains some
fairly major changes and improvements.
cmogstored CPU usage may be higher than other servers because it's
designed to use whatever resources it has at its disposal to
distribute load to different storage devices. cmogstored 1.3
continues this, but it should be safer to lower thread counts
without hurting performance too much for non-dedicated servers.
cmogstored 1.3 contains improvements for storage hosts at the
extremes ends of the performance scale. For large machines with many
cores, memory/thread usage is reduced because we had too many acceptor
threads. There are more improvements for smaller machines, especially
those with slow/imbalanced drive speeds and few CPUs. Some of the
improvements came from my testing with ancient single-core machines,
others came from testing on 24-core machines :)
Major features in 1.3:
ioq - a I/O queues for all MogileFS requests
--------------------------------------------
The new I/O queue (ioq) implements the equivalent of AIO channels
functionality from Perlbal/mogstored. This feature prevents a
failing/overloaded disk from monopolizing all the threads in the system.
Since cmogstored uses threads directly (and not AIO), the common
(uncontended case) behaves like a successful sem_wait with POSIX
semaphores. Queueing+rescheduling only occurs in the contended case
(unlike with AIO-style APIs, where request are always queued). I
experimented with, but did not use POSIX semaphores as contention would
still starve the thread pool.
Unlike the old fsck_queue, ioq is based on the MogileFS devid in the URL
and not the st_dev ID of the actual underlying file. This is less
correct from a systems perspective, but should make no difference for
normal production deployments (which are expected to use one MogileFS
devid for each st_dev ID) and has several advantages:
1) testing/mock deploys of this feature with mock deploys is easier
2) we do not require any additional filesystem syscall (open/*stat)
to look up the ioq based on st_dev, so we can use ioq to avoid
stalls from slow open/openat/stat/fstatat/unlink/unlinkat syscalls.
Otherwise, the implementation of this very closely resembles the old
fsck queue implementation, but is generic across HTTP and sidechannel
clients. The existing fsck queue functionality is now implemented using
ioq. Thus, fsck queue functionality is mapped by the MogileFS devid and
not the system st_dev ID as a result of this change.
One benefit of this feature is the ability to run fewer aio_threads
safely without worrying about cross-device contention on machines with
limited resources or few disks (or not solely dedicated to MogileFS
storage).
The capacity of these I/O queues is automatically scaled to the number
of available aio_threads, so they can change dynamically while your
admin is tuning "SERVER aio_threads = XX"
However, on a dedicated storage node, running many aio_threads (as is
the default) should still be beneficial. Having more threads can keep
the internal I/O queues of the kernel and storage hardware more
populated and can improve throughput.
thread shutdown fixes (epoll)
-----------------------------
Our previous reliance on pthreads cancellation primitives left us open
to a small race condition where I/O events (from epoll) could be lost
during graceful shutdown or thread reduction via
"SERVER aio_threads = XX". We no longer rely on pthreads cancellation
for stopping threads and instead implement explicit check points for
epoll.
This did not affect kqueue users, but the code is simpler and more
consistent across epoll/kqueue implementations.
Graceful shutdown improvements
------------------------------
The addition of our I/O queueing and use of our custom thread shutdown
API also allowed us to improve the responsiveness and fairness when the
process enters graceful shutdown mode. This improves fairness and
avoids client-side timeouts when large PUT requests are being issued
over a fast network to slow disks during graceful shutdown.
Currently, graceful shutdown remains single-threaded, but we will likely
become multi-threaded in the future (like normal runtime).
Miscellaneous fixes and improvements
------------------------------------
Further improved matching for (Linux) device-mapper setups where the
same device (not symlinks) appears multiple times in /dev
aio_threads count is automatically updated when new devices are
added/removed. This is currently synced to MOG_DISK_USAGE_INTERVAL, but
will use inotify (or the kqueue equivalent) in the future.
HTTP read buffers grow monotonically (up to 64K) and always use aligned
memory. This allows deployments which pass large HTTP headers do not
trigger unnecessary reallocations. Deployments which use small HTTP
headers should notice no memory increase.
Acceptor threads are now limited to two per process instead of being
scaled to CPU count. This avoids excessive threads/memory usage and
contention of kernel-level mutexes for large multi-core machines.
The gnulib version used for building the tarball is now included in the
tarball for ease-of-reproducibility.
Additional tests for uncommon error conditions using the fault-injection
capabilities of GNU ld.
The "shutdown" command over the sidechannel is more responsive for epoll
users.
Improved reporting of failed requests during PUT requests. Again, I run
MogileFS instances on some of the most horrible networks on the planet[2]
fix LIB_CLOCK_GETTIME linkage on some toolchains.
"SERVER mogstored.persist_client = (0|1)" over the sidechannel is supported
for compatibility with Perlbal/mogstored
The Status: header is no longer returned on HTTP responses. All known
MogileFS clients parse the HTTP status response correctly without the
need for the Status: header. Neither Perlbal nor nginx set the Status:
header on responses, so this is unlikely to introduce incompatibilities.
The Status: header was originally inherited from HTTP servers which had
to deal with a much larger range of (non-compliant) clients.
|
|
The Status: header is no longer returned on HTTP responses. All known
MogileFS clients parse the HTTP status response correctly without the
need for the Status: header. Neither Perlbal nor nginx set the Status:
header on responses, so this is unlikely to introduce incompatibilities.
The Status: header was originally inherited from HTTP servers which had
to deal with a much larger range of (non-compliant) clients.
SystemTap support is mostly fleshed out. There are some bundled awk
scripts which should make better sense of the all.stp which logs just
about everything.
Raising aio_threads now correctly increases ioq capacity. This
regression was only introduced in the 1.3.0 rc series, as ioq
was not in 1.2.x.
|
|
|
|
Otherwise, reenqueue-ing only one mfd at-a-time is pointless
and prevents cmogstored from utilizing new threads.
|
|
We do not need to set the contended flag again until we're certain
we have no free slots in the ioq, not when we assume the client
is the last one to take a slot. This is because ioq access itself
is serialized, and the last client taking the ioq could be getting
a false positive when another thread is waiting on ioq->mtx to
release the ioq.
This prevents throughput loss while recovering from a situation
where an ioq is oversubscribed. This is reproduced under heavy
load and switching temporarily to "SERVER aio_threads = 1"
and then bringing aio_threads back up to a high value.
|
|
The variable may not be defined at all, so it must be
quoted to avoid spewing a warning of dtrace/stap are not
found.
|
|
Otherwise I will forget what they output one day and will
have to read the code again.
|
|
systemtap support is implemented, and hopefully dtrace works, too.
|
|
Our "all.stp" tapset now generates awk-friendly output for feeding
some sample awk scripts.
Using awk (and gawk) was necessary to avoid reimplementing strftime
in guru mode for generating CLF (Common Log Format) HTTP access logs.
Using awk also gives us several advantages:
* floating point number support (for time differences)
* a more familiar language to systems administrators
(given this is for MogileFS, perhaps Perl would be even
more familiar...).
* fast edit/run cycle, so the slowness of using stap to
rebuild/reload the kernel module for all.stp changes can
be avoided when output must be customized.
|
|
This was inherited from a server which needed to deal with
some broken clients, MogileFS does not have this problem.
Neither Perlbal nor nginx set this response header, either,
so lets save ourselves a few bytes.
|
|
While we're fortunate enough to not have encountered a case
where send/writev returns zero with a non-zero-length buffer,
it's not inconceivable that it could strike us one day. In that
case, error out the connection instead of infinite looping.
Dropping a connection is safer than letting a thread run in
an infinite loop.
|
|
Unfortunately, slow mount points still cause minor reliability
issues with the test suite.
|
|
This seems to fail more under heavy load, so wait a bit longer for
iostat to become aware of the new devices.
|
|
We'll have tracing everywhere, so it's too much maintenance overhead
to add it to every file which wants it. Increased build-times are
a problem, but less than the maintenance overhead of finding the
right headers.
|
|
This tapset will contain every probe point and acts as a
check/documentation for extracting useful probes.
|
|
Incomplete request headers are uncommon, so if we see them,
something is probably off or strange. This should make it
easier to maintain probe points to watch for this behavior.
|
|
Growing the rbufs should be uncommon, but it should set off alarms
if it happens too often.
|
|
mgmt may now encounter large rbufs, so ensure that uncommon case
is tested.
|
|
This should allow easier tracing of rbuf growth, and should
hopefully make the code more explicit and harder to screw up.
|
|
ioq tracing will allow users to notice when devices are saturated
(from a cmogstored POV) and increase aio_threads if necessary.
|
|
It is helpful to know the address of the listener on the server
which accepted the client socket. Additionally, the PID,FD combination
should be be safely unique for any point in time.
|
|
I needed to spend time to convince myself this was safe, so
leave a note to others (and future self) in case there is
cause for concern.
Basically, this is highly dependent on our overall one-shot-based
concurrency model and safe as long as basic rules are followed.
|
|
Willy Tarreau cherry-picked the relevant fix into 2.6.32 longterm
stable tree
ref:
commit 1c137a47bbdd6e86298627e04f547afd7f35d523
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
|
|
This function is no longer used as we now attempt to reattach
rbufs to the TLS space of each thread.
|
|
It's unlikely we'll even come close to see 2-4 billion devices in a
MogileFS instance for a while. Meanwhile, it's also unlikely the
kernel will ever run that many threads, either. So make it easier
to pack and shrink data structures to save a few bytes and perhaps
get better memory alignement.
For reference, the POSIX semaphore API specifies initial values
with unsigned (int) values, too.
This leads to a minor size reduction (and we're not even packing):
$ ~/linux/scripts/bloat-o-meter cmogstored.before cmogstored
add/remove: 0/0 grow/shrink: 0/13 up/down: 0/-86 (-86)
function old new delta
mog_svc_dev_quit_prepare 13 12 -1
mog_mgmt_fn_aio_threads 147 146 -1
mog_dev_user_rescale_i 27 26 -1
mog_ioq_requeue_prepare 52 50 -2
mog_ioq_init 80 78 -2
mog_thrpool_start 101 96 -5
mog_svc_dev_user_rescale 143 137 -6
mog_svc_start_each 264 256 -8
mog_svc_aio_threads_handler 257 249 -8
mog_ioq_ready 263 255 -8
mog_ioq_next 303 295 -8
mog_svc_thrpool_rescale 206 197 -9
mog_thrpool_set_size 1028 1001 -27
|
|
For the most part, cmogstored 1.2.2 works well, but 1.3 contains some
fairly major changes and improvements.
cmogstored CPU usage may be higher than other servers because it's
designed to use whatever resources it has at its disposal to distribute
load to different storage devices. cmogstored 1.3 will continue this,
but it should be safer to lower thread counts without hurting
performance too much for non-dedicated servers.
Unfortunately, the minor, Linux-only bug affecting 1.2.2 for (uncommon)
thread shutdowns required some fairly intrusive changes to fix, so I'm
not sure if releasing a 1.2.3 is worth it. If you're happy with 1.2.x,
I recommend marking the host down via mogadm before lowering
"SERVER aio_threads = XX" or sending SIGQUIT to cmogstored. But
I think thread shutdown is uncommon enough to not affect normal
deployments.
cmogstored 1.3 will contain improvements for storage hosts at the
extremes ends of the performance scale. For large machines with many
cores, memory/thread usage is reduced because we had too many acceptor
threads. There are more improvements for smaller machines, especially
those with slow/imbalanced drive speeds and few CPUs. Some of the
improvements came from my testing with ancient single-core machines,
others came from testing on 24-core machines :)
The SystemTap tracing work is still in-progress (although the 1.3 cycle
was originally intended to focus on this :x). I expect the remaining
changes to be non-intrusive and will work on them through the RC cycle.
Major features in 1.3:
ioq - a I/O queues for all MogileFS requests
--------------------------------------------
The new I/O queue (ioq) implements the equivalent of AIO channels
functionality from Perlbal/mogstored. This feature prevents a
failing/overloaded disk from monopolizing all the threads in the system.
Since cmogstored uses threads directly (and not AIO), the common
(uncontended case) behaves like a successful sem_wait with POSIX
semaphores. Queueing+rescheduling only occurs in the contended case
(unlike with AIO-style APIs, where request are always queued). I
experimented with, but did not use POSIX semaphores as contention would
still starve the thread pool.
Unlike the old fsck_queue, ioq is based on the MogileFS devid in the URL
and not the st_dev ID of the actual underlying file. This is less
correct from a systems perspective, but should make no difference for
normal production deployments (which are expected to use one MogileFS
devid for each st_dev ID) and has several advantages:
1) testing/mock deploys of this feature with mock deploys is easier
2) we do not require any additional filesystem syscall (open/*stat)
to look up the ioq based on st_dev, so we can use ioq to avoid
stalls from slow open/openat/stat/fstatat/unlink/unlinkat syscalls.
Otherwise, the implementation of this very closely resembles the old
fsck queue implementation, but is generic across HTTP and sidechannel
clients. The existing fsck queue functionality is now implemented using
ioq. Thus, fsck queue functionality is mapped by the MogileFS devid and
not the system st_dev ID as a result of this change.
One benefit of this feature is the ability to run fewer aio_threads
safely without worrying about cross-device contention on machines with
limited resources or few disks (or not solely dedicated to MogileFS
storage).
The capacity of these I/O queues is automatically scaled to the number
of available aio_threads, so they can change dynamically while your
admin is tuning "SERVER aio_threads = XX"
However, on a dedicated storage node, running many aio_threads (as is
the default) should still be beneficial. Having more threads can keep
the internal I/O queues of the kernel and storage hardware more
populated and can improve throughput.
thread shutdown fixes (epoll)
-----------------------------
Our previous reliance on pthreads cancellation primitives left us open
to a small race condition where I/O events (from epoll) could be lost
during graceful shutdown or thread reduction via
"SERVER aio_threads = XX". We no longer rely on pthreads cancellation
for stopping threads and instead implement explicit check points for
epoll.
This did not affect kqueue users, but the code is simpler and more
consistent across epoll/kqueue implementations.
Graceful shutdown improvements
------------------------------
The addition of our I/O queueing and use of our custom thread shutdown
API also allowed us to improve the responsiveness and fairness when the
process enters graceful shutdown mode. This improves fairness and
avoids client-side timeouts when large PUT requests are being issued
over a fast network to slow disks during graceful shutdown.
Currently, graceful shutdown remains single-threaded, but we will likely
become multi-threaded in the future (like normal runtime).
Miscellaneous fixes and improvements
------------------------------------
Further improved matching for (Linux) device-mapper setups where the
same device (not symlinks) appears multiple times in /dev
aio_threads count is automatically updated when new devices are
added/removed. This is currently synced to MOG_DISK_USAGE_INTERVAL, but
will use inotify (or the kqueue equivalent) in the future.
HTTP read buffers grow monotonically (up to 64K) and always use aligned
memory. This allows deployments which pass large HTTP headers do not
trigger unnecessary reallocations. Deployments which use small HTTP
headers should notice no memory increase.
Acceptor threads are now limited to two per process instead of being
scaled to CPU count. This avoids excessive threads/memory usage and
contention of kernel-level mutexes for large multi-core machines.
The gnulib version used for building the tarball is now included in the
tarball for ease-of-reproducibility.
Additional tests for uncommon error conditions using the fault-injection
capabilities of GNU ld.
The "shutdown" command over the sidechannel is more responsive for epoll
users.
Improved reporting of failed requests during PUT requests. Again, I run
MogileFS instances on some of the most horrible networks on the planet[2]
fix LIB_CLOCK_GETTIME linkage on some toolchains.
"SERVER mogstored.persist_client = (0|1)" over the sidechannel is supported
for compatibility with Perlbal/mogstored
|
|
Only relying on dtrace leads to build problems on FreeBSD which
I haven't had a chance to fix.
|
|
This should avoid concurrency bugs where client may run in
multiple threads if we switch to multi-threaded graceful shutdown.
|
|
This test is too slow and timing-sensitive under valgrind, so
disable it for now until we have a better solution.
|
|
We could be completely out of threads upon acquiring an ioq, so the
last thread to acquire a lock slot must trigger a yield soon to
avoid starvation and fairness issues. Otherwise, all threads
for a given device could remained pinned indefinitely.
|
|
Tests need to cleanup by stopping running processes.
|
|
This allows us to capture/trace the listen address which
accepted the request without consuming additional stack space.
|
|
This will allow us to properly report the listen address the client
connected to.
|
|
This makes it easier to write tapsets which key objects
by: PID,FD for uniqueness. This also avoids some mog_fd_of()
calls.
|
|
This avoids noise in config.log
|
|
The update prefix is bounded in size, so this will save us NR_DEVICES
malloc/free pairs each second from typical iostat output.
|
|
No need to recreate mog_mgmt_fn_blank for sending blank responses.
|
|
test_head_response_time does not test anything which would
not be otherwise tested by other tests under valgrind.
This test is only needed for occasional validation of
fuckups regarding TCP_NOPUSH on FreeBSD, and not necessary
for general use.
|
|
We don't want drop in-flight pipelined requests when disabling
persistent connections. Disabling persistent connections will
always be potentially racy, but hopefully this makes the race
small enough that lower-level latencies are the only thing
which affect that.
|
|
While we always properly disconnected clients during shutdown, we
explicitly set "Connection: close" now to inform clients of our
pending shutdown. This avoids potentially confusing clients when we
disconnect them as there may still be a race condition where we shut
down a client while their request packets are in-flight.
|
|
This is Perlbal functionality which works in Perl mogstored,
so we will also support it here, as it makes upgrading to new
versions easier.
|
|
By reducing the capacity of each ioq, we force each running worker
thread to yield the current client and hit an exit point
(epoll_wait/kqueue) sooner.
|
|
Without this, test_iostat_watch fails sometimes under valgrind.
|
|
pwrite can be a slow, blocking function on an overloaded
system, but a slow pwrite requires a wrapper to simulate.
This allows us to have coverage of the:
if (mog_ioq_contended())
return MOG_NEXT_WAIT_RD;
cases in http_put.c
|
|
Users reducing or increasing thread counts should increase
ioq capacity, otherwise there's no point in having more or
less threads if they are synched to the ioq capacity.
|
|
We want to yield dying threads as soon as possible during
thread shutdown, so we check the quit flag and yield the
running thread to trigger a MOG_NEXT_ACTIVE.
|
|
This will allow us to detect I/O contention on our queue
and yield the current thread to other clients for fairness.
This can prevent a client from hogging the thread in situations
where the network is much faster than the filesystem/disk.
|