Date | Commit message (Collapse) |
|
|
|
We'll have tracing everywhere, so it's too much maintenance overhead
to add it to every file which wants it. Increased build-times are
a problem, but less than the maintenance overhead of finding the
right headers.
|
|
Incomplete request headers are uncommon, so if we see them,
something is probably off or strange. This should make it
easier to maintain probe points to watch for this behavior.
|
|
This should allow easier tracing of rbuf growth, and should
hopefully make the code more explicit and harder to screw up.
|
|
I needed to spend time to convince myself this was safe, so
leave a note to others (and future self) in case there is
cause for concern.
Basically, this is highly dependent on our overall one-shot-based
concurrency model and safe as long as basic rules are followed.
|
|
This allows us to capture/trace the listen address which
accepted the request without consuming additional stack space.
|
|
This makes it easier to write tapsets which key objects
by: PID,FD for uniqueness. This also avoids some mog_fd_of()
calls.
|
|
This should hopefully make failures easier to track down.
|
|
This allows us to avoid a redundant hash lookup every time we
"activate" an open file for reading or writing.
|
|
This will allow us to limit concurrency on a per-device basis with
limited impact on HTTP header reading/parsing. This prevents
pathological slowness on a single device from bringing down an entire
host. This also allows users to more safely run with fewer aio_threads
(e.g. 1:1 thread:device mapping) on fast devices with smaller low-level
(kernel/hardware) I/O queues.
|
|
Reattaching/reusing read buffers allows us to avoid repeated
reallocation/growth/free when clients repeatedly send us large headers.
This may also increase cache-hits by favoring recently-used buffers as
long as fragmentation is kept in check. The fragmentation should be
no worse that is currently, due to the existing detach nature of rbufs
|
|
We need to ensure we do not introduce code to launch
http_process_client while we have buffered data (or socket write
errors).
|
|
Simply releasing the descriptor triggering ENOSPC/ENOMEM errors from
epoll_ctl and kevent is not good enough, as those descriptors may
have other descriptors (e.g. files to be served) hanging off of them.
|
|
We cannot assume sa_family_t is the first element of "struct
sockaddr_in" or "struct sockaddr_in6". FreeBSD has a "sa_len"
member as the first element while Linux does not.
So only keep the parts of the "struct sockaddr*" we need and use
inet_ntop instead of getnameinfo. This also gives us a little more
space to add additional fields to "struct mog_http" in the future
without increasing memory (or CPU cache) use.
|
|
This will allow us to do lookups for IO queues/semaphores before
we attempt to fstatat/stat a path.
|
|
We will key most client events by pid() and file descriptors,
as this is least ambiguous. There are some minor refactorings
to pass "struct mog_fd *" around as much as possible instead of
"struct mog_http *".
|
|
getpeername() does not work on unconnected sockets. For error-handling,
unconnected sockets is a fairly common occurrence, so we want to get
the address early on when we know the address is still valid.
For IPv4 addresses, this does not increase memory overhead at all. IPv6
addresses[1] does require an additional heap allocation, but it does not
need to be aligned since it is infrequently accessed. If IPv6 becomes
common, we may need to expand our per-client storage to 192 bytes (from
128) on 64-bit (or see if we may pack data more carefully).
[1] IPv6 addresses are rare with MogileFS, as MogileFS does not
currently support them.
|
|
This will allow easy use of memset to reset attributes in
between requests without clobbering more important data.
|
|
We need to signal we do not have more bytes to write to the
socket when generating HTTP HEAD responses. This avoids a
200ms delay between HTTP responses. This regression only
appeared in commit 14e0684507c06439ee9c7a731fd6ca90b7b9adcb
and was never in a release.
|
|
Since we no longer use TCP_CORK under Linux (where we use
MSG_MORE instead), we can cleanup the nomenclature and avoid
confusing people by mentioning TCP_CORK.
|
|
gnulib did it for us in m4/gnulib-cache.m4, we'll match.
|
|
We do not need to track queue state any longer since accept
threads always inject directly into the epoll/kqueue watcher
nowadays.
|
|
kevent() has the ability to insert items into the kqueue
and retrieve with the same syscall. This allows us to
reduce syscalls on systems with kqueue support.
Regardless of whether this potential optimization can
improve performance, this makes the code smaller and
possibly easier to follow.
|
|
Content-Length, Content-Range, chunk size can all overflow
the limit of off_t, so return a more informative 507 instead
of a 400.
|
|
The rbuf may grow sometimes to accomodate larger requests,
so use rbuf->rcapa instead.
|
|
This offloads work from the kernel into userspace and helps us
get around the lack of a useful/non-buggy TCP_DEFER_ACCEPT
semantics.
After this, we may now reduce the number of acceptor threads
as the acceptor threads will no longer be bound by disk
performance.
|
|
For many years now, TCP_NOPUSH behaves exactly like TCP_CORK
on Linux so we can just enable it to save system calls on
the /client/ side.
Using the integrated writev-like facility of the BSD sendfile()
implementation may not be worth it as it complicates error handling.
Tested on Debian GNU/kFreeBSD 6.0
|
|
This is mainly to prevent triggering potential bugs in some HTTP
clients that rely on the Perl mogstored (which uses TCP_CORK).
This should also make HTTP GET responses slightly more efficient
in terms of network traffic. Low-latency clients may see some
improvement because clients may process the response headers and
body with fewer wakeups and waiting.
The downside of this is slightly slower DELETE/PUT/HEAD
responses due to the additional syscalls on the server.
|
|
We don't want accidental /dev* directories being created
due to misconfiguration. This can help prevent accidental
configuration errors from spilling over or going unnoticed.
|
|
Hopefully things are less error-prone this way.
|
|
Not only we have to be careful about not changing a
bit, we also need to be careful about actually setting
it for current cases...
Found by valgrind.
|
|
This makes it easy to support read-only HTTP traffic on a
different listen port.
This reduces listen queue contention and allows using iptables
to block off DAV traffic from certain hosts while serving
freely.
|
|
We want to avoid global resources like the active queue
as much as possible.
Unnecesarly bouncing of clients between different threads
and contention for the active queue lock hurts concurrency.
This contention is witnessed when parallel MD5 requests
are serviced during parallel fsck runs.
|
|
Try to drain (or fill up) the socket as much as possible.
We want to be able to be able to do some work without
putting additional contention in the active queue and
potentially bouncing data between CPU caches.
|
|
"detach" makes more sense than "defer" here. This function
detaches a per-thread buffer from it's owner.
|
|
Some folks with reproxy setups end up forwarding large headers
(e.g. session cookies) to mogstored backends.
Since our per-client HTTP buffer offsets are uint16_t,
UINT16_MAX was chosen. Perlbal actually allows 100K, but I
doubt anybody would ever actually need that much.
|
|
We didn't have rcapa in the past, but now we do, so use it.
rsize is only used for stashing buffers in per-client (fdmap)
areas.
|
|
They're the same, so it should result in less fragmentation
resizing if we _keep_ them the same moving forward.
|
|
This stores the original size of the struct and makes
it easier to know how much of it is used.
|
|
By going into single-threaded mode, we can drastically simplify our
shutdown sequence to avoid race conditions. This also allows us
to not have additional overhead during normal runtime: as all the
shutdown-specific logic is isolated to only a few portions of
the code.
Like all graceful shutdown schemes, this is one is still vulnerable to
race conditions due to network latency, but this one should be no worse
than any other server. Fortunately all requests we service are
idempotent.
|
|
This will help us avoid bugs if we're transfering mog_fd
structs between queues.
|
|
This forces us to invalidate the mog_fd structure before calling
close() on the file descriptor. Eventually, this lets us
gracefully shutdown by scanning fdmap to invalidate old
connections.
|
|
We want to be able to override keepalive/persistence
set by our parser if our svc is being shut down.
|
|
They're far too common and will just flood syslog
|
|
Found by clang, apparently GCC gets confused when it
comes to small-sized enums.
|
|
Unlimited-length streams are trickier to parse with minimal
buffering, so we need to be careful with corner cases clients
may put us through...
|
|
In case MogileFS clients rely on these fields, we're
closer to being a "real" HTTP server.
|
|
Still a bit iffy on the details, but it seems to basically work.
There will probably be cases where this code falls down badly
so it needs much more testing...
|
|
The Perl MogileFS::Client library still send requests
with Content-Range for partial PUTs
|
|
Good thing is that pipelined and persistent PUT works
out-of-the-box, too. We use O_EXCL when opening files,
so there's currently no risk of overwriting anything,
maybe it's a good thing?
TODOs:
* partial write (Content-Range header)
* overhaul the mog_open* API for Content-Range
* support overwriting existing files (maybe)
* Content-MD5 verification (in trailers, too)
* Transfer-Encoding: chunked support (for Content-MD5 trailers)
* mmap() write support.
|