cmogstored.git - alternative mogstored implementation for MogileFS

Date	Commit message (Collapse)
2013-07-26	more probes WIP st-wip-broken

2013-07-19	move trace.h include to global cmogstored.h
	We'll have tracing everywhere, so it's too much maintenance overhead to add it to every file which wants it. Increased build-times are a problem, but less than the maintenance overhead of finding the right headers.
2013-07-19	split out {mgmt,http}_parse_continue checks
	Incomplete request headers are uncommon, so if we see them, something is probably off or strange. This should make it easier to maintain probe points to watch for this behavior.
2013-07-19	split out {http,mgmt}_rbuf_grow functions
	This should allow easier tracing of rbuf growth, and should hopefully make the code more explicit and harder to screw up.
2013-07-17	document ioq and mog_{mgmt,http}_drop interaction safety
	I needed to spend time to convince myself this was safe, so leave a note to others (and future self) in case there is cause for concern. Basically, this is highly dependent on our overall one-shot-based concurrency model and safe as long as basic rules are followed.
2013-07-13	pass mog_accept instead of mog_svc to post-accept callbacks
	This allows us to capture/trace the listen address which accepted the request without consuming additional stack space.
2013-07-13	http: pass "struct mog_fd *" more consistently in API
	This makes it easier to write tapsets which key objects by: PID,FD for uniqueness. This also avoids some mog_fd_of() calls.
2013-07-10	http: include IP:PORT in "client died" message
	This should hopefully make failures easier to track down.
2013-07-10	file: embed ioq in the opened mog_file object
	This allows us to avoid a redundant hash lookup every time we "activate" an open file for reading or writing.
2013-07-10	ioq: implement and enable generic I/O queues
	This will allow us to limit concurrency on a per-device basis with limited impact on HTTP header reading/parsing. This prevents pathological slowness on a single device from bringing down an entire host. This also allows users to more safely run with fewer aio_threads (e.g. 1:1 thread:device mapping) on fast devices with smaller low-level (kernel/hardware) I/O queues.
2013-07-10	rbuf: reattach/reuse read buffers when possible
	Reattaching/reusing read buffers allows us to avoid repeated reallocation/growth/free when clients repeatedly send us large headers. This may also increase cache-hits by favoring recently-used buffers as long as fragmentation is kept in check. The fragmentation should be no worse that is currently, due to the existing detach nature of rbufs
2013-07-10	http: add assertion for unused wbuf
	We need to ensure we do not introduce code to launch http_process_client while we have buffered data (or socket write errors).
2013-06-25	avoid leaks on epoll/kqueue resources exhaustion
	Simply releasing the descriptor triggering ENOSPC/ENOMEM errors from epoll_ctl and kevent is not good enough, as those descriptors may have other descriptors (e.g. files to be served) hanging off of them.
2013-06-25	shrink mog_packaddr and improve portability
	We cannot assume sa_family_t is the first element of "struct sockaddr_in" or "struct sockaddr_in6". FreeBSD has a "sa_len" member as the first element while Linux does not. So only keep the parts of the "struct sockaddr*" we need and use inet_ntop instead of getnameinfo. This also gives us a little more space to add additional fields to "struct mog_http" in the future without increasing memory (or CPU cache) use.
2013-06-25	parse out mogilefs devid in mgmt/http requests
	This will allow us to do lookups for IO queues/semaphores before we attempt to fstatat/stat a path.
2013-05-06	preliminary systemtap support for tracing
	We will key most client events by pid() and file descriptors, as this is least ambiguous. There are some minor refactorings to pass "struct mog_fd " around as much as possible instead of "struct mog_http ".
2013-04-17	save socket address on accept/accept4
	getpeername() does not work on unconnected sockets. For error-handling, unconnected sockets is a fairly common occurrence, so we want to get the address early on when we know the address is still valid. For IPv4 addresses, this does not increase memory overhead at all. IPv6 addresses[1] does require an additional heap allocation, but it does not need to be aligned since it is infrequently accessed. If IPv6 becomes common, we may need to expand our per-client storage to 192 bytes (from 128) on 64-bit (or see if we may pack data more carefully). [1] IPv6 addresses are rare with MogileFS, as MogileFS does not currently support them.
2013-03-19	http: put parser-private attrs in a private struct attr
	This will allow easy use of memset to reset attributes in between requests without clobbering more important data.
2013-01-17	http: avoid MSG_MORE on HEAD responses
	We need to signal we do not have more bytes to write to the socket when generating HTTP HEAD responses. This avoids a 200ms delay between HTTP responses. This regression only appeared in commit 14e0684507c06439ee9c7a731fd6ca90b7b9adcb and was never in a release.
2013-01-17	simplify TCP_NOPUSH support code (remove TCP_CORK)
	Since we no longer use TCP_CORK under Linux (where we use MSG_MORE instead), we can cleanup the nomenclature and avoid confusing people by mentioning TCP_CORK.
2013-01-17	copyright comment updates for 2013
	gnulib did it for us in m4/gnulib-cache.m4, we'll match.
2012-12-09	remove queue_state field from struct mog_fd
	We do not need to track queue state any longer since accept threads always inject directly into the epoll/kqueue watcher nowadays.
2012-11-08	queue: refactor for future, potential kqueue speedup
	kevent() has the ability to insert items into the kqueue and retrieve with the same syscall. This allows us to reduce syscalls on systems with kqueue support. Regardless of whether this potential optimization can improve performance, this makes the code smaller and possibly easier to follow.
2012-10-31	http_put: return 507 for excess sizes in headers
	Content-Length, Content-Range, chunk size can all overflow the limit of off_t, so return a more informative 507 instead of a 400.
2012-10-16	http*: do not rely on MOG_RBUF_BASE_SIZE for calculations
	The rbuf may grow sometimes to accomodate larger requests, so use rbuf->rcapa instead.
2012-08-03	acceptor threads push directly into event queue
	This offloads work from the kernel into userspace and helps us get around the lack of a useful/non-buggy TCP_DEFER_ACCEPT semantics. After this, we may now reduce the number of acceptor threads as the acceptor threads will no longer be bound by disk performance.
2012-07-19	use TCP_NOPUSH if available for FreeBSD-based systems
	For many years now, TCP_NOPUSH behaves exactly like TCP_CORK on Linux so we can just enable it to save system calls on the /client/ side. Using the integrated writev-like facility of the BSD sendfile() implementation may not be worth it as it complicates error handling. Tested on Debian GNU/kFreeBSD 6.0
2012-07-09	http: TCP_CORK support for Linux kernel users
	This is mainly to prevent triggering potential bugs in some HTTP clients that rely on the Perl mogstored (which uses TCP_CORK). This should also make HTTP GET responses slightly more efficient in terms of network traffic. Low-latency clients may see some improvement because clients may process the response headers and body with fewer wakeups and waiting. The downside of this is slightly slower DELETE/PUT/HEAD responses due to the additional syscalls on the server.
2012-03-16	additional path restrictions on HTTP PUT creating dirs
	We don't want accidental /dev* directories being created due to misconfiguration. This can help prevent accidental configuration errors from spilling over or going unnoticed.
2012-03-15	httpget deserves its own fd_type enum
	Hopefully things are less error-prone this way.
2012-03-15	http: fix uninitialized mem access for non-GET/HEAD reqs
	Not only we have to be careful about not changing a bit, we also need to be careful about actually setting it for current cases... Found by valgrind.
2012-03-14	support for httpgetlisten config directive
	This makes it easy to support read-only HTTP traffic on a different listen port. This reduces listen queue contention and allows using iptables to block off DAV traffic from certain hosts while serving freely.
2012-03-14	queue: active clients maintain thread affinity
	We want to avoid global resources like the active queue as much as possible. Unnecesarly bouncing of clients between different threads and contention for the active queue lock hurts concurrency. This contention is witnessed when parallel MD5 requests are serviced during parallel fsck runs.
2012-03-08	http: avoid active queue on initial GET/PUT chunk
	Try to drain (or fill up) the socket as much as possible. We want to be able to be able to do some work without putting additional contention in the active queue and potentially bouncing data between CPU caches.
2012-03-07	properly name mog_rbuf_detach() function
	"detach" makes more sense than "defer" here. This function detaches a per-thread buffer from it's owner.
2012-03-03	http: allow headers up to UINT16_MAX in size
	Some folks with reproxy setups end up forwarding large headers (e.g. session cookies) to mogstored backends. Since our per-client HTTP buffer offsets are uint16_t, UINT16_MAX was chosen. Perlbal actually allows 100K, but I doubt anybody would ever actually need that much.
2012-03-03	rbuf: use rcapa instead of rsize correctly
	We didn't have rcapa in the past, but now we do, so use it. rsize is only used for stashing buffers in per-client (fdmap) areas.
2012-02-28	unify rbuf sizes for http and mgmt
	They're the same, so it should result in less fragmentation resizing if we _keep_ them the same moving forward.
2012-02-27	rbuf: add rcapa element to struct
	This stores the original size of the struct and makes it easier to know how much of it is used.
2012-02-25	implement graceful shutdown for outstanding requests
	By going into single-threaded mode, we can drastically simplify our shutdown sequence to avoid race conditions. This also allows us to not have additional overhead during normal runtime: as all the shutdown-specific logic is isolated to only a few portions of the code. Like all graceful shutdown schemes, this is one is still vulnerable to race conditions due to network latency, but this one should be no worse than any other server. Fortunately all requests we service are idempotent.
2012-02-23	cleanup mog_fd insertion/initialization for queues
	This will help us avoid bugs if we're transfering mog_fd structs between queues.
2012-02-20	redo mog_fd_put() and actually use it
	This forces us to invalidate the mog_fd structure before calling close() on the file descriptor. Eventually, this lets us gracefully shutdown by scanning fdmap to invalidate old connections.
2012-02-18	http: use internal svc flag to toggle persistence
	We want to be able to override keepalive/persistence set by our parser if our svc is being shut down.
2012-02-09	do not log for ENOTCONN and ECONNRESET errors
	They're far too common and will just flood syslog
2012-02-05	http: fix missing case statement in switch
	Found by clang, apparently GCC gets confused when it comes to small-sized enums.
2012-02-04	cleanup HTTP chunked PUT support for odd edge cases
	Unlimited-length streams are trickier to parse with minimal buffering, so we need to be careful with corner cases clients may put us through...
2012-01-31	http: Date: and Last-Modified: response headers
	In case MogileFS clients rely on these fields, we're closer to being a "real" HTTP server.
2012-01-31	enable chunked HTTP PUT support
	Still a bit iffy on the details, but it seems to basically work. There will probably be cases where this code falls down badly so it needs much more testing...
2012-01-19	add Content-Range: support for PUT requests
	The Perl MogileFS::Client library still send requests with Content-Range for partial PUTs
2012-01-19	http_put: identity PUT (no chunking/trailers) working
	Good thing is that pipelined and persistent PUT works out-of-the-box, too. We use O_EXCL when opening files, so there's currently no risk of overwriting anything, maybe it's a good thing? TODOs: * partial write (Content-Range header) * overhaul the mog_open* API for Content-Range * support overwriting existing files (maybe) * Content-MD5 verification (in trailers, too) * Transfer-Encoding: chunked support (for Content-MD5 trailers) * mmap() write support.