From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-2.9 required=3.0 tests=ALL_TRUSTED,BAYES_00, URIBL_BLOCKED shortcircuit=no autolearn=unavailable version=3.3.2 X-Original-To: unicorn-public@bogomips.org Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id DCAAA1FD47; Thu, 12 Mar 2015 01:45:15 +0000 (UTC) Date: Thu, 12 Mar 2015 01:45:15 +0000 From: Eric Wong To: Kevin Yank Cc: unicorn-public@bogomips.org Subject: Re: On USR2, new master runs with same PID Message-ID: <20150312014515.GA10605@dcvr.yhbt.net> References: <2C0638F8-338C-4B06-B641-0DCF48D4D16A@avalanche.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2C0638F8-338C-4B06-B641-0DCF48D4D16A@avalanche.com.au> List-Id: Kevin Yank wrote: > Having recently migrated our Rails app to MRI 2.2.0 (which may or may > not be related), we’re experiencing problems with our Unicorn > zero-downtime restarts. > > When I send USR2 to the master process (PID 19216 in this example), I > get the following in the Unicorn log: > > I, [2015-03-11T23:47:33.992274 #6848] INFO -- : executing ["/srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn", "/srv/ourapp/current/config.ru", "-Dc", "/srv/ourapp/shared/config/unicorn.rb", {10=>#}] (in /srv/ourapp/releases/a0e8b5df474ad5129200654f92a76af00a750f47) > I, [2015-03-11T23:47:36.504235 #6848] INFO -- : inherited addr=/srv/ourapp/shared/sockets/unicorn.sock fd=10 > /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:206:in `pid=': Already running on PID:19216 (or pid=/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError) Nothing suspicious until the above line... > And when I check, indeed, there is now unicorn.pid and > unicorn.pid.oldbin, both containing 19216. > > What could cause this situation to arise? Any chance you have a process manager or something else creating the (non-.oldbin) pid file behind unicorn's back? Can you check your process table to ensure there's not multiple unicorn instances running off the same config and pid files, too? Also, were there other activity (another USR2 or HUP) in the logs a few seconds beforehand? What kind of filesystem / kernel is the pid file on? A network FS or anything without the consistency guarantees provided by POSIX would not work for pid files. pid files are unfortunately prone to to nasty race conditions, but I'm not sure what you're seeing happens very often. Likewise, the check for stale Unix domain socket paths at startup is inevitably racy, too, but the window is usually small enough to be unnoticeable. But yes, just in case, check the process table to make sure there aren't multiple, non-related masters running on off the same paths. > before_fork do |server, worker| > ActiveRecord::Base.connection.disconnect! -------------------8<------------------- > sleep 10 > > old_pid = "#{server.config[:pid]}.oldbin" > if File.exists?(old_pid) && server.pid != old_pid > begin > Process.kill("QUIT", File.read(old_pid).to_i) > rescue Errno::ENOENT, Errno::ESRCH > # someone else did our job for us > end > end -------------------8<------------------- I'd get rid of that hunk starting with the "sleep 10" (at least while debugging this issue). If you did a USR2 previously, maybe it could affect the current USR2 upgrade process. Sleeping so long in the master like that is pretty bad it throws off timing and delays signal handling. That's a pretty fragile config and I regret ever including it in the example config files