"unicorn 4.8.1"

* Re: On USR2, new master runs with same PID
  2015-03-12  1:45  0% ` Eric Wong
@ 2015-03-12  6:26  0%   ` Kevin Yank
  0 siblings, 0 replies; 4+ results
From: Kevin Yank @ 2015-03-12  6:26 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

[-- Attachment #1: Type: text/plain, Size: 5164 bytes --]

Thanks for your help, Eric!

> On 12 Mar 2015, at 12:45 pm, Eric Wong <e@80x24.org> wrote:
> 
> Kevin Yank <kyank@avalanche.com.au> wrote:
>> When I send USR2 to the master process (PID 19216 in this example), I
>> get the following in the Unicorn log:
>> 
>> I, [2015-03-11T23:47:33.992274 #6848]  INFO -- : executing ["/srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn", "/srv/ourapp/current/config.ru", "-Dc", "/srv/ourapp/shared/config/unicorn.rb", {10=>#<Kgio::UNIXServer:/srv/ourapp/shared/sockets/unicorn.sock>}] (in /srv/ourapp/releases/a0e8b5df474ad5129200654f92a76af00a750f47)
>> I, [2015-03-11T23:47:36.504235 #6848]  INFO -- : inherited addr=/srv/ourapp/shared/sockets/unicorn.sock fd=10
>> /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:206:in `pid=': Already running on PID:19216 (or pid=/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError)
> 
> Nothing suspicious until the above line...

That’s right.

>> And when I check, indeed, there is now unicorn.pid and
>> unicorn.pid.oldbin, both containing 19216.
> 
> Any chance you have a process manager or something else creating the
> (non-.oldbin) pid file behind unicorn's back?

It’s possible; I’m using eye (https://github.com/kostya/eye) as a process monitor. I’m not aware of it writing .pid files for processes that self-daemonize like Unicorn, though. And once one of my servers “goes bad” (i.e. Unicorn starts failing to restart in response to a USR2), it does so 100% consistently until I stop and restart Unicorn entirely. Based on that, I don’t believe it’s a race condition where my process monitor is slipping in a new unicorn.pid file some of the time.

> Can you check your process table to ensure there's not multiple
> unicorn instances running off the same config and pid files, too?

As far as I can tell, I have some servers that have gotten themselves into this “USR2 restart fails” state, while others are working just fine. In both cases, the Unicorn process tree (as show in htop) looks like this “at rest” (i.e. no deployments/restarts in progress):

unicorn master
`- unicorn master
`- unicorn worker[2]
|  `- unicorn worker[2]
`- unicorn worker[1]
|  `- unicorn worker[1]
`- unicorn worker[0]
   `- unicorn worker[0]

At first glance I’d definitely say it appears that I have two masters running from the same config files. However, there’s only one unicorn.pid file of course (the root process in the tree above), and when I try to kill -TERM the master process that doesn’t have a .pid file, the entire process tree exits. Am I misinterpreting the process table? Is this process list actually normal?

Thus far I’ve been unable to find any difference in state between a properly-behaving server and a misbehaving server, apart from the behaviour of the Unicorn master when it receives a USR2.

> Also, were there other activity (another USR2 or HUP) in the logs
> a few seconds beforehand?

No, didn’t see anything like that (and I was looking for it).

> What kind of filesystem / kernel is the pid file on?

EXT4 / Ubuntu Server 12.04 LTS

> A network FS or anything without the consistency guarantees provided
> by POSIX would not work for pid files.

Given my environment above, I should be OK, right?

> pid files are unfortunately prone to to nasty race conditions,
> but I'm not sure what you're seeing happens very often.

This has been happening pretty frequently across multiple server instances, and again once it starts happening on an instance, it keeps happening 100% of the time (until I restart Unicorn completely). So it’s not a rare edge case.

> -------------------8<-------------------
>>  sleep 10
>> 
>>  old_pid = "#{server.config[:pid]}.oldbin"
>>  if File.exists?(old_pid) && server.pid != old_pid
>>    begin
>>      Process.kill("QUIT", File.read(old_pid).to_i)
>>    rescue Errno::ENOENT, Errno::ESRCH
>>      # someone else did our job for us
>>    end
>>  end
> -------------------8<-------------------
> 
> I'd get rid of that hunk starting with the "sleep 10" (at least while
> debugging this issue).

If I simply delete this hunk, I’ll have old masters left running on my servers because they’ll never get sent the QUIT signal. I can definitely remove it temporarily (and kill the old master myself) while debugging, though.

> If you did a USR2 previously, maybe it could
> affect the current USR2 upgrade process.  Sleeping so long in the master
> like that is pretty bad it throws off timing and delays signal handling.

I’d definitely like to get rid of the sleep, as my restarts definitely feel slow. I’m not clear on what a better approach would be, though.

> That's a pretty fragile config and I regret ever including it in the
> example config files

Any better examples/docs you’d recommend I consult for guidance? Or is expecting to achieve a robust zero-downtime restart using before_fork/after_fork hooks unrealistic?

--
Kevin Yank
Chief Technology Officer, Avalanche Technology Group
http://avalanche.com.au/

ph: +61 4 2241 0083

^ permalink raw reply	[relevance 0%]