after_worker_exit on murder

unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed

* after_worker_exit on murder
@ 2017-04-04 14:08 Simon Eskildsen
  2017-04-04 14:32 ` Jeremy Evans
  2017-04-05  1:19 ` Eric Wong
  0 siblings, 2 replies; 6+ messages in thread
From: Simon Eskildsen @ 2017-04-04 14:08 UTC (permalink / raw)
  To: unicorn-public, Jeremy Evans

Hi!

With Jeremy Evans' work on `after_worker_exit`, I was hoping I could
replace our internal fork which has a `before_murder` hook to report
to monitoring systems when workers are forcibly killed. However, the
`after_worker_exit` is only called on reaping—not when murdering.

How would you feel about executing this on the murder path?
Alternatively, we could add an `after_murder` hook for this path.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: after_worker_exit on murder
  2017-04-04 14:08 after_worker_exit on murder Simon Eskildsen
@ 2017-04-04 14:32 ` Jeremy Evans
  2017-04-04 14:36   ` Simon Eskildsen
  2017-04-05  1:19 ` Eric Wong
  1 sibling, 1 reply; 6+ messages in thread
From: Jeremy Evans @ 2017-04-04 14:32 UTC (permalink / raw)
  To: Simon Eskildsen; +Cc: unicorn-public

On 04/04 10:08, Simon Eskildsen wrote:
> Hi!
> 
> With Jeremy Evans' work on `after_worker_exit`, I was hoping I could
> replace our internal fork which has a `before_murder` hook to report
> to monitoring systems when workers are forcibly killed. However, the
> `after_worker_exit` is only called on reaping???not when murdering.
> 
> How would you feel about executing this on the murder path?
> Alternatively, we could add an `after_murder` hook for this path.

I checked and after_worker_exit is called when reaping the worker
after murdering, so I don't think any changes are necessary. I would
guess if you added it when murdering, it would be called twice on the
same worker.

Thanks,
Jeremy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: after_worker_exit on murder
  2017-04-04 14:32 ` Jeremy Evans
@ 2017-04-04 14:36   ` Simon Eskildsen
  0 siblings, 0 replies; 6+ messages in thread
From: Simon Eskildsen @ 2017-04-04 14:36 UTC (permalink / raw)
  To: Jeremy Evans; +Cc: unicorn-public

You're right, I misread the code—I thought the wait(2) would happen in
the kill worker call, but the reap worker call takes care of that.

Thanks for clearing that up Jeremy!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: after_worker_exit on murder
  2017-04-04 14:08 after_worker_exit on murder Simon Eskildsen
  2017-04-04 14:32 ` Jeremy Evans
@ 2017-04-05  1:19 ` Eric Wong
  2017-04-05 10:55   ` Simon Eskildsen
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Wong @ 2017-04-05  1:19 UTC (permalink / raw)
  To: Simon Eskildsen; +Cc: unicorn-public, Jeremy Evans

Simon Eskildsen <simon.eskildsen@shopify.com> wrote:
> With Jeremy Evans' work on `after_worker_exit`, I was hoping I could
> replace our internal fork which has a `before_murder` hook to report
> to monitoring systems when workers are forcibly killed. However, the
> `after_worker_exit` is only called on reaping—not when murdering.

Hi Simon, it looks like Jeremy clarified after_worker_exit for you.

Anyways...  I remember rejecting patches to add more to timeout
support in unicorn over the years since I did not want it to be
a crutch for application developers, or worse; a reason for
people to feel locked into unicorn.  Instead I wrote things like
https://bogomips.org/unicorn/Application_Timeouts.html to
discourage relying on unicorn's built-in `timeout' feature.

But, it seems like there's still a reliance on the built-in
timeout...

Why is that?  (If you're allowed to disclose)

I don't mean to put you guys (Shopify) on the spot,
as I'm sure other folks do it, too; but you're here :)

Anyways, is this something that could or should be improved
in Rack or ruby itself?  Or are there buggy external libraries
or even external dependencies like NFS to blame?

(Or, perhaps "I don't know" is fine :)

The Ruby timeout handling in stdlib could be way better if
done natively, I think.  Perhaps `ensure` could be better
declared, too; but there's still the problem of getting
external code to work well with it...

Anyways I am of two minds; on one hand, buggy code is fine!
unicorn handles it by nuking a process.  But on the other
hand... it's irritatingly inefficient and shoving things under
the rug still rots and stinks eventually.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: after_worker_exit on murder
  2017-04-05  1:19 ` Eric Wong
@ 2017-04-05 10:55   ` Simon Eskildsen
  2017-04-05 18:33     ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Simon Eskildsen @ 2017-04-05 10:55 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public, Jeremy Evans

I agree with you in principle, absolutely. However, when you have a
code-base the size of ours (100Ks of lines of Ruby) with 100s of
developers, and new ones coming on every month with no prior Ruby or
Rails experience, we can't rely on everyone doing the right thing all
the time. With a surface area of that size, there will be things that
are missed, especially when you run gems like Liquid that have
billions of different ways of composing templates—some of these paths,
unfortunately, are going to be slow. We definitely chase down all the
worst offenders, but when new ones creep up, we do our best to chase
them down when time allows. Using this hook, allows us to monitor when
that happens, and how often it happens, and for which endpoints. With
100Ks of lines of code, 100s of developers, and 10s of thousands of
requests per second—once in a million happens every couple of seconds.
Multiply that with the size of the code-base, and Unicorn timeouts due
to the conditions below will happen somewhat often.

It becomes difficult, because sometimes you have legitimate requests
that take 10-20s, because the merchant's data set is so large that it
exposes anomalies. Again, with the size of our code-base, we need this
wiggle room in the global timeout to not just error on users. You can
have endpoints that do 4 HTTP requests, 5 RPC requests, 4 MySQL
queries, and 30 calls to Memcached. In that case, your worst case is
the timeout of all of those actions, which easily exceeds the Unicorn
timeout. We've debated having "budgets" and "shitlisting"
(http://sirupsen.com/shitlists/) paths that obviously take longer than
the budget for a single resource. The probability of more than one
resource being very slow at once, is quite low (and if it is, again,
we rely on the Unicorn timeout).

In other words, the Unicorn timeout is not a crutch for timeouts in
the application, but a global timeout to as a last line of defense
against many timeouts, or some bug we didn't foresee. This seems
unavoidable in my eyes, unless you have very aggressive timeouts and
meticulously keep track of the budgets in a testing environment and
raise if the budget is exceeded. When we hit many timeouts, we use
Semian (http://github.com/shopify/semian) to trigger circuit
breakers—so the reliance on Unicorn should be brief.

Some of these bugs are even deep in Ruby, Jean B, one of my co-workers
submitted a bug about there being no write_timeout in Net::HTTP (you
even replied!): https://bugs.ruby-lang.org/issues/13396

BTW we deployed 5.3.0 and replaced our `before_murder` hook with
`after_worker_exit`. Everything works perfectly and we finally are not
using a forked version of Unicorn anymore. Thanks for the release!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: after_worker_exit on murder
  2017-04-05 10:55   ` Simon Eskildsen
@ 2017-04-05 18:33     ` Eric Wong
  0 siblings, 0 replies; 6+ messages in thread
From: Eric Wong @ 2017-04-05 18:33 UTC (permalink / raw)
  To: Simon Eskildsen; +Cc: unicorn-public, Jeremy Evans

Simon Eskildsen <simon.eskildsen@shopify.com> wrote:

Thank you for your reply.  It is a good reminder of how far away
I am from the rest of the web development world :>

<snip>

> It becomes difficult, because sometimes you have legitimate requests
> that take 10-20s, because the merchant's data set is so large that it
> exposes anomalies. Again, with the size of our code-base, we need this
> wiggle room in the global timeout to not just error on users. You can
> have endpoints that do 4 HTTP requests, 5 RPC requests, 4 MySQL
> queries, and 30 calls to Memcached. In that case, your worst case is
> the timeout of all of those actions, which easily exceeds the Unicorn
> timeout.

Wow, that is frightening... People will actually wait for a web
page to load in that case?  I guess that's why ccc exists :>

> We've debated having "budgets" and "shitlisting"
> (http://sirupsen.com/shitlists/) paths that obviously take longer than
> the budget for a single resource. The probability of more than one
> resource being very slow at once, is quite low (and if it is, again,
> we rely on the Unicorn timeout).

Interesting.  I guess for now, you can use nginx or similar to
route to differently-configured unicorns with different timeouts
(or even other servers)?

Anyways, I'm coming around to reconciling the two mindsets of
"typical" code running on unicorn ("it's alright to crash")
and the "no room for error: nuclear war starts if you screw up"
mindset I adopt for other projects.

> Some of these bugs are even deep in Ruby, Jean B, one of my co-workers
> submitted a bug about there being no write_timeout in Net::HTTP (you
> even replied!): https://bugs.ruby-lang.org/issues/13396

Yeah, that got me thinking of improving core timeouts again...

One big problem is the lack of a portable standard asynchronous
name resolution mechanism in the C standard library.  I'm not
sure how well resolv.rb/resolv-replace.rb holds up in real-world
usage, nor if pulling in something like ares2 would be an
acceptable dependency for ruby-core...

> BTW we deployed 5.3.0 and replaced our `before_murder` hook with
> `after_worker_exit`. Everything works perfectly and we finally are not
> using a forked version of Unicorn anymore. Thanks for the release!

Cool, good to know.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-04-05 18:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-04 14:08 after_worker_exit on murder Simon Eskildsen
2017-04-04 14:32 ` Jeremy Evans
2017-04-04 14:36   ` Simon Eskildsen
2017-04-05  1:19 ` Eric Wong
2017-04-05 10:55   ` Simon Eskildsen
2017-04-05 18:33     ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).